This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Beyond π’ͺ​(T)\mathcal{O}(\sqrt{T}) Regret: Decoupling Learning and Decision-making in Online Linear Programming

Wenzhi Gao gwz@stanford.edu, this paper is an extended version of [11] ICME, Stanford University Dongdong Ge dongdong@gmail.com, corresponding author Antai College of Economics and Management, Shanghai Jiao Tong University Chunlin Sun chunlin@stanford.edu ICME, Stanford University Chenyu Xue xcy2721d@gmail.com, corresponding author RIIS, Shanghai University of Finance and Economics Yinyu Ye yyye@stanford.edu ICME, Stanford University Management Science and Engineering, Stanford University
Abstract

Online linear programming plays an important role in both revenue management and resource allocation, and recent research has focused on developing efficient first-order online learning algorithms. Despite the empirical success of first-order methods, they typically achieve a regret no better than π’ͺ​(T)\mathcal{O}(\sqrt{T}), which is suboptimal compared to the π’ͺ​(log⁑T)\mathcal{O}(\log T) bound guaranteed by the state-of-the-art linear programming (LP)-based online algorithms. This paper establishes a general framework that improves upon the π’ͺ​(T)\mathcal{O}(\sqrt{T}) result when the LP dual problem exhibits certain error bound conditions. For the first time, we show that first-order learning algorithms achieve o​(T)o(\sqrt{T}) regret in the continuous support setting and π’ͺ​(log⁑T)\mathcal{O}(\log T) regret in the finite support setting beyond the non-degeneracy assumption. Our results significantly improve the state-of-the-art regret results and provide new insights for sequential decision-making.

1 Introduction

This paper presents a new algorithmic framework to solve the online linear programming (OLP) problem. In this context, a decision-maker receives a sequence of resource requests with bidding prices and sequentially makes irrevocable allocation decisions for these requests. OLP aims to maximize the accumulated reward subject to inventory or resource constraints. OLP plays an important role in a wide range of applications. For example, in online advertising [4], an online platform has limited advertising slots on a web page. When a web page loads, online advertisers bid for ad placement, and the platform decides within milliseconds the slot allocation based on the features of advertisers and the user. The goal is to maximize the website’s revenue and improve user experience. Another example is online auction, where an online auction platform hosts a large number of auctions for different items. The platform must handle bids and update the auction status in real-time. Besides the aforementioned applications, OLP is also widely used in applications such as revenue management [29], resource allocation [15], cloud computing [12], and many other applications [3].

Most state-of-the-art algorithms for OLP are dual linear program (LP)-based [1, 16, 22, 19, 22]. More specifically, these algorithms require solving a sequence of LPs to make online decisions. However, the high computational cost of these LP-based methods prevents their application in time-sensitive or large-scale problems. For example, in the aforementioned online advertising example, a decision has to be made in milliseconds, while LP-based methods can take minutes to hours on large-scale problems. This challenge motivates a recent line of research using first-order methods to address OLP [18, 10, 4, 5], which are based on gradient information and more scalable and computationally efficient than LP-based methods.

Despite the advantage in computational efficiency, first-order methods are still not comparable to LP-based methods in terms of regret for many settings. Existing first-order OLP algorithms only achieve π’ͺ​(T)\mathcal{O}(\sqrt{T}) regret bound with only a few exceptions. When the distribution of requests and bidding prices has finite support [28], π’ͺ​(T3/8)\mathcal{O}(T^{3/8}) regret is obtainable using a three-stage algorithm; if first-order methods are used to solve the subproblems of LP-based methods infrequently, π’ͺ​(log2⁑T)\mathcal{O}(\log^{2}T) regret is achievable in the continuous support setting under a uniform non-degeneracy assumption [22]. However, these methods are either complicated to implement or require strong assumptions. It remains open whether there exists a general framework that allows first-order methods to break the π’ͺ​(T)\mathcal{O}(\sqrt{T}) regret barrier. This paper answers this question affirmatively.

1.1 Contributions

  • β€’

    We show that first-order methods achieve o​(T)o(\sqrt{T}) regret under weaker assumptions than LP-based methods. In particular, we identify a dual error bound condition that is sufficient to guarantee lower regret of first-order methods when the dual LP problem has a unique optimal solution. In the continuous-support setting, we establish an π’ͺ​(T1/3)\mathcal{O}(T^{1/3}) regret result under weaker assumptions than the existing methods; In the finite-support setting, we establish an π’ͺ​(log⁑T)\mathcal{O}(\log T) result, which significantly improves on the state-of-the-art π’ͺ​(T3/8)\mathcal{O}(T^{3/8}) result and almost matches the π’ͺ​(1)\mathcal{O}(1) regret of LP-based methods. For problems with Ξ³\gamma-HΓΆlder growth condition, we establish a general π’ͺ​(TΞ³βˆ’12β€‹Ξ³βˆ’1​log⁑T){\mathcal{O}(T^{\frac{\gamma-1}{2\gamma-1}}\log T)} regret result, which interpolates among the settings of no growth (Ξ³=∞\gamma=\infty), continuous support with quadratic growth (Ξ³=2\gamma=2) and finite support with sharpness (Ξ³=1\gamma=1). Our results show that first-order methods perform well under strictly weaker conditions than LP-based methods and still achieve competitive performance, significantly advancing their applicability in practice.

  • β€’

    We design a general exploration-exploitation framework to exploit the dual error bound condition. The idea is to learn a good approximation of the distribution dual optimal solution. Then, the online decision-making algorithm can be localized around a neighborhood of the approximate dual solution and makes decisions in an effective domain of size o​(1)o(1), thereby achieving improved regret guarantees. We reveal an important dilemma in simultaneously using a single first-order method as both learning and decision-making algorithms: a good learning algorithm can perform poorly in decision-making. This dilemma implies an important discrepancy between stochastic optimization and online decision-making, and it is addressed by decoupling learning and decision-making: two different first-order methods are adopted for learning and decision-making. This simple idea yields a highly flexible framework for online sequential decision-making. Our analysis can be of independent interest in the broader context of online convex optimization.

Table 1: Regret results in the current OLP literature. log⁑log\log\log factors are ignored.
Paper Setting and assumptions Algorithm Regret Lower bound
[19] Bounded, continuous support, uniform non-degeneracy LP-based π’ͺ​(log⁑T)\mathcal{O}(\log T) Yes
[7] Bounded, continuous support, uniform non-degeneracy LP-based π’ͺ​(log⁑T)\mathcal{O}(\log T) Yes
[13] Bounded, finite support of 𝐚t\mathbf{a}_{t}, quadratic growth LP-based π’ͺ​(log2⁑T)\mathcal{O}(\log^{2}T) Unknown
[22] Bounded, continuous support, uniform non-degeneracy LP-based π’ͺ​(log⁑T)\mathcal{O}(\log T) Yes
[9] Bounded, finite support, non-degeneracy LP-based π’ͺ​(1)\mathcal{O}(1) Yes
[18] Bounded Subgradient π’ͺ​(T)\mathcal{O}(\sqrt{T}) Yes
[4] Bounded Mirror Descent π’ͺ​(T)\mathcal{O}(\sqrt{T}) Yes
[10] Bounded Proximal Point π’ͺ​(T)\mathcal{O}(\sqrt{T}) Yes
[5] Bounded Momentum π’ͺ​(T)\mathcal{O}(\sqrt{T}) Yes
[28] Bounded, finite support, non-degeneracy Subgradient π’ͺ​(T3/8)\mathcal{O}(T^{3/8}) No (π’ͺ​(1)\mathcal{O}(1))
[22] Bounded, continuous support, uniform non-degeneracy Subgradient π’ͺ​(log2⁑T)\mathcal{O}(\log^{2}T) No (π’ͺ​(1)\mathcal{O}(1))
This paper Bounded, continuous support, quadratic growth Subgradient π’ͺ​(T1/3){\mathcal{O}(T^{1/3})} No (π’ͺ​(log⁑T)\mathcal{O}(\log T))
This paper Bounded, finite support, sharpness Subgradient π’ͺ​(log⁑T){\mathcal{O}(\log T)} No (π’ͺ​(1)\mathcal{O}(1))
This paper Bounded, Ξ³\gamma-dual error bound, unique solution Subgradient π’ͺ​(TΞ³βˆ’12β€‹Ξ³βˆ’1​log⁑T){\mathcal{O}(T^{\frac{\gamma-1}{2\gamma-1}}\log T)} Unknown
Related Literature.

There is a vast amount of literature on OLP [23, 25, 24, 2], and we review some recent developments that go beyond π’ͺ​(T)\mathcal{O}(\sqrt{T}) regret in the stochastic input setting (Table 1). These algorithms mostly follow the same principle of making decisions based on the learned information: learning and decision-making are closely coupled with each other. We refer the interested readers to [3] for a more detailed review of OLP and relevant problems.

LP-based OLP Algorithms.

Most LP-based methods leverage the dual LP problem [1], with only a few exceptions [16]. Under assumptions of either non-degeneracy or finite support on resource requests and/or rewards, π’ͺ​(log⁑T)\mathcal{O}(\log T) regret has been achieved under different settings. More specifically, [19] establish the dual convergence of finite-horizon LP solution to the optimal dual solution to the underlying stochastic program. In the continuous support setting, π’ͺ​(log⁑T​log⁑log⁑T)\mathcal{O}(\log T\log\log T) regret is achievable. [7] considers multi-secretary problem and establishes an π’ͺ​(log⁑T)\mathcal{O}(\log T) regret result. [22] consider the setting where a regularization term is imposed on the resource and also establish an π’ͺ​(log⁑T)\mathcal{O}(\log T) regret result. [13] establish π’ͺ​(log2⁑T)\mathcal{O}(\log^{2}T) regret without the non-degeneracy assumption and assume that the distribution of resource requests has finite support. [9] consider the case where both resource requests and prices have finite support and π’ͺ​(1)\mathcal{O}(1) regret can be achieved in this case under a non-degeneracy assumption. Recently, attempts have been made to address the computation cost of LP-based methods by infrequently solving the LP subproblems [17, 30, 27]. Most LP-based methods follow the action-history dependent approach developed in [19] to achieve o​(T)o(\sqrt{T}) regret, and in the continuous support case, the non-degeneracy assumption is required to hold uniformly for resource vector 𝐛\mathbf{b} in some pre-specified region. Compared to the aforementioned LP-based methods, our framework can work under strictly weaker assumptions.

First-order OLP Algorithms.

Early explorations of first-order OLP algorithms start from [18], [4] and [21], where π’ͺ​(T)\mathcal{O}(\sqrt{T}) regret is established using mirror descent and subgradient methods. [10] show that proximal point update also achieves π’ͺ​(T)\mathcal{O}(\sqrt{T}) regret. [5] analyze a momentum variant of mirror descent and get π’ͺ​(T)\mathcal{O}(\sqrt{T}) regret. In the finite support setting, [28] design a three-stage algorithm that achieves π’ͺ​(T3/8)\mathcal{O}(T^{3/8}) regret when the distribution LP is non-degenerate. [22] apply a first-order method to solve subproblems in LP-based methods infrequently and achieve π’ͺ​(log2⁑T)\mathcal{O}(\log^{2}T) regret. However, [22] still requires a uniform non-degeneracy assumption. Our framework is motivated directly by the properties of first-order methods and provides a unified analysis under different distribution settings.

Structure of the paper

The rest of the paper is organized as follows. SectionΒ 2 introduces OLP and first-order OLP algorithms. SectionΒ 3 defines the dual error bound condition and its implications on the first-order learning algorithms. In SectionΒ 4, we introduce a general framework that exploits the error bound condition and improves the state-of-the-art regret results. We verify the theoretical findings in SectionΒ 5.

2 Online linear programming with first-order methods

Notations.

Throughout the paper, we use βˆ₯β‹…βˆ₯\|\cdot\| to denote Euclidean norm and βŸ¨β‹…,β‹…βŸ©\langle\cdot,\cdot\rangle to denote Euclidean inner product. Bold letter notations 𝐀\mathbf{A} and 𝐚\mathbf{a} denote matrices and vectors, respectively. Given a convex function f​(𝐱)f(\mathbf{x}), its subdifferential is denoted by βˆ‚f​(𝐱):={𝐯:f​(𝐲)β‰₯f​(𝐱)+⟨𝐯,π²βˆ’π±βŸ©,Β for all ​𝐲}\partial f(\mathbf{x}):=\{\mathbf{v}:f(\mathbf{y})\geq f(\mathbf{x})+\langle\mathbf{v},\mathbf{y}-\mathbf{x}\rangle,\text{ for all }\mathbf{y}\} and f′​(𝐱)βˆˆβˆ‚f​(𝐱)f^{\prime}(\mathbf{x})\in\partial f(\mathbf{x}) is called a subgradient. We use 𝐠𝐱\mathbf{g}_{\mathbf{x}} satisfying 𝔼​[𝐠𝐱]βˆˆβˆ‚f​(𝐱)\mathbb{E}[\mathbf{g}_{\mathbf{x}}]\in\partial f(\mathbf{x}) to denote a stochastic subgradient. [β‹…]+=max⁑{β‹…,0}[\cdot]_{+}=\max\{\cdot,0\} denotes the component-wise positive-part function, and 𝕀​{β‹…}\mathbb{I}\{\cdot\} denotes the 0-1 indicator function. Relation 𝐱β‰₯𝐲\mathbf{x}\geq\mathbf{y} denotes element-wise inequality. Given 𝐱\mathbf{x} and a closed convex set 𝒳\mathcal{X}, we define dist​(𝐱,𝒳):=minπ²βˆˆπ’³β‘β€–π±βˆ’π²β€–{\mathrm{dist}}(\mathbf{x},\mathcal{X}):=\min_{\mathbf{y}\in\mathcal{X}}\|\mathbf{x}-\mathbf{y}\| and diam​(𝒳):=max𝐱,π²βˆˆπ’³β‘β€–π±βˆ’π²β€–\mathrm{diam}(\mathcal{X}):=\max_{\mathbf{x},\mathbf{y}\in\mathcal{X}}\|\mathbf{x}-\mathbf{y}\|.

2.1 OLP and duality

An online resource allocation problem with linear inventory and rewards can be modeled as an OLP problem: given time horizon Tβ‰₯1T\geq 1 and mβ‰₯1m\geq 1 resources represented by π›βˆˆβ„+m\mathbf{b}\in\mathbb{R}^{m}_{+}, at time tt, a customer with (ct,𝐚t)βˆˆβ„Γ—β„m(c_{t},\mathbf{a}_{t})\in\mathbb{R}\times\mathbb{R}^{m} arrives and requests resources 𝐚t\mathbf{a}_{t} at price ctc_{t}. Decision xt∈[0,1]x^{t}\in[0,1] is made to (partially) accept the order or reject it. With compact notation 𝐜=(c1,…,cT)βŠ€βˆˆβ„T\mathbf{c}=(c_{1},\ldots,c_{T})^{\top}\in\mathbb{R}^{T} and 𝐀:=(𝐚1,…,𝐚T)βˆˆβ„mΓ—T\mathbf{A}:=(\mathbf{a}_{1},\ldots,\mathbf{a}_{T})\in\mathbb{R}^{m\times T}, the problem can be written as

max𝐱⟨𝐜,𝐱⟩subject to𝐀𝐱≀𝐛,\displaystyle\max_{\mathbf{x}}\quad\langle\mathbf{c},\mathbf{x}\rangle\quad\text{subject to}\quad\mathbf{A}\mathbf{x}\leq\mathbf{b}, πŸŽβ‰€π±β‰€πŸ,\displaystyle\quad\mathbf{0}\leq\mathbf{x}\leq\mathbf{1},

where 𝟎\mathbf{0} and 𝟏\mathbf{1} are vectors of all zeros and ones. The dual problem

min(𝐲,𝐬)β‰₯πŸŽβŸ¨π›,𝐲⟩+⟨𝟏,𝐬⟩subject to𝐬β‰₯πœβˆ’π€βŠ€β€‹π²\displaystyle\min_{(\mathbf{y},\mathbf{s})\geq\mathbf{0}}\quad\langle\mathbf{b},\mathbf{y}\rangle+\langle\mathbf{1},\mathbf{s}\rangle\quad\text{subject to}\quad\mathbf{s}\geq\mathbf{c}-\mathbf{A}^{\top}\mathbf{y} (DLP)

can be transformed into the following finite sum form

min𝐲β‰₯𝟎⁑fT​(𝐲):=1Tβ€‹βˆ‘t=1T⟨𝐝,𝐲⟩+[ctβˆ’βŸ¨πšt,𝐲⟩]+,\displaystyle\min_{\mathbf{y}\geq\mathbf{0}}~{}f_{T}(\mathbf{y}):=\textstyle\frac{1}{T}\sum_{t=1}^{T}\langle\mathbf{d},\mathbf{y}\rangle+[c_{t}-\langle\mathbf{a}_{t},\mathbf{y}\rangle]_{+}, (1)

where 𝐝=Tβˆ’1​𝐛\mathbf{d}=T^{-1}\mathbf{b} is the average resource. When (ct,𝐚t)(c_{t},\mathbf{a}_{t}) are i.i.d. distributed, fT​(𝐲)f_{T}(\mathbf{y}) can be viewed as a sample approximation of the expected dual function f​(𝐲)f(\mathbf{y}), where

f​(𝐲):=𝔼​[fT​(𝐲)]=⟨𝐝,𝐲⟩+𝔼(c,𝐚)​[cβˆ’βŸ¨πš,𝐲⟩]+.f(\mathbf{y}):=\mathbb{E}[f_{T}(\mathbf{y})]=\langle\mathbf{d},\mathbf{y}\rangle+\mathbb{E}_{(c,\mathbf{a})}[c-\langle\mathbf{a},\mathbf{y}\rangle]_{+}. (2)

Define the sets of dual optimal solutions, and let 𝐲T⋆,𝐲⋆\mathbf{y}_{T}^{\star},\mathbf{y}^{\star} be some dual optimal solutions, respectively:

𝐲Tβ‹†βˆˆπ’΄T⋆=arg​min𝐲β‰₯𝟎⁑fT​(𝐲)andπ²β‹†βˆˆπ’΄β‹†=arg​min𝐲β‰₯𝟎⁑f​(𝐲),\mathbf{y}_{T}^{\star}\in\mathcal{Y}_{T}^{\star}=\operatornamewithlimits{arg\,min}_{\mathbf{y}\geq\mathbf{0}}~{}f_{T}(\mathbf{y})\quad\text{and}\quad\mathbf{y}^{\star}\in\mathcal{Y}^{\star}=\operatornamewithlimits{arg\,min}_{\mathbf{y}\geq\mathbf{0}}~{}f(\mathbf{y}),

and we can determine the primal optimal solution 𝐱T⋆=(x1⋆,…,xT⋆)βˆˆβ„T\mathbf{x}_{T}^{\star}=(x_{1}^{\star},\ldots,x_{T}^{\star})\in\mathbb{R}^{T} by the LP optimality conditions:

xtβ‹†βˆˆ{{0},Β if ​ct<⟨𝐚t,𝐲Tβ‹†βŸ©,[0,1],Β if ​ct=⟨𝐚t,𝐲Tβ‹†βŸ©,{1},Β if ​ct>⟨𝐚t,𝐲Tβ‹†βŸ©.x_{t}^{\star}\in\left\{\begin{array}[]{cl}\{0\},&\text{ if }c_{t}<\langle\mathbf{a}_{t},\mathbf{y}_{T}^{\star}\rangle,\\ {}[0,1],&\text{ if }c_{t}=\langle\mathbf{a}_{t},\mathbf{y}_{T}^{\star}\rangle,\\ \{1\},&\text{ if }c_{t}>\langle\mathbf{a}_{t},\mathbf{y}_{T}^{\star}\rangle.\end{array}\right.

This connection between primal and dual solutions motivates dual-based online learning algorithms: a dual-based learning algorithm maintains a dual sequence {𝐲t}t=1T\{\mathbf{y}^{t}\}_{t=1}^{T} in the learning process, while primal decisions are made based on the optimality condition. Given the sample approximation interpretation, first-order methods are natural candidate learning algorithms.

2.2 First-order methods on the dual problem

First-order methods leverage the sample approximation structure and applies (sub)gradient-based first-order update. One commonly used first-order method is the online projected subgradient method (AlgorithmΒ 1):

xt=\displaystyle x^{t}={} 𝕀​{ctβ‰₯⟨𝐚t,𝐲t⟩},\displaystyle\mathbb{I}\{c_{t}\geq\langle\mathbf{a}_{t},\mathbf{y}^{t}\rangle\},
𝐠t∈\displaystyle\mathbf{g}^{t}\in{} βˆ‚π²=𝐲t{⟨𝐝,𝐲⟩+[ctβˆ’βŸ¨πšt,𝐲⟩]+},\displaystyle\partial_{\mathbf{y}=\mathbf{y}_{t}}\{\langle\mathbf{d},\mathbf{y}\rangle+[c_{t}-\langle\mathbf{a}_{t},\mathbf{y}\rangle]_{+}\},
𝐲t+1=\displaystyle\mathbf{y}^{t+1}={} [𝐲tβˆ’Ξ±t​𝐠t]+.\displaystyle[\mathbf{y}^{t}-\alpha_{t}\mathbf{g}^{t}]_{+}. (3)

Upon the arrival of each customer (ct,𝐚t)(c_{t},\mathbf{a}_{t}), a decision xtx^{t} is made based on the optimality condition. Then, the dual variable 𝐲t\mathbf{y}^{t} is adjusted with the stochastic subgradient. Other learning algorithms, such as mirror descent, also apply to the OLP setting. This paper focuses on the subgradient method.

Input: Initial dual solution guess 𝐲1\mathbf{y}^{1}, subgradient stepsize {αt}t=1T\{\alpha_{t}\}_{t=1}^{T}
forΒ tt = 11 to TT Β do
Β Β Β Β Β Β  Make primal decision xt=𝕀​{ctβ‰₯⟨𝐚t,𝐲t⟩}x^{t}=\mathbb{I}\{c_{t}\geq\langle\mathbf{a}_{t},\mathbf{y}^{t}\rangle\}
Β Β Β Β Β Β  Compute subgradient 𝐠t=πβˆ’πšt​xt\mathbf{g}^{t}=\mathbf{d}-\mathbf{a}_{t}x^{t}
Β Β Β Β Β Β  Subgradient update 𝐲t+1=[𝐲tβˆ’Ξ±t​𝐠t]+\mathbf{y}^{t+1}=[\mathbf{y}^{t}-\alpha_{t}\mathbf{g}^{t}]_{+}
end for
AlgorithmΒ 1 First-order subgradient OLP algorithm

2.3 Performance metric

Given online algorithm output 𝐱^T=(x1,…,xT)\hat{\mathbf{x}}_{T}=(x^{1},\ldots,x^{T}), its regret and constraint violation are defined as

r​(𝐱^T):=\displaystyle r(\hat{\mathbf{x}}_{T}):= max𝐀𝐱≀𝐛,πŸŽβ‰€π±β‰€πŸβ‘βŸ¨πœ,π±βŸ©βˆ’βŸ¨πœ,𝐱^T⟩andv​(𝐱^T):=β€–[𝐀​𝐱^Tβˆ’π›]+β€–.\displaystyle\max_{\mathbf{A}\mathbf{x}\leq\mathbf{b},\mathbf{0}\leq\mathbf{x}\leq\mathbf{1}}\langle\mathbf{c},\mathbf{x}\rangle-\langle\mathbf{c},\hat{\mathbf{x}}_{T}\rangle\qquad\text{and}\qquad v(\hat{\mathbf{x}}_{T}):=\|[\mathbf{A}\hat{\mathbf{x}}_{T}-\mathbf{b}]_{+}\|.

These metrics are widely used in the OLP literature [18, 10].

2.4 Main assumptions and summary of the results

We make the following assumptions throughout the paper.

  1. A1.

    (Stochastic input) {(ct,𝐚t)}t=1T\{(c_{t},\mathbf{a}_{t})\}_{t=1}^{T} are generated i.i.d. from some distribution 𝒫\mathcal{P}.

  2. A2.

    (Bounded data) There exist constants aΒ―,cΒ―>0\bar{a},\bar{c}>0 such that β€–πšβ€–βˆžβ‰€aΒ―\|\mathbf{a}\|_{\infty}\leq\bar{a} and |c|≀cΒ―|c|\leq\bar{c} almost surely.

  3. A3.

    (Linear resource) The average resource 𝐝=Tβˆ’1​𝐛\mathbf{d}=T^{-1}\mathbf{b} satisfies dΒ―β‹…πŸβ‰€πβ‰€dΒ―β‹…πŸ\underline{d}\cdot\mathbf{1}\leq\mathbf{d}\leq\bar{d}\cdot\mathbf{1}, where 0<d¯≀dΒ―0<\underline{d}\leq\bar{d}.

A1 to A3 are standard and minimal in the OLP literature [4, 18, 10], and it is known that online subgradient method (AlgorithmΒ 1) with constant stepsize Ξ±t≑π’ͺ​(1/T)\alpha_{t}\equiv\mathcal{O}{(1/\sqrt{T})} achieves π’ͺ​(T)\mathcal{O}(\sqrt{T}) regret.

Theorem 2.1 (Sublinear regret benchmark [10, 18]).

Under A1 to A3, online subgradient method (3) with Ξ±t≑2​cΒ―m​d¯​(aΒ―+dΒ―)2β‹…1T\alpha_{t}\equiv\sqrt{\frac{2\bar{c}}{m\underline{d}(\bar{a}+\bar{d})^{2}}}\cdot\tfrac{1}{\sqrt{T}} outputs 𝐱^T\hat{\mathbf{x}}_{T} such that

𝔼​[r​(𝐱^T)+v​(𝐱^T)]≀m​(aΒ―+dΒ―)2dΒ―+m​(aΒ―+dΒ―)+m​cΒ―2​d¯​(aΒ―+dΒ―)​T=π’ͺ​(T).\mathbb{E}[r(\hat{\mathbf{x}}_{T})+v(\hat{\mathbf{x}}_{T})]\leq\tfrac{m(\bar{a}+\bar{d})^{2}}{\underline{d}}+\sqrt{m}(\bar{a}+\bar{d})+\sqrt{\tfrac{m\bar{c}}{2\underline{d}}}(\bar{a}+\bar{d})\sqrt{T}=\mathcal{O}(\sqrt{T}).

TheoremΒ 2.1 will be used as a benchmark for our results. Under A1 to A3, π’ͺ​(T)\mathcal{O}(\sqrt{T}) regret has been shown to achieve the lower bound [2]. Under further assumptions such as non-degeneracy, LP-based OLP algorithms can efficiently leverage this structure to achieve π’ͺ​(log⁑T)\mathcal{O}(\log T) and π’ͺ​(1)\mathcal{O}(1) regret, respectively, in the continuous [19] and finite support settings [9]. However, how first-order methods can efficiently exploit these structures remains less explored. This paper establishes a new online learning framework to resolve this issue. In particular, we consider the Ξ³\gamma-error bound condition from the optimization literature and summarize our main results below:

Theorem 2.2 (TheoremΒ 4.1, informal).

Suppose f​(𝐲)f(\mathbf{y}) satisfies Ξ³\gamma-dual error bound condition (Ξ³β‰₯1\gamma\geq 1) and that 𝒴⋆={𝐲⋆}\mathcal{Y}^{\star}=\{\mathbf{y}^{\star}\} is a singleton. Then, our framework achieves

𝔼​[r​(𝐱^T)+v​(𝐱^T)]≀π’ͺ​(TΞ³βˆ’12β€‹Ξ³βˆ’1​log⁑T)\mathbb{E}[r(\hat{\mathbf{x}}_{T})+v(\hat{\mathbf{x}}_{T})]\leq\mathcal{O}(T^{\frac{\gamma-1}{2\gamma-1}}\log T)

using first-order methods.

It turns out the dual error bound is key to improved regret for first-order methods. In the next section, we formally define the dual error bound condition and introduce its consequences.

3 Dual error bound and subgradient method

In this section, we discuss the dual error bound condition that allows first-order OLP algorithms to go beyond π’ͺ​(T)\mathcal{O}(\sqrt{T}) regret. We also introduce and explain several important implications of the error bound condition for the subgradient method. Unless specified, we restrict 𝐲⋆=minπ²βˆˆπ’΄β‹†β‘β€–π²β€–\mathbf{y}^{\star}=\min_{\mathbf{y}\in\mathcal{Y}^{\star}}\|\mathbf{y}\| and 𝐲T⋆=minπ²βˆˆπ’΄T⋆⁑‖𝐲‖\mathbf{y}^{\star}_{T}=\min_{\mathbf{y}\in\mathcal{Y}^{\star}_{T}}\|\mathbf{y}\| to be the unique minimum-norm solution to the distribution dual problem (2) and the sample dual problem (1).

3.1 Dual error bound condition

Our key assumption, also known in the literature as the HΓΆlder error bound condition [14], is stated as follows.

  1. A4.

    (Dual error bound) f​(𝐲)βˆ’f​(𝐲⋆)β‰₯ΞΌβ‹…dist​(𝐲,𝒴⋆)Ξ³f(\mathbf{y})-f(\mathbf{y}^{\star})\geq\mu\cdot\mathrm{dist}(\mathbf{y},\mathcal{Y}^{\star})^{\gamma} for all π²βˆˆπ’΄={𝐲:𝐲β‰₯𝟎,‖𝐲‖≀cΒ―+dΒ―dΒ―},γ∈[1,∞)\mathbf{y}\in\mathcal{Y}=\{\mathbf{y}:\mathbf{y}\geq\mathbf{0},\|\mathbf{y}\|\leq\tfrac{\bar{c}+\underline{d}}{\underline{d}}\},\gamma\in[1,\infty).

The assumption A4 states a growth condition in terms of the expected dual function: as 𝐲\mathbf{y} leaves the distribution dual optimal set 𝒴⋆\mathcal{Y}^{\star}, the objective will grow at least at rate dist​(𝐲,𝒴⋆)Ξ³\mathrm{dist}(\mathbf{y},\mathcal{Y}^{\star})^{\gamma}. It is implied by the assumptions used in the analysis of LP-based OLP algorithms, which we summarize below.

Remark 3.1.

The set 𝒴\mathcal{Y} is chosen such that π’΄β‹†βŠ†π’΄\mathcal{Y}^{\star}\subseteq\mathcal{Y} since 𝐲⋆β‰₯𝟎\mathbf{y}^{\star}\geq\mathbf{0} and

d¯​‖𝐲⋆‖≀d¯​‖𝐲⋆‖1β‰€βŸ¨π,π²β‹†βŸ©β‰€f​(𝐲⋆)=𝔼​[⟨𝐝,π²β‹†βŸ©+[cβˆ’βŸ¨πš,π²β‹†βŸ©]+]≀f​(𝟎)≀cΒ―.\underline{d}\|\mathbf{y}^{\star}\|\leq\underline{d}\|\mathbf{y}^{\star}\|_{1}\leq\langle\mathbf{d},\mathbf{y}^{\star}\rangle\leq f(\mathbf{y}^{\star})=\mathbb{E}[\langle\mathbf{d},\mathbf{y}^{\star}\rangle+[c-\langle\mathbf{a},\mathbf{y}^{\star}\rangle]_{+}]\leq f(\mathbf{0})\leq\bar{c}. (4)

Similarly, we can show that 𝐲Tβ‹†βˆˆπ’΄\mathbf{y}_{T}^{\star}\in\mathcal{Y}.

Example 3.1 (Continuous-support, non-degeneracy [19, 7, 22]).

Suppose there exist Ξ»1,Ξ»2,Ξ»3>0\lambda_{1},\lambda_{2},\lambda_{3}>0 such that

  • β€’

    𝔼​[𝐚𝐚⊀]βͺ°Ξ»1β€‹πˆ\mathbb{E}[\mathbf{a}\mathbf{a}^{\top}]\succeq\lambda_{1}\mathbf{I}.

  • β€’

    Ξ»3​|⟨𝐚,π²βˆ’π²β‹†βŸ©|β‰₯|ℙ​{cβ‰₯⟨𝐚,𝐲⟩|𝐚}βˆ’β„™β€‹{cβ‰₯⟨𝐚,π²β‹†βŸ©|𝐚}|β‰₯Ξ»2​|⟨𝐚,π²βˆ’π²β‹†βŸ©|\lambda_{3}|\langle\mathbf{a},\mathbf{y}-\mathbf{y}^{\star}\rangle|\geq|\mathbb{P}\{c\geq\langle\mathbf{a},\mathbf{y}\rangle|\mathbf{a}\}-\mathbb{P}\{c\geq\langle\mathbf{a},\mathbf{y}^{\star}\rangle|\mathbf{a}\}|\geq\lambda_{2}|\langle\mathbf{a},\mathbf{y}-\mathbf{y}^{\star}\rangle| for all π²βˆˆπ’΄\mathbf{y}\in\mathcal{Y}.

  • β€’

    yi⋆=0y_{i}^{\star}=0 for all diβˆ’π”Ό(c,𝐚)​[ai​𝕀​{c>⟨𝐚,π²β‹†βŸ©}]>0d_{i}-\mathbb{E}_{(c,\mathbf{a})}[a_{i}\mathbb{I}\{c>\langle\mathbf{a},\mathbf{y}^{\star}\rangle\}]>0 for all ii.

Then 𝒴⋆={𝐲⋆}\mathcal{Y}^{\star}=\{\mathbf{y}^{\star}\} and f​(𝐲)βˆ’f​(𝐲⋆)β‰₯Ξ»1​λ22β€‹β€–π²βˆ’π²β‹†β€–2f(\mathbf{y})-f(\mathbf{y}^{\star})\geq\frac{\lambda_{1}\lambda_{2}}{2}\|\mathbf{y}-\mathbf{y}^{\star}\|^{2}. Here diam​(𝒴⋆)=0,ΞΌ=Ξ»1​λ22\mathrm{diam}(\mathcal{Y}^{\star})=0,\mu=\frac{\lambda_{1}\lambda_{2}}{2} and Ξ³=2\gamma=2.

Example 3.2 (Finite-support, non-degeneracy [9]).

Suppose (c,𝐚)(c,\mathbf{a}) has finite support. Then there exists μ>0\mu>0 such that

f​(𝐲)βˆ’f​(𝐲⋆)β‰₯ΞΌβ‹…dist​(𝐲,𝒴⋆).f(\mathbf{y})-f(\mathbf{y}^{\star})\geq\mu\cdot\mathrm{dist}(\mathbf{y},\mathcal{Y}^{\star}).

Here diam​(𝒴⋆)β‰₯0\mathrm{diam}(\mathcal{Y}^{\star})\geq 0, Ξ³=1\gamma=1, and ΞΌ\mu is determined by the data distribution. If the expected LP is non-degenerate, then diam​(𝒴⋆)=0\mathrm{diam}(\mathcal{Y}^{\star})=0.

Example 3.3 (General growth).

Suppose π’΄β‹†βŠ†int(𝒴)\mathcal{Y}^{\star}\subseteq\operatornamewithlimits{int}(\mathcal{Y}) and there exist Ξ»4,Ξ»5>0\lambda_{4},\lambda_{5}>0 such that

  • β€’

    𝔼​[|⟨𝐚,𝐲⟩|]βͺ°Ξ»4​‖𝐲‖\mathbb{E}[|\langle\mathbf{a},\mathbf{y}\rangle|]\succeq\lambda_{4}\|\mathbf{y}\| for all π²βˆˆπ’΄\mathbf{y}\in\mathcal{Y}.

  • β€’

    |ℙ​{cβ‰₯⟨𝐚,𝐲⟩|𝐚}βˆ’β„™β€‹{cβ‰₯⟨𝐚,π²β‹†βŸ©|𝐚}|β‰₯Ξ»5​|⟨𝐚,π²βˆ’π²β‹†βŸ©|p,p∈[1,∞)|\mathbb{P}\{c\geq\langle\mathbf{a},\mathbf{y}\rangle|\mathbf{a}\}-\mathbb{P}\{c\geq\langle\mathbf{a},\mathbf{y}^{\star}\rangle|\mathbf{a}\}|\geq\lambda_{5}|\langle\mathbf{a},\mathbf{y}-\mathbf{y}^{\star}\rangle|^{p},p\in[1,\infty).

Then 𝒴⋆={𝐲⋆}\mathcal{Y}^{\star}=\{\mathbf{y}^{\star}\} and f​(𝐲)βˆ’f​(𝐲⋆)β‰₯Ξ»4p+1​λ52​(p+1)β€‹β€–π²βˆ’π²β‹†β€–p+1f(\mathbf{y})-f(\mathbf{y}^{\star})\geq\frac{\lambda_{4}^{p+1}\lambda_{5}}{2(p+1)}\|\mathbf{y}-\mathbf{y}^{\star}\|^{p+1}. Here diam​(𝒴⋆)=0,ΞΌ=Ξ»4p+1​λ52​(p+1)\mathrm{diam}(\mathcal{Y}^{\star})=0,\mu=\frac{\lambda_{4}^{p+1}\lambda_{5}}{2(p+1)} and Ξ³=p+1\gamma=p+1.

We leave the detailed verification of the results in the appendix SectionΒ A.3.

While A4 is implied by the non-degeneracy assumptions in the literature, A4 does not rule out degenerate LPs. An LP can be degenerate but still satisfy the error bound. Therefore, A4 is weaker than the existing assumptions in the OLP literature. The error bound has several important consequences on our algorithm design, which we summarize next.

3.2 Consequences of the dual error bound

In the stochastic input setting, the online subgradient method (AlgorithmΒ 1) can be viewed as stochastic subgradient method (SGM), where the error bound condition is widely studied in the optimization literature [32, 14, 31]. We will use three implications of A4 to facilitate OLP algorithm design. The first implication is the existence of efficient first-order methods that learn 𝒴⋆\mathcal{Y}^{\star}.

Lemma 3.1 (Efficient learning algorithm).

Under A1 to A4, there exists a first-order method π’œL\mathcal{A}_{L} such that after TT iterations, it outputs some 𝐲¯TΞ΅+1∈{𝐲:𝐲β‰₯𝟎,‖𝐲‖≀cΒ―dΒ―}\bar{\mathbf{y}}^{T_{\varepsilon}+1}\in\{\mathbf{y}:\mathbf{y}\geq\mathbf{0},\|\mathbf{y}\|\leq\tfrac{\bar{c}}{\underline{d}}\} such that for all TΞ΅β‰₯π’ͺ​(Ξ΅βˆ’2​(1βˆ’Ξ³βˆ’1)​log⁑(1Ξ΅)​log⁑(1Ξ΄))T_{\varepsilon}\geq\mathcal{O}(\varepsilon^{-2(1-\gamma^{-1})}\log(\tfrac{1}{\varepsilon})\log(\tfrac{1}{\delta})),

f​(𝐲¯TΞ΅+1)βˆ’f​(𝐲⋆)≀Ρf(\bar{\mathbf{y}}^{T_{\varepsilon}+1})-f(\mathbf{y}^{\star})\leq\varepsilon

with probability at least 1βˆ’Ξ΄1-\delta. Moreover, for all TΞ΅β‰₯π’ͺ​(1ΞΌβ€‹Ξ΅βˆ’2​(1βˆ’Ξ³βˆ’1)​log⁑(1Ξ΅)​log⁑(1Ξ΄))T_{\varepsilon}\geq\mathcal{O}(\tfrac{1}{\mu}\varepsilon^{-2(1-\gamma^{-1})}\log(\tfrac{1}{\varepsilon})\log(\tfrac{1}{\delta})),

dist​(𝐲¯TΞ΅+1,𝒴⋆)γ≀Ρ\mathrm{dist}(\bar{\mathbf{y}}^{T_{\varepsilon}+1},\mathcal{Y}^{\star})^{\gamma}\leq\varepsilon

with probability at least 1βˆ’Ξ΄1-\delta.

LemmaΒ 3.1 shows that there is an efficient learning algorithm (in particular, AlgorithmΒ 4 in the appendix) that learns an approximate dual optimal solution 𝐲^\hat{\mathbf{y}} with suboptimality Ξ΅\varepsilon at sample complexity π’ͺ​(Ξ΅βˆ’2​(1βˆ’Ξ³βˆ’1)​log⁑(1Ξ΅))\mathcal{O}(\varepsilon^{-2(1-\gamma^{-1})}\log(\tfrac{1}{\varepsilon})). The sample complexity increases as the growth parameter Ξ³\gamma becomes larger. Moreover, A4 allows us to transform the dual suboptimality into the distance to optimality: dist​(𝐲^,𝒴⋆)γ≀Ρ\mathrm{dist}(\hat{\mathbf{y}},\mathcal{Y}^{\star})^{\gamma}\leq\varepsilon. Back to the context of OLP, when the growth parameter Ξ³\gamma is small, it is possible to learn the distribution optimal solution with a small amount of customer data. For example, with Ξ³=1\gamma=1 and Ξ΅=Ξ΄=1/T\varepsilon=\delta=1/T, we only need the information of π’ͺ​(log2⁑(T))\mathcal{O}(\log^{2}(T)) customers to learn a highly accurate approximate dual solution satisfying dist​(𝐲^,𝒴⋆)≀Tβˆ’1\mathrm{dist}(\hat{\mathbf{y}},\mathcal{Y}^{\star})\leq T^{-1}. In other words, Ξ³\gamma characterizes the complexity or difficulty of the distribution of (c,𝐚)(c,\mathbf{a}); smaller Ξ³\gamma implies that the distribution is easier to learn.

The second implication of A4 comes from the stochastic optimization literature [20]: suppose the subgradient method (AlgorithmΒ 1) runs with constant stepsize Ξ±\alpha, then the last iterate will end up in a noise ball around the optimal set, whose radius is determined by the initial distance to optimality dist​(𝐲1,𝒴⋆)\mathrm{dist}(\mathbf{y}^{1},\mathcal{Y}^{\star}) and the subgradient stepsize Ξ±\alpha.

Lemma 3.2 (Noise ball and last iterate convergence).

Under A1 to A3, suppose AlgorithmΒ 1 uses Ξ±t≑α\alpha_{t}\equiv\alpha for all tt, then

𝔼​[f​(𝐲T+1)βˆ’f​(𝐲⋆)]≀π’ͺ​(Ξ”2α​T+α​log⁑T),\mathbb{E}[f(\mathbf{y}^{T+1})-f(\mathbf{y}^{\star})]\leq\mathcal{O}(\tfrac{\Delta^{2}}{\alpha T}+\alpha\log T),

where Ξ”:=dist​(𝐲1,𝒴⋆)\Delta:=\mathrm{dist}(\mathbf{y}^{1},\mathcal{Y}^{\star}). Moreover, if A4 holds, then

𝔼​[dist​(𝐲T+1,𝒴⋆)Ξ³]≀π’ͺ​(Ξ”2μ​α​T+αμ​log⁑T).\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T+1},\mathcal{Y}^{\star})^{\gamma}]\leq\mathcal{O}\big{(}\tfrac{\Delta^{2}}{\mu\alpha T}+\tfrac{\alpha}{\mu}\log T\big{)}.

To demonstrate the role of LemmaΒ 3.2 in our analysis. Suppose Ξ”\Delta is sufficiently small and Ξ±=π’ͺ​(Ξ”)\alpha=\mathcal{O}(\Delta) is fixed. Then applying LemmaΒ 3.2 with T=1,…T=1,\ldots shows that all the iterates generated by AlgorithmΒ 1 will satisfy

𝔼​[dist​(𝐲t+1,𝒴⋆)Ξ³]≀π’ͺ​(Δ​log⁑T),t=1,…,T.\mathbb{E}[\mathrm{dist}(\mathbf{y}^{t+1},\mathcal{Y}^{\star})^{\gamma}]\leq\mathcal{O}(\Delta\log T),\quad t=1,\ldots,T.

In other words, if 𝐲1\mathbf{y}^{1} is close to the optimal set 𝒴⋆\mathcal{Y}^{\star}, then a proper choice of subgradient stepsize will keep all the iterates in a noise ball around 𝒴⋆\mathcal{Y}^{\star}. This noise ball is key to our improved reget guarantee.

The last implication, which connects the behavior of the subgradient method and OLP, states that the hindsight optimal dual solution 𝐲T⋆\mathbf{y}^{\star}_{T} will be close to 𝒴⋆\mathcal{Y}^{\star}.

Lemma 3.3 (Dual convergence).

Under A1 to A4, for any 𝐲Tβ‹†βˆˆπ’΄T⋆\mathbf{y}^{\star}_{T}\in\mathcal{Y}^{\star}_{T}, we have

𝔼​[dist​(𝐲T⋆,𝒴⋆)Ξ³]≀π’ͺ​(log⁑TΞΌ2​T).\mathbb{E}[\mathrm{dist}(\mathbf{y}^{\star}_{T},\mathcal{Y}^{\star})^{\gamma}]\leq\mathcal{O}(\sqrt{\tfrac{{\log T}}{\mu^{2}{T}}}).

LemmaΒ 3.3 states a standard dual convergence result when A4 is present. This type of result is key to the analysis of the LP-based methods [19, 7, 22]. Although our analysis will not explicitly invoke LemmaΒ 3.3, it provides sufficient intuition for our algorithm design: suppose diam​(𝒴⋆)=0\mathrm{diam}(\mathcal{Y}^{\star})=0, then the hindsight 𝐲T⋆\mathbf{y}^{\star}_{T}, which has no regret, will be in an o​(1)o(1) neighborhood around 𝐲⋆\mathbf{y}^{\star}. In other words, if we have prior knowledge of the customer distribution (thereby, 𝐲⋆\mathbf{y}^{\star}), we can localize around 𝐲⋆\mathbf{y}^{\star} since we know 𝐲T⋆\mathbf{y}^{\star}_{T} will not be far off. Moreover, according to LemmaΒ 3.1 and LemmaΒ 3.2, the subgradient method has the ability to get close to, and more importantly, to stay in proximity (the noise ball) around 𝒴⋆\mathcal{Y}^{\star}. Intuitively, if 𝐲1\mathbf{y}^{1} is in an o​(1)o(1) neighborhood of 𝒴⋆\mathcal{Y}^{\star}, then we can adjust the stepsize of the subgradient method so that the online decision-making happens in an o​(1)o(1) neighborhood around 𝒴⋆\mathcal{Y}^{\star}, and better performance is naturally expected. Even if diam​(𝒴⋆)>0\mathrm{diam}(\mathcal{Y}^{\star})>0, the same argument still applies and can improve performance by a constant. In the next section, we formalize the aforementioned intuitions and establish a general framework for first-order methods to achieve better performance.

4 Improved regret with first-order methods

This section formalizes the intuitions established in SectionΒ 3 and introduces a general framework that allows first-order methods to go beyond π’ͺ​(T)\mathcal{O}(\sqrt{T}) regret.

4.1 Regret decomposition and localization

We start by formalizing the intuition that if 𝐲1\mathbf{y}^{1} is sufficiently close to 𝐲⋆\mathbf{y}^{\star}, then adjusting the stepsize of the subgradient method allows us to make decisions in a noise ball around 𝐲⋆\mathbf{y}^{\star} and achieve improved performance.

Lemma 4.1 (Regret).

Under A1 to A4, if ‖𝐲1‖≀cΒ―dΒ―\|\mathbf{y}^{1}\|\leq\frac{\bar{c}}{\underline{d}}, then the output of AlgorithmΒ 1 satisfies

𝔼​[r​(𝐱^T)]≀m​(aΒ―+dΒ―)2​α2​T+𝖱α​[‖𝐲1βˆ’π²β‹†β€–+𝔼​[‖𝐲T+1βˆ’π²β‹†β€–]],\mathbb{E}[r(\hat{\mathbf{x}}_{T})]\leq\tfrac{m(\bar{a}+\bar{d})^{2}\alpha}{2}T+\tfrac{\mathsf{R}}{\alpha}[\|\mathbf{y}^{1}-\mathbf{y}^{\star}\|+\mathbb{E}[\|\mathbf{y}^{T+1}-\mathbf{y}^{\star}\|]],

where 𝖱=cΒ―dΒ―+[m​(aΒ―+dΒ―)22​dΒ―+m​(aΒ―+dΒ―)]​α\mathsf{R}=\frac{\bar{c}}{\underline{d}}+\big{[}\frac{m(\bar{a}+\bar{d})^{2}}{2\underline{d}}+\sqrt{m}(\bar{a}+\bar{d})\big{]}\alpha.

Lemma 4.2 (Violation).

Under the same conditions as LemmaΒ 4.1, the output of AlgorithmΒ 1 satisfies

𝔼​[v​(𝐱^T)]≀1α​[‖𝐲1βˆ’π²β‹†β€–+𝔼​[‖𝐲T+1βˆ’π²β‹†β€–]].\mathbb{E}[v(\hat{\mathbf{x}}_{T})]\leq\tfrac{1}{\alpha}[\|\mathbf{y}^{1}-\mathbf{y}^{\star}\|+\mathbb{E}[\|\mathbf{y}^{T+1}-\mathbf{y}^{\star}\|]].
Remark 4.1.

Note that in our analysis, Ξ±\alpha will always be o​(1)o(1) if TT is sufficiently large. Therefore, we can consider 𝖱\mathsf{R} as a constant without loss of generality.

Putting LemmaΒ 4.1 and LemmaΒ 4.2 together, the performance of AlgorithmΒ 1 is characterized by

𝔼​[r​(𝐱^T)+v​(𝐱^T)]≀π’ͺ​(α​T+1α​‖𝐲1βˆ’π²β‹†β€–+1α​𝔼​[‖𝐲T+1βˆ’π²β‹†β€–]).\mathbb{E}[r(\hat{\mathbf{x}}_{T})+v(\hat{\mathbf{x}}_{T})]\leq\mathcal{O}(\alpha T+\tfrac{1}{\alpha}\|\mathbf{y}^{1}-\mathbf{y}^{\star}\|+\tfrac{1}{\alpha}\mathbb{E}[\|\mathbf{y}^{T+1}-\mathbf{y}^{\star}\|]). (5)

In the standard OLP analysis, it is only possible to ensure boundedness of ‖𝐲1βˆ’π²β‹†β€–\|\mathbf{y}^{1}-\mathbf{y}^{\star}\| and ‖𝐲T+1βˆ’π²β‹†β€–\|\mathbf{y}^{T+1}-\mathbf{y}^{\star}\|. In other words,

𝔼​[r​(𝐱^T)+v​(𝐱^T)]≀π’ͺ​(α​T+1Ξ±)\mathbb{E}[r(\hat{\mathbf{x}}_{T})+v(\hat{\mathbf{x}}_{T})]\leq\mathcal{O}(\alpha T+\tfrac{1}{\alpha})

and the optimal trade-off at Ξ±=π’ͺ​(1/T)\alpha=\mathcal{O}(1/\sqrt{T}) gives π’ͺ​(T)\mathcal{O}(\sqrt{T}) performance in TheoremΒ 2.1. However, under A4, our analysis more accurately characterizes the behavior of the subgradient method, and we can do much better when 𝐲1\mathbf{y}^{1} is close to 𝐲⋆\mathbf{y}^{\star}: suppose for now that diam​(𝒴⋆)=0\mathrm{diam}(\mathcal{Y}^{\star})=0 (𝒴⋆\mathcal{Y}^{\star} is a singleton) and that ‖𝐲1βˆ’π²β‹†β€–=0\|\mathbf{y}^{1}-\mathbf{y}^{\star}\|=0. LemmaΒ 3.2 with Ξ”=0\Delta=0 ensures that

𝔼​[‖𝐲T+1βˆ’π²β‹†β€–]=π’ͺ​((α​log⁑T)1/Ξ³).\mathbb{E}[\|\mathbf{y}^{T+1}-\mathbf{y}^{\star}\|]=\mathcal{O}((\alpha\log T)^{1/\gamma}). (6)

Plugging (6) back into (5),

𝔼​[r​(𝐱^T)+v​(𝐱^T)]≀π’ͺ​(α​T+1α​α1/γ​(log⁑T)1/Ξ³)=π’ͺ​(α​T+1Ξ±1βˆ’1/γ​(log⁑T)1/Ξ³)\mathbb{E}[r(\hat{\mathbf{x}}_{T})+v(\hat{\mathbf{x}}_{T})]\leq\mathcal{O}(\alpha T+\tfrac{1}{\alpha}\alpha^{1/\gamma}(\log T)^{1/\gamma})=\mathcal{O}(\alpha T+\tfrac{1}{\alpha^{1-1/\gamma}}(\log T)^{1/\gamma}) (7)

and taking Ξ±=Tβˆ’Ξ³2β€‹Ξ³βˆ’1\alpha=T^{-\frac{\gamma}{2\gamma-1}} gives

𝔼​[r​(𝐱^T)+v​(𝐱^T)]≀π’ͺ​(TΞ³βˆ’12β€‹Ξ³βˆ’1​(log⁑T)1/Ξ³).\mathbb{E}[r(\hat{\mathbf{x}}_{T})+v(\hat{\mathbf{x}}_{T})]\leq\mathcal{O}(T^{\frac{\gamma-1}{2\gamma-1}}(\log T)^{1/\gamma}).

This simple argument provides two important observations.

  • β€’

    When Ξ³<∞\gamma<\infty and diam​(𝒴⋆)=0\mathrm{diam}(\mathcal{Y}^{\star})=0, the knowledge of 𝐲⋆\mathbf{y}^{\star} significantly improves the performance of first-order methods by shrinking the stepsize of the subgradient method from π’ͺ​(1/T)\mathcal{O}(1/\sqrt{T}) to π’ͺ​(Tβˆ’Ξ³2β€‹Ξ³βˆ’1)\mathcal{O}(T^{-\frac{\gamma}{2\gamma-1}}): small stepsize implies localization around 𝐲⋆\mathbf{y}^{\star}. If Ξ³=1\gamma=1, we achieve π’ͺ​(log⁑T)\mathcal{O}(\log T) regret; if Ξ³β†’βˆž\gamma\rightarrow\infty, we recover π’ͺ​(T)\mathcal{O}(\sqrt{T}) regret.

  • β€’

    Even if 𝐲⋆\mathbf{y}^{\star} is known, the optimal strategy is not taking Ξ±=0\alpha=0 and staying at 𝐲⋆\mathbf{y}^{\star}. Instead, Ξ±\alpha should be chosen according to Ξ³\gamma, the strength of the error bound.

In summary, when 𝐲1\mathbf{y}^{1} is close to 𝐲⋆\mathbf{y}^{\star}, we achieve improved performance guarantees through localization. The smaller Ξ³\gamma is, the smaller stepsize we take, and finally, the better regret we achieve. This observation matches LemmaΒ 3.1: when a distribution is β€œeasy”, we can trust 𝐲⋆\mathbf{y}^{\star} and stay close to it.

Although it is sometimes reasonable to assume prior knowledge of 𝐲⋆\mathbf{y}^{\star} beforehand, it is not always a practical assumption. Therefore, a natural strategy is learning it online from the customers. It is where the efficient learning algorithm from LemmaΒ 3.1 comes into play and leads to an exploration-exploitation framework.

4.2 Exploration and exploitation

When 𝐲⋆\mathbf{y}^{\star} is not known beforehand, LemmaΒ 3.1 shows first-order methods can learn it from data and an exploration-exploitation strategy (AlgorithmΒ 2) is easily applicable: specify a target accuracy Ξ”\Delta and define

Exploration horizon ​Te:=\displaystyle\text{Exploration horizon ~{}}T_{e}:={} π’ͺ​(1Ξ”2​(Ξ³βˆ’1)​log⁑(1Δγ)​log⁑1Tβˆ’2​γ)=π’ͺ​(1Ξ”2​(Ξ³βˆ’1)​log⁑(1Δγ)​log⁑T)\displaystyle\mathcal{O}(\tfrac{1}{\Delta^{2(\gamma-1)}}\log(\tfrac{1}{\Delta^{\gamma}})\log\tfrac{1}{T^{-2\gamma}})=\mathcal{O}(\tfrac{1}{\Delta^{2(\gamma-1)}}\log(\tfrac{1}{\Delta^{\gamma}})\log T) (8)
Exploitaition horizon ​Tp:=\displaystyle\text{Exploitaition horizon ~{}}T_{p}:={} Tβˆ’Te,\displaystyle T-T_{e},

where TeT_{e} is obtained by taking Ξ΅=Δγ\varepsilon=\Delta^{\gamma} and Ξ΄=Tβˆ’2​γ\delta=T^{-2\gamma} in LemmaΒ 3.1. Without loss of generality, we assume that TeT_{e} is an integer and that T≫TeT\gg T_{e}. Then LemmaΒ 3.1 guarantees dist​(𝐲¯Te+1,𝒴⋆)≀Δ\mathrm{dist}(\bar{\mathbf{y}}^{T_{e}+1},\mathcal{Y}^{\star})\leq\Delta with probability at least 1βˆ’Tβˆ’2​γ1-T^{-2\gamma}. In the exploitation phase, we use the subgradient method (AlgorithmΒ 1) with a properly configured stepsize to localize around 𝒴⋆\mathcal{Y}^{\star} and achieve better performance. LemmaΒ 4.3 characterizes the behavior of this two-phase algorithm AlgorithmΒ 2.

Input: 𝐲1=𝟎\mathbf{y}^{1}=\mathbf{0} (no prior knowledge), learning algorithm π’œL\mathcal{A}_{L} in LemmaΒ 3.1, exploration length TeT_{e}
explore 𝐲¯Te+1β‰ˆπ²β‹†\bar{\mathbf{y}}^{T_{e}+1}\approx\mathbf{y}^{\star} for t=1t=1 to TeT_{e} with π’œL\mathcal{A}_{L}
exploit forΒ tt = Te+1T_{e}+1 to TT Β do
       Run Algorithm 1 starting with 𝐲Te+1=𝐲¯Te+1\mathbf{y}^{T_{e}+1}=\bar{\mathbf{y}}^{T_{e}+1} with proper stepsize.
end for
AlgorithmΒ 2 Exploration-exploitation
Lemma 4.3.

Under the same assumptions as LemmaΒ 4.1, the output of AlgorithmΒ 2 satisfies

𝔼​[r​(𝐱^T)+v​(𝐱^T)]≀V​(Te)+π’ͺ​(α​Tp+Δα+Ξ”2/Ξ³Ξ±1/Ξ³+1​Tp1/Ξ³+Ξ±1/Ξ³βˆ’1​(log⁑T)1/Ξ³+diam​(𝒴⋆)Ξ±+1α​T2​γ+1Ξ±1/Ξ³+1​Tp1/γ​T2),\mathbb{E}[r(\hat{\mathbf{x}}_{T})+v(\hat{\mathbf{x}}_{T})]\leq V(T_{e})+\mathcal{O}(\alpha T_{p}+\tfrac{\Delta}{\alpha}+\tfrac{\Delta^{2/\gamma}}{\alpha^{1/\gamma+1}T_{p}^{1/\gamma}}+\alpha^{1/\gamma-1}(\log T)^{1/\gamma}+\tfrac{{\mathrm{diam}}(\mathcal{Y}^{\star})}{\alpha}+\tfrac{1}{\alpha T^{2\gamma}}+\tfrac{1}{\alpha^{1/\gamma+1}T_{p}^{1/\gamma}T^{2}}),

where V​(Te):=𝔼​[β€–[βˆ‘t=1Te(𝐚t​xtβˆ’π)]+β€–+βˆ‘t=1Tef​(𝐲⋆)βˆ’ct​xt]V(T_{e}):=\mathbb{E}[\|[\textstyle\sum_{t=1}^{T_{e}}(\mathbf{a}_{t}x^{t}-\mathbf{d})]_{+}\|+\textstyle\sum_{t=1}^{T_{e}}f(\mathbf{y}^{\star})-c_{t}x^{t}] is the performance metric in the exploration phase.

LemmaΒ 4.3 presents two trade-offs:

  • β€’

    Trade-off between exploration and exploitation.

    A high accuracy approximate dual solution dist​(𝐲¯Te+1,𝒴⋆)=Ξ”β‰ˆ0\mathrm{dist}(\bar{\mathbf{y}}^{T_{e}+1},\mathcal{Y}^{\star})=\Delta\approx 0 allows localization and improves the performance in exploitation. However, reducing Ξ”\Delta requires a longer exploration phase and larger Ve​(Te)V_{e}(T_{e}).

  • β€’

    Trade-off of stepsize within the exploitation phase.

    As in (7), the following terms dominate the performance in the exploitation phase

    α​Tp+Δα+Ξ”2/Ξ³Ξ±1/Ξ³+1​Tp1/Ξ³+Ξ±1/Ξ³βˆ’1​(log⁑T)1/Ξ³+diam​(𝒴⋆)Ξ±\alpha T_{p}+\tfrac{\Delta}{\alpha}+\tfrac{\Delta^{2/\gamma}}{\alpha^{1/\gamma+1}T_{p}^{1/\gamma}}+\alpha^{1/\gamma-1}(\log T)^{1/\gamma}+\tfrac{{\mathrm{diam}}(\mathcal{Y}^{\star})}{\alpha}

    and we need to set the optimal Ξ±\alpha based on (Tp,Ξ”,Ξ³,diam​(𝒴⋆))(T_{p},\Delta,\gamma,{\mathrm{diam}}(\mathcal{Y}^{\star})).

Note that we haven’t specified the expression of V​(Te)V(T_{e}), since it depends on the dual sequence used for decision-making in the exploration phase. Ideally, V​(Te)V(T_{e}) should grow slowly in TeT_{e} so that exploration provides a high-quality solution without compromising the overall algorithm performance. One natural idea is to make decisions based on the dual solutions produced by the efficient learning algorithm in LemmaΒ 3.1. This is exactly what LP-based methods do [19]. However, as we will demonstrate in the next section, a good first-order learning algorithm can be inferior for decision-making. This counter-intuitive observation motivates the idea of decoupling learning and decision-making, and finally provides a general framework for first-order methods to go beyond π’ͺ​(T)\mathcal{O}(\sqrt{T}) regret.

4.3 Dilemma between learning and decision-making

LemmaΒ 4.3 requires controlling V​(Te)V(T_{e}), the performance metric during exploration, by specifying {𝐲t}t=1Te\{\mathbf{y}^{t}\}_{t=1}^{T_{e}} used for decision-making. It seems natural to adopt {𝐲Lt}t=1Te\{\mathbf{y}^{t}_{L}\}_{t=1}^{T_{e}}, the dual iterates produced by π’œL\mathcal{A}_{L} for decision-making, and one may also wonder whether running π’œL\mathcal{A}_{L} for decision-making over the whole horizon TT leads to further improved performance guarantees. However, this is not the case: using a good learning algorithm for decision-making leads to worse performance guarantees. To demonstrate this issue, we give a concrete example and consider the following one-dimensional multi-secretary online LP:

max0≀xt≀1β€‹βˆ‘t=1Tct​xt​subject toβ€‹βˆ‘t=1Txt≀T2,\max_{0\leq x^{t}\leq 1}\textstyle~{}~{}\sum_{t=1}^{T}c_{t}x^{t}~{}~{}\text{subject to}~{}~{}\sum_{t=1}^{T}x^{t}\leq\tfrac{T}{2}, (9)

where {ct}t=1T\{c_{t}\}_{t=1}^{T} are sampled uniformly from [0,1][0,1]. For this problem, ΞΌ=12,Ξ³=2\mu=\tfrac{1}{2},\gamma=2, and y⋆=12y^{\star}=\tfrac{1}{2} is unique. Subgradient method with stepsize Ξ±t=1/(μ​t)\alpha_{t}=1/(\mu t) satisfies the convergence result of LemmaΒ 3.1 is a suitable candidate for π’œL\mathcal{A}_{L}:

Lemma 4.4.

For the multi-secretary problem (9), subgradient method

yt+1=[ytβˆ’Ξ±t​gt]+y^{t+1}=[y^{t}-\alpha_{t}g^{t}]_{+}

with stepsize Ξ±t=1μ​t\alpha_{t}=\tfrac{1}{\mu t} satisfies |yT+1βˆ’y⋆|2≀π’ͺ​(log⁑log⁑T+log⁑(1/Ξ΄)T)|y^{T+1}-y^{\star}|^{2}\leq\mathcal{O}{(\frac{\log\log T+\log(1/\delta)}{T})} at least with probability 1βˆ’Ξ΄1-\delta.

LemmaΒ 4.4 suggests that using π’œL\mathcal{A}_{L}, we indeed approximate y⋆y^{\star} efficiently. However, to approximate y⋆y^{\star} to high accuracy, the algorithm will inevitably take small stepsize Ξ±t\alpha_{t} when tβ‰₯Ω​(T)t\geq\Omega(T). Following our discussion in SectionΒ 4.1, even with perfect information of y⋆y^{\star}, the online algorithm for Ξ³=2\gamma=2 should remain adaptive to the environment by taking stepsize π’ͺ​(Tβˆ’2/3)\mathcal{O}(T^{-2/3}). Taking π’ͺ​(1/T)\mathcal{O}(1/T) stepsize nullifies this adaptivity, and the most direct consequence of lack of adaptivity is that, when the learning algorithm deviates from y⋆y^{\star} due to noise, the overly small stepsize will take the algorithm a long time to get back. From an optimization perspective, this does not necessarily affect the quality of the final output yT+1y^{T+1}, since we only care about the quality of the final output. However, as a decision-making algorithm, the regret will accumulate when the algorithm tries to get back. This observation shows a clear distinction between stochastic optimization and online decision-making. LemmaΒ 4.5 formalizes the aforementioned consequence:

Lemma 4.5.

Denote yty^{t} as the estimated dual solution for the online secretary problem (9) at time tt by the subgradient method with stepsize 1/(μ​t)1/(\mu t) specified in LemmaΒ 4.4. If there exists t0β‰₯T/10+1t_{0}\geq T/10+1 such that yt0β‰₯y⋆+1Ty^{t_{0}}\geq y^{\star}+\frac{1}{\sqrt{T}}, then 𝔼​[yt|yt0]β‰₯y⋆+120​T\mathbb{E}[y^{t}|y^{t_{0}}]\geq y^{\star}+\frac{1}{20\sqrt{T}} for all tβ‰₯t0t\geq t_{0}.

As a consequence, a good learning algorithm, due to its lack of adaptivity, is a bad decision-making algorithm:

Proposition 4.1 (Dilemma between learning and decision-making).

If subgradient method with stepsize 1/(μ​t)1/(\mu t) is used for decision-making, it cannot achieve π’ͺ​(TΞ²)\mathcal{O}(T^{\beta}) regret and constraint violation simultaneously for any Ξ²<12\beta<\frac{1}{2}.

Although our example only covers Ξ³=2\gamma=2, similar issues happen for other values of Ξ³\gamma: the stepsize used by learning algorithms (LemmaΒ 3.1) near convergence are much smaller than the optimal choice dictated by (7). This argument reveals a dilemma between learning and decision-making: a learning algorithm needs a small stepsize to achieve high accuracy, while a decision-making algorithm needs a larger stepsize to maintain adaptivity to the environment. This dilemma is inevitable for a single first-order method. However, the low computation cost of first-order methods opens up another way: it is feasible to use two separate algorithms for learning and decision-making.

4.4 Decoupling learning and decision-making

As discussed, a single first-order method may not simultaneously achieve good regret and accurate approximation of 𝒴⋆\mathcal{Y}^{\star}. However, this dilemma can be easily addressed if we decouple learning and decision-making and employ two first-order methods for learning and decision-making, respectively. The iteration cost of first-order methods is inexpensive, so it is feasible to maintain multiple of them to take the best of both worlds: the best possible learning algorithm π’œL\mathcal{A}_{L} and decision algorithm π’œD\mathcal{A}_{D}. Back to the exploration-exploitation framework, in the exploration phase, we can take π’œD\mathcal{A}_{D} to be the same subgradient method with constant stepsize, which we know at least guarantees π’ͺ​(Te)\mathcal{O}(\sqrt{T_{e}}) performance for horizon length TeT_{e}. The algorithm maintains two paths of dual sequences in the exploration phase, and when exploration is over, the solution learned by π’œL\mathcal{A}_{L} is handed over to the exploitation phase and π’œD\mathcal{A}_{D} adjusts stepsize based on the trade-off in LemmaΒ 4.3. Since the subgradient methods with different stepsizes are used for decision-making in both exploration and exploitation, the actual effect of the framework is to restart the subgradient method. The final algorithm is presented in AlgorithmΒ 3 (FigureΒ 1).

Refer to caption
Figure 1: Exploration phase sends 𝐲Te+1\mathbf{y}^{T_{e}+1} into a neighborhood of 𝒴⋆\mathcal{Y}^{\star}, and in the exploitation phase, {𝐲t}Te+1T\{\mathbf{y}^{t}\}_{T_{e}+1}^{T} localizes in this neighborhood with adaptivity to make adjustments.
Input: 𝐲1=𝟎\mathbf{y}^{1}=\mathbf{0} (no prior knowledge), learning algorithm π’œL\mathcal{A}_{L} in LemmaΒ 3.1,
    decision algorithm π’œD=\mathcal{A}_{D}= AlgorithmΒ 1, exploration length TeT_{e}
explore forΒ tt = 11 to TeT_{e} Β do
Β Β Β Β Β Β  Run AlgorithmΒ 1 with stepsize Ξ±e\alpha_{e}.
Β Β Β Β Β Β Run π’œL\mathcal{A}_{L} and learn 𝐲¯Te+1β‰ˆπ²β‹†\bar{\mathbf{y}}^{T_{e}+1}\approx\mathbf{y}^{\star}.
end for
exploit forΒ tt = Te+1T_{e}+1 to TT Β do
       Run Algorithm 1 starting with 𝐲Te+1=𝐲¯Te+1\mathbf{y}^{T_{e}+1}=\bar{\mathbf{y}}^{T_{e}+1} with stepsize αp\alpha_{p}.
end for
AlgorithmΒ 3 Exploration-Exploitation, decoupling learning and decision-making, and localization

Decoupling learning and decision-making, LemmaΒ 4.6 characterizes the performance of the whole framework.

Lemma 4.6.

Under the same assumptions as LemmaΒ 4.3, the output of AlgorithmΒ 3 satisfies

V​(Te)≀π’ͺ​(1Ξ±e+Ξ±e​Te)V(T_{e})\leq\mathcal{O}(\tfrac{1}{\alpha_{e}}+\alpha_{e}T_{e})

and we have the following performance guarantee:

𝔼​[r​(𝐱^T)+v​(𝐱^T)]≀\displaystyle\mathbb{E}[r(\hat{\mathbf{x}}_{T})+v(\hat{\mathbf{x}}_{T})]\leq{} π’ͺ(1Ξ±e+Ξ±eTe+Ξ±pTp+Δαp+Ξ”2/Ξ³Ξ±p1/Ξ³+1​Tp1/Ξ³\displaystyle\mathcal{O}\Big{(}\tfrac{1}{\alpha_{e}}+\alpha_{e}T_{e}+\alpha_{p}T_{p}+\tfrac{\Delta}{\alpha_{p}}+\tfrac{\Delta^{2/\gamma}}{\alpha_{p}^{1/\gamma+1}T_{p}^{1/\gamma}}
+Ξ±p1/Ξ³βˆ’1(logT)1/Ξ³+diam​(𝒴⋆)Ξ±p+1Ξ±p​T2​γ+1Ξ±p1/Ξ³+1​Tp1/γ​T2).\displaystyle~{}~{}~{}~{}\quad+\alpha_{p}^{1/\gamma-1}(\log T)^{1/\gamma}+\tfrac{{\mathrm{diam}}(\mathcal{Y}^{\star})}{\alpha_{p}}+\tfrac{1}{\alpha_{p}T^{2\gamma}}+\tfrac{1}{\alpha_{p}^{1/\gamma+1}T_{p}^{1/\gamma}T^{2}}\Big{)}. (10)

After balancing the trade-off by considering all the terms, we arrive at TheoremΒ 4.1.

Theorem 4.1 (Main theorem).

Under the same assumptions as LemmaΒ 4.6 and suppose TT is sufficiently large. If diam​(𝒴⋆)=0\mathrm{diam}(\mathcal{Y}^{\star})=0, then with

Te=π’ͺ​(T2β€‹Ξ³βˆ’22β€‹Ξ³βˆ’1​log2⁑T),Ξ±e=π’ͺ​(Tβˆ’Ξ³βˆ’12β€‹Ξ³βˆ’1log⁑T),Ξ±p=π’ͺ​(Tβˆ’Ξ³2β€‹Ξ³βˆ’1),\quad T_{e}=\mathcal{O}(T^{\frac{2\gamma-2}{2\gamma-1}}\log^{2}T),\quad\alpha_{e}=\mathcal{O}\big{(}\tfrac{T^{-\frac{\gamma-1}{2\gamma-1}}}{\log T}\big{)},\quad\alpha_{p}=\mathcal{O}(T^{-\frac{\gamma}{2\gamma-1}}),

we have

𝔼​[r​(𝐱^T)+v​(𝐱^T)]≀π’ͺ​(TΞ³βˆ’12β€‹Ξ³βˆ’1​log⁑T).\mathbb{E}[r(\hat{\mathbf{x}}_{T})+v(\hat{\mathbf{x}}_{T})]\leq\mathcal{O}(T^{\frac{\gamma-1}{2\gamma-1}}\log T).

In particular, if Ξ³=2\gamma=2, there is no log⁑T\log T term. If diam​(𝒴⋆)>0\mathrm{diam}(\mathcal{Y}^{\star})>0, then with

Te=2​d​i​a​m​(𝒴⋆)2​d​i​a​m​(𝒴⋆)+1​T,Ξ±e=2​cΒ―m​(aΒ―+dΒ―)2​d¯​2​d​i​a​m​(𝒴⋆)+12​d​i​a​m​(𝒴⋆)​T,Ξ±p=2​cΒ―m​(aΒ―+dΒ―)2​d¯​2​d​i​a​m​(𝒴⋆)​(2​d​i​a​m​(𝒴⋆)+1)T,T_{e}=\tfrac{2\mathrm{diam}(\mathcal{Y}^{\star})}{2\mathrm{diam}(\mathcal{Y}^{\star})+1}T,\quad\alpha_{e}=\sqrt{\tfrac{2\bar{c}}{m(\bar{a}+\bar{d})^{2}\underline{d}}\tfrac{2\mathrm{diam}(\mathcal{Y}^{\star})+1}{2\mathrm{diam}(\mathcal{Y}^{\star})T}},\quad\alpha_{p}=\sqrt{\tfrac{2\bar{c}}{m(\bar{a}+\bar{d})^{2}\underline{d}}\tfrac{2\mathrm{diam}(\mathcal{Y}^{\star})(2\mathrm{diam}(\mathcal{Y}^{\star})+1)}{T}},

we have

𝔼​[r​(𝐱^T)+v​(𝐱^T)]≀4​m​cΒ―2​d¯​2​d​i​a​m​(𝒴⋆)​T.\mathbb{E}[r(\hat{\mathbf{x}}_{T})+v(\hat{\mathbf{x}}_{T})]\leq 4\sqrt{\tfrac{m\bar{c}}{2\underline{d}}}\sqrt{2\mathrm{diam}(\mathcal{Y}^{\star})}\sqrt{T}.

TheoremΒ 4.1 shows that when the dual optimal set 𝒴⋆\mathcal{Y}^{\star} is a singleton, first-order methods can achieve o​(T)o(\sqrt{T}) regret using our framework. If diam​(𝒴⋆)>0\mathrm{diam}(\mathcal{Y}^{\star})>0, it is still possible to achieve better regret in terms of constant when diam​(𝒴⋆)β‰ͺdiam​(𝒴)\mathrm{diam}(\mathcal{Y}^{\star})\ll\mathrm{diam}(\mathcal{Y}). As a realization of our framework, we recover o​(T)o(\sqrt{T}) performance guarantees in the traditional setting of LP-based methods.

Corollary 4.1.

In the non-degenerate continuous support case, we get π’ͺ​(T1/3)\mathcal{O}(T^{1/3}) performance.

Corollary 4.2.

In the non-degenerate finite-support case, we get π’ͺ​(log⁑T)\mathcal{O}(\log T) performance.

Again, the intuitions behind the algorithm are simple: error bound ensures 𝐲T⋆\mathbf{y}_{T}^{\star} is close to 𝒴⋆\mathcal{Y}^{\star} and allows online algorithm to localize in an o​(1)o(1) neighborhood around 𝒴⋆\mathcal{Y}^{\star}; exploration-exploitation allows us to learn from data and get close to 𝒴⋆\mathcal{Y}^{\star}; decoupling learning and decision-making, we get the best of both worlds and control the regret in the exploration phase. These pieces together make first-order methods go beyond π’ͺ​(T)\mathcal{O}(\sqrt{T}) regret.

5 Numerical experiments

This section conducts experiments to illustrate the empirical performance of our framework. We consider both the continuous and finite support settings. To benchmark our algorithms, we compare

  1. M1.

    Benchmark subgradient method AlgorithmΒ 1 with constant stepsize π’ͺ​(1/T)\mathcal{O}{(1/\sqrt{T})}.

  2. M2.

    Our framework AlgorithmΒ 3.

  1. MLP.

    State-of-the-art LP-based methods. In the continuous support setting, MLP is the action-history-dependent algorithm [19, Algorithm 3]; in the finite support setting, MLP is the adaptive allocation algorithm [9, Algorithm 1].

In the following, we provide the details of M2 for each setting.

  • β€’

    For the continuous support setting, π’œL\mathcal{A}_{L} is the subgradient method with π’ͺ​(1/(μ​t))\mathcal{O}(1/(\mu t)) stepsize (LemmaΒ 4.4). As suggested by TheoremΒ 4.1, π’œD\mathcal{A}_{D} is the subgradient method with stepsize Ξ±e=1/Te=Tβˆ’1/3\alpha_{e}=1/\sqrt{T_{e}}=T^{-1/3} in the exploration phase Te=T2/3T_{e}=T^{2/3}. In the exploitation phase, π’œD\mathcal{A}_{D} takes stepsize Ξ±p=Tβˆ’2/3\alpha_{p}=T^{-2/3}. We always set ΞΌ=1\mu=1 and do not tune it through the experiments.

  • β€’

    For the finite support setting, π’œL\mathcal{A}_{L} is ASSG [31] (AlgorithmΒ 4 in the appendix); π’œD\mathcal{A}_{D} is the subgradient method with stepsize Ξ±e=1/T\alpha_{e}=1/\sqrt{T} in the exploration phase Te=50​log⁑TT_{e}=50\log T. In the exploitation phase, π’œD\mathcal{A}_{D} takes stepsize Ξ±p=Tβˆ’1\alpha_{p}=T^{-1}.

5.1 Continuous support

We generate {(ct,𝐚t)}t=1T\{(c_{t},\mathbf{a}_{t})\}_{t=1}^{T} from different continuous distributions. The performance of three algorithms is evaluated in terms of r​(𝐱^T)+v​(𝐱^T)r(\hat{\mathbf{x}}_{T})+v(\hat{\mathbf{x}}_{T}) (which we will call regret for simplicity). We choose m∈{1,5}m\in\{1,5\} and 10 different TT evenly spaced over [102,105][10^{2},10^{5}] on log\log-scale. All the results are averaged over 100100 independent random trials. For all the distributions, each did_{i} is sampled i.i.d. from uniform distribution 𝒰​[1/3,2/3]\mathcal{U}[1/3,2/3]. The data {(ct,𝐚t)}t=1T\{(c_{t},\mathbf{a}_{t})\}_{t=1}^{T} is generated as follows: 1). The first distribution [18] takes m=1m=1 and samples each ai​t,cta_{it},c_{t} i.i.d. from 𝒰​[0,2]\mathcal{U}[0,2]; 2). The second distribution [19], takes m=1,ai​t=1m=1,a_{it}=1, and samples each ctc_{t} i.i.d. from 𝒰​[0,1]\mathcal{U}[0,1]; 3). The third distribution takes m=5m=5 and samples ai​ta_{it} from Beta​(Ξ±,Ξ²)\text{Beta}(\alpha,\beta) with (Ξ±,Ξ²)=(1,8)(\alpha,\beta)=(1,8) and each ctc_{t} i.i.d. from 𝒰​[0,3]\mathcal{U}[0,3]. 4). The last distribution takes m=5m=5 and samples ai​ta_{it} and ctc_{t} i.i.d. from 𝒰​[1,6]\mathcal{U}[1,6] and 𝒰​[0,3]\mathcal{U}[0,3], respectively.

For each distribution and algorithm, we plot the growth behavior of regret with respect to TT. The performance statistics are normalized by the performance at T=102T=10^{2}. Figure 2 suggests that M2 has a better order of regret compared to M1, which is consistent with our theory. Although MLP achieves the best performance in r+vr+v, it requires significantly more computation time than M2, since it solves an LP for each tt. To demonstrate this empirically, we also compare the computation time of M1, M2, and MLP. We generate instances according to the first distribution with m=2m=2 and T∈{103,104,105}T\in\{10^{3},10^{4},10^{5}\}. For each (m,T)(m,T) pair, we average the r+vr+v and computation time over 1010 independent trials and summarize the result in Table 2: MLP takes more than one hour when T=105T=10^{5}, whereas M2 only needs 0.0640.064 seconds and achieves significant better regret compared to MLP. Our proposed framework effectively balances efficiency and regret performance.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Growth of normalized r​(𝐱^T)+v​(𝐱^T)r(\hat{\mathbf{x}}_{T})+v(\hat{\mathbf{x}}_{T}) of different algorithms under the continuous distributions.
Table 2: Computation time of different algorithms under the first tested continuous distribution.
TT Algorithm Avg. Regret Avg. Time(s) TT Algorithm Avg. Regret Avg. Time(s) TT Algorithm Avg. Regret Avg. Time(s)
10310^{3} M1 12.3712.37 <0.001<0.001 10410^{4} M1 38.2438.24 <0.01<0.01 10510^{5} M1 123.03123.03 0.0630.063
M2 4.184.18 <0.001<0.001 M2 13.8313.83 <0.01<0.01 M2 24.0024.00 0.0640.064
MLP 3.823.82 0.950.95 MLP 4.124.12 37.537.5 MLP 5.915.91 4742.94742.9

5.2 Finite support

We generate {(ct,𝐚t)}t=1T\{(c_{t},\mathbf{a}_{t})\}_{t=1}^{T} from different discrete distributions. The performance of three algorithms is evaluated in terms of r​(𝐱^T)+v​(𝐱^T)r(\hat{\mathbf{x}}_{T})+v(\hat{\mathbf{x}}_{T}). We choose m∈{2,5}m\in\{2,5\} and 1010 different TT evenly spaced over [103,105][10^{3},10^{5}] on log-scale. All the results are averaged over 100100 independent random trials. To generate a discrete distribution with support size KK, we first sample KK different {(ck,𝐚k)}k=1K\{(c_{k},\mathbf{a}_{k})\}_{k=1}^{K} from some distribution, then randomly generate a finite probability distribution 𝐩=(p1,p2,…,pK)\mathbf{p}=(p_{1},p_{2},\ldots,p_{K}) over {(ck,𝐚k)}k=1K\{(c_{k},\mathbf{a}_{k})\}_{k=1}^{K}. At time tt, we sample (ck,𝐚k)(c_{k},\mathbf{a}_{k}) with probability pkp_{k}. We generate four discrete distributions as follows: 1). The first distribution takes m=2,K=5m=2,K=5 and samples ck,ak​ic_{k},a_{ki} i.i.d. from 𝒰​[0,1]\mathcal{U}[0,1] and 𝒰​[0,3]\mathcal{U}[0,3]. Each did_{i} is sampled from 𝒰​[1/3,2/3]\mathcal{U}[1/3,2/3]. 2). The second distribution takes m=5,K=5m=5,K=5 and samples ckc_{k} from the folded normal distribution with parameter ΞΌ=0\mu=0 and Οƒ=1\sigma=1; ak​ia_{ki} is sampled from the folded normal distribution with ΞΌ=Οƒ=1\mu=\sigma=1. Each element in did_{i} is sampled from 13​(1+|X|)\frac{1}{3}(1+|X|) with XβˆΌπ’©β€‹(0,1)X\sim\mathcal{N}(0,1). 3). The third distribution takes m=5m=5, K=10K=10 and samples ckc_{k} i.i.d. from exponential distribution exp​(1)\text{exp}(1); ak​ia_{ki} is sampled from from exp​(2)\text{exp}(2). Each element in did_{i} is sampled from (1+|X|)/3(1+|X|)/3, where X∼exp​(1)X\sim\text{exp}(1). 4). The last distribution takes m=2m=2, K=10K=10 and samples ckc_{k} i.i.d. from 𝒰​[1,2]\mathcal{U}[1,2] and ak​ia_{ki} from Γ​(Ξ±,ΞΈ)\Gamma(\alpha,\theta) with (Ξ±,ΞΈ)=(2,3)(\alpha,\theta)=(2,3). Each element in did_{i} is sampled from 𝒰​[1/3,2/3]\mathcal{U}[1/3,2/3].

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Growth of normalized r​(𝐱^T)+v​(𝐱^T)r(\hat{\mathbf{x}}_{T})+v(\hat{\mathbf{x}}_{T}) of different algorithms under the finite distributions.

For each distribution and algorithm, we plot the normalized regret with respect to TT. FigureΒ 3 indicates that M2 consistently outperforms M1 and exhibits π’ͺ​(log⁑T)\mathcal{O}(\log T) regret. Moreover, M2 significantly reduces the computation time compared to MLP. To demonstrate this empirically, we also compare the computation time of M1, M2, and MLP. We generate instances according to the fourth distribution with m=2m=2 and T∈{103,104,105}T\in\{10^{3},10^{4},10^{5}\}. For each (m,T)(m,T) pair, we average r+vr+v and computation time over 1010 independent trials and summarize the result in TableΒ 3: MLP greatly reduces computation time compared to MLP but has comparable regret performance. First-order methods can replace LP-based methods in this case.

Table 3: Computation time of different algorithms under the last tested finite distribution.
TT Algorithm Avg. Regret Avg. Time(s) TT Algorithm Avg. Regret Avg. Time(s) TT Algorithm Avg. Regret Avg. Time(s)
10310^{3} M1 15.2615.26 <0.001<0.001 10410^{4} M1 24.3924.39 <0.01<0.01 10510^{5} M1 71.3871.38 0.0800.080
M2 3.613.61 <0.001<0.001 M2 3.003.00 <0.01<0.01 M2 3.233.23 0.0840.084
MLP 3.043.04 0.690.69 MLP 4.034.03 6.916.91 MLP 3.623.62 69.2369.23

6 Conclusion

In this paper, we propose an online decision-making framework that allows first-order methods to achieve beyond π’ͺ​(T)\mathcal{O}(\sqrt{T}) regret. We identify that the error bound condition on the dual problem is sufficient for first-order methods to obtain improved regret and design an online learning framework to exploit this condition. We believe that our results provide important new insights for sequential decision-making problems.

References

  • [1] Shipra Agrawal, Zizhuo Wang, and Yinyu Ye. A dynamic near-optimal algorithm for online linear programming. Operations Research, 62(4):876–890, 2014.
  • [2] Alessandro Arlotto and Itai Gurvich. Uniformly bounded regret in the multisecretary problem. Stochastic Systems, 9(3):231–260, 2019.
  • [3] SantiagoΒ R Balseiro, Omar Besbes, and Dana Pizarro. Survey of dynamic resource-constrained reward collection problems: Unified model and analysis. Operations Research, 2023.
  • [4] SantiagoΒ R Balseiro, Haihao Lu, and Vahab Mirrokni. The best of many worlds: Dual mirror descent for online allocation problems. Operations Research, 2022.
  • [5] SantiagoΒ R Balseiro, Haihao Lu, Vahab Mirrokni, and Balasubramanian Sivan. From online optimization to PID controllers: Mirror descent with momentum. arXiv preprint arXiv:2202.06152, 2022.
  • [6] Dimitris Bertsimas and JohnΒ N Tsitsiklis. Introduction to linear optimization, volumeΒ 6. Athena Scientific Belmont, MA, 1997.
  • [7] RobertΒ L Bray. Logarithmic regret in multisecretary and online linear programming problems with continuous valuations. arXiv e-prints, pages arXiv–1912, 2019.
  • [8] JamesΒ V Burke and MichaelΒ C Ferris. Weak sharp minima in mathematical programming. SIAM Journal on Control and Optimization, 31(5):1340–1359, 1993.
  • [9] Guanting Chen, Xiaocheng Li, and Yinyu Ye. An improved analysis of lp-based control for revenue management. Operations Research, 2022.
  • [10] Wenzhi Gao, Dongdong Ge, Chunlin Sun, and Yinyu Ye. Solving linear programs with fast online learning algorithms. In International Conference on Machine Learning, pages 10649–10675. PMLR, 2023.
  • [11] Wenzhi Gao, Chunlin Sun, Chenyu Xue, and Yinyu Ye. Decoupling learning and decision-making: Breaking the o​(T)o(\sqrt{T}) barrier in online resource allocation with first-order methods. In International Conference on Machine Learning, pages 14859–14883. PMLR, 2024.
  • [12] Hameed Hussain, Saif UrΒ Rehman Malik, Abdul Hameed, SameeΒ Ullah Khan, Gage Bickler, Nasro Min-Allah, MuhammadΒ Bilal Qureshi, Limin Zhang, Wang Yongji, Nasir Ghani, etΒ al. A survey on resource allocation in high performance distributed computing systems. Parallel Computing, 39(11):709–736, 2013.
  • [13] Jiashuo Jiang, Will Ma, and Jiawei Zhang. Degeneracy is OK: Logarithmic Regret for Network Revenue Management with Indiscrete Distributions. arXiv, 2022.
  • [14] PatrickΒ R Johnstone and Pierre Moulin. Faster subgradient methods for functions with hΓΆlderian growth. Mathematical Programming, 180(1):417–450, 2020.
  • [15] Naoki Katoh and Toshihide Ibaraki. Resource allocation problems. Handbook of Combinatorial Optimization: Volume1–3, pages 905–1006, 1998.
  • [16] Thomas Kesselheim, Andreas TΓΆnnis, Klaus Radke, and Berthold VΓΆcking. Primal beats dual on online packing lps in the random-order model. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 303–312, 2014.
  • [17] Guokai Li, Zizhuo Wang, and Jingwei Zhang. Infrequent resolving algorithm for online linear programming. arXiv preprint arXiv:2408.00465, 2024.
  • [18] Xiaocheng Li, Chunlin Sun, and Yinyu Ye. Simple and fast algorithm for binary integer and online linear programming. Advances in Neural Information Processing Systems, 33:9412–9421, 2020.
  • [19] Xiaocheng Li and Yinyu Ye. Online linear programming: Dual convergence, new algorithms, and regret bounds. Operations Research, 70(5):2948–2966, 2022.
  • [20] Zijian Liu and Zhengyuan Zhou. Revisiting the last-iterate convergence of stochastic gradient methods. arXiv preprint arXiv:2312.08531, 2023.
  • [21] Alfonso Lobos, Paul Grigas, and Zheng Wen. Joint online learning and decision-making via dual mirror descent. In International Conference on Machine Learning, pages 7080–7089. PMLR, 2021.
  • [22] Wanteng Ma, Ying Cao, DannyΒ HK Tsang, and Dong Xia. Optimal regularized online allocation by adaptive re-solving. Operations Research, 2024.
  • [23] Will Ma and David Simchi-Levi. Algorithms for online matching, assortment, and pricing with tight weight-dependent competitive ratios. Operations Research, 68(6):1787–1803, 2020.
  • [24] Mohammad Mahdian, Hamid Nazerzadeh, and Amin Saberi. Online optimization with uncertain information. ACM Transactions on Algorithms (TALG), 8(1):1–29, 2012.
  • [25] VahabΒ S Mirrokni, ShayanΒ Oveis Gharan, and Morteza Zadimoghaddam. Simultaneous approximations for adversarial and stochastic online budgeted allocation. In Proceedings of the twenty-third annual ACM-SIAM symposium on Discrete Algorithms, pages 1690–1701. SIAM, 2012.
  • [26] Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. arXiv preprint arXiv:1109.5647, 2011.
  • [27] Jingruo Sun, Wenzhi Gao, Ellen Vitercik, and Yinyu Ye. Wait-less offline tuning and re-solving for online decision making. arXiv preprint arXiv:2412.09594, 2024.
  • [28] Rui Sun, Xinshang Wang, and Zijie Zhou. Near-optimal primal-dual algorithms for quantity-based network revenue management. arXiv preprint arXiv:2011.06327, 2020.
  • [29] KalyanΒ T Talluri, Garrett VanΒ Ryzin, and Garrett VanΒ Ryzin. The theory and practice of revenue management, volumeΒ 1. Springer, 2004.
  • [30] Haoran Xu, PeterΒ W Glynn, and Yinyu Ye. Online linear programming with batching. arXiv preprint arXiv:2408.00310, 2024.
  • [31] YiΒ Xu, Qihang Lin, and Tianbao Yang. Stochastic convex optimization: Faster local growth implies faster global convergence. In International Conference on Machine Learning, pages 3821–3830. PMLR, 2017.
  • [32] Tianbao Yang and Qihang Lin. RSG: Beating subgradient method without smoothness and strong convexity. The Journal of Machine Learning Research, 19(1):236–268, 2018.
\doparttoc\faketableofcontents

Appendix

\parttoc

Appendix A Proof of Results in SectionΒ 3

A.1 Auxiliary results

Lemma A.1 (Hoeffding’s inequality).

Let X1,…,XnX_{1},\ldots,X_{n} be independent random variables such that 0≀Xi≀u0\leq X_{i}\leq u almost surely. Then for all ΞΆβ‰₯0\zeta\geq 0,

ℙ​{1nβ€‹βˆ‘i=1nXiβˆ’π”Όβ€‹[1nβ€‹βˆ‘i=1nXi]β‰₯ΞΆ}≀exp⁑{βˆ’2​n​uβˆ’2​΢2}.\textstyle\mathbb{P}\{\tfrac{1}{n}\sum_{i=1}^{n}X_{i}-\mathbb{E}[\tfrac{1}{n}\sum_{i=1}^{n}X_{i}]\geq\zeta\}\leq\exp\{-2nu^{-2}\zeta^{2}\}.
Lemma A.2.

Consider standard form LP min𝐀𝐱=𝐛,𝐱β‰₯𝟎⁑⟨𝐜,𝐱⟩\min_{\mathbf{A}\mathbf{x}=\mathbf{b},\mathbf{x}\geq\mathbf{0}}\langle\mathbf{c},\mathbf{x}\rangle and suppose both primal and dual problems are non-degenerate. Then the primal LP solution 𝐱⋆\mathbf{x}^{\star} is unique and there exists ΞΌ>0\mu>0 such that

⟨𝐜,π±βŸ©βˆ’βŸ¨πœ,π±β‹†βŸ©β‰₯ΞΌβ€‹β€–π±βˆ’π±β‹†β€–\langle\mathbf{c},\mathbf{x}\rangle-\langle\mathbf{c},\mathbf{x}^{\star}\rangle\geq\mu\|\mathbf{x}-\mathbf{x}^{\star}\|

for all primal feasible 𝐱∈{𝐱:𝐀𝐱=𝐛,𝐱β‰₯𝟎}\mathbf{x}\in\{\mathbf{x}:\mathbf{A}\mathbf{x}=\mathbf{b},\mathbf{x}\geq\mathbf{0}\}.

Proof.

Denote β€–π±β€–βˆ’βˆž=minj⁑xj\|\mathbf{x}\|_{-\infty}=\min_{j}{x_{j}}. Since both primal and dual problems are non-degenerate, 𝐱⋆\mathbf{x}^{\star} is unique [6]. Denote (B,N)(B,N) to be the partition of basic and non-basic variables, and we have 𝐱⋆=(𝐱B⋆,𝐱N⋆)\mathbf{x}^{\star}=(\mathbf{x}_{B}^{\star},\mathbf{x}_{N}^{\star}), where 𝐱B⋆>𝟎\mathbf{x}_{B}^{\star}>\mathbf{0} and 𝐱N⋆=𝟎\mathbf{x}_{N}^{\star}=\mathbf{0}. Similarly, denote 𝐬\mathbf{s} to be the dual slack for 𝐱\mathbf{x}, we can partition 𝐬⋆=(𝐬B⋆,𝐬N⋆)\mathbf{s}^{\star}=(\mathbf{s}_{B}^{\star},\mathbf{s}_{N}^{\star}) where 𝐬B⋆=𝟎\mathbf{s}_{B}^{\star}=\mathbf{0} and 𝐬N⋆>0\mathbf{s}_{N}^{\star}>0. We have 𝐀B​𝐱B⋆=𝐛\mathbf{A}_{B}\mathbf{x}_{B}^{\star}=\mathbf{b} by primal feasibility of 𝐱⋆\mathbf{x}^{\star}. With dual feasibility, 𝐜N=𝐀NβŠ€β€‹π²β‹†+𝐬N⋆,𝐜B=𝐀BβŠ€β€‹π²β‹†\mathbf{c}_{N}=\mathbf{A}_{N}^{\top}\mathbf{y}^{\star}+\mathbf{s}_{N}^{\star},\mathbf{c}_{B}=\mathbf{A}_{B}^{\top}\mathbf{y}^{\star} for some 𝐲⋆\mathbf{y}^{\star}. Next, consider any feasible LP solution 𝐱\mathbf{x}, and we can write

𝐀𝐱=𝐀B​𝐱B+𝐀N​𝐱N=𝐛=𝐀B​𝐱B⋆.\mathbf{A}\mathbf{x}=\mathbf{A}_{B}\mathbf{x}_{B}+\mathbf{A}_{N}\mathbf{x}_{N}=\mathbf{b}=\mathbf{A}_{B}\mathbf{x}_{B}^{\star}.

Since 𝐀B\mathbf{A}_{B} is non-degenerate, taking inverse on both sides gives 𝐱B⋆=𝐱B+𝐀Bβˆ’1​𝐀N​𝐱N\mathbf{x}_{B}^{\star}=\mathbf{x}_{B}+\mathbf{A}_{B}^{-1}\mathbf{A}_{N}\mathbf{x}_{N} and we deduce that

⟨𝐜,π±βŸ©βˆ’βŸ¨πœ,π±β‹†βŸ©=\displaystyle\langle\mathbf{c},\mathbf{x}\rangle-\langle\mathbf{c},\mathbf{x}^{\star}\rangle={} ⟨𝐜B,𝐱B⟩+⟨𝐜N,𝐱NβŸ©βˆ’βŸ¨πœB,𝐱Bβ‹†βŸ©\displaystyle\langle\mathbf{c}_{B},\mathbf{x}_{B}\rangle+\langle\mathbf{c}_{N},\mathbf{x}_{N}\rangle-\langle\mathbf{c}_{B},\mathbf{x}^{\star}_{B}\rangle (11)
=\displaystyle={} ⟨𝐜B,𝐱B⟩+⟨𝐜N,𝐱NβŸ©βˆ’βŸ¨πœB,𝐱B+𝐀Bβˆ’1​𝐀N​𝐱N⟩\displaystyle\langle\mathbf{c}_{B},\mathbf{x}_{B}\rangle+\langle\mathbf{c}_{N},\mathbf{x}_{N}\rangle-\langle\mathbf{c}_{B},\mathbf{x}_{B}+\mathbf{A}_{B}^{-1}\mathbf{A}_{N}\mathbf{x}_{N}\rangle (12)
=\displaystyle={} ⟨𝐜Nβˆ’π€NβŠ€β€‹π€Bβˆ’βŠ€β€‹πœB,𝐱N⟩\displaystyle\langle\mathbf{c}_{N}-\mathbf{A}_{N}^{\top}\mathbf{A}_{B}^{-\top}\mathbf{c}_{B},\mathbf{x}_{N}\rangle
=\displaystyle={} βŸ¨π€NβŠ€β€‹π²β‹†+𝐬Nβ‹†βˆ’π€NβŠ€β€‹π€Bβˆ’βŠ€β€‹π€BβŠ€β€‹π²β‹†,𝐱N⟩\displaystyle\langle\mathbf{A}_{N}^{\top}\mathbf{y}^{\star}+\mathbf{s}_{N}^{\star}-\mathbf{A}_{N}^{\top}\mathbf{A}_{B}^{-\top}\mathbf{A}_{B}^{\top}\mathbf{y}^{\star},\mathbf{x}_{N}\rangle (13)
=\displaystyle={} ⟨𝐬N⋆,𝐱N⟩β‰₯‖𝐬Nβ‹†β€–βˆ’βˆžβ€‹β€–π±Nβ€–,\displaystyle\langle\mathbf{s}_{N}^{\star},\mathbf{x}_{N}\rangle\geq\|\mathbf{s}_{N}^{\star}\|_{-\infty}\|\mathbf{x}_{N}\|, (14)

where (11) uses 𝐱N⋆=𝟎\mathbf{x}_{N}^{\star}=\mathbf{0}, (12) plugs in 𝐱B⋆=𝐱B+𝐀Bβˆ’1​𝐀N​𝐱N\mathbf{x}_{B}^{\star}=\mathbf{x}_{B}+\mathbf{A}_{B}^{-1}\mathbf{A}_{N}\mathbf{x}_{N}, (13) plugs in 𝐜N=𝐀NβŠ€β€‹π²β‹†+𝐬N⋆\mathbf{c}_{N}=\mathbf{A}_{N}^{\top}\mathbf{y}^{\star}+\mathbf{s}_{N}^{\star} and 𝐜B=𝐀BβŠ€β€‹π²β‹†\mathbf{c}_{B}=\mathbf{A}_{B}^{\top}\mathbf{y}^{\star}, (14) uses the fact that 𝐬N⋆>𝟎\mathbf{s}^{\star}_{N}>\mathbf{0} and ⟨𝐬N⋆,𝐱N⟩β‰₯‖𝐬Nβ€–βˆ’βˆžβ€‹β€–π±Nβ€–1β‰₯‖𝐬Nβ€–βˆ’βˆžβ€‹β€–π±Nβ€–\langle\mathbf{s}_{N}^{\star},\mathbf{x}_{N}\rangle\geq\|\mathbf{s}_{N}\|_{-\infty}\|\mathbf{x}_{N}\|_{1}\geq\|\mathbf{s}_{N}\|_{-\infty}\|\mathbf{x}_{N}\|. Re-arranging the terms,

‖𝐱N‖≀‖𝐬Nβ‹†β€–βˆ’βˆžβˆ’1​(⟨𝐜,π±βŸ©βˆ’βŸ¨πœ,π±β‹†βŸ©).\|\mathbf{x}_{N}\|\leq\|\mathbf{s}_{N}^{\star}\|_{-\infty}^{-1}(\langle\mathbf{c},\mathbf{x}\rangle-\langle\mathbf{c},\mathbf{x}^{\star}\rangle). (15)

On the other hand, we have

β€–π±βˆ’π±β‹†β€–2=\displaystyle\|\mathbf{x}-\mathbf{x}^{\star}\|^{2}={} ‖𝐱Bβˆ’π±B⋆‖2+‖𝐱Nβˆ’π±N⋆‖2\displaystyle\|\mathbf{x}_{B}-\mathbf{x}_{B}^{\star}\|^{2}+\|\mathbf{x}_{N}-\mathbf{x}_{N}^{\star}\|^{2}
=\displaystyle={} ‖𝐀Bβˆ’1​𝐀N​𝐱Nβ€–2+‖𝐱Nβ€–2\displaystyle\|\mathbf{A}_{B}^{-1}\mathbf{A}_{N}\mathbf{x}_{N}\|^{2}+\|\mathbf{x}_{N}\|^{2} (16)
=\displaystyle={} ⟨𝐱N,(𝐀NβŠ€β€‹π€Bβˆ’βŠ€β€‹π€Bβˆ’1​𝐀N+𝐈)​𝐱N⟩\displaystyle\langle\mathbf{x}_{N},(\mathbf{A}_{N}^{\top}\mathbf{A}_{B}^{-\top}\mathbf{A}_{B}^{-1}\mathbf{A}_{N}+\mathbf{I})\mathbf{x}_{N}\rangle
≀\displaystyle\leq{} (‖𝐀Nβ€–2Οƒmin​(𝐀B)2+1)​‖𝐱Nβ€–2\displaystyle\big{(}\tfrac{\|\mathbf{A}_{N}\|^{2}}{\sigma_{\min}(\mathbf{A}_{B})^{2}}+1\big{)}\|\mathbf{x}_{N}\|^{2}
≀\displaystyle\leq{} (‖𝐀Nβ€–2Οƒmin​(𝐀B)2+1)​(⟨𝐜,π±βŸ©βˆ’βŸ¨πœ,π±β‹†βŸ©β€–π¬Nβ‹†β€–βˆ’βˆž)2,\displaystyle\big{(}\tfrac{\|\mathbf{A}_{N}\|^{2}}{\sigma_{\min}(\mathbf{A}_{B})^{2}}+1\big{)}\big{(}\tfrac{\langle\mathbf{c},\mathbf{x}\rangle-\langle\mathbf{c},\mathbf{x}^{\star}\rangle}{\|\mathbf{s}_{N}^{\star}\|_{-\infty}}\big{)}^{2}, (17)

where (16) again plugs in 𝐱B⋆=𝐱B+𝐀Bβˆ’1​𝐀N​𝐱N\mathbf{x}_{B}^{\star}=\mathbf{x}_{B}+\mathbf{A}_{B}^{-1}\mathbf{A}_{N}\mathbf{x}_{N} and 𝐱N⋆=𝟎\mathbf{x}_{N}^{\star}=\mathbf{0}; (17) uses the relation (15). Taking square-root on both sides gives

β€–π±βˆ’π±β‹†β€–β‰€(‖𝐀Nβ€–2Οƒmin​(𝐀B)2+1)1/2​1‖𝐬Nβ‹†β€–βˆ’βˆžβ€‹[⟨𝐜,π±βŸ©βˆ’βŸ¨πœ,π±β‹†βŸ©]≀‖𝐀Nβ€–+Οƒmin​(𝐀B)Οƒmin​(𝐀B)​‖𝐬Nβ‹†β€–βˆ’βˆžβ€‹[⟨𝐜,π±βŸ©βˆ’βŸ¨πœ,π±β‹†βŸ©]\|\mathbf{x}-\mathbf{x}^{\star}\|\leq(\tfrac{\|\mathbf{A}_{N}\|^{2}}{\sigma_{\min}(\mathbf{A}_{B})^{2}}+1)^{1/2}\tfrac{1}{\|\mathbf{s}_{N}^{\star}\|_{-\infty}}[\langle\mathbf{c},\mathbf{x}\rangle-\langle\mathbf{c},\mathbf{x}^{\star}\rangle]\leq\tfrac{\|\mathbf{A}_{N}\|+\sigma_{\min}(\mathbf{A}_{B})}{\sigma_{\min}(\mathbf{A}_{B})\|\mathbf{s}_{N}^{\star}\|_{-\infty}}[\langle\mathbf{c},\mathbf{x}\rangle-\langle\mathbf{c},\mathbf{x}^{\star}\rangle]

and another re-arrangement of the inequality completes the proof. ∎

Lemma A.3 (Learning algorithm for HΓΆlder growth [31]).

Consider stochastic optimization problem minπ²βˆˆπ’΄β‘f​(𝐲):=𝔼ξ​[f​(𝐲,ΞΎ)]\min_{\mathbf{y}\in\mathcal{Y}}f(\mathbf{y}):=\mathbb{E}_{\xi}[f(\mathbf{y},\xi)] with optimal set 𝒴⋆\mathcal{Y}^{\star} and suppose the following conditions hold:

  1. 1.

    there exists some 𝐲1βˆˆπ’΄\mathbf{y}^{1}\in\mathcal{Y} such that f​(𝐲1)βˆ’f​(𝐲⋆)≀Ρ0f(\mathbf{y}^{1})-f(\mathbf{y}^{\star})\leq\varepsilon_{0},

  2. 2.

    𝒴⋆\mathcal{Y}^{\star} is a nonempty compact set,

  3. 3.

    there exists some constant GG such that β€–f′​(𝐲,ΞΎ)‖≀G\|f^{\prime}(\mathbf{y},\xi)\|\leq G for all ΞΎ\xi,

  4. 4.

    there exists some constant Ξ»>0\lambda>0 and θ∈(0,1]\theta\in(0,1] such that for all π²βˆˆπ’΄\mathbf{y}\in\mathcal{Y}

    f​(𝐲)βˆ’f​(𝐲⋆)β‰₯Ξ»β‹…dist​(𝐲,𝒴⋆)1/ΞΈ.f(\mathbf{y})-f(\mathbf{y}^{\star})\geq\lambda\cdot\mathrm{dist}(\mathbf{y},\mathcal{Y}^{\star})^{1/\theta}.

Then, there is a first-order method (AlgorithmΒ 4, Algorithm 1, 2, and 4 of [31]) that outputs f​(𝐲¯T+1)βˆ’f​(𝐲⋆)≀Ρf(\bar{\mathbf{y}}^{T+1})-f(\mathbf{y}^{\star})\leq\varepsilon after

TΞ΅β‰₯{max⁑{9,1728​{log⁑(1Ξ΄)+log⁑⌈log2⁑(2​Ρ0Ξ΅)βŒ‰}}​22​(1βˆ’ΞΈ)β€‹Ξ»βˆ’2​θ​G2Ξ΅2​(1βˆ’ΞΈ)+1}β€‹βŒˆlog2⁑(2​Ρ0Ξ΅)βŒ‰T_{\varepsilon}\geq\{\textstyle\max\{9,1728\{\log(\tfrac{1}{\delta})+\log\lceil\log_{2}(\tfrac{2\varepsilon_{0}}{\varepsilon})\rceil\}\}\tfrac{2^{2(1-\theta)}\lambda^{-2\theta}G^{2}}{\varepsilon^{2(1-\theta)}}+1\}\lceil\log_{2}(\tfrac{2\varepsilon_{0}}{\varepsilon})\rceil

iterations with probability at least 1βˆ’Ξ΄1-\delta.

Lemma A.4 (Last-iterate convergence of stochastic subgradient [20]).

Consider stochastic optimization problem min𝐲β‰₯𝟎⁑f​(𝐲):=𝔼ξ​[f​(𝐲,ΞΎ)]\min_{\mathbf{y}\geq\mathbf{0}}f(\mathbf{y}):=\mathbb{E}_{\xi}[f(\mathbf{y},\xi)]. Suppose the following conditions hold:

  1. 1.

    There exist Mβ‰₯0M\geq 0 such that

    f​(𝐱)βˆ’f​(𝐲)βˆ’βŸ¨f′​(𝐲),π±βˆ’π²βŸ©β‰€Mβ€‹β€–π±βˆ’π²β€–f(\mathbf{x})-f(\mathbf{y})-\langle f^{\prime}(\mathbf{y}),\mathbf{x}-\mathbf{y}\rangle\leq M\|\mathbf{x}-\mathbf{y}\|

    for all 𝐱,𝐲\mathbf{x},\mathbf{y} and f′​(𝐲)βˆˆβˆ‚f​(𝐲)f^{\prime}(\mathbf{y})\in\partial f(\mathbf{y}),

  2. 2.

    It is possible to compute 𝐠𝐲\mathbf{g}_{\mathbf{y}} such that 𝔼​[𝐠𝐲]=f′​(𝐲)\mathbb{E}[\mathbf{g}_{\mathbf{y}}]=f^{\prime}(\mathbf{y}),

  3. 3.

    𝔼​[β€–π π²βˆ’f′​(𝐲)β€–2]≀σ2\mathbb{E}[\|\mathbf{g}_{\mathbf{y}}-f^{\prime}(\mathbf{y})\|^{2}]\leq\sigma^{2}.

Then, the last iterate of the projected subgradient method with stepsize Ξ±\alpha: 𝐲t+1=[𝐲tβˆ’Ξ±β€‹π t]+\mathbf{y}^{t+1}=[\mathbf{y}^{t}-\alpha\mathbf{g}^{t}]_{+} satisfies

𝔼​[f​(𝐲T+1)βˆ’f​(𝐲)]≀‖𝐲1βˆ’π²β€–2T​α+2​α​(M2+Οƒ2)​(1+log⁑T)\mathbb{E}[f(\mathbf{y}^{T+1})-f(\mathbf{y})]\leq\tfrac{\|\mathbf{y}^{1}-\mathbf{y}\|^{2}}{T\alpha}+2\alpha(M^{2}+\sigma^{2})(1+\log T)

for all 𝐲β‰₯𝟎\mathbf{y}\geq\mathbf{0}.

LemmaΒ A.4 is an application of Theorem C.1, equation (24) of [20] with L=0,h​(𝐲)=0L=0,h(\mathbf{y})=0 and Οˆβ€‹(𝐱)=12​‖𝐱‖2\psi(\mathbf{x})=\frac{1}{2}\|\mathbf{x}\|^{2}.

A.2 Dual learning algorithm

We include two algorithms in [31] that can exploit A4 and achieve the sample complexity in LemmaΒ 3.1. AlgorithmΒ 4 is the baseline algorithm and AlgorithmΒ 5 is its parameter-free variant that adapts to unknown Ξ»\lambda. Note that the algorithm has an explicit projection routine onto 𝒴′={𝐲β‰₯𝟎:‖𝐲‖≀cΒ―dΒ―}\mathcal{Y}^{\prime}=\{\mathbf{y}\geq\mathbf{0}:\|\mathbf{y}\|\leq\frac{\bar{c}}{\underline{d}}\}. According to [31], given parameters (Ξ΄,Ξ΅,Ξ΅0,Ξ³,G)(\delta,\varepsilon,\varepsilon_{0},\gamma,G), AlgorithmΒ 4 is configured as follows:

K=⌈log2⁑(2​Ρ0Ξ΅)βŒ‰,D1=21βˆ’Ξ³β€‹Ξ»βˆ’ΞΈβ€‹Ξ΅0Ξ΅1βˆ’Ξ³,t=max⁑{9,1728​log⁑(KΞ΄)}​G2​D12Ξ΅02.K=\lceil\log_{2}(\tfrac{2\varepsilon_{0}}{\varepsilon})\rceil,\quad D_{1}=\tfrac{2^{1-\gamma}\lambda^{-\theta}\varepsilon_{0}}{\varepsilon^{1-\gamma}},\quad t=\max\{9,1728\log(\tfrac{K}{\delta})\}\tfrac{G^{2}D_{1}^{2}}{\varepsilon_{0}^{2}}.
Input: Initial point 𝐲0βˆˆπ’΄β€²={𝐲β‰₯𝟎:‖𝐲‖≀cΒ―dΒ―}\mathbf{y}_{0}\in\mathcal{Y}^{\prime}=\{\mathbf{y}\geq\mathbf{0}:\|\mathbf{y}\|\leq\frac{\bar{c}}{\underline{d}}\}, outer iteration count KK, inner iteration count tt, initial error estimate Ξ΅0\varepsilon_{0}, initial diameter D1D_{1}, Lipschitz constant GG
Set Ξ·1=Ξ΅03​G2\eta_{1}=\frac{\varepsilon_{0}}{3G^{2}}
forΒ k=1,2,…,Kk=1,2,\ldots,KΒ do
Β Β Β Β Β Β  Let 𝐲1k=𝐲kβˆ’1\mathbf{y}_{1}^{k}=\mathbf{y}_{k-1}
Β Β Β Β Β Β forΒ Ο„=1,2,…,tβˆ’1\tau=1,2,\ldots,t-1Β do
Β Β Β Β Β Β Β Β Β Β Β Β  𝐲τ+1k=βˆπ’΄βˆ©β„¬β€‹(𝐲kβˆ’1,Dk)[𝐲τkβˆ’Ξ·k​𝐠𝐲τk]\mathbf{y}_{\tau+1}^{k}=\prod_{\mathcal{\mathcal{Y}}\cap\mathcal{B}(\mathbf{y}_{k-1},D_{k})}[\mathbf{y}^{k}_{\tau}-\eta_{k}\mathbf{g}_{\mathbf{y}_{\tau}^{k}}]
Β Β Β Β Β Β  end for
Β Β Β Β Β Β 
Β Β Β Β Β Β Let 𝐲k=1tβ€‹βˆ‘Ο„=1t𝐲τk\mathbf{y}_{k}=\frac{1}{t}\sum_{\tau=1}^{t}\mathbf{y}_{\tau}^{k}
Β Β Β Β Β Β Let Ξ·k+1=12​ηk\eta_{k+1}=\frac{1}{2}\eta_{k} and Dk+1=12​DkD_{k+1}=\frac{1}{2}D_{k}
end for
Output: 𝐲K\mathbf{y}_{K}
AlgorithmΒ 4 Accelerated Stochastic SubGradient Method (ASSG)
Input: Initial point 𝐲0βˆˆπ’΄β€²={𝐲β‰₯𝟎:‖𝐲‖≀cΒ―dΒ―}\mathbf{y}^{0}\in\mathcal{Y}^{\prime}=\{\mathbf{y}\geq\mathbf{0}:\|\mathbf{y}\|\leq\frac{\bar{c}}{\underline{d}}\}, outer iteration count KK, initial distance D1(1)D_{1}^{(1)}, inner iteraion count t1t_{1}, initial error estimate Ξ΅0\varepsilon_{0} and Ο‰βˆˆ(0,1]\omega\in(0,1], error bound parameter Ξ³\gamma, restart round SS, Lipschitz constant GG
Set Ξ΅0(1)=Ξ΅0\varepsilon_{0}^{(1)}=\varepsilon_{0}, Ξ·1=Ξ΅03​G2\eta_{1}=\frac{\varepsilon_{0}}{3G^{2}}
forΒ s=1,2,…,Ss=1,2,\ldots,SΒ do
Β Β Β Β Β Β  𝐲(s)←\mathbf{y}^{(s)}\leftarrow ASSG(𝐲(sβˆ’1),K,ts,D1(s),Ξ΅0(s))(\mathbf{y}^{(s-1)},K,t_{s},D_{1}^{(s)},\varepsilon_{0}^{(s)})
Β Β Β Β Β Β Let ts+1=ts​22​(1βˆ’Ξ³βˆ’1),D1(s+1)=D1(s)​21βˆ’Ξ³βˆ’1t_{s+1}=t_{s}2^{2(1-\gamma^{-1})},D_{1}^{(s+1)}=D_{1}^{(s)}2^{1-\gamma^{-1}}, and Ξ΅0(s+1)=ω​Ρ0(s)\varepsilon_{0}^{(s+1)}=\omega\varepsilon_{0}^{(s)}
end for
Output: 𝐲(S)\mathbf{y}^{(S)}
AlgorithmΒ 5 ASSG with Restart (RASSG)

A.3 Verification of the examples

A.3.1 Continuous support

The result is a direct application of Proposition 2 of [19].

A.3.2 Finite support

Denote {(ΞΎk,𝜢k)}k=1K\{(\xi_{k},\bm{\alpha}_{k})\}_{k=1}^{K} to be the support of LP data associated with distribution π©βˆˆβ„K\mathbf{p}\in\mathbb{R}^{K}. i.e., there are KK types of customers and customers of type kk arrive with probability pkp_{k}. We can write the expected dual problem as

min(𝐲,𝝈)β‰₯𝟎⟨𝐝,𝐲⟩+βˆ‘k=1Kpi​σisubject toΟƒiβ‰₯ΞΎiβˆ’βŸ¨πœΆi,𝐲⟩,i∈[K].\min_{(\mathbf{y},\bm{\sigma})\geq\mathbf{0}}\quad\langle\mathbf{d},\mathbf{y}\rangle+\textstyle\textstyle\sum_{k=1}^{K}p_{i}\sigma_{i}\quad\text{subject to}\quad\sigma_{i}\geq\xi_{i}-\langle\bm{\alpha}_{i},\mathbf{y}\rangle,i\in[K].

More compactly, we introduce slack π€βˆˆβ„K\bm{\lambda}\in\mathbb{R}^{K} and define 𝐟:=(𝐝;𝐩;𝟎),𝐳:=(𝐲;𝝈;𝝀)β‰₯𝟎,𝐐:=(π€βŠ€,𝐈,βˆ’πˆ){{\mathbf{f}}}:=(\mathbf{d};\mathbf{p};\mathbf{0}),\mathbf{z}:=(\mathbf{y};\bm{\sigma};\bm{\lambda})\geq\mathbf{0},\mathbf{Q}:=(\mathbf{A}^{\top},\mathbf{I},-\mathbf{I}). Then, the dual problem can be written as standard-form.

min𝐳β‰₯𝟎⟨𝐟,𝐳⟩subject to𝐐𝐳=𝝃.\min_{\mathbf{z}\geq\mathbf{0}}\quad\langle{{\mathbf{f}}},\mathbf{z}\rangle\quad\text{subject to}\quad\mathbf{Q}\mathbf{z}=\bm{\xi}. (18)

When diam​(𝒴⋆)>0\mathrm{diam}(\mathcal{Y}^{\star})>0, the result is an application of weak sharp minima to LP [8]. When the primal-dual problems are both non-degenerate, 𝒴⋆={𝐲⋆}\mathcal{Y}^{\star}=\{\mathbf{y}^{\star}\} and applying Lemma A.2, we get the following error bound in terms of the LP optimal basis.

Lemma A.5.

Let (B,N)(B,N) denote the optimal basis partition for (18) and let 𝐬N\mathbf{s}_{N} denote the dual slack of primal variables 𝐳\mathbf{z}, then

⟨𝐟,π³βŸ©βˆ’βŸ¨πŸ,π³β‹†βŸ©β‰₯ΞΌβ€‹β€–π³βˆ’π³β‹†β€–,\langle{{\mathbf{f}}},\mathbf{z}\rangle-\langle{{\mathbf{f}}},\mathbf{z}^{\star}\rangle\geq\mu\|\mathbf{z}-\mathbf{z}^{\star}\|,

where ΞΌ=Οƒmin​(𝐐B)​‖𝐬Nβ€–βˆ’βˆžβ€–πNβ€–+Οƒmin​(𝐐B)\mu=\frac{\sigma_{\min}(\mathbf{Q}_{B})\|\mathbf{s}_{N}\|_{-\infty}}{\|\mathbf{Q}_{N}\|+\sigma_{\min}(\mathbf{Q}_{B})}. Moreover, we have f​(𝐲)βˆ’f​(𝐲⋆)β‰₯ΞΌβ€‹β€–π²βˆ’π²β‹†β€–f(\mathbf{y})-f(\mathbf{y}^{\star})\geq\mu\|\mathbf{y}-\mathbf{y}^{\star}\|.

Proof.

⟨𝐟,π³βŸ©βˆ’βŸ¨πŸ,π³β‹†βŸ©β‰₯ΞΌβ€‹β€–π³βˆ’π³β‹†β€–\langle{{\mathbf{f}}},\mathbf{z}\rangle-\langle{{\mathbf{f}}},\mathbf{z}^{\star}\rangle\geq\mu\|\mathbf{z}-\mathbf{z}^{\star}\| follows from LemmaΒ A.2 applied to the compact LP formulation. Next, define 𝐳𝐲:=(𝐲;𝝈𝐲;𝝀𝐲)\mathbf{z}_{\mathbf{y}}:=(\mathbf{y};\bm{\sigma}_{\mathbf{y}};\bm{\lambda}_{\mathbf{y}}) where σ𝐲=[πƒβˆ’βˆ‘k=1K𝜢i​yi]+\sigma_{\mathbf{y}}=[\bm{\xi}-\textstyle\sum_{k=1}^{K}\bm{\alpha}_{i}y_{i}]_{+} and 𝝀𝐲=Οƒπ²βˆ’πƒ+βˆ‘k=1K𝜢i​yi\bm{\lambda}_{\mathbf{y}}=\sigma_{\mathbf{y}}-\bm{\xi}+\textstyle\sum_{k=1}^{K}\bm{\alpha}_{i}y_{i}. We deduce

f​(𝐲)βˆ’f​(𝐲⋆)=⟨𝐟,π³π²βŸ©βˆ’βŸ¨πŸ,π³β‹†βŸ©β‰₯ΞΌβ€‹β€–π³π²βˆ’π³β‹†β€–β‰₯ΞΌβ€‹β€–π²βˆ’π²β‹†β€–f(\mathbf{y})-f(\mathbf{y}^{\star})=\langle\mathbf{f},\mathbf{z}_{\mathbf{y}}\rangle-\langle\mathbf{f},\mathbf{z}^{\star}\rangle\geq\mu\|\mathbf{z}_{\mathbf{y}}-\mathbf{z}^{\star}\|\geq\mu\|\mathbf{y}-\mathbf{y}^{\star}\|

and this completes the proof. ∎

A.3.3 General growth

Given π²β‹†βˆˆarg⁑min𝐲⁑f​(𝐲)βŠ†int(𝒴)\mathbf{y}^{\star}\in\arg\min_{\mathbf{y}}f(\mathbf{y})\subseteq\operatornamewithlimits{int}(\mathcal{Y}), by optimality condition, 𝟎=πβˆ’π”Όβ€‹[πšβ€‹π•€β€‹{cβ‰₯⟨𝐚,π²β‹†βŸ©}]\mathbf{0}=\mathbf{d}-\mathbb{E}[\mathbf{a}\mathbb{I}\{c\geq\langle\mathbf{a},\mathbf{y}^{\star}\rangle\}] and

𝐝=βˆ«πšβ€‹βˆ«βŸ¨πš,π²β‹†βŸ©βˆždF​(c|𝐚)​dF​(𝐚),\mathbf{d}=\textstyle\int\mathbf{a}\textstyle\int_{\langle\mathbf{a},\mathbf{y}^{\star}\rangle}^{\infty}\mathrm{d}F(c|\mathbf{a})\mathrm{d}F(\mathbf{a}),

where F​(c,𝐚)F(c,\mathbf{a}) denotes the cdf. of the distribution of (c,𝐚)(c,\mathbf{a}). Then we deduce that

f​(𝐲)βˆ’f​(𝐲⋆)=\displaystyle f(\mathbf{y})-f(\mathbf{y}^{\star})={} ⟨𝐝,π²βˆ’π²β‹†βŸ©+𝔼​[[cβˆ’βŸ¨πš,𝐲⟩]+βˆ’[cβˆ’βŸ¨πš,π²β‹†βŸ©]+]\displaystyle\langle\mathbf{d},\mathbf{y}-\mathbf{y}^{\star}\rangle+\mathbb{E}[[c-\langle\mathbf{a},\mathbf{y}\rangle]_{+}-[c-\langle\mathbf{a},\mathbf{y}^{\star}\rangle]_{+}]
=\displaystyle={} ∫∫⟨𝐚,π²β‹†βŸ©βˆžβŸ¨πš,π²βˆ’π²β‹†βŸ©β€‹dF​(c|𝐚)​dF​(𝐚)+∫∫⟨𝐚,𝐲⟩⟨𝐚,π²β‹†βŸ©dF​(c|𝐚)​dF​(𝐚)\displaystyle\textstyle\int\textstyle\int_{\langle\mathbf{a},\mathbf{y}^{\star}\rangle}^{\infty}\langle\mathbf{a},\mathbf{y}-\mathbf{y}^{\star}\rangle\mathrm{d}F(c|\mathbf{a})\mathrm{d}F(\mathbf{a})+\textstyle\int\textstyle\int_{\langle\mathbf{a},\mathbf{y}\rangle}^{\langle\mathbf{a},\mathbf{y}^{\star}\rangle}\mathrm{d}F(c|\mathbf{a})\mathrm{d}F(\mathbf{a})
=\displaystyle={} ∫∫⟨𝐚,𝐲⟩⟨𝐚,π²β‹†βŸ©π•€β€‹{cβ‰₯v}β€‹βŸ¨πš,π²βˆ’π²β‹†βŸ©β€‹dv​dF​(c,𝐚)+∫∫⟨𝐚,𝐲⟩⟨𝐚,π²β‹†βŸ©dF​(c|𝐚)​dF​(𝐚)\displaystyle\textstyle\int\textstyle\int_{\langle\mathbf{a},\mathbf{y}\rangle}^{\langle\mathbf{a},\mathbf{y}^{\star}\rangle}\mathbb{I}\{c\geq v\}\langle\mathbf{a},\mathbf{y}-\mathbf{y}^{\star}\rangle\mathrm{d}v\mathrm{d}F(c,\mathbf{a})+\textstyle\int\textstyle\int_{\langle\mathbf{a},\mathbf{y}\rangle}^{\langle\mathbf{a},\mathbf{y}^{\star}\rangle}\mathrm{d}F(c|\mathbf{a})\mathrm{d}F(\mathbf{a})
=\displaystyle={} ∫∫⟨𝐚,𝐲⟩⟨𝐚,π²β‹†βŸ©π•€β€‹{cβ‰₯v}βˆ’π•€β€‹{cβ‰₯⟨𝐚,π²β‹†βŸ©}​d​v​d​F​(c,𝐚).\displaystyle\textstyle\int\textstyle\int_{\langle\mathbf{a},\mathbf{y}\rangle}^{\langle\mathbf{a},\mathbf{y}^{\star}\rangle}\mathbb{I}\{c\geq v\}-\mathbb{I}\{c\geq\langle\mathbf{a},\mathbf{y}^{\star}\rangle\}\mathrm{d}v\mathrm{d}F(c,\mathbf{a}).

Next, we invoke the assumptions and

∫∫⟨𝐚,𝐲⟩⟨𝐚,π²β‹†βŸ©π•€β€‹{cβ‰₯v}βˆ’π•€β€‹{cβ‰₯⟨𝐚,π²β‹†βŸ©}​d​v​d​F​(c,𝐚)\displaystyle\textstyle\int\textstyle\int_{\langle\mathbf{a},\mathbf{y}\rangle}^{\langle\mathbf{a},\mathbf{y}^{\star}\rangle}\mathbb{I}\{c\geq v\}-\mathbb{I}\{c\geq\langle\mathbf{a},\mathbf{y}^{\star}\rangle\}~{}\mathrm{d}v\mathrm{d}F(c,\mathbf{a})
β‰₯\displaystyle\geq{} Ξ»52β€‹βˆ«βˆ«βŸ¨πš,𝐲⟩⟨𝐚,π²β‹†βŸ©|⟨𝐚,π²β‹†βŸ©βˆ’v|p​dv​dF​(𝐚)\displaystyle\tfrac{\lambda_{5}}{2}\textstyle\int\textstyle\int_{\langle\mathbf{a},\mathbf{y}\rangle}^{\langle\mathbf{a},\mathbf{y}^{\star}\rangle}|\langle\mathbf{a},\mathbf{y}^{\star}\rangle-v|^{p}\mathrm{d}v\mathrm{d}F(\mathbf{a})
=\displaystyle={} Ξ»52​(p+1)​𝔼​[|⟨𝐚,π²βˆ’π²β‹†βŸ©|p+1]\displaystyle\tfrac{\lambda_{5}}{2(p+1)}\mathbb{E}[|\langle\mathbf{a},\mathbf{y}-\mathbf{y}^{\star}\rangle|^{p+1}]
β‰₯\displaystyle\geq{} Ξ»52​(p+1)​𝔼​[|⟨𝐚,π²βˆ’π²β‹†βŸ©|]p+1\displaystyle\tfrac{\lambda_{5}}{2(p+1)}\mathbb{E}[|\langle\mathbf{a},\mathbf{y}-\mathbf{y}^{\star}\rangle|]^{p+1} (19)
β‰₯\displaystyle\geq{} Ξ»4p+1​λ52​(p+1)β€‹β€–π²βˆ’π²β‹†β€–p+1,\displaystyle\tfrac{\lambda_{4}^{p+1}\lambda_{5}}{2(p+1)}\|\mathbf{y}-\mathbf{y}^{\star}\|^{p+1},

where (19) uses pβ‰₯0p\geq 0 and that 𝔼​[|X|p+1]β‰₯E​[|X|]p+1\mathbb{E}[|X|^{p+1}]\geq E[|X|]^{p+1}. Since β€–π²βˆ’π²β‹†β€–p+1>0\|\mathbf{y}-\mathbf{y}^{\star}\|^{p+1}>0 for 𝐲≠𝐲⋆\mathbf{y}\neq\mathbf{y}^{\star}, this completes the proof.

A.4 Proof of Lemma 3.1

We verify the conditions in LemmaΒ A.3.

Condition 1. Take 𝐲1=πŸŽβˆˆπ’΄\mathbf{y}^{1}=\mathbf{0}\in\mathcal{Y}. Then

f​(𝐲1)βˆ’f​(𝐲⋆)≀f​(𝐲1)=𝔼​[⟨𝐝,𝐲1⟩+[cβˆ’βŸ¨πš,𝐲1⟩]+]=𝔼​[[c]+]≀cΒ―,f(\mathbf{y}^{1})-f(\mathbf{y}^{\star})\leq f(\mathbf{y}^{1})=\mathbb{E}[\langle\mathbf{d},\mathbf{y}^{1}\rangle+[c-\langle\mathbf{a},\mathbf{y}^{1}\rangle]_{+}]=\mathbb{E}[[c]_{+}]\leq\bar{c},

where the first inequality holds since f​(𝐲⋆)β‰₯0f(\mathbf{y}^{\star})\geq 0.

Condition 2 holds since π’΄β‹†βŠ†π’΄\mathcal{Y}^{\star}\subseteq\mathcal{Y}, 𝒴⋆\mathcal{Y}^{\star} is closed and 𝒴\mathcal{Y} is a compact set.

Condition 3 holds since 𝐠𝐲=πβˆ’πšβ€‹π•€β€‹{cβ‰₯⟨𝐚,𝐲⟩}\mathbf{g}_{\mathbf{y}}=\mathbf{d}-\mathbf{a}\mathbb{I}\{c\geq\langle\mathbf{a},\mathbf{y}\rangle\} and ‖𝐠‖≀m​(aΒ―+dΒ―)\|\mathbf{g}\|\leq\sqrt{m}(\bar{a}+\bar{d}). Hence G=m​(aΒ―+dΒ―)G=\sqrt{m}(\bar{a}+\bar{d}).

Condition 4 holds by the dual error bound condition f​(𝐲)β‰₯ΞΌβ‹…dist​(𝐲,𝒴⋆)Ξ³f(\mathbf{y})\geq\mu\cdot\mathrm{dist}(\mathbf{y},\mathcal{Y}^{\star})^{\gamma} with Ξ»=ΞΌ\lambda=\mu and ΞΈ=1/Ξ³\theta=1/\gamma.

Now invoke LemmaΒ A.3 and we get that, after

TΞ΅β‰₯\displaystyle T_{\varepsilon}\geq{} {max⁑{9,1728​{log⁑(1Ξ΄)+log⁑⌈log2⁑(2​cΒ―Ξ΅)βŒ‰}}​22​(1βˆ’Ξ³βˆ’1)β€‹ΞΌβˆ’2/γ​m​(aΒ―+dΒ―)2Ξ΅2​(1βˆ’Ξ³βˆ’1)+1}β€‹βŒˆlog2⁑(2​cΒ―Ξ΅)βŒ‰\displaystyle\{\textstyle\max\{9,1728\{\log(\tfrac{1}{\delta})+\log\lceil\log_{2}(\tfrac{2\bar{c}}{\varepsilon})\rceil\}\}\tfrac{2^{2(1-\gamma^{-1})}\mu^{-2/\gamma}m(\bar{a}+\bar{d})^{2}}{\varepsilon^{2(1-\gamma^{-1})}}+1\}\lceil\log_{2}(\tfrac{2\bar{c}}{\varepsilon})\rceil
=\displaystyle={} π’ͺ​(Ξ΅βˆ’2​(1βˆ’Ξ³βˆ’1)​log⁑(1Ξ΄)​log⁑(1Ξ΅))\displaystyle\mathcal{O}(\varepsilon^{-2(1-\gamma^{-1})}\log(\tfrac{1}{\delta})\log(\tfrac{1}{\varepsilon}))

iterations, the algorithm outputs 𝐲¯T+1\bar{\mathbf{y}}^{T+1} such that with probability at least 1βˆ’Ξ΄1-\delta,

ΞΌβ‹…dist​(𝐲¯T+1,𝒴⋆)≀f​(𝐲¯T+1)βˆ’f​(𝐲⋆)≀Ρ\mu\cdot\mathrm{dist}(\bar{\mathbf{y}}^{T+1},\mathcal{Y}^{\star})\leq f(\bar{\mathbf{y}}^{T+1})-f(\mathbf{y}^{\star})\leq\varepsilon

and this completes the proof.

A.5 Proof of Lemma 3.2

We verify the conditions in LemmaΒ A.4.

Condition 1. Since f​(𝐲)f(\mathbf{y}) is convex and has Lipschitz constant m​(aΒ―+dΒ―)\sqrt{m}(\bar{a}+\bar{d}), we take M=2​m​(aΒ―+dΒ―)M=2\sqrt{m}(\bar{a}+\bar{d}) and deduce that

f​(𝐱)βˆ’f​(𝐲)βˆ’βŸ¨f′​(𝐲),π±βˆ’π²βŸ©β‰€\displaystyle f(\mathbf{x})-f(\mathbf{y})-\langle f^{\prime}(\mathbf{y}),\mathbf{x}-\mathbf{y}\rangle\leq{} m​(aΒ―+dΒ―)β€‹β€–π±βˆ’π²β€–+β€–f′​(𝐲)β€–β‹…β€–π±βˆ’π²β€–\displaystyle\sqrt{m}(\bar{a}+\bar{d})\|\mathbf{x}-\mathbf{y}\|+\|f^{\prime}(\mathbf{y})\|\cdot\|\mathbf{x}-\mathbf{y}\|
≀\displaystyle\leq{} 2​m​(aΒ―+dΒ―)β€‹β€–π±βˆ’π²β€–\displaystyle 2\sqrt{m}(\bar{a}+\bar{d})\|\mathbf{x}-\mathbf{y}\|
=\displaystyle={} Mβ€‹β€–π±βˆ’π²β€–.\displaystyle M\|\mathbf{x}-\mathbf{y}\|.

Condition 2 holds in the stochastic i.i.d. input setting.

Condition 3 holds by taking Οƒ2=4​m​(aΒ―+dΒ―)2\sigma^{2}=4m(\bar{a}+\bar{d})^{2} and notice that

𝔼​[β€–π π²βˆ’f′​(𝐲)β€–2]≀2​𝔼​[‖𝐠𝐲‖2]+2​[β€–f′​(𝐲)β€–2]≀4​m​(aΒ―+dΒ―)2.\mathbb{E}[\|\mathbf{g}_{\mathbf{y}}-f^{\prime}(\mathbf{y})\|^{2}]\leq 2\mathbb{E}[\|\mathbf{g}_{\mathbf{y}}\|^{2}]+2[\|f^{\prime}(\mathbf{y})\|^{2}]\leq 4m(\bar{a}+\bar{d})^{2}.

Next, we invoke LemmaΒ A.4 and get last-iterate convergence for Tβ‰₯3T\geq 3.

𝔼​[f​(𝐲T+1)βˆ’f​(𝐲)]≀\displaystyle\mathbb{E}[f(\mathbf{y}^{T+1})-f(\mathbf{y})]\leq{} ‖𝐲1βˆ’π²β€–2T​α+2​α​(M2+Οƒ2)​(1+log⁑T)\displaystyle\tfrac{\|\mathbf{y}^{1}-\mathbf{y}\|^{2}}{T\alpha}+2\alpha(M^{2}+\sigma^{2})(1+\log T)
≀\displaystyle\leq{} ‖𝐲1βˆ’π²β€–2T​α+16​α​m​(aΒ―+dΒ―)2​(1+log⁑T)\displaystyle\tfrac{\|\mathbf{y}^{1}-\mathbf{y}\|^{2}}{T\alpha}+16\alpha m(\bar{a}+\bar{d})^{2}(1+\log T) (20)
≀\displaystyle\leq{} ‖𝐲1βˆ’π²β€–2T​α+32​α​m​(aΒ―+dΒ―)2​log⁑T,\displaystyle\tfrac{\|\mathbf{y}^{1}-\mathbf{y}\|^{2}}{T\alpha}+32\alpha m(\bar{a}+\bar{d})^{2}\log T,

where (20) plugs in M=2​m​(aΒ―+dΒ―)M=2\sqrt{m}(\bar{a}+\bar{d}) and Οƒ2=4​m​(aΒ―+dΒ―)2\sigma^{2}=4m(\bar{a}+\bar{d})^{2}. Taking 𝐲=Π𝒴⋆​(𝐲1)\mathbf{y}=\Pi_{\mathcal{Y}^{\star}}(\mathbf{y}^{1}) completes the proof.

A.6 Proof of Lemma 3.3

By definition and the fact that π’΄β‹†βˆˆπ’΄\mathcal{Y}^{\star}\in\mathcal{Y},

π²β‹†βˆˆarg⁑minπ²βˆˆπ’΄β‘f​(𝐲)and𝐲Tβ‹†βˆˆarg⁑minπ²βˆˆπ’΄β‘fT​(𝐲).\mathbf{y}^{\star}\in\arg\min_{\mathbf{y}\in\mathcal{Y}}~{}f(\mathbf{y})\qquad\text{and}\qquad\mathbf{y}_{T}^{\star}\in\arg\min_{\mathbf{y}\in\mathcal{Y}}~{}f_{T}(\mathbf{y}).

According to A4, f​(𝐲⋆)≀f​(𝐲T⋆)βˆ’ΞΌβ€‹dist​(𝐲T⋆,𝒴⋆)Ξ³f(\mathbf{y}^{\star})\leq f(\mathbf{y}_{T}^{\star})-\mu\mathrm{dist}(\mathbf{y}_{T}^{\star},\mathcal{Y}^{\star})^{\gamma} and

ΞΌβ‹…dist​(𝐲T⋆,𝒴⋆)γ≀\displaystyle\mu\cdot\mathrm{dist}(\mathbf{y}_{T}^{\star},\mathcal{Y}^{\star})^{\gamma}\leq{} f​(𝐲T⋆)βˆ’f​(𝐲⋆)\displaystyle f(\mathbf{y}_{T}^{\star})-f(\mathbf{y}^{\star})
=\displaystyle={} f​(𝐲T⋆)βˆ’fT​(𝐲T⋆)+fT​(𝐲T⋆)βˆ’fT​(𝐲⋆)+fT​(𝐲⋆)βˆ’f​(𝐲⋆).\displaystyle f(\mathbf{y}_{T}^{\star})-f_{T}(\mathbf{y}_{T}^{\star})+f_{T}(\mathbf{y}_{T}^{\star})-f_{T}(\mathbf{y}^{\star})+f_{T}(\mathbf{y}^{\star})-f(\mathbf{y}^{\star}).
≀\displaystyle\leq{} f​(𝐲T⋆)βˆ’fT​(𝐲T⋆)+fT​(𝐲⋆)βˆ’f​(𝐲⋆),\displaystyle f(\mathbf{y}_{T}^{\star})-f_{T}(\mathbf{y}_{T}^{\star})+f_{T}(\mathbf{y}^{\star})-f(\mathbf{y}^{\star}), (21)

where (21) uses fT​(𝐲T⋆)βˆ’fT​(𝐲⋆)≀0f_{T}(\mathbf{y}_{T}^{\star})-f_{T}(\mathbf{y}^{\star})\leq 0. Taking expectation and using 𝔼​[fT​(𝐲⋆)]=f​(𝐲⋆)\mathbb{E}[f_{T}(\mathbf{y}^{\star})]=f(\mathbf{y}^{\star}), we arrive at

μ​𝔼​[dist​(𝐲T⋆,𝒴⋆)Ξ³]≀𝔼​[f​(𝐲T⋆)βˆ’fT​(𝐲T⋆)]\mu\mathbb{E}[\mathrm{dist}(\mathbf{y}_{T}^{\star},\mathcal{Y}^{\star})^{\gamma}]\leq\mathbb{E}[f(\mathbf{y}_{T}^{\star})-f_{T}(\mathbf{y}_{T}^{\star})]

and it remains to bound f​(𝐲T⋆)βˆ’fT​(𝐲T⋆)f(\mathbf{y}_{T}^{\star})-f_{T}(\mathbf{y}_{T}^{\star}). For any fixed π²βˆˆπ’΄\mathbf{y}\in\mathcal{Y},

fT​(𝐲)=1Tβ€‹βˆ‘t=1T⟨𝐝,𝐲⟩+[ctβˆ’βŸ¨πšt,𝐲⟩]+f_{T}(\mathbf{y})=\tfrac{1}{T}\textstyle\sum_{t=1}^{T}\langle\mathbf{d},\mathbf{y}\rangle+[c_{t}-\langle\mathbf{a}_{t},\mathbf{y}\rangle]_{+}

and for each tt, since 𝐲β‰₯𝟎\mathbf{y}\geq\mathbf{0},

0≀\displaystyle 0\leq{} ⟨𝐝,𝐲⟩+[ctβˆ’βŸ¨πšt,𝐲⟩]+\displaystyle\langle\mathbf{d},\mathbf{y}\rangle+[c_{t}-\langle\mathbf{a}_{t},\mathbf{y}\rangle]_{+}
≀\displaystyle\leq{} ‖𝐝‖⋅‖𝐲‖+|ct|+β€–πšt‖⋅‖𝐲‖\displaystyle\|\mathbf{d}\|\cdot\|\mathbf{y}\|+|c_{t}|+\|\mathbf{a}_{t}\|\cdot\|\mathbf{y}\|
≀\displaystyle\leq{} m​d¯​(cΒ―+dΒ―)dΒ―+cΒ―+m​a¯​(cΒ―+dΒ―)dΒ―\displaystyle\sqrt{m}\bar{d}\tfrac{(\bar{c}+\underline{d})}{\underline{d}}+\bar{c}+\sqrt{m}\bar{a}\tfrac{(\bar{c}+\underline{d})}{\underline{d}}
=\displaystyle={} m​(aΒ―+dΒ―)​(cΒ―+dΒ―)dΒ―+cΒ―\displaystyle\sqrt{m}\tfrac{(\bar{a}+\bar{d})(\bar{c}+\underline{d})}{\underline{d}}+\bar{c}

Using LemmaΒ A.1,

ℙ​{f​(𝐲)βˆ’fT​(𝐲)β‰₯ΞΆ}≀exp⁑{βˆ’2​dΒ―2​T(m​(aΒ―+dΒ―)​(cΒ―+dΒ―)+c¯​dΒ―)2​΢2}.\mathbb{P}\{f(\mathbf{y})-f_{T}(\mathbf{y})\geq\zeta\}\leq\exp\{-\tfrac{2\underline{d}^{2}T}{(\sqrt{m}(\bar{a}+\bar{d})(\bar{c}+\underline{d})+\bar{c}\underline{d})^{2}}\zeta^{2}\}.

Recall that 𝐲Tβ‹†βˆˆπ’΄\mathbf{y}_{T}^{\star}\in\mathcal{Y} by (4), and we construct an Ξ΅\varepsilon-net of 𝒴\mathcal{Y} as follows:

π’΄βŠ†π’©k:=⋃{ji}i=1m∈{0,…,k}m{𝐲:β€–π²βˆ’βˆ‘i=1mcΒ―+dΒ―k​d¯​jiβ€‹πžiβ€–βˆžβ‰€cΒ―+dΒ―k​dΒ―},\mathcal{Y}\subseteq\mathcal{N}_{k}:=\bigcup_{\{j_{i}\}_{i=1}^{m}\in\{0,\ldots,k\}^{m}}\{\mathbf{y}:\|\mathbf{y}-\textstyle\sum_{i=1}^{m}\tfrac{\bar{c}+\underline{d}}{k\underline{d}}j_{i}\mathbf{e}_{i}\|_{\infty}\leq\tfrac{\bar{c}+\underline{d}}{k\underline{d}}\},

where we denote the centers of each net as π’žk\mathcal{C}_{k} and |π’žk|=(k+1)m|\mathcal{C}_{k}|=(k+1)^{m}. In each member of the net, we have, by Lipschitz continuity of f​(𝐲)f(\mathbf{y}) and fT​(𝐲)f_{T}(\mathbf{y}), that

f​(𝐲1)βˆ’f​(𝐲2)≀m​(aΒ―+dΒ―)​‖𝐲1βˆ’π²2‖≀m​(aΒ―+dΒ―)​‖𝐲1βˆ’π²2β€–βˆžβ‰€m​(aΒ―+dΒ―)​(cΒ―+dΒ―)k​dΒ―.f(\mathbf{y}_{1})-f(\mathbf{y}_{2})\leq\sqrt{m}(\bar{a}+\bar{d})\|\mathbf{y}_{1}-\mathbf{y}_{2}\|\leq m(\bar{a}+\bar{d})\|\mathbf{y}_{1}-\mathbf{y}_{2}\|_{\infty}\leq\tfrac{m(\bar{a}+\bar{d})(\bar{c}+\underline{d})}{k\underline{d}}.

Next, with union bound,

ℙ​{maxπ²βˆˆπ’žk⁑f​(𝐲)βˆ’fT​(𝐲)β‰₯ΞΆ}≀\displaystyle\mathbb{P}\{\textstyle\max_{\mathbf{y}\in\mathcal{C}_{k}}f(\mathbf{y})-f_{T}(\mathbf{y})\geq\zeta\}\leq{} βˆ‘π³βˆˆπ’žkℙ​{f​(𝐳)βˆ’fT​(𝐳)β‰₯ΞΆ}\displaystyle\textstyle\sum_{\mathbf{z}\in\mathcal{C}_{k}}\mathbb{P}\{f(\mathbf{z})-f_{T}(\mathbf{z})\geq\zeta\}
≀\displaystyle\leq{} (k+1)m​exp⁑{βˆ’2​dΒ―2​T(m​(aΒ―+dΒ―)​(cΒ―+dΒ―)+c¯​dΒ―)2​΢2}.\displaystyle(k+1)^{m}\exp\{-\tfrac{2\underline{d}^{2}T}{(\sqrt{m}(\bar{a}+\bar{d})(\bar{c}+\underline{d})+\bar{c}\underline{d})^{2}}\zeta^{2}\}.

Taking k=Tk=\sqrt{T}, we have

ℙ​{supπ²βˆˆπ’΄f​(𝐲)βˆ’fT​(𝐲)≀΢+2​m​(aΒ―+dΒ―)​(cΒ―+dΒ―)T​dΒ―}\displaystyle\mathbb{P}\{\textstyle\sup_{\mathbf{y}\in\mathcal{Y}}f(\mathbf{y})-f_{T}(\mathbf{y})\leq\zeta+\tfrac{2m(\bar{a}+\bar{d})(\bar{c}+\underline{d})}{\sqrt{T}\underline{d}}\}
β‰₯\displaystyle\geq{} ℙ​{supπ²βˆˆπ’΄f​(𝐲)βˆ’fT​(𝐲)≀΢+2​m​(aΒ―+dΒ―)​(cΒ―+dΒ―)T​dΒ―|maxπ²βˆˆπ’žk⁑f​(𝐲)βˆ’fT​(𝐲)≀΢}⋅ℙ​{maxπ²βˆˆπ’žk⁑f​(𝐲)βˆ’fT​(𝐲)≀΢}\displaystyle\mathbb{P}\{\textstyle\sup_{\mathbf{y}\in\mathcal{Y}}f(\mathbf{y})-f_{T}(\mathbf{y})\leq\zeta+\tfrac{2m(\bar{a}+\bar{d})(\bar{c}+\underline{d})}{\sqrt{T}\underline{d}}|\textstyle\max_{\mathbf{y}\in\mathcal{C}_{k}}f(\mathbf{y})-f_{T}(\mathbf{y})\leq\zeta\}\cdot\mathbb{P}\{\textstyle\max_{\mathbf{y}\in\mathcal{C}_{k}}f(\mathbf{y})-f_{T}(\mathbf{y})\leq\zeta\}
=\displaystyle={} ℙ​{maxπ²βˆˆπ’žk⁑f​(𝐲)βˆ’fT​(𝐲)≀΢}\displaystyle\mathbb{P}\{\textstyle\max_{\mathbf{y}\in\mathcal{C}_{k}}f(\mathbf{y})-f_{T}(\mathbf{y})\leq\zeta\}
β‰₯\displaystyle\geq{} 1βˆ’(T+1)m​exp⁑{βˆ’4​dΒ―2​T(m​(aΒ―+dΒ―)​(cΒ―+dΒ―)+c¯​dΒ―)2​΢2}.\displaystyle 1-(\sqrt{T}+1)^{m}\exp\{-\tfrac{4\underline{d}^{2}T}{(\sqrt{m}(\bar{a}+\bar{d})(\bar{c}+\underline{d})+\bar{c}\underline{d})^{2}}\zeta^{2}\}.

Taking ΞΆ=3​m​(m​(aΒ―+dΒ―)​(cΒ―+dΒ―)+c¯​dΒ―)24​dΒ―2​log⁑TT\zeta=\sqrt{\tfrac{3m(\sqrt{m}(\bar{a}+\bar{d})(\bar{c}+\underline{d})+\bar{c}\underline{d})^{2}}{4\underline{d}^{2}}\tfrac{\log T}{T}} gives

ℙ​{supπ²βˆˆπ’΄f​(𝐲)βˆ’fT​(𝐲)≀π’ͺ​(log⁑TT)}β‰₯1βˆ’1T\mathbb{P}\Big{\{}\textstyle\sup_{\mathbf{y}\in\mathcal{Y}}f(\mathbf{y})-f_{T}(\mathbf{y})\leq\mathcal{O}\big{(}\sqrt{\tfrac{\log T}{T}}\big{)}\Big{\}}\geq 1-\tfrac{1}{T}

and

𝔼​[dist​(𝐲T⋆,𝒴⋆)]γ≀𝔼​[dist​(𝐲T⋆,𝒴⋆)Ξ³]≀1μ​𝔼​[f​(𝐲T⋆)βˆ’fT​(𝐲T⋆)]=π’ͺ​(log⁑TT)=o​(1).\mathbb{E}[\mathrm{dist}(\mathbf{y}_{T}^{\star},\mathcal{Y}^{\star})]^{\gamma}\leq\mathbb{E}[\mathrm{dist}(\mathbf{y}_{T}^{\star},\mathcal{Y}^{\star})^{\gamma}]\leq\tfrac{1}{\mu}\mathbb{E}[f(\mathbf{y}_{T}^{\star})-f_{T}(\mathbf{y}_{T}^{\star})]=\mathcal{O}(\sqrt{\tfrac{\log T}{T}})=o(1).

This completes the proof.

Appendix B Proof of results in SectionΒ 4

B.1 Auxiliary results

Lemma B.1 (Bounded dual solution [10]).

Assume that A1 to A3 hold and suppose AlgorithmΒ 1 with Ξ±t≑α\alpha_{t}\equiv\alpha starts from 𝐲1\mathbf{y}^{1} and ‖𝐲1‖≀cΒ―dΒ―\|\mathbf{y}^{1}\|\leq\tfrac{\bar{c}}{\underline{d}}, then

‖𝐲t‖≀\displaystyle\|\mathbf{y}^{t}\|\leq{} cΒ―dΒ―+m​(aΒ―+dΒ―)2​α2​dΒ―+α​m​(aΒ―+dΒ―)=𝖱,Β for all ​t,\displaystyle\tfrac{\bar{c}}{\underline{d}}+\tfrac{m(\bar{a}+\bar{d})^{2}\alpha}{2\underline{d}}+\alpha\sqrt{m}(\bar{a}+\bar{d})=\mathsf{R},\text{ for all }t, (22)

almost surely. Moreover, if α≀2​dΒ―3​m​(aΒ―+dΒ―)2\alpha\leq\frac{2\underline{d}}{3m(\bar{a}+\bar{d})^{2}}, then 𝐲tβˆˆπ’΄\mathbf{y}^{t}\in\mathcal{Y} for all tt almost surely.

Proof.

The relation (22) follows immediately from Lemma 5 of [10]. To see 𝐲tβˆˆπ’΄\mathbf{y}^{t}\in\mathcal{Y}, we successively deduce, for α≀2​dΒ―3​m​(aΒ―+dΒ―)2\alpha\leq\frac{2\underline{d}}{3m(\bar{a}+\bar{d})^{2}}, that

m​(aΒ―+dΒ―)2​α2​dΒ―+α​m​(aΒ―+dΒ―)=13+2​dΒ―3​m​(aΒ―+dΒ―)≀13+2​(aΒ―+dΒ―)3​m​(aΒ―+dΒ―)≀1\tfrac{m(\bar{a}+\bar{d})^{2}\alpha}{2\underline{d}}+\alpha\sqrt{m}(\bar{a}+\bar{d})=\tfrac{1}{3}+\tfrac{2\underline{d}}{3\sqrt{m}(\bar{a}+\bar{d})}\leq\tfrac{1}{3}+\tfrac{2(\bar{a}+\bar{d})}{3\sqrt{m}(\bar{a}+\bar{d})}\leq 1

and this completes the proof. ∎

Lemma B.2 (Subgradient method on strongly convex problems [26]).

Let δ∈(0,eβˆ’1)\delta\in(0,e^{-1}) and assume Tβ‰₯4T\geq 4. Suppose f​(𝐲)f(\mathbf{y}) is ΞΌ\mu-strongly convex and ‖𝐠𝐲‖≀G\|\mathbf{g}_{\mathbf{y}}\|\leq G. Then, the subgradient method with stepsize Ξ±t=1/(μ​t)\alpha_{t}=1/(\mu t) satisfies

‖𝐲T+1βˆ’π²β‹†β€–2≀624​log⁑(log⁑TΞ΄+1)​G2ΞΌ2​T\|\mathbf{y}^{T+1}-\mathbf{y}^{\star}\|^{2}\leq\tfrac{624\log(\frac{\log T}{\delta}+1)G^{2}}{\mu^{2}T}

with probability at least 1βˆ’Ξ΄1-\delta.

Lemma B.3 (Subgradient method for Ξ³=2\gamma=2).

Suppose A1 to A3 and A4 with Ξ³=2\gamma=2 hold. Then, the subgradient method with Ξ±t=1/(μ​(t+1))\alpha_{t}=1/(\mu(t+1)) outputs 𝐲T+1\mathbf{y}^{T+1} such that 𝔼​[dist​(𝐲T+1,𝒴⋆)2]≀m​(aΒ―+dΒ―)2ΞΌ2​T\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T+1},\mathcal{Y}^{\star})^{2}]\leq\tfrac{m(\bar{a}+\bar{d})^{2}}{\mu^{2}T}.

Proof.

For any 𝐲^βˆˆπ’΄β‹†\hat{\mathbf{y}}\in\mathcal{Y}^{\star}, we deduce that

‖𝐲t+1βˆ’π²^β€–2=\displaystyle\|\mathbf{y}^{t+1}-\hat{\mathbf{y}}\|^{2}={} ‖Π𝒴​[𝐲tβˆ’Ξ±t​𝐠t]βˆ’π²^β€–2\displaystyle\|\Pi_{\mathcal{Y}}[\mathbf{y}^{t}-\alpha_{t}\mathbf{g}^{t}]-\hat{\mathbf{y}}\|^{2}
≀\displaystyle\leq{} ‖𝐲tβˆ’Ξ±t​𝐠tβˆ’π²^β€–2\displaystyle\|\mathbf{y}^{t}-\alpha_{t}\mathbf{g}^{t}-\hat{\mathbf{y}}\|^{2} (23)
=\displaystyle={} ‖𝐲tβˆ’π²^β€–2βˆ’2​αtβ€‹βŸ¨π²tβˆ’π²^,𝐠t⟩+Ξ±t2​‖𝐠tβ€–2,\displaystyle\|\mathbf{y}^{t}-\hat{\mathbf{y}}\|^{2}-2\alpha_{t}\langle\mathbf{y}^{t}-\hat{\mathbf{y}},\mathbf{g}^{t}\rangle+\alpha_{t}^{2}\|\mathbf{g}^{t}\|^{2},

where (23) uses the non-expansiveness of the projection operator. Taking 𝐲^=Π𝒴⋆​[𝐲t]\hat{\mathbf{y}}=\Pi_{\mathcal{Y}^{\star}}[\mathbf{y}^{t}] and using ‖𝐠tβ€–2≀m​(aΒ―+dΒ―)2\|\mathbf{g}^{t}\|^{2}\leq m(\bar{a}+\bar{d})^{2}, we get

‖𝐲t+1βˆ’π²^β€–2≀dist​(𝐲t,𝒴⋆)2βˆ’2​αtβ€‹βŸ¨π²tβˆ’π²,𝐠t⟩+Ξ±t2​m​(aΒ―+dΒ―)2.\|\mathbf{y}^{t+1}-\hat{\mathbf{y}}\|^{2}\leq\mathrm{dist}(\mathbf{y}^{t},\mathcal{Y}^{\star})^{2}-2\alpha_{t}\langle\mathbf{y}^{t}-\mathbf{y},\mathbf{g}^{t}\rangle+\alpha_{t}^{2}m(\bar{a}+\bar{d})^{2}.

Since 𝔼​[𝐠t]βˆˆβˆ‚f​(𝐲t)\mathbb{E}[\mathbf{g}^{t}]\in\partial f(\mathbf{y}^{t}), we have, by convexity of ff, that

βˆ’2β€‹βŸ¨π²tβˆ’π²β‹†,Ξ±t​𝔼​[𝐠t]βŸ©β‰€βˆ’2​αt​(f​(𝐲t)βˆ’f​(𝐲)).-2\langle\mathbf{y}^{t}-\mathbf{y}^{\star},\alpha_{t}\mathbb{E}[\mathbf{g}^{t}]\rangle\leq-2\alpha_{t}(f(\mathbf{y}^{t})-f(\mathbf{y})).

Next, we invoke A4 to get

f​(𝐲t)βˆ’f​(𝐲^)β‰₯ΞΌβ‹…dist​(𝐲t,𝒴⋆)2.f(\mathbf{y}^{t})-f(\hat{\mathbf{y}})\geq\mu\cdot\mathrm{dist}(\mathbf{y}^{t},\mathcal{Y}^{\star})^{2}.

Conditioned on history and taking expectation, we have

𝔼​[dist​(𝐲t+1,𝒴⋆)2|𝐲t]≀\displaystyle\mathbb{E}[\mathrm{dist}(\mathbf{y}^{t+1},\mathcal{Y}^{\star})^{2}|\mathbf{y}^{t}]\leq{} 𝔼​[‖𝐲t+1βˆ’π²^β€–2|𝐲t]\displaystyle\mathbb{E}[\|\mathbf{y}^{t+1}-\hat{\mathbf{y}}\|^{2}|\mathbf{y}^{t}]
≀\displaystyle\leq{} dist​(𝐲t,𝒴⋆)2βˆ’2​αt​μ​dist​(𝐲t,𝒴⋆)2+Ξ±t2​m​(aΒ―+dΒ―)2\displaystyle\mathrm{dist}(\mathbf{y}^{t},\mathcal{Y}^{\star})^{2}-2\alpha_{t}\mu\mathrm{dist}(\mathbf{y}^{t},\mathcal{Y}^{\star})^{2}+\alpha_{t}^{2}m(\bar{a}+\bar{d})^{2}
=\displaystyle={} (1βˆ’2​αt​μ)​dist​(𝐲t,𝒴⋆)2+Ξ±t2​m​(aΒ―+dΒ―)2.\displaystyle(1-2\alpha_{t}\mu)\mathrm{dist}(\mathbf{y}^{t},\mathcal{Y}^{\star})^{2}+\alpha_{t}^{2}m(\bar{a}+\bar{d})^{2}. (24)

With Ξ±t=1μ​(t+1)\alpha_{t}=\tfrac{1}{\mu(t+1)}, we have

𝔼​[dist​(𝐲t+1,𝒴⋆)2]≀\displaystyle\mathbb{E}[\mathrm{dist}(\mathbf{y}^{t+1},\mathcal{Y}^{\star})^{2}]\leq{} (1βˆ’2​αt​μ)​dist​(𝐲t,𝒴⋆)2+Ξ±t2​m​(aΒ―+dΒ―)2\displaystyle(1-2\alpha_{t}\mu)\mathrm{dist}(\mathbf{y}^{t},\mathcal{Y}^{\star})^{2}+\alpha^{2}_{t}m(\bar{a}+\bar{d})^{2}
=\displaystyle={} tβˆ’1t+1​dist​(𝐲t,𝒴⋆)2+m​(aΒ―+dΒ―)2ΞΌ2​(t+1)2.\displaystyle\tfrac{t-1}{t+1}\mathrm{dist}(\mathbf{y}^{t},\mathcal{Y}^{\star})^{2}+\tfrac{m(\bar{a}+\bar{d})^{2}}{\mu^{2}(t+1)^{2}}.

Multiply both sides by (t+1)2(t+1)^{2} and we get

(t+1)2​𝔼​[dist​(𝐲t+1,𝒴⋆)2]≀\displaystyle(t+1)^{2}\mathbb{E}[\mathrm{dist}(\mathbf{y}^{t+1},\mathcal{Y}^{\star})^{2}]\leq{} (t2βˆ’1)​dist​(𝐲t,𝒴⋆)2+m​(aΒ―+dΒ―)2ΞΌ2\displaystyle(t^{2}-1)\mathrm{dist}(\mathbf{y}^{t},\mathcal{Y}^{\star})^{2}+\tfrac{m(\bar{a}+\bar{d})^{2}}{\mu^{2}} (25)
4​𝔼​[dist​(𝐲2,𝒴)2]≀\displaystyle 4\mathbb{E}[\mathrm{dist}(\mathbf{y}^{2},\mathcal{Y})^{2}]\leq{} m​(aΒ―+dΒ―)2ΞΌ2\displaystyle\tfrac{m(\bar{a}+\bar{d})^{2}}{\mu^{2}} (26)

Re-arranging the terms, we arrive at

(t+1)2​𝔼​[dist​(𝐲t+1,𝒴⋆)2]βˆ’t2​dist​(𝐲t,𝒴⋆)2≀m​(aΒ―+dΒ―)2ΞΌ2.(t+1)^{2}\mathbb{E}[\mathrm{dist}(\mathbf{y}^{t+1},\mathcal{Y}^{\star})^{2}]-t^{2}\mathrm{dist}(\mathbf{y}^{t},\mathcal{Y}^{\star})^{2}\leq\tfrac{m(\bar{a}+\bar{d})^{2}}{\mu^{2}}.

Taking expectation over all the randomness and telescoping from t=2t=2 to TT, with (26) added, gives

𝔼​[dist​(𝐲T+1,𝒴⋆)2]≀m​(aΒ―+dΒ―)2​TΞΌ2​(T+1)2≀m​(aΒ―+dΒ―)2ΞΌ2​T\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T+1},\mathcal{Y}^{\star})^{2}]\leq\tfrac{m(\bar{a}+\bar{d})^{2}T}{\mu^{2}(T+1)^{2}}\leq\tfrac{m(\bar{a}+\bar{d})^{2}}{\mu^{2}T}

and this completes the proof. ∎

Lemma B.4 (Subgradient with constant stepsize).

Under the same assumptions as LemmaΒ B.3, if Ξ±t≑α<1/(2​μ)\alpha_{t}\equiv\alpha<1/(2\mu), then

𝔼​[dist​(𝐲T+1,𝒴⋆)2]≀Δ2μ​α​T+m​(aΒ―+dΒ―)2μ​α,\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T+1},\mathcal{Y}^{\star})^{2}]\leq\tfrac{\Delta^{2}}{\mu\alpha T}+\tfrac{m(\bar{a}+\bar{d})^{2}}{\mu}\alpha,

where Ξ”=dist​(𝐲1,𝒴⋆)\Delta=\mathrm{dist}(\mathbf{y}_{1},\mathcal{Y}^{\star}).

Proof.

Taking Ξ±t≑α<1/(2​μ)\alpha_{t}\equiv\alpha<1/(2\mu) and unrolling the recursion from (24) till 𝐲1\mathbf{y}^{1}, we have

𝔼​[dist​(𝐲T+1βˆ’π’΄β‹†)2]≀\displaystyle\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T+1}-\mathcal{Y}^{\star})^{2}]\leq{} (1βˆ’2​μ​α)​𝔼​[dist​(𝐲T,𝒴⋆)2]+Ξ±2​m​(aΒ―+dΒ―)2\displaystyle(1-2\mu\alpha)\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T},\mathcal{Y}^{\star})^{2}]+\alpha^{2}m(\bar{a}+\bar{d})^{2}
≀\displaystyle\leq{} (1βˆ’2​μ​α)T​dist​(𝐲1,𝒴⋆)2+βˆ‘j=0Tβˆ’1Ξ±2​m​(aΒ―+dΒ―)2​(1βˆ’2​μ​α)j\displaystyle(1-2\mu\alpha)^{T}\mathrm{dist}(\mathbf{y}^{1},\mathcal{Y}^{\star})^{2}+\textstyle\sum_{j=0}^{T-1}\alpha^{2}m(\bar{a}+\bar{d})^{2}(1-2\mu\alpha)^{j}
≀\displaystyle\leq{} (1βˆ’2​μ​α)T​dist​(𝐲1,𝒴⋆)2+m​(aΒ―+dΒ―)2μ​α\displaystyle(1-2\mu\alpha)^{T}\mathrm{dist}(\mathbf{y}^{1},\mathcal{Y}^{\star})^{2}+\tfrac{m(\bar{a}+\bar{d})^{2}}{\mu}\alpha (27)
≀\displaystyle\leq{} 1μ​α​T​dist​(𝐲1,𝒴⋆)2+m​(aΒ―+dΒ―)2μ​α\displaystyle\tfrac{1}{\mu\alpha T}\mathrm{dist}(\mathbf{y}^{1},\mathcal{Y}^{\star})^{2}+\tfrac{m(\bar{a}+\bar{d})^{2}}{\mu}\alpha (28)
=\displaystyle={} Ξ”2μ​α​T+m​(aΒ―+dΒ―)2μ​α,\displaystyle\tfrac{\Delta^{2}}{\mu\alpha T}+\tfrac{m(\bar{a}+\bar{d})^{2}}{\mu}\alpha,

where (27) uses the relation βˆ‘j=0Tβˆ’1(1βˆ’2​μ​α)j=1βˆ’(1βˆ’2​μ​α)T2​μ​α≀1μ​α\sum_{j=0}^{T-1}(1-2\mu\alpha)^{j}=\tfrac{1-(1-2\mu\alpha)^{T}}{2\mu\alpha}\leq\tfrac{1}{\mu\alpha} and (28) is by (1βˆ’2​μ​α)T≀11+2​μ​α​T≀1μ​α​T(1-2\mu\alpha)^{T}\leq\tfrac{1}{1+2\mu\alpha T}\leq\tfrac{1}{\mu\alpha T}. This completes the proof. ∎

B.2 Proof of Lemma 4.1

By definition of regret, we deduce that

𝔼​[r​(𝐱^T)]=\displaystyle\mathbb{E}[r(\hat{\mathbf{x}}_{T})]={} 𝔼​[⟨𝐜,𝐱Tβ‹†βŸ©βˆ’βŸ¨πœ,𝐱^T⟩]\displaystyle\mathbb{E}[\langle\mathbf{c},\mathbf{x}^{\star}_{T}\rangle-\langle\mathbf{c},\hat{\mathbf{x}}_{T}\rangle]
=\displaystyle={} 𝔼​[T​fT​(𝐲T⋆)βˆ’βŸ¨πœ,𝐱^T⟩]\displaystyle\mathbb{E}[Tf_{T}(\mathbf{y}_{T}^{\star})-\langle\mathbf{c},\hat{\mathbf{x}}_{T}\rangle] (29)
≀\displaystyle\leq{} 𝔼​[T​fT​(𝐲⋆)βˆ’βŸ¨πœ,𝐱^T⟩]\displaystyle\mathbb{E}[Tf_{T}(\mathbf{y}^{\star})-\langle\mathbf{c},\hat{\mathbf{x}}_{T}\rangle] (30)
=\displaystyle={} T​f​(𝐲⋆)βˆ’π”Όβ€‹[⟨𝐜,𝐱^T⟩]\displaystyle Tf(\mathbf{y}^{\star})-\mathbb{E}[\langle\mathbf{c},\hat{\mathbf{x}}_{T}\rangle] (31)
≀\displaystyle\leq{} 𝔼​[βˆ‘t=1Tf​(𝐲t)βˆ’βŸ¨πœ,𝐱^T⟩]\displaystyle\mathbb{E}[\textstyle\sum_{t=1}^{T}f(\mathbf{y}^{t})-\langle\mathbf{c},\hat{\mathbf{x}}_{T}\rangle]
=\displaystyle={} βˆ‘t=1T𝔼​[⟨𝐝,𝐲t⟩+[ctβˆ’βŸ¨πšt,𝐲t⟩]+βˆ’ct​xt]\displaystyle\textstyle\sum_{t=1}^{T}\mathbb{E}[\langle\mathbf{d},\mathbf{y}^{t}\rangle+[c_{t}-\langle\mathbf{a}_{t},\mathbf{y}^{t}\rangle]_{+}-c_{t}x^{t}] (32)
=\displaystyle={} βˆ‘t=1T𝔼​[βŸ¨πβˆ’πšt​xt,𝐲t⟩],\displaystyle\textstyle\sum_{t=1}^{T}\mathbb{E}[\langle\mathbf{d}-\mathbf{a}_{t}x^{t},\mathbf{y}^{t}\rangle],

where (29) uses strong duality of LP; (30) uses the fact 𝐲⋆\mathbf{y}^{\star} is a feasible solution and that 𝐲T⋆\mathbf{y}_{T}^{\star} is the optimal solution to the sample LP; (32) uses the definition of f​(𝐲)f(\mathbf{y}) and that (ct,𝐚t)(c_{t},\mathbf{a}_{t}) are i.i.d. generated. Then we have

‖𝐲t+1β€–2βˆ’β€–π²tβ€–2=\displaystyle\|\mathbf{y}^{t+1}\|^{2}-\|\mathbf{y}^{t}\|^{2}={} β€–[𝐲tβˆ’Ξ±β€‹(πβˆ’πšt​xt)]+β€–2βˆ’β€–π²tβ€–2\displaystyle\|[\mathbf{y}^{t}-\alpha(\mathbf{d}-\mathbf{a}_{t}x^{t})]_{+}\|^{2}-\|\mathbf{y}^{t}\|^{2}
≀\displaystyle\leq{} ‖𝐲tβˆ’Ξ±β€‹(πβˆ’πšt​xt)β€–2βˆ’β€–π²tβ€–2\displaystyle\|\mathbf{y}^{t}-\alpha(\mathbf{d}-\mathbf{a}_{t}x^{t})\|^{2}-\|\mathbf{y}^{t}\|^{2} (33)
=\displaystyle={} βˆ’2β€‹Ξ±β€‹βŸ¨πβˆ’πšt​xt,𝐲t⟩+Ξ±2β€‹β€–πβˆ’πšt​xtβ€–2\displaystyle-2\alpha\langle\mathbf{d}-\mathbf{a}_{t}x^{t},\mathbf{y}^{t}\rangle+\alpha^{2}\|\mathbf{d}-\mathbf{a}_{t}x^{t}\|^{2}
≀\displaystyle\leq{} βˆ’2β€‹Ξ±β€‹βŸ¨πβˆ’πšt​xt,𝐲t⟩+m​(aΒ―+dΒ―)2​α2,\displaystyle-2\alpha\langle\mathbf{d}-\mathbf{a}_{t}x^{t},\mathbf{y}^{t}\rangle+m(\bar{a}+\bar{d})^{2}\alpha^{2}, (34)

where (33) uses β€–[𝐱]+‖≀‖𝐱‖\|[\mathbf{x}]_{+}\|\leq\|\mathbf{x}\| and (34) uses A2, A3. A simple re-arrangement gives

βŸ¨πβˆ’πšt​xt,𝐲tβŸ©β‰€m​(aΒ―+dΒ―)2​α2+‖𝐲tβ€–2βˆ’β€–π²t+1β€–22​α.\langle\mathbf{d}-\mathbf{a}_{t}x^{t},\mathbf{y}^{t}\rangle\leq\tfrac{m(\bar{a}+\bar{d})^{2}\alpha}{2}+\tfrac{\|\mathbf{y}^{t}\|^{2}-\|\mathbf{y}^{t+1}\|^{2}}{2\alpha}. (35)

Next, we telescope the relation (35) from t=1t=1 to TT and get

𝔼​[r​(𝐱^T)]=\displaystyle\mathbb{E}[r(\hat{\mathbf{x}}_{T})]={} βˆ‘t=1T𝔼​[βŸ¨πβˆ’πšt​xt,𝐲t⟩]\displaystyle\textstyle\sum_{t=1}^{T}\mathbb{E}[\langle\mathbf{d}-\mathbf{a}_{t}x^{t},\mathbf{y}^{t}\rangle]
≀\displaystyle\leq{} m​(aΒ―+dΒ―)2​α2​T+βˆ‘t=1T𝔼​[‖𝐲tβ€–2]βˆ’π”Όβ€‹[‖𝐲t+1β€–2]2​α\displaystyle\tfrac{m(\bar{a}+\bar{d})^{2}\alpha}{2}T+\textstyle\sum_{t=1}^{T}\tfrac{\mathbb{E}[\|\mathbf{y}^{t}\|^{2}]-\mathbb{E}[\|\mathbf{y}^{t+1}\|^{2}]}{2\alpha} (36)
=\displaystyle={} m​(aΒ―+dΒ―)2​α2​T+𝔼​[‖𝐲1β€–2]βˆ’π”Όβ€‹[‖𝐲T+1β€–2]2​α\displaystyle\tfrac{m(\bar{a}+\bar{d})^{2}\alpha}{2}T+\tfrac{\mathbb{E}[\|\mathbf{y}^{1}\|^{2}]-\mathbb{E}[\|\mathbf{y}^{T+1}\|^{2}]}{2\alpha}
=\displaystyle={} m​(aΒ―+dΒ―)2​α2​T+𝔼​[⟨𝐲1+𝐲T+1,𝐲1βˆ’π²T+1⟩]2​α\displaystyle\tfrac{m(\bar{a}+\bar{d})^{2}\alpha}{2}T+\tfrac{\mathbb{E}[\langle\mathbf{y}^{1}+\mathbf{y}^{T+1},\mathbf{y}^{1}-\mathbf{y}^{T+1}\rangle]}{2\alpha} (37)
≀\displaystyle\leq{} m​(aΒ―+dΒ―)2​α2​T+𝖱α​𝔼​[‖𝐲1βˆ’π²T+1β€–]\displaystyle\tfrac{m(\bar{a}+\bar{d})^{2}\alpha}{2}T+\tfrac{\mathsf{R}}{\alpha}\mathbb{E}[\|\mathbf{y}^{1}-\mathbf{y}^{T+1}\|] (38)
≀\displaystyle\leq{} m​(aΒ―+dΒ―)2​α2​T+𝖱α​𝔼​[‖𝐲1βˆ’π²β‹†β€–+‖𝐲T+1βˆ’π²β‹†β€–],\displaystyle\tfrac{m(\bar{a}+\bar{d})^{2}\alpha}{2}T+\tfrac{\mathsf{R}}{\alpha}\mathbb{E}[\|\mathbf{y}^{1}-\mathbf{y}^{\star}\|+\|\mathbf{y}^{T+1}-\mathbf{y}^{\star}\|], (39)

where (36) again uses relation (35); (38) uses the Cauchy’s inequality

⟨𝐲1+𝐲1,𝐲1βˆ’π²T+1βŸ©β‰€β€–π²1+𝐲T+1‖⋅‖𝐲1βˆ’π²T+1β€–\langle\mathbf{y}^{1}+\mathbf{y}^{1},\mathbf{y}^{1}-\mathbf{y}^{T+1}\rangle\leq\|\mathbf{y}^{1}+\mathbf{y}^{T+1}\|\cdot\|\mathbf{y}^{1}-\mathbf{y}^{T+1}\|

and almost sure boundedness of iterations derived from Lemma B.1:

‖𝐲1+𝐲T+1‖≀\displaystyle\|\mathbf{y}^{1}+\mathbf{y}^{T+1}\|\leq{} ‖𝐲T+1β€–+‖𝐲1‖≀2​𝖱.\displaystyle\|\mathbf{y}^{T+1}\|+\|\mathbf{y}^{1}\|\leq{}2\mathsf{R}.

Finally (39) is obtained from the triangle inequality

‖𝐲1βˆ’π²T+1β€–=\displaystyle\|\mathbf{y}^{1}-\mathbf{y}^{T+1}\|={} ‖𝐲1βˆ’π²β‹†+π²β‹†βˆ’π²T+1‖≀‖𝐲1βˆ’π²β‹†β€–+‖𝐲T+1βˆ’π²β‹†β€–\displaystyle\|\mathbf{y}^{1}-\mathbf{y}^{\star}+\mathbf{y}^{\star}-\mathbf{y}^{T+1}\|\leq\|\mathbf{y}^{1}-\mathbf{y}^{\star}\|+\|\mathbf{y}^{T+1}-\mathbf{y}^{\star}\|

and this completes the proof.

B.3 Proof of Lemma 4.2

For constraint violation, recall that

𝔼​[v​(𝐱^T)]=𝔼​[β€–[𝐀​𝐱^Tβˆ’π›]+β€–]=𝔼​[β€–[βˆ‘t=1T(𝐚t​xtβˆ’π)]+β€–]\mathbb{E}[v(\hat{\mathbf{x}}_{T})]=\mathbb{E}[\|[\mathbf{A}\hat{\mathbf{x}}_{T}-\mathbf{b}]_{+}\|]=\mathbb{E}\big{[}\big{\|}\big{[}\textstyle\sum_{t=1}^{T}(\mathbf{a}_{t}x^{t}-\mathbf{d})\big{]}_{+}\big{\|}\big{]}

and that

𝐲t+1=[𝐲t+1βˆ’Ξ±β€‹(πβˆ’πšt​xt)]+β‰₯𝐲tβˆ’Ξ±β€‹(πβˆ’πšt​xt).\mathbf{y}^{t+1}=[\mathbf{y}^{t+1}-\alpha(\mathbf{d}-\mathbf{a}_{t}x^{t})]_{+}\geq\mathbf{y}^{t}-\alpha(\mathbf{d}-\mathbf{a}_{t}x^{t}).

A re-arrangement gives

𝐚t​xt≀𝐝+1α​(𝐲t+1βˆ’π²t).\mathbf{a}_{t}x^{t}\leq\mathbf{d}+\tfrac{1}{\alpha}(\mathbf{y}^{t+1}-\mathbf{y}^{t}). (40)

and that

βˆ‘t=1T(𝐚t​xtβˆ’π)≀\displaystyle\textstyle\sum_{t=1}^{T}(\mathbf{a}_{t}x^{t}-\mathbf{d})\leq{} 1Ξ±β€‹βˆ‘t=1T(𝐲t+1βˆ’π²t)\displaystyle\tfrac{1}{\alpha}\textstyle\sum_{t=1}^{T}(\mathbf{y}^{t+1}-\mathbf{y}^{t}) (41)
=\displaystyle={} 1α​(𝐲T+1βˆ’π²1)\displaystyle\tfrac{1}{\alpha}(\mathbf{y}^{T+1}-\mathbf{y}^{1})

where (41) uses (40). Now, we apply triangle inequality again:

𝔼​[β€–[𝐀​𝐱^Tβˆ’π›]+β€–]≀\displaystyle\mathbb{E}[\|[\mathbf{A}\hat{\mathbf{x}}_{T}-\mathbf{b}]_{+}\|]\leq{} 1α​𝔼​[‖𝐲T+1βˆ’π²1β€–]\displaystyle\tfrac{1}{\alpha}\mathbb{E}[\|\mathbf{y}^{T+1}-\mathbf{y}^{1}\|]
≀\displaystyle\leq{} 1α​𝔼​[‖𝐲1βˆ’π²β‹†β€–+‖𝐲T+1βˆ’π²β‹†β€–],\displaystyle\tfrac{1}{\alpha}\mathbb{E}[\|\mathbf{y}^{1}-\mathbf{y}^{\star}\|+\|\mathbf{y}^{T+1}-\mathbf{y}^{\star}\|], (42)

and this completes the proof.

B.4 Proof of Lemma 4.3

Similar to the proof of LemmaΒ 4.1 and LemmaΒ 4.2, we deduce that

𝔼​[r​(𝐱^T)]≀\displaystyle\mathbb{E}[r(\hat{\mathbf{x}}_{T})]\leq{} T​f​(𝐲⋆)βˆ’π”Όβ€‹[⟨𝐜,𝐱^T⟩]\displaystyle Tf(\mathbf{y}^{\star})-\mathbb{E}[\langle\mathbf{c},\hat{\mathbf{x}}_{T}\rangle] (43)
=\displaystyle={} Te​f​(𝐲⋆)βˆ’π”Όβ€‹[βˆ‘t=1Tect​xt]+βˆ‘t=Te+1T𝔼​[f​(𝐲⋆)βˆ’ct​xt]\displaystyle T_{e}f(\mathbf{y}^{\star})-\mathbb{E}[\textstyle\sum_{t=1}^{T_{e}}c_{t}x^{t}]+\textstyle\sum_{t=T_{e}+1}^{T}\mathbb{E}[f(\mathbf{y}^{\star})-c_{t}x^{t}]
≀\displaystyle\leq{} Te​f​(𝐲⋆)βˆ’π”Όβ€‹[βˆ‘t=1Tect​xt]+βˆ‘t=Te+1T𝔼​[f​(𝐲t)βˆ’ct​xt]\displaystyle T_{e}f(\mathbf{y}^{\star})-\mathbb{E}[\textstyle\sum_{t=1}^{T_{e}}c_{t}x^{t}]+\textstyle\sum_{t=T_{e}+1}^{T}\mathbb{E}[f(\mathbf{y}^{t})-c_{t}x^{t}]
=\displaystyle={} Te​f​(𝐲⋆)βˆ’π”Όβ€‹[βˆ‘t=1Tect​xt]+βˆ‘t=Te+1T𝔼​[βŸ¨πβˆ’πšt​xt,𝐲t⟩],\displaystyle T_{e}f(\mathbf{y}^{\star})-\mathbb{E}[\textstyle\sum_{t=1}^{T_{e}}c_{t}x^{t}]+\textstyle\sum_{t=T_{e}+1}^{T}\mathbb{E}[\langle\mathbf{d}-\mathbf{a}_{t}x^{t},\mathbf{y}^{t}\rangle],

where (43) is directly obtained from (31). Next, we analyze βˆ‘t=Te+1T𝔼​[βŸ¨πβˆ’πšt​xt,𝐲t⟩]\textstyle\sum_{t=T_{e}+1}^{T}\mathbb{E}[\langle\mathbf{d}-\mathbf{a}_{t}x^{t},\mathbf{y}^{t}\rangle]. Using (35),

βŸ¨πβˆ’πšt​xt,𝐲tβŸ©β‰€m​(aΒ―+dΒ―)2​α2+‖𝐲tβ€–2βˆ’β€–π²t+1β€–22​α,\langle\mathbf{d}-\mathbf{a}_{t}x^{t},\mathbf{y}^{t}\rangle\leq\tfrac{m(\bar{a}+\bar{d})^{2}\alpha}{2}+\tfrac{\|\mathbf{y}^{t}\|^{2}-\|\mathbf{y}^{t+1}\|^{2}}{2\alpha},

and we deduce that

βˆ‘t=Te+1T𝔼​[βŸ¨πβˆ’πšt​xt,𝐲t⟩]≀\displaystyle\textstyle\sum_{t=T_{e}+1}^{T}\mathbb{E}[\langle\mathbf{d}-\mathbf{a}_{t}x^{t},\mathbf{y}^{t}\rangle]\leq{} βˆ‘t=Te+1T[m​(aΒ―+dΒ―)2​α2+12​α​𝔼​[‖𝐲tβ€–2βˆ’β€–π²t+1β€–2]]\displaystyle\textstyle\sum_{t=T_{e}+1}^{T}\big{[}\tfrac{m(\bar{a}+\bar{d})^{2}\alpha}{2}+\tfrac{1}{2\alpha}\mathbb{E}[\|\mathbf{y}^{t}\|^{2}-\|\mathbf{y}^{t+1}\|^{2}]\big{]}
=\displaystyle={} m​(aΒ―+dΒ―)2​α2​Tp+12​α​𝔼​[‖𝐲Te+1β€–2βˆ’β€–π²T+1β€–2]\displaystyle\tfrac{m(\bar{a}+\bar{d})^{2}\alpha}{2}T_{p}+\tfrac{1}{2\alpha}\mathbb{E}[\|\mathbf{y}^{T_{e}+1}\|^{2}-\|\mathbf{y}^{T+1}\|^{2}]
≀\displaystyle\leq{} m​(aΒ―+dΒ―)2​α2​Tp+𝖱α​𝔼​[‖𝐲Te+1βˆ’π²β‹†β€–+‖𝐲T+1βˆ’π²β‹†β€–],\displaystyle\tfrac{m(\bar{a}+\bar{d})^{2}\alpha}{2}T_{p}+\tfrac{\mathsf{R}}{\alpha}\mathbb{E}[\|\mathbf{y}^{T_{e}+1}-\mathbf{y}^{\star}\|+\|\mathbf{y}^{T+1}-\mathbf{y}^{\star}\|], (44)

where (44) uses triangle inequality as in (38). Next, we consider constraint violation, and we have

𝔼​[v​(𝐱^T)]=\displaystyle\mathbb{E}[v(\hat{\mathbf{x}}_{T})]={} 𝔼​[β€–[𝐀​𝐱^Tβˆ’π›]+β€–]\displaystyle\mathbb{E}[\|[\mathbf{A}\hat{\mathbf{x}}_{T}-\mathbf{b}]_{+}\|]
=\displaystyle={} 𝔼​[β€–[βˆ‘t=1Te(𝐚t​xtβˆ’π)+βˆ‘t=Te+1T(𝐚t​xtβˆ’π)]+β€–]\displaystyle\mathbb{E}[\|[\textstyle\sum_{t=1}^{T_{e}}(\mathbf{a}_{t}x^{t}-\mathbf{d})+\textstyle\sum_{t=T_{e}+1}^{T}(\mathbf{a}_{t}x^{t}-\mathbf{d})]_{+}\|]
≀\displaystyle\leq{} 𝔼​[β€–[βˆ‘t=1Te(𝐚t​xtβˆ’π)]+β€–]+𝔼​[β€–[βˆ‘t=Te+1T(𝐚t​xtβˆ’π)]+β€–],\displaystyle\mathbb{E}[\|[\textstyle\sum_{t=1}^{T_{e}}(\mathbf{a}_{t}x^{t}-\mathbf{d})]_{+}\|]+\mathbb{E}[\|[\textstyle\sum_{t=T_{e}+1}^{T}(\mathbf{a}_{t}x^{t}-\mathbf{d})]_{+}\|], (45)

where (45) is by β€–[𝐱+𝐲]+‖≀‖[𝐱]+β€–+β€–[𝐲]+β€–\|[\mathbf{x}+\mathbf{y}]_{+}\|\leq\|[\mathbf{x}]_{+}\|+\|[\mathbf{y}]_{+}\| and we bound

𝔼​[β€–[βˆ‘t=Te+1T(𝐚t​xtβˆ’π)]+β€–]≀1α​𝔼​[‖𝐲Te+1βˆ’π²β‹†β€–+‖𝐲T+1βˆ’π²β‹†β€–]\mathbb{E}[\|[\textstyle\sum_{t=T_{e}+1}^{T}(\mathbf{a}_{t}x^{t}-\mathbf{d})]_{+}\|]\leq\tfrac{1}{\alpha}\mathbb{E}[\|\mathbf{y}^{T_{e}+1}-\mathbf{y}^{\star}\|+\|\mathbf{y}^{T+1}-\mathbf{y}^{\star}\|]

with the same argument as (42). Putting two relations together and using

V​(Te)=𝔼​[β€–[βˆ‘t=1Te(𝐚t​xtβˆ’π)]+β€–+βˆ‘t=1Tef​(𝐲⋆)βˆ’ct​xt].V(T_{e})=\mathbb{E}[\|[\textstyle\sum_{t=1}^{T_{e}}(\mathbf{a}_{t}x^{t}-\mathbf{d})]_{+}\|+\textstyle\sum_{t=1}^{T_{e}}f(\mathbf{y}^{\star})-c_{t}x^{t}].

We arrive at

𝔼​[r​(𝐱^T)+v​(𝐱^T)]\displaystyle\mathbb{E}[r(\hat{\mathbf{x}}_{T})+v(\hat{\mathbf{x}}_{T})]
≀\displaystyle\leq{} V​(Te)+m​(aΒ―+dΒ―)2​α2​Tp+𝖱+1α​𝔼​[‖𝐲Te+1βˆ’π²β‹†β€–+‖𝐲T+1βˆ’π²β‹†β€–]\displaystyle V(T_{e})+\tfrac{m(\bar{a}+\bar{d})^{2}\alpha}{2}T_{p}+\tfrac{\mathsf{R}+1}{\alpha}\mathbb{E}[\|\mathbf{y}^{T_{e}+1}-\mathbf{y}^{\star}\|+\|\mathbf{y}^{T+1}-\mathbf{y}^{\star}\|]
≀\displaystyle\leq{} V​(Te)+m​(aΒ―+dΒ―)2​α2​Tp+𝖱+1α​𝔼​[dist​(𝐲Te+1,𝒴⋆)+dist​(𝐲T+1,𝒴⋆)+2​d​i​a​m​(𝒴⋆)],\displaystyle V(T_{e})+\tfrac{m(\bar{a}+\bar{d})^{2}\alpha}{2}T_{p}+\tfrac{\mathsf{R}+1}{\alpha}\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})+\mathrm{dist}(\mathbf{y}^{T+1},\mathcal{Y}^{\star})+2\mathrm{diam}(\mathcal{Y}^{\star})], (46)

where (46) uses

β€–π²βˆ’π²β‹†β€–=β€–π²βˆ’Ξ π’΄β‹†β€‹[𝐲]+Π𝒴⋆​[𝐲]βˆ’π²β‹†β€–β‰€dist​(𝐲,𝒴⋆)+diam​(𝒴⋆)\|\mathbf{y}-\mathbf{y}^{\star}\|=\|\mathbf{y}-\Pi_{\mathcal{Y}^{\star}}[\mathbf{y}]+\Pi_{\mathcal{Y}^{\star}}[\mathbf{y}]-\mathbf{y}^{\star}\|\leq\mathrm{dist}(\mathbf{y},\mathcal{Y}^{\star})+\mathrm{diam}(\mathcal{Y}^{\star})

for all 𝐲\mathbf{y} and it remains to analyze 𝔼​[dist​(𝐲Te+1,𝒴⋆)+dist​(𝐲T+1,𝒴⋆)]\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})+\mathrm{dist}(\mathbf{y}^{T+1},\mathcal{Y}^{\star})].

By LemmaΒ 3.1, we have with probability 1βˆ’1/T2​γ1-1/T^{2\gamma} that

dist​(𝐲Te+1,𝒴⋆)≀Δ.\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})\leq\Delta.

Conditioned on the event dist​(𝐲Te+1,𝒴⋆)≀Δ\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})\leq\Delta, we deduce that

𝔼​[dist​(𝐲Te+1,𝒴⋆)]=\displaystyle\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})]={} 𝔼​[dist​(𝐲Te+1,𝒴⋆)|dist​(𝐲Te+1,𝒴⋆)≀Δ]⋅ℙ​{dist​(𝐲Te+1,𝒴⋆)≀Δ}\displaystyle\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})|\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})\leq\Delta]\cdot\mathbb{P}\{\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})\leq\Delta\}
+𝔼​[dist​(𝐲Te+1,𝒴⋆)​|dist​(𝐲Te+1,𝒴⋆)>​Δ]⋅ℙ​{dist​(𝐲Te+1,𝒴⋆)>Ξ”}\displaystyle+\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})|\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})>\Delta]\cdot\mathbb{P}\{\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})>\Delta\}
≀\displaystyle\leq{} Ξ”+𝖱T2​γ,\displaystyle\Delta+\tfrac{\mathsf{R}}{T^{2\gamma}}, (47)
𝔼​[dist​(𝐲Te+1,𝒴⋆)2]=\displaystyle\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})^{2}]={} 𝔼​[dist​(𝐲Te+1,𝒴⋆)2|dist​(𝐲Te+1,𝒴⋆)≀Δ]⋅ℙ​{dist​(𝐲Te+1,𝒴⋆)≀Δ}\displaystyle\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})^{2}|\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})\leq\Delta]\cdot\mathbb{P}\{\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})\leq\Delta\}
+𝔼​[dist​(𝐲Te+1,𝒴⋆)2​|dist​(𝐲Te+1,𝒴⋆)>​Δ]⋅ℙ​{dist​(𝐲Te+1,𝒴⋆)>Ξ”}\displaystyle+\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})^{2}|\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})>\Delta]\cdot\mathbb{P}\{\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})>\Delta\}
≀\displaystyle\leq{} Ξ”2+𝖱2T2​γ,\displaystyle\Delta^{2}+\tfrac{\mathsf{R}^{2}}{T^{2\gamma}}, (48)

where both (47) and (48) use the fact that 𝐲Te+1βˆˆπ’΄\mathbf{y}^{T_{e}+1}\in\mathcal{Y} imposed by AlgorithmΒ 2. Using LemmaΒ 3.2, we have, conditioned on 𝐲Te+1\mathbf{y}^{T_{e}+1}, that

𝔼​[dist​(𝐲T+1,𝒴⋆)]γ≀\displaystyle\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T+1},\mathcal{Y}^{\star})]^{\gamma}\leq{} 𝔼​[dist​(𝐲T+1,𝒴⋆)Ξ³]\displaystyle\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T+1},\mathcal{Y}^{\star})^{\gamma}] (49)
=\displaystyle={} 𝔼​[𝔼​[dist​(𝐲T+1,𝒴⋆)Ξ³]|𝐲Te+1]\displaystyle\mathbb{E}[\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T+1},\mathcal{Y}^{\star})^{\gamma}]|\mathbf{y}^{T_{e}+1}]
≀\displaystyle\leq{} 1μ​𝔼​[1α​Tp​dist​(𝐲Te+1,𝒴⋆)2+32​m​(aΒ―+dΒ―)2​α​log⁑Tp]\displaystyle\tfrac{1}{\mu}\mathbb{E}[\tfrac{1}{\alpha T_{p}}\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})^{2}+32m(\bar{a}+\bar{d})^{2}\alpha\log T_{p}] (50)
≀\displaystyle\leq{} 1μ​[1α​Tp​𝔼​[dist​(𝐲Te+1,𝒴⋆)2]+32​m​(aΒ―+dΒ―)2​α​log⁑T]\displaystyle\tfrac{1}{\mu}[\tfrac{1}{\alpha T_{p}}\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})^{2}]+32m(\bar{a}+\bar{d})^{2}\alpha\log T] (51)
≀\displaystyle\leq{} 1μ​[1α​Tp​(Ξ”2+𝖱2T2​γ)+32​m​(aΒ―+dΒ―)2​α​log⁑T],\displaystyle\tfrac{1}{\mu}[\tfrac{1}{\alpha T_{p}}(\Delta^{2}+\tfrac{\mathsf{R}^{2}}{T^{2\gamma}})+32m(\bar{a}+\bar{d})^{2}\alpha\log T], (52)

where (49) uses 𝔼​[X]γ≀𝔼​[XΞ³]\mathbb{E}[X]^{\gamma}\leq\mathbb{E}[X^{\gamma}] for nonnegative random variable XX; (50) invokes LemmaΒ 3.2; (51) uses Tp≀TT_{p}\leq T and (52) plugs in (48). Putting the results together, we get

𝔼​[dist​(𝐲T+1,𝒴⋆)]≀\displaystyle\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T+1},\mathcal{Y}^{\star})]\leq{} (1μ​α​Tp​(Ξ”2+𝖱2T2​γ)+32​m​(aΒ―+dΒ―)2​αμ​log⁑T)1/Ξ³\displaystyle(\tfrac{1}{\mu\alpha T_{p}}(\Delta^{2}+\tfrac{\mathsf{R}^{2}}{T^{2\gamma}})+\tfrac{32m(\bar{a}+\bar{d})^{2}\alpha}{\mu}\log T)^{1/\gamma}
≀\displaystyle\leq{} (1ΞΌ)1/γ​1Ξ±1/γ​Tp1/γ​[Ξ”2/Ξ³+𝖱2/Ξ³T2]+(32​m​(aΒ―+dΒ―)2ΞΌ)1/γ​α1/γ​(log⁑T)1/Ξ³,\displaystyle(\tfrac{1}{\mu})^{1/\gamma}\tfrac{1}{\alpha^{1/\gamma}T_{p}^{1/\gamma}}[\Delta^{2/\gamma}+\tfrac{\mathsf{R}^{2/\gamma}}{T^{2}}]+(\tfrac{32m(\bar{a}+\bar{d})^{2}}{\mu})^{1/\gamma}\alpha^{1/\gamma}(\log T)^{1/\gamma}, (53)

where (53) recursively applies (a+b)1/γ≀a1/Ξ³+b1/Ξ³(a+b)^{1/\gamma}\leq a^{1/\gamma}+b^{1/\gamma} and we arrive at

𝔼​[r​(𝐱^T)+v​(𝐱^T)]\displaystyle\mathbb{E}[r(\hat{\mathbf{x}}_{T})+v(\hat{\mathbf{x}}_{T})]
≀\displaystyle\leq{} V​(Te)+m​(aΒ―+dΒ―)22​α​Tp\displaystyle V(T_{e})+\tfrac{m(\bar{a}+\bar{d})^{2}}{2}\alpha T_{p}
+𝖱+1α​[Ξ”+𝖱T2​γ+(1ΞΌ)1/γ​1Ξ±1/γ​Tp1/γ​(Ξ”2/Ξ³+𝖱2/Ξ³T2)+(32​m​(aΒ―+dΒ―)2ΞΌ)1/γ​α1/γ​(log⁑T)1/Ξ³+2​d​i​a​m​(𝒴⋆)]\displaystyle+\tfrac{\mathsf{R}+1}{\alpha}[\Delta+\tfrac{\mathsf{R}}{T^{2\gamma}}+(\tfrac{1}{\mu})^{1/\gamma}\tfrac{1}{\alpha^{1/\gamma}T_{p}^{1/\gamma}}(\Delta^{2/\gamma}+\tfrac{\mathsf{R}^{2/\gamma}}{T^{2}})+(\tfrac{32m(\bar{a}+\bar{d})^{2}}{\mu})^{1/\gamma}\alpha^{1/\gamma}(\log T)^{1/\gamma}+2\mathrm{diam}(\mathcal{Y}^{\star})]
=\displaystyle={} V​(Te)+m​(aΒ―+dΒ―)22​α​Tp+(𝖱+1)​[Δα+(1ΞΌ)1/γ​(Ξ”2/Ξ³Ξ±1/Ξ³+1​Tp1/Ξ³)+(32​m​(aΒ―+dΒ―)2ΞΌ)1/γ​α1/Ξ³βˆ’1​(log⁑T)1/Ξ³]\displaystyle V(T_{e})+\tfrac{m(\bar{a}+\bar{d})^{2}}{2}\alpha T_{p}+(\mathsf{R}+1)[\tfrac{\Delta}{\alpha}+(\tfrac{1}{\mu})^{1/\gamma}(\tfrac{\Delta^{2/\gamma}}{\alpha^{1/\gamma+1}T_{p}^{1/\gamma}})+(\tfrac{32m(\bar{a}+\bar{d})^{2}}{\mu})^{1/\gamma}\alpha^{1/\gamma-1}(\log T)^{1/\gamma}]
+(𝖱+1)​[𝖱α​T2​γ+(1ΞΌ)1/γ​𝖱2/Ξ³Ξ±1/Ξ³+1​Tp1/γ​T2]+2​(𝖱+1)α​diam​(𝒴⋆)\displaystyle+(\mathsf{R}+1)[\tfrac{\mathsf{R}}{\alpha T^{2\gamma}}+(\tfrac{1}{\mu})^{1/\gamma}\tfrac{\mathsf{R}^{2/\gamma}}{\alpha^{1/\gamma+1}T_{p}^{1/\gamma}T^{2}}]+\tfrac{2(\mathsf{R}+1)}{\alpha}\mathrm{diam}(\mathcal{Y}^{\star})
=\displaystyle={} V​(Te)+π’ͺ​(α​Tp+Δα+Ξ”2/Ξ³Ξ±1/Ξ³+1​Tp1/Ξ³+Ξ±1/Ξ³βˆ’1​(log⁑T)1/Ξ³+1α​diam​(𝒴⋆)+1α​T2​γ+1Ξ±1/Ξ³+1​Tp1/γ​T2)\displaystyle V(T_{e})+\mathcal{O}(\alpha T_{p}+\tfrac{\Delta}{\alpha}+\tfrac{\Delta^{2/\gamma}}{\alpha^{1/\gamma+1}T_{p}^{1/\gamma}}+\alpha^{1/\gamma-1}(\log T)^{1/\gamma}+\tfrac{1}{\alpha}\mathrm{diam}(\mathcal{Y}^{\star})+\tfrac{1}{\alpha T^{2\gamma}}+\tfrac{1}{\alpha^{1/\gamma+1}T_{p}^{1/\gamma}T^{2}})

and this completes the proof. Here, the explicit expression of TeT_{e} can be obtained from LemmaΒ 3.1:

Te=1μ​{max⁑{9,1728​{2​γ​log⁑T+log⁑⌈log2⁑(2​c¯Δγ)βŒ‰}}β€‹ΞΌβˆ’2/γ​m​(aΒ―+dΒ―)2Ξ”2​(Ξ³βˆ’1)+1}β€‹βŒˆlog2⁑(2​c¯Δγ)βŒ‰.T_{e}=\tfrac{1}{\mu}\{\max\{9,1728\{2\gamma\log T+\log\lceil\log_{2}(\tfrac{2\bar{c}}{\Delta^{\gamma}})\rceil\}\}\tfrac{\mu^{-2/\gamma}m(\bar{a}+\bar{d})^{2}}{\Delta^{2(\gamma-1)}}+1\}\lceil\log_{2}(\tfrac{2\bar{c}}{\Delta^{\gamma}})\rceil.

B.5 Proof of Lemma 4.4

Using LemmaΒ B.2, it suffices to verify that the expected dual objective is strongly convex:

f​(y)=12​y+𝔼c​[[cβˆ’y]+]=12​y+∫y1(cβˆ’y)​dc=12​y2βˆ’12​y+12.\displaystyle\textstyle f(y)=\tfrac{1}{2}y+\mathbb{E}_{c}[[c-y]_{+}]=\tfrac{1}{2}y+\int_{y}^{1}(c-y)\mathrm{d}c=\tfrac{1}{2}y^{2}-\tfrac{1}{2}y+\tfrac{1}{2}.

and indeed, f​(y)f(y) is 1-strongly convex.

B.6 Proof of Lemma 4.5

First, we establish the update rule formula for 𝔼​[yt+1]\mathbb{E}[y^{t+1}] in terms of 𝔼​[yt]\mathbb{E}[y^{t}]. Specifically, we have

𝔼​[yt+1]=\displaystyle\mathbb{E}[y^{t+1}]={} 𝔼​[[ytβˆ’1t​(12βˆ’π•€β€‹{ct>yt})]+]\displaystyle\mathbb{E}[[y^{t}-\tfrac{1}{t}(\tfrac{1}{2}-\mathbb{I}\{c_{t}>y^{t}\})]_{+}] (54)
β‰₯\displaystyle\geq{} 𝔼​[ytβˆ’1t​(12βˆ’π•€β€‹{ct>yt})]\displaystyle\mathbb{E}[y^{t}-\tfrac{1}{t}(\tfrac{1}{2}-\mathbb{I}\{c_{t}>y^{t}\})] (55)
β‰₯\displaystyle\geq{} 𝔼​[ytβˆ’1t​yt+12​t]\displaystyle\mathbb{E}[y^{t}-\tfrac{1}{t}y^{t}+\tfrac{1}{2t}] (56)

where (54) is obtained by the update rule of subgradient, (55) uses Jensen’s inequality, and (56) is obtained by the fact that ctc_{t} is independent of yty^{t} and it is drawn uniformly from [0,1][0,1]. Indeed, we have

𝔼​[𝕀​{ct>yt}]=𝔼​[𝔼​[𝕀​{ct>yt}|yt]]=𝔼​[∫01𝕀​{c>yt}​dc|yt]=𝔼​[1βˆ’yt].\mathbb{E}[\mathbb{I}\{c_{t}>y^{t}\}]=\mathbb{E}[\mathbb{E}[\mathbb{I}\{c_{t}>y^{t}\}|y^{t}]]=\mathbb{E}[\textstyle\int_{0}^{1}\mathbb{I}\{c>y^{t}\}\mathrm{d}c|y^{t}]=\mathbb{E}[1-y^{t}].

Subtracting t/2t/2 from both sides and multiplying both sides the the inequality by tt, we have

t​(𝔼​[yt+1]βˆ’12)β‰₯(tβˆ’1)​(𝔼​[yt]βˆ’12),for allΒ t=1,…,T.\displaystyle t(\mathbb{E}[y^{t+1}]-\tfrac{1}{2})\geq(t-1)(\mathbb{E}[y^{t}]-\tfrac{1}{2}),\quad\text{for all $t=1,\dots,T$.}

Next we condition on the value of yt0y^{t_{0}} and

t​(𝔼​[yt+1|yt0]βˆ’12)β‰₯(t0βˆ’1)​(yt0βˆ’12).\displaystyle t(\mathbb{E}[y^{t+1}|y^{t_{0}}]-\tfrac{1}{2})\geq(t_{0}-1)(y^{t_{0}}-\tfrac{1}{2}). (57)

Thus, given yt0>y⋆+1T=12+1Ty^{t_{0}}>y^{\star}+\tfrac{1}{\sqrt{T}}=\frac{1}{2}+\tfrac{1}{\sqrt{T}} for some t0t_{0}, we have

t​(𝔼​[yt+1|yt0]βˆ’12)β‰₯(t0βˆ’1)​(yt0βˆ’12)β‰₯t0βˆ’1T,\displaystyle t(\mathbb{E}[y^{t+1}|y^{t_{0}}]-\tfrac{1}{2})\geq(t_{0}-1)(y^{t_{0}}-\tfrac{1}{2})\geq\tfrac{t_{0}-1}{\sqrt{T}}, (58)

As a result, when t0β‰₯T10+1t_{0}\geq\tfrac{T}{10}+1, (58) implies

𝔼​[yt+1|yt0]β‰₯12+t0βˆ’1tΓ—Tβ‰₯12+110​T,\displaystyle\mathbb{E}[y^{t+1}|y^{t_{0}}]\geq\tfrac{1}{2}+\tfrac{t_{0}-1}{t\times\sqrt{T}}\geq\tfrac{1}{2}+\tfrac{1}{10\sqrt{T}},

since we assume t0β‰₯T/10+1t_{0}\geq T/10+1. This completes the proof.

B.7 Proof of Proposition 4.1

Based on [26], there exists some universal constant c>0c>0 such that with probability no less than 1βˆ’1/T41-1/T^{4}, |ytβˆ’y⋆|≀c​log⁑T/T|y^{t}-y^{\star}|\leq c\log T/\sqrt{T} for all tβ‰₯t0t\geq t_{0}, where y⋆=12y^{\star}=\tfrac{1}{2} and t0=π’ͺ​(log⁑T)t_{0}=\mathcal{O}(\log T). Thus, without loss of generality, we assume

yt∈[14,34],Β and ​yt+1=ytβˆ’1t​(12βˆ’π•€β€‹{ct>yt})\displaystyle y^{t}\in[\tfrac{1}{4},\tfrac{3}{4}],\text{ and }y^{t+1}=y^{t}-\tfrac{1}{t}(\tfrac{1}{2}-\mathbb{I}\{c_{t}>y^{t}\}) (59)

for all tβ‰₯t0t\geq t_{0} by setting a new random initialization yt0∈[1/4,3/4]y^{t_{0}}\in[1/4,3/4] and ignoring the all decision steps before the t0t_{0} step. In the following, we show that SGM using π’ͺ​(1/(μ​t))\mathcal{O}(1/(\mu t)) stepsize must have Ω​(T1/2)\Omega(T^{1/2}) regret or constraint violation for any initialization yt0y^{t_{0}}. We first calculate 𝔼​[ytβˆ’12]\mathbb{E}[y^{t}-\tfrac{1}{2}] and 𝔼​[(ytβˆ’12)2]\mathbb{E}[(y^{t}-\tfrac{1}{2})^{2}] similar to the proof of LemmaΒ 4.5. Specifically, for 𝔼​[ytβˆ’1/2]\mathbb{E}[y^{t}-1/2], we have

𝔼​[yt+1|yt]=(1βˆ’1t)​yt+12​t,\displaystyle\mathbb{E}[y^{t+1}|y^{t}]=\big{(}1-\tfrac{1}{t}\big{)}y^{t}+\tfrac{1}{2t},

which implies

𝔼​[yt+1βˆ’12|yt0]=t0βˆ’1t​(y1βˆ’12)+12,\displaystyle\mathbb{E}[y^{t+1}-\tfrac{1}{2}|y^{t_{0}}]=\tfrac{t_{0}-1}{t}(y^{1}-\tfrac{1}{2})+\tfrac{1}{2}, (60)

Also, similarly, for 𝔼​[(ytβˆ’1/2)2]\mathbb{E}[(y^{t}-1/2)^{2}] we have under assumption (59)

𝔼​[(yt+1βˆ’12)2|yt]\displaystyle\mathbb{E}[(y^{t+1}-\tfrac{1}{2})^{2}|y^{t}] =𝔼​[(ytβˆ’1t​(12βˆ’π•€β€‹{ct>yt})βˆ’12)2|yt]\displaystyle=\mathbb{E}[(y^{t}-\tfrac{1}{t}(\tfrac{1}{2}-\mathbb{I}\{c_{t}>y^{t}\})-\tfrac{1}{2})^{2}|y^{t}]
=(1βˆ’1t)2​(ytβˆ’12)2+14​t2βˆ’1t2​(ytβˆ’12)2\displaystyle=(1-\tfrac{1}{t})^{2}(y^{t}-\tfrac{1}{2})^{2}+\tfrac{1}{4t^{2}}-\tfrac{1}{t^{2}}(y^{t}-\tfrac{1}{2})^{2}
β‰₯(1βˆ’1t)2​(ytβˆ’12)2+14​t2βˆ’ct3,\displaystyle\geq(1-\tfrac{1}{t})^{2}(y^{t}-\tfrac{1}{2})^{2}+\tfrac{1}{4t^{2}}-\tfrac{c}{t^{3}},

which implies

𝔼​[(yt+1βˆ’12)2|yt]β‰₯(t0βˆ’1)2t2​(yt0βˆ’12)2+14​tβˆ’c​log⁑t+t0t2.\displaystyle\mathbb{E}[(y^{t+1}-\tfrac{1}{2})^{2}|y^{t}]\geq\tfrac{(t_{0}-1)^{2}}{t^{2}}(y^{t_{0}}-\tfrac{1}{2})^{2}+\tfrac{1}{4t}-\tfrac{c\log t+t_{0}}{t^{2}}. (61)

Combining (60) and (61), we then can compute

𝔼​[(βˆ‘t=t0T𝕀​{ct>yt}βˆ’Tβˆ’t0+12)2]\displaystyle~{}~{}~{}~{}\textstyle\mathbb{E}[(\sum_{t=t_{0}}^{T}\mathbb{I}{\{c_{t}>y^{t}\}}-\tfrac{T-t_{0}+1}{2})^{2}] (62)
=βˆ‘t=t0T𝔼​[(𝕀​{ct>yt}βˆ’12)2]+2β€‹βˆ‘t0≀i<j≀T𝔼​[(𝕀​{cj>yj}βˆ’12)​(𝕀​{ci>yi}βˆ’12)]\displaystyle=\textstyle\sum_{t=t_{0}}^{T}\mathbb{E}[(\mathbb{I}{\{c_{t}>y^{t}\}}-\tfrac{1}{2})^{2}]+2\sum_{t_{0}\leq i<j\leq T}\mathbb{E}[(\mathbb{I}{\{c_{j}>y^{j}\}}-\tfrac{1}{2})(\mathbb{I}{\{c_{i}>y^{i}\}}-\tfrac{1}{2})]
=Tβˆ’t04+2β€‹βˆ‘t0≀i<j≀T𝔼​[(𝕀​{cj>yj}βˆ’12)​(𝕀​{ci>yi}βˆ’12)]\displaystyle=\textstyle\tfrac{T-t_{0}}{4}+2\sum_{t_{0}\leq i<j\leq T}\mathbb{E}[(\mathbb{I}{\{c_{j}>y^{j}\}}-\tfrac{1}{2})(\mathbb{I}{\{c_{i}>y^{i}\}}-\tfrac{1}{2})]
=Tβˆ’t04+2β€‹βˆ‘t0≀i<j≀Tiβˆ’1jβˆ’1​𝔼​[(yiβˆ’12)2]βˆ’iβˆ’14​i​(jβˆ’1)\displaystyle=\textstyle\tfrac{T-t_{0}}{4}+2\sum_{t_{0}\leq i<j\leq T}\tfrac{i-1}{j-1}\mathbb{E}[(y^{i}-\tfrac{1}{2})^{2}]-\tfrac{i-1}{4i(j-1)} (63)
β‰₯Tβˆ’t04βˆ’2β€‹βˆ‘t0≀i<j≀Tc​log⁑T+t0(iβˆ’1)2\displaystyle\geq\textstyle\tfrac{T-t_{0}}{4}-2\sum_{t_{0}\leq i<j\leq T}\tfrac{c\log T+t_{0}}{(i-1)^{2}}
=Ω​(T).\displaystyle=\Omega(T).

In addition, since |ytβˆ’12|≀cT|y^{t}-\tfrac{1}{2}|\leq\tfrac{c}{\sqrt{T}}, by LemmaΒ A.1, we have with probability no less than 1βˆ’1T21-\tfrac{1}{T^{2}}

|βˆ‘t=t0T𝕀​{ct>yt}βˆ’Tβˆ’t0+12|=π’ͺ​(T​log⁑T).\displaystyle\big{|}\textstyle\sum_{t=t_{0}}^{T}\mathbb{I}{\{c_{t}>y^{t}\}}-\tfrac{T-t_{0}+1}{2}\big{|}=\mathcal{O}(\sqrt{T}\log T).

Consequently, by (62), we have

𝔼​[|βˆ‘t=t0T𝕀​{ct>yt}βˆ’Tβˆ’t0+12|]=Ω​(Tlog⁑T).\displaystyle\mathbb{E}\big{[}\big{|}\textstyle\sum_{t=t_{0}}^{T}\mathbb{I}{\{c_{t}>y^{t}\}}-\tfrac{T-t_{0}+1}{2}\big{|}\big{]}=\Omega(\tfrac{\sqrt{T}}{\log T}). (64)

This is the summation of constraint violation and constraint (resource) leftover, and thus, the summation of constraint violation and the regret must be no less than Ω​(T/log⁑T)\Omega(\sqrt{T}/\log T).

B.8 Proof of Theorem 4.1

First note that for sufficiently large TT, the condition Ξ±e≀2​dΒ―3​m​(aΒ―+dΒ―)2\alpha_{e}\leq\frac{2\underline{d}}{3m(\bar{a}+\bar{d})^{2}} from LemmaΒ B.1 will be satisfied and all the dual iterates {𝐲t}t=Te+1T\{\mathbf{y}^{t}\}_{t=T_{e}+1}^{T} will stay in 𝒴\mathcal{Y} almost surely. When diam​(𝒴⋆)=0\mathrm{diam}(\mathcal{Y}^{\star})=0, we consider

1Ξ±e+Ξ±e​Te+Ξ±p​Tp+Δαp+Ξ”2/Ξ³Ξ±p1/Ξ³+1​Tp1/Ξ³+Ξ±p1/Ξ³βˆ’1​(log⁑T)1/Ξ³+1Ξ±p​T2​γ+1Ξ±p1/Ξ³+1​Tp1/γ​T2.\tfrac{1}{\alpha_{e}}+\alpha_{e}T_{e}+\alpha_{p}T_{p}+\tfrac{\Delta}{\alpha_{p}}+\tfrac{\Delta^{2/\gamma}}{\alpha_{p}^{1/\gamma+1}T_{p}^{1/\gamma}}+\alpha_{p}^{1/\gamma-1}(\log T)^{1/\gamma}+\tfrac{1}{\alpha_{p}T^{2\gamma}}+\tfrac{1}{\alpha_{p}^{1/\gamma+1}T_{p}^{1/\gamma}T^{2}}.

Since Ξ±e\alpha_{e} only appears in 1Ξ±e+Ξ±e​Te\tfrac{1}{\alpha_{e}}+\alpha_{e}T_{e}, we let Ξ±e=π’ͺ​(1/Te)\alpha_{e}=\mathcal{O}(1/\sqrt{T_{e}}) to optimize the trade-off. Hence, it suffices to consider

Te+Ξ±p​Tp+Δαp+Ξ”2/Ξ³Ξ±p1/Ξ³+1​Tp1/Ξ³+Ξ±p1/Ξ³βˆ’1​(log⁑T)1/Ξ³+1Ξ±p​T2​γ+1Ξ±p1/Ξ³+1​Tp1/γ​T2.\sqrt{T_{e}}+\alpha_{p}T_{p}+\tfrac{\Delta}{\alpha_{p}}+\tfrac{\Delta^{2/\gamma}}{\alpha_{p}^{1/\gamma+1}T_{p}^{1/\gamma}}+\alpha_{p}^{1/\gamma-1}(\log T)^{1/\gamma}+\tfrac{1}{\alpha_{p}T^{2\gamma}}+\tfrac{1}{\alpha_{p}^{1/\gamma+1}T_{p}^{1/\gamma}T^{2}}.

Taking Ξ”=π’ͺ​(Tβˆ’Ξ²)\Delta=\mathcal{O}(T^{-\beta}) and Ξ±p=π’ͺ​(Tβˆ’Ξ»)\alpha_{p}=\mathcal{O}(T^{-\lambda}) with (Ξ²,Ξ»)β‰₯0(\beta,\lambda)\geq 0, we have Te=π’ͺ​(T2​β​(Ξ³βˆ’1)​log2⁑T)T_{e}=\mathcal{O}(T^{2\beta(\gamma-1)}\log^{2}T) according to LemmaΒ 3.1 and (8), and Te=π’ͺ​(TΞ²β€‹Ξ³βˆ’Ξ²β€‹log⁑T)\sqrt{T_{e}}=\mathcal{O}(T^{\beta\gamma-\beta}\log T). Moreover, we have, using β‰…\cong to denote equivalence under π’ͺ​(β‹…)\mathcal{O}(\cdot) notation, that

Ξ±p​Tpβ‰…\displaystyle\alpha_{p}T_{p}\cong{} T1βˆ’Ξ»\displaystyle T^{1-\lambda}
Δαpβ‰…\displaystyle\tfrac{\Delta}{\alpha_{p}}\cong{} TΞ»βˆ’Ξ²\displaystyle T^{\lambda-\beta}
Ξ”2/Ξ³Ξ±p1/Ξ³+1​Tp1/Ξ³β‰…\displaystyle\tfrac{\Delta^{2/\gamma}}{\alpha_{p}^{1/\gamma+1}T_{p}^{1/\gamma}}\cong{} Tβˆ’2​β/Ξ³Tβˆ’Ξ»/Ξ³βˆ’Ξ»β€‹T1/γ​(1βˆ’T2​β​(Ξ³βˆ’1)βˆ’1​log2⁑T)1/Ξ³=Tβˆ’2​β+Ξ»βˆ’1Ξ³+Ξ»(1βˆ’T2​β​(Ξ³βˆ’1)βˆ’1​log2⁑T)1/Ξ³\displaystyle\tfrac{T^{-2\beta/\gamma}}{T^{-\lambda/\gamma-\lambda}T^{1/\gamma}(1-T^{2\beta(\gamma-1)-1}\log^{2}T)^{1/\gamma}}=\tfrac{T^{\frac{-2\beta+\lambda-1}{\gamma}+\lambda}}{(1-T^{2\beta(\gamma-1)-1}\log^{2}T)^{1/\gamma}}
Ξ±p1/Ξ³βˆ’1​(log⁑T)1/Ξ³β‰…\displaystyle\alpha_{p}^{1/\gamma-1}(\log T)^{1/\gamma}\cong{} TΞ»βˆ’Ξ»/γ​(log⁑T)1/Ξ³\displaystyle T^{\lambda-\lambda/\gamma}(\log T)^{1/\gamma}
1Ξ±p​T2​γ=\displaystyle\tfrac{1}{\alpha_{p}T^{2\gamma}}={} π’ͺ​(1)\displaystyle\mathcal{O}(1)
1Ξ±p1/Ξ³βˆ’1​Tp1/γ​T2=\displaystyle\tfrac{1}{\alpha_{p}^{1/\gamma-1}T_{p}^{1/\gamma}T^{2}}={} π’ͺ​(1).\displaystyle\mathcal{O}(1).

Suppose 2​β​(Ξ³βˆ’1)βˆ’1<02\beta(\gamma-1)-1<0. Then Ξ”2/Ξ³Ξ±p1/Ξ³+1​Tp1/Ξ³β‰…Tβˆ’2​β+Ξ»βˆ’1Ξ³+Ξ»\tfrac{\Delta^{2/\gamma}}{\alpha_{p}^{1/\gamma+1}T_{p}^{1/\gamma}}\cong{}T^{\frac{-2\beta+\lambda-1}{\gamma}+\lambda} and

Te+Ξ±p​Tp+Δαp+Ξ”2/Ξ³Ξ±p1/Ξ³+1​Tp1/Ξ³+Ξ±p1/Ξ³βˆ’1​(log⁑T)1/Ξ³+1Ξ±p​T2​γ+1Ξ±p1/Ξ³+1​Tp1/γ​T2\displaystyle\sqrt{T_{e}}+\alpha_{p}T_{p}+\tfrac{\Delta}{\alpha_{p}}+\tfrac{\Delta^{2/\gamma}}{\alpha_{p}^{1/\gamma+1}T_{p}^{1/\gamma}}+\alpha_{p}^{1/\gamma-1}(\log T)^{1/\gamma}+\tfrac{1}{\alpha_{p}T^{2\gamma}}+\tfrac{1}{\alpha_{p}^{1/\gamma+1}T_{p}^{1/\gamma}T^{2}}
β‰…\displaystyle\cong{} TΞ²β€‹Ξ³βˆ’Ξ²β€‹log⁑T+T1βˆ’Ξ»+TΞ»βˆ’Ξ²+Tβˆ’2​β+Ξ»βˆ’1Ξ³+Ξ»+TΞ»βˆ’Ξ»/γ​(log⁑T)1/Ξ³\displaystyle T^{\beta\gamma-\beta}\log T+T^{1-\lambda}+T^{\lambda-\beta}+T^{\frac{-2\beta+\lambda-1}{\gamma}+\lambda}+T^{\lambda-\lambda/\gamma}(\log T)^{1/\gamma}
≲\displaystyle\lesssim{} [TΞ²β€‹Ξ³βˆ’Ξ²+T1βˆ’Ξ»+TΞ»βˆ’Ξ²+Tβˆ’2​β+Ξ»βˆ’1Ξ³+Ξ»+TΞ»βˆ’Ξ»/Ξ³]​log⁑T,\displaystyle[T^{\beta\gamma-\beta}+T^{1-\lambda}+T^{\lambda-\beta}+T^{\frac{-2\beta+\lambda-1}{\gamma}+\lambda}+T^{\lambda-\lambda/\gamma}]\log T, (65)

where (65) uses Ξ³β‰₯1\gamma\geq 1 and that (log⁑T)1/γ≀log⁑T(\log T)^{1/\gamma}\leq\log T. To find the optimal trade-off, we solve the following optimization problem

minΞ»,Ξ²\displaystyle\min_{\lambda,\beta} max⁑{Ξ²β€‹Ξ³βˆ’Ξ²,1βˆ’Ξ»,Ξ»βˆ’Ξ²,βˆ’2​β+Ξ»βˆ’1Ξ³+Ξ»,Ξ»βˆ’Ξ»Ξ³}\displaystyle\max\{\beta\gamma-\beta,1-\lambda,\lambda-\beta,\tfrac{-2\beta+\lambda-1}{\gamma}+\lambda,\lambda-\tfrac{\lambda}{\gamma}\}
subject to (Ξ»,Ξ²)β‰₯0.\displaystyle(\lambda,\beta)\geq 0.

The solution yields λ⋆=Ξ³2β€‹Ξ³βˆ’1\lambda^{\star}=\frac{\gamma}{2\gamma-1} and β⋆=12β€‹Ξ³βˆ’1\beta^{\star}=\frac{1}{2\gamma-1} and

2​β⋆​(Ξ³βˆ’1)βˆ’1=2β€‹Ξ³βˆ’22β€‹Ξ³βˆ’1βˆ’1=βˆ’12β€‹Ξ³βˆ’1<02\beta^{\star}(\gamma-1)-1=\tfrac{2\gamma-2}{2\gamma-1}-1=-\tfrac{1}{2\gamma-1}<0

always holds. Hence

TΞ²β‹†β€‹Ξ³βˆ’Ξ²β‹†+T1βˆ’Ξ»β‹†+TΞ»β‹†βˆ’Ξ²β‹†+Tβˆ’2​β⋆+Ξ»β‹†βˆ’1Ξ³+λ⋆+TΞ»β‹†βˆ’Ξ»β‹†/Ξ³=π’ͺ​(TΞ³βˆ’12β€‹Ξ³βˆ’1​log⁑T)T^{\beta^{\star}\gamma-\beta^{\star}}+T^{1-\lambda^{\star}}+T^{\lambda^{\star}-\beta^{\star}}+T^{\frac{-2\beta^{\star}+\lambda^{\star}-1}{\gamma}+\lambda^{\star}}+T^{\lambda^{\star}-\lambda^{\star}/\gamma}=\mathcal{O}(T^{\frac{\gamma-1}{2\gamma-1}}\log T)

and this completes the proof for diam​(𝒴⋆)=0\mathrm{diam}(\mathcal{Y}^{\star})=0.

Next, consider the case diam​(𝒴⋆)>0\mathrm{diam}(\mathcal{Y}^{\star})>0. In this case we need to consider the trade-off:

1Ξ±e+Ξ±e​Te+Ξ±p​Tp+Δαp+Ξ”2/Ξ³Ξ±p1/Ξ³βˆ’1​Tp1/Ξ³+Ξ±p1/Ξ³βˆ’1​(log⁑T)1/Ξ³+diam​(𝒴⋆)Ξ±p+1Ξ±p​T2​γ+1Ξ±p1/Ξ³βˆ’1​Tp1/γ​T2.\tfrac{1}{\alpha_{e}}+\alpha_{e}T_{e}+\alpha_{p}T_{p}+\tfrac{\Delta}{\alpha_{p}}+\tfrac{\Delta^{2/\gamma}}{\alpha_{p}^{1/\gamma-1}T_{p}^{1/\gamma}}+\alpha_{p}^{1/\gamma-1}(\log T)^{1/\gamma}+\tfrac{\mathrm{diam}(\mathcal{Y}^{\star})}{\alpha_{p}}+\tfrac{1}{\alpha_{p}T^{2\gamma}}+\tfrac{1}{\alpha_{p}^{1/\gamma-1}T_{p}^{1/\gamma}T^{2}}.

Note that 1Ξ±e+Ξ±e​Te+Ξ±p​Tp+diam​(𝒴⋆)Ξ±pβ‰₯2​Te+2​Tp​diam​(𝒴⋆)\tfrac{1}{\alpha_{e}}+\alpha_{e}T_{e}+\alpha_{p}T_{p}+\tfrac{\mathrm{diam}(\mathcal{Y}^{\star})}{\alpha_{p}}\geq 2\sqrt{T_{e}}+2\sqrt{T_{p}\mathrm{diam}(\mathcal{Y}^{\star})} and that Te+Tp=TT_{e}+T_{p}=T make it impossible to achieve better than π’ͺ​(T)\mathcal{O}(\sqrt{T}) regret. Hence, we consider improving the constant associated with T\sqrt{T}.

Using 𝖱=cΒ―dΒ―+π’ͺ(max{Ξ±e,Ξ±p}))\mathsf{R}=\tfrac{\bar{c}}{\underline{d}}+\mathcal{O}(\max\{\alpha_{e},\alpha_{p}\})) and suppose Ξ±e,Ξ±p\alpha_{e},\alpha_{p} are of the same order with respect to TT,

𝔼​[r​(𝐱^T)+v​(𝐱^T)]≀\displaystyle\mathbb{E}[r(\hat{\mathbf{x}}_{T})+v(\hat{\mathbf{x}}_{T})]\leq{} m​(aΒ―+dΒ―)22​(Ξ±e​Te+Ξ±p​Tp)+𝖱αe+2​(𝖱+1)Ξ±p​diam​(𝒴⋆)\displaystyle\tfrac{m(\bar{a}+\bar{d})^{2}}{2}(\alpha_{e}T_{e}+\alpha_{p}T_{p})+\tfrac{\mathsf{R}}{\alpha_{e}}+\tfrac{2(\mathsf{R}+1)}{\alpha_{p}}\mathrm{diam}(\mathcal{Y}^{\star})
+(𝖱+1)​[Δαp+(1ΞΌ)1/γ​Δ2/Ξ³Ξ±p1/Ξ³+1​Tp1/Ξ³+(32​m​(aΒ―+dΒ―)2ΞΌ)1/γ​αp1/γ​(log⁑T)1/Ξ³]+π’ͺ​(1)\displaystyle+(\mathsf{R}+1)[\tfrac{\Delta}{\alpha_{p}}+(\tfrac{1}{\mu})^{1/\gamma}\tfrac{\Delta^{2/\gamma}}{\alpha_{p}^{1/\gamma+1}T_{p}^{1/\gamma}}+(\tfrac{32m(\bar{a}+\bar{d})^{2}}{\mu})^{1/\gamma}\alpha^{1/\gamma}_{p}(\log T)^{1/\gamma}]+\mathcal{O}(1)
=\displaystyle={} m​(aΒ―+dΒ―)22​(Ξ±e​Te+Ξ±p​Tp)+cΒ―d¯​1Ξ±e+cΒ―d¯​2​d​i​a​m​(𝒴⋆)Ξ±p\displaystyle\tfrac{m(\bar{a}+\bar{d})^{2}}{2}(\alpha_{e}T_{e}+\alpha_{p}T_{p})+\tfrac{\bar{c}}{\underline{d}}\tfrac{1}{\alpha_{e}}+\tfrac{\bar{c}}{\underline{d}}\tfrac{2\mathrm{diam}(\mathcal{Y}^{\star})}{\alpha_{p}}
+(𝖱+1)​[Δαp+(1ΞΌ)1/γ​Δ2/Ξ³Ξ±p1/Ξ³+1​Tp1/Ξ³+(32​m​(aΒ―+dΒ―)2ΞΌ)1/γ​αp1/γ​(log⁑T)1/Ξ³]+π’ͺ​(1).\displaystyle+(\mathsf{R}+1)[\tfrac{\Delta}{\alpha_{p}}+(\tfrac{1}{\mu})^{1/\gamma}\tfrac{\Delta^{2/\gamma}}{\alpha_{p}^{1/\gamma+1}T_{p}^{1/\gamma}}+(\tfrac{32m(\bar{a}+\bar{d})^{2}}{\mu})^{1/\gamma}\alpha^{1/\gamma}_{p}(\log T)^{1/\gamma}]+\mathcal{O}(1).

Suppose we take Te=θ​TT_{e}=\theta T and Tp=(1βˆ’ΞΈ)​TT_{p}=(1-\theta)T for θ∈(0,1)\theta\in(0,1) and we let Ξ±e=Ξ²eTe=Ξ²eθ​T,Ξ±p=Ξ²pTp=Ξ²p(1βˆ’ΞΈ)​T\alpha_{e}=\tfrac{\beta_{e}}{\sqrt{T_{e}}}=\tfrac{\beta_{e}}{\sqrt{\theta T}},\alpha_{p}=\tfrac{\beta_{p}}{\sqrt{T_{p}}}=\tfrac{\beta_{p}}{\sqrt{(1-\theta)T}}. Then, Ξ”=o​(1)\Delta=o(1) and

(𝖱+1)​[Δαp+(1ΞΌ)1/γ​Δ2/Ξ³Ξ±p1/Ξ³+1​Tp1/Ξ³+(32​m​(aΒ―+dΒ―)2ΞΌ)1/γ​αp1/γ​(log⁑T)1/Ξ³]=o​(T).(\mathsf{R}+1)[\tfrac{\Delta}{\alpha_{p}}+(\tfrac{1}{\mu})^{1/\gamma}\tfrac{\Delta^{2/\gamma}}{\alpha_{p}^{1/\gamma+1}T_{p}^{1/\gamma}}+(\tfrac{32m(\bar{a}+\bar{d})^{2}}{\mu})^{1/\gamma}\alpha^{1/\gamma}_{p}(\log T)^{1/\gamma}]=o(\sqrt{T}).

Hence, it suffices to consider

m​(aΒ―+dΒ―)22​(Ξ±e​Te+Ξ±p​Tp)+cΒ―d¯​1Ξ±e+cΒ―d¯​2​d​i​a​m​(𝒴⋆)Ξ±p\displaystyle\tfrac{m(\bar{a}+\bar{d})^{2}}{2}(\alpha_{e}T_{e}+\alpha_{p}T_{p})+\tfrac{\bar{c}}{\underline{d}}\tfrac{1}{\alpha_{e}}+\tfrac{\bar{c}}{\underline{d}}\tfrac{2\mathrm{diam}(\mathcal{Y}^{\star})}{\alpha_{p}}
=\displaystyle={} m​(aΒ―+dΒ―)22​βe​θ​T+m​(aΒ―+dΒ―)22​βp​(1βˆ’ΞΈ)​T+cΒ―d¯​θ​TΞ²e+cΒ―d¯​2​d​i​a​m​(𝒴⋆)Ξ²p​(1βˆ’ΞΈ)​T\displaystyle\tfrac{m(\bar{a}+\bar{d})^{2}}{2}\beta_{e}\sqrt{\theta T}+\tfrac{m(\bar{a}+\bar{d})^{2}}{2}\beta_{p}\sqrt{(1-\theta)T}+\tfrac{\bar{c}}{\underline{d}}\tfrac{\sqrt{\theta T}}{\beta_{e}}+\tfrac{\bar{c}}{\underline{d}}\tfrac{2\mathrm{diam}(\mathcal{Y}^{\star})}{\beta_{p}}\sqrt{(1-\theta)T}
=\displaystyle={} [m​(aΒ―+dΒ―)22​βe+cΒ―d¯​βe]​θ​T+[m​(aΒ―+dΒ―)22​βp+2​d​i​a​m​(𝒴⋆)​cΒ―d¯​βp]​(1βˆ’ΞΈ)T.\displaystyle\big{[}\tfrac{m(\bar{a}+\bar{d})^{2}}{2}\beta_{e}+\tfrac{\bar{c}}{\underline{d}\beta_{e}}\big{]}\sqrt{\theta T}+\big{[}\tfrac{m(\bar{a}+\bar{d})^{2}}{2}\beta_{p}+\tfrac{2\mathrm{diam}(\mathcal{Y}^{\star})\bar{c}}{\underline{d}\beta_{p}}\big{]}\sqrt{(1-\theta)T.}

Taking Ξ²e=2m​(aΒ―+dΒ―)2β‹…cΒ―dΒ―\beta_{e}=\sqrt{\tfrac{2}{m(\bar{a}+\bar{d})^{2}}\cdot\tfrac{\bar{c}}{\underline{d}}} and Ξ²p=2m​(aΒ―+dΒ―)2β‹…2​d​i​a​m​(𝒴⋆)​cΒ―dΒ―\beta_{p}=\sqrt{\tfrac{2}{m(\bar{a}+\bar{d})^{2}}\cdot\tfrac{2\mathrm{diam}(\mathcal{Y}^{\star})\bar{c}}{\underline{d}}} to optimize the two trade-offs, we get

𝔼​[r​(𝐱^T)+v​(𝐱^T)]≀2​m​cΒ―2​d¯​(aΒ―+dΒ―)​θ​T+2​2​m​cΒ―2​d¯​(aΒ―+dΒ―)​diam​(𝒴⋆)​(1βˆ’ΞΈ)​T\mathbb{E}[r(\hat{\mathbf{x}}_{T})+v(\hat{\mathbf{x}}_{T})]\leq 2\sqrt{\tfrac{m\bar{c}}{2\underline{d}}}(\bar{a}+\bar{d})\sqrt{\theta T}+2\sqrt{2}\sqrt{\tfrac{m\bar{c}}{2\underline{d}}}(\bar{a}+\bar{d})\sqrt{\mathrm{diam}(\mathcal{Y}^{\star})}\sqrt{(1-\theta)T}

With ΞΈ=2​d​i​a​m​(𝒴⋆)2​d​i​a​m​(𝒴⋆)+1\theta=\frac{2\mathrm{diam}(\mathcal{Y}^{\star})}{2\mathrm{diam}(\mathcal{Y}^{\star})+1}, we have

𝔼​[r​(𝐱^T)+v​(𝐱^T)]≀4​m​cΒ―2​d¯​2​d​i​a​m​(𝒴⋆)2​d​i​a​m​(𝒴⋆)+1​(aΒ―+dΒ―)​T.\mathbb{E}[r(\hat{\mathbf{x}}_{T})+v(\hat{\mathbf{x}}_{T})]\leq 4\sqrt{\tfrac{m\bar{c}}{2\underline{d}}}\sqrt{\tfrac{2\mathrm{diam}(\mathcal{Y}^{\star})}{2\mathrm{diam}(\mathcal{Y}^{\star})+1}}(\bar{a}+\bar{d})\sqrt{T}.

Since diam​(𝒴⋆)β‰₯0\mathrm{diam}(\mathcal{Y}^{\star})\geq 0, this completes the proof.

B.9 Removing additional log⁑T\log T when γ=2\gamma=2

When Ξ³=2\gamma=2, the dual error bound condition reduces to quadratic growth, and it is possible to remove the log⁑T\log T factor in the regret result. Recall that log⁑T\log T terms appear when bounding 𝔼​[dist​(𝐲Te+1,𝒴⋆)]\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})] and 𝔼​[dist​(𝐲T+1,𝒴⋆)2]\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T+1},\mathcal{Y}^{\star})^{2}]. For Ξ³=2\gamma=2, using a tailored analysis, LemmaΒ B.3 guarantees

𝔼​[dist​(𝐲Te+1,𝒴⋆)]≀𝔼​[dist​(𝐲Te+1,𝒴⋆)2]=π’ͺ​(1T).\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})]\leq\sqrt{\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})^{2}]}=\mathcal{O}(\tfrac{1}{\sqrt[]{T}}).

Moreover, using LemmaΒ B.4, we can directly bound the expectation

𝔼​[dist​(𝐲T+1,𝒴⋆)2]=\displaystyle\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T+1},\mathcal{Y}^{\star})^{2}]={} 𝔼​[𝔼​[dist​(𝐲T+1,𝒴⋆)2|𝐲Te+1]]\displaystyle\mathbb{E}\big{[}\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T+1},\mathcal{Y}^{\star})^{2}|\mathbf{y}^{T_{e}+1}]\big{]}
≀\displaystyle\leq{} 𝔼​[dist​(𝐲Te+1,𝒴⋆)2μ​α​T+m​(aΒ―+dΒ―)2μ​α]\displaystyle\mathbb{E}\big{[}\tfrac{\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})^{2}}{\mu\alpha T}+\tfrac{m(\bar{a}+\bar{d})^{2}}{\mu}\alpha\big{]}
=\displaystyle={} 𝔼​[dist​(𝐲Te+1,𝒴⋆)2]μ​α​T+m​(aΒ―+dΒ―)2μ​α\displaystyle\tfrac{\mathbb{E}[\mathrm{dist}(\mathbf{y}^{T_{e}+1},\mathcal{Y}^{\star})^{2}]}{\mu\alpha T}+\tfrac{m(\bar{a}+\bar{d})^{2}}{\mu}\alpha
≀\displaystyle\leq{} Ξ”2μ​α​T+m​(aΒ―+dΒ―)2μ​α.\displaystyle\tfrac{\Delta^{2}}{\mu\alpha T}+\tfrac{m(\bar{a}+\bar{d})^{2}}{\mu}\alpha.

Therefore, log⁑T\log T terms can be removed from the analysis.

B.10 Learning with unknown parameters

It is possible that Ξ³\gamma and ΞΌ\mu are unknown in practice. When ΞΌ\mu is unknown, it is possible to run parameter-variants of first-order methods AlgorithmΒ 5, which is slightly more complicated. In terms of Ξ³\gamma, in the finite-support setting, the LP polyhedral error bound always guarantees Ξ³=1\gamma=1. In the continuous support setting, it suffices to know an upperbound bound on Ξ³\gamma: if A4 holds for some Ξ³>0\gamma>0, then given ΞΈ>0\theta>0,

f​(𝐲)β‰₯\displaystyle f(\mathbf{y})\geq{} f​(𝐲⋆)+μ​dist​(𝐲,𝒴⋆)Ξ³\displaystyle f(\mathbf{y}^{\star})+\mu\mathrm{dist}(\mathbf{y},\mathcal{Y}^{\star})^{\gamma}
=\displaystyle={} f​(𝐲)βˆ’f​(𝐲⋆)+μ​dist​(𝐲,𝒴⋆)Ξ³+ΞΈdist​(𝐲,𝒴⋆)ΞΈ\displaystyle f(\mathbf{y})-f(\mathbf{y}^{\star})+\mu\tfrac{\mathrm{dist}(\mathbf{y},\mathcal{Y}^{\star})^{\gamma+\theta}}{\mathrm{dist}(\mathbf{y},\mathcal{Y}^{\star})^{\theta}}
β‰₯\displaystyle\geq{} f​(𝐲)βˆ’f​(𝐲⋆)+ΞΌdiam(𝒴)θ​dist​(𝐲,𝒴⋆)Ξ³+ΞΈ.\displaystyle f(\mathbf{y})-f(\mathbf{y}^{\star})+\tfrac{\mu}{\operatorname{diam}(\mathcal{Y})^{\theta}}\mathrm{dist}(\mathbf{y},\mathcal{Y}^{\star})^{\gamma+\theta}.

and A4 also holds for Ξ³β€²>Ξ³\gamma^{\prime}>\gamma.