This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learnability in Online Kernel Selection with Memory Constraint via Data-dependent Regret Analysis

Junfan Li, Shizhong Liao*
College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
{junfli,szliao}@tju.edu.cn
*ย Corresponding Author
Abstract

Online kernel selection is a fundamental problem of online kernel methods. In this paper, we study online kernel selection with memory constraint in which the memory of kernel selection and online prediction procedures is limited to a fixed budget. An essential question is what is the intrinsic relationship among online learnability, memory constraint, and data complexity? To answer the question, it is necessary to show the trade-offs between regret and memory constraint. Previous work gives a worst-case lower bound depending on the data size, and shows learning is impossible within a small memory constraint. In contrast, we present distinct results by offering data-dependent upper bounds that rely on two data complexities: kernel alignment and the cumulative losses of competitive hypothesis. We propose an algorithmic framework giving data-dependent upper bounds for two types of loss functions. For the hinge loss function, our algorithm achieves an expected upper bound depending on kernel alignment. For smooth loss functions, our algorithm achieves a high-probability upper bound depending on the cumulative losses of competitive hypothesis. We also prove a matching lower bound for smooth loss functions. Our results show that if the two data complexities are sub-linear, then learning is possible within a small memory constraint. Our algorithmic framework depends on a new buffer maintaining framework and a reduction from online kernel selection to prediction with expert advice. Finally, we empirically verify the prediction performance of our algorithms on benchmark datasets.

1 Introduction

Online kernel selection (OKS) aims to dynamically select a kernel function or a reproducing kernel Hilbert space (RKHS) for online kernel learning algorithms. Compared to offline kernel selection, OKS poses more computational challenges for kernel selection algorithms, as both the kernel selection procedure and prediction need to occur in real-time. We can formulate it as a sequential decision problem. Let ๐’ฆ={ฮบi}i=1K\mathcal{K}=\{\kappa_{i}\}^{K}_{i=1} contain KK base kernel functions and โ„‹i\mathcal{H}_{i} be a dense subset of the RKHS induced by ฮบi\kappa_{i}. At round tt, an adversary selects an instances ๐’™tโˆˆโ„d{\bm{x}}_{t}\in\mathbb{R}^{d}. Then a learner chooses a hypothesis ftโˆˆโˆชi=1Kโ„‹if_{t}\in\cup^{K}_{i=1}\mathcal{H}_{i} and makes a prediction. The adversary gives the true output yty_{t}, and the learner suffers a loss โ„“โ€‹(ftโ€‹(๐’™t),yt)\ell(f_{t}({\bm{x}}_{t}),y_{t}). We use the regret to measure the performance of the learner which is defined as

โˆ€ฮบiโˆˆ๐’ฆ,โˆ€fโˆˆโ„‹i,Regโ€‹(f)=โˆ‘t=1Tโ„“โ€‹(ftโ€‹(๐’™t),yt)โˆ’โˆ‘t=1Tโ„“โ€‹(fโ€‹(๐’™t),yt).\forall\kappa_{i}\in\mathcal{K},\quad\forall f\in\mathcal{H}_{i},\quad\mathrm{Reg}(f)=\sum^{T}_{t=1}\ell(f_{t}({\bm{x}}_{t}),y_{t})-\sum^{T}_{t=1}\ell(f({\bm{x}}_{t}),y_{t}).

An efficient algorithm should guarantee Regโ€‹(f)=oโ€‹(T)\mathrm{Reg}(f)=o(T). For convex loss functions, many algorithms [Sahoo2014Online, Foster2017Parameter, Liao2021High] achieve (or imply)

Regโ€‹(f)=O~โ€‹((โ€–fโ€–โ„‹iฮฑ+1)โ€‹Tโ€‹lnโกK),ฮฑโˆˆ{1,2}.\mathrm{Reg}(f)=\tilde{O}\left(\left(\|f\|^{\alpha}_{\mathcal{H}_{i}}+1\right)\sqrt{T\ln{K}}\right),\alpha\in\{1,2\}.

The algorithms adapt to the optimal, yet unknown hypothesis space within a small information-theoretical cost.

A major challenge in OKS is the so called curse of dimensionality, that is, algorithms must store the previous tโˆ’1t-1 instances at the tt-th round. The Oโ€‹(t)O(t) memory cost is prohibitive for large-scale online learning problems. To address this issue, we limit the memory of algorithms to a budget of โ„›\mathcal{R} quanta. For convex loss functions, the worst-case regret is ฮ˜โ€‹(maxโก{T,Tโ„›})\Theta\left(\max\left\{\sqrt{T},\frac{T}{\sqrt{\mathcal{R}}}\right\}\right) [Li2022Worst] which proves a trade-off between memory constraint and regret. To be specific, achieving a Oโ€‹(Tฮฑ)O(T^{\alpha}) regret requires โ„›=ฮฉโ€‹(T2โ€‹(1โˆ’ฮฑ))\mathcal{R}=\Omega\left(T^{2(1-\alpha)}\right), ฮฑโˆˆ[12,1)\alpha\in\left[\frac{1}{2},1\right). Neither a constant nor a ฮ˜โ€‹(lnโกT)\Theta(\ln{T}) memory cost can guarantee sub-linear regret, thereby learning is impossible in any โ„‹i\mathcal{H}_{i}. We use the regret to define the learnability of a hypothesis space (see Definition 2). However, empirical results have shown that a small memory cost is enough to achieve good prediction performance [Dekel2008The, Zhao2012Fast, Zhang2018Online]. It seems that learning is possible in the optimal hypothesis space. To fill the gap between learnability and regret, it is necessary to establish data-dependent regret bounds. To this end, we focus on the following two questions:

  1. Q1

    What are the new trade-offs between memory constraints and regret, that is, how do certain data-dependent regret bounds depend on โ„›\mathcal{R}, TT, and KK?

  2. Q2

    Sub-linear regret implies that some hypothesis spaces are learnable. Is it possible to achieve sub-linear regret within a Oโ€‹(lnโกT)O(\ln{T}) memory cost?

In this paper, we will answer the two questions affirmatively. We first propose an algorithmic framework. Then we apply the algorithmic framework to two types of loss functions and achieve two kinds of data-dependent regret bounds. To maintain the memory constraint, we reduce it to the budget that limits the size of the buffer. Our algorithmic framework will use the buffer to store a subset of the observed examples. The main results are summarized as follows.

  1. 1)

    For the hinge loss function, our algorithm enjoys an expected kernel alignment regret bound as follows (see Theorem 1 for detailed result),

    โˆ€ฮบiโˆˆ๐’ฆ,โˆ€fโˆˆโ„i,๐”ผโ€‹[Regโ€‹(f)]=O~โ€‹(LTโ€‹(f)โ€‹lnโกK+โ„›K+Kโ„›โ€‹๐’œT,ฮบi),\forall\kappa_{i}\in\mathcal{K},\forall f\in\mathbb{H}_{i},\quad\mathbb{E}[\mathrm{Reg}(f)]=\tilde{O}\left(\sqrt{L_{T}(f)\ln{K}}+\frac{\sqrt{\mathcal{R}}}{\sqrt{K}}+\frac{\sqrt{K}}{\sqrt{\mathcal{R}}}\mathcal{A}_{T,\kappa_{i}}\right), (1)

    where ๐’œT,ฮบi\mathcal{A}_{T,\kappa_{i}} is called kernel alignment, โ„iโІโ„‹i\mathbb{H}_{i}\subseteq\mathcal{H}_{i} and

    LTโ€‹(f)=โˆ‘t=1Tโ„“โ€‹(fโ€‹(๐’™t),yt).L_{T}(f)=\sum^{T}_{t=1}\ell(f({\bm{x}}_{t}),y_{t}). (2)
  2. 2)

    For smooth loss functions (see Assumption 3), our algorithm achieves a high-probability small-loss bound (see Theorem 3 for detailed result). โˆ€ฮบiโˆˆ๐’ฆ,โˆ€fโˆˆโ„i\forall\kappa_{i}\in\mathcal{K},\forall f\in\mathbb{H}_{i}, with probability at least 1โˆ’ฮด1-\delta,

    Regโ€‹(f)=O~โ€‹(LTโ€‹(f)โ„›+LTโ€‹(f)โ€‹lnโกK+โ„›).\begin{split}\mathrm{Reg}(f)=\tilde{O}\left(\frac{L_{T}(f)}{\sqrt{\mathcal{R}}}+\sqrt{L_{T}(f)\ln{K}}+\sqrt{\mathcal{R}}\right).\end{split} (3)
  3. 3)

    For smooth loss functions, we prove a lower bound on the regret (see Theorem 4 for detailed result). Let K=1K=1. The regret of any algorithm that stores BB examples satisfies

    โˆƒfโˆˆโ„,๐”ผโ€‹[Regโ€‹(f)]=ฮฉโ€‹(Uโ‹…LTโ€‹(f)B),\exists f\in\mathbb{H},\quad\mathbb{E}\left[\mathrm{Reg}(f)\right]=\Omega\left(U\cdot\frac{L_{T}(f)}{\sqrt{B}}\right),

    in which U>0U>0 is a constant and bounds the norm of all hypotheses in โ„\mathbb{H}. The upper bound in (3) is optimal in terms of the dependence on LTโ€‹(f)L_{T}(f).

The data-dependent bounds in (1) and (3) improve the bound Oโ€‹(Tโ€‹lnโกK+โ€–fโ€–โ„‹i2โ€‹maxโก{T,Tโ„›})O\left(\sqrt{T\ln{K}}+\|f\|^{2}_{\mathcal{H}_{i}}\max\{\sqrt{T},\frac{T}{\sqrt{\mathcal{R}}}\}\right) [Li2022Worst], given that ๐’œT,ฮบi=Oโ€‹(T)\mathcal{A}_{T,\kappa_{i}}=O(T) and LTโ€‹(f)=Oโ€‹(T)L_{T}(f)=O(T). If ฮบi\kappa_{i} matches well with the data, then we expect ๐’œT,ฮบiโ‰ชT\mathcal{A}_{T,\kappa_{i}}\ll T or LTโ€‹(f)โ‰ชTL_{T}(f)\ll T. In the worst case, i.e., ๐’œT,ฮบi=ฮ˜โ€‹(T)\mathcal{A}_{T,\kappa_{i}}=\Theta(T) or LTโ€‹(f)=ฮ˜โ€‹(T)L_{T}(f)=\Theta(T), our bounds are same with previous result. We give new trade-offs between memory constraint and regret which answers Q1. For online kernel selection, we just aim to adapt to the optimal kernel ฮบโˆ—โˆˆ๐’ฆ\kappa^{\ast}\in\mathcal{K}. If ฮบโˆ—\kappa^{\ast} matches well with the data, then we expect ๐’œT,ฮบโˆ—=oโ€‹(T)\mathcal{A}_{T,\kappa^{\ast}}=o(T) and minfโˆˆโ„‹ฮบโˆ—โกLTโ€‹(f)=oโ€‹(T)\min_{f\in\mathcal{H}_{\kappa^{\ast}}}L_{T}(f)=o(T). In this case, a ฮ˜โ€‹(lnโกT)\Theta(\ln{T}) memory cost is enough to achieve sub-linear regret which answers Q2. Thus learning is possible in โ„ฮบโˆ—\mathbb{H}_{\kappa^{\ast}} within a small memory constraint.

Our algorithmic framework reduces OKS to prediction with expert advice and uses the well-known optimistic mirror descent framework [Chiang2012Online, Rakhlin2013Online]. We also propose a new buffer maintaining framework.

2 Related Work

Previous work has adopted buffer maintaining technique and random features to develop algorithms for online kernel selection with memory constraint [Sahoo2014Online, Liao2021High, Zhang2018Online, Shen2019Random]. However, all of existing results can not answer Q1 and Q2 well. The sketch-based online kernel selection algorithm [Zhang2018Online] enjoys a Oโ€‹(Bโ€‹lnโกT)O(B\ln{T}) regularized regret (the loss function incorporates a regularizer) within a buffer of size BB. The regret bound becomes Oโ€‹(Tโ€‹lnโกT)O(T\ln{T}) in the case of B=ฮ˜โ€‹(T)B=\Theta(T). The Raker algorithm [Shen2019Random] achieves a Oโ€‹(Tโ€‹lnโกK+TD)O\left(\sqrt{T\ln{K}}+\frac{T}{\sqrt{D}}\right) 111The original regret bound is Oโ€‹(Uโ€‹Tโ€‹lnโกK+ฯตโ€‹Tโ€‹U)O(U\sqrt{T\ln{K}}+\epsilon TU), and holds with probability 1โˆ’ฮ˜โ€‹(ฯตโˆ’2โ€‹expโก(โˆ’D4โ€‹d+8โ€‹ฯต2))1-\Theta\left(\epsilon^{-2}\exp\left(-\frac{D}{4d+8}\epsilon^{2}\right)\right). regret and suffers a space complexity in Oโ€‹((d+K)โ€‹D)O((d+K)D), where DD is the number of random features. The upper bound also shows a trade-off in the worst case. If DD is a constant or D=ฮ˜โ€‹(lnโกT)D=\Theta(\ln{T}), then Raker can not achieve a sub-linear regret bound, providing a negative answer to Q2.

Our work is also related to achieving data-dependent regret bounds for online learning and especially online kernel learning. For online learning, various data-dependent regret bounds including small-loss bound [Cesa-Bianchi2006Prediction, Lykouris2018Small, Lee2020Bias], variance bound [Hazan2009Better, Hazan2010Extracting] and path-length bound [Chiang2012Online, Steinhardt2014Adaptivity, Wei2018More], have been established. The adaptive online learning framework [Foster2015Adaptive] can achieve data-dependent and model-dependent regret bounds, but can not induce a computationally efficient algorithm. For online kernel learning, it is harder to achieve data-dependent regret bounds, as we must balance the computational cost. For loss functions satisfying specific smoothness (see Assumption 3), the OSKL algorithm [Zhang2013Online] achieves a small-loss bound. For loss functions with a curvature property, the PROS-N-KONS algorithm [Calandriello2017Efficient] and the PKAWV algorithm [Jezequel2019Efficient] achieve regret bounds depending on the effective dimension of the kernel matrix. None of these algorithms took memory constraints into account, failing to provide a trade-off between memory constraint and regret.

3 Problem Setup

Let โ„T:={(๐’™t,yt)}tโˆˆ[T]\mathcal{I}_{T}:=\{({\bm{x}}_{t},y_{t})\}_{t\in[T]} be a sequence of examples, where ๐’™tโˆˆ๐’ณโІโ„d,ytโˆˆ[โˆ’1,1]{\bm{x}}_{t}\in\mathcal{X}\subseteq\mathbb{R}^{d},y_{t}\in[-1,1] and [T]:={1,โ€ฆ,T}[T]:=\{1,\ldots,T\}. Let ฮบโ€‹(โ‹…,โ‹…):โ„dร—โ„dโ†’โ„\kappa(\cdot,\cdot):\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R} be a positive semidefinite kernel function and โ„‹ฮบ\mathcal{H}_{\kappa} be a dense subset of the associated RKHS, such that, for any fโˆˆโ„‹ฮบf\in\mathcal{H}_{\kappa},

  1. (i)

    โŸจf,ฮบโ€‹(๐’™,โ‹…)โŸฉโ„‹ฮบ=fโ€‹(๐’™),โˆ€๐’™โˆˆ๐’ณ\langle f,\kappa({\bm{x}},\cdot)\rangle_{\mathcal{H}_{\kappa}}=f({\bm{x}}),\forall{\bm{x}}\in\mathcal{X},

  2. (ii)

    โ„‹ฮบ=spanโ€‹(ฮบโ€‹(๐’™t,โ‹…)|tโˆˆ[T])ยฏ\mathcal{H}_{\kappa}=\overline{\mathrm{span}(\kappa({\bm{x}}_{t},\cdot)|t\in[T])}.

We define โŸจโ‹…,โ‹…โŸฉโ„‹ฮบ\langle\cdot,\cdot\rangle_{\mathcal{H}_{\kappa}} as the inner product in โ„‹ฮบ\mathcal{H}_{\kappa}, which induces the norm โ€–fโ€–โ„‹ฮบ=โŸจf,fโŸฉโ„‹ฮบ\|f\|_{\mathcal{H}_{\kappa}}=\sqrt{\langle f,f\rangle_{\mathcal{H}_{\kappa}}}. For simplicity, we will omit the subscript โ„‹ฮบ\mathcal{H}_{\kappa} in the inner product. Let โ„“โ€‹(โ‹…,โ‹…):โ„ร—[โˆ’1,1]โ†’โ„\ell(\cdot,\cdot):\mathbb{R}\times[-1,1]\rightarrow\mathbb{R} be the loss function. Denote by

โ„ฌฯˆโ€‹(f,g)=ฯˆโ€‹(f)โˆ’ฯˆโ€‹(g)โˆ’โŸจโˆ‡ฯˆโ€‹(g),fโˆ’gโŸฉ,โˆ€f,gโˆˆโ„‹ฮบ,\mathcal{B}_{\psi}(f,g)=\psi(f)-\psi(g)-\langle\nabla\psi(g),f-g\rangle,~{}\forall f,g\in\mathcal{H}_{\kappa},

the Bregman divergence associated with a strongly convex regularizer ฯˆโ€‹(โ‹…):โ„‹ฮบโ†’โ„\psi(\cdot):\mathcal{H}_{\kappa}\rightarrow\mathbb{R}.

Let ๐’ฆ={ฮบi}i=1K\mathcal{K}=\{\kappa_{i}\}^{K}_{i=1} contain KK base kernels. If an oracle gives the optimal kernel ฮบโˆ—โˆˆ๐’ฆ\kappa^{\ast}\in\mathcal{K} for โ„T\mathcal{I}_{T}, then we just run an online kernel learning algorithm in โ„‹ฮบโˆ—\mathcal{H}_{\kappa^{\ast}}. Since ฮบโˆ—\kappa^{\ast} is unknown, the learner aims to develop a kernel selection algorithm and generate a sequence of hypotheses {ft}t=1T\{f_{t}\}^{T}_{t=1} that can compete with any fโˆˆโ„‹ฮบโˆ—f\in\mathcal{H}_{\kappa^{\ast}}, that is, the learner hopes to guarantee Regโ€‹(f)=oโ€‹(T)\mathrm{Reg}(f)=o(T). For any fโˆˆโ„‹if\in\mathcal{H}_{i}, we have f=โˆ‘t=1Tatโ€‹ฮบiโ€‹(๐’™t,โ‹…)f=\sum^{T}_{t=1}a_{t}\kappa_{i}({\bm{x}}_{t},\cdot). Thus storing ftf_{t} needs to store {(๐’™ฯ„,yฯ„)}ฯ„=1tโˆ’1\{({\bm{x}}_{\tau},y_{\tau})\}^{t-1}_{\tau=1}, incurring a memory cost in Oโ€‹(t)O(t). In this paper, we consider online kernel selection with memory constraint. Next we define the memory constraint and reduce it to the size of a buffer.

Definition 1 (Memory Budget [Li2022Worst]).

A memory budget of โ„›\mathcal{R} quanta is the maximal memory that any online kernel selection algorithm can use.

Assumption 1 ([Li2022Worst]).

โˆ€ฮบiโˆˆ๐’ฆ\forall\kappa_{i}\in\mathcal{K}, there is a constant ฮฑ>0\alpha>0, such that any budgeted online kernel leaning algorithm running in โ„‹i\mathcal{H}_{i} can maintain a buffer of size Bโ‰คฮฑโ€‹โ„›B\leq\alpha\mathcal{R} within โ„›\mathcal{R} quanta memory constraint. If the space complexity of algorithm is linear with BB, then โ€œ==โ€ holds.

Assumption 1 implies that there is no need to assume โ„›=โˆž\mathcal{R}=\infty unless T=โˆžT=\infty, as the maximal value of โ„›\mathcal{R} satisfies B=TB=T. The budgeted online kernel learning algorithms are such algorithms that operate on a subset of the observed examples, such as Forgetron [Dekel2008The], BSGD [Wang2012Breakingecond] to name but a few. In Assumption 1, ฮฑ\alpha is independent of kernel function, since the memory cost is used to store the examples and the coefficients.

Assumption 2.

There is a constant k1>0k_{1}>0, such that โˆ€ฮบiโˆˆ๐’ฆ\forall\kappa_{i}\in\mathcal{K} and โˆ€๐ฎโˆˆ๐’ณ\forall{\bm{u}}\in\mathcal{X}, ฮบiโ€‹(๐ฎ,๐ฎ)โˆˆ[k1,1]\kappa_{i}({\bm{u}},{\bm{u}})\in[k_{1},1].

We will define two types of data complexities and use them to bound the regret. The first one is called kernel alignment defined as follows

๐’œT,ฮบi=โˆ‘t=1Tฮบiโ€‹(๐’™t,๐’™t)โˆ’1Tโ€‹๐’€Tโ€‹๐‘ฒiโ€‹๐’€,\mathcal{A}_{T,\kappa_{i}}=\sum^{T}_{t=1}\kappa_{i}({\bm{x}}_{t},{\bm{x}}_{t})-\frac{1}{T}{\bm{Y}}^{\mathrm{T}}{\bm{K}}_{i}{\bm{Y}},

where ๐’€=(y1,โ€ฆ,yT)T{\bm{Y}}=(y_{1},\ldots,y_{T})^{\mathrm{T}} and ๐‘ฒi{\bm{K}}_{i} is the kernel matrix induced by ฮบi\kappa_{i}. ๐’œT,ฮบi\mathcal{A}_{T,\kappa_{i}} quantifies how well ๐‘ฒi{\bm{K}}_{i} matches the examples. If ๐‘ฒi=๐’€โ€‹๐’€T{\bm{K}}_{i}={\bm{Y}}{\bm{Y}}^{\mathrm{T}} is the ideal kernel matrix, then ๐’œT,ฮบi=ฮ˜โ€‹(1)\mathcal{A}_{T,\kappa_{i}}=\Theta(1). The second one is called small-loss defined by minfโˆˆโ„iโกLTโ€‹(f)\min_{f\in\mathbb{H}_{i}}L_{T}(f) where LTโ€‹(f)L_{T}(f) follows (2) and

โ„i={fโˆˆโ„‹i:โ€–fโ€–โ„‹iโ‰คU}.\mathbb{H}_{i}=\left\{f\in\mathcal{H}_{i}:\|f\|_{\mathcal{H}_{i}}\leq U\right\}.

Both ๐’œT,ฮบi\mathcal{A}_{T,\kappa_{i}} and LTโ€‹(f)L_{T}(f) are independent of algorithms.

Finally, we define the online learnability which is a variant of the learnability defined in [Rakhlin2010Online].

Definition 2 (Online Learnability).

Given โ„T\mathcal{I}_{T}, a hypothesis space โ„‹\mathcal{H} is said to be online learnable if limTโ†’โˆžRegโ€‹(f)T=0\lim_{T\rightarrow\infty}\frac{\mathrm{Reg}(f)}{T}=0, โˆ€fโˆˆโ„‹\forall f\in\mathcal{H}.

The definition is equivalent to Regโ€‹(f)=oโ€‹(T)\mathrm{Reg}(f)=o(T). We can also replace Regโ€‹(f)\mathrm{Reg}(f) with ๐”ผโ€‹[Regโ€‹(f)]\mathbb{E}[\mathrm{Reg}(f)]. The learnability defined in [Rakhlin2010Online] holds for the entire sample space, i.e., the worst-case examples, while our definition only considers a fixed โ„T\mathcal{I}_{T}. In the worst-case, the two kinds of learnability are equivalent. Given that โ„T\mathcal{I}_{T} may not always be generated in the worst-case, our learnability can adapt to the hardness of โ„T\mathcal{I}_{T}.

4 Algorithmic Framework

We reduce OKS to prediction with expert advice (PEA) where ฮบi\kappa_{i} corresponds to the ii-th expert. The main idea of our algorithmic framework is summarized as follows: (i) generating {ft,iโˆˆโ„‹i}t=1T\{f_{t,i}\in\mathcal{H}_{i}\}^{T}_{t=1} for all iโˆˆ[K]i\in[K]; (ii) aggregating the predictions {ft,iโ€‹(๐’™t)}i=1K\{f_{t,i}({\bm{x}}_{t})\}^{K}_{i=1}. The challenge is how to control the size of {ft,i}i=1K\{f_{t,i}\}^{K}_{i=1}. To this end, we propose a new buffer maintaining framework containing an adaptive sampling and a removing process.

4.1 Adaptive Sampling

For all iโˆˆ[K]i\in[K], we use the optimistic mirror descent framework (OMD) [Chiang2012Online, Rakhlin2013Online] to generate {ft,i}t=1T\{f_{t,i}\}^{T}_{t=1}. OMD maintains two sequences of hypothesis, i.e., {ft,i}t=1T\{f_{t,i}\}^{T}_{t=1} and {ft,iโ€ฒ}t=1T\{f^{\prime}_{t,i}\}^{T}_{t=1} which are defined as follows,

ft,i=\displaystyle f_{t,i}= argโกminfโˆˆโ„‹i{โŸจf,โˆ‡^t,iโŸฉ+โ„ฌฯˆiโ€‹(f,ftโˆ’1,iโ€ฒ)},\displaystyle\mathop{\arg\min}_{f\in\mathcal{H}_{i}}\left\{\left\langle f,\hat{\nabla}_{t,i}\right\rangle+\mathcal{B}_{\psi_{i}}\left(f,f^{\prime}_{t-1,i}\right)\right\}, (4)
ft,iโ€ฒ=\displaystyle f^{\prime}_{t,i}= argโกminfโˆˆโ„i{โŸจf,โˆ‡t,iโŸฉ+โ„ฌฯˆiโ€‹(f,ftโˆ’1,iโ€ฒ)},\displaystyle\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\{\left\langle f,\nabla_{t,i}\right\rangle+\mathcal{B}_{\psi_{i}}\left(f,f^{\prime}_{t-1,i}\right)\right\}, (5)

where โˆ‡t,i\nabla_{t,i} is the (sub)-gradient of โ„“โ€‹(ft,iโ€‹(๐’™t),yt)\ell(f_{t,i}({\bm{x}}_{t}),y_{t}), โˆ‡^t,i\hat{\nabla}_{t,i} is an optimistic estimator of โˆ‡t,i\nabla_{t,i} and

ฯˆiโ€‹(f)=\displaystyle\psi_{i}(f)= 12โ€‹ฮปiโ€‹โ€–fโ€–โ„‹i2,\displaystyle\frac{1}{2\lambda_{i}}\|f\|^{2}_{\mathcal{H}_{i}},
โˆ‡t,i=\displaystyle\nabla_{t,i}= โ„“โ€ฒโ€‹(ft,iโ€‹(๐’™t),yt)โ‹…ฮบiโ€‹(๐’™t,โ‹…).\displaystyle\ell^{\prime}(f_{t,i}({\bm{x}}_{t}),y_{t})\cdot\kappa_{i}({\bm{x}}_{t},\cdot).

Note that the hypothesis space in (4) and (5) is โ„‹i\mathcal{H}_{i} and โ„i\mathbb{H}_{i}, respectively. This trick [Li2023Improved] is critical to obtaining regret bounds depending on kernel alignment. At the beginning of round tt, we execute (4) and compute ft,iโ€‹(๐’™t)f_{t,i}({\bm{x}}_{t}). The key is to control the size of ft,iโ€ฒf^{\prime}_{t,i}. By Assumption 1, we construct a buffer denoted by SiS_{i} that stores the examples constructing ft,iโ€ฒf^{\prime}_{t,i}. Thus we just need to limit the size of SiS_{i}. To this end, we propose an adaptive sampling scheme. Let bt,ib_{t,i} be a Bernoulli random variable satisfying

โ„™โ€‹[bt,i=1]=โ€–โˆ‡t,iโˆ’โˆ‡^t,iโ€–โ„‹iฮฝZt,ฮฝโˆˆ{1,2},\mathbb{P}[b_{t,i}=1]=\frac{\|\nabla_{t,i}-\hat{\nabla}_{t,i}\|^{\nu}_{\mathcal{H}_{i}}}{Z_{t}},\quad\nu\in\{1,2\}, (6)

where ZtZ_{t} is a normalizing constant. We further define

conโ€‹(aโ€‹(i)):=\displaystyle\mathrm{con}(a(i)):= โ€–ฮบiโ€‹(๐’™iโ€‹(st),โ‹…)โˆ’ฮบiโ€‹(๐’™t,โ‹…)โ€–โ„‹iโ‰คฮณt,i,\displaystyle\left\|\kappa_{i}({\bm{x}}_{i(s_{t})},\cdot)-\kappa_{i}({\bm{x}}_{t},\cdot)\right\|_{\mathcal{H}_{i}}\leq\gamma_{t,i},
๐’™iโ€‹(st)=\displaystyle{\bm{x}}_{i(s_{t})}= argโกmin๐’™ฯ„โˆˆSiโ€–ฮบiโ€‹(๐’™ฯ„,โ‹…)โˆ’ฮบiโ€‹(๐’™t,โ‹…)โ€–โ„‹i.\displaystyle\mathop{\arg\min}_{{\bm{x}}_{\tau}\in S_{i}}\left\|\kappa_{i}({\bm{x}}_{\tau},\cdot)-\kappa_{i}({\bm{x}}_{t},\cdot)\right\|_{\mathcal{H}_{i}}.

conโ€‹(aโ€‹(i))\mathrm{con}(a(i)) means that if there is an instance in SiS_{i} that is similar with ๐’™t{\bm{x}}_{t}, then we can use it as a proxy of ๐’™t{\bm{x}}_{t}. In this way, (๐’™t,yt)({\bm{x}}_{t},y_{t}) will not be added into SiS_{i}.

If โˆ‡t,i=0\nabla_{t,i}=0, then SiS_{i} keeps unchanged and we execute (5). If โˆ‡t,iโ‰ 0\nabla_{t,i}\neq 0 and conโ€‹(aโ€‹(i))\mathrm{con}(a(i)), then SiS_{i} keeps unchanged and we replace (5) with (7).

ft,iโ€ฒ=argโกminfโˆˆโ„i{โŸจf,โˆ‡iโ€‹(st),iโŸฉ+โ„ฌฯˆiโ€‹(f,ftโˆ’1,iโ€ฒ)}.f^{\prime}_{t,i}=\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\{\left\langle f,\nabla_{i(s_{t}),i}\right\rangle+\mathcal{B}_{\psi_{i}}\left(f,f^{\prime}_{t-1,i}\right)\right\}. (7)

If โˆ‡t,iโ‰ 0,ยฌconโ€‹(aโ€‹(i))\nabla_{t,i}\neq 0,\neg\mathrm{con}(a(i)), then we replace (5) with (8).

ft,iโ€ฒ=argโกminfโˆˆโ„i{โŸจf,โˆ‡~t,iโŸฉ+โ„ฌฯˆiโ€‹(f,ftโˆ’1,iโ€ฒ)}.f^{\prime}_{t,i}=\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\{\left\langle f,\tilde{\nabla}_{t,i}\right\rangle+\mathcal{B}_{\psi_{i}}\left(f,f^{\prime}_{t-1,i}\right)\right\}. (8)

In this case, if bt,i=1b_{t,i}=1, then we add (๐’™t,yt)({\bm{x}}_{t},y_{t}) into SiS_{i}. In the above two equations, we define

โˆ‡iโ€‹(st),i=\displaystyle\nabla_{i(s_{t}),i}= โ„“โ€ฒโ€‹(ft,iโ€‹(๐’™t),yt)โ‹…ฮบiโ€‹(๐’™iโ€‹(st),โ‹…),\displaystyle\ell^{\prime}(f_{t,i}({\bm{x}}_{t}),y_{t})\cdot\kappa_{i}({\bm{x}}_{i(s_{t})},\cdot),
โˆ‡~t,i=\displaystyle\tilde{\nabla}_{t,i}= โˆ‡t,iโˆ’โˆ‡^t,iโ„™โ€‹[bt,i=1]โ‹…๐•€bt,i=1+โˆ‡^t,i.\displaystyle\frac{\nabla_{t,i}-\hat{\nabla}_{t,i}}{\mathbb{P}[b_{t,i}=1]}\cdot\mathbb{I}_{b_{t,i}=1}+\hat{\nabla}_{t,i}.

4.2 Removing a Half of Examples

We allocate a memory budget โ„›i\mathcal{R}_{i} for ft,iโ€ฒf^{\prime}_{t,i} and require โˆ‘i=1Kโ„›iโ‰คโ„›\sum^{K}_{i=1}\mathcal{R}_{i}\leq\mathcal{R}. By Assumption 1, we just need to limit |Si|โ‰คฮฑโ€‹โ„›i|S_{i}|\leq\alpha\mathcal{R}_{i}. At any round tt, if |Si|=ฮฑโ€‹โ„›i|S_{i}|=\alpha\mathcal{R}_{i} and bt,i=1b_{t,i}=1, then we remove a half of examples from SiS_{i} [Li2023Improved]. Let Si={(๐’™rj,yrj)}j=1ฮฑโ€‹โ„›iS_{i}=\{({\bm{x}}_{r_{j}},y_{r_{j}})\}^{\alpha\mathcal{R}_{i}}_{j=1}. We rewrite ftโˆ’1,iโ€ฒf^{\prime}_{t-1,i} as follows,

ftโˆ’1,iโ€ฒ=\displaystyle f^{\prime}_{t-1,i}= ftโˆ’1,iโ€ฒโ€‹(1)+ftโˆ’1,iโ€ฒโ€‹(2),\displaystyle f^{\prime}_{t-1,i}(1)+f^{\prime}_{t-1,i}(2),
ftโˆ’1,iโ€ฒโ€‹(1)=\displaystyle f^{\prime}_{t-1,i}(1)= โˆ‘j=1ฮฑโ€‹โ„›i/2ฮฒrjโ€‹ฮบiโ€‹(๐’™rj,โ‹…),\displaystyle\sum^{\alpha\mathcal{R}_{i}/2}_{j=1}\beta_{r_{j}}\kappa_{i}({\bm{x}}_{r_{j}},\cdot),
ftโˆ’1,iโ€ฒโ€‹(2)=\displaystyle f^{\prime}_{t-1,i}(2)= โˆ‘j=ฮฑโ€‹โ„›i/2+1ฮฑโ€‹โ„›iฮฒrjโ€‹ฮบiโ€‹(๐’™rj,โ‹…).\displaystyle\sum^{\alpha\mathcal{R}_{i}}_{j=\alpha\mathcal{R}_{i}/2+1}\beta_{r_{j}}\kappa_{i}({\bm{x}}_{r_{j}},\cdot).

Let otโˆˆ{1,2}o_{t}\in\{1,2\} and qtโˆˆ{1,2}โˆ–{ot}q_{t}\in\{1,2\}\setminus\{o_{t}\}. We will remove ftโˆ’1,iโ€ฒโ€‹(ot)f^{\prime}_{t-1,i}(o_{t}) from ftโˆ’1,iโ€ฒf^{\prime}_{t-1,i}, that is, the examples constructing ftโˆ’1,iโ€ฒโ€‹(ot)f^{\prime}_{t-1,i}(o_{t}) are removed from SiS_{i}. We further project ftโˆ’1,iโ€ฒโ€‹(qt)f^{\prime}_{t-1,i}(q_{t}) onto โ„i\mathbb{H}_{i} by

fยฏtโˆ’1,iโ€ฒโ€‹(qt)=argโกminfโˆˆโ„iโ€–fโˆ’ftโˆ’1,iโ€ฒโ€‹(qt)โ€–โ„‹2.\bar{f}^{\prime}_{t-1,i}(q_{t})=\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\|f-f^{\prime}_{t-1,i}(q_{t})\right\|^{2}_{\mathcal{H}}.

Then we execute the second mirror descent and redefine (8) as follows

ft,iโ€ฒ=argโกminfโˆˆโ„i{โŸจf,โˆ‡~t,iโŸฉ+โ„ฌฯˆiโ€‹(f,fยฏtโˆ’1,iโ€ฒโ€‹(qt))}.f^{\prime}_{t,i}=\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\{\left\langle f,\tilde{\nabla}_{t,i}\right\rangle+\mathcal{B}_{\psi_{i}}\left(f,\bar{f}^{\prime}_{t-1,i}(q_{t})\right)\right\}. (9)

Our buffer maintaining framework always ensures |Si|โ‰คฮฑโ€‹โ„›i|S_{i}|\leq\alpha\mathcal{R}_{i}. A natural approach is the restart technique which removes all of the examples in SiS_{i}, i.e., resetting Si=โˆ…S_{i}=\emptyset. We will prove that removing a half of examples can achieve better regret bounds.

4.3 Reduction to PEA

Similar to previous online multi-kernel learning algorithms [Sahoo2014Online, Shen2019Random, Jin2010Online], we use PEA framework [Cesa-Bianchi2006Prediction] to aggregate KK predictions. Let ฮ”Kโˆ’1\Delta_{K-1} be the Kโˆ’1K-1 dimensional simplex. At each round tt, we maintain a probability distribution ๐’‘tโˆˆฮ”Kโˆ’1{\bm{p}}_{t}\in\Delta_{K-1} over {ft,i}i=1K\{f_{t,i}\}^{K}_{i=1}. and output the prediction ftโ€‹(๐’™t)=โˆ‘i=1Kpt,iโ€‹ft,iโ€‹(๐’™t)f_{t}({\bm{x}}_{t})=\sum^{K}_{i=1}p_{t,i}f_{t,i}({\bm{x}}_{t}) or y^t=sign(ft(๐’™t)\hat{y}_{t}=\mathrm{sign}(f_{t}({\bm{x}}_{t}). By the multiplicative-weight update in Subsection 2.1 [Cesa-Bianchi2006Prediction], ๐’‘t+1{\bm{p}}_{t+1} is defined as follows

pt+1,i=wt+1,iโˆ‘j=1Kwt+1,j,wt+1,i=expโก(โˆ’ฮทt+1โ€‹โˆ‘ฯ„=1tcฯ„,i),ฮทt+1=2โ€‹lnโกK1+โˆ‘ฯ„=1tโˆ‘i=1Kpฯ„,iโ€‹cฯ„,i2,ฮท1=2โ€‹lnโกK,\begin{split}p_{t+1,i}&=\frac{w_{t+1,i}}{\sum^{K}_{j=1}w_{t+1,j}},\\ w_{t+1,i}&=\exp\left(-\eta_{t+1}\sum^{t}_{\tau=1}c_{\tau,i}\right),\\ \eta_{t+1}=&\frac{\sqrt{2\ln{K}}}{\sqrt{1+\sum^{t}_{\tau=1}\sum^{K}_{i=1}p_{\tau,i}c^{2}_{\tau,i}}},\eta_{1}=\sqrt{2\ln{K}},\end{split} (10)

where ct,i=gโ€‹(ft,iโ€‹(๐’™t),yt)c_{t,i}=g(f_{t,i}({\bm{x}}_{t}),y_{t}). gโ€‹(โ‹…,โ‹…):โ„2โ†’โ„+โˆช{0}g(\cdot,\cdot):\mathbb{R}^{2}\rightarrow\mathbb{R}^{+}\cup\{0\} will be defined later.

5 Applications

In this section, we apply the proposed algorithmic framework to two types of loss functions and derive two kinds of data-dependent regret bounds.

5.1 the Hinge Loss Function

We consider online binary classification tasks. Let the loss function be the Hinge loss. Thus โˆ‡t,i=โˆ’ytโ€‹ฮบiโ€‹(๐’™t,โ‹…)โ‹…๐•€ytโ€‹ft,iโ€‹(๐’™t)<1\nabla_{t,i}=-y_{t}\kappa_{i}({\bm{x}}_{t},\cdot)\cdot\mathbb{I}_{y_{t}f_{t,i}({\bm{x}}_{t})<1}. We use the โ€œReservoir Sampling (RS)โ€ technique [Hazan2009Better, Vitter1985Random] to create โˆ‡^t,i\hat{\nabla}_{t,i}. To this end, we create a new buffer VtV_{t} with size Mโ‰ฅ1M\geq 1. Initializing โˆ‡^1,i=0\hat{\nabla}_{1,i}=0, and for tโ‰ฅ2t\geq 2, we define

โˆ‡^t,i=โˆ’1|Vt|โ€‹โˆ‘(๐’™,y)โˆˆVtyโ‹…ฮบiโ€‹(๐’™,โ‹…).\hat{\nabla}_{t,i}=-\frac{1}{|V_{t}|}\sum_{({\bm{x}},y)\in V_{t}}y\cdot\kappa_{i}({\bm{x}},\cdot).

RS is defined follows. At the end of round tโ‰ฅ1t\geq 1, we add (๐’™t,yt)({\bm{x}}_{t},y_{t}) into VtV_{t} with probability minโก{1,Mt}\min\left\{1,\frac{M}{t}\right\}. If |Vt|=M|V_{t}|=M and we will add the current example, then an old example must be removed from VtV_{t} uniformly. We create another buffer S0S_{0} that stores all of the examples added into VtV_{t}. It is easy to prove that ๐”ผโ€‹[|S0|]โ‰คMโ€‹(1+lnโกT)\mathbb{E}[|S_{0}|]\leq M(1+\ln{T}). Let โ„›0=Mโ€‹(1+lnโกT)ฮฑ\mathcal{R}_{0}=\frac{M(1+\ln{T})}{\alpha} and โ„›i=โ„›โˆ’โ„›0K\mathcal{R}_{i}=\frac{\mathcal{R}-\mathcal{R}_{0}}{K} where we require โ„›โ‰ฅ2โ€‹โ„›0\mathcal{R}\geq 2\mathcal{R}_{0}. It is easy to verify that โˆ‘i=0Kโ„›i=โ„›\sum^{K}_{i=0}\mathcal{R}_{i}=\mathcal{R}.

We instantiate the sampling scheme (6) as follows,

ฮฝ=2,Zt=โ€–โˆ‡t,iโˆ’โˆ‡^t,iโ€–โ„‹i2+โ€–โˆ‡^t,iโ€–โ„‹i2.\nu=2,\quad Z_{t}=\left\|\nabla_{t,i}-\hat{\nabla}_{t,i}\right\|^{2}_{\mathcal{H}_{i}}+\left\|\hat{\nabla}_{t,i}\right\|^{2}_{\mathcal{H}_{i}}.

The additional term โ€–โˆ‡^t,iโ€–โ„‹i2\left\|\hat{\nabla}_{t,i}\right\|^{2}_{\mathcal{H}_{i}} in ZtZ_{t} is important for controlling the times of removing operation, while it also increases the regret. The trick that uses different hypothesis spaces in (4) and (5), gives a negative term in the regret bound. The negative term can cancel out the increase on the regret. Let ot=2o_{t}=2. We remove ftโˆ’1,iโ€ฒโ€‹(2)f^{\prime}_{t-1,i}(2) from ft,if_{t,i}. Then (9) becomes

ft,iโ€ฒ=argโกminfโˆˆโ„i{โŸจf,โˆ‡~t,iโŸฉ+โ„ฌฯˆiโ€‹(f,fยฏtโˆ’1,iโ€ฒโ€‹(1))}.f^{\prime}_{t,i}=\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\{\left\langle f,\tilde{\nabla}_{t,i}\right\rangle+\mathcal{B}_{\psi_{i}}\left(f,\bar{f}^{\prime}_{t-1,i}(1)\right)\right\}. (11)

At the tt-th round, our algorithm outputs the prediction y^t=signโ€‹(ftโ€‹(๐’™t))\hat{y}_{t}=\mathrm{sign}(f_{t}({\bm{x}}_{t})). Let ๐’‘1{\bm{p}}_{1} be the uniform distribution and ๐’‘t+1{\bm{p}}_{t+1} follow (10) where ct,i=โ„“โ€‹(ft,iโ€‹(๐’™t),yt)c_{t,i}=\ell(f_{t,i}({\bm{x}}_{t}),y_{t}). We name this algorithm M-OMD-H (Memory bounded OMD for the Hinge loss) and present the pseudo-code in Algorithm 1.

Algorithm 1 M-OMD-H
0:ย ย {ฮปi}i=1K\{\lambda_{i}\}^{K}_{i=1}, {ฮทt}t=1T\{\eta_{t}\}^{T}_{t=1}, โ„›\mathcal{R}, UU.
0:ย ย f0,iโ€ฒ=โˆ‡^1,i=0f^{\prime}_{0,i}=\hat{\nabla}_{1,i}=0, Si=V=โˆ…S_{i}=V=\emptyset.
1:ย ย forย t=1,2,โ€ฆ,Tt=1,2,\ldots,Tย do
2:ย ย ย ย ย Receive ๐’™t{\bm{x}}_{t}โ€„
3:ย ย ย ย ย forย i=1,โ€ฆ,Ki=1,\ldots,Kย do
4:ย ย ย ย ย ย ย ย Find ๐’™iโ€‹(st){\bm{x}}_{i(s_{t})}
5:ย ย ย ย ย ย ย ย Compute โˆ‡^t,i=โˆ’1|V|โ€‹โˆ‘(๐’™,y)โˆˆVyโ€‹ฮบiโ€‹(๐’™,โ‹…)\hat{\nabla}_{t,i}=\frac{-1}{|V|}\sum_{({\bm{x}},y)\in V}y\kappa_{i}({\bm{x}},\cdot)
6:ย ย ย ย ย ย ย ย Compute ft,iโ€‹(๐’™t)f_{t,i}({\bm{x}}_{t}) according to (4)
7:ย ย ย ย ย endย for
8:ย ย ย ย ย Output y^t=signโ€‹(ftโ€‹(๐’™t))\hat{y}_{t}=\mathrm{sign}(f_{t}({\bm{x}}_{t})) and receive yty_{t}โ€„
9:ย ย ย ย ย forย i=1,โ€ฆ,Ki=1,\ldots,Kย do
10:ย ย ย ย ย ย ย ย ifย ytโ€‹ft,iโ€‹(๐’™t)<1y_{t}f_{t,i}({\bm{x}}_{t})<1ย then
11:ย ย ย ย ย ย ย ย ย ย ย ifย conโ€‹(aโ€‹(i))\mathrm{con}(a(i))ย then
12:ย ย ย ย ย ย ย ย ย ย ย ย ย ย Update ft,iโ€ฒf^{\prime}_{t,i} following (7)
13:ย ย ย ย ย ย ย ย ย ย ย else
14:ย ย ย ย ย ย ย ย ย ย ย ย ย ย Sample bt,ib_{t,i}
15:ย ย ย ย ย ย ย ย ย ย ย ย ย ย ifย bt,i=1b_{t,i}=1 and |Si|โ‰คฮฑโ€‹โ„›iโˆ’1|S_{i}|\leq\alpha\mathcal{R}_{i}-1ย then
16:ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย Update ft,iโ€ฒf^{\prime}_{t,i} following (8)
17:ย ย ย ย ย ย ย ย ย ย ย ย ย ย endย if
18:ย ย ย ย ย ย ย ย ย ย ย ย ย ย ifย bt,i=1b_{t,i}=1ย and\mathrm{and}ย |Si|=ฮฑโ€‹โ„›i|S_{i}|=\alpha\mathcal{R}_{i}ย then
19:ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย Compute fยฏtโˆ’1,iโ€ฒโ€‹(1)\bar{f}^{\prime}_{t-1,i}(1)
20:ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย Update ft,iโ€ฒf^{\prime}_{t,i} following (11)
21:ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย ย Remove the latest ฮฑโ€‹โ„›i2\frac{\alpha\mathcal{R}_{i}}{2} examples from SiS_{i}
22:ย ย ย ย ย ย ย ย ย ย ย ย ย ย endย if
23:ย ย ย ย ย ย ย ย ย ย ย ย ย ย Update Si=Siโˆช{(๐’™t,yt):bt,i=1}S_{i}=S_{i}\cup\{({\bm{x}}_{t},y_{t}):b_{t,i}=1\}
24:ย ย ย ย ย ย ย ย ย ย ย endย if
25:ย ย ย ย ย ย ย ย endย if
26:ย ย ย ย ย ย ย ย Compute ๐’‘t+1{\bm{p}}_{t+1} by (10)
27:ย ย ย ย ย ย ย ย Update reservoir VV and S0S_{0}
28:ย ย ย ย ย endย for
29:ย ย endย for
Lemma 1.

Let M>1M>1 and ฮฑโ€‹โ„›:=Bโ‰ฅ2โ€‹Mโ€‹(1+lnโกT)\alpha\mathcal{R}:=B\geq 2M(1+\ln{T}). For any โ„T\mathcal{I}_{T}, the expected times that M-OMD-H executes removing operation on SiS_{i} are โŒˆ4โ€‹Kโ€‹๐’œ~T,ฮบiBโ€‹k1โŒ‰\left\lceil\frac{4K\tilde{\mathcal{A}}_{T,\kappa_{i}}}{Bk_{1}}\right\rceil at most, in which

๐’œ~T,ฮบi=1+โˆ‘t=2Tโ€–ytโ€‹ฮบiโ€‹(๐’™t,โ‹…)โˆ’โˆ‘(๐’™,y)โˆˆVtyโ€‹ฮบiโ€‹(๐’™,โ‹…)|Vt|โ€–โ„‹i2.\tilde{\mathcal{A}}_{T,\kappa_{i}}=1+\sum^{T}_{t=2}\left\|y_{t}\kappa_{i}({\bm{x}}_{t},\cdot)-\frac{\sum_{({\bm{x}},y)\in V_{t}}y\kappa_{i}({\bm{x}},\cdot)}{|V_{t}|}\right\|^{2}_{\mathcal{H}_{i}}.

Note that that the times of removing operation depend on the kernel alignment, rather than TT.

Theorem 1.

Suppose โ„›\mathcal{R} satisfies the condition in Lemma 1. Let B>KB>K, U>1U>1, ฮปi=Uโ€‹K2โ€‹B\lambda_{i}=U\frac{\sqrt{K}}{\sqrt{2B}}, ฮทt\eta_{t} follow (10) and

ฮณt,i=โ€–โˆ‡t,iโˆ’โˆ‡^t,iโ€–โ„‹i21+โˆ‘ฯ„โ‰คtโ€–โˆ‡ฯ„,iโˆ’โˆ‡^ฯ„,iโ€–โ„‹i2โ‹…๐•€โˆ‡ฯ„,iโ‰ 0.\gamma_{t,i}=\frac{\left\|\nabla_{t,i}-\hat{\nabla}_{t,i}\right\|^{2}_{\mathcal{H}_{i}}}{\sqrt{1+\sum_{\tau\leq t}\left\|\nabla_{\tau,i}-\hat{\nabla}_{\tau,i}\right\|^{2}_{\mathcal{H}_{i}}\cdot\mathbb{I}_{\nabla_{\tau,i}\neq 0}}}. (12)

The expected regret of M-OMD-H satisfies,

cโ€‹โˆ€ฮบiโˆˆ๐’ฆ,fโˆˆโ„i,๐”ผโ€‹[Regโ€‹(f)]=Oโ€‹(Uโ€‹LTโ€‹(f)โ€‹lnโกK+Uโ€‹BK+Kโ€‹Uโ€‹๐’œT,ฮบiโ€‹lnโกTBโ€‹k1).c\forall\kappa_{i}\in\mathcal{K},f\in\mathbb{H}_{i},\quad\mathbb{E}\left[\mathrm{Reg}(f)\right]=O\left(\sqrt{UL_{T}(f)\ln{K}}+\frac{U\sqrt{B}}{\sqrt{K}}+\frac{\sqrt{K}U\mathcal{A}_{T,\kappa_{i}}\ln{T}}{\sqrt{B}k_{1}}\right).

For any competitor ff satisfying โ€–fโ€–โ„‹i<1\|f\|_{\mathcal{H}_{i}}<1, it must be LTโ€‹(f)=ฮ˜โ€‹(T)L_{T}(f)=\Theta(T), inducing a trivial upper bound. To address this issue, we must set U>1U>1 and exclude such competitors. Theorem 1 reveals how the regret bound increases with KK, โ„›\mathcal{R}, and ๐’œT,ฮบi\mathcal{A}_{T,\kappa_{i}}. The larger the memory budget is, the smaller the regret bound will be. Theorem 1 gives an answer to Q1. If the optimal kernel ฮบโˆ—โˆˆ๐’ฆ\kappa^{\ast}\in\mathcal{K} matches well with the examples, i.e., ๐’œT,ฮบโˆ—=oโ€‹(T)\mathcal{A}_{T,\kappa^{\ast}}=o(T), then it is possible to achieve a oโ€‹(T)o(T) regret bound in the case of โ„›=ฮ˜โ€‹(lnโกT)\mathcal{R}=\Theta(\ln{T}). Such a result gives an answer to Q2. By Definition 2, learning is possible in โ„ฮบโˆ—\mathbb{H}_{\kappa^{\ast}}. The information-theoretical cost that the optimal kernel is unknown is Oโ€‹(Uโ€‹LTโ€‹(f)โ€‹lnโกK)O\left(\sqrt{UL_{T}(f)\ln{K}}\right), which is a lower order term.

Our regret bound can also derive the state-of-the-art results. Let B=ฮ˜โ€‹(๐’œT,ฮบi)B=\Theta(\mathcal{A}_{T,\kappa_{i}}). Then we obtain a Oโ€‹(Uโ€‹Kโ€‹๐’œT,ฮบi)O\left(U\sqrt{K\mathcal{A}_{T,\kappa_{i}}}\right) expected bound. Previous work proved a Oโ€‹(Uโ€‹Kโ€‹T14โ€‹๐’œT,ฮบi14)O\left(U\sqrt{K}T^{\frac{1}{4}}\mathcal{A}^{\frac{1}{4}}_{T,\kappa_{i}}\right) high-probability bound [Liao2021High]. If ๐’œT,ฮบi=Oโ€‹(T)\mathcal{A}_{T,\kappa_{i}}=O(T), then we achieve an expected bound of Oโ€‹(Uโ€‹Kโ€‹TB)O\left(U\sqrt{K}\frac{T}{\sqrt{B}}\right), which matches the upper bound in [Li2022Worst]. If B=ฮ˜โ€‹(T)B=\Theta(T), then our regret bound matches the upper bounds in [Sahoo2014Online, Foster2017Parameter, Shen2019Random].

Next, we present an algorithm-dependent bound demonstrating that removing half of the examples can be more effective than the restart technique. We introduction two new notations as follows

๐’ฅi=\displaystyle\mathcal{J}_{i}= {tโˆˆ[T]:|Si|=ฮฑโ€‹โ„›,bt,i=1},\displaystyle\{t\in[T]:|S_{i}|=\alpha\mathcal{R},b_{t,i}=1\},
ฮ›i=\displaystyle\Lambda_{i}= โˆ‘tโˆˆ๐’ฅi[โ€–fยฏtโˆ’1,iโ€ฒโ€‹(1)โˆ’fโ€–โ„‹i2โˆ’โ€–ftโˆ’1,iโ€ฒโˆ’fโ€–โ„‹i2].\displaystyle\sum_{t\in\mathcal{J}_{i}}\left[\left\|\bar{f}^{\prime}_{t-1,i}(1)-f\right\|^{2}_{\mathcal{H}_{i}}-\left\|f^{\prime}_{t-1,i}-f\right\|^{2}_{\mathcal{H}_{i}}\right].

There must be a constant ฮพiโˆˆ(0,4]\xi_{i}\in(0,4] such that ฮ›i=ฮพiโ€‹U2โ€‹|๐’ฅi|\Lambda_{i}=\xi_{i}U^{2}|\mathcal{J}_{i}|. Recalling that fยฏtโˆ’1,iโ€ฒโ€‹(1)\bar{f}^{\prime}_{t-1,i}(1) is the initial hypothesis after a removing operation. It is natural that if (fยฏtโˆ’1,iโ€ฒโ€‹(1)โˆ’f)\left(\bar{f}^{\prime}_{t-1,i}(1)-f\right) is close to (ftโˆ’1,iโ€ฒโˆ’f)\left(f^{\prime}_{t-1,i}-f\right), then ฮพiโ‰ช4\xi_{i}\ll 4. The restart technique sets fยฏtโˆ’1,iโ€ฒโ€‹(1)=0\bar{f}^{\prime}_{t-1,i}(1)=0 implying ฮ›iโ‰คU2โ€‹|๐’ฅi|\Lambda_{i}\leq U^{2}|\mathcal{J}_{i}|. In the worst case, our approach is slightly worse than the restart. If ฮพi\xi_{i} is sufficiently small, then our approach is much better than the restart.

Theorem 2 (Algorithm-dependent Bound).

Suppose the conditions in Theorem 1 are satisfied. For each iโˆˆ[K]i\in[K], let ฮปi=2โ€‹U5โ€‹๐’œ~T,ฮบi\lambda_{i}=\frac{\sqrt{2}U}{\sqrt{5\tilde{\mathcal{A}}_{T,\kappa_{i}}}}. If ฮพiโ‹…|๐’ฅi|โ‰ค1\xi_{i}\cdot|\mathcal{J}_{i}|\leq 1 for all iโˆˆ[K]i\in[K], then the expected regret of M-OMD-H satisfies,

โˆ€ฮบiโˆˆ๐’ฆ,fโˆˆโ„i,๐”ผโ€‹[Regโ€‹(f)]=Oโ€‹(LTโ€‹(f)โ€‹lnโกK+Uโ€‹KBโ€‹๐’œT,ฮบiโ€‹lnโกT+Uโ€‹๐’œT,ฮบiโ€‹lnโกT).\forall\kappa_{i}\in\mathcal{K},f\in\mathbb{H}_{i},\quad\mathbb{E}\left[\mathrm{Reg}(f)\right]=O\left(\sqrt{L_{T}(f)\ln{K}}+\frac{UK}{B}\mathcal{A}_{T,\kappa_{i}}\ln{T}+U\sqrt{\mathcal{A}_{T,\kappa_{i}}\ln{T}}\right).

The dominate term in Theorem 2 is O~โ€‹(Uโ€‹Kโ€‹๐’œT,ฮบiB)\tilde{O}\left(\frac{UK\mathcal{A}_{T,\kappa_{i}}}{B}\right), while the dominate term in Theorem 1 is O~โ€‹(Uโ€‹Kโ€‹๐’œT,ฮบiB)\tilde{O}\left(\frac{UK\mathcal{A}_{T,\kappa_{i}}}{\sqrt{B}}\right). The restart technique only enjoys the regret bound given in Theorem 1. Thus removing a half of examples can be better than the restart.

5.2 Smooth Loss Functions

We first define the smooth loss functions.

Assumption 3 ([Zhang2013Online]).

Let G1G_{1} and G2G_{2} be positive constants. For any uu and yy, โ„“โ€‹(u,y)\ell(u,y) satisfies (i) |โ„“โ€ฒโ€‹(u,y)|โ‰คG1|\ell^{\prime}(u,y)|\leq G_{1}, (ii) |โ„“โ€ฒโ€‹(u,y)|โ‰คG2โ€‹โ„“โ€‹(u,y)|\ell^{\prime}(u,y)|\leq G_{2}\ell(u,y), where โ„“โ€ฒโ€‹(u,y)=dโ€‹โ„“โ€‹(u,y)dโ€‹u\ell^{\prime}(u,y)=\frac{\mathrm{d}\,\ell(u,y)}{\mathrm{d}\,u}.

The logistic loss function satisfies Assumption 3 with G2=1G_{2}=1. For the hinge loss function, M-OMD-H uniformly allocates the memory budget and induces a Oโ€‹(K)O\left(\sqrt{K}\right) factor in the regret bound. For smooth loss functions, we will propose a memory sharing scheme that can avoid the Oโ€‹(K)O\left(\sqrt{K}\right) factor.

Recalling that the memory is used to store the examples and coefficients. The main challenge is how to share the examples. To this end, we only keep a single buffer SS. We first rewrite ftโ€‹(๐’™t)f_{t}({\bm{x}}_{t}) as follows,

ftโ€‹(๐’™t)=โˆ‘i=1Kpt,iโ€‹ft,iโ€‹(๐’™t)=โˆ‘i=1Kpt,iโ€‹โˆ‘๐’™ฯ„โˆˆSiaฯ„,iโ€‹ฮบiโ€‹(๐’™ฯ„,๐’™t).f_{t}({\bm{x}}_{t})=\sum^{K}_{i=1}p_{t,i}f_{t,i}({\bm{x}}_{t})=\sum^{K}_{i=1}p_{t,i}\sum_{{\bm{x}}_{\tau}\in S_{i}}a_{\tau,i}\kappa_{i}({\bm{x}}_{\tau},{\bm{x}}_{t}).

We just need to ensure Si=SS_{i}=S for all iโˆˆ[K]i\in[K]. For each ft,if_{t,i}, we define a surrogate gradient โˆ‡t,i=โ„“โ€ฒโ€‹(ftโ€‹(๐’™t),yt)โ‹…ฮบiโ€‹(๐’™t,โ‹…)\nabla_{t,i}=\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t})\cdot\kappa_{i}({\bm{x}}_{t},\cdot). Note that we use the first derivative โ„“โ€ฒโ€‹(ftโ€‹(๐’™t),yt)\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t}), rather than โ„“โ€ฒโ€‹(ft,iโ€‹(๐’™t),yt)\ell^{\prime}(f_{t,i}({\bm{x}}_{t}),y_{t}). Let โˆ‡^t,i=0\hat{\nabla}_{t,i}=0 in (4). Then we obtain a single update rule as follows

ft+1,i=argโกminfโˆˆโ„i{โŸจf,โˆ‡t,iโŸฉ+โ„ฌฯˆiโ€‹(f,ft,i)}.f_{t+1,i}=\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\{\langle f,\nabla_{t,i}\rangle+\mathcal{B}_{\psi_{i}}(f,f_{t,i})\right\}.

To ensure Si=SS_{i}=S, we must change the sampling scheme (6). Let bt,i=btb_{t,i}=b_{t} and conโ€‹(aโ€‹(i))=conโ€‹(a)\mathrm{con}(a(i))=\mathrm{con}(a) for all iโˆˆ[K]i\in[K]. We define ฮฝ=1\nu=1 and

Zt=\displaystyle Z_{t}= โ€–โˆ‡t,iโ€–โ„‹i+G1โ€‹ฮบiโ€‹(๐’™t,๐’™t),\displaystyle\|\nabla_{t,i}\|_{\mathcal{H}_{i}}+G_{1}\sqrt{\kappa_{i}({\bm{x}}_{t},{\bm{x}}_{t})},
conโ€‹(a):=\displaystyle\mathrm{con}(a):= maxiโˆˆ[K]โกโ€–ฮบiโ€‹(๐’™iโ€‹(st),โ‹…)โˆ’ฮบiโ€‹(๐’™t,โ‹…)โ€–โ„‹iโ‰คฮณt.\displaystyle\max_{i\in[K]}\|\kappa_{i}({\bm{x}}_{i(s_{t})},\cdot)-\kappa_{i}({\bm{x}}_{t},\cdot)\|_{\mathcal{H}_{i}}\leq\gamma_{t}.

If conโ€‹(a)\mathrm{con}(a), then we replace (7) with (13).

ft+1,i=argโกminfโˆˆโ„i{โŸจf,โˆ‡iโ€‹(st),iโŸฉ+โ„ฌฯˆiโ€‹(f,ft,i)},โˆ‡iโ€‹(st),i=โ„“โ€ฒโ€‹(ftโ€‹(๐’™t),yt)โ‹…ฮบiโ€‹(๐’™iโ€‹(st),โ‹…).\begin{split}f_{t+1,i}=&\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\{\left\langle f,\nabla_{i(s_{t}),i}\right\rangle+\mathcal{B}_{\psi_{i}}(f,f_{t,i})\right\},\\ \nabla_{i(s_{t}),i}=&\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t})\cdot\kappa_{i}\left({\bm{x}}_{i(s_{t})},\cdot\right).\end{split} (13)

Otherwise, we replace (8) with (14).

ft+1,i=argโกminfโˆˆโ„i{โŸจf,โˆ‡~t,iโŸฉ+โ„ฌฯˆiโ€‹(f,ft,i)}.f_{t+1,i}=\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\{\left\langle f,\tilde{\nabla}_{t,i}\right\rangle+\mathcal{B}_{\psi_{i}}(f,f_{t,i})\right\}. (14)

If |S|=ฮฑโ€‹โ„›|S|=\alpha\mathcal{R} and bt=1b_{t}=1, then let ot=1o_{t}=1. We remove ft,iโ€‹(1)f_{t,i}(1) from ft,if_{t,i} and redefine (9) as follows

ft+1,i=argโกminfโˆˆโ„i{โŸจf,โˆ‡~t,iโŸฉ+โ„ฌฯˆiโ€‹(f,fยฏt,iโ€‹(2))}.f_{t+1,i}=\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\{\left\langle f,\tilde{\nabla}_{t,i}\right\rangle+\mathcal{B}_{\psi_{i}}\left(f,\bar{f}_{t,i}(2)\right)\right\}. (15)

Let ๐’‘1{\bm{p}}_{1} be the uniform distribution and ๐’‘t+1{\bm{p}}_{t+1} follow (10), in which ct,ic_{t,i} follows (16).

Definition 3.

At each round tt, โˆ€ฮบiโˆˆ๐’ฆ\forall\kappa_{i}\in\mathcal{K}, let

ct,i={โ„“โ€ฒโ€‹(ftโ€‹(๐’™t),yt)โ‹…(ft,iโ€‹(๐’™t)โˆ’minjโˆˆ[K]โกft,jโ€‹(๐’™t)),ifโ€‹โ„“โ€ฒโ€‹(ft)>0,โ„“โ€ฒโ€‹(ftโ€‹(๐’™t),yt)โ‹…(ft,iโ€‹(๐’™t)โˆ’maxjโˆˆ[K]โกft,jโ€‹(๐’™t)),otherwise.c_{t,i}=\left\{\begin{array}[]{ll}\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t})\cdot\left(f_{t,i}({\bm{x}}_{t})-\min_{j\in[K]}f_{t,j}({\bm{x}}_{t})\right),&~{}\mathrm{if}~{}\ell^{\prime}(f_{t})>0,\\ \ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t})\cdot\left(f_{t,i}({\bm{x}}_{t})-\max_{j\in[K]}f_{t,j}({\bm{x}}_{t})\right),&~{}\mathrm{otherwise}.\end{array}\right. (16)

We name this algorithm M-OMD-S (Memory bounded OMD for Smooth loss function) and present the pseudo-code in Algorithm 2.

Algorithm 2 M-OMD-S
0:ย ย {ฮปi}i=1K\{\lambda_{i}\}^{K}_{i=1}, {ฮทt}t=1T\{\eta_{t}\}^{T}_{t=1}, โ„›\mathcal{R}, UU.
0:ย ย S=โˆ…S=\emptyset, f1,i=0,iโˆˆ[K]f_{1,i}=0,i\in[K].
1:ย ย forย t=1,2,โ€ฆ,Tt=1,2,\ldots,Tย do
2:ย ย ย ย ย Receive ๐’™t{\bm{x}}_{t}
3:ย ย ย ย ย forย i=1,โ€ฆ,Ki=1,\ldots,Kย do
4:ย ย ย ย ย ย ย ย Find ๐’™iโ€‹(st){\bm{x}}_{i(s_{t})}
5:ย ย ย ย ย ย ย ย Compute ft,iโ€‹(๐’™t)f_{t,i}({\bm{x}}_{t})
6:ย ย ย ย ย endย for
7:ย ย ย ย ย Output ftโ€‹(๐’™t)f_{t}({\bm{x}}_{t}) or y^t=signโ€‹(ftโ€‹(๐’™t))\hat{y}_{t}=\mathrm{sign}(f_{t}({\bm{x}}_{t})) and receive yty_{t}
8:ย ย ย ย ย ifย conโ€‹(a)\mathrm{con}(a)ย then
9:ย ย ย ย ย ย ย ย forย i=1,โ€ฆ,Ki=1,\ldots,Kย do
10:ย ย ย ย ย ย ย ย ย ย ย Compute ft+1,if_{t+1,i} following (13)
11:ย ย ย ย ย ย ย ย endย for
12:ย ย ย ย ย else
13:ย ย ย ย ย ย ย ย Sample btb_{t}
14:ย ย ย ย ย ย ย ย ifย bt=1b_{t}=1 and |S|โ‰คฮฑโ€‹โ„›โˆ’1|S|\leq\alpha\mathcal{R}-1ย then
15:ย ย ย ย ย ย ย ย ย ย ย โˆ€iโˆˆ[K]\forall i\in[K], update ft+1,if_{t+1,i} following (14)
16:ย ย ย ย ย ย ย ย endย if
17:ย ย ย ย ย ย ย ย ifย bt=1b_{t}=1ย and\mathrm{and}ย |S|=ฮฑโ€‹โ„›|S|=\alpha\mathcal{R}ย then
18:ย ย ย ย ย ย ย ย ย ย ย Remove the first 12โ€‹ฮฑโ€‹โ„›\frac{1}{2}\alpha\mathcal{R} support vectors from SS
19:ย ย ย ย ย ย ย ย ย ย ย โˆ€iโˆˆ[K]\forall i\in[K], update ft+1,if_{t+1,i} following (15)
20:ย ย ย ย ย ย ย ย endย if
21:ย ย ย ย ย ย ย ย Update S=Sโˆช{(๐’™t,yt):bt=1}S=S\cup\{({\bm{x}}_{t},y_{t}):b_{t}=1\}
22:ย ย ย ย ย endย if
23:ย ย ย ย ย Compute ๐’‘t+1{\bm{p}}_{t+1} by (10)
24:ย ย endย for
Lemma 2.

Suppose โ„“โ€‹(โ‹…,โ‹…)\ell(\cdot,\cdot) satisfies Assumption 3. Let ฮดโˆˆ(0,1)\delta\in(0,1) and ฮฑโ€‹โ„›:=Bโ‰ฅ21โ€‹lnโก1ฮด\alpha\mathcal{R}:=B\geq 21\ln\frac{1}{\delta}. For any โ„T\mathcal{I}_{T}, with probability at least 1โˆ’TBโ€‹ฮ˜โ€‹(โŒˆlnโกTโŒ‰)โ€‹ฮด1-\frac{T}{B}\Theta(\lceil\ln{T}\rceil)\delta, the times that M-OMD-S executes removing operation on SS are โŒˆ4โ€‹G2โ€‹L^1:T(Bโˆ’43โ€‹lnโก1ฮด)โ€‹G1โŒ‰\left\lceil\frac{4G_{2}\hat{L}_{1:T}}{(B-\frac{4}{3}\ln\frac{1}{\delta})G_{1}}\right\rceil at most, in which

L^1:T=โˆ‘t=1Tโ„“โ€‹(ftโ€‹(๐’™t),yt).\hat{L}_{1:T}=\sum^{T}_{t=1}\ell(f_{t}({\bm{x}}_{t}),y_{t}).

It is important that the times of removing operation depend on the cumulative losses of the algorithm, rather than TT.

Theorem 3.

Suppose โ„›\mathcal{R} and โ„“โ€‹(โ‹…,โ‹…)\ell(\cdot,\cdot) satisfy Lemma 2. Let ฮดโˆˆ(0,1)\delta\in(0,1), Kโ‰คdK\leq d, ฮทt\eta_{t} follow (10),

ฮณt=\displaystyle\gamma_{t}= 2โ€‹lnโกK1+โˆ‘ฯ„=1t|โ„“โ€ฒโ€‹(fฯ„โ€‹(๐’™ฯ„),yฯ„)|,\displaystyle\frac{\sqrt{2\ln{K}}}{\sqrt{1+\sum^{t}_{\tau=1}|\ell^{\prime}(f_{\tau}({\bm{x}}_{\tau}),y_{\tau})|}}, (17)
Uโ‰ค\displaystyle U\leq 18โ€‹G2โ€‹Bโˆ’43โ€‹lnโก1ฮด,\displaystyle\frac{1}{8G_{2}}\sqrt{B-\frac{4}{3}\ln\frac{1}{\delta}},

and ฮปi=2โ€‹UG1โ€‹B\lambda_{i}=\frac{2U}{G_{1}\sqrt{B}} for all iโˆˆ[K]i\in[K]. With probability at least 1โˆ’3โ€‹Tฮฑโ€‹โ„›โ€‹ฮ˜โ€‹(โŒˆlnโกTโŒ‰)โ€‹ฮด1-\frac{3T}{\alpha\mathcal{R}}\Theta(\lceil\ln{T}\rceil)\delta, the regret of M-OMD-S satisfies

โˆ€ฮบiโˆˆ๐’ฆ,โˆ€fโˆˆโ„i,\displaystyle\forall\kappa_{i}\in\mathcal{K},\forall f\in\mathbb{H}_{i}, Regโ€‹(f)โ‰ค\displaystyle\quad\mathrm{Reg}(f)\leq
24โ€‹Uโ€‹G2โ€‹LTโ€‹(f)Bโˆ’43โ€‹lnโก1ฮด+Uโ€‹G1โ€‹B+100โ€‹U2โ€‹G2โ€‹G1โ€‹lnโก1ฮด(1โˆ’ฮณ)2+10โ€‹U(1โˆ’ฮณ)32โ€‹G2โ€‹G1โ€‹lnโก1ฮดโ€‹LTโ€‹(f)+Uโ€‹G1โ€‹B4,\displaystyle\frac{24UG_{2}L_{T}(f)}{\sqrt{B-\frac{4}{3}\ln\frac{1}{\delta}}}+UG_{1}\sqrt{B}+\frac{100U^{2}G_{2}G_{1}\ln\frac{1}{\delta}}{(1-\gamma)^{2}}+\frac{10U}{(1-\gamma)^{\frac{3}{2}}}\sqrt{G_{2}G_{1}\ln\frac{1}{\delta}}\sqrt{L_{T}(f)+\frac{UG_{1}\sqrt{B}}{4}},

where ฮณ=6โ€‹Uโ€‹G2Bโˆ’43โ€‹lnโก1ฮด\gamma=\frac{6UG_{2}}{\sqrt{B-\frac{4}{3}\ln\frac{1}{\delta}}}.

In the case of Bโ‰ชTB\ll T, the dominant term in the upper bound is Oโ€‹(LTโ€‹(f)B)O\left(\frac{L_{T}(f)}{\sqrt{B}}\right). Previous work proved a ฮฉโ€‹(TB)\Omega\left(\frac{T}{B}\right) lower bound [Zhang2013Online]. Thus our result proves a new trade-off between memory constraint and regret which gives an answer to Q1. If minfโˆˆโ„ฮบโˆ—โกLTโ€‹(f)=oโ€‹(T)\min_{f\in\mathbb{H}_{\kappa^{\ast}}}L_{T}(f)=o(T), then a ฮ˜โ€‹(lnโกT)\Theta(\ln{T}) memory budget is enough to achieve a sub-linear regret which answers Q2. By Definition 2, we also conclude that โ„ฮบโˆ—\mathbb{H}_{\kappa^{\ast}} is online learnable. The penalty term for not knowing the optimal kernel is Oโ€‹(Uโ€‹LTโ€‹(f)โ€‹lnโกK)O\left(U\sqrt{L_{T}(f)\ln{K}}\right) which only depends on Oโ€‹(lnโกK)O\left(\sqrt{\ln{K}}\right). In this case, online kernel selection is not much harder than online kernel learning.

Theorem 3 can also recover some state-of-the-art results. Let B=ฮ˜โ€‹(T)B=\Theta(T). We obtain a Oโ€‹(Uโ€‹Tโ€‹lnโกK)O\left(U\sqrt{T\ln{K}}\right) regret bound which is same with the regret bound in [Sahoo2014Online, Foster2017Parameter, Shen2019Random]. If K=1K=1 and B=ฮ˜โ€‹(LTโ€‹(f))B=\Theta(L_{T}(f)), then we obtain a Oโ€‹(Uโ€‹LTโ€‹(f))O\left(U\sqrt{L_{T}(f)}\right) regret bound, matching the regret bound in [Zhang2013Online]. Next we prove that M-OMD-S is optimal.

Theorem 4 (Lower Bound).

Let โ„“โ€‹(u,y)=lnโก(1+expโก(โˆ’yโ€‹u))\ell(u,y)=\ln(1+\exp(-yu)) which satisfies Assumption 3 and ฮบโ€‹(๐ฑ,๐ฏ)=โŸจ๐ฑ,๐ฏโŸฉp\kappa({\bm{x}},{\bm{v}})=\langle{\bm{x}},{\bm{v}}\rangle^{p} where p>0p>0 is a constant. Let U<3โ€‹B5U<\frac{\sqrt{3B}}{5} and โ„={fโˆˆโ„‹ฮบ:โ€–fโ€–โ„‹โ‰คU}\mathbb{H}=\{f\in\mathcal{H}_{\kappa}:\|f\|_{\mathcal{H}}\leq U\}. There exists a โ„T\mathcal{I}_{T} such that the regret of any online kernel algorithms storing BB examples satisfies

โˆƒfโˆˆโ„,๐”ผโ€‹[Regโ€‹(f)]=ฮฉโ€‹(Uโ‹…LTโ€‹(f)B).\exists f\in\mathbb{H},\quad\mathbb{E}\left[\mathrm{Reg}(f)\right]=\Omega\left(U\cdot\frac{L_{T}(f)}{\sqrt{B}}\right).

The expectation is w.r.t. the randomness of algorithm and environment.

Our lower bound improves the previous lower bound ฮฉโ€‹(TB)\Omega\left(\frac{T}{B}\right) [Zhang2013Online]. Next we show an algorithm-dependent upper bound. Let

๐’ฅ=\displaystyle\mathcal{J}= {tโˆˆ[T]:|S|=ฮฑโ€‹โ„›,bt=1},\displaystyle\left\{t\in[T]:|S|=\alpha\mathcal{R},b_{t}=1\right\},
ฮ›i=\displaystyle\Lambda_{i}= โˆ‘tโˆˆ๐’ฅ[โ€–ft,iโ€‹(2)โˆ’fโ€–โ„‹i2โˆ’โ€–ft,iโˆ’fโ€–โ„‹i2].\displaystyle\sum_{t\in\mathcal{J}}\left[\left\|f_{t,i}(2)-f\right\|^{2}_{\mathcal{H}_{i}}-\left\|f_{t,i}-f\right\|^{2}_{\mathcal{H}_{i}}\right].

There must be a constant ฮพiโˆˆ(0,4]\xi_{i}\in(0,4] such that ฮ›i=ฮพiโ€‹U2โ€‹|๐’ฅ|\Lambda_{i}=\xi_{i}U^{2}|\mathcal{J}|. It is natural that if (ft,iโ€‹(2)โˆ’f)(f_{t,i}(2)-f) is close to (ft,iโˆ’f)(f_{t,i}-f), then ฮพiโ‰ช4\xi_{i}\ll 4. The restart technique sets ft,iโ€‹(2)=0f_{t,i}(2)=0 implying ฮ›iโ‰คU2โ€‹|๐’ฅ|\Lambda_{i}\leq U^{2}|\mathcal{J}|.

Theorem 5 (Algorithm-dependent Bound).

Suppose the conditions in Theorem 3 are satisfied. Let ฮดโˆˆ(0,1)\delta\in(0,1) and Uโ‰คBโˆ’43โ€‹lnโก1ฮด32โ€‹G2U\leq\frac{B-\frac{4}{3}\ln\frac{1}{\delta}}{32G_{2}}. If ฮพi<1|๐’ฅ|\xi_{i}<\frac{1}{|\mathcal{J}|} for all iโˆˆ[K]i\in[K], then with probability at least 1โˆ’3โ€‹Tฮฑโ€‹โ„›โ€‹ฮ˜โ€‹(โŒˆlnโกTโŒ‰)โ€‹ฮด1-\frac{3T}{\alpha\mathcal{R}}\Theta(\lceil\ln{T}\rceil)\delta, the regret of M-OMD-S satisfies,

โˆ€ฮบiโˆˆ๐’ฆ,โˆ€fโˆˆโ„i,Regโ€‹(f)โ‰ค32โ€‹Uโ€‹G2โ€‹LTโ€‹(f)Bโˆ’43โ€‹lnโก1ฮด+144โ€‹U2โ€‹G2โ€‹G1โ€‹lnโก1ฮด(1โˆ’ฮณ)2+12โ€‹U(1โˆ’ฮณ)32โ€‹G2โ€‹G1โ€‹LTโ€‹(f)โ€‹lnโก1ฮด,\displaystyle\forall\kappa_{i}\in\mathcal{K},\forall f\in\mathbb{H}_{i},\quad\mathrm{Reg}(f)\leq\frac{32UG_{2}L_{T}(f)}{B-\frac{4}{3}\ln\frac{1}{\delta}}+\frac{144U^{2}G_{2}G_{1}\ln\frac{1}{\delta}}{(1-\gamma)^{2}}+\frac{12U}{(1-\gamma)^{\frac{3}{2}}}\sqrt{G_{2}G_{1}L_{T}(f)\ln\frac{1}{\delta}},

where ฮณ=16โ€‹Uโ€‹G2Bโˆ’43โ€‹lnโก1ฮด\gamma=\frac{16UG_{2}}{B-\frac{4}{3}\ln\frac{1}{\delta}}.

The dominate term in Theorem 5 is Oโ€‹(Uโ€‹G2โ€‹LTโ€‹(f)B)O\left(\frac{UG_{2}L_{T}(f)}{B}\right), while the dominate term in Theorem 3 is Oโ€‹(Uโ€‹G2โ€‹LTโ€‹(f)B)O\left(\frac{UG_{2}L_{T}(f)}{\sqrt{B}}\right). The restart technique only enjoys the regret bound given in Theorem 3. Thus removing a half of examples can be better than the restart technique.

6 Experiments

In this section, we will verify the following two goals,

  1. G1

    Learning is possible in the optimal hypothesis space within a small memory budget.
    We aim to verify the two data-dependent complexities are smaller than the worst-case complexity TT, that is, minฮบi,iโˆˆ[K]โก๐’œT,ฮบiโ‰ชT\min_{\kappa_{i},i\in[K]}\mathcal{A}_{T,\kappa_{i}}\ll T and minfโˆˆโ„i,iโˆˆ[K]โกLTโ€‹(f)โ‰ชT\min_{f\in\mathbb{H}_{i},i\in[K]}L_{T}(f)\ll T.

  2. G2

    Our algorithms perform better than baseline algorothms within a same memory budget.

6.1 Experimental Setting

We implement M-OMD-S with the logistic loss function. Both M-OMD-H and M-OMD-S are suitable for online binary classification tasks. We adopt the Gaussian kernel ฮบโ€‹(๐’™,๐’—)=expโก(โˆ’โ€–๐’™โˆ’๐’—โ€–222โ€‹ฯƒ2)\kappa({\bm{x}},{\bm{v}})=\exp(-\frac{\|{\bm{x}}-{\bm{v}}\|^{2}_{2}}{2\sigma^{2}}), and choose eight classification datasets. The information of the datasets is shown in Table 1. The w8a, the ijcnn1, and the cod-rna datasets are download from the LIBSVM website 222https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/, Jul. 2024.. The other datasets are downloaded from UCI machine learning repository 333http://archive.ics.uci.edu/ml/datasets.php, Jul. 2024.. All algorithms are implemented in R on a Windows machine with 2.8 GHz Core(TM) i7-1165G7 CPU. We execute each experiment ten times with random permutation of all datasets.

Table 1: Datasets Used in Experiments
Datasets #sample # Feature Datasets #sample # Feature Datasets #sample # Feature
mushrooms 8,124 112 phishing 11,055 68 magic04 19,020 10
a9a 48,842 123 w8a 49,749 300 SUSY 50,000 18
ijcnn1 141,691 22 cod-rna 271,617 8 -

We compare our algorithms with two baseline algorithms.

  • โ€ข

    LKMBooks: OKS with memory constraint via random adding examples [Li2022Worst].

  • โ€ข

    Raker: online multi-kernel learning using random features [Shen2019Random].

Our algorithms improve the regret bounds of Raker and LKMBooks, and we expect that our algorithms also perform better. Although there are other online kernel selection algorithm, such as OKS [Yang2012Online], and online multi-kernel learning algorithms, such as OMKR [Sahoo2014Online] and OMKC [Hoi2013Online], we do not compare with these algorithms, since they do not limit the memory budget.

We select K=5K=5 Gaussian kernel function by setting ฯƒ=2[โˆ’2:2:6]\sigma=2^{[-2:2:6]}. For the baseline algorithms, we tune the hyper-parameters following original papers. To be specific, for LKMBooks, there are three hyper-parameters ฮฝโˆˆ(0,1)\nu\in(0,1), ฮป\lambda and ฮท\eta. Both ฮป\lambda and ฮท\eta are learning rates. We reset ฮป=101โˆ’ฮฝโ€‹Tโ€‹(1+ฮฝ)โ€‹B\lambda=\frac{10}{\sqrt{1-\nu}T}\sqrt{(1+\nu)B} for improving performance and tune ฮฝโˆˆ(12,13,14)\nu\in(\frac{1}{2},\frac{1}{3},\frac{1}{4}). For Raker, there is a learning rate ฮท\eta and a regularization parameter ฮป\lambda. We tune ฮท=10โˆ’3:1:3T\eta=\frac{10^{-3:1:3}}{\sqrt{T}} and ฮปโˆˆ{0.05,0.005,0.0005}\lambda\in\{0.05,0.005,0.0005\}. For M-OMD-H and M-OMD-S, we can set an aggressive value of U=BU=\sqrt{B} following Theorem 2 and Theorem 5. We tune the learning rate ฮปi=cโ€‹UB\lambda_{i}=c\frac{U}{\sqrt{B}} where cโˆˆ{2,1,0.5}c\in\{2,1,0.5\}. The value of ฮทt\eta_{t} follows (10). The value of ฮณt\gamma_{t} follows (12) and (17). For M-OMD-H, we set M=10M=10.

We always set D=400D=400 in Raker where DD is the number of random features adopted by Raker and plays a similar role with BB. For LKMBooks and M-OMD-S, we always set the budget B=400B=400. Note that M-OMD-H uniformly allocates the memory budget over KK hypotheses and the reservoir. Thus we separately set B=100B=100 and B=400B=400 for M-OMD-H, denoted as M-OMD-H (B=100)(B=100) and M-OMD-H (B=400)(B=400). Raker, LKMBooks, M-OMD-S, and M-OMD-H (B=100)(B=100) use a same memory budget, while M-OMD-H (B=400)(B=400) uses a larger memory budget.

6.2 Experimental Results

We separately use the hinge loss function and the logistic loss function.

6.2.1 Hinge Loss Function

We first compare the AMR of all algorithms. Table 2 shows the results. As a whole, M-OMD-H (B=400)(B=400) performs best on most of datasets, while M-OMD-H (B=100)(B=100) performs worst. The reason is that M-OMD-H uniformly allocates the memory budget over all kernels thereby inducing a Oโ€‹(K)O\left(\sqrt{K}\right) factor on the regret bound. M-OMD-H (B=400)(B=400) uses a larger memory budget, thereby performing better. The experimental results do not completely coincide with G2. It is left for future work to study whether we can avoid allocating the memory budget.

Table 2 also reports the average running time. It is natural that M-OMD-H (B=400)(B=400) requires the longest running time, since it uses a larger memory budget. The running time of the other three algorithms is similar.

Next we analyze the value of ๐’œT,ฮบโˆ—\mathcal{A}_{T,\kappa^{\ast}}. Computing ๐’œT,ฮบโˆ—\mathcal{A}_{T,\kappa^{\ast}} requires Oโ€‹(T2)O(T^{2}) time. For simplicity, we will define a proxy of ๐’œT,ฮบโˆ—\mathcal{A}_{T,\kappa^{\ast}}. Let

๐’œ^T,i:=โˆ‘ฯ„=1Tโ€–โˆ‡ฯ„,iโˆ’โˆ‡^ฯ„,iโ€–โ„‹i2โ‹…๐•€โˆ‡ฯ„,iโ‰ 0.\hat{\mathcal{A}}_{T,i}:=\sum^{T}_{\tau=1}\left\|\nabla_{\tau,i}-\hat{\nabla}_{\tau,i}\right\|^{2}_{\mathcal{H}_{i}}\cdot\mathbb{I}_{\nabla_{\tau,i}\neq 0}.

To be specific, we use miniโˆˆ[K]โก๐’œ^T,i\min_{i\in[K]}\hat{\mathcal{A}}_{T,i} as a proxy of ๐’œT,ฮบโˆ—\mathcal{A}_{T,\kappa^{\ast}}. In fact, our regret bound depends on ๐’œ^T,i\hat{\mathcal{A}}_{T,i} which satisfies ๐’œ^T,i=O~โ€‹(๐’œT,ฮบi)\hat{\mathcal{A}}_{T,i}=\tilde{O}(\mathcal{A}_{T,\kappa_{i}}). To obtain a precise estimation of ๐’œT,ฮบi\mathcal{A}_{T,\kappa_{i}}, we again run M-OMD-H with M=30M=30 and B=400B=400.

Table 2 shows the results. We find that ๐’œT,ฮบโˆ—<T\mathcal{A}_{T,\kappa^{\ast}}<T on all datasets. Especially, ๐’œT,ฮบโˆ—โ‰ชT\mathcal{A}_{T,\kappa^{\ast}}\ll T on the mushrooms, the phishing, and the w8a datasets. In the optimal hypothesis space, a small memory budget is enough to guarantee a small regret and thus learning is possible. The experimental results verify G1.

Table 2: Experimental Results Using the Hinge Loss Function
Algorithm mushrooms, T=8124T=8124 phishing, T=11055T=11055
AMR B|DB|D Time (s) ๐’œT,ฮบโˆ—\mathcal{A}_{T,\kappa^{\ast}} AMR B|DB|D Time (s) ๐’œT,ฮบโˆ—\mathcal{A}_{T,\kappa^{\ast}}
Raker [Shen2019Random] 11.97 ยฑ\pm 0.93 400 1.57 - 9.13 ยฑ\pm 0.28 400 2.06 -
LKMBooks [Li2022Worst] 6.04 ยฑ\pm 0.62 400 1.64 - 12.81 ยฑ\pm 0.54 400 2.21 -
M-OMD-H 8.09 ยฑ\pm 2.51 100 5.62 - 19.41 ยฑ\pm 3.78 100 5.05 -
M-OMD-H 0.69 ยฑ\pm 0.07 400 10.97 220 8.87 ยฑ\pm 0.40 400 10.74 2849
Algorithm magic04, T=19020T=19020 a9a, T=48842T=48842
AMR B|DB|D Time (s) ๐’œT,ฮบโˆ—\mathcal{A}_{T,\kappa^{\ast}} AMR B|DB|D Time (s) ๐’œT,ฮบโˆ—\mathcal{A}_{T,\kappa^{\ast}}
Raker [Shen2019Random] 32.14 ยฑ\pm 0.43 400 3.25 - 23.47 ยฑ\pm 0.23 400 9.45 -
LKMBooks [Li2022Worst] 23.66 ยฑ\pm 0.51 400 1.31 - 21.29 ยฑ\pm 1.36 400 10.15 -
M-OMD-H 29.08 ยฑ\pm 3.78 100 5.50 - 24.84 ยฑ\pm 1.43 100 31.39 -
M-OMD-H 22.75 ยฑ\pm 0.57 400 6.09 10422 19.88 ยฑ\pm 1.44 400 49.29 19393
Algorithm w8a, T=49749T=49749 SUSY, T=50000T=50000
AMR B|DB|D Time (s) ๐’œT,ฮบโˆ—\mathcal{A}_{T,\kappa^{\ast}} AMR B|DB|D Time (s) ๐’œT,ฮบโˆ—\mathcal{A}_{T,\kappa^{\ast}}
Raker [Shen2019Random] 2.98 ยฑ\pm 0.00 400 11.62 - 29.62 ยฑ\pm 0.46 400 8.88 -
LKMBooks [Li2022Worst] 2.97 ยฑ\pm 0.00 400 23.52 - 29.37 ยฑ\pm 0.87 400 5.47 -
M-OMD-H 2.99 ยฑ\pm 0.10 100 43.20 - 36.21 ยฑ\pm 2.08 100 16.29 -
M-OMD-H 2.65 ยฑ\pm 0.12 400 84.80 4200 27.76 ยฑ\pm 1.04 400 29.45 28438
Algorithm ijcnn1, T=141691T=141691 con-rna, T=271617T=271617
AMR B|DB|D Time (s) ๐’œT,ฮบโˆ—\mathcal{A}_{T,\kappa^{\ast}} AMR B|DB|D Time (s) ๐’œT,ฮบโˆ—\mathcal{A}_{T,\kappa^{\ast}}
Raker [Shen2019Random] 9.49 ยฑ\pm 0.08 400 23.54 - 11.90 ยฑ\pm 0.31 400 53.32 -
LKMBooks [Li2022Worst] 9.58 ยฑ\pm 0.01 400 18.85 - 12.57 ยฑ\pm 0.26 400 19.17 -
M-OMD-H 9.98 ยฑ\pm 0.29 100 33.15 - 14.59 ยฑ\pm 1.47 100 70.33 -
M-OMD-H 9.57 ยฑ\pm 0.43 400 54.81 33626 12.46 ยฑ\pm 0.54 400 130.09 62259

6.2.2 Logistic Loss Function

We first compare the AMR of all algorithms. Table 3 reports the results. As a whole, M-OMD-S performs best on most of datasets. On the phishing, the SUSY, and the cod-rna datasets, Raker performs better than M-OMD-S. We have found that the current regret analysis of Raker is not tight and can be improved by utilizing Assumption 3. Thus it is reasonable that Raker performs similar with M-OMD-S. Besides, M-OMD-S performs better than LKMBooks on all datasets. The experimental results coincide with the theoretical analysis that is, M-OMD-S enjoys a better regret bound than LKMBooks. The experimental results verify G2. It is natural that the running time of all algorithms is similar as they enjoy the same time complexity.

Finally, we analyze the value of Lโˆ—L^{\ast}. For simplicity, we use the cumulative losses of M-OMD-S, L^1:T\hat{L}_{1:T}, as a proxy of Lโˆ—L^{\ast}. Note that Lโˆ—โ‰คL^1:TL^{\ast}\leq\hat{L}_{1:T}. Table 3 shows that Lโˆ—<TL^{\ast}<T on all datasets. Besides, on the mushrooms dataset, phishing dataset, and w8a dataset, Lโˆ—โ‰ชTL^{\ast}\ll T. In the optimal hypothesis space, a small memory budget is enough to guarantee a small regret and thus learning is possible. The experimental results verify G1.

Table 3: Experimental Results Using the Logistic Loss Function
Algorithm mushrooms, T=8124T=8124 phishing, T=11055T=11055
AMR B|DB|D Time (s) Lโˆ—L^{\ast} AMR B|DB|D Time (s) Lโˆ—L^{\ast}
Raker [Shen2019Random] 11.47 ยฑ\pm 0.75 400 1.64 - 9.07 ยฑ\pm 0.44 400 2.05 -
LKMBooks [Li2022Worst] 10.01 ยฑ\pm 1.44 400 2.12 - 15.86 ยฑ\pm 1.28 400 1.94 -
M-OMD-S 2.70 ยฑ\pm 0.32 400 1.74 160 10.38 ยฑ\pm 0.31 400 3.55 3012
Algorithm magic04, T=19020T=19020 a9a, T=48842T=48842
AMR B|DB|D Time (s) Lโˆ—L^{\ast} AMR B|DB|D Time (s) Lโˆ—L^{\ast}
Raker [Shen2019Random] 30.74 ยฑ\pm 1.11 400 3.56 - 23.13 ยฑ\pm 0.17 400 9.90 -
LKMBooks [Li2022Worst] 25.72 ยฑ\pm 0.64 400 1.30 - 22.38 ยฑ\pm 1.29 400 10.98 -
M-OMD-S 23.97 ยฑ\pm 0.23 400 1.81 10160 19.43 ยฑ\pm 0.16 400 10.41 20194
Algorithm w8a, T=49749T=49749 SUSY, T=50000T=50000
AMR B|DB|D Time (s) Lโˆ—L^{\ast} AMR B|DB|D Time (s) Lโˆ—L^{\ast}
Raker [Shen2019Random] 2.97 ยฑ\pm 0.00 400 12.06 - 28.62 ยฑ\pm 0.44 400 9.02 -
LKMBooks [Li2022Worst] 2.98 ยฑ\pm 0.01 400 6.10 - 30.72 ยฑ\pm 1.33 400 6.45 0
M-OMD-S 2.83 ยฑ\pm 0.03 400 21.44 5116 29.03 ยฑ\pm 0.16 400 8.24 28582
Algorithm ijcnn1, T=141691T=141691 con-rna, T=271617T=271617
AMR B|DB|D Time (s) Lโˆ—L^{\ast} AMR B|DB|D Time (s) Lโˆ—L^{\ast}
Raker [Shen2019Random] 9.53 ยฑ\pm 0.03 400 24.94 - 12.27 ยฑ\pm 0.57 400 52.24 -
LKMBooks [Li2022Worst] 9.57 ยฑ\pm 0.00 400 16.95 - 13.91 ยฑ\pm 1.31 400 25.55 -
M-OMD-S 9.42 ยฑ\pm 0.05 400 19.33 39208 13.09 ยฑ\pm 0.04 400 31.20 94326

7 Conclusion

Learnability is an essential problem in online kernel selection with memory constraint. Our work gave positive results on learnability through data-dependent regret analysis, in contrast to previous negative results obtained from worst-case regret analysis. We characterized the regret bounds via the kernel alignment and the cumulative losses, and gave new trade-offs between regret and memory constraint. If there is a kernel function matches well with the data, i.e., the kernel alignment and the cumulative losses is sub-linear, then sub-linear regret can be achieved within a ฮ˜โ€‹(lnโกT)\Theta(\ln{T}) memory constraint and the corresponding hypothesis space is learnable. Data-dependent regret analysis provides a new perspective for studying the learnability in online kernel selection with memory constraint.

Deploying machine learning models on computational devices with limited computational resources is necessary for practical applications. Our algorithms limit the memory budget to Oโ€‹(lnโกT)O(\ln{T}), and can naturally be deployed on computational devices with limited memory. Even if the deployed machine learning models are not kernel classifiers, our work can guide how to allocate the memory resources properly. For instance, (i) the optimal value of memory budget depends on the hardness of problems; (ii) our second algorithm provides the idea of memory sharing.

There are some important work for future study. Firstly, it is interesting to study other data complexities beyond kernel alignment and small-loss, such as the effective dimension of the kernel matrix. Secondly, it is important to investigate whether we can eliminate the Oโ€‹(K)O\left(\sqrt{K}\right) factor in the regret bound of M-OMD-H.

Appendix

Appendix A.1 Proof of Lemma 1

Proof.

We just analyze SiS_{i} for a fixed iโˆˆ[K]i\in[K]. Let the times of removing operation be JJ. Denote by B=ฮฑโ€‹โ„›B=\alpha\mathcal{R}, ๐’ฅ={tr,rโˆˆ[J]}\mathcal{J}=\{t_{r},r\in[J]\}, Tr={trโˆ’1+1,โ€ฆ,tr}T_{r}=\{t_{r-1}+1,\ldots,t_{r}\} and t0=0t_{0}=0. For any tโˆˆTrt\in T_{r}, if โˆ‡t,iโ‰ 0\nabla_{t,i}\neq 0, ยฌconโ€‹(aโ€‹(i))\neg\mathrm{con}(a(i)) and bt,i=1b_{t,i}=1, then (๐’™t,yt)({\bm{x}}_{t},y_{t}) will be added into SiS_{i}. For simplicity, we define a new notation ฮฝt,i\nu_{t,i} as follows,

ฮฝt,i=๐•€ytโ€‹ft,iโ€‹(๐’™t)<1โ‹…๐•€ยฌconโ€‹(aโ€‹(i))โ‹…bt,i.\nu_{t,i}=\mathbb{I}_{y_{t}f_{t,i}({\bm{x}}_{t})<1}\cdot\mathbb{I}_{\neg\mathrm{con}(a(i))}\cdot b_{t,i}.

At the end of the trt_{r}-th round, the following equation can be derived,

|Si|=|Siโ€‹(trโˆ’1+1)|+โˆ‘t=trโˆ’1+1trฮฝt,i=BK,|S_{i}|=|S_{i}(t_{r-1}+1)|+\sum^{t_{r}}_{t=t_{r-1}+1}\nu_{t,i}=\frac{B}{K},

where |Siโ€‹(trโˆ’1+1)||S_{i}(t_{r-1}+1)| is defined the initial size of SiS_{i}.

Let sr=trโˆ’1+1s_{r}=t_{r-1}+1. Assuming that there is no budget. We will present an expected bound on โˆ‘t=srtยฏฮฝt,i\sum^{\bar{t}}_{t=s_{r}}\nu_{t,i} for any tยฏ>sr\bar{t}>s_{r}. In the first epoch, s1=1s_{1}=1 and |Siโ€‹(s1)|=0|S_{i}(s_{1})|=0. Taking expectation w.r.t. bt,ib_{t,i} gives

๐”ผโ€‹[โˆ‘t=s1tยฏฮฝt,i]=\displaystyle\mathbb{E}\left[\sum^{\bar{t}}_{t=s_{1}}\nu_{t,i}\right]= โˆ‘t=s1tยฏโ€–โˆ‡t,iโˆ’โˆ‡^t,iโ€–โ„‹i2โ‹…๐•€โˆ‡t,iโ‰ 0โ€–โˆ‡t,iโˆ’โˆ‡^t,iโ€–โ„‹i2+โ€–โˆ‡^t,iโ€–โ„‹i2\displaystyle\sum^{\bar{t}}_{t=s_{1}}\frac{\|\nabla_{t,i}-\hat{\nabla}_{t,i}\|^{2}_{\mathcal{H}_{i}}\cdot\mathbb{I}_{\nabla_{t,i}\neq 0}}{\|\nabla_{t,i}-\hat{\nabla}_{t,i}\|^{2}_{\mathcal{H}_{i}}+\|\hat{\nabla}_{t,i}\|^{2}_{\mathcal{H}_{i}}}
โ‰ค\displaystyle\leq 2k1โ€‹(1+โˆ‘t=2tยฏโ€–ytโ€‹ฮบiโ€‹(๐’™t,โ‹…)โˆ’โˆ‘(๐’™,y)โˆˆVtyโ€‹ฮบiโ€‹(๐’™,โ‹…)|Vt|โ€–โ„‹i2)โŸ๐’œ~[s1,tยฏ],ฮบi\displaystyle\frac{2}{k_{1}}\underbrace{\left(1+\sum^{\bar{t}}_{t=2}\left\|y_{t}\kappa_{i}({\bm{x}}_{t},\cdot)-\frac{\sum_{({\bm{x}},y)\in V_{t}}y\kappa_{i}({\bm{x}},\cdot)}{|V_{t}|}\right\|^{2}_{\mathcal{H}_{i}}\right)}_{\tilde{\mathcal{A}}_{[s_{1},\bar{t}],\kappa_{i}}}
=\displaystyle= 2k1โ€‹๐’œ~[s1,tยฏ],ฮบi,\displaystyle\frac{2}{k_{1}}\tilde{\mathcal{A}}_{[s_{1},\bar{t}],\kappa_{i}},

where we use the fact ฮบiโ€‹(๐’™t,๐’™t)โ‰ฅk1\kappa_{i}({\bm{x}}_{t},{\bm{x}}_{t})\geq k_{1}. Let t1t_{1} be the minimal tยฏ\bar{t} such that

2k1โ€‹๐’œ~[s1,t1],ฮบiโ‰ฅBK.\frac{2}{k_{1}}\tilde{\mathcal{A}}_{[s_{1},t_{1}],\kappa_{i}}\geq\frac{B}{K}. (A1)

The first epoch will end at t1t_{1} in expectation. We define ๐’œ~T1,ฮบi:=๐’œ~[s1,t1],ฮบi\tilde{\mathcal{A}}_{T_{1},\kappa_{i}}:=\tilde{\mathcal{A}}_{[s_{1},t_{1}],\kappa_{i}}.

Next we consider rโ‰ฅ2r\geq 2. It must be |Siโ€‹(sr)|=B2โ€‹K|S_{i}(s_{r})|=\frac{B}{2K}. Similar to r=1r=1, we can obtain

๐”ผโ€‹[โˆ‘t=srtยฏฮฝt,i]โ‰ค2k1โ€‹โˆ‘t=srtยฏโ€–ytโ€‹ฮบiโ€‹(๐’™t,โ‹…)โˆ’โˆ‘(๐’™,y)โˆˆVtyโ€‹ฮบiโ€‹(๐’™,โ‹…)|Vt|โ€–โ„‹i2โŸ๐’œ~[sr,tยฏ],ฮบi=2k1โ€‹๐’œ~[sr,tยฏ],ฮบi.\mathbb{E}\left[\sum^{\bar{t}}_{t=s_{r}}\nu_{t,i}\right]\leq\frac{2}{k_{1}}\underbrace{\sum^{\bar{t}}_{t=s_{r}}\left\|y_{t}\kappa_{i}({\bm{x}}_{t},\cdot)-\frac{\sum_{({\bm{x}},y)\in V_{t}}y\kappa_{i}({\bm{x}},\cdot)}{|V_{t}|}\right\|^{2}_{\mathcal{H}_{i}}}_{\tilde{\mathcal{A}}_{[s_{r},\bar{t}],\kappa_{i}}}=\frac{2}{k_{1}}\tilde{\mathcal{A}}_{[s_{r},\bar{t}],\kappa_{i}}.

Let trt_{r} be the minimal tยฏ\bar{t} such that

2k1โ€‹๐’œ~[sr,tยฏ],ฮบiโ‰ฅB2โ€‹K,\frac{2}{k_{1}}\tilde{\mathcal{A}}_{[s_{r},\bar{t}],\kappa_{i}}\geq\frac{B}{2K}, (A2)

Let ๐’œ~Tr,ฮบi=๐’œ~[sr,tยฏ],ฮบi\tilde{\mathcal{A}}_{T_{r},\kappa_{i}}=\tilde{\mathcal{A}}_{[s_{r},\bar{t}],\kappa_{i}}. Combining (A1) and (A2), and summing over r=1,โ€ฆ,Jr=1,\ldots,J yields

BK+Bโ€‹(Jโˆ’1)2โ€‹Kโ‰ค\displaystyle\frac{B}{K}+\frac{B(J-1)}{2K}\leq 2k1โ€‹๐’œ~T1,ฮบi+โˆ‘r=2J2k1โ€‹๐’œ~Tr,ฮบi\displaystyle\frac{2}{k_{1}}\tilde{\mathcal{A}}_{T_{1},\kappa_{i}}+\sum^{J}_{r=2}\frac{2}{k_{1}}\tilde{\mathcal{A}}_{T_{r},\kappa_{i}}
โ‰ค\displaystyle\leq 2k1โ€‹โˆ‘t=s1Tโ€–ytโ€‹ฮบiโ€‹(๐’™t,โ‹…)โˆ’โˆ‘(๐’™,y)โˆˆVtyโ€‹ฮบiโ€‹(๐’™,โ‹…)|Vt|โ€–โ„‹i2โŸ๐’œ~T,ฮบi\displaystyle\frac{2}{k_{1}}\underbrace{\sum^{T}_{t=s_{1}}\left\|y_{t}\kappa_{i}({\bm{x}}_{t},\cdot)-\frac{\sum_{({\bm{x}},y)\in V_{t}}y\kappa_{i}({\bm{x}},\cdot)}{|V_{t}|}\right\|^{2}_{\mathcal{H}_{i}}}_{\tilde{\mathcal{A}}_{T,\kappa_{i}}}
โ‰ค\displaystyle\leq 2k1โ€‹๐’œ~T,ฮบi.\displaystyle\frac{2}{k_{1}}\tilde{\mathcal{A}}_{T,\kappa_{i}}.

Arranging terms gives

Jโ‰ค4โ€‹Kโ€‹๐’œ~T,ฮบiBโ€‹k1โˆ’1โ‰ค4โ€‹Kโ€‹๐’œ~T,ฮบiBโ€‹k1.J\leq\frac{4K\tilde{\mathcal{A}}_{T,\kappa_{i}}}{Bk_{1}}-1\leq\frac{4K\tilde{\mathcal{A}}_{T,\kappa_{i}}}{Bk_{1}}. (A3)

Taking expectation w.r.t. the randomness of reservoir sampling gives

๐”ผโ€‹[J]โ‰ค4โ€‹KBโ€‹k1โ‹…๐”ผโ€‹[๐’œ~T,ฮบi]โ‰ค12โ€‹KBโ€‹k1โ€‹๐’œT,ฮบiโ‹…(1+lnโกTM)+32โ€‹KBโ€‹k1,\mathbb{E}[J]\leq\frac{4K}{Bk_{1}}\cdot\mathbb{E}[\tilde{\mathcal{A}}_{T,\kappa_{i}}]\leq\frac{12K}{Bk_{1}}\mathcal{A}_{T,\kappa_{i}}\cdot\left(1+\frac{\ln{T}}{M}\right)+\frac{32K}{Bk_{1}},

where the last inequality comes from Lemma A.1.1. Omitting the last constant term concludes the proof. โˆŽ

Lemma A.1.1.

The reservoir sampling guarantees

โˆ€iโˆˆ[K],๐”ผโ€‹[๐’œ~T,ฮบi]โ‰ค3โ€‹๐’œT,ฮบi+8+3โ€‹๐’œT,ฮบiโ€‹lnโกTM.\forall i\in[K],\quad\mathbb{E}\left[\tilde{\mathcal{A}}_{T,\kappa_{i}}\right]\leq 3\mathcal{A}_{T,\kappa_{i}}+8+\frac{3\mathcal{A}_{T,\kappa_{i}}\ln{T}}{M}.
Proof.

Let ฮผt,i=โˆ’1tโ€‹โˆ‘ฯ„=1tyฯ„โ€‹ฮบiโ€‹(๐’™ฯ„,โ‹…)\mu_{t,i}=-\frac{1}{t}\sum^{t}_{\tau=1}y_{\tau}\kappa_{i}({\bm{x}}_{\tau},\cdot) and ฯ„0=M\tau_{0}=M. For tโ‰คฯ„0t\leq\tau_{0}, it can be verified that

๐’œ~ฯ„0,ฮบi=\displaystyle\tilde{\mathcal{A}}_{\tau_{0},\kappa_{i}}= 1+โˆ‘t=2ฯ„0โ€–โˆ’ytโ€‹ฮบiโ€‹(๐’™t,โ‹…)โˆ’ฮผtโˆ’1,iโ€–โ„‹i2\displaystyle 1+\sum^{\tau_{0}}_{t=2}\left\|-y_{t}\kappa_{i}({\bm{x}}_{t},\cdot)-\mu_{t-1,i}\right\|^{2}_{\mathcal{H}_{i}}
=\displaystyle= 1+โˆ‘t=2ฯ„0โ€–โˆ’ytโ€‹ฮบiโ€‹(๐’™t,โ‹…)โˆ’ฮผt,i+ฮผt,iโˆ’ฮผtโˆ’1,iโ€–โ„‹i2\displaystyle 1+\sum^{\tau_{0}}_{t=2}\left\|-y_{t}\kappa_{i}({\bm{x}}_{t},\cdot)-\mu_{t,i}+\mu_{t,i}-\mu_{t-1,i}\right\|^{2}_{\mathcal{H}_{i}}
โ‰ค\displaystyle\leq 1+2โ€‹๐’œ[2:ฯ„0],ฮบi+2โ€‹โˆ‘t=2ฯ„0โ€–ฮผt,iโˆ’ฮผtโˆ’1,iโ€–โ„‹i2,\displaystyle 1+2\mathcal{A}_{[2:\tau_{0}],\kappa_{i}}+2\sum^{\tau_{0}}_{t=2}\left\|\mu_{t,i}-\mu_{t-1,i}\right\|^{2}_{\mathcal{H}_{i}},

where ฮผ0,i=0\mu_{0,i}=0. Let VtV_{t} be the reservoir at the beginning of round tt. Next we consider the case t>ฯ„0t>\tau_{0}.

๐’œ~[ฯ„0:T],ฮบi=\displaystyle\tilde{\mathcal{A}}_{[\tau_{0}:T],\kappa_{i}}= โˆ‘t=ฯ„0+1Tโ€–โˆ’ytโ€‹ฮบiโ€‹(๐’™t,โ‹…)โˆ’ฮผ~tโˆ’1,iโ€–โ„‹i2\displaystyle\sum^{T}_{t=\tau_{0}+1}\left\|-y_{t}\kappa_{i}({\bm{x}}_{t},\cdot)-\tilde{\mu}_{t-1,i}\right\|^{2}_{\mathcal{H}_{i}}
โ‰ค\displaystyle\leq โˆ‘t=ฯ„0+1T3โ€‹[โ€–ytโ€‹ฮบiโ€‹(๐’™t,โ‹…)+ฮผt,iโ€–โ„‹i2+โ€–ฮผt,iโˆ’ฮผtโˆ’1,iโ€–โ„‹i2+โ€–ฮผtโˆ’1,iโˆ’ฮผ~tโˆ’1,iโ€–โ„‹i2]\displaystyle\sum^{T}_{t=\tau_{0}+1}3\left[\left\|y_{t}\kappa_{i}({\bm{x}}_{t},\cdot)+\mu_{t,i}\right\|^{2}_{\mathcal{H}_{i}}+\left\|\mu_{t,i}-{\mu}_{t-1,i}\right\|^{2}_{\mathcal{H}_{i}}+\left\|{\mu}_{t-1,i}-\tilde{\mu}_{t-1,i}\right\|^{2}_{\mathcal{H}_{i}}\right]
=\displaystyle= 3โ€‹๐’œ[ฯ„0:T],ฮบi+3โ€‹โˆ‘t=ฯ„0+1Tโ€–ฮผt,iโˆ’ฮผtโˆ’1,iโ€–โ„‹i2+3โ€‹โˆ‘t=ฯ„0+1Tโ€–ฮผtโˆ’1,i+1|Vt|โ€‹โˆ‘(๐’™,y)โˆˆVtyโ€‹ฮบiโ€‹(๐’™,โ‹…)โ€–โ„‹i2.\displaystyle 3\mathcal{A}_{[\tau_{0}:T],\kappa_{i}}+3\sum^{T}_{t=\tau_{0}+1}\left\|\mu_{t,i}-{\mu}_{t-1,i}\right\|^{2}_{\mathcal{H}_{i}}+3\sum^{T}_{t=\tau_{0}+1}\left\|{\mu}_{t-1,i}+\frac{1}{|V_{t}|}\sum_{({\bm{x}},y)\in V_{t}}y\kappa_{i}({\bm{x}},\cdot)\right\|^{2}_{\mathcal{H}_{i}}.

Taking expectation w.r.t. the reservoir sampling yields

๐”ผโ€‹[๐’œ~T,ฮบi]\displaystyle\mathbb{E}[\tilde{\mathcal{A}}_{T,\kappa_{i}}]
=\displaystyle= ๐’œ~ฯ„0,ฮบi+๐”ผโ€‹[๐’œ~[ฯ„0:T],ฮบi]\displaystyle\tilde{\mathcal{A}}_{\tau_{0},\kappa_{i}}+\mathbb{E}[\tilde{\mathcal{A}}_{[\tau_{0}:T],\kappa_{i}}]
โ‰ค\displaystyle\leq 1+3โ€‹๐’œT,ฮบi+3โ€‹โˆ‘t=2Tโ€–ฮผt,iโˆ’ฮผtโˆ’1,iโ€–โ„‹i2+3โ€‹โˆ‘t=ฯ„0+1T๐”ผโ€‹[โ€–ฮผtโˆ’1,i+1|Vt|โ€‹โˆ‘(๐’™,y)โˆˆVtyโ€‹ฮบiโ€‹(๐’™,โ‹…)โ€–โ„‹i2]byโ€‹Lemmaโ€‹A.8.1\displaystyle 1+3\mathcal{A}_{T,\kappa_{i}}+3\sum^{T}_{t=2}\left\|\mu_{t,i}-{\mu}_{t-1,i}\right\|^{2}_{\mathcal{H}_{i}}+3\sum^{T}_{t=\tau_{0}+1}\mathbb{E}\left[\left\|{\mu}_{t-1,i}+\frac{1}{|V_{t}|}\sum_{({\bm{x}},y)\in V_{t}}y\kappa_{i}({\bm{x}},\cdot)\right\|^{2}_{\mathcal{H}_{i}}\right]\quad\quad\quad\mathrm{by}~{}\mathrm{Lemma}~{}\ref{lem:JCST2022:variance_reservoir_estimator}
โ‰ค\displaystyle\leq 1+3โ€‹๐’œT,ฮบi+3โ€‹โˆ‘t=2Tโ€–ฮผt,iโˆ’ฮผtโˆ’1,iโ€–โ„‹i2+โˆ‘t=ฯ„0+1T3โ€‹๐’œtโˆ’1,ฮบi(tโˆ’1)โ€‹|Vt|\displaystyle 1+3\mathcal{A}_{T,\kappa_{i}}+3\sum^{T}_{t=2}\left\|\mu_{t,i}-{\mu}_{t-1,i}\right\|^{2}_{\mathcal{H}_{i}}+\sum^{T}_{t=\tau_{0}+1}\frac{3\mathcal{A}_{t-1,\kappa_{i}}}{(t-1)|V_{t}|}
โ‰ค\displaystyle\leq 1+3โ€‹๐’œT,ฮบi+โˆ‘t=2T12t2+3โ€‹๐’œT,ฮบiโ€‹lnโกTM\displaystyle 1+3\mathcal{A}_{T,\kappa_{i}}+\sum^{T}_{t=2}\frac{12}{t^{2}}+\frac{3\mathcal{A}_{T,\kappa_{i}}\ln{T}}{M}
โ‰ค\displaystyle\leq 1+3โ€‹๐’œT,ฮบi+7+3โ€‹๐’œT,ฮบiโ€‹lnโกTM,\displaystyle 1+3\mathcal{A}_{T,\kappa_{i}}+7+\frac{3\mathcal{A}_{T,\kappa_{i}}\ln{T}}{M},

where |Vt|=M|V_{t}|=M for all tโ‰ฅฯ„0t\geq\tau_{0}. โˆŽ

Appendix A.2 Proof of Theorem 1

Proof.

By the convexity of the Hinge loss function, we decompose the regret as follows

Regโ€‹(f)=\displaystyle\mathrm{Reg}(f)= โˆ‘t=1Tโ„“โ€‹(โˆ‘j=1Kpt,jโ€‹ft,jโ€‹(๐’™t),yt)โˆ’โˆ‘t=1Tโ„“โ€‹(fโ€‹(๐’™t),yt)\displaystyle\sum^{T}_{t=1}\ell\left(\sum^{K}_{j=1}p_{t,j}f_{t,j}({\bm{x}}_{t}),y_{t}\right)-\sum^{T}_{t=1}\ell\left(f({\bm{x}}_{t}),y_{t}\right)
โ‰ค\displaystyle\leq โˆ‘t=1Tโˆ‘j=1Kpt,jโ€‹โ„“โ€‹(ft,jโ€‹(๐’™t),yt)โˆ’โˆ‘t=1Tโ„“โ€‹(fโ€‹(๐’™t),yt)\displaystyle\sum^{T}_{t=1}\sum^{K}_{j=1}p_{t,j}\ell\left(f_{t,j}({\bm{x}}_{t}),y_{t}\right)-\sum^{T}_{t=1}\ell\left(f({\bm{x}}_{t}),y_{t}\right)
โ‰ค\displaystyle\leq โˆ‘t=1T[โˆ‘j=1Kpt,jโ€‹โ„“โ€‹(ft,jโ€‹(๐’™t),yt)โˆ’โ„“โ€‹(ft,iโ€‹(๐’™t),yt)]โŸ๐’ฏ1+โˆ‘tโˆˆET,i[โ„“โ€‹(ft,iโ€‹(๐’™t),yt)โˆ’โ„“โ€‹(fโ€‹(๐’™t),yt)]โŸ๐’ฏ2,\displaystyle\underbrace{\sum^{T}_{t=1}\left[\sum^{K}_{j=1}p_{t,j}\ell(f_{t,j}({\bm{x}}_{t}),y_{t})-\ell(f_{t,i}({\bm{x}}_{t}),y_{t})\right]}_{\mathcal{T}_{1}}+\underbrace{\sum_{t\in E_{T,i}}\left[\ell(f_{t,i}({\bm{x}}_{t}),y_{t})-\ell(f({\bm{x}}_{t}),y_{t})\right]}_{\mathcal{T}_{2}},

where ET,i={tโˆˆ[T],โˆ‡t,iโ‰ 0}E_{T,i}=\{t\in[T],\nabla_{t,i}\neq 0\}.

A.2.1 Analyzing ๐’ฏ1\mathcal{T}_{1}

The following analysis is same with the proof of Theorem 3.1 in [Bubeck2012Regret]. Let ct,i:=โ„“โ€‹(ft,iโ€‹(๐’™t),yt)c_{t,i}:=\ell(f_{t,i}({\bm{x}}_{t}),y_{t}). The updating of probability is as follows,

pt+1,i=wt+1,iโˆ‘j=1Kwt+1,j,wt+1,i=expโก(โˆ’ฮทt+1โ€‹โˆ‘ฯ„=1tcฯ„,i).p_{t+1,i}=\frac{w_{t+1,i}}{\sum^{K}_{j=1}w_{t+1,j}},\quad w_{t+1,i}=\exp\left(-\eta_{t+1}\sum^{t}_{\tau=1}c_{\tau,i}\right).

Similar to the analysis of Exp3 [Bubeck2012Regret], we define a potential function ฮ“tโ€‹(ฮทt)\Gamma_{t}(\eta_{t}) as follows,

ฮ“tโ€‹(ฮทt):=1ฮทtโ€‹lnโ€‹โˆ‘i=1Kpt,iโ€‹expโก(โˆ’ฮทtโ€‹ct,i)โ‰คโˆ’โˆ‘i=1Kpt,iโ€‹ct,i+12โ€‹ฮทtโ€‹โˆ‘i=1Kpt,iโ€‹ct,i2,\displaystyle\Gamma_{t}(\eta_{t}):=\frac{1}{\eta_{t}}\ln\sum^{K}_{i=1}p_{t,i}\exp(-\eta_{t}c_{t,i})\leq-\sum^{K}_{i=1}p_{t,i}c_{t,i}+\frac{1}{2}\eta_{t}\sum^{K}_{i=1}p_{t,i}c^{2}_{t,i},

where we use the following two inequalities

lnโกxโ‰คxโˆ’1,โˆ€x>0,expโก(โˆ’x)โ‰ค1โˆ’x+x22,โˆ€xโ‰ฅ0.\ln{x}\leq x-1,\forall x>0,\quad\exp(-x)\leq 1-x+\frac{x^{2}}{2},\forall x\geq 0.

Summing over tโˆˆ[T]t\in[T] yields

โˆ‘t=1Tฮ“tโ€‹(ฮทt)โ‰คโˆ’โˆ‘t=1TโŸจ๐’‘t,๐’„tโŸฉ+โˆ‘t=1Tโˆ‘i=1Kฮทt2โ€‹pt,iโ€‹ct,i2.\sum^{T}_{t=1}\Gamma_{t}(\eta_{t})\leq-\sum^{T}_{t=1}\langle{\bm{p}}_{t},{\bm{c}}_{t}\rangle+\sum^{T}_{t=1}\sum^{K}_{i=1}\frac{\eta_{t}}{2}p_{t,i}c^{2}_{t,i}. (A4)

On the other hand, by the definition of pt,ip_{t,i}, we have

ฮ“tโ€‹(ฮทt)=\displaystyle\Gamma_{t}(\eta_{t})= 1ฮทtโ€‹lnโกโˆ‘i=1Kexpโก(โˆ’ฮทtโ€‹โˆ‘ฯ„=1tโˆ’1cฯ„,i)โ€‹expโก(โˆ’ฮทtโ€‹ct,i)โˆ‘j=1Kexpโก(โˆ’ฮทtโ€‹โˆ‘ฯ„=1tโˆ’1cฯ„,j)\displaystyle\frac{1}{\eta_{t}}\ln\frac{\sum^{K}_{i=1}\exp\left(-\eta_{t}\sum^{t-1}_{\tau=1}c_{\tau,i}\right)\exp(-\eta_{t}c_{t,i})}{\sum^{K}_{j=1}\exp\left(-\eta_{t}\sum^{t-1}_{\tau=1}c_{\tau,j}\right)}
=\displaystyle= 1ฮทtโ€‹lnโก1Kโ€‹โˆ‘i=1Kexpโก(โˆ’ฮทtโ€‹โˆ‘ฯ„=1tcฯ„,i)1Kโ€‹โˆ‘j=1Kexpโก(โˆ’ฮทtโ€‹โˆ‘ฯ„=1tโˆ’1cฯ„,j)\displaystyle\frac{1}{\eta_{t}}\ln\frac{\frac{1}{K}\sum^{K}_{i=1}\exp\left(-\eta_{t}\sum^{t}_{\tau=1}c_{\tau,i}\right)}{\frac{1}{K}\sum^{K}_{j=1}\exp\left(-\eta_{t}\sum^{t-1}_{\tau=1}c_{\tau,j}\right)}
=\displaystyle= ฮ“ยฏtโ€‹(ฮทt)โˆ’ฮ“ยฏtโˆ’1โ€‹(ฮทt),\displaystyle\bar{\Gamma}_{t}(\eta_{t})-\bar{\Gamma}_{t-1}(\eta_{t}),

where ฮ“ยฏtโ€‹(ฮท)=1ฮทโ€‹lnโก1Kโ€‹โˆ‘j=1Kexpโก(โˆ’ฮทโ€‹โˆ‘ฯ„=1tcฯ„,j).\bar{\Gamma}_{t}(\eta)=\frac{1}{\eta}\ln\frac{1}{K}\sum^{K}_{j=1}\exp\left(-\eta\sum^{t}_{\tau=1}c_{\tau,j}\right).
Without loss of generality, let ฮ“ยฏ0โ€‹(ฮท)=0\bar{\Gamma}_{0}(\eta)=0. Summing over t=1,โ€ฆ,Tt=1,\ldots,T yields

โˆ‘t=1Tฮ“tโ€‹(ฮทt)=ฮ“ยฏTโ€‹(ฮทT)โˆ’ฮ“ยฏ0โ€‹(ฮท1)+โˆ‘t=1Tโˆ’1[ฮ“ยฏtโ€‹(ฮทt)โˆ’ฮ“ยฏtโ€‹(ฮทt+1)],\sum^{T}_{t=1}\Gamma_{t}(\eta_{t})=\bar{\Gamma}_{T}(\eta_{T})-\bar{\Gamma}_{0}(\eta_{1})+\sum^{T-1}_{t=1}\left[\bar{\Gamma}_{t}(\eta_{t})-\bar{\Gamma}_{t}(\eta_{t+1})\right],

where ฮ“ยฏTโ€‹(ฮทT)โ‰ฅ1ฮทTโ€‹lnโก1Kโˆ’โˆ‘ฯ„=1Tcฯ„,i\bar{\Gamma}_{T}(\eta_{T})\geq\frac{1}{\eta_{T}}\ln\frac{1}{K}-\sum^{T}_{\tau=1}c_{\tau,i}. Combining with the upper bound (A4), we obtain

โˆ‘t=1TโŸจ๐’‘t,๐’„tโŸฉโˆ’โˆ‘ฯ„=1Tcฯ„,iโ‰ค1ฮทTโ€‹lnโกK+โˆ‘t=1Tโˆ’1[ฮ“ยฏtโ€‹(ฮทt+1)โˆ’ฮ“ยฏtโ€‹(ฮทt)]+โˆ‘t=1Tโˆ‘i=1Kฮทt2โ€‹pt,iโ€‹ct,i2.\sum^{T}_{t=1}\langle{\bm{p}}_{t},{\bm{c}}_{t}\rangle-\sum^{T}_{\tau=1}c_{\tau,i}\leq\frac{1}{\eta_{T}}\ln{K}+\sum^{T-1}_{t=1}\left[\bar{\Gamma}_{t}(\eta_{t+1})-\bar{\Gamma}_{t}(\eta_{t})\right]+\sum^{T}_{t=1}\sum^{K}_{i=1}\frac{\eta_{t}}{2}p_{t,i}c^{2}_{t,i}.

For simplicity, let Cยฏt,j:=โˆ‘ฯ„=1tcฯ„,j\bar{C}_{t,j}:=\sum^{t}_{\tau=1}c_{\tau,j}. The first derivative of ฮ“ยฏtโ€‹(ฮท)\bar{\Gamma}_{t}(\eta) w.r.t. ฮท\eta is as follows

dโ€‹ฮ“ยฏtโ€‹(ฮท)dโ€‹ฮท=\displaystyle\frac{\mathrm{d}\,\bar{\Gamma}_{t}(\eta)}{\mathrm{d}\,\eta}= โˆ’lnโ€‹โˆ‘j=1Kexpโก(โˆ’ฮทโ€‹Cยฏt,j)Kฮท2โˆ’1Kโ€‹โˆ‘j=1KCยฏt,jโ€‹expโก(โˆ’ฮทโ€‹Cยฏt,j)ฮทKโ€‹โˆ‘j=1Kexpโก(โˆ’ฮทโ€‹Cยฏt,j)\displaystyle\frac{-\ln\sum^{K}_{j=1}\frac{\exp\left(-\eta\bar{C}_{t,j}\right)}{K}}{\eta^{2}}-\frac{\frac{1}{K}\sum^{K}_{j=1}\bar{C}_{t,j}\exp\left(-\eta\bar{C}_{t,j}\right)}{\frac{\eta}{K}\sum^{K}_{j=1}\exp\left(-\eta\bar{C}_{t,j}\right)}
=\displaystyle= 1ฮท2โ€‹KLโ€‹(p~t,1K)\displaystyle\frac{1}{\eta^{2}}\mathrm{KL}(\tilde{p}_{t},\frac{1}{K})
โ‰ฅ\displaystyle\geq 0\displaystyle 0

where p~t,j=expโก(โˆ’ฮทโ€‹Cยฏt,j)โˆ‘i=1Kexpโก(โˆ’ฮทโ€‹Cยฏt,i)\tilde{p}_{t,j}=\frac{\exp\left(-\eta\bar{C}_{t,j}\right)}{\sum^{K}_{i=1}\exp\left(-\eta\bar{C}_{t,i}\right)}. Since ฮทt+1โ‰คฮทt\eta_{t+1}\leq\eta_{t}, we have ฮ“ยฏtโ€‹(ฮทt+1)โ‰คฮ“ยฏtโ€‹(ฮทt)\bar{\Gamma}_{t}(\eta_{t+1})\leq\bar{\Gamma}_{t}(\eta_{t}). Combining all results, we have

โˆ‘t=1TโŸจ๐’‘t,๐’„tโŸฉโˆ’โˆ‘ฯ„=1Tcฯ„,i\displaystyle\sum^{T}_{t=1}\langle{\bm{p}}_{t},{\bm{c}}_{t}\rangle-\sum^{T}_{\tau=1}c_{\tau,i}
โ‰ค\displaystyle\leq lnโกKฮทTโˆ’lnโกKฮท1+โˆ‘t=1Tโˆ‘i=1Kฮทt2โ€‹pt,iโ€‹ct,i2\displaystyle\frac{\ln{K}}{\eta_{T}}-\frac{\ln{K}}{\eta_{1}}+\sum^{T}_{t=1}\sum^{K}_{i=1}\frac{\eta_{t}}{2}p_{t,i}c^{2}_{t,i}
โ‰ค\displaystyle\leq lnโกK2โ‹…1+โˆ‘ฯ„=1Tโˆ’1โŸจ๐’‘ฯ„,๐’„ฯ„2โŸฉโˆ’lnโกK2+lnโกKโ€‹(2โ€‹โˆ‘ฯ„=1TโŸจ๐’‘ฯ„,๐’„ฯ„2โŸฉ+maxt,jโกct,j2)byโ€‹Lemmaโ€‹A.8.2\displaystyle\frac{\sqrt{\ln{K}}}{\sqrt{2}}\cdot\sqrt{1+\sum^{T-1}_{\tau=1}\langle{\bm{p}}_{\tau},{\bm{c}}^{2}_{\tau}\rangle}-\frac{\sqrt{\ln{K}}}{\sqrt{2}}+\sqrt{\ln{K}}\left(\sqrt{2\sum^{T}_{\tau=1}\langle{\bm{p}}_{\tau},{\bm{c}}^{2}_{\tau}\rangle}+\frac{\max_{t,j}c_{t,j}}{\sqrt{2}}\right)\quad\quad\quad\mathrm{by}~{}\mathrm{Lemma}~{}\ref{lem:JCST2022:analysis_second_moment_TV}
โ‰ฒ\displaystyle\lesssim 32โ€‹maxt,jโกct,jโ‹…โˆ‘ฯ„=1TโŸจ๐’‘ฯ„,๐’„ฯ„โŸฉโ€‹lnโกK.\displaystyle\frac{3}{\sqrt{2}}\sqrt{\max_{t,j}c_{t,j}\cdot\sum^{T}_{\tau=1}\langle{\bm{p}}_{\tau},{\bm{c}}_{\tau}\rangle\ln{K}}. (A5)

Solving for โˆ‘t=1TโŸจ๐’‘t,๐’„tโŸฉ\sum^{T}_{t=1}\langle{\bm{p}}_{t},{\bm{c}}_{t}\rangle gives

๐’ฏ1=โˆ‘t=1T[โŸจ๐’‘t,๐’„tโŸฉโˆ’ct,i]โ‰ค32โ€‹maxt,jโกct,jโ‹…โˆ‘ฯ„=1Tcฯ„,iโ€‹lnโกK+92โ€‹maxt,jโกct,jโ‹…lnโกK.\mathcal{T}_{1}=\sum^{T}_{t=1}[\langle{\bm{p}}_{t},{\bm{c}}_{t}\rangle-c_{t,i}]\leq\frac{3}{\sqrt{2}}\sqrt{\max_{t,j}c_{t,j}\cdot\sum^{T}_{\tau=1}c_{\tau,i}\ln{K}}+\frac{9}{2}\max_{t,j}c_{t,j}\cdot\ln{K}. (A6)

A.2.2 Analyzing ๐’ฏ2\mathcal{T}_{2}

We decompose ET,iE_{T,i} as follows.

Ei=\displaystyle E_{i}= {tโˆˆET,i:conโ€‹(aโ€‹(i))},\displaystyle\{t\in E_{T,i}:\mathrm{con}(a(i))\},
๐’ฅi=\displaystyle\mathcal{J}_{i}= {tโˆˆET,i:|Si|=ฮฑโ€‹โ„›i,bt,i=1},\displaystyle\{t\in E_{T,i}:|S_{i}|=\alpha\mathcal{R}_{i},b_{t,i}=1\},
Eยฏi=\displaystyle\bar{E}_{i}= ET,iโˆ–(Eiโˆช๐’ฅi).\displaystyle E_{T,i}\setminus(E_{i}\cup\mathcal{J}_{i}).

We separately analyze the regret in EiE_{i}, ๐’ฅi\mathcal{J}_{i} and Eยฏi\bar{E}_{i}.
Case 1: regret in EiE_{i}
For any fโˆˆโ„if\in\mathbb{H}_{i}, the convexity of loss function gives

โ„“โ€‹(ft,iโ€‹(๐’™t),yt)โˆ’โ„“โ€‹(fโ€‹(๐’™t),yt)\displaystyle\ell(f_{t,i}({\bm{x}}_{t}),y_{t})-\ell(f({\bm{x}}_{t}),y_{t})
โ‰ค\displaystyle\leq โŸจft,iโˆ’f,โˆ‡t,iโŸฉ\displaystyle\langle f_{t,i}-f,\nabla_{t,i}\rangle
=\displaystyle= โŸจft,iโˆ’ft,iโ€ฒ,โˆ‡^t,iโŸฉโŸฮž1+โŸจft,iโ€ฒโˆ’f,โˆ‡iโ€‹(st),iโŸฉโŸฮž2+โŸจft,iโ€ฒโˆ’f,โˆ‡t,iโˆ’โˆ‡iโ€‹(st),iโŸฉ+โŸจft,iโˆ’ft,iโ€ฒ,โˆ‡t,iโˆ’โˆ‡^t,iโŸฉ\displaystyle\underbrace{\langle f_{t,i}-f^{\prime}_{t,i},\hat{\nabla}_{t,i}\rangle}_{\Xi_{1}}+\underbrace{\langle f^{\prime}_{t,i}-f,\nabla_{i(s_{t}),i}\rangle}_{\Xi_{2}}+\langle f^{\prime}_{t,i}-f,\nabla_{t,i}-\nabla_{i(s_{t}),i}\rangle+\langle f_{t,i}-f^{\prime}_{t,i},\nabla_{t,i}-\hat{\nabla}_{t,i}\rangle
=\displaystyle= ฮž1+ฮž2+โŸจft,iโˆ’ft,iโ€ฒ,โˆ‡iโ€‹(st),iโˆ’โˆ‡^t,iโŸฉ+โŸจft,iโˆ’f,โˆ‡t,iโˆ’โˆ‡iโ€‹(st),iโŸฉ\displaystyle\Xi_{1}+\Xi_{2}+\langle f_{t,i}-f^{\prime}_{t,i},\nabla_{i(s_{t}),i}-\hat{\nabla}_{t,i}\rangle+\langle f_{t,i}-f,\nabla_{t,i}-\nabla_{i(s_{t}),i}\rangle
โ‰ค\displaystyle\leq [โ„ฌฯˆiโ€‹(f,ftโˆ’1,iโ€ฒ)โˆ’โ„ฌฯˆiโ€‹(f,ft,iโ€ฒ)]+โ€–ft,iโˆ’fโ€–โ‹…ฮณt,iโŸฮž3+โŸจft,iโˆ’ft,iโ€ฒ,โˆ‡iโ€‹(st),iโˆ’โˆ‡^t,iโŸฉโˆ’โ„ฌฯˆiโ€‹(ft,iโ€ฒ,ft,i)โŸฮž4,\displaystyle\left[\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t,i})\right]+\underbrace{\left\|f_{t,i}-f\right\|\cdot\gamma_{t,i}}_{\Xi_{3}}+\underbrace{\left\langle f_{t,i}-f^{\prime}_{t,i},\nabla_{i(s_{t}),i}-\hat{\nabla}_{t,i}\right\rangle-\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},f_{t,i})}_{\Xi_{4}},

where the standard analysis of OMD [Chiang2012Online] gives

ฮž1โ‰ค\displaystyle\Xi_{1}\leq โ„ฌฯˆiโ€‹(ft,iโ€ฒ,ftโˆ’1,iโ€ฒ)โˆ’โ„ฌฯˆiโ€‹(ft,iโ€ฒ,ft,i)โˆ’โ„ฌฯˆiโ€‹(ft,i,ftโˆ’1,iโ€ฒ),\displaystyle\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},f_{t,i})-\mathcal{B}_{\psi_{i}}(f_{t,i},f^{\prime}_{t-1,i}),
ฮž2โ‰ค\displaystyle\Xi_{2}\leq โ„ฌฯˆiโ€‹(f,ftโˆ’1,iโ€ฒ)โˆ’โ„ฌฯˆiโ€‹(f,ft,iโ€ฒ)โˆ’โ„ฌฯˆiโ€‹(ft,iโ€ฒ,ftโˆ’1,iโ€ฒ).\displaystyle\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t,i})-\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},f^{\prime}_{t-1,i}).

Substituting into ฮณt,i\gamma_{t,i} and summing over tโˆˆEit\in E_{i} gives

โˆ‘tโˆˆEiฮž3โ‰ค\displaystyle\sum_{t\in E_{i}}\Xi_{3}\leq โˆ‘tโˆˆEimaxtโกโ€–ft,iโˆ’fโ€–โ„‹iโ‹…โ€–โˆ‡t,iโˆ’โˆ‡^t,iโ€–โ„‹i21+โˆ‘ฯ„โ‰คtโ€–โˆ‡ฯ„,iโˆ’โˆ‡^ฯ„,iโ€–โ„‹i2โ‹…๐•€โˆ‡ฯ„,iโ‰ 0\displaystyle\sum_{t\in E_{i}}\frac{\max_{t}\|f_{t,i}-f\|_{\mathcal{H}_{i}}\cdot\left\|\nabla_{t,i}-\hat{\nabla}_{t,i}\right\|^{2}_{\mathcal{H}_{i}}}{\sqrt{1+\sum_{\tau\leq t}\left\|\nabla_{\tau,i}-\hat{\nabla}_{\tau,i}\right\|^{2}_{\mathcal{H}_{i}}\cdot\mathbb{I}_{\nabla_{\tau,i}\neq 0}}}
โ‰ค\displaystyle\leq 2โ€‹(U+ฮปi)โ‹…โˆ‘tโˆˆEiโ€–โˆ‡t,iโˆ’โˆ‡^t,iโ€–โ„‹i21+โˆ‘ฯ„โ‰คtโ€–โˆ‡ฯ„,iโˆ’โˆ‡^ฯ„,iโ€–โ„‹i2โ‹…๐•€โˆ‡ฯ„,iโ‰ 0\displaystyle 2(U+\lambda_{i})\cdot\sum_{t\in E_{i}}\frac{\left\|\nabla_{t,i}-\hat{\nabla}_{t,i}\right\|^{2}_{\mathcal{H}_{i}}}{\sqrt{1+\sum_{\tau\leq t}\left\|\nabla_{\tau,i}-\hat{\nabla}_{\tau,i}\right\|^{2}_{\mathcal{H}_{i}}\cdot\mathbb{I}_{\nabla_{\tau,i}\neq 0}}}
โ‰ค\displaystyle\leq 4โ€‹(U+ฮปi)โ€‹๐’œ~T,ฮบi,\displaystyle 4(U+\lambda_{i})\sqrt{\tilde{\mathcal{A}}_{T,\kappa_{i}}},

where โ€–ft,iโ€–โ„‹iโ‰คU+ฮปi\|f_{t,i}\|_{\mathcal{H}_{i}}\leq U+\lambda_{i}.
According to Lemma A.8.6, we can obtain

โˆ‘tโˆˆEiฮž4โ‰คฮปi2โ€‹โˆ‘tโˆˆEiโ€–โˆ‡iโ€‹(st),iโˆ’โˆ‡^t,iโ€–โ„‹i2โ‰ค2โ€‹ฮปiโ€‹๐’œ~T,ฮบi.\sum_{t\in E_{i}}\Xi_{4}\leq\frac{\lambda_{i}}{2}\sum_{t\in E_{i}}\left\|\nabla_{i(s_{t}),i}-\hat{\nabla}_{t,i}\right\|^{2}_{\mathcal{H}_{i}}\leq 2\lambda_{i}\tilde{\mathcal{A}}_{T,\kappa_{i}}.

Case 2: regret in Eยฏi\bar{E}_{i}
We decompose the instantaneous regret as follows,

โŸจft,iโˆ’f,โˆ‡t,iโŸฉ\displaystyle\langle f_{t,i}-f,\nabla_{t,i}\rangle
=\displaystyle= โŸจft,iโˆ’ft,iโ€ฒ,โˆ‡^t,iโŸฉโŸฮž1+โŸจft,iโ€ฒโˆ’f,โˆ‡~t,iโŸฉโŸฮž2+โŸจft,iโˆ’ft,iโ€ฒ,โˆ‡~t,iโˆ’โˆ‡^t,iโŸฉโŸฮž3+โŸจft,iโˆ’f,โˆ‡t,iโˆ’โˆ‡~t,iโŸฉ\displaystyle\underbrace{\langle f_{t,i}-f^{\prime}_{t,i},\hat{\nabla}_{t,i}\rangle}_{\Xi_{1}}+\underbrace{\langle f^{\prime}_{t,i}-f,\tilde{\nabla}_{t,i}\rangle}_{\Xi_{2}}+\underbrace{\langle f_{t,i}-f^{\prime}_{t,i},\tilde{\nabla}_{t,i}-\hat{\nabla}_{t,i}\rangle}_{\Xi_{3}}+\left\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\right\rangle
โ‰ค\displaystyle\leq โ„ฌฯˆiโ€‹(f,ftโˆ’1,iโ€ฒ)โˆ’โ„ฌฯˆiโ€‹(f,ft,iโ€ฒ)+โŸจft,iโˆ’f,โˆ‡t,iโˆ’โˆ‡~t,iโŸฉ+ฮž3โˆ’[โ„ฌฯˆiโ€‹(ft,iโ€ฒ,ft,i)+โ„ฌฯˆiโ€‹(ft,i,ftโˆ’1,iโ€ฒ)]\displaystyle\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t,i})+\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle+\Xi_{3}-\left[\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},f_{t,i})+\mathcal{B}_{\psi_{i}}(f_{t,i},f^{\prime}_{t-1,i})\right]
=\displaystyle= โ„ฌฯˆiโ€‹(f,ftโˆ’1,iโ€ฒ)โˆ’โ„ฌฯˆiโ€‹(f,ft,iโ€ฒ)+โŸจft,iโˆ’f,โˆ‡t,iโˆ’โˆ‡~t,iโŸฉ+ฮž3โˆ’โ„ฌฯˆiโ€‹(ft,iโ€ฒ,ft,i)โˆ’ฮปi2โ€‹โ€–โˆ‡^t,iโ€–โ„‹i2\displaystyle\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t,i})+\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle+\Xi_{3}-\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},f_{t,i})-\frac{\lambda_{i}}{2}\|\hat{\nabla}_{t,i}\|^{2}_{\mathcal{H}_{i}}
โ‰ค\displaystyle\leq โ„ฌฯˆiโ€‹(f,ftโˆ’1,iโ€ฒ)โˆ’โ„ฌฯˆiโ€‹(f,ft,iโ€ฒ)+โŸจft,iโˆ’f,โˆ‡t,iโˆ’โˆ‡~t,iโŸฉโˆ’ฮปi2โ€‹โ€–โˆ‡~t,iโˆ’โˆ‡^t,iโ€–โ„‹i2โˆ’ฮปi2โ€‹โ€–โˆ‡^t,iโ€–โ„‹i2byโ€‹Lemmaโ€‹A.8.6\displaystyle\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t,i})+\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle-\frac{\lambda_{i}}{2}\|\tilde{\nabla}_{t,i}-\hat{\nabla}_{t,i}\|^{2}_{\mathcal{H}_{i}}-\frac{\lambda_{i}}{2}\|\hat{\nabla}_{t,i}\|^{2}_{\mathcal{H}_{i}}\quad\mathrm{by}~{}\mathrm{Lemma}~{}\ref{lemma:JCST2022:property_of_OMD}
=\displaystyle= โ„ฌฯˆiโ€‹(f,ftโˆ’1,iโ€ฒ)โˆ’โ„ฌฯˆiโ€‹(f,ft,iโ€ฒ)+โŸจft,iโˆ’f,โˆ‡t,iโˆ’โˆ‡~t,iโŸฉ+ฮปi2โ€‹(โ€–โˆ‡t,iโˆ’โˆ‡^t,iโ€–โ„‹i2(โ„™โ€‹[bt,i=1])2โ€‹๐•€bt,i=1โˆ’โ€–โˆ‡^t,iโ€–โ„‹i2),\displaystyle\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t,i})+\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle+\frac{\lambda_{i}}{2}\left(\frac{\|\nabla_{t,i}-\hat{\nabla}_{t,i}\|^{2}_{\mathcal{H}_{i}}}{(\mathbb{P}[b_{t,i}=1])^{2}}\mathbb{I}_{b_{t,i}=1}-\|\hat{\nabla}_{t,i}\|^{2}_{\mathcal{H}_{i}}\right),

where ฮž1+ฮž2\Xi_{1}+\Xi_{2} follows the analysis in Case 1.
Case 3: regret in ๐’ฅi\mathcal{J}_{i}
Recalling that the second mirror updating is

ft,iโ€ฒ=argโกminfโˆˆโ„i{โŸจf,โˆ‡~t,iโŸฉ+โ„ฌฯˆiโ€‹(f,fยฏtโˆ’1,iโ€ฒโ€‹(1))}.f^{\prime}_{t,i}=\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\{\langle f,\tilde{\nabla}_{t,i}\rangle+\mathcal{B}_{\psi_{i}}(f,\bar{f}^{\prime}_{t-1,i}(1))\right\}.

We still decompose the instantaneous regret as follows

โŸจft,iโˆ’f,โˆ‡t,iโŸฉ=โŸจft,iโˆ’ft,iโ€ฒ,โˆ‡^t,iโŸฉโŸฮž1+โŸจft,iโ€ฒโˆ’f,โˆ‡~t,iโŸฉโŸฮž2+โŸจft,iโˆ’ft,iโ€ฒ,โˆ‡~t,iโˆ’โˆ‡^t,iโŸฉโŸฮž3+โŸจft,iโˆ’f,โˆ‡t,iโˆ’โˆ‡~t,iโŸฉ.\langle f_{t,i}-f,\nabla_{t,i}\rangle=\underbrace{\langle f_{t,i}-f^{\prime}_{t,i},\hat{\nabla}_{t,i}\rangle}_{\Xi_{1}}+\underbrace{\langle f^{\prime}_{t,i}-f,\tilde{\nabla}_{t,i}\rangle}_{\Xi_{2}}+\underbrace{\langle f_{t,i}-f^{\prime}_{t,i},\tilde{\nabla}_{t,i}-\hat{\nabla}_{t,i}\rangle}_{\Xi_{3}}+\left\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\right\rangle.

We reanalyze ฮž1\Xi_{1} and ฮž2\Xi_{2} as follows

ฮž1\displaystyle\Xi_{1} โ‰คโ„ฌฯˆiโ€‹(ft,iโ€ฒ,ftโˆ’1,iโ€ฒ)โˆ’โ„ฌฯˆiโ€‹(ft,iโ€ฒ,ft,i)โˆ’โ„ฌฯˆiโ€‹(ft,i,ftโˆ’1,iโ€ฒ),\displaystyle\leq\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},f_{t,i})-\mathcal{B}_{\psi_{i}}(f_{t,i},f^{\prime}_{t-1,i}),
ฮž2\displaystyle\Xi_{2} โ‰คโ„ฌฯˆiโ€‹(f,fยฏtโˆ’1,iโ€ฒโ€‹(1))โˆ’โ„ฌฯˆiโ€‹(f,ft,iโ€ฒ)โˆ’โ„ฌฯˆiโ€‹(ft,iโ€ฒ,fยฏtโˆ’1,iโ€ฒโ€‹(1)).\displaystyle\leq\mathcal{B}_{\psi_{i}}(f,\bar{f}^{\prime}_{t-1,i}(1))-\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t,i})-\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},\bar{f}^{\prime}_{t-1,i}(1)).

Then ฮž1+ฮž2+ฮž3\Xi_{1}+\Xi_{2}+\Xi_{3} can be further bounded as follows,

ฮž1+ฮž2+ฮž3โ‰ค\displaystyle\Xi_{1}+\Xi_{2}+\Xi_{3}\leq โ„ฌฯˆiโ€‹(f,ftโˆ’1,iโ€ฒ)โˆ’โ„ฌฯˆiโ€‹(f,ft,iโ€ฒ)+[โ„ฌฯˆiโ€‹(f,fยฏtโˆ’1,iโ€ฒโ€‹(1))โˆ’โ„ฌฯˆiโ€‹(f,ftโˆ’1,iโ€ฒ)]+\displaystyle\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t,i})+\left[\mathcal{B}_{\psi_{i}}(f,\bar{f}^{\prime}_{t-1,i}(1))-\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t-1,i})\right]+
[โ„ฌฯˆiโ€‹(ft,iโ€ฒ,ftโˆ’1,iโ€ฒ)โˆ’โ„ฌฯˆiโ€‹(ft,iโ€ฒ,fยฏtโˆ’1,iโ€ฒโ€‹(1))]โˆ’[โ„ฌฯˆiโ€‹(ft,iโ€ฒ,ft,i)+โ„ฌฯˆiโ€‹(ft,i,ftโˆ’1,iโ€ฒ)]+ฮž3.\displaystyle\left[\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},\bar{f}^{\prime}_{t-1,i}(1))\right]-\left[\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},f_{t,i})+\mathcal{B}_{\psi_{i}}(f_{t,i},f^{\prime}_{t-1,i})\right]+\Xi_{3}.

By Lemma A.8.6, we analyze the following term

ฮž3โˆ’[โ„ฌฯˆiโ€‹(ft,iโ€ฒ,ft,i)+โ„ฌฯˆiโ€‹(ft,i,ftโˆ’1,iโ€ฒ)]\displaystyle\Xi_{3}-\left[\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},f_{t,i})+\mathcal{B}_{\psi_{i}}(f_{t,i},f^{\prime}_{t-1,i})\right]
โ‰ค\displaystyle\leq ฮปi2โ€‹[โ€–โˆ‡t,iโˆ’โˆ‡^t,iโ€–โ„‹i2(โ„™โ€‹[bt,i=1])2โ€‹๐•€bt,i=1โˆ’โ€–โˆ‡^t,iโ€–โ„‹i2]โˆ’12โ€‹ฮปiโ€‹โ€–ftโˆ’1,iโ€ฒโˆ’ft,iโ€ฒโ€–โ„‹i2+โŸจftโˆ’1,iโ€ฒโˆ’ft,iโ€ฒ,โˆ‡~t,iโŸฉ.\displaystyle\frac{\lambda_{i}}{2}\left[\frac{\|\nabla_{t,i}-\hat{\nabla}_{t,i}\|^{2}_{\mathcal{H}_{i}}}{(\mathbb{P}[b_{t,i}=1])^{2}}\mathbb{I}_{b_{t,i}=1}-\|\hat{\nabla}_{t,i}\|^{2}_{\mathcal{H}_{i}}\right]-\frac{1}{2\lambda_{i}}\|f^{\prime}_{t-1,i}-f^{\prime}_{t,i}\|^{2}_{\mathcal{H}_{i}}+\langle f^{\prime}_{t-1,i}-f^{\prime}_{t,i},\tilde{\nabla}_{t,i}\rangle.

Substituting into the instantaneous regret gives

โŸจft,iโˆ’f,โˆ‡t,iโŸฉโ‰ค\displaystyle\langle f_{t,i}-f,\nabla_{t,i}\rangle\leq โ„ฌฯˆiโ€‹(f,ftโˆ’1,iโ€ฒ)โˆ’โ„ฌฯˆiโ€‹(f,ft,iโ€ฒ)+โŸจft,iโˆ’f,โˆ‡t,iโˆ’โˆ‡~t,iโŸฉ+โŸจโˆ‡~t,i,ftโˆ’1,iโ€ฒโˆ’ft,iโ€ฒโŸฉ+\displaystyle\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t,i})+\left\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\right\rangle+\langle\tilde{\nabla}_{t,i},f^{\prime}_{t-1,i}-f^{\prime}_{t,i}\rangle+
โ€–fยฏtโˆ’1,iโ€ฒโ€‹(1)โˆ’fโ€–โ„‹i2โˆ’โ€–ftโˆ’1,iโ€ฒโˆ’fโ€–โ„‹i22โ€‹ฮปi+ฮปi2โ€‹โ€–โˆ‡t,iโˆ’โˆ‡^t,iโ€–โ„‹i2(โ„™โ€‹[bt,i=1])2โ€‹๐•€bt,i=1โˆ’ฮปi2โ€‹โ€–โˆ‡^t,iโ€–โ„‹i2.\displaystyle\frac{\|\bar{f}^{\prime}_{t-1,i}(1)-f\|^{2}_{\mathcal{H}_{i}}-\|f^{\prime}_{t-1,i}-f\|^{2}_{\mathcal{H}_{i}}}{2\lambda_{i}}+\frac{\lambda_{i}}{2}\frac{\|\nabla_{t,i}-\hat{\nabla}_{t,i}\|^{2}_{\mathcal{H}_{i}}}{(\mathbb{P}[b_{t,i}=1])^{2}}\mathbb{I}_{b_{t,i}=1}-\frac{\lambda_{i}}{2}\|\hat{\nabla}_{t,i}\|^{2}_{\mathcal{H}_{i}}.

Combining all
Combining the above three cases, we obtain

๐’ฏ2โ‰ค\displaystyle\mathcal{T}_{2}\leq โˆ‘tโˆˆET,i[โ„ฌฯˆiโ€‹(f,ftโˆ’1,iโ€ฒ)โˆ’โ„ฌฯˆiโ€‹(f,ft,iโ€ฒ)]+4โ€‹(U+ฮปi)โ€‹๐’œ~T,ฮบi12+โˆ‘tโˆˆ๐’ฅi[โŸจโˆ‡~t,i,ftโˆ’1,iโ€ฒโˆ’ft,iโ€ฒโŸฉ+2โ€‹U2ฮปi]+\displaystyle\sum_{t\in E_{T,i}}\left[\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t,i})\right]+4(U+\lambda_{i})\tilde{\mathcal{A}}^{\frac{1}{2}}_{T,\kappa_{i}}+\sum_{t\in\mathcal{J}_{i}}\left[\langle\tilde{\nabla}_{t,i},f^{\prime}_{t-1,i}-f^{\prime}_{t,i}\rangle+\frac{2U^{2}}{\lambda_{i}}\right]+
ฮปi2โ€‹โˆ‘tโˆˆEยฏiโˆช๐’ฅi[โ€–โˆ‡t,iโˆ’โˆ‡^t,iโ€–โ„‹i2(โ„™โ€‹[bt,i=1])2โ€‹๐•€bt,i=1โˆ’โ€–โˆ‡^t,iโ€–โ„‹i2]+โˆ‘tโˆˆEยฏiโˆช๐’ฅiโŸจft,iโˆ’f,โˆ‡t,iโˆ’โˆ‡~t,iโŸฉ+2โ€‹ฮปiโ€‹๐’œ~T,ฮบi.\displaystyle\frac{\lambda_{i}}{2}\sum_{t\in\bar{E}_{i}\cup\mathcal{J}_{i}}\left[\frac{\|\nabla_{t,i}-\hat{\nabla}_{t,i}\|^{2}_{\mathcal{H}_{i}}}{(\mathbb{P}[b_{t,i}=1])^{2}}\mathbb{I}_{b_{t,i}=1}-\|\hat{\nabla}_{t,i}\|^{2}_{\mathcal{H}_{i}}\right]+\sum_{t\in\bar{E}_{i}\cup\mathcal{J}_{i}}\left\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\right\rangle+2\lambda_{i}\tilde{\mathcal{A}}_{T,\kappa_{i}}.

Recalling that โ€–ft,iโ€ฒโ€–โ„‹iโ‰คU\|f^{\prime}_{t,i}\|_{\mathcal{H}_{i}}\leq U and fโ‰คUf\leq U. Conditioned on bsr,i,โ€ฆ,btโˆ’1,ib_{s_{r},i},\ldots,b_{t-1,i}, taking expectation w.r.t. bt,ib_{t,i} gives

๐”ผโ€‹[๐’ฏ2]โ‰คU22โ€‹ฮปi+(2โ€‹U+2โ€‹U2ฮปi)โ‹…J+5โ€‹ฮปi2โ€‹๐’œ~T,ฮบi+4โ€‹(U+ฮปi)โ€‹๐’œ~T,ฮบi.\mathbb{E}\left[\mathcal{T}_{2}\right]\leq\frac{U^{2}}{2\lambda_{i}}+\left(2U+\frac{2U^{2}}{\lambda_{i}}\right)\cdot J+\frac{5\lambda_{i}}{2}\tilde{\mathcal{A}}_{T,\kappa_{i}}+4(U+\lambda_{i})\sqrt{\tilde{\mathcal{A}}_{T,\kappa_{i}}}. (A7)

Let ฮปi=Kโ€‹U2โ€‹B\lambda_{i}=\frac{\sqrt{K}U}{2\sqrt{B}}. Assuming that Bโ‰ฅKB\geq K, we have ฮปiโ‰คU2\lambda_{i}\leq\frac{U}{2}. Then

๐”ผโ€‹[๐’ฏ2]=\displaystyle\mathbb{E}\left[\mathcal{T}_{2}\right]= Oโ€‹(Uโ€‹BK+Kโ€‹UBโ€‹k1โ€‹๐’œ~T,ฮบi+Uโ€‹๐’œ~T,ฮบi)byโ€‹(โ€‹A3โ€‹)\displaystyle O\left(\frac{U\sqrt{B}}{\sqrt{K}}+\frac{\sqrt{K}U}{\sqrt{B}k_{1}}\tilde{\mathcal{A}}_{T,\kappa_{i}}+U\sqrt{\tilde{\mathcal{A}}_{T,\kappa_{i}}}\right)\quad\quad\quad\mathrm{by}~{}\eqref{eq:JCST2022:M-OMD-H:J}
=\displaystyle= Oโ€‹(Uโ€‹BK+Kโ€‹Uโ€‹๐’œT,ฮบiโ€‹lnโกTBโ€‹k1),byโ€‹Lemmaโ€‹A.1.1\displaystyle O\left(\frac{U\sqrt{B}}{\sqrt{K}}+\frac{\sqrt{K}U\mathcal{A}_{T,\kappa_{i}}\ln{T}}{\sqrt{B}k_{1}}\right),\quad\quad\quad\mathrm{by}~{}\mathrm{Lemma}~{}\ref{lem:JCST2022:reservoir_estimator}

where we omit the lower order term.

A.2.3 Combining ๐’ฏ1\mathcal{T}_{1} and ๐’ฏ2\mathcal{T}_{2}

Combining ๐’ฏ1\mathcal{T}_{1} and ๐’ฏ2\mathcal{T}_{2}, and taking expectation w.r.t. the randomness of reservoir sampling gives

๐”ผโ€‹[Regโ€‹(f)]\displaystyle\mathbb{E}\left[\mathrm{Reg}(f)\right]
=\displaystyle= ๐”ผโ€‹[โˆ‘t=1Tโ„“โ€‹(ftโ€‹(๐’™t),yt)โˆ’โˆ‘t=1Tโ„“โ€‹(ft,iโ€‹(๐’™t),yt)]+๐”ผโ€‹[๐’ฏ2]\displaystyle\mathbb{E}\left[\sum^{T}_{t=1}\ell(f_{t}({\bm{x}}_{t}),y_{t})-\sum^{T}_{t=1}\ell(f_{t,i}({\bm{x}}_{t}),y_{t})\right]+\mathbb{E}\left[\mathcal{T}_{2}\right]
โ‰ค\displaystyle\leq 32โ€‹๐”ผโ€‹[maxt,jโกct,jโ‹…โˆ‘t=1Tโ„“โ€‹(ft,iโ€‹(๐’™t),yt)โ€‹lnโกK]+92โ€‹maxt,jโกct,jโ‹…lnโกK+๐”ผโ€‹[๐’ฏ2]byโ€‹(โ€‹A6โ€‹)\displaystyle\frac{3}{\sqrt{2}}\mathbb{E}\left[\sqrt{\max_{t,j}c_{t,j}\cdot\sum^{T}_{t=1}\ell(f_{t,i}({\bm{x}}_{t}),y_{t})\ln{K}}\right]+\frac{9}{2}\max_{t,j}c_{t,j}\cdot\ln{K}+\mathbb{E}\left[\mathcal{T}_{2}\right]\quad\quad\quad\mathrm{by}~{}\eqref{eq:JCST2022:kernel_alignment_bound:first_part_of_regret}
=\displaystyle= 32โ€‹๐”ผโ€‹[maxt,jโกct,jโ‹…(โˆ‘t=1Tโ„“โ€‹(fโ€‹(๐’™t),yt)+๐”ผโ€‹[๐’ฏ2])โ€‹lnโกK]+92โ€‹maxt,jโกct,jโ‹…lnโกK+๐”ผโ€‹[๐’ฏ2]\displaystyle\frac{3}{\sqrt{2}}\mathbb{E}\left[\sqrt{\max_{t,j}c_{t,j}\cdot\left(\sum^{T}_{t=1}\ell(f({\bm{x}}_{t}),y_{t})+\mathbb{E}\left[\mathcal{T}_{2}\right]\right)\ln{K}}\right]+\frac{9}{2}\max_{t,j}c_{t,j}\cdot\ln{K}+\mathbb{E}\left[\mathcal{T}_{2}\right]
=\displaystyle= Oโ€‹(maxt,jโกct,jโ‹…LTโ€‹(f)โ€‹lnโกK+Uโ€‹BK+Kโ€‹Uโ€‹๐’œT,ฮบiโ€‹lnโกTBโ€‹k1+maxt,jโกct,jโ‹…lnโกK).\displaystyle O\left(\sqrt{\max_{t,j}c_{t,j}\cdot L_{T}(f)\ln{K}}+\frac{U\sqrt{B}}{\sqrt{K}}+\frac{\sqrt{K}U\mathcal{A}_{T,\kappa_{i}}\ln{T}}{\sqrt{B}k_{1}}+\max_{t,j}c_{t,j}\cdot\ln{K}\right).

For the Hinge loss function, we have maxt,jโกct,j=1+U\max_{t,j}c_{t,j}=1+U. โˆŽ

Appendix A.3 Proof of Theorem 2

Proof.

For simplicity, denote by

ฮ›i=โˆ‘tโˆˆ๐’ฅi[โ€–fยฏtโˆ’1,iโ€ฒโ€‹(1)โˆ’fโ€–โ„‹i2โˆ’โ€–ftโˆ’1,iโ€ฒโˆ’fโ€–โ„‹i2].\Lambda_{i}=\sum_{t\in\mathcal{J}_{i}}\left[\left\|\bar{f}^{\prime}_{t-1,i}(1)-f\right\|^{2}_{\mathcal{H}_{i}}-\left\|f^{\prime}_{t-1,i}-f\right\|^{2}_{\mathcal{H}_{i}}\right].

There must be a constant ฮพiโˆˆ(0,4]\xi_{i}\in(0,4] such that ฮ›iโ‰คฮพiโ€‹U2โ€‹J\Lambda_{i}\leq\xi_{i}U^{2}J. We will prove a better regret bound if ฮพi\xi_{i} is small enough. Recalling that (A3) gives an upper bound on JJ. If ฮพiโ‰ค1J\xi_{i}\leq\frac{1}{J}, then we rewrite (A7) by

๐’ฏ2โ‰คU22โ€‹ฮปi+2โ€‹Uโ€‹J+U22โ€‹ฮปi+5โ€‹ฮปi2โ€‹๐’œ~T,ฮบi+4โ€‹(U+ฮปi)โ€‹๐’œ~T,ฮบi.\displaystyle\mathcal{T}_{2}\leq\frac{U^{2}}{2\lambda_{i}}+2UJ+\frac{U^{2}}{2\lambda_{i}}+\frac{5\lambda_{i}}{2}\tilde{\mathcal{A}}_{T,\kappa_{i}}+4(U+\lambda_{i})\sqrt{\tilde{\mathcal{A}}_{T,\kappa_{i}}}.

Let ฮปi=2โ€‹U5โ€‹๐’œ~T,ฮบi\lambda_{i}=\frac{\sqrt{2}U}{\sqrt{5\tilde{\mathcal{A}}_{T,\kappa_{i}}}}. Taking expectation w.r.t. the reservoir sampling and using Lemma A.1.1 gives

๐”ผโ€‹[๐’ฏ2]=Oโ€‹(Uโ€‹KBโ€‹k1โ€‹๐’œT,ฮบiโ€‹lnโกT+Uโ€‹๐’œT,ฮบiโ€‹lnโกT),\mathbb{E}\left[\mathcal{T}_{2}\right]=O\left(\frac{UK}{Bk_{1}}\mathcal{A}_{T,\kappa_{i}}\ln{T}+U\sqrt{\mathcal{A}_{T,\kappa_{i}}\ln{T}}\right),

where we omit the lower order terms. Combining ๐’ฏ1\mathcal{T}_{1} and ๐’ฏ2\mathcal{T}_{2} gives

๐”ผโ€‹[Regโ€‹(f)]\displaystyle\mathbb{E}\left[\mathrm{Reg}(f)\right]
=\displaystyle= 32โ€‹๐”ผโ€‹[maxt,jโกct,jโ‹…(โˆ‘t=1Tโ„“โ€‹(fโ€‹(๐’™t),yt)+๐”ผโ€‹[๐’ฏ2])โ€‹lnโกK]+92โ€‹maxt,jโกct,jโ‹…lnโกK+๐”ผโ€‹[๐’ฏ2]\displaystyle\frac{3}{\sqrt{2}}\mathbb{E}\left[\sqrt{\max_{t,j}c_{t,j}\cdot\left(\sum^{T}_{t=1}\ell(f({\bm{x}}_{t}),y_{t})+\mathbb{E}\left[\mathcal{T}_{2}\right]\right)\ln{K}}\right]+\frac{9}{2}\max_{t,j}c_{t,j}\cdot\ln{K}+\mathbb{E}\left[\mathcal{T}_{2}\right]
=\displaystyle= Oโ€‹(maxt,jโกct,jโ‹…LTโ€‹(f)โ€‹lnโกK+Uโ€‹KBโ€‹k1โ€‹๐’œT,ฮบiโ€‹lnโกT+Uโ€‹๐’œT,ฮบiโ€‹lnโกT+maxt,jโกct,jโ‹…lnโกK),\displaystyle O\left(\sqrt{\max_{t,j}c_{t,j}\cdot L_{T}(f)\ln{K}}+\frac{UK}{Bk_{1}}\mathcal{A}_{T,\kappa_{i}}\ln{T}+U\sqrt{\mathcal{A}_{T,\kappa_{i}}\ln{T}}+\max_{t,j}c_{t,j}\cdot\ln{K}\right),

which concludes the proof. โˆŽ

Appendix A.4 Proof of Lemma 2

Proof.

Recalling the definition of ๐’ฅ\mathcal{J} and TrT_{r} in Section A.1. For any tโˆˆTrt\in T_{r}, (๐’™t,yt)({\bm{x}}_{t},y_{t}) will be added into SS only if bt=1b_{t}=1. At the end of the trt_{r}-th round, we have

|S|=B2โ€‹๐•€rโ‰ 1+โˆ‘t=trโˆ’1+1trbt=B.|S|=\frac{B}{2}\mathbb{I}_{r\neq 1}+\sum^{t_{r}}_{t=t_{r-1}+1}b_{t}=B.

We remove B2\frac{B}{2} examples from SS at the end of the trt_{r}-th round.

Assuming that there is no budget. For any t0>trโˆ’1+1t_{0}>t_{r-1}+1, we will prove an upper bound on โˆ‘t=trโˆ’1+1t0bt\sum^{t_{0}}_{t=t_{r-1}+1}b_{t}. Define a random variable XtX_{t} as follows,

Xt=btโˆ’โ„™โ€‹[bt=1],|Xt|โ‰ค1.X_{t}=b_{t}-\mathbb{P}[b_{t}=1],\quad|X_{t}|\leq 1.

Under the condition of btrโˆ’1+1,โ€ฆ,btโˆ’1b_{t_{r-1}+1},\ldots,b_{t-1}, we have ๐”ผbtโ€‹[Xt]=0\mathbb{E}_{b_{t}}[X_{t}]=0. Thus Xtrโˆ’1+1,โ€ฆ,Xt0X_{t_{r-1}+1},\ldots,X_{t_{0}} form bounded martingale difference. Let L^a:b:=โˆ‘t=abโ„“โ€‹(ftโ€‹(๐’™t),yt)\hat{L}_{a:b}:=\sum^{b}_{t=a}\ell(f_{t}({\bm{x}}_{t}),y_{t}) and L^1:Tโ‰คN\hat{L}_{1:T}\leq N. The sum of conditional variances satisfies

ฮฃ2โ‰คโˆ‘t=trโˆ’1+1t0|โ„“โ€ฒโ€‹(ftโ€‹(๐’™t),yt)||โ„“โ€ฒโ€‹(ftโ€‹(๐’™t),yt)|+G1โ‰คG2G1โ€‹L^trโˆ’1+1:t0,\Sigma^{2}\leq\sum^{t_{0}}_{t=t_{r-1}+1}\frac{|\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t})|}{|\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t})|+G_{1}}\leq\frac{G_{2}}{G_{1}}\hat{L}_{t_{r-1}+1:t_{0}},

where the last inequality comes from Assumption 3. Since L^trโˆ’1+1:t0\hat{L}_{t_{r-1}+1:t_{0}} is a random variable, Lemma A.8.3 can give an upper bound on โˆ‘t=trโˆ’1+1t0bt\sum^{t_{0}}_{t=t_{r-1}+1}b_{t} with probability at least 1โˆ’2โ€‹โŒˆlogโกNโŒ‰โ€‹ฮด1-2\lceil\log{N}\rceil\delta. Let trt_{r} be the minimal t0t_{0} such that

G2G1โ€‹L^trโˆ’1+1:tr+23โ€‹lnโก1ฮด+2โ€‹G2G1โ€‹L^trโˆ’1+1:trโ€‹lnโก1ฮดโ‰ฅB2โ‹…๐•€rโ‰ฅ2+Bโ‹…๐•€r=1.\displaystyle\frac{G_{2}}{G_{1}}\hat{L}_{t_{r-1}+1:t_{r}}+\frac{2}{3}\ln\frac{1}{\delta}+2\sqrt{\frac{G_{2}}{G_{1}}\hat{L}_{t_{r-1}+1:t_{r}}\ln\frac{1}{\delta}}\geq\frac{B}{2}\cdot\mathbb{I}_{r\geq 2}+B\cdot\mathbb{I}_{r=1}.

The rr-th epoch will end at trt_{r}. Summing over rโˆˆ{1,โ€ฆ,J}r\in\{1,\ldots,J\}, with probability at least 1โˆ’2โ€‹Jโ€‹โŒˆlogโกNโŒ‰โ€‹ฮด1-2J\lceil\log{N}\rceil\delta,

โˆ‘r=1Jโˆ‘t=trโˆ’1+1trbtโ‰คโˆ‘r=1J(G2G1โ€‹L^trโˆ’1+1:tr+23โ€‹lnโก1ฮด+2โ€‹G2G1โ€‹L^trโˆ’1+1:trโ€‹lnโก1ฮด),\displaystyle\sum^{J}_{r=1}\sum^{t_{r}}_{t=t_{r-1}+1}b_{t}\leq\sum^{J}_{r=1}\left(\frac{G_{2}}{G_{1}}\hat{L}_{t_{r-1}+1:t_{r}}+\frac{2}{3}\ln\frac{1}{\delta}+2\sqrt{\frac{G_{2}}{G_{1}}\hat{L}_{t_{r-1}+1:t_{r}}\ln\frac{1}{\delta}}\right),

which is equivalent to

B2+Jโ€‹B2โ‰คG2G1โ€‹L^1:T+23โ€‹Jโ€‹lnโก1ฮด+2โ€‹Jโ€‹G2G1โ€‹L^1:Tโ€‹lnโก1ฮด.\frac{B}{2}+\frac{JB}{2}\leq\frac{G_{2}}{G_{1}}\hat{L}_{1:T}+\frac{2}{3}J\ln\frac{1}{\delta}+2\sqrt{J\frac{G_{2}}{G_{1}}\hat{L}_{1:T}\ln\frac{1}{\delta}}.

Solving the above inequality yields,

Jโ‰ค\displaystyle J\leq 2โ€‹G2G1โ€‹L^1:TBโˆ’43โ€‹lnโก1ฮด+16โ€‹G2G1โ€‹L^1:T(Bโˆ’43โ€‹lnโก1ฮด)2โ€‹lnโก1ฮด+4โ€‹2(Bโˆ’43โ€‹lnโก1ฮด)32โ€‹G2G1โ€‹L^1:Tโ€‹lnโก1ฮด.\displaystyle\frac{2G_{2}}{G_{1}}\frac{\hat{L}_{1:T}}{B-\frac{4}{3}\ln\frac{1}{\delta}}+\frac{16G_{2}}{G_{1}}\frac{\hat{L}_{1:T}}{(B-\frac{4}{3}\ln\frac{1}{\delta})^{2}}\ln\frac{1}{\delta}+\frac{4\sqrt{2}}{(B-\frac{4}{3}\ln\frac{1}{\delta})^{\frac{3}{2}}}\frac{G_{2}}{G_{1}}\hat{L}_{1:T}\sqrt{\ln\frac{1}{\delta}}.

Let Bโ‰ฅ21โ€‹lnโก1ฮดB\geq 21\ln\frac{1}{\delta}. Simplifying the above result concludes the proof. โˆŽ

Appendix A.5 Proof of Theorem 3

Proof.

Let ๐’‘โˆˆฮ”Kโˆ’1{\bm{p}}\in\Delta_{K-1} satisfy pi=1p_{i}=1. By the convexity of loss function, we have

Regโ€‹(f)โ‰ค\displaystyle\mathrm{Reg}(f)\leq โˆ‘t=1TโŸจโ„“โ€ฒโ€‹(ftโ€‹(๐’™t),yt),ftโ€‹(๐’™t)โˆ’fโ€‹(๐’™t)โŸฉ\displaystyle\sum^{T}_{t=1}\langle\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t}),f_{t}({\bm{x}}_{t})-f({\bm{x}}_{t})\rangle
=\displaystyle= โˆ‘t=1TโŸจโ„“โ€ฒโ€‹(ftโ€‹(๐’™t),yt),โˆ‘i=1Kpt,iโ€‹ft,iโ€‹(๐’™t)โˆ’fโ€‹(๐’™t)โŸฉ\displaystyle\sum^{T}_{t=1}\left\langle\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t}),\sum^{K}_{i=1}p_{t,i}f_{t,i}({\bm{x}}_{t})-f({\bm{x}}_{t})\right\rangle
=\displaystyle= โˆ‘t=1TโŸจโ„“โ€ฒโ€‹(ftโ€‹(๐’™t),yt),โˆ‘i=1Kpt,iโ€‹ft,iโ€‹(๐’™t)โˆ’โˆ‘i=1Kpiโ€‹ft,iโ€‹(๐’™t)+โˆ‘i=1Kpiโ€‹ft,iโ€‹(๐’™t)โˆ’fโ€‹(๐’™t)โŸฉ\displaystyle\sum^{T}_{t=1}\left\langle\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t}),\sum^{K}_{i=1}p_{t,i}f_{t,i}({\bm{x}}_{t})-\sum^{K}_{i=1}p_{i}f_{t,i}({\bm{x}}_{t})+\sum^{K}_{i=1}p_{i}f_{t,i}({\bm{x}}_{t})-f({\bm{x}}_{t})\right\rangle
=\displaystyle= โˆ‘t=1Tโ„“โ€ฒโ€‹(ftโ€‹(๐’™t),yt)โ€‹โˆ‘i=1K(pt,iโˆ’pi)โ€‹ft,iโ€‹(๐’™t)โŸ๐’ฏ1+โˆ‘t=1TโŸจโˆ‡t,i,ft,iโˆ’fโŸฉโŸ๐’ฏ2.\displaystyle\underbrace{\sum^{T}_{t=1}\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t})\sum^{K}_{i=1}(p_{t,i}-p_{i})f_{t,i}({\bm{x}}_{t})}_{\mathcal{T}_{1}}+\underbrace{\sum^{T}_{t=1}\langle\nabla_{t,i},f_{t,i}-f\rangle}_{\mathcal{T}_{2}}.

We first analyze ๐’ฏ1\mathcal{T}_{1}. We have

๐’ฏ1=\displaystyle\mathcal{T}_{1}= โˆ‘tโˆˆT1โˆ‘i=1K(pt,iโˆ’pi)โ‹…โ„“โ€ฒโ€‹(ftโ€‹(๐’™t),yt)โ‹…(ft,iโ€‹(๐’™t)โˆ’minjโˆˆ[K]โกft,jโ€‹(๐’™t))+\displaystyle\sum_{t\in T^{1}}\sum^{K}_{i=1}(p_{t,i}-p_{i})\cdot\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t})\cdot\left(f_{t,i}({\bm{x}}_{t})-\min_{j\in[K]}f_{t,j}({\bm{x}}_{t})\right)+
โˆ‘tโˆˆT2โˆ‘i=1K(pt,iโˆ’pi)โ‹…โ„“โ€ฒโ€‹(ftโ€‹(๐’™t),yt)โ‹…(ft,iโ€‹(๐’™t)โˆ’maxjโˆˆ[K]โกft,jโ€‹(๐’™t))\displaystyle\sum_{t\in T^{2}}\sum^{K}_{i=1}(p_{t,i}-p_{i})\cdot\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t})\cdot\left(f_{t,i}({\bm{x}}_{t})-\max_{j\in[K]}f_{t,j}({\bm{x}}_{t})\right)
=\displaystyle= โˆ‘t=1TโŸจ๐’‘tโˆ’๐’‘,๐’„tโŸฉbyโ€‹(โ€‹16โ€‹)\displaystyle\sum^{T}_{t=1}\langle{\bm{p}}_{t}-{\bm{p}},{\bm{c}}_{t}\rangle\quad\quad\quad\mathrm{by}~{}\eqref{eq:JCST2022:smooth_loss:unsigned_criterion}
โ‰ค\displaystyle\leq 32โ€‹maxt,jโกct,jโ€‹โˆ‘ฯ„=1TโŸจ๐’‘ฯ„,๐’„ฯ„โŸฉโ€‹lnโกKbyโ€‹(โ€‹A5โ€‹)\displaystyle\frac{3}{\sqrt{2}}\sqrt{\max_{t,j}c_{t,j}\sum^{T}_{\tau=1}\langle{\bm{p}}_{\tau},{\bm{c}}_{\tau}\rangle\ln{K}}\quad\quad\quad\mathrm{by}~{}\eqref{eq:JCST2023_supp:initial_regret_PEA}
โ‰ค\displaystyle\leq 32โ€‹maxt,jโกct,jโ€‹โˆ‘ฯ„=1T|โ„“โ€ฒโ€‹(ftโ€‹(๐’™t),yt)|โ‹…2โ€‹Uโ€‹lnโกK\displaystyle\frac{3}{\sqrt{2}}\sqrt{\max_{t,j}c_{t,j}\sum^{T}_{\tau=1}|\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t})|\cdot 2U\ln{K}}
โ‰ค\displaystyle\leq 62UG2โ€‹G1โ€‹L^1:Tโ€‹lnโกK.byAssumption(3)\displaystyle\frac{6}{\sqrt{2}}U\sqrt{G_{2}G_{1}\hat{L}_{1:T}\ln{K}}.\quad\quad\quad\mathrm{by}~{}\mathrm{Assumption}~{}\eqref{ass:JCST2022:property_smooth_loss}

Next we analyze ๐’ฏ2\mathcal{T}_{2}. We decompose [T][T] as follows,

T1=\displaystyle T_{1}= {tโˆˆ[T]:conโ€‹(a)},\displaystyle\{t\in[T]:\mathrm{con}(a)\},
๐’ฅ=\displaystyle\mathcal{J}= {tโˆˆ[T]:|S|=ฮฑโ€‹โ„›,bt=1},\displaystyle\{t\in[T]:|S|=\alpha\mathcal{R},b_{t}=1\},
Tยฏ1=\displaystyle\bar{T}_{1}= [T]โˆ–(T1โˆช๐’ฅ).\displaystyle[T]\setminus(T_{1}\cup\mathcal{J}).

Case 1: regret in T1T_{1}
We decompose โŸจft,iโˆ’f,โˆ‡t,iโŸฉ\langle f_{t,i}-f,\nabla_{t,i}\rangle as follows,

โŸจft,iโˆ’f,โˆ‡t,iโŸฉ\displaystyle\langle f_{t,i}-f,\nabla_{t,i}\rangle
=\displaystyle= โŸจft+1,iโˆ’f,โˆ‡iโ€‹(st),iโŸฉ+โŸจft+1,iโˆ’f,โˆ‡t,iโˆ’โˆ‡iโ€‹(st),iโŸฉ+โŸจft,iโˆ’ft+1,i,โˆ‡iโ€‹(st),i+โˆ‡t,iโˆ’โˆ‡iโ€‹(st),iโŸฉ\displaystyle\langle f_{t+1,i}-f,\nabla_{i(s_{t}),i}\rangle+\langle f_{t+1,i}-f,\nabla_{t,i}-\nabla_{i(s_{t}),i}\rangle+\langle f_{t,i}-f_{t+1,i},\nabla_{i(s_{t}),i}+\nabla_{t,i}-\nabla_{i(s_{t}),i}\rangle
โ‰ค\displaystyle\leq โ„ฌฯˆiโ€‹(f,ft,i)โˆ’โ„ฌฯˆiโ€‹(f,ft+1,i)โˆ’โ„ฌฯˆiโ€‹(ft+1,i,ft,i)+โŸจft,iโˆ’ft+1,i,โˆ‡iโ€‹(st),iโŸฉ+โŸจft,iโˆ’f,โˆ‡t,iโˆ’โˆ‡iโ€‹(st),iโŸฉ\displaystyle\mathcal{B}_{\psi_{i}}(f,f_{t,i})-\mathcal{B}_{\psi_{i}}(f,f_{t+1,i})-\mathcal{B}_{\psi_{i}}(f_{t+1,i},f_{t,i})+\left\langle f_{t,i}-f_{t+1,i},\nabla_{i(s_{t}),i}\right\rangle+\left\langle f_{t,i}-f,\nabla_{t,i}-\nabla_{i(s_{t}),i}\right\rangle
โ‰ค\displaystyle\leq โ„ฌฯˆiโ€‹(f,ft,i)โˆ’โ„ฌฯˆiโ€‹(f,ft+1,i)+ฮป2โ€‹โ€–โˆ‡iโ€‹(st),iโ€–โ„‹i2+โŸจft,iโˆ’f,โˆ‡t,iโˆ’โˆ‡iโ€‹(st),iโŸฉ,\displaystyle\mathcal{B}_{\psi_{i}}(f,f_{t,i})-\mathcal{B}_{\psi_{i}}(f,f_{t+1,i})+\frac{\lambda}{2}\|\nabla_{i(s_{t}),i}\|^{2}_{\mathcal{H}_{i}}+\left\langle f_{t,i}-f,\nabla_{t,i}-\nabla_{i(s_{t}),i}\right\rangle,

where the last inequality comes from Lemma A.8.6. Next we analyze the third term.

โˆ‘tโˆˆT1โŸจft,iโˆ’f,โˆ‡t,iโˆ’โˆ‡iโ€‹(st),iโŸฉโ‰ค\displaystyle\sum_{t\in T_{1}}\langle f_{t,i}-f,\nabla_{t,i}-\nabla_{i(s_{t}),i}\rangle\leq 2โ€‹Uโ‹…โˆ‘tโˆˆT1|โ„“โ€ฒโ€‹(ftโ€‹(๐’™t),yt)|1+โˆ‘ฯ„โˆˆT1,ฯ„โ‰คt|โ„“โ€ฒโ€‹(fฯ„โ€‹(๐’™ฯ„),yฯ„)|\displaystyle 2U\cdot\sum_{t\in T_{1}}\frac{\left|\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t})\right|}{\sqrt{1+\sum_{\tau\in T_{1},\tau\leq t}\left|\ell^{\prime}(f_{\tau}({\bm{x}}_{\tau}),y_{\tau})\right|}}
โ‰ค\displaystyle\leq 4โ€‹Uโ€‹G2โ€‹L^1:T.\displaystyle 4U\sqrt{G_{2}\hat{L}_{1:T}}.

Case 2: regret in Tยฏ1\bar{T}_{1}
We use a different decomposition as follows

โŸจft,iโˆ’f,โˆ‡t,iโŸฉ\displaystyle\langle f_{t,i}-f,\nabla_{t,i}\rangle
=\displaystyle= โŸจft+1,iโˆ’f,โˆ‡~t,iโŸฉโŸฮž1+โŸจft+1,iโˆ’f,โˆ‡t,iโˆ’โˆ‡~t,iโŸฉโŸฮž2+โŸจft,iโˆ’ft+1,i,โˆ‡t,iโŸฉโŸฮž3\displaystyle\underbrace{\langle f_{t+1,i}-f,\tilde{\nabla}_{t,i}\rangle}_{\Xi_{1}}+\underbrace{\langle f_{t+1,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle}_{\Xi_{2}}+\underbrace{\langle f_{t,i}-f_{t+1,i},\nabla_{t,i}\rangle}_{\Xi_{3}}
โ‰ค\displaystyle\leq โ„ฌฯˆiโ€‹(f,ft,i)โˆ’โ„ฌฯˆiโ€‹(f,ft+1,i)โˆ’โ„ฌฯˆiโ€‹(ft+1,i,ft,i)+โŸจft+1,iโˆ’f,โˆ‡t,iโˆ’โˆ‡~t,iโŸฉ+\displaystyle\mathcal{B}_{\psi_{i}}(f,f_{t,i})-\mathcal{B}_{\psi_{i}}(f,f_{t+1,i})-\mathcal{B}_{\psi_{i}}(f_{t+1,i},f_{t,i})+\langle f_{t+1,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle+
โŸจft,iโˆ’ft+1,i,โˆ‡~t,iโŸฉ+โŸจft,iโˆ’ft+1,i,โˆ‡t,iโˆ’โˆ‡~t,iโŸฉ\displaystyle\langle f_{t,i}-f_{t+1,i},\tilde{\nabla}_{t,i}\rangle+\langle f_{t,i}-f_{t+1,i},\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle
=\displaystyle= โ„ฌฯˆiโ€‹(f,ft,i)โˆ’โ„ฌฯˆiโ€‹(f,ft+1,i)+โŸจft,iโˆ’f,โˆ‡t,iโˆ’โˆ‡~t,iโŸฉ+โŸจft,iโˆ’ft+1,i,โˆ‡~t,iโŸฉโˆ’โ„ฌฯˆiโ€‹(ft+1,i,ft,i)\displaystyle\mathcal{B}_{\psi_{i}}(f,f_{t,i})-\mathcal{B}_{\psi_{i}}(f,f_{t+1,i})+\left\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\right\rangle+\langle f_{t,i}-f_{t+1,i},\tilde{\nabla}_{t,i}\rangle-\mathcal{B}_{\psi_{i}}(f_{t+1,i},f_{t,i})
โ‰ค\displaystyle\leq โ„ฌฯˆiโ€‹(f,ft,i)โˆ’โ„ฌฯˆiโ€‹(f,ft+1,i)+โŸจft,iโˆ’f,โˆ‡t,iโˆ’โˆ‡~t,iโŸฉ+ฮปi2โ€‹โ€–โˆ‡~t,iโ€–โ„‹i2.\displaystyle\mathcal{B}_{\psi_{i}}(f,f_{t,i})-\mathcal{B}_{\psi_{i}}(f,f_{t+1,i})+\left\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\right\rangle+\frac{\lambda_{i}}{2}\|\tilde{\nabla}_{t,i}\|^{2}_{\mathcal{H}_{i}}.

Case 3: regret in ๐’ฅ\mathcal{J}
We decompose โŸจft,iโˆ’f,โˆ‡t,iโŸฉ\left\langle f_{t,i}-f,\nabla_{t,i}\right\rangle into three terms as in Case 2. The second mirror updating is

ft+1,i=argโกminfโˆˆโ„i{โŸจf,โˆ‡~t,iโŸฉ+โ„ฌฯˆiโ€‹(f,fยฏt,iโ€‹(2))}.f_{t+1,i}=\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\{\langle f,\tilde{\nabla}_{t,i}\rangle+\mathcal{B}_{\psi_{i}}(f,\bar{f}_{t,i}(2))\right\}.

Similar to the analysis of Case 2, we obtain

ฮž1โ‰ค\displaystyle\Xi_{1}\leq โ„ฌฯˆiโ€‹(f,ft,i)โˆ’โ„ฌฯˆiโ€‹(f,ft+1,i)โˆ’โ„ฌฯˆiโ€‹(ft+1,i,fยฏt,iโ€‹(2))+[โ„ฌฯˆiโ€‹(f,fยฏt,iโ€‹(2))โˆ’โ„ฌฯˆiโ€‹(f,ft,i)],\displaystyle\mathcal{B}_{\psi_{i}}(f,f_{t,i})-\mathcal{B}_{\psi_{i}}(f,f_{t+1,i})-\mathcal{B}_{\psi_{i}}(f_{t+1,i},\bar{f}_{t,i}(2))+[\mathcal{B}_{\psi_{i}}(f,\bar{f}_{t,i}(2))-\mathcal{B}_{\psi_{i}}(f,f_{t,i})],
ฮž3=\displaystyle\Xi_{3}= โŸจfยฏt,iโ€‹(2)โˆ’ft+1,i,โˆ‡~t,iโŸฉ+โŸจft,iโˆ’fยฏt,iโ€‹(2),โˆ‡~t,iโŸฉ+โŸจft,iโˆ’ft+1,i,โˆ‡t,iโˆ’โˆ‡~t,iโŸฉ.\displaystyle\langle\bar{f}_{t,i}(2)-f_{t+1,i},\tilde{\nabla}_{t,i}\rangle+\langle f_{t,i}-\bar{f}_{t,i}(2),\tilde{\nabla}_{t,i}\rangle+\langle f_{t,i}-f_{t+1,i},\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle.

Combining ฮž1\Xi_{1}, ฮž2\Xi_{2} and ฮž3\Xi_{3} gives

โŸจft,iโˆ’f,โˆ‡t,iโŸฉ\displaystyle\left\langle f_{t,i}-f,\nabla_{t,i}\right\rangle
โ‰ค\displaystyle\leq โ„ฌฯˆiโ€‹(f,ft,i)โˆ’โ„ฌฯˆiโ€‹(f,ft+1,i)+โŸจft,iโˆ’f,โˆ‡t,iโˆ’โˆ‡~t,iโŸฉ+\displaystyle\mathcal{B}_{\psi_{i}}(f,f_{t,i})-\mathcal{B}_{\psi_{i}}(f,f_{t+1,i})+\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle+
โ„ฌฯˆiโ€‹(f,fยฏt,iโ€‹(2))โˆ’โ„ฌฯˆiโ€‹(f,ft,i)+โŸจft,iโˆ’fยฏt,iโ€‹(2),โˆ‡~t,iโŸฉโŸฮž4+โŸจfยฏt,iโ€‹(2)โˆ’ft+1,i,โˆ‡~t,iโŸฉโˆ’โ„ฌฯˆiโ€‹(ft+1,i,fยฏt,iโ€‹(2))\displaystyle\underbrace{\mathcal{B}_{\psi_{i}}(f,\bar{f}_{t,i}(2))-\mathcal{B}_{\psi_{i}}(f,f_{t,i})+\langle f_{t,i}-\bar{f}_{t,i}(2),\tilde{\nabla}_{t,i}\rangle}_{\Xi_{4}}+\langle\bar{f}_{t,i}(2)-f_{t+1,i},\tilde{\nabla}_{t,i}\rangle-\mathcal{B}_{\psi_{i}}(f_{t+1,i},\bar{f}_{t,i}(2)) (A8)
โ‰ค\displaystyle\leq โ„ฌฯˆiโ€‹(f,ft,i)โˆ’โ„ฌฯˆiโ€‹(f,ft+1,i)+โŸจft,iโˆ’f,โˆ‡t,iโˆ’โˆ‡~t,iโŸฉ+\displaystyle\mathcal{B}_{\psi_{i}}(f,f_{t,i})-\mathcal{B}_{\psi_{i}}(f,f_{t+1,i})+\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle+
2โ€‹U2ฮปi+4โ€‹Uโ€‹G1+โŸจfยฏt,iโ€‹(2)โˆ’ft+1,i,โˆ‡~t,iโŸฉโˆ’โ„ฌฯˆiโ€‹(ft+1,i,fยฏt,iโ€‹(2))byโ€‹Lemmaโ€‹A.8.6\displaystyle\frac{2U^{2}}{\lambda_{i}}+4UG_{1}+\langle\bar{f}_{t,i}(2)-f_{t+1,i},\tilde{\nabla}_{t,i}\rangle-\mathcal{B}_{\psi_{i}}(f_{t+1,i},\bar{f}_{t,i}(2))\quad\quad\quad\mathrm{by}~{}\mathrm{Lemma}~{}\ref{lemma:JCST2022:property_of_OMD}
โ‰ค\displaystyle\leq โ„ฌฯˆiโ€‹(f,ft,i)โˆ’โ„ฌฯˆiโ€‹(f,ft+1,i)+โŸจft,iโˆ’f,โˆ‡t,iโˆ’โˆ‡~t,iโŸฉ+2โ€‹U2ฮปi+4โ€‹Uโ€‹G1+ฮปi2โ€‹โ€–โˆ‡~t,iโ€–โ„‹i2.\displaystyle\mathcal{B}_{\psi_{i}}(f,f_{t,i})-\mathcal{B}_{\psi_{i}}(f,f_{t+1,i})+\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle+\frac{2U^{2}}{\lambda_{i}}+4UG_{1}+\frac{\lambda_{i}}{2}\|\tilde{\nabla}_{t,i}\|^{2}_{\mathcal{H}_{i}}.

Combining the regret in T1T_{1}, ๐’ฅ\mathcal{J} and Tยฏ1\bar{T}_{1} gives

๐’ฏ2โ‰ค\displaystyle\mathcal{T}_{2}\leq 4โ€‹Uโ€‹G2โ€‹L^1:T+(2โ€‹U2ฮปi+4โ€‹Uโ€‹G1)โ€‹|๐’ฅ|+โˆ‘tโˆˆTยฏ1โˆช๐’ฅโŸจft,iโˆ’f,โˆ‡t,iโˆ’โˆ‡~t,iโŸฉโŸฮž2,1+\displaystyle 4U\sqrt{G_{2}\hat{L}_{1:T}}+\left(\frac{2U^{2}}{\lambda_{i}}+4UG_{1}\right)|\mathcal{J}|+\underbrace{\sum_{t\in\bar{T}_{1}\cup\mathcal{J}}\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle}_{\Xi_{2,1}}+
โˆ‘t=1T(โ„ฌฯˆiโ€‹(f,ft,i)โˆ’โ„ฌฯˆiโ€‹(f,ft+1,i))+ฮปiโ€‹(12โ€‹โˆ‘tโˆˆTยฏ1โˆช๐’ฅโ€–โˆ‡~t,iโ€–โ„‹i2+โˆ‘tโˆˆT112โ€‹โ€–โˆ‡iโ€‹(st),iโ€–โ„‹i2)โŸฮž2,2\displaystyle\sum^{T}_{t=1}\left(\mathcal{B}_{\psi_{i}}(f,f_{t,i})-\mathcal{B}_{\psi_{i}}(f,f_{t+1,i})\right)+\lambda_{i}\underbrace{\left(\frac{1}{2}\sum_{t\in\bar{T}_{1}\cup\mathcal{J}}\|\tilde{\nabla}_{t,i}\|^{2}_{\mathcal{H}_{i}}+\sum_{t\in T_{1}}\frac{1}{2}\|\nabla_{i(s_{t}),i}\|^{2}_{\mathcal{H}_{i}}\right)}_{\Xi_{2,2}}
โ‰ค\displaystyle\leq 4โ€‹Uโ€‹G2โ€‹L^1:T+(U22โ€‹ฮปi+4โ€‹Uโ€‹G1)โ€‹|๐’ฅ|+ฮž2,1+U22โ€‹ฮปi+ฮž2,2.\displaystyle 4U\sqrt{G_{2}\hat{L}_{1:T}}+\left(\frac{U^{2}}{2\lambda_{i}}+4UG_{1}\right)|\mathcal{J}|+\Xi_{2,1}+\frac{U^{2}}{2\lambda_{i}}+\Xi_{2,2}.

Lemma A.8.4 gives, with probability at least 1โˆ’ฮ˜โ€‹(โŒˆlnโกTโŒ‰)โ€‹ฮด1-\Theta(\lceil\ln{T}\rceil)\delta,

ฮž2,1โ‰ค\displaystyle\Xi_{2,1}\leq 43โ€‹Uโ€‹G1โ€‹lnโก1ฮด+2โ€‹Uโ€‹2โ€‹G2โ€‹G1โ€‹L^1:Tโ€‹lnโก1ฮด,\displaystyle\frac{4}{3}UG_{1}\ln\frac{1}{\delta}+2U\sqrt{2G_{2}G_{1}\hat{L}_{1:T}\ln\frac{1}{\delta}},
ฮž2,2โ‰ค\displaystyle\Xi_{2,2}\leq G1โ€‹G2โ€‹L^1:T+23โ€‹G12โ€‹lnโก1ฮด+2โ€‹G13โ€‹G2โ€‹L^1:Tโ€‹lnโก1ฮด.\displaystyle G_{1}G_{2}\hat{L}_{1:T}+\frac{2}{3}G^{2}_{1}\ln\frac{1}{\delta}+2\sqrt{G^{3}_{1}G_{2}\hat{L}_{1:T}\ln\frac{1}{\delta}}.

Let ฮปi=2โ€‹UBโ€‹G1\lambda_{i}=\frac{2U}{\sqrt{B}G_{1}}. Using Lemma 2 and combining ๐’ฏ1\mathcal{T}_{1} and ๐’ฏ2\mathcal{T}_{2} gives, with probability at least 1โˆ’ฮ˜โ€‹(โŒˆlnโกTโŒ‰)โ€‹ฮด1-\Theta(\lceil\ln{T}\rceil)\delta,

Regโ€‹(f)=\displaystyle\mathrm{Reg}(f)= L^1:Tโˆ’LTโ€‹(f)\displaystyle\hat{L}_{1:T}-L_{T}(f)
โ‰ค\displaystyle\leq ๐’ฏ1+๐’ฏ2\displaystyle\mathcal{T}_{1}+\mathcal{T}_{2}
โ‰ค\displaystyle\leq 10โ€‹Uโ€‹G2โ€‹G1โ€‹L^1:Tโ€‹lnโก1ฮด+Uโ€‹G1โ€‹B4+6โ€‹Uโ€‹G2โ€‹L^1:TBโˆ’43โ€‹lnโก1ฮด,\displaystyle 10U\sqrt{G_{2}G_{1}\hat{L}_{1:T}\ln\frac{1}{\delta}}+\frac{UG_{1}\sqrt{B}}{4}+\frac{6UG_{2}\hat{L}_{1:T}}{\sqrt{B-\frac{4}{3}\ln\frac{1}{\delta}}},

where we omit the constant terms and the lower order terms. Let ฮณ=6โ€‹Uโ€‹G2Bโˆ’43โ€‹lnโก1ฮด\gamma=\frac{6UG_{2}}{\sqrt{B-\frac{4}{3}\ln\frac{1}{\delta}}} and Uโ‰ค18โ€‹G2โ€‹Bโˆ’43โ€‹lnโก1ฮดU\leq\frac{1}{8G_{2}}\sqrt{B-\frac{4}{3}\ln\frac{1}{\delta}}. Then 1โˆ’ฮณโ‰ฅ141-\gamma\geq\frac{1}{4}. Solving for L^1:T\hat{L}_{1:T} concludes the proof.

Finally, we explain why it must be satisfied that Kโ‰คdK\leq d. The space complexity of M-OMD-S is Oโ€‹(Kโ€‹B+dโ€‹B+K)O(KB+dB+K). According to Assumption 1, the coefficient ฮฑ\alpha only depends on dd. If Kโ‰คdK\leq d, then the space complexity of M-OMD-S is Oโ€‹(dโ€‹B)O(dB). In this case, B=ฮ˜โ€‹(ฮฑโ€‹โ„›)B=\Theta(\alpha\mathcal{R}). If K>dK>d, then the space complexity is Oโ€‹(Kโ€‹B)O(KB). M-OMD-S must allocate the memory resource over KK hypotheses. For instance, if K=dฮฝK=d^{\nu}, ฮฝ>1\nu>1, then B=ฮ˜โ€‹(K1โˆ’vvโ€‹ฮฑโ€‹โ„›)B=\Theta(K^{\frac{1-v}{v}}\alpha\mathcal{R}). Thus the regret bound will increase a factor of order Oโ€‹(Kvโˆ’12โ€‹v)O(K^{\frac{v-1}{2v}}). โˆŽ

Appendix A.6 Proof of Theorem 4

Proof.

Let ฮบโ€‹(๐’™,๐’—)=โŸจ๐’™,๐’—โŸฉp\kappa({\bm{x}},{\bm{v}})=\langle{\bm{x}},{\bm{v}}\rangle^{p}. The adversary first constructs โ„T\mathcal{I}_{T}. For 1โ‰คtโ‰ค3โ€‹B1\leq t\leq 3B, let ๐’™t=๐’†t{\bm{x}}_{t}={\bm{e}}_{t} where ๐’†t{\bm{e}}_{t} is the standard basis vector in โ„d\mathbb{R}^{d}. Let yt=1y_{t}=1 if tt is odd. Otherwise, yt=โˆ’1y_{t}=-1. For 3โ€‹B+1โ‰คtโ‰คT3B+1\leq t\leq T, let (๐’™t,yt)โˆˆ{(๐’™ฯ„,yฯ„)}ฯ„=13โ€‹B({\bm{x}}_{t},y_{t})\in\{({\bm{x}}_{\tau},y_{\tau})\}^{3B}_{\tau=1} uniformly.

We construct a competitor as follows,

fยฏโ„=U3โ€‹Bโ‹…โˆ‘t=13โ€‹Bytโ€‹ฮบโ€‹(๐’™t,โ‹…).\bar{f}_{\mathbb{H}}=\frac{U}{\sqrt{3B}}\cdot\sum^{3B}_{t=1}y_{t}\kappa({\bm{x}}_{t},\cdot).

It is easy to prove

LTโ€‹(fยฏโ„)=\displaystyle L_{T}(\bar{f}_{\mathbb{H}})= Tโ‹…lnโก(1+expโก(โˆ’U3โ€‹B)),\displaystyle T\cdot\ln\left(1+\exp\left(-\frac{U}{\sqrt{3B}}\right)\right),
โ€–fยฏโ„โ€–โ„‹=\displaystyle\|\bar{f}_{\mathbb{H}}\|_{\mathcal{H}}= U.\displaystyle U.

Thus fยฏโ„โˆˆโ„\bar{f}_{\mathbb{H}}\in\mathbb{H}.
Let ๐’œ\mathcal{A} be an algorithm storing BB examples at most. At the beginning of round tt, let

ft=โˆ‘iโ‰คBai(t)โ€‹ฮบโ€‹(๐’™i(t),โ‹…)f_{t}=\sum_{i\leq B}a^{(t)}_{i}\kappa({\bm{x}}^{(t)}_{i},\cdot)

be the hypothesis maintained by ๐’œ\mathcal{A} where ๐’™i(t)โˆˆ{๐’™1,โ€ฆ,๐’™tโˆ’1}{\bm{x}}^{(t)}_{i}\in\{{\bm{x}}_{1},\ldots,{\bm{x}}_{t-1}\}. Besides, it must be satisfied that

โ€–ftโ€–โ„‹=โˆ‘iโ‰คB|ai(t)|2โ‰คU.\|f_{t}\|_{\mathcal{H}}=\sqrt{\sum_{i\leq B}|a^{(t)}_{i}|^{2}}\leq U. (A9)

It is easy to obtain โˆ‘t=13โ€‹Bโ„“โ€‹(ftโ€‹(๐’™t),yt)=3โ€‹lnโก(2)โ€‹B.\sum^{3B}_{t=1}\ell(f_{t}({\bm{x}}_{t}),y_{t})=3\ln(2)B. For any tโ‰ฅ3โ€‹B+1t\geq 3B+1, the expected per-round loss is

๐”ผโ€‹[โ„“โ€‹(ftโ€‹(๐’™t),yt)]=23โ€‹lnโก(2)+13โ€‹Bโ€‹โˆ‘iโ‰คBlnโก(1+expโก(โˆ’|ai(t)|)).\mathbb{E}\left[\ell(f_{t}({\bm{x}}_{t}),y_{t})\right]=\frac{2}{3}\ln(2)+\frac{1}{3B}\sum_{i\leq B}\ln(1+\exp(-|a^{(t)}_{i}|)).

Note that |a1(t)|,โ€ฆ,|aB(t)||a^{(t)}_{1}|,\ldots,|a^{(t)}_{B}| must satisfy (A9). By the Lagrangian multiplier method, the minimum is obtained at |ai(t)|=UB|a^{(t)}_{i}|=\frac{U}{\sqrt{B}} for all ii. Then we have

๐”ผโ€‹[โ„“โ€‹(ftโ€‹(๐’™t),yt)]โ‰ฅ23โ€‹lnโก(2)+lnโก(1+expโก(โˆ’UB))3.\mathbb{E}\left[\ell(f_{t}({\bm{x}}_{t}),y_{t})\right]\geq\frac{2}{3}\ln(2)+\frac{\ln(1+\exp(-\frac{U}{\sqrt{B}}))}{3}.

It can be verified that

โˆ€0<xโ‰ค0.2,lnโก(1+expโก(โˆ’x))โ‰คlnโก(2)โˆ’0.45โ€‹x.\forall 0<x\leq 0.2,\quad\ln(1+\exp(-x))\leq\ln(2)-0.45x.

Let B<TB<T and Uโ‰ค15โ€‹3โ€‹BU\leq\frac{1}{5}\sqrt{3B}. The expected regret w.r.t. fยฏโ„\bar{f}_{\mathbb{H}} is lower bounded as follows

๐”ผโ€‹[Regโ€‹(fยฏโ„)]โ‰ฅ\displaystyle\mathbb{E}\left[\mathrm{Reg}(\bar{f}_{\mathbb{H}})\right]\geq 3โ€‹lnโก(2)โ€‹B+(Tโˆ’3โ€‹B)โ‹…lnโก(1+expโก(โˆ’UB))3+23โ€‹lnโก(2)โ‹…(Tโˆ’3โ€‹B)โˆ’Tโ‹…lnโก(1+expโก(โˆ’U3โ€‹B))\displaystyle 3\ln(2)B+(T-3B)\cdot\frac{\ln(1+\exp(-\frac{U}{\sqrt{B}}))}{3}+\frac{2}{3}\ln(2)\cdot(T-3B)-T\cdot\ln\left(1+\exp\left(-\frac{U}{\sqrt{3B}}\right)\right)
=\displaystyle= (23โ€‹T+B)โ‹…(lnโก(2)โˆ’lnโก(1+expโก(โˆ’U3โ€‹B)))+13โ€‹(Tโˆ’3โ€‹B)โ€‹lnโก1+expโก(โˆ’UB)1+expโก(โˆ’U3โ€‹B)\displaystyle\left(\frac{2}{3}T+B\right)\cdot\left(\ln(2)-\ln\left(1+\exp\left(-\frac{U}{\sqrt{3B}}\right)\right)\right)+\frac{1}{3}(T-3B)\ln\frac{1+\exp(-\frac{U}{\sqrt{B}})}{1+\exp(-\frac{U}{\sqrt{3B}})}
โ‰ฅ\displaystyle\geq 310โ‹…Uโ€‹TB+13โ€‹(33โˆ’1)โ‹…Uโ€‹TB.\displaystyle\frac{\sqrt{3}}{10}\cdot\frac{UT}{\sqrt{B}}+\frac{1}{3}\left(\frac{\sqrt{3}}{3}-1\right)\cdot\frac{UT}{\sqrt{B}}.

It can be verified that LTโ€‹(fยฏโ„)=ฮ˜โ€‹(T)L_{T}(\bar{f}_{\mathbb{H}})=\Theta(T). Replacing TT with ฮ˜โ€‹(LTโ€‹(fยฏโ„))\Theta(L_{T}(\bar{f}_{\mathbb{H}})) concludes the proof. โˆŽ

Appendix A.7 Proof of Theorem 5

Proof.

There is a ฮพiโˆˆ(0,4]\xi_{i}\in(0,4] such that

ฮ›i=โˆ‘tโˆˆ๐’ฅ[โ€–fยฏt,iโ€‹(2)โˆ’fโ€–โ„‹i2โˆ’โ€–ft,iโˆ’fโ€–โ„‹i2]โ‰คฮพiโ€‹U2โ€‹|๐’ฅ|.\Lambda_{i}=\sum_{t\in\mathcal{J}}\left[\left\|\bar{f}_{t,i}(2)-f\right\|^{2}_{\mathcal{H}_{i}}-\|f_{t,i}-f\|^{2}_{\mathcal{H}_{i}}\right]\leq\xi_{i}U^{2}|\mathcal{J}|.

We can rewrite ฮž4\Xi_{4} in (A8) as follow,

โˆ‘tโˆˆ๐’ฅฮž4โ‰ค(ฮพiโ€‹U22โ€‹ฮปi+4โ€‹Uโ€‹G1)โ‹…|๐’ฅ|.\sum_{t\in\mathcal{J}}\Xi_{4}\leq\left(\frac{\xi_{i}U^{2}}{2\lambda_{i}}+4UG_{1}\right)\cdot|\mathcal{J}|.

If ฮพiโ‰ค1|๐’ฅ|\xi_{i}\leq\frac{1}{|\mathcal{J}|}, then ฮพiโ€‹U22โ€‹ฮปiโ‹…|๐’ฅ|โ‰คU22โ€‹ฮปi\frac{\xi_{i}U^{2}}{2\lambda_{i}}\cdot|\mathcal{J}|\leq\frac{U^{2}}{2\lambda_{i}}. Let ฮปi=UG1โ€‹G2โ€‹L^1:T\lambda_{i}=\frac{U}{\sqrt{G_{1}G_{2}\hat{L}_{1:T}}}. In this way, we obtain a new upper bound on ๐’ฏ2\mathcal{T}_{2}. Combining ๐’ฏ1\mathcal{T}_{1} and ๐’ฏ2\mathcal{T}_{2} gives

L^1:Tโˆ’LTโ€‹(f)โ‰ค12โ€‹Uโ€‹G2โ€‹G1โ€‹L^1:Tโ€‹lnโก1ฮด+16โ€‹Uโ€‹G2โ€‹L^1:TBโˆ’43โ€‹lnโก1ฮด+4โ€‹Uโ€‹G1โ€‹lnโก1ฮด.\displaystyle\hat{L}_{1:T}-L_{T}(f)\leq 12U\sqrt{G_{2}G_{1}\hat{L}_{1:T}\ln\frac{1}{\delta}}+\frac{16UG_{2}\hat{L}_{1:T}}{B-\frac{4}{3}\ln\frac{1}{\delta}}+4UG_{1}\ln\frac{1}{\delta}.

Let ฮณ=16โ€‹Uโ€‹G2โ€‹(Bโˆ’43โ€‹lnโก1ฮด)โˆ’1\gamma=16UG_{2}(B-\frac{4}{3}\ln\frac{1}{\delta})^{-1} and U<Bโˆ’43โ€‹lnโก1ฮด32โ€‹G2U<\frac{B-\frac{4}{3}\ln\frac{1}{\delta}}{32G_{2}}. We have ฮณโ‰ค12\gamma\leq\frac{1}{2}. Solving for L^1:T\hat{L}_{1:T} concludes the proof. โˆŽ

Appendix A.8 Auxiliary Lemmas

Lemma A.8.1 ([Hazan2009Better]).

โˆ€t>M\forall t>M and โˆ€iโˆˆ[K]\forall i\in[K], ๐”ผโ€‹[โ€–โˆ‡^t,iโˆ’ฮผt,iโ€–โ„‹i2]โ‰ค1tโ€‹|V|โ€‹๐’œt,ฮบi\mathbb{E}[\|\hat{\nabla}_{t,i}-\mu_{t,i}\|^{2}_{\mathcal{H}_{i}}]\leq\frac{1}{t|V|}\mathcal{A}_{t,\kappa_{i}}.

Lemma A.8.2.

Let ฮทt\eta_{t} follow (10) and ๐ฉ1{\bm{p}}_{1} be the uniform distribution. Then

โˆ‘t=1Tโˆ‘i=1Kฮทt2โ€‹pt,iโ€‹ct,i2โ‰ค2โ€‹lnโกKโ€‹โˆ‘ฯ„=1Tโˆ‘i=1Kpฯ„,iโ€‹cฯ„,i2+2โ€‹lnโกK2โ€‹maxt,iโกct,i.\sum^{T}_{t=1}\sum^{K}_{i=1}\frac{\eta_{t}}{2}p_{t,i}c^{2}_{t,i}\leq\sqrt{2\ln{K}}\sqrt{\sum^{T}_{\tau=1}\sum^{K}_{i=1}p_{\tau,i}c^{2}_{\tau,i}}+\frac{\sqrt{2\ln{K}}}{2}\max_{t,i}c_{t,i}.
Proof.

Let ฯƒฯ„=โˆ‘i=1Kpฯ„,iโ€‹cฯ„,i2\sigma_{\tau}=\sum^{K}_{i=1}p_{\tau,i}c^{2}_{\tau,i} and ฯƒ0=1\sigma_{0}=1. We decompose the term as follows

lnโกK2โ‹…โˆ‘t=1Tฯƒt1+โˆ‘ฯ„=1tโˆ’1ฯƒฯ„=\displaystyle\frac{\sqrt{\ln{K}}}{\sqrt{2}}\cdot\sum^{T}_{t=1}\frac{\sigma_{t}}{\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}}= lnโกK2โ‹…โˆ‘t=1Tฯƒtโˆ‘ฯ„=0tโˆ’1ฯƒฯ„\displaystyle\frac{\sqrt{\ln{K}}}{\sqrt{2}}\cdot\sum^{T}_{t=1}\frac{\sigma_{t}}{\sqrt{\sum^{t-1}_{\tau=0}\sigma_{\tau}}}
=\displaystyle= lnโกK2โ‹…[ฯƒ1+โˆ‘t=2Tฯƒtโˆ’11+โˆ‘ฯ„=1tโˆ’1ฯƒฯ„+โˆ‘t=2Tฯƒtโˆ’ฯƒtโˆ’11+โˆ‘ฯ„=1tโˆ’1ฯƒฯ„].\displaystyle\frac{\sqrt{\ln{K}}}{\sqrt{2}}\cdot\left[\sigma_{1}+\sum^{T}_{t=2}\frac{\sigma_{t-1}}{\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}}+\sum^{T}_{t=2}\frac{\sigma_{t}-\sigma_{t-1}}{\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}}\right].

We analyze the third term.

โˆ‘t=2Tฯƒtโˆ’ฯƒtโˆ’11+โˆ‘ฯ„=1tโˆ’1ฯƒฯ„=\displaystyle\sum^{T}_{t=2}\frac{\sigma_{t}-\sigma_{t-1}}{\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}}= โˆ’ฯƒ11+ฯƒ1+ฯƒT1+โˆ‘ฯ„=1Tโˆ’1ฯƒฯ„+โˆ‘t=2Tโˆ’1ฯƒtโ€‹[11+โˆ‘ฯ„=1tโˆ’1ฯƒฯ„โˆ’11+โˆ‘ฯ„=1tฯƒฯ„]\displaystyle\frac{-\sigma_{1}}{\sqrt{1+\sigma_{1}}}+\frac{\sigma_{T}}{\sqrt{1+\sum^{T-1}_{\tau=1}\sigma_{\tau}}}+\sum^{T-1}_{t=2}\sigma_{t}\left[\frac{1}{\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}}-\frac{1}{\sqrt{1+\sum^{t}_{\tau=1}\sigma_{\tau}}}\right]
โ‰ค\displaystyle\leq โˆ’ฯƒ11+ฯƒ1+maxt=1,โ€ฆ,Tโกฯƒtโ‹…11+ฯƒ1.\displaystyle-\frac{\sigma_{1}}{\sqrt{1+\sigma_{1}}}+\max_{t=1,\ldots,T}\sigma_{t}\cdot\frac{1}{\sqrt{1+\sigma_{1}}}.

Now we analyze the second term.

โˆ‘t=2Tฯƒtโˆ’11+โˆ‘ฯ„=1tโˆ’1ฯƒฯ„=ฯƒ11+ฯƒ1+โˆ‘t=3Tฯƒtโˆ’11+โˆ‘ฯ„=1tโˆ’1ฯƒฯ„.\displaystyle\sum^{T}_{t=2}\frac{\sigma_{t-1}}{\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}}=\frac{\sigma_{1}}{\sqrt{1+\sigma_{1}}}+\sum^{T}_{t=3}\frac{\sigma_{t-1}}{\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}}.

For any a>0a>0 and b>0b>0, we have 2โ€‹aโ€‹bโ‰คa+b2\sqrt{a}\sqrt{b}\leq a+b. Let a=1+โˆ‘ฯ„=1tโˆ’1ฯƒฯ„a=1+\sum^{t-1}_{\tau=1}\sigma_{\tau} and b=1+โˆ‘ฯ„=1tโˆ’2ฯƒฯ„b=1+\sum^{t-2}_{\tau=1}\sigma_{\tau}. Then we have

2โ€‹1+โˆ‘ฯ„=1tโˆ’1ฯƒฯ„โ‹…1+โˆ‘ฯ„=1tโˆ’2ฯƒฯ„โ‰ค2โ€‹(1+โˆ‘ฯ„=1tโˆ’1ฯƒฯ„)โˆ’ฯƒtโˆ’1.\displaystyle 2\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}\cdot\sqrt{1+\sum^{t-2}_{\tau=1}\sigma_{\tau}}\leq 2\left(1+\sum^{t-1}_{\tau=1}\sigma_{\tau}\right)-\sigma_{t-1}.

Dividing by a\sqrt{a} and rearranging terms yields

12โ€‹ฯƒtโˆ’11+โˆ‘ฯ„=1tโˆ’1ฯƒฯ„โ‰ค1+โˆ‘ฯ„=1tโˆ’1ฯƒฯ„โˆ’1+โˆ‘ฯ„=1tโˆ’2ฯƒฯ„.\displaystyle\frac{1}{2}\frac{\sigma_{t-1}}{\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}}\leq\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}-\sqrt{1+\sum^{t-2}_{\tau=1}\sigma_{\tau}}.

Summing over t=3,โ€ฆ,Tt=3,\ldots,T, we obtain

โˆ‘t=3Tฯƒtโˆ’11+โˆ‘ฯ„=1tโˆ’1ฯƒฯ„โ‰ค2โ€‹1+โˆ‘ฯ„=1Tโˆ’1ฯƒฯ„โˆ’2โ€‹1+ฯƒ1.\displaystyle\sum^{T}_{t=3}\frac{\sigma_{t-1}}{\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}}\leq 2\sqrt{1+\sum^{T-1}_{\tau=1}\sigma_{\tau}}-2\sqrt{1+\sigma_{1}}.

Summing over all results, we have

โˆ‘t=1Tฯƒt1+โˆ‘ฯ„=1tโˆ’1ฯƒฯ„โ‰ค\displaystyle\sum^{T}_{t=1}\frac{\sigma_{t}}{\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}}\leq 2โ€‹1+โˆ‘ฯ„=1Tโˆ’1ฯƒฯ„โˆ’2โ€‹1+ฯƒ1+maxtโกฯƒtโ‹…11+ฯƒ1+ฯƒ1\displaystyle 2\sqrt{1+\sum^{T-1}_{\tau=1}\sigma_{\tau}}-2\sqrt{1+\sigma_{1}}+\max_{t}\sigma_{t}\cdot\frac{1}{\sqrt{1+\sigma_{1}}}+\sigma_{1}
โ‰ค\displaystyle\leq 2โ€‹โˆ‘ฯ„=1Tฯƒฯ„+maxtโกฯƒt,\displaystyle 2\sqrt{\sum^{T}_{\tau=1}\sigma_{\tau}}+\max_{t}\sigma_{t},

which concludes the proof. โˆŽ

Lemma A.8.3 (Improved Bernsteinโ€™s inequality [Li2024On]).

Let X1,โ€ฆ,XnX_{1},\ldots,X_{n} be a bounded martingale difference sequence w.r.t. the filtration โ„ฑ=(โ„ฑk)1โ‰คkโ‰คn\mathcal{F}=(\mathcal{F}_{k})_{1\leq k\leq n} and with |Xk|โ‰คa|X_{k}|\leq a. Let Zt=โˆ‘k=1tXkZ_{t}=\sum^{t}_{k=1}X_{k} be the associated martingale. Denote the sum of the conditional variances by

ฮฃn2=โˆ‘k=1n๐”ผโ€‹[Xk2|โ„ฑkโˆ’1]โ‰คv,\Sigma^{2}_{n}=\sum^{n}_{k=1}\mathbb{E}\left[X^{2}_{k}|\mathcal{F}_{k-1}\right]\leq v,

where vโˆˆ[0,V]v\in[0,V] is a random variable and Vโ‰ฅ2V\geq 2 is a constant. Then for all constants a>0a>0, with probability at least 1โˆ’2โ€‹โŒˆlogโกVโŒ‰โ€‹ฮด1-2\lceil\log{V}\rceil\delta,

maxt=1,โ€ฆ,nโกZt<2โ€‹a3โ€‹lnโก1ฮด+2Vโ€‹lnโก1ฮด+2โ€‹vโ€‹lnโก1ฮด.\max_{t=1,\ldots,n}Z_{t}<\frac{2a}{3}\ln\frac{1}{\delta}+\sqrt{\frac{2}{V}\ln\frac{1}{\delta}}+2\sqrt{v\ln\frac{1}{\delta}}.
Lemma A.8.4.

With probability at least 1โˆ’ฮ˜โ€‹(โŒˆlnโกTโŒ‰)โ€‹ฮด1-\Theta(\lceil\ln{T}\rceil)\delta,

โˆ‘tโˆˆTยฏ1โˆช๐’ฅ1(โ„™โ€‹[bt=1])2โ€‹โ€–โˆ‡t,iโ€–โ„‹i2โ‹…๐•€bt=1โ‰ค\displaystyle\sum_{t\in\bar{T}_{1}\cup\mathcal{J}}\frac{1}{(\mathbb{P}[b_{t}=1])^{2}}\left\|\nabla_{t,i}\right\|^{2}_{\mathcal{H}_{i}}\cdot\mathbb{I}_{b_{t}=1}\leq โˆ‘tโˆˆTยฏ1โˆช๐’ฅโ€–โˆ‡t,iโ€–โ„‹i2โ„™โ€‹[bt=1]+4โ€‹G123โ€‹lnโก1ฮด+4โ€‹G13โ€‹G2โ€‹L^1:Tโ€‹lnโก1ฮด.\displaystyle\sum_{t\in\bar{T}_{1}\cup\mathcal{J}}\frac{\left\|\nabla_{t,i}\right\|^{2}_{\mathcal{H}_{i}}}{\mathbb{P}[b_{t}=1]}+\frac{4G^{2}_{1}}{3}\ln\frac{1}{\delta}+4\sqrt{G^{3}_{1}G_{2}\hat{L}_{1:T}\ln\frac{1}{\delta}}.
Proof.

Define a random variable XtX_{t} by

Xt=โ€–โˆ‡t,iโ€–โ„‹i2(โ„™โ€‹[bt=1])2โ€‹๐•€bt=1โˆ’โ€–โˆ‡t,iโ€–โ„‹i2โ„™โ€‹[bt=1],|Xt|โ‰ค2โ€‹G12.X_{t}=\frac{\|\nabla_{t,i}\|^{2}_{\mathcal{H}_{i}}}{(\mathbb{P}[b_{t}=1])^{2}}\mathbb{I}_{b_{t}=1}-\frac{\|\nabla_{t,i}\|^{2}_{\mathcal{H}_{i}}}{\mathbb{P}[b_{t}=1]},\quad|X_{t}|\leq 2G^{2}_{1}.

{Xt}tโˆˆTยฏ1\{X_{t}\}_{t\in\bar{T}_{1}} forms bounded martingale difference w.r.t. {bฯ„}ฯ„=1tโˆ’1\{b_{\tau}\}^{t-1}_{\tau=1}. The sum of conditional variances is

ฮฃ2โ‰คโˆ‘tโˆˆTยฏ1โˆช๐’ฅ๐”ผโ€‹[Xt2]โ‰ค4โ€‹G13โ€‹G2โ€‹L^1:T.\Sigma^{2}\leq\sum_{t\in\bar{T}_{1}\cup\mathcal{J}}\mathbb{E}[X^{2}_{t}]\leq 4G^{3}_{1}G_{2}\hat{L}_{1:T}.

Using Lemma A.8.3 concludes the proof. โˆŽ

Lemma A.8.5.

Let ฮ”t=โŸจft,iโˆ’f,โˆ‡t,iโˆ’โˆ‡~t,iโŸฉ\Delta_{t}=\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle. With probability at least 1โˆ’ฮ˜โ€‹(โŒˆlnโกTโŒ‰)โ€‹ฮด1-\Theta(\lceil\ln{T}\rceil)\delta,

โˆ‘tโˆˆTยฏ1ฮ”tโ‰ค4โ€‹Uโ€‹G13โ€‹lnโก1ฮด+2โ€‹Uโ€‹2โ€‹G2โ€‹G1โ€‹L^1:Tโ€‹lnโก1ฮด.\sum_{t\in\bar{T}_{1}}\Delta_{t}\leq\frac{4UG_{1}}{3}\ln\frac{1}{\delta}+2U\sqrt{2G_{2}G_{1}\hat{L}_{1:T}\ln\frac{1}{\delta}}.
Proof.

The proof is similar to that of Lemma A.8.4. โˆŽ

Lemma A.8.6.

For each iโˆˆ[K]i\in[K], let ฯˆiโ€‹(f)=12โ€‹ฮปiโ€‹โ€–fโ€–โ„‹i2\psi_{i}(f)=\frac{1}{2\lambda_{i}}\|f\|^{2}_{\mathcal{H}_{i}}. Then โ„ฌฯˆiโ€‹(f,g)=12โ€‹ฮปiโ€‹โ€–fโˆ’gโ€–โ„‹i2\mathcal{B}_{\psi_{i}}(f,g)=\frac{1}{2\lambda_{i}}\|f-g\|^{2}_{\mathcal{H}_{i}} and the solutions of (4) and (5) are as follows

ft,i=\displaystyle f_{t,i}= ftโˆ’1,iโ€ฒโˆ’ฮปiโ€‹โˆ‡^t,i,\displaystyle f^{\prime}_{t-1,i}-\lambda_{i}\hat{\nabla}_{t,i},
ft,iโ€ฒ=\displaystyle f^{\prime}_{t,i}= minโก{1,Uโ€–ftโˆ’1,iโ€ฒโˆ’ฮปiโ€‹โˆ‡t,iโ€–โ„‹i}โ‹…(ftโˆ’1,iโ€ฒโˆ’ฮปiโ€‹โˆ‡t,i).\displaystyle\min\left\{1,\frac{U}{\left\|f^{\prime}_{t-1,i}-\lambda_{i}\nabla_{t,i}\right\|_{\mathcal{H}_{i}}}\right\}\cdot\left(f^{\prime}_{t-1,i}-\lambda_{i}\nabla_{t,i}\right).

Similarly, we can obtain the solution of (14)

ft+1,i=minโก{1,Uโ€–ft,iโˆ’ฮปiโ€‹โˆ‡~t,iโ€–โ„‹i}โ‹…(ft,iโˆ’ฮปiโ€‹โˆ‡~t,i).f_{t+1,i}=\min\left\{1,\frac{U}{\left\|f_{t,i}-\lambda_{i}\tilde{\nabla}_{t,i}\right\|_{\mathcal{H}_{i}}}\right\}\cdot\left(f_{t,i}-\lambda_{i}\tilde{\nabla}_{t,i}\right).

Besides,

โŸจft,iโˆ’ft+1,i,โˆ‡~t,iโŸฉโˆ’โ„ฌฯˆiโ€‹(ft+1,i,ft,i)โ‰คฮปi2โ€‹โ€–โˆ‡~t,iโ€–โ„‹i2.\langle f_{t,i}-f_{t+1,i},\tilde{\nabla}_{t,i}\rangle-\mathcal{B}_{\psi_{i}}(f_{t+1,i},f_{t,i})\leq\frac{\lambda_{i}}{2}\left\|\tilde{\nabla}_{t,i}\right\|^{2}_{\mathcal{H}_{i}}.
Proof.

We can solve (4) and (5) by Lagrangian multiplier method. Next we prove the last inequality.

โŸจft,iโˆ’ft+1,i,โˆ‡~t,iโŸฉโˆ’โ„ฌฯˆiโ€‹(ft+1,i,ft,i)=\displaystyle\langle f_{t,i}-f_{t+1,i},\tilde{\nabla}_{t,i}\rangle-\mathcal{B}_{\psi_{i}}(f_{t+1,i},f_{t,i})= โŸจft,iโˆ’ft+1,i,โˆ‡~t,iโŸฉโˆ’โ€–ft+1,iโˆ’ft,iโ€–โ„‹i22โ€‹ฮปi\displaystyle\left\langle f_{t,i}-f_{t+1,i},\tilde{\nabla}_{t,i}\right\rangle-\frac{\|f_{t+1,i}-f_{t,i}\|^{2}_{\mathcal{H}_{i}}}{2\lambda_{i}}
=\displaystyle= ฮปi2โ€‹โ€–โˆ‡~t,iโ€–โ„‹i2โˆ’12โ€‹ฮปiโ€‹โ€–ft+1,iโˆ’ft,iโˆ’ฮปiโ€‹โˆ‡^iโ€–โ„‹i2\displaystyle\frac{\lambda_{i}}{2}\left\|\tilde{\nabla}_{t,i}\right\|^{2}_{\mathcal{H}_{i}}-\frac{1}{2\lambda_{i}}\left\|f_{t+1,i}-f_{t,i}-\lambda_{i}\hat{\nabla}_{i}\right\|^{2}_{\mathcal{H}_{i}}
โ‰ค\displaystyle\leq ฮปi2โ€‹โ€–โˆ‡~t,iโ€–โ„‹i2,\displaystyle\frac{\lambda_{i}}{2}\left\|\tilde{\nabla}_{t,i}\right\|^{2}_{\mathcal{H}_{i}},

which concludes the proof. โˆŽ

References

  • [A1] S.ย Bubeck and N.ย Cesa-Bianchi, โ€œRegret analysis of stochastic and nonstochastic multi-armed bandit problems,โ€ Foundations and Trendsยฎ in Machine Learning, vol.ย 5, no.ย 1, pp. 1โ€“122, 2012.
  • [A2] C.ย Chiang, T.ย Yang, C.ย Lee, M.ย Mahdavi, C.ย Lu, R.ย Jin, and S.ย Zhu, โ€œOnline optimization with gradual variations,โ€ in Proceedings of the 25th Annual Conference on Learning Theory, 2012, pp. 6.1โ€“6.20.
  • [A3] E.ย Hazan and S.ย Kale, โ€œBetter algorithms for benign bandits,โ€ in Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, 2009, pp. 38โ€“47.
  • [A4] J.ย Li, Z.ย Xu, Z.ย Wu, and I.ย King, โ€œOn the necessity of collaboration in online model selection with decentralized data,โ€ CoRR, vol. arXiv:2404.09494v3, Apr 2024, https://arxiv.org/abs/2404.09494v3.