Learnability in Online Kernel Selection with Memory Constraint via Data-dependent Regret Analysis

Junfan Li, Shizhong Liao*
College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
{junfli,szliao}@tju.edu.cn
* Corresponding Author

Abstract

Online kernel selection is a fundamental problem of online kernel methods. In this paper, we study online kernel selection with memory constraint in which the memory of kernel selection and online prediction procedures is limited to a fixed budget. An essential question is what is the intrinsic relationship among online learnability, memory constraint, and data complexity? To answer the question, it is necessary to show the trade-offs between regret and memory constraint. Previous work gives a worst-case lower bound depending on the data size, and shows learning is impossible within a small memory constraint. In contrast, we present distinct results by offering data-dependent upper bounds that rely on two data complexities: kernel alignment and the cumulative losses of competitive hypothesis. We propose an algorithmic framework giving data-dependent upper bounds for two types of loss functions. For the hinge loss function, our algorithm achieves an expected upper bound depending on kernel alignment. For smooth loss functions, our algorithm achieves a high-probability upper bound depending on the cumulative losses of competitive hypothesis. We also prove a matching lower bound for smooth loss functions. Our results show that if the two data complexities are sub-linear, then learning is possible within a small memory constraint. Our algorithmic framework depends on a new buffer maintaining framework and a reduction from online kernel selection to prediction with expert advice. Finally, we empirically verify the prediction performance of our algorithms on benchmark datasets.

1 Introduction

Online kernel selection (OKS) aims to dynamically select a kernel function or a reproducing kernel Hilbert space (RKHS) for online kernel learning algorithms. Compared to offline kernel selection, OKS poses more computational challenges for kernel selection algorithms, as both the kernel selection procedure and prediction need to occur in real-time. We can formulate it as a sequential decision problem. Let $\mathcal{K}=\{\kappa_{i}\}^{K}_{i=1}$ contain $K$ base kernel functions and $\mathcal{H}_{i}$ be a dense subset of the RKHS induced by $\kappa_{i}$ . At round $t$ , an adversary selects an instances ${\bm{x}}_{t}\in\mathbb{R}^{d}$ . Then a learner chooses a hypothesis $f_{t}\in\cup^{K}_{i=1}\mathcal{H}_{i}$ and makes a prediction. The adversary gives the true output $y_{t}$ , and the learner suffers a loss $\ell(f_{t}({\bm{x}}_{t}),y_{t})$ . We use the regret to measure the performance of the learner which is defined as

\forall\kappa_{i}\in\mathcal{K},\quad\forall f\in\mathcal{H}_{i},\quad\mathrm{Reg}(f)=\sum^{T}_{t=1}\ell(f_{t}({\bm{x}}_{t}),y_{t})-\sum^{T}_{t=1}\ell(f({\bm{x}}_{t}),y_{t}).

An efficient algorithm should guarantee $\mathrm{Reg}(f)=o(T)$ . For convex loss functions, many algorithms [Sahoo2014Online, Foster2017Parameter, Liao2021High] achieve (or imply)

\mathrm{Reg}(f)=\tilde{O}\left(\left(\|f\|^{\alpha}_{\mathcal{H}_{i}}+1\right)\sqrt{T\ln{K}}\right),\alpha\in\{1,2\}.

The algorithms adapt to the optimal, yet unknown hypothesis space within a small information-theoretical cost.

A major challenge in OKS is the so called curse of dimensionality, that is, algorithms must store the previous $t-1$ instances at the $t$ -th round. The $O(t)$ memory cost is prohibitive for large-scale online learning problems. To address this issue, we limit the memory of algorithms to a budget of $\mathcal{R}$ quanta. For convex loss functions, the worst-case regret is $\Theta\left(\max\left\{\sqrt{T},\frac{T}{\sqrt{\mathcal{R}}}\right\}\right)$ [Li2022Worst] which proves a trade-off between memory constraint and regret. To be specific, achieving a $O(T^{\alpha})$ regret requires $\mathcal{R}=\Omega\left(T^{2(1-\alpha)}\right)$ , $\alpha\in\left[\frac{1}{2},1\right)$ . Neither a constant nor a $\Theta(\ln{T})$ memory cost can guarantee sub-linear regret, thereby learning is impossible in any $\mathcal{H}_{i}$ . We use the regret to define the learnability of a hypothesis space (see Definition 2). However, empirical results have shown that a small memory cost is enough to achieve good prediction performance [Dekel2008The, Zhao2012Fast, Zhang2018Online]. It seems that learning is possible in the optimal hypothesis space. To fill the gap between learnability and regret, it is necessary to establish data-dependent regret bounds. To this end, we focus on the following two questions:

Q1

What are the new trade-offs between memory constraints and regret, that is, how do certain data-dependent regret bounds depend on $\mathcal{R}$ , $T$ , and $K$ ?
Q2

Sub-linear regret implies that some hypothesis spaces are learnable. Is it possible to achieve sub-linear regret within a $O(\ln{T})$ memory cost?

In this paper, we will answer the two questions affirmatively. We first propose an algorithmic framework. Then we apply the algorithmic framework to two types of loss functions and achieve two kinds of data-dependent regret bounds. To maintain the memory constraint, we reduce it to the budget that limits the size of the buffer. Our algorithmic framework will use the buffer to store a subset of the observed examples. The main results are summarized as follows.

For the hinge loss function, our algorithm enjoys an expected kernel alignment regret bound as follows (see Theorem 1 for detailed result),

\forall\kappa_{i}\in\mathcal{K},\forall f\in\mathbb{H}_{i},\quad\mathbb{E}[\mathrm{Reg}(f)]=\tilde{O}\left(\sqrt{L_{T}(f)\ln{K}}+\frac{\sqrt{\mathcal{R}}}{\sqrt{K}}+\frac{\sqrt{K}}{\sqrt{\mathcal{R}}}\mathcal{A}_{T,\kappa_{i}}\right),

(1)

where $\mathcal{A}_{T,\kappa_{i}}$ is called kernel alignment, $\mathbb{H}_{i}\subseteq\mathcal{H}_{i}$ and

L_{T}(f)=\sum^{T}_{t=1}\ell(f({\bm{x}}_{t}),y_{t}).

(2)

For smooth loss functions (see Assumption 3), our algorithm achieves a high-probability small-loss bound (see Theorem 3 for detailed result). $\forall\kappa_{i}\in\mathcal{K},\forall f\in\mathbb{H}_{i}$ , with probability at least $1-\delta$ ,

\begin{split}\mathrm{Reg}(f)=\tilde{O}\left(\frac{L_{T}(f)}{\sqrt{\mathcal{R}}}+\sqrt{L_{T}(f)\ln{K}}+\sqrt{\mathcal{R}}\right).\end{split}

(3)

3)

For smooth loss functions, we prove a lower bound on the regret (see Theorem 4 for detailed result). Let $K=1$ . The regret of any algorithm that stores $B$ examples satisfies

$\exists f\in\mathbb{H},\quad\mathbb{E}\left[\mathrm{Reg}(f)\right]=\Omega\left(U\cdot\frac{L_{T}(f)}{\sqrt{B}}\right),$

in which $U>0$ is a constant and bounds the norm of all hypotheses in $\mathbb{H}$ . The upper bound in (3) is optimal in terms of the dependence on $L_{T}(f)$ .

The data-dependent bounds in (1) and (3) improve the bound $O\left(\sqrt{T\ln{K}}+\|f\|^{2}_{\mathcal{H}_{i}}\max\{\sqrt{T},\frac{T}{\sqrt{\mathcal{R}}}\}\right)$ [Li2022Worst], given that $\mathcal{A}_{T,\kappa_{i}}=O(T)$ and $L_{T}(f)=O(T)$ . If $\kappa_{i}$ matches well with the data, then we expect $\mathcal{A}_{T,\kappa_{i}}\ll T$ or $L_{T}(f)\ll T$ . In the worst case, i.e., $\mathcal{A}_{T,\kappa_{i}}=\Theta(T)$ or $L_{T}(f)=\Theta(T)$ , our bounds are same with previous result. We give new trade-offs between memory constraint and regret which answers Q1. For online kernel selection, we just aim to adapt to the optimal kernel $\kappa^{\ast}\in\mathcal{K}$ . If $\kappa^{\ast}$ matches well with the data, then we expect $\mathcal{A}_{T,\kappa^{\ast}}=o(T)$ and $\min_{f\in\mathcal{H}_{\kappa^{\ast}}}L_{T}(f)=o(T)$ . In this case, a $\Theta(\ln{T})$ memory cost is enough to achieve sub-linear regret which answers Q2. Thus learning is possible in $\mathbb{H}_{\kappa^{\ast}}$ within a small memory constraint.

Our algorithmic framework reduces OKS to prediction with expert advice and uses the well-known optimistic mirror descent framework [Chiang2012Online, Rakhlin2013Online]. We also propose a new buffer maintaining framework.

2 Related Work

Previous work has adopted buffer maintaining technique and random features to develop algorithms for online kernel selection with memory constraint [Sahoo2014Online, Liao2021High, Zhang2018Online, Shen2019Random]. However, all of existing results can not answer Q1 and Q2 well. The sketch-based online kernel selection algorithm [Zhang2018Online] enjoys a $O(B\ln{T})$ regularized regret (the loss function incorporates a regularizer) within a buffer of size $B$ . The regret bound becomes $O(T\ln{T})$ in the case of $B=\Theta(T)$ . The Raker algorithm [Shen2019Random] achieves a $O\left(\sqrt{T\ln{K}}+\frac{T}{\sqrt{D}}\right)$ ¹¹1The original regret bound is $O(U\sqrt{T\ln{K}}+\epsilon TU)$ , and holds with probability $1-\Theta\left(\epsilon^{-2}\exp\left(-\frac{D}{4d+8}\epsilon^{2}\right)\right)$ . regret and suffers a space complexity in $O((d+K)D)$ , where $D$ is the number of random features. The upper bound also shows a trade-off in the worst case. If $D$ is a constant or $D=\Theta(\ln{T})$ , then Raker can not achieve a sub-linear regret bound, providing a negative answer to Q2.

Our work is also related to achieving data-dependent regret bounds for online learning and especially online kernel learning. For online learning, various data-dependent regret bounds including small-loss bound [Cesa-Bianchi2006Prediction, Lykouris2018Small, Lee2020Bias], variance bound [Hazan2009Better, Hazan2010Extracting] and path-length bound [Chiang2012Online, Steinhardt2014Adaptivity, Wei2018More], have been established. The adaptive online learning framework [Foster2015Adaptive] can achieve data-dependent and model-dependent regret bounds, but can not induce a computationally efficient algorithm. For online kernel learning, it is harder to achieve data-dependent regret bounds, as we must balance the computational cost. For loss functions satisfying specific smoothness (see Assumption 3), the OSKL algorithm [Zhang2013Online] achieves a small-loss bound. For loss functions with a curvature property, the PROS-N-KONS algorithm [Calandriello2017Efficient] and the PKAWV algorithm [Jezequel2019Efficient] achieve regret bounds depending on the effective dimension of the kernel matrix. None of these algorithms took memory constraints into account, failing to provide a trade-off between memory constraint and regret.

3 Problem Setup

Let $\mathcal{I}_{T}:=\{({\bm{x}}_{t},y_{t})\}_{t\in[T]}$ be a sequence of examples, where ${\bm{x}}_{t}\in\mathcal{X}\subseteq\mathbb{R}^{d},y_{t}\in[-1,1]$ and $[T]:=\{1,\ldots,T\}$ . Let $\kappa(\cdot,\cdot):\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R}$ be a positive semidefinite kernel function and $\mathcal{H}_{\kappa}$ be a dense subset of the associated RKHS, such that, for any $f\in\mathcal{H}_{\kappa}$ ,

(i)

$\langle f,\kappa({\bm{x}},\cdot)\rangle_{\mathcal{H}_{\kappa}}=f({\bm{x}}),\forall{\bm{x}}\in\mathcal{X}$ ,
(ii)

$\mathcal{H}_{\kappa}=\overline{\mathrm{span}(\kappa({\bm{x}}_{t},\cdot)|t\in[T])}$ .

We define $\langle\cdot,\cdot\rangle_{\mathcal{H}_{\kappa}}$ as the inner product in $\mathcal{H}_{\kappa}$ , which induces the norm $\|f\|_{\mathcal{H}_{\kappa}}=\sqrt{\langle f,f\rangle_{\mathcal{H}_{\kappa}}}$ . For simplicity, we will omit the subscript $\mathcal{H}_{\kappa}$ in the inner product. Let $\ell(\cdot,\cdot):\mathbb{R}\times[-1,1]\rightarrow\mathbb{R}$ be the loss function. Denote by

\mathcal{B}_{\psi}(f,g)=\psi(f)-\psi(g)-\langle\nabla\psi(g),f-g\rangle,~{}\forall f,g\in\mathcal{H}_{\kappa},

the Bregman divergence associated with a strongly convex regularizer $\psi(\cdot):\mathcal{H}_{\kappa}\rightarrow\mathbb{R}$ .

Let $\mathcal{K}=\{\kappa_{i}\}^{K}_{i=1}$ contain $K$ base kernels. If an oracle gives the optimal kernel $\kappa^{\ast}\in\mathcal{K}$ for $\mathcal{I}_{T}$ , then we just run an online kernel learning algorithm in $\mathcal{H}_{\kappa^{\ast}}$ . Since $\kappa^{\ast}$ is unknown, the learner aims to develop a kernel selection algorithm and generate a sequence of hypotheses $\{f_{t}\}^{T}_{t=1}$ that can compete with any $f\in\mathcal{H}_{\kappa^{\ast}}$ , that is, the learner hopes to guarantee $\mathrm{Reg}(f)=o(T)$ . For any $f\in\mathcal{H}_{i}$ , we have $f=\sum^{T}_{t=1}a_{t}\kappa_{i}({\bm{x}}_{t},\cdot)$ . Thus storing $f_{t}$ needs to store $\{({\bm{x}}_{\tau},y_{\tau})\}^{t-1}_{\tau=1}$ , incurring a memory cost in $O(t)$ . In this paper, we consider online kernel selection with memory constraint. Next we define the memory constraint and reduce it to the size of a buffer.

Definition 1 (Memory Budget [Li2022Worst]).

A memory budget of $\mathcal{R}$ quanta is the maximal memory that any online kernel selection algorithm can use.

Assumption 1 ([Li2022Worst]).

$\forall\kappa_{i}\in\mathcal{K}$ , there is a constant $\alpha>0$ , such that any budgeted online kernel leaning algorithm running in $\mathcal{H}_{i}$ can maintain a buffer of size $B\leq\alpha\mathcal{R}$ within $\mathcal{R}$ quanta memory constraint. If the space complexity of algorithm is linear with $B$ , then “ $=$ ” holds.

Assumption 1 implies that there is no need to assume $\mathcal{R}=\infty$ unless $T=\infty$ , as the maximal value of $\mathcal{R}$ satisfies $B=T$ . The budgeted online kernel learning algorithms are such algorithms that operate on a subset of the observed examples, such as Forgetron [Dekel2008The], BSGD [Wang2012Breakingecond] to name but a few. In Assumption 1, $\alpha$ is independent of kernel function, since the memory cost is used to store the examples and the coefficients.

Assumption 2.

There is a constant $k_{1}>0$ , such that $\forall\kappa_{i}\in\mathcal{K}$ and $\forall{\bm{u}}\in\mathcal{X}$ , $\kappa_{i}({\bm{u}},{\bm{u}})\in[k_{1},1]$ .

We will define two types of data complexities and use them to bound the regret. The first one is called kernel alignment defined as follows

\mathcal{A}_{T,\kappa_{i}}=\sum^{T}_{t=1}\kappa_{i}({\bm{x}}_{t},{\bm{x}}_{t})-\frac{1}{T}{\bm{Y}}^{\mathrm{T}}{\bm{K}}_{i}{\bm{Y}},

where ${\bm{Y}}=(y_{1},\ldots,y_{T})^{\mathrm{T}}$ and ${\bm{K}}_{i}$ is the kernel matrix induced by $\kappa_{i}$ . $\mathcal{A}_{T,\kappa_{i}}$ quantifies how well ${\bm{K}}_{i}$ matches the examples. If ${\bm{K}}_{i}={\bm{Y}}{\bm{Y}}^{\mathrm{T}}$ is the ideal kernel matrix, then $\mathcal{A}_{T,\kappa_{i}}=\Theta(1)$ . The second one is called small-loss defined by $\min_{f\in\mathbb{H}_{i}}L_{T}(f)$ where $L_{T}(f)$ follows (2) and

\mathbb{H}_{i}=\left\{f\in\mathcal{H}_{i}:\|f\|_{\mathcal{H}_{i}}\leq U\right\}.

Both $\mathcal{A}_{T,\kappa_{i}}$ and $L_{T}(f)$ are independent of algorithms.

Finally, we define the online learnability which is a variant of the learnability defined in [Rakhlin2010Online].

Definition 2 (Online Learnability).

Given $\mathcal{I}_{T}$ , a hypothesis space $\mathcal{H}$ is said to be online learnable if $\lim_{T\rightarrow\infty}\frac{\mathrm{Reg}(f)}{T}=0$ , $\forall f\in\mathcal{H}$ .

The definition is equivalent to $\mathrm{Reg}(f)=o(T)$ . We can also replace $\mathrm{Reg}(f)$ with $\mathbb{E}[\mathrm{Reg}(f)]$ . The learnability defined in [Rakhlin2010Online] holds for the entire sample space, i.e., the worst-case examples, while our definition only considers a fixed $\mathcal{I}_{T}$ . In the worst-case, the two kinds of learnability are equivalent. Given that $\mathcal{I}_{T}$ may not always be generated in the worst-case, our learnability can adapt to the hardness of $\mathcal{I}_{T}$ .

4 Algorithmic Framework

We reduce OKS to prediction with expert advice (PEA) where $\kappa_{i}$ corresponds to the $i$ -th expert. The main idea of our algorithmic framework is summarized as follows: (i) generating $\{f_{t,i}\in\mathcal{H}_{i}\}^{T}_{t=1}$ for all $i\in[K]$ ; (ii) aggregating the predictions $\{f_{t,i}({\bm{x}}_{t})\}^{K}_{i=1}$ . The challenge is how to control the size of $\{f_{t,i}\}^{K}_{i=1}$ . To this end, we propose a new buffer maintaining framework containing an adaptive sampling and a removing process.

4.1 Adaptive Sampling

For all $i\in[K]$ , we use the optimistic mirror descent framework (OMD) [Chiang2012Online, Rakhlin2013Online] to generate $\{f_{t,i}\}^{T}_{t=1}$ . OMD maintains two sequences of hypothesis, i.e., $\{f_{t,i}\}^{T}_{t=1}$ and $\{f^{\prime}_{t,i}\}^{T}_{t=1}$ which are defined as follows,

	$\displaystyle f_{t,i}=$	$\displaystyle\mathop{\arg\min}_{f\in\mathcal{H}_{i}}\left\{\left\langle f,\hat{\nabla}_{t,i}\right\rangle+\mathcal{B}_{\psi_{i}}\left(f,f^{\prime}_{t-1,i}\right)\right\},$		(4)
	$\displaystyle f^{\prime}_{t,i}=$	$\displaystyle\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\{\left\langle f,\nabla_{t,i}\right\rangle+\mathcal{B}_{\psi_{i}}\left(f,f^{\prime}_{t-1,i}\right)\right\},$		(5)

where $\nabla_{t,i}$ is the (sub)-gradient of $\ell(f_{t,i}({\bm{x}}_{t}),y_{t})$ , $\hat{\nabla}_{t,i}$ is an optimistic estimator of $\nabla_{t,i}$ and

	$\displaystyle\psi_{i}(f)=$	$\displaystyle\frac{1}{2\lambda_{i}}\\|f\\|^{2}_{\mathcal{H}_{i}},$
	$\displaystyle\nabla_{t,i}=$	$\displaystyle\ell^{\prime}(f_{t,i}({\bm{x}}_{t}),y_{t})\cdot\kappa_{i}({\bm{x}}_{t},\cdot).$

Note that the hypothesis space in (4) and (5) is $\mathcal{H}_{i}$ and $\mathbb{H}_{i}$ , respectively. This trick [Li2023Improved] is critical to obtaining regret bounds depending on kernel alignment. At the beginning of round $t$ , we execute (4) and compute $f_{t,i}({\bm{x}}_{t})$ . The key is to control the size of $f^{\prime}_{t,i}$ . By Assumption 1, we construct a buffer denoted by $S_{i}$ that stores the examples constructing $f^{\prime}_{t,i}$ . Thus we just need to limit the size of $S_{i}$ . To this end, we propose an adaptive sampling scheme. Let $b_{t,i}$ be a Bernoulli random variable satisfying

\mathbb{P}[b_{t,i}=1]=\frac{\|\nabla_{t,i}-\hat{\nabla}_{t,i}\|^{\nu}_{\mathcal{H}_{i}}}{Z_{t}},\quad\nu\in\{1,2\},

(6)

where $Z_{t}$ is a normalizing constant. We further define

	$\displaystyle\mathrm{con}(a(i)):=$	$\displaystyle\left\\|\kappa_{i}({\bm{x}}_{i(s_{t})},\cdot)-\kappa_{i}({\bm{x}}_{t},\cdot)\right\\|_{\mathcal{H}_{i}}\leq\gamma_{t,i},$
	$\displaystyle{\bm{x}}_{i(s_{t})}=$	$\displaystyle\mathop{\arg\min}_{{\bm{x}}_{\tau}\in S_{i}}\left\\|\kappa_{i}({\bm{x}}_{\tau},\cdot)-\kappa_{i}({\bm{x}}_{t},\cdot)\right\\|_{\mathcal{H}_{i}}.$

$\mathrm{con}(a(i))$ means that if there is an instance in $S_{i}$ that is similar with ${\bm{x}}_{t}$ , then we can use it as a proxy of ${\bm{x}}_{t}$ . In this way, $({\bm{x}}_{t},y_{t})$ will not be added into $S_{i}$ .

If $\nabla_{t,i}=0$ , then $S_{i}$ keeps unchanged and we execute (5). If $\nabla_{t,i}\neq 0$ and $\mathrm{con}(a(i))$ , then $S_{i}$ keeps unchanged and we replace (5) with (7).

f^{\prime}_{t,i}=\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\{\left\langle f,\nabla_{i(s_{t}),i}\right\rangle+\mathcal{B}_{\psi_{i}}\left(f,f^{\prime}_{t-1,i}\right)\right\}.

(7)

If $\nabla_{t,i}\neq 0,\neg\mathrm{con}(a(i))$ , then we replace (5) with (8).

f^{\prime}_{t,i}=\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\{\left\langle f,\tilde{\nabla}_{t,i}\right\rangle+\mathcal{B}_{\psi_{i}}\left(f,f^{\prime}_{t-1,i}\right)\right\}.

(8)

In this case, if $b_{t,i}=1$ , then we add $({\bm{x}}_{t},y_{t})$ into $S_{i}$ . In the above two equations, we define

	$\displaystyle\nabla_{i(s_{t}),i}=$	$\displaystyle\ell^{\prime}(f_{t,i}({\bm{x}}_{t}),y_{t})\cdot\kappa_{i}({\bm{x}}_{i(s_{t})},\cdot),$
	$\displaystyle\tilde{\nabla}_{t,i}=$	$\displaystyle\frac{\nabla_{t,i}-\hat{\nabla}_{t,i}}{\mathbb{P}[b_{t,i}=1]}\cdot\mathbb{I}_{b_{t,i}=1}+\hat{\nabla}_{t,i}.$

4.2 Removing a Half of Examples

We allocate a memory budget $\mathcal{R}_{i}$ for $f^{\prime}_{t,i}$ and require $\sum^{K}_{i=1}\mathcal{R}_{i}\leq\mathcal{R}$ . By Assumption 1, we just need to limit $|S_{i}|\leq\alpha\mathcal{R}_{i}$ . At any round $t$ , if $|S_{i}|=\alpha\mathcal{R}_{i}$ and $b_{t,i}=1$ , then we remove a half of examples from $S_{i}$ [Li2023Improved]. Let $S_{i}=\{({\bm{x}}_{r_{j}},y_{r_{j}})\}^{\alpha\mathcal{R}_{i}}_{j=1}$ . We rewrite $f^{\prime}_{t-1,i}$ as follows,

	$\displaystyle f^{\prime}_{t-1,i}=$	$\displaystyle f^{\prime}_{t-1,i}(1)+f^{\prime}_{t-1,i}(2),$
	$\displaystyle f^{\prime}_{t-1,i}(1)=$	$\displaystyle\sum^{\alpha\mathcal{R}_{i}/2}_{j=1}\beta_{r_{j}}\kappa_{i}({\bm{x}}_{r_{j}},\cdot),$
	$\displaystyle f^{\prime}_{t-1,i}(2)=$	$\displaystyle\sum^{\alpha\mathcal{R}_{i}}_{j=\alpha\mathcal{R}_{i}/2+1}\beta_{r_{j}}\kappa_{i}({\bm{x}}_{r_{j}},\cdot).$

Let $o_{t}\in\{1,2\}$ and $q_{t}\in\{1,2\}\setminus\{o_{t}\}$ . We will remove $f^{\prime}_{t-1,i}(o_{t})$ from $f^{\prime}_{t-1,i}$ , that is, the examples constructing $f^{\prime}_{t-1,i}(o_{t})$ are removed from $S_{i}$ . We further project $f^{\prime}_{t-1,i}(q_{t})$ onto $\mathbb{H}_{i}$ by

\bar{f}^{\prime}_{t-1,i}(q_{t})=\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\|f-f^{\prime}_{t-1,i}(q_{t})\right\|^{2}_{\mathcal{H}}.

Then we execute the second mirror descent and redefine (8) as follows

f^{\prime}_{t,i}=\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\{\left\langle f,\tilde{\nabla}_{t,i}\right\rangle+\mathcal{B}_{\psi_{i}}\left(f,\bar{f}^{\prime}_{t-1,i}(q_{t})\right)\right\}.

(9)

Our buffer maintaining framework always ensures $|S_{i}|\leq\alpha\mathcal{R}_{i}$ . A natural approach is the restart technique which removes all of the examples in $S_{i}$ , i.e., resetting $S_{i}=\emptyset$ . We will prove that removing a half of examples can achieve better regret bounds.

4.3 Reduction to PEA

Similar to previous online multi-kernel learning algorithms [Sahoo2014Online, Shen2019Random, Jin2010Online], we use PEA framework [Cesa-Bianchi2006Prediction] to aggregate $K$ predictions. Let $\Delta_{K-1}$ be the $K-1$ dimensional simplex. At each round $t$ , we maintain a probability distribution ${\bm{p}}_{t}\in\Delta_{K-1}$ over $\{f_{t,i}\}^{K}_{i=1}$ . and output the prediction $f_{t}({\bm{x}}_{t})=\sum^{K}_{i=1}p_{t,i}f_{t,i}({\bm{x}}_{t})$ or $\hat{y}_{t}=\mathrm{sign}(f_{t}({\bm{x}}_{t})$ . By the multiplicative-weight update in Subsection 2.1 [Cesa-Bianchi2006Prediction], ${\bm{p}}_{t+1}$ is defined as follows

\begin{split}p_{t+1,i}&=\frac{w_{t+1,i}}{\sum^{K}_{j=1}w_{t+1,j}},\\ w_{t+1,i}&=\exp\left(-\eta_{t+1}\sum^{t}_{\tau=1}c_{\tau,i}\right),\\ \eta_{t+1}=&\frac{\sqrt{2\ln{K}}}{\sqrt{1+\sum^{t}_{\tau=1}\sum^{K}_{i=1}p_{\tau,i}c^{2}_{\tau,i}}},\eta_{1}=\sqrt{2\ln{K}},\end{split}

(10)

where $c_{t,i}=g(f_{t,i}({\bm{x}}_{t}),y_{t})$ . $g(\cdot,\cdot):\mathbb{R}^{2}\rightarrow\mathbb{R}^{+}\cup\{0\}$ will be defined later.

5 Applications

In this section, we apply the proposed algorithmic framework to two types of loss functions and derive two kinds of data-dependent regret bounds.

5.1 the Hinge Loss Function

We consider online binary classification tasks. Let the loss function be the Hinge loss. Thus $\nabla_{t,i}=-y_{t}\kappa_{i}({\bm{x}}_{t},\cdot)\cdot\mathbb{I}_{y_{t}f_{t,i}({\bm{x}}_{t})<1}$ . We use the “Reservoir Sampling (RS)” technique [Hazan2009Better, Vitter1985Random] to create $\hat{\nabla}_{t,i}$ . To this end, we create a new buffer $V_{t}$ with size $M\geq 1$ . Initializing $\hat{\nabla}_{1,i}=0$ , and for $t\geq 2$ , we define

\hat{\nabla}_{t,i}=-\frac{1}{|V_{t}|}\sum_{({\bm{x}},y)\in V_{t}}y\cdot\kappa_{i}({\bm{x}},\cdot).

RS is defined follows. At the end of round $t\geq 1$ , we add $({\bm{x}}_{t},y_{t})$ into $V_{t}$ with probability $\min\left\{1,\frac{M}{t}\right\}$ . If $|V_{t}|=M$ and we will add the current example, then an old example must be removed from $V_{t}$ uniformly. We create another buffer $S_{0}$ that stores all of the examples added into $V_{t}$ . It is easy to prove that $\mathbb{E}[|S_{0}|]\leq M(1+\ln{T})$ . Let $\mathcal{R}_{0}=\frac{M(1+\ln{T})}{\alpha}$ and $\mathcal{R}_{i}=\frac{\mathcal{R}-\mathcal{R}_{0}}{K}$ where we require $\mathcal{R}\geq 2\mathcal{R}_{0}$ . It is easy to verify that $\sum^{K}_{i=0}\mathcal{R}_{i}=\mathcal{R}$ .

We instantiate the sampling scheme (6) as follows,

\nu=2,\quad Z_{t}=\left\|\nabla_{t,i}-\hat{\nabla}_{t,i}\right\|^{2}_{\mathcal{H}_{i}}+\left\|\hat{\nabla}_{t,i}\right\|^{2}_{\mathcal{H}_{i}}.

The additional term $\left\|\hat{\nabla}_{t,i}\right\|^{2}_{\mathcal{H}_{i}}$ in $Z_{t}$ is important for controlling the times of removing operation, while it also increases the regret. The trick that uses different hypothesis spaces in (4) and (5), gives a negative term in the regret bound. The negative term can cancel out the increase on the regret. Let $o_{t}=2$ . We remove $f^{\prime}_{t-1,i}(2)$ from $f_{t,i}$ . Then (9) becomes

f^{\prime}_{t,i}=\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\{\left\langle f,\tilde{\nabla}_{t,i}\right\rangle+\mathcal{B}_{\psi_{i}}\left(f,\bar{f}^{\prime}_{t-1,i}(1)\right)\right\}.

(11)

At the $t$ -th round, our algorithm outputs the prediction $\hat{y}_{t}=\mathrm{sign}(f_{t}({\bm{x}}_{t}))$ . Let ${\bm{p}}_{1}$ be the uniform distribution and ${\bm{p}}_{t+1}$ follow (10) where $c_{t,i}=\ell(f_{t,i}({\bm{x}}_{t}),y_{t})$ . We name this algorithm M-OMD-H (Memory bounded OMD for the Hinge loss) and present the pseudo-code in Algorithm 1.

Algorithm 1 M-OMD-H

\{\lambda_{i}\}^{K}_{i=1}

\{\eta_{t}\}^{T}_{t=1}

\mathcal{R}

U

f^{\prime}_{0,i}=\hat{\nabla}_{1,i}=0

S_{i}=V=\emptyset

1: for

t=1,2,\ldots,T

2: Receive

{\bm{x}}_{t}

3: for

i=1,\ldots,K

4: Find

{\bm{x}}_{i(s_{t})}

5: Compute

\hat{\nabla}_{t,i}=\frac{-1}{|V|}\sum_{({\bm{x}},y)\in V}y\kappa_{i}({\bm{x}},\cdot)

6: Compute

f_{t,i}({\bm{x}}_{t})

according to (4)

7: end for

8: Output

\hat{y}_{t}=\mathrm{sign}(f_{t}({\bm{x}}_{t}))

and receive

y_{t}

9: for

i=1,\ldots,K

10: if

y_{t}f_{t,i}({\bm{x}}_{t})<1

then

11: if

\mathrm{con}(a(i))

then

12: Update

f^{\prime}_{t,i}

following (7)

13: else

14: Sample

b_{t,i}

15: if

b_{t,i}=1

and

|S_{i}|\leq\alpha\mathcal{R}_{i}-1

then

16: Update

f^{\prime}_{t,i}

following (8)

17: end if

18: if

b_{t,i}=1

\mathrm{and}

|S_{i}|=\alpha\mathcal{R}_{i}

then

19: Compute

\bar{f}^{\prime}_{t-1,i}(1)

20: Update

f^{\prime}_{t,i}

following (11)

21: Remove the latest

\frac{\alpha\mathcal{R}_{i}}{2}

examples from

S_{i}

22: end if

23: Update

S_{i}=S_{i}\cup\{({\bm{x}}_{t},y_{t}):b_{t,i}=1\}

24: end if

25: end if

26: Compute

{\bm{p}}_{t+1}

by (10)

27: Update reservoir

V

and

S_{0}

28: end for

29: end for

Lemma 1.

Let $M>1$ and $\alpha\mathcal{R}:=B\geq 2M(1+\ln{T})$ . For any $\mathcal{I}_{T}$ , the expected times that M-OMD-H executes removing operation on $S_{i}$ are $\left\lceil\frac{4K\tilde{\mathcal{A}}_{T,\kappa_{i}}}{Bk_{1}}\right\rceil$ at most, in which

\tilde{\mathcal{A}}_{T,\kappa_{i}}=1+\sum^{T}_{t=2}\left\|y_{t}\kappa_{i}({\bm{x}}_{t},\cdot)-\frac{\sum_{({\bm{x}},y)\in V_{t}}y\kappa_{i}({\bm{x}},\cdot)}{|V_{t}|}\right\|^{2}_{\mathcal{H}_{i}}.

Note that that the times of removing operation depend on the kernel alignment, rather than $T$ .

Theorem 1.

Suppose $\mathcal{R}$ satisfies the condition in Lemma 1. Let $B>K$ , $U>1$ , $\lambda_{i}=U\frac{\sqrt{K}}{\sqrt{2B}}$ , $\eta_{t}$ follow (10) and

\gamma_{t,i}=\frac{\left\|\nabla_{t,i}-\hat{\nabla}_{t,i}\right\|^{2}_{\mathcal{H}_{i}}}{\sqrt{1+\sum_{\tau\leq t}\left\|\nabla_{\tau,i}-\hat{\nabla}_{\tau,i}\right\|^{2}_{\mathcal{H}_{i}}\cdot\mathbb{I}_{\nabla_{\tau,i}\neq 0}}}.

(12)

The expected regret of M-OMD-H satisfies,

c\forall\kappa_{i}\in\mathcal{K},f\in\mathbb{H}_{i},\quad\mathbb{E}\left[\mathrm{Reg}(f)\right]=O\left(\sqrt{UL_{T}(f)\ln{K}}+\frac{U\sqrt{B}}{\sqrt{K}}+\frac{\sqrt{K}U\mathcal{A}_{T,\kappa_{i}}\ln{T}}{\sqrt{B}k_{1}}\right).

For any competitor $f$ satisfying $\|f\|_{\mathcal{H}_{i}}<1$ , it must be $L_{T}(f)=\Theta(T)$ , inducing a trivial upper bound. To address this issue, we must set $U>1$ and exclude such competitors. Theorem 1 reveals how the regret bound increases with $K$ , $\mathcal{R}$ , and $\mathcal{A}_{T,\kappa_{i}}$ . The larger the memory budget is, the smaller the regret bound will be. Theorem 1 gives an answer to Q1. If the optimal kernel $\kappa^{\ast}\in\mathcal{K}$ matches well with the examples, i.e., $\mathcal{A}_{T,\kappa^{\ast}}=o(T)$ , then it is possible to achieve a $o(T)$ regret bound in the case of $\mathcal{R}=\Theta(\ln{T})$ . Such a result gives an answer to Q2. By Definition 2, learning is possible in $\mathbb{H}_{\kappa^{\ast}}$ . The information-theoretical cost that the optimal kernel is unknown is $O\left(\sqrt{UL_{T}(f)\ln{K}}\right)$ , which is a lower order term.

Our regret bound can also derive the state-of-the-art results. Let $B=\Theta(\mathcal{A}_{T,\kappa_{i}})$ . Then we obtain a $O\left(U\sqrt{K\mathcal{A}_{T,\kappa_{i}}}\right)$ expected bound. Previous work proved a $O\left(U\sqrt{K}T^{\frac{1}{4}}\mathcal{A}^{\frac{1}{4}}_{T,\kappa_{i}}\right)$ high-probability bound [Liao2021High]. If $\mathcal{A}_{T,\kappa_{i}}=O(T)$ , then we achieve an expected bound of $O\left(U\sqrt{K}\frac{T}{\sqrt{B}}\right)$ , which matches the upper bound in [Li2022Worst]. If $B=\Theta(T)$ , then our regret bound matches the upper bounds in [Sahoo2014Online, Foster2017Parameter, Shen2019Random].

Next, we present an algorithm-dependent bound demonstrating that removing half of the examples can be more effective than the restart technique. We introduction two new notations as follows

	$\displaystyle\mathcal{J}_{i}=$	$\displaystyle\{t\in[T]:\|S_{i}\|=\alpha\mathcal{R},b_{t,i}=1\},$
	$\displaystyle\Lambda_{i}=$	$\displaystyle\sum_{t\in\mathcal{J}_{i}}\left[\left\\|\bar{f}^{\prime}_{t-1,i}(1)-f\right\\|^{2}_{\mathcal{H}_{i}}-\left\\|f^{\prime}_{t-1,i}-f\right\\|^{2}_{\mathcal{H}_{i}}\right].$

There must be a constant $\xi_{i}\in(0,4]$ such that $\Lambda_{i}=\xi_{i}U^{2}|\mathcal{J}_{i}|$ . Recalling that $\bar{f}^{\prime}_{t-1,i}(1)$ is the initial hypothesis after a removing operation. It is natural that if $\left(\bar{f}^{\prime}_{t-1,i}(1)-f\right)$ is close to $\left(f^{\prime}_{t-1,i}-f\right)$ , then $\xi_{i}\ll 4$ . The restart technique sets $\bar{f}^{\prime}_{t-1,i}(1)=0$ implying $\Lambda_{i}\leq U^{2}|\mathcal{J}_{i}|$ . In the worst case, our approach is slightly worse than the restart. If $\xi_{i}$ is sufficiently small, then our approach is much better than the restart.

Theorem 2 (Algorithm-dependent Bound).

Suppose the conditions in Theorem 1 are satisfied. For each $i\in[K]$ , let $\lambda_{i}=\frac{\sqrt{2}U}{\sqrt{5\tilde{\mathcal{A}}_{T,\kappa_{i}}}}$ . If $\xi_{i}\cdot|\mathcal{J}_{i}|\leq 1$ for all $i\in[K]$ , then the expected regret of M-OMD-H satisfies,

\forall\kappa_{i}\in\mathcal{K},f\in\mathbb{H}_{i},\quad\mathbb{E}\left[\mathrm{Reg}(f)\right]=O\left(\sqrt{L_{T}(f)\ln{K}}+\frac{UK}{B}\mathcal{A}_{T,\kappa_{i}}\ln{T}+U\sqrt{\mathcal{A}_{T,\kappa_{i}}\ln{T}}\right).

The dominate term in Theorem 2 is $\tilde{O}\left(\frac{UK\mathcal{A}_{T,\kappa_{i}}}{B}\right)$ , while the dominate term in Theorem 1 is $\tilde{O}\left(\frac{UK\mathcal{A}_{T,\kappa_{i}}}{\sqrt{B}}\right)$ . The restart technique only enjoys the regret bound given in Theorem 1. Thus removing a half of examples can be better than the restart.

5.2 Smooth Loss Functions

We first define the smooth loss functions.

Assumption 3 ([Zhang2013Online]).

Let $G_{1}$ and $G_{2}$ be positive constants. For any $u$ and $y$ , $\ell(u,y)$ satisfies (i) $|\ell^{\prime}(u,y)|\leq G_{1}$ , (ii) $|\ell^{\prime}(u,y)|\leq G_{2}\ell(u,y)$ , where $\ell^{\prime}(u,y)=\frac{\mathrm{d}\,\ell(u,y)}{\mathrm{d}\,u}$ .

The logistic loss function satisfies Assumption 3 with $G_{2}=1$ . For the hinge loss function, M-OMD-H uniformly allocates the memory budget and induces a $O\left(\sqrt{K}\right)$ factor in the regret bound. For smooth loss functions, we will propose a memory sharing scheme that can avoid the $O\left(\sqrt{K}\right)$ factor.

Recalling that the memory is used to store the examples and coefficients. The main challenge is how to share the examples. To this end, we only keep a single buffer $S$ . We first rewrite $f_{t}({\bm{x}}_{t})$ as follows,

f_{t}({\bm{x}}_{t})=\sum^{K}_{i=1}p_{t,i}f_{t,i}({\bm{x}}_{t})=\sum^{K}_{i=1}p_{t,i}\sum_{{\bm{x}}_{\tau}\in S_{i}}a_{\tau,i}\kappa_{i}({\bm{x}}_{\tau},{\bm{x}}_{t}).

We just need to ensure $S_{i}=S$ for all $i\in[K]$ . For each $f_{t,i}$ , we define a surrogate gradient $\nabla_{t,i}=\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t})\cdot\kappa_{i}({\bm{x}}_{t},\cdot)$ . Note that we use the first derivative $\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t})$ , rather than $\ell^{\prime}(f_{t,i}({\bm{x}}_{t}),y_{t})$ . Let $\hat{\nabla}_{t,i}=0$ in (4). Then we obtain a single update rule as follows

f_{t+1,i}=\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\{\langle f,\nabla_{t,i}\rangle+\mathcal{B}_{\psi_{i}}(f,f_{t,i})\right\}.

To ensure $S_{i}=S$ , we must change the sampling scheme (6). Let $b_{t,i}=b_{t}$ and $\mathrm{con}(a(i))=\mathrm{con}(a)$ for all $i\in[K]$ . We define $\nu=1$ and

	$\displaystyle Z_{t}=$	$\displaystyle\\|\nabla_{t,i}\\|_{\mathcal{H}_{i}}+G_{1}\sqrt{\kappa_{i}({\bm{x}}_{t},{\bm{x}}_{t})},$
	$\displaystyle\mathrm{con}(a):=$	$\displaystyle\max_{i\in[K]}\\|\kappa_{i}({\bm{x}}_{i(s_{t})},\cdot)-\kappa_{i}({\bm{x}}_{t},\cdot)\\|_{\mathcal{H}_{i}}\leq\gamma_{t}.$

If $\mathrm{con}(a)$ , then we replace (7) with (13).

\begin{split}f_{t+1,i}=&\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\{\left\langle f,\nabla_{i(s_{t}),i}\right\rangle+\mathcal{B}_{\psi_{i}}(f,f_{t,i})\right\},\\ \nabla_{i(s_{t}),i}=&\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t})\cdot\kappa_{i}\left({\bm{x}}_{i(s_{t})},\cdot\right).\end{split}

(13)

Otherwise, we replace (8) with (14).

f_{t+1,i}=\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\{\left\langle f,\tilde{\nabla}_{t,i}\right\rangle+\mathcal{B}_{\psi_{i}}(f,f_{t,i})\right\}.

(14)

If $|S|=\alpha\mathcal{R}$ and $b_{t}=1$ , then let $o_{t}=1$ . We remove $f_{t,i}(1)$ from $f_{t,i}$ and redefine (9) as follows

f_{t+1,i}=\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\{\left\langle f,\tilde{\nabla}_{t,i}\right\rangle+\mathcal{B}_{\psi_{i}}\left(f,\bar{f}_{t,i}(2)\right)\right\}.

(15)

Let ${\bm{p}}_{1}$ be the uniform distribution and ${\bm{p}}_{t+1}$ follow (10), in which $c_{t,i}$ follows (16).

Definition 3.

At each round $t$ , $\forall\kappa_{i}\in\mathcal{K}$ , let

c_{t,i}=\left\{\begin{array}[]{ll}\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t})\cdot\left(f_{t,i}({\bm{x}}_{t})-\min_{j\in[K]}f_{t,j}({\bm{x}}_{t})\right),&~{}\mathrm{if}~{}\ell^{\prime}(f_{t})>0,\\ \ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t})\cdot\left(f_{t,i}({\bm{x}}_{t})-\max_{j\in[K]}f_{t,j}({\bm{x}}_{t})\right),&~{}\mathrm{otherwise}.\end{array}\right.

(16)

We name this algorithm M-OMD-S (Memory bounded OMD for Smooth loss function) and present the pseudo-code in Algorithm 2.

Algorithm 2 M-OMD-S

\{\lambda_{i}\}^{K}_{i=1}

\{\eta_{t}\}^{T}_{t=1}

\mathcal{R}

U

S=\emptyset

f_{1,i}=0,i\in[K]

1: for

t=1,2,\ldots,T

2: Receive

{\bm{x}}_{t}

3: for

i=1,\ldots,K

4: Find

{\bm{x}}_{i(s_{t})}

5: Compute

f_{t,i}({\bm{x}}_{t})

6: end for

7: Output

f_{t}({\bm{x}}_{t})

\hat{y}_{t}=\mathrm{sign}(f_{t}({\bm{x}}_{t}))

and receive

y_{t}

8: if

\mathrm{con}(a)

then

9: for

i=1,\ldots,K

10: Compute

f_{t+1,i}

following (13)

11: end for

12: else

13: Sample

b_{t}

14: if

b_{t}=1

and

|S|\leq\alpha\mathcal{R}-1

then

15:

\forall i\in[K]

, update

f_{t+1,i}

following (14)

16: end if

17: if

b_{t}=1

\mathrm{and}

|S|=\alpha\mathcal{R}

then

18: Remove the first

\frac{1}{2}\alpha\mathcal{R}

support vectors from

S

19:

\forall i\in[K]

, update

f_{t+1,i}

following (15)

20: end if

21: Update

S=S\cup\{({\bm{x}}_{t},y_{t}):b_{t}=1\}

22: end if

23: Compute

{\bm{p}}_{t+1}

by (10)

24: end for

Lemma 2.

Suppose $\ell(\cdot,\cdot)$ satisfies Assumption 3. Let $\delta\in(0,1)$ and $\alpha\mathcal{R}:=B\geq 21\ln\frac{1}{\delta}$ . For any $\mathcal{I}_{T}$ , with probability at least $1-\frac{T}{B}\Theta(\lceil\ln{T}\rceil)\delta$ , the times that M-OMD-S executes removing operation on $S$ are $\left\lceil\frac{4G_{2}\hat{L}_{1:T}}{(B-\frac{4}{3}\ln\frac{1}{\delta})G_{1}}\right\rceil$ at most, in which

\hat{L}_{1:T}=\sum^{T}_{t=1}\ell(f_{t}({\bm{x}}_{t}),y_{t}).

It is important that the times of removing operation depend on the cumulative losses of the algorithm, rather than $T$ .

Theorem 3.

Suppose $\mathcal{R}$ and $\ell(\cdot,\cdot)$ satisfy Lemma 2. Let $\delta\in(0,1)$ , $K\leq d$ , $\eta_{t}$ follow (10),

	$\displaystyle\gamma_{t}=$	$\displaystyle\frac{\sqrt{2\ln{K}}}{\sqrt{1+\sum^{t}_{\tau=1}\|\ell^{\prime}(f_{\tau}({\bm{x}}_{\tau}),y_{\tau})\|}},$		(17)
	$\displaystyle U\leq$	$\displaystyle\frac{1}{8G_{2}}\sqrt{B-\frac{4}{3}\ln\frac{1}{\delta}},$

and $\lambda_{i}=\frac{2U}{G_{1}\sqrt{B}}$ for all $i\in[K]$ . With probability at least $1-\frac{3T}{\alpha\mathcal{R}}\Theta(\lceil\ln{T}\rceil)\delta$ , the regret of M-OMD-S satisfies

	$\displaystyle\forall\kappa_{i}\in\mathcal{K},\forall f\in\mathbb{H}_{i},$	$\displaystyle\quad\mathrm{Reg}(f)\leq$
		$\displaystyle\frac{24UG_{2}L_{T}(f)}{\sqrt{B-\frac{4}{3}\ln\frac{1}{\delta}}}+UG_{1}\sqrt{B}+\frac{100U^{2}G_{2}G_{1}\ln\frac{1}{\delta}}{(1-\gamma)^{2}}+\frac{10U}{(1-\gamma)^{\frac{3}{2}}}\sqrt{G_{2}G_{1}\ln\frac{1}{\delta}}\sqrt{L_{T}(f)+\frac{UG_{1}\sqrt{B}}{4}},$

where $\gamma=\frac{6UG_{2}}{\sqrt{B-\frac{4}{3}\ln\frac{1}{\delta}}}$ .

In the case of $B\ll T$ , the dominant term in the upper bound is $O\left(\frac{L_{T}(f)}{\sqrt{B}}\right)$ . Previous work proved a $\Omega\left(\frac{T}{B}\right)$ lower bound [Zhang2013Online]. Thus our result proves a new trade-off between memory constraint and regret which gives an answer to Q1. If $\min_{f\in\mathbb{H}_{\kappa^{\ast}}}L_{T}(f)=o(T)$ , then a $\Theta(\ln{T})$ memory budget is enough to achieve a sub-linear regret which answers Q2. By Definition 2, we also conclude that $\mathbb{H}_{\kappa^{\ast}}$ is online learnable. The penalty term for not knowing the optimal kernel is $O\left(U\sqrt{L_{T}(f)\ln{K}}\right)$ which only depends on $O\left(\sqrt{\ln{K}}\right)$ . In this case, online kernel selection is not much harder than online kernel learning.

Theorem 3 can also recover some state-of-the-art results. Let $B=\Theta(T)$ . We obtain a $O\left(U\sqrt{T\ln{K}}\right)$ regret bound which is same with the regret bound in [Sahoo2014Online, Foster2017Parameter, Shen2019Random]. If $K=1$ and $B=\Theta(L_{T}(f))$ , then we obtain a $O\left(U\sqrt{L_{T}(f)}\right)$ regret bound, matching the regret bound in [Zhang2013Online]. Next we prove that M-OMD-S is optimal.

Theorem 4 (Lower Bound).

Let $\ell(u,y)=\ln(1+\exp(-yu))$ which satisfies Assumption 3 and $\kappa({\bm{x}},{\bm{v}})=\langle{\bm{x}},{\bm{v}}\rangle^{p}$ where $p>0$ is a constant. Let $U<\frac{\sqrt{3B}}{5}$ and $\mathbb{H}=\{f\in\mathcal{H}_{\kappa}:\|f\|_{\mathcal{H}}\leq U\}$ . There exists a $\mathcal{I}_{T}$ such that the regret of any online kernel algorithms storing $B$ examples satisfies

\exists f\in\mathbb{H},\quad\mathbb{E}\left[\mathrm{Reg}(f)\right]=\Omega\left(U\cdot\frac{L_{T}(f)}{\sqrt{B}}\right).

The expectation is w.r.t. the randomness of algorithm and environment.

Our lower bound improves the previous lower bound $\Omega\left(\frac{T}{B}\right)$ [Zhang2013Online]. Next we show an algorithm-dependent upper bound. Let

	$\displaystyle\mathcal{J}=$	$\displaystyle\left\{t\in[T]:\|S\|=\alpha\mathcal{R},b_{t}=1\right\},$
	$\displaystyle\Lambda_{i}=$	$\displaystyle\sum_{t\in\mathcal{J}}\left[\left\\|f_{t,i}(2)-f\right\\|^{2}_{\mathcal{H}_{i}}-\left\\|f_{t,i}-f\right\\|^{2}_{\mathcal{H}_{i}}\right].$

There must be a constant $\xi_{i}\in(0,4]$ such that $\Lambda_{i}=\xi_{i}U^{2}|\mathcal{J}|$ . It is natural that if $(f_{t,i}(2)-f)$ is close to $(f_{t,i}-f)$ , then $\xi_{i}\ll 4$ . The restart technique sets $f_{t,i}(2)=0$ implying $\Lambda_{i}\leq U^{2}|\mathcal{J}|$ .

Theorem 5 (Algorithm-dependent Bound).

Suppose the conditions in Theorem 3 are satisfied. Let $\delta\in(0,1)$ and $U\leq\frac{B-\frac{4}{3}\ln\frac{1}{\delta}}{32G_{2}}$ . If $\xi_{i}<\frac{1}{|\mathcal{J}|}$ for all $i\in[K]$ , then with probability at least $1-\frac{3T}{\alpha\mathcal{R}}\Theta(\lceil\ln{T}\rceil)\delta$ , the regret of M-OMD-S satisfies,

\displaystyle\forall\kappa_{i}\in\mathcal{K},\forall f\in\mathbb{H}_{i},\quad\mathrm{Reg}(f)\leq\frac{32UG_{2}L_{T}(f)}{B-\frac{4}{3}\ln\frac{1}{\delta}}+\frac{144U^{2}G_{2}G_{1}\ln\frac{1}{\delta}}{(1-\gamma)^{2}}+\frac{12U}{(1-\gamma)^{\frac{3}{2}}}\sqrt{G_{2}G_{1}L_{T}(f)\ln\frac{1}{\delta}},

where $\gamma=\frac{16UG_{2}}{B-\frac{4}{3}\ln\frac{1}{\delta}}$ .

The dominate term in Theorem 5 is $O\left(\frac{UG_{2}L_{T}(f)}{B}\right)$ , while the dominate term in Theorem 3 is $O\left(\frac{UG_{2}L_{T}(f)}{\sqrt{B}}\right)$ . The restart technique only enjoys the regret bound given in Theorem 3. Thus removing a half of examples can be better than the restart technique.

6 Experiments

In this section, we will verify the following two goals,

G1

Learning is possible in the optimal hypothesis space within a small memory budget.
We aim to verify the two data-dependent complexities are smaller than the worst-case complexity $T$ , that is, $\min_{\kappa_{i},i\in[K]}\mathcal{A}_{T,\kappa_{i}}\ll T$ and $\min_{f\in\mathbb{H}_{i},i\in[K]}L_{T}(f)\ll T$ .
G2

Our algorithms perform better than baseline algorothms within a same memory budget.

6.1 Experimental Setting

We implement M-OMD-S with the logistic loss function. Both M-OMD-H and M-OMD-S are suitable for online binary classification tasks. We adopt the Gaussian kernel $\kappa({\bm{x}},{\bm{v}})=\exp(-\frac{\|{\bm{x}}-{\bm{v}}\|^{2}_{2}}{2\sigma^{2}})$ , and choose eight classification datasets. The information of the datasets is shown in Table 1. The w8a, the ijcnn1, and the cod-rna datasets are download from the LIBSVM website ²²2https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/, Jul. 2024.. The other datasets are downloaded from UCI machine learning repository ³³3http://archive.ics.uci.edu/ml/datasets.php, Jul. 2024.. All algorithms are implemented in R on a Windows machine with 2.8 GHz Core(TM) i7-1165G7 CPU. We execute each experiment ten times with random permutation of all datasets.

Table 1: Datasets Used in Experiments

Datasets	#sample	# Feature	Datasets	#sample	# Feature	Datasets	#sample	# Feature
mushrooms	8,124	112	phishing	11,055	68	magic04	19,020	10
a9a	48,842	123	w8a	49,749	300	SUSY	50,000	18
ijcnn1	141,691	22	cod-rna	271,617	8	-

We compare our algorithms with two baseline algorithms.

•

LKMBooks: OKS with memory constraint via random adding examples [Li2022Worst].
•

Raker: online multi-kernel learning using random features [Shen2019Random].

Our algorithms improve the regret bounds of Raker and LKMBooks, and we expect that our algorithms also perform better. Although there are other online kernel selection algorithm, such as OKS [Yang2012Online], and online multi-kernel learning algorithms, such as OMKR [Sahoo2014Online] and OMKC [Hoi2013Online], we do not compare with these algorithms, since they do not limit the memory budget.

We select $K=5$ Gaussian kernel function by setting $\sigma=2^{[-2:2:6]}$ . For the baseline algorithms, we tune the hyper-parameters following original papers. To be specific, for LKMBooks, there are three hyper-parameters $\nu\in(0,1)$ , $\lambda$ and $\eta$ . Both $\lambda$ and $\eta$ are learning rates. We reset $\lambda=\frac{10}{\sqrt{1-\nu}T}\sqrt{(1+\nu)B}$ for improving performance and tune $\nu\in(\frac{1}{2},\frac{1}{3},\frac{1}{4})$ . For Raker, there is a learning rate $\eta$ and a regularization parameter $\lambda$ . We tune $\eta=\frac{10^{-3:1:3}}{\sqrt{T}}$ and $\lambda\in\{0.05,0.005,0.0005\}$ . For M-OMD-H and M-OMD-S, we can set an aggressive value of $U=\sqrt{B}$ following Theorem 2 and Theorem 5. We tune the learning rate $\lambda_{i}=c\frac{U}{\sqrt{B}}$ where $c\in\{2,1,0.5\}$ . The value of $\eta_{t}$ follows (10). The value of $\gamma_{t}$ follows (12) and (17). For M-OMD-H, we set $M=10$ .

We always set $D=400$ in Raker where $D$ is the number of random features adopted by Raker and plays a similar role with $B$ . For LKMBooks and M-OMD-S, we always set the budget $B=400$ . Note that M-OMD-H uniformly allocates the memory budget over $K$ hypotheses and the reservoir. Thus we separately set $B=100$ and $B=400$ for M-OMD-H, denoted as M-OMD-H $(B=100)$ and M-OMD-H $(B=400)$ . Raker, LKMBooks, M-OMD-S, and M-OMD-H $(B=100)$ use a same memory budget, while M-OMD-H $(B=400)$ uses a larger memory budget.

6.2 Experimental Results

We separately use the hinge loss function and the logistic loss function.

6.2.1 Hinge Loss Function

We first compare the AMR of all algorithms. Table 2 shows the results. As a whole, M-OMD-H $(B=400)$ performs best on most of datasets, while M-OMD-H $(B=100)$ performs worst. The reason is that M-OMD-H uniformly allocates the memory budget over all kernels thereby inducing a $O\left(\sqrt{K}\right)$ factor on the regret bound. M-OMD-H $(B=400)$ uses a larger memory budget, thereby performing better. The experimental results do not completely coincide with G2. It is left for future work to study whether we can avoid allocating the memory budget.

Table 2 also reports the average running time. It is natural that M-OMD-H $(B=400)$ requires the longest running time, since it uses a larger memory budget. The running time of the other three algorithms is similar.

Next we analyze the value of $\mathcal{A}_{T,\kappa^{\ast}}$ . Computing $\mathcal{A}_{T,\kappa^{\ast}}$ requires $O(T^{2})$ time. For simplicity, we will define a proxy of $\mathcal{A}_{T,\kappa^{\ast}}$ . Let

\hat{\mathcal{A}}_{T,i}:=\sum^{T}_{\tau=1}\left\|\nabla_{\tau,i}-\hat{\nabla}_{\tau,i}\right\|^{2}_{\mathcal{H}_{i}}\cdot\mathbb{I}_{\nabla_{\tau,i}\neq 0}.

To be specific, we use $\min_{i\in[K]}\hat{\mathcal{A}}_{T,i}$ as a proxy of $\mathcal{A}_{T,\kappa^{\ast}}$ . In fact, our regret bound depends on $\hat{\mathcal{A}}_{T,i}$ which satisfies $\hat{\mathcal{A}}_{T,i}=\tilde{O}(\mathcal{A}_{T,\kappa_{i}})$ . To obtain a precise estimation of $\mathcal{A}_{T,\kappa_{i}}$ , we again run M-OMD-H with $M=30$ and $B=400$ .

Table 2 shows the results. We find that $\mathcal{A}_{T,\kappa^{\ast}}<T$ on all datasets. Especially, $\mathcal{A}_{T,\kappa^{\ast}}\ll T$ on the mushrooms, the phishing, and the w8a datasets. In the optimal hypothesis space, a small memory budget is enough to guarantee a small regret and thus learning is possible. The experimental results verify G1.

Table 2: Experimental Results Using the Hinge Loss Function

Algorithm	mushrooms, $T=8124$				phishing, $T=11055$
Algorithm	AMR	$B\|D$	Time (s)	$\mathcal{A}_{T,\kappa^{\ast}}$	AMR	$B\|D$	Time (s)	$\mathcal{A}_{T,\kappa^{\ast}}$
Raker [Shen2019Random]	11.97 $\pm$ 0.93	400	1.57	-	9.13 $\pm$ 0.28	400	2.06	-
LKMBooks [Li2022Worst]	6.04 $\pm$ 0.62	400	1.64	-	12.81 $\pm$ 0.54	400	2.21	-
M-OMD-H	8.09 $\pm$ 2.51	100	5.62	-	19.41 $\pm$ 3.78	100	5.05	-
M-OMD-H	0.69 $\pm$ 0.07	400	10.97	220	8.87 $\pm$ 0.40	400	10.74	2849
Algorithm	magic04, $T=19020$				a9a, $T=48842$
Algorithm	AMR	$B\|D$	Time (s)	$\mathcal{A}_{T,\kappa^{\ast}}$	AMR	$B\|D$	Time (s)	$\mathcal{A}_{T,\kappa^{\ast}}$
Raker [Shen2019Random]	32.14 $\pm$ 0.43	400	3.25	-	23.47 $\pm$ 0.23	400	9.45	-
LKMBooks [Li2022Worst]	23.66 $\pm$ 0.51	400	1.31	-	21.29 $\pm$ 1.36	400	10.15	-
M-OMD-H	29.08 $\pm$ 3.78	100	5.50	-	24.84 $\pm$ 1.43	100	31.39	-
M-OMD-H	22.75 $\pm$ 0.57	400	6.09	10422	19.88 $\pm$ 1.44	400	49.29	19393
Algorithm	w8a, $T=49749$				SUSY, $T=50000$
Algorithm	AMR	$B\|D$	Time (s)	$\mathcal{A}_{T,\kappa^{\ast}}$	AMR	$B\|D$	Time (s)	$\mathcal{A}_{T,\kappa^{\ast}}$
Raker [Shen2019Random]	2.98 $\pm$ 0.00	400	11.62	-	29.62 $\pm$ 0.46	400	8.88	-
LKMBooks [Li2022Worst]	2.97 $\pm$ 0.00	400	23.52	-	29.37 $\pm$ 0.87	400	5.47	-
M-OMD-H	2.99 $\pm$ 0.10	100	43.20	-	36.21 $\pm$ 2.08	100	16.29	-
M-OMD-H	2.65 $\pm$ 0.12	400	84.80	4200	27.76 $\pm$ 1.04	400	29.45	28438
Algorithm	ijcnn1, $T=141691$				con-rna, $T=271617$
Algorithm	AMR	$B\|D$	Time (s)	$\mathcal{A}_{T,\kappa^{\ast}}$	AMR	$B\|D$	Time (s)	$\mathcal{A}_{T,\kappa^{\ast}}$
Raker [Shen2019Random]	9.49 $\pm$ 0.08	400	23.54	-	11.90 $\pm$ 0.31	400	53.32	-
LKMBooks [Li2022Worst]	9.58 $\pm$ 0.01	400	18.85	-	12.57 $\pm$ 0.26	400	19.17	-
M-OMD-H	9.98 $\pm$ 0.29	100	33.15	-	14.59 $\pm$ 1.47	100	70.33	-
M-OMD-H	9.57 $\pm$ 0.43	400	54.81	33626	12.46 $\pm$ 0.54	400	130.09	62259

6.2.2 Logistic Loss Function

We first compare the AMR of all algorithms. Table 3 reports the results. As a whole, M-OMD-S performs best on most of datasets. On the phishing, the SUSY, and the cod-rna datasets, Raker performs better than M-OMD-S. We have found that the current regret analysis of Raker is not tight and can be improved by utilizing Assumption 3. Thus it is reasonable that Raker performs similar with M-OMD-S. Besides, M-OMD-S performs better than LKMBooks on all datasets. The experimental results coincide with the theoretical analysis that is, M-OMD-S enjoys a better regret bound than LKMBooks. The experimental results verify G2. It is natural that the running time of all algorithms is similar as they enjoy the same time complexity.

Finally, we analyze the value of $L^{\ast}$ . For simplicity, we use the cumulative losses of M-OMD-S, $\hat{L}_{1:T}$ , as a proxy of $L^{\ast}$ . Note that $L^{\ast}\leq\hat{L}_{1:T}$ . Table 3 shows that $L^{\ast}<T$ on all datasets. Besides, on the mushrooms dataset, phishing dataset, and w8a dataset, $L^{\ast}\ll T$ . In the optimal hypothesis space, a small memory budget is enough to guarantee a small regret and thus learning is possible. The experimental results verify G1.

Table 3: Experimental Results Using the Logistic Loss Function

Algorithm	mushrooms, $T=8124$				phishing, $T=11055$
Algorithm	AMR	$B\|D$	Time (s)	$L^{\ast}$	AMR	$B\|D$	Time (s)	$L^{\ast}$
Raker [Shen2019Random]	11.47 $\pm$ 0.75	400	1.64	-	9.07 $\pm$ 0.44	400	2.05	-
LKMBooks [Li2022Worst]	10.01 $\pm$ 1.44	400	2.12	-	15.86 $\pm$ 1.28	400	1.94	-
M-OMD-S	2.70 $\pm$ 0.32	400	1.74	160	10.38 $\pm$ 0.31	400	3.55	3012
Algorithm	magic04, $T=19020$				a9a, $T=48842$
Algorithm	AMR	$B\|D$	Time (s)	$L^{\ast}$	AMR	$B\|D$	Time (s)	$L^{\ast}$
Raker [Shen2019Random]	30.74 $\pm$ 1.11	400	3.56	-	23.13 $\pm$ 0.17	400	9.90	-
LKMBooks [Li2022Worst]	25.72 $\pm$ 0.64	400	1.30	-	22.38 $\pm$ 1.29	400	10.98	-
M-OMD-S	23.97 $\pm$ 0.23	400	1.81	10160	19.43 $\pm$ 0.16	400	10.41	20194
Algorithm	w8a, $T=49749$				SUSY, $T=50000$
Algorithm	AMR	$B\|D$	Time (s)	$L^{\ast}$	AMR	$B\|D$	Time (s)	$L^{\ast}$
Raker [Shen2019Random]	2.97 $\pm$ 0.00	400	12.06	-	28.62 $\pm$ 0.44	400	9.02	-
LKMBooks [Li2022Worst]	2.98 $\pm$ 0.01	400	6.10	-	30.72 $\pm$ 1.33	400	6.45	0
M-OMD-S	2.83 $\pm$ 0.03	400	21.44	5116	29.03 $\pm$ 0.16	400	8.24	28582
Algorithm	ijcnn1, $T=141691$				con-rna, $T=271617$
Algorithm	AMR	$B\|D$	Time (s)	$L^{\ast}$	AMR	$B\|D$	Time (s)	$L^{\ast}$
Raker [Shen2019Random]	9.53 $\pm$ 0.03	400	24.94	-	12.27 $\pm$ 0.57	400	52.24	-
LKMBooks [Li2022Worst]	9.57 $\pm$ 0.00	400	16.95	-	13.91 $\pm$ 1.31	400	25.55	-
M-OMD-S	9.42 $\pm$ 0.05	400	19.33	39208	13.09 $\pm$ 0.04	400	31.20	94326

7 Conclusion

Learnability is an essential problem in online kernel selection with memory constraint. Our work gave positive results on learnability through data-dependent regret analysis, in contrast to previous negative results obtained from worst-case regret analysis. We characterized the regret bounds via the kernel alignment and the cumulative losses, and gave new trade-offs between regret and memory constraint. If there is a kernel function matches well with the data, i.e., the kernel alignment and the cumulative losses is sub-linear, then sub-linear regret can be achieved within a $\Theta(\ln{T})$ memory constraint and the corresponding hypothesis space is learnable. Data-dependent regret analysis provides a new perspective for studying the learnability in online kernel selection with memory constraint.

Deploying machine learning models on computational devices with limited computational resources is necessary for practical applications. Our algorithms limit the memory budget to $O(\ln{T})$ , and can naturally be deployed on computational devices with limited memory. Even if the deployed machine learning models are not kernel classifiers, our work can guide how to allocate the memory resources properly. For instance, (i) the optimal value of memory budget depends on the hardness of problems; (ii) our second algorithm provides the idea of memory sharing.

There are some important work for future study. Firstly, it is interesting to study other data complexities beyond kernel alignment and small-loss, such as the effective dimension of the kernel matrix. Secondly, it is important to investigate whether we can eliminate the $O\left(\sqrt{K}\right)$ factor in the regret bound of M-OMD-H.

Appendix

Appendix A.1 Proof of Lemma 1

Proof.

We just analyze $S_{i}$ for a fixed $i\in[K]$ . Let the times of removing operation be $J$ . Denote by $B=\alpha\mathcal{R}$ , $\mathcal{J}=\{t_{r},r\in[J]\}$ , $T_{r}=\{t_{r-1}+1,\ldots,t_{r}\}$ and $t_{0}=0$ . For any $t\in T_{r}$ , if $\nabla_{t,i}\neq 0$ , $\neg\mathrm{con}(a(i))$ and $b_{t,i}=1$ , then $({\bm{x}}_{t},y_{t})$ will be added into $S_{i}$ . For simplicity, we define a new notation $\nu_{t,i}$ as follows,

\nu_{t,i}=\mathbb{I}_{y_{t}f_{t,i}({\bm{x}}_{t})<1}\cdot\mathbb{I}_{\neg\mathrm{con}(a(i))}\cdot b_{t,i}.

At the end of the $t_{r}$ -th round, the following equation can be derived,

|S_{i}|=|S_{i}(t_{r-1}+1)|+\sum^{t_{r}}_{t=t_{r-1}+1}\nu_{t,i}=\frac{B}{K},

where $|S_{i}(t_{r-1}+1)|$ is defined the initial size of $S_{i}$ .

Let $s_{r}=t_{r-1}+1$ . Assuming that there is no budget. We will present an expected bound on $\sum^{\bar{t}}_{t=s_{r}}\nu_{t,i}$ for any $\bar{t}>s_{r}$ . In the first epoch, $s_{1}=1$ and $|S_{i}(s_{1})|=0$ . Taking expectation w.r.t. $b_{t,i}$ gives

	$\displaystyle\mathbb{E}\left[\sum^{\bar{t}}_{t=s_{1}}\nu_{t,i}\right]=$	$\displaystyle\sum^{\bar{t}}_{t=s_{1}}\frac{\\|\nabla_{t,i}-\hat{\nabla}_{t,i}\\|^{2}_{\mathcal{H}_{i}}\cdot\mathbb{I}_{\nabla_{t,i}\neq 0}}{\\|\nabla_{t,i}-\hat{\nabla}_{t,i}\\|^{2}_{\mathcal{H}_{i}}+\\|\hat{\nabla}_{t,i}\\|^{2}_{\mathcal{H}_{i}}}$
	$\displaystyle\leq$	$\displaystyle\frac{2}{k_{1}}\underbrace{\left(1+\sum^{\bar{t}}_{t=2}\left\\|y_{t}\kappa_{i}({\bm{x}}_{t},\cdot)-\frac{\sum_{({\bm{x}},y)\in V_{t}}y\kappa_{i}({\bm{x}},\cdot)}{\|V_{t}\|}\right\\|^{2}_{\mathcal{H}_{i}}\right)}_{\tilde{\mathcal{A}}_{[s_{1},\bar{t}],\kappa_{i}}}$
	$\displaystyle=$	$\displaystyle\frac{2}{k_{1}}\tilde{\mathcal{A}}_{[s_{1},\bar{t}],\kappa_{i}},$

where we use the fact $\kappa_{i}({\bm{x}}_{t},{\bm{x}}_{t})\geq k_{1}$ . Let $t_{1}$ be the minimal $\bar{t}$ such that

\frac{2}{k_{1}}\tilde{\mathcal{A}}_{[s_{1},t_{1}],\kappa_{i}}\geq\frac{B}{K}.

(A1)

The first epoch will end at $t_{1}$ in expectation. We define $\tilde{\mathcal{A}}_{T_{1},\kappa_{i}}:=\tilde{\mathcal{A}}_{[s_{1},t_{1}],\kappa_{i}}$ .

Next we consider $r\geq 2$ . It must be $|S_{i}(s_{r})|=\frac{B}{2K}$ . Similar to $r=1$ , we can obtain

\mathbb{E}\left[\sum^{\bar{t}}_{t=s_{r}}\nu_{t,i}\right]\leq\frac{2}{k_{1}}\underbrace{\sum^{\bar{t}}_{t=s_{r}}\left\|y_{t}\kappa_{i}({\bm{x}}_{t},\cdot)-\frac{\sum_{({\bm{x}},y)\in V_{t}}y\kappa_{i}({\bm{x}},\cdot)}{|V_{t}|}\right\|^{2}_{\mathcal{H}_{i}}}_{\tilde{\mathcal{A}}_{[s_{r},\bar{t}],\kappa_{i}}}=\frac{2}{k_{1}}\tilde{\mathcal{A}}_{[s_{r},\bar{t}],\kappa_{i}}.

Let $t_{r}$ be the minimal $\bar{t}$ such that

\frac{2}{k_{1}}\tilde{\mathcal{A}}_{[s_{r},\bar{t}],\kappa_{i}}\geq\frac{B}{2K},

(A2)

Let $\tilde{\mathcal{A}}_{T_{r},\kappa_{i}}=\tilde{\mathcal{A}}_{[s_{r},\bar{t}],\kappa_{i}}$ . Combining (A1) and (A2), and summing over $r=1,\ldots,J$ yields

	$\displaystyle\frac{B}{K}+\frac{B(J-1)}{2K}\leq$	$\displaystyle\frac{2}{k_{1}}\tilde{\mathcal{A}}_{T_{1},\kappa_{i}}+\sum^{J}_{r=2}\frac{2}{k_{1}}\tilde{\mathcal{A}}_{T_{r},\kappa_{i}}$
	$\displaystyle\leq$	$\displaystyle\frac{2}{k_{1}}\underbrace{\sum^{T}_{t=s_{1}}\left\\|y_{t}\kappa_{i}({\bm{x}}_{t},\cdot)-\frac{\sum_{({\bm{x}},y)\in V_{t}}y\kappa_{i}({\bm{x}},\cdot)}{\|V_{t}\|}\right\\|^{2}_{\mathcal{H}_{i}}}_{\tilde{\mathcal{A}}_{T,\kappa_{i}}}$
	$\displaystyle\leq$	$\displaystyle\frac{2}{k_{1}}\tilde{\mathcal{A}}_{T,\kappa_{i}}.$

Arranging terms gives

J\leq\frac{4K\tilde{\mathcal{A}}_{T,\kappa_{i}}}{Bk_{1}}-1\leq\frac{4K\tilde{\mathcal{A}}_{T,\kappa_{i}}}{Bk_{1}}.

(A3)

Taking expectation w.r.t. the randomness of reservoir sampling gives

\mathbb{E}[J]\leq\frac{4K}{Bk_{1}}\cdot\mathbb{E}[\tilde{\mathcal{A}}_{T,\kappa_{i}}]\leq\frac{12K}{Bk_{1}}\mathcal{A}_{T,\kappa_{i}}\cdot\left(1+\frac{\ln{T}}{M}\right)+\frac{32K}{Bk_{1}},

where the last inequality comes from Lemma A.1.1. Omitting the last constant term concludes the proof. ∎

Lemma A.1.1.

The reservoir sampling guarantees

\forall i\in[K],\quad\mathbb{E}\left[\tilde{\mathcal{A}}_{T,\kappa_{i}}\right]\leq 3\mathcal{A}_{T,\kappa_{i}}+8+\frac{3\mathcal{A}_{T,\kappa_{i}}\ln{T}}{M}.

Proof.

Let $\mu_{t,i}=-\frac{1}{t}\sum^{t}_{\tau=1}y_{\tau}\kappa_{i}({\bm{x}}_{\tau},\cdot)$ and $\tau_{0}=M$ . For $t\leq\tau_{0}$ , it can be verified that

	$\displaystyle\tilde{\mathcal{A}}_{\tau_{0},\kappa_{i}}=$	$\displaystyle 1+\sum^{\tau_{0}}_{t=2}\left\\|-y_{t}\kappa_{i}({\bm{x}}_{t},\cdot)-\mu_{t-1,i}\right\\|^{2}_{\mathcal{H}_{i}}$
	$\displaystyle=$	$\displaystyle 1+\sum^{\tau_{0}}_{t=2}\left\\|-y_{t}\kappa_{i}({\bm{x}}_{t},\cdot)-\mu_{t,i}+\mu_{t,i}-\mu_{t-1,i}\right\\|^{2}_{\mathcal{H}_{i}}$
	$\displaystyle\leq$	$\displaystyle 1+2\mathcal{A}_{[2:\tau_{0}],\kappa_{i}}+2\sum^{\tau_{0}}_{t=2}\left\\|\mu_{t,i}-\mu_{t-1,i}\right\\|^{2}_{\mathcal{H}_{i}},$

where $\mu_{0,i}=0$ . Let $V_{t}$ be the reservoir at the beginning of round $t$ . Next we consider the case $t>\tau_{0}$ .

	$\displaystyle\tilde{\mathcal{A}}_{[\tau_{0}:T],\kappa_{i}}=$	$\displaystyle\sum^{T}_{t=\tau_{0}+1}\left\\|-y_{t}\kappa_{i}({\bm{x}}_{t},\cdot)-\tilde{\mu}_{t-1,i}\right\\|^{2}_{\mathcal{H}_{i}}$
	$\displaystyle\leq$	$\displaystyle\sum^{T}_{t=\tau_{0}+1}3\left[\left\\|y_{t}\kappa_{i}({\bm{x}}_{t},\cdot)+\mu_{t,i}\right\\|^{2}_{\mathcal{H}_{i}}+\left\\|\mu_{t,i}-{\mu}_{t-1,i}\right\\|^{2}_{\mathcal{H}_{i}}+\left\\|{\mu}_{t-1,i}-\tilde{\mu}_{t-1,i}\right\\|^{2}_{\mathcal{H}_{i}}\right]$
	$\displaystyle=$	$\displaystyle 3\mathcal{A}_{[\tau_{0}:T],\kappa_{i}}+3\sum^{T}_{t=\tau_{0}+1}\left\\|\mu_{t,i}-{\mu}_{t-1,i}\right\\|^{2}_{\mathcal{H}_{i}}+3\sum^{T}_{t=\tau_{0}+1}\left\\|{\mu}_{t-1,i}+\frac{1}{\|V_{t}\|}\sum_{({\bm{x}},y)\in V_{t}}y\kappa_{i}({\bm{x}},\cdot)\right\\|^{2}_{\mathcal{H}_{i}}.$

Taking expectation w.r.t. the reservoir sampling yields

		$\displaystyle\mathbb{E}[\tilde{\mathcal{A}}_{T,\kappa_{i}}]$
	$\displaystyle=$	$\displaystyle\tilde{\mathcal{A}}_{\tau_{0},\kappa_{i}}+\mathbb{E}[\tilde{\mathcal{A}}_{[\tau_{0}:T],\kappa_{i}}]$
	$\displaystyle\leq$	$\displaystyle 1+3\mathcal{A}_{T,\kappa_{i}}+3\sum^{T}_{t=2}\left\\|\mu_{t,i}-{\mu}_{t-1,i}\right\\|^{2}_{\mathcal{H}_{i}}+3\sum^{T}_{t=\tau_{0}+1}\mathbb{E}\left[\left\\|{\mu}_{t-1,i}+\frac{1}{\|V_{t}\|}\sum_{({\bm{x}},y)\in V_{t}}y\kappa_{i}({\bm{x}},\cdot)\right\\|^{2}_{\mathcal{H}_{i}}\right]\quad\quad\quad\mathrm{by}~{}\mathrm{Lemma}~{}\ref{lem:JCST2022:variance_reservoir_estimator}$
	$\displaystyle\leq$	$\displaystyle 1+3\mathcal{A}_{T,\kappa_{i}}+3\sum^{T}_{t=2}\left\\|\mu_{t,i}-{\mu}_{t-1,i}\right\\|^{2}_{\mathcal{H}_{i}}+\sum^{T}_{t=\tau_{0}+1}\frac{3\mathcal{A}_{t-1,\kappa_{i}}}{(t-1)\|V_{t}\|}$
	$\displaystyle\leq$	$\displaystyle 1+3\mathcal{A}_{T,\kappa_{i}}+\sum^{T}_{t=2}\frac{12}{t^{2}}+\frac{3\mathcal{A}_{T,\kappa_{i}}\ln{T}}{M}$
	$\displaystyle\leq$	$\displaystyle 1+3\mathcal{A}_{T,\kappa_{i}}+7+\frac{3\mathcal{A}_{T,\kappa_{i}}\ln{T}}{M},$

where $|V_{t}|=M$ for all $t\geq\tau_{0}$ . ∎

Appendix A.2 Proof of Theorem 1

Proof.

By the convexity of the Hinge loss function, we decompose the regret as follows

	$\displaystyle\mathrm{Reg}(f)=$	$\displaystyle\sum^{T}_{t=1}\ell\left(\sum^{K}_{j=1}p_{t,j}f_{t,j}({\bm{x}}_{t}),y_{t}\right)-\sum^{T}_{t=1}\ell\left(f({\bm{x}}_{t}),y_{t}\right)$
	$\displaystyle\leq$	$\displaystyle\sum^{T}_{t=1}\sum^{K}_{j=1}p_{t,j}\ell\left(f_{t,j}({\bm{x}}_{t}),y_{t}\right)-\sum^{T}_{t=1}\ell\left(f({\bm{x}}_{t}),y_{t}\right)$
	$\displaystyle\leq$	$\displaystyle\underbrace{\sum^{T}_{t=1}\left[\sum^{K}_{j=1}p_{t,j}\ell(f_{t,j}({\bm{x}}_{t}),y_{t})-\ell(f_{t,i}({\bm{x}}_{t}),y_{t})\right]}_{\mathcal{T}_{1}}+\underbrace{\sum_{t\in E_{T,i}}\left[\ell(f_{t,i}({\bm{x}}_{t}),y_{t})-\ell(f({\bm{x}}_{t}),y_{t})\right]}_{\mathcal{T}_{2}},$

where $E_{T,i}=\{t\in[T],\nabla_{t,i}\neq 0\}$ .

A.2.1 Analyzing $\mathcal{T}_{1}$

The following analysis is same with the proof of Theorem 3.1 in [Bubeck2012Regret]. Let $c_{t,i}:=\ell(f_{t,i}({\bm{x}}_{t}),y_{t})$ . The updating of probability is as follows,

p_{t+1,i}=\frac{w_{t+1,i}}{\sum^{K}_{j=1}w_{t+1,j}},\quad w_{t+1,i}=\exp\left(-\eta_{t+1}\sum^{t}_{\tau=1}c_{\tau,i}\right).

Similar to the analysis of Exp3 [Bubeck2012Regret], we define a potential function $\Gamma_{t}(\eta_{t})$ as follows,

\displaystyle\Gamma_{t}(\eta_{t}):=\frac{1}{\eta_{t}}\ln\sum^{K}_{i=1}p_{t,i}\exp(-\eta_{t}c_{t,i})\leq-\sum^{K}_{i=1}p_{t,i}c_{t,i}+\frac{1}{2}\eta_{t}\sum^{K}_{i=1}p_{t,i}c^{2}_{t,i},

where we use the following two inequalities

\ln{x}\leq x-1,\forall x>0,\quad\exp(-x)\leq 1-x+\frac{x^{2}}{2},\forall x\geq 0.

Summing over $t\in[T]$ yields

\sum^{T}_{t=1}\Gamma_{t}(\eta_{t})\leq-\sum^{T}_{t=1}\langle{\bm{p}}_{t},{\bm{c}}_{t}\rangle+\sum^{T}_{t=1}\sum^{K}_{i=1}\frac{\eta_{t}}{2}p_{t,i}c^{2}_{t,i}.

(A4)

On the other hand, by the definition of $p_{t,i}$ , we have

	$\displaystyle\Gamma_{t}(\eta_{t})=$	$\displaystyle\frac{1}{\eta_{t}}\ln\frac{\sum^{K}_{i=1}\exp\left(-\eta_{t}\sum^{t-1}_{\tau=1}c_{\tau,i}\right)\exp(-\eta_{t}c_{t,i})}{\sum^{K}_{j=1}\exp\left(-\eta_{t}\sum^{t-1}_{\tau=1}c_{\tau,j}\right)}$
	$\displaystyle=$	$\displaystyle\frac{1}{\eta_{t}}\ln\frac{\frac{1}{K}\sum^{K}_{i=1}\exp\left(-\eta_{t}\sum^{t}_{\tau=1}c_{\tau,i}\right)}{\frac{1}{K}\sum^{K}_{j=1}\exp\left(-\eta_{t}\sum^{t-1}_{\tau=1}c_{\tau,j}\right)}$
	$\displaystyle=$	$\displaystyle\bar{\Gamma}_{t}(\eta_{t})-\bar{\Gamma}_{t-1}(\eta_{t}),$

where $\bar{\Gamma}_{t}(\eta)=\frac{1}{\eta}\ln\frac{1}{K}\sum^{K}_{j=1}\exp\left(-\eta\sum^{t}_{\tau=1}c_{\tau,j}\right).$
Without loss of generality, let $\bar{\Gamma}_{0}(\eta)=0$ . Summing over $t=1,\ldots,T$ yields

\sum^{T}_{t=1}\Gamma_{t}(\eta_{t})=\bar{\Gamma}_{T}(\eta_{T})-\bar{\Gamma}_{0}(\eta_{1})+\sum^{T-1}_{t=1}\left[\bar{\Gamma}_{t}(\eta_{t})-\bar{\Gamma}_{t}(\eta_{t+1})\right],

where $\bar{\Gamma}_{T}(\eta_{T})\geq\frac{1}{\eta_{T}}\ln\frac{1}{K}-\sum^{T}_{\tau=1}c_{\tau,i}$ . Combining with the upper bound (A4), we obtain

\sum^{T}_{t=1}\langle{\bm{p}}_{t},{\bm{c}}_{t}\rangle-\sum^{T}_{\tau=1}c_{\tau,i}\leq\frac{1}{\eta_{T}}\ln{K}+\sum^{T-1}_{t=1}\left[\bar{\Gamma}_{t}(\eta_{t+1})-\bar{\Gamma}_{t}(\eta_{t})\right]+\sum^{T}_{t=1}\sum^{K}_{i=1}\frac{\eta_{t}}{2}p_{t,i}c^{2}_{t,i}.

For simplicity, let $\bar{C}_{t,j}:=\sum^{t}_{\tau=1}c_{\tau,j}$ . The first derivative of $\bar{\Gamma}_{t}(\eta)$ w.r.t. $\eta$ is as follows

	$\displaystyle\frac{\mathrm{d}\,\bar{\Gamma}_{t}(\eta)}{\mathrm{d}\,\eta}=$	$\displaystyle\frac{-\ln\sum^{K}_{j=1}\frac{\exp\left(-\eta\bar{C}_{t,j}\right)}{K}}{\eta^{2}}-\frac{\frac{1}{K}\sum^{K}_{j=1}\bar{C}_{t,j}\exp\left(-\eta\bar{C}_{t,j}\right)}{\frac{\eta}{K}\sum^{K}_{j=1}\exp\left(-\eta\bar{C}_{t,j}\right)}$
	$\displaystyle=$	$\displaystyle\frac{1}{\eta^{2}}\mathrm{KL}(\tilde{p}_{t},\frac{1}{K})$
	$\displaystyle\geq$	$\displaystyle 0$

where $\tilde{p}_{t,j}=\frac{\exp\left(-\eta\bar{C}_{t,j}\right)}{\sum^{K}_{i=1}\exp\left(-\eta\bar{C}_{t,i}\right)}$ . Since $\eta_{t+1}\leq\eta_{t}$ , we have $\bar{\Gamma}_{t}(\eta_{t+1})\leq\bar{\Gamma}_{t}(\eta_{t})$ . Combining all results, we have

	$\displaystyle\sum^{T}_{t=1}\langle{\bm{p}}_{t},{\bm{c}}_{t}\rangle-\sum^{T}_{\tau=1}c_{\tau,i}$
$\displaystyle\leq$	$\displaystyle\frac{\ln{K}}{\eta_{T}}-\frac{\ln{K}}{\eta_{1}}+\sum^{T}_{t=1}\sum^{K}_{i=1}\frac{\eta_{t}}{2}p_{t,i}c^{2}_{t,i}$
$\displaystyle\leq$	$\displaystyle\frac{\sqrt{\ln{K}}}{\sqrt{2}}\cdot\sqrt{1+\sum^{T-1}_{\tau=1}\langle{\bm{p}}_{\tau},{\bm{c}}^{2}_{\tau}\rangle}-\frac{\sqrt{\ln{K}}}{\sqrt{2}}+\sqrt{\ln{K}}\left(\sqrt{2\sum^{T}_{\tau=1}\langle{\bm{p}}_{\tau},{\bm{c}}^{2}_{\tau}\rangle}+\frac{\max_{t,j}c_{t,j}}{\sqrt{2}}\right)\quad\quad\quad\mathrm{by}~{}\mathrm{Lemma}~{}\ref{lem:JCST2022:analysis_second_moment_TV}$
$\displaystyle\lesssim$	$\displaystyle\frac{3}{\sqrt{2}}\sqrt{\max_{t,j}c_{t,j}\cdot\sum^{T}_{\tau=1}\langle{\bm{p}}_{\tau},{\bm{c}}_{\tau}\rangle\ln{K}}.$	(A5)

Solving for $\sum^{T}_{t=1}\langle{\bm{p}}_{t},{\bm{c}}_{t}\rangle$ gives

\mathcal{T}_{1}=\sum^{T}_{t=1}[\langle{\bm{p}}_{t},{\bm{c}}_{t}\rangle-c_{t,i}]\leq\frac{3}{\sqrt{2}}\sqrt{\max_{t,j}c_{t,j}\cdot\sum^{T}_{\tau=1}c_{\tau,i}\ln{K}}+\frac{9}{2}\max_{t,j}c_{t,j}\cdot\ln{K}.

(A6)

A.2.2 Analyzing $\mathcal{T}_{2}$

We decompose $E_{T,i}$ as follows.

	$\displaystyle E_{i}=$	$\displaystyle\{t\in E_{T,i}:\mathrm{con}(a(i))\},$
	$\displaystyle\mathcal{J}_{i}=$	$\displaystyle\{t\in E_{T,i}:\|S_{i}\|=\alpha\mathcal{R}_{i},b_{t,i}=1\},$
	$\displaystyle\bar{E}_{i}=$	$\displaystyle E_{T,i}\setminus(E_{i}\cup\mathcal{J}_{i}).$

We separately analyze the regret in $E_{i}$ , $\mathcal{J}_{i}$ and $\bar{E}_{i}$ .
Case 1: regret in $E_{i}$
For any $f\in\mathbb{H}_{i}$ , the convexity of loss function gives

		$\displaystyle\ell(f_{t,i}({\bm{x}}_{t}),y_{t})-\ell(f({\bm{x}}_{t}),y_{t})$
	$\displaystyle\leq$	$\displaystyle\langle f_{t,i}-f,\nabla_{t,i}\rangle$
	$\displaystyle=$	$\displaystyle\underbrace{\langle f_{t,i}-f^{\prime}_{t,i},\hat{\nabla}_{t,i}\rangle}_{\Xi_{1}}+\underbrace{\langle f^{\prime}_{t,i}-f,\nabla_{i(s_{t}),i}\rangle}_{\Xi_{2}}+\langle f^{\prime}_{t,i}-f,\nabla_{t,i}-\nabla_{i(s_{t}),i}\rangle+\langle f_{t,i}-f^{\prime}_{t,i},\nabla_{t,i}-\hat{\nabla}_{t,i}\rangle$
	$\displaystyle=$	$\displaystyle\Xi_{1}+\Xi_{2}+\langle f_{t,i}-f^{\prime}_{t,i},\nabla_{i(s_{t}),i}-\hat{\nabla}_{t,i}\rangle+\langle f_{t,i}-f,\nabla_{t,i}-\nabla_{i(s_{t}),i}\rangle$
	$\displaystyle\leq$	$\displaystyle\left[\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t,i})\right]+\underbrace{\left\\|f_{t,i}-f\right\\|\cdot\gamma_{t,i}}_{\Xi_{3}}+\underbrace{\left\langle f_{t,i}-f^{\prime}_{t,i},\nabla_{i(s_{t}),i}-\hat{\nabla}_{t,i}\right\rangle-\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},f_{t,i})}_{\Xi_{4}},$

where the standard analysis of OMD [Chiang2012Online] gives

	$\displaystyle\Xi_{1}\leq$	$\displaystyle\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},f_{t,i})-\mathcal{B}_{\psi_{i}}(f_{t,i},f^{\prime}_{t-1,i}),$
	$\displaystyle\Xi_{2}\leq$	$\displaystyle\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t,i})-\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},f^{\prime}_{t-1,i}).$

Substituting into $\gamma_{t,i}$ and summing over $t\in E_{i}$ gives

	$\displaystyle\sum_{t\in E_{i}}\Xi_{3}\leq$	$\displaystyle\sum_{t\in E_{i}}\frac{\max_{t}\\|f_{t,i}-f\\|_{\mathcal{H}_{i}}\cdot\left\\|\nabla_{t,i}-\hat{\nabla}_{t,i}\right\\|^{2}_{\mathcal{H}_{i}}}{\sqrt{1+\sum_{\tau\leq t}\left\\|\nabla_{\tau,i}-\hat{\nabla}_{\tau,i}\right\\|^{2}_{\mathcal{H}_{i}}\cdot\mathbb{I}_{\nabla_{\tau,i}\neq 0}}}$
	$\displaystyle\leq$	$\displaystyle 2(U+\lambda_{i})\cdot\sum_{t\in E_{i}}\frac{\left\\|\nabla_{t,i}-\hat{\nabla}_{t,i}\right\\|^{2}_{\mathcal{H}_{i}}}{\sqrt{1+\sum_{\tau\leq t}\left\\|\nabla_{\tau,i}-\hat{\nabla}_{\tau,i}\right\\|^{2}_{\mathcal{H}_{i}}\cdot\mathbb{I}_{\nabla_{\tau,i}\neq 0}}}$
	$\displaystyle\leq$	$\displaystyle 4(U+\lambda_{i})\sqrt{\tilde{\mathcal{A}}_{T,\kappa_{i}}},$

where $\|f_{t,i}\|_{\mathcal{H}_{i}}\leq U+\lambda_{i}$ .
According to Lemma A.8.6, we can obtain

\sum_{t\in E_{i}}\Xi_{4}\leq\frac{\lambda_{i}}{2}\sum_{t\in E_{i}}\left\|\nabla_{i(s_{t}),i}-\hat{\nabla}_{t,i}\right\|^{2}_{\mathcal{H}_{i}}\leq 2\lambda_{i}\tilde{\mathcal{A}}_{T,\kappa_{i}}.

Case 2: regret in $\bar{E}_{i}$
We decompose the instantaneous regret as follows,

		$\displaystyle\langle f_{t,i}-f,\nabla_{t,i}\rangle$
	$\displaystyle=$	$\displaystyle\underbrace{\langle f_{t,i}-f^{\prime}_{t,i},\hat{\nabla}_{t,i}\rangle}_{\Xi_{1}}+\underbrace{\langle f^{\prime}_{t,i}-f,\tilde{\nabla}_{t,i}\rangle}_{\Xi_{2}}+\underbrace{\langle f_{t,i}-f^{\prime}_{t,i},\tilde{\nabla}_{t,i}-\hat{\nabla}_{t,i}\rangle}_{\Xi_{3}}+\left\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\right\rangle$
	$\displaystyle\leq$	$\displaystyle\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t,i})+\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle+\Xi_{3}-\left[\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},f_{t,i})+\mathcal{B}_{\psi_{i}}(f_{t,i},f^{\prime}_{t-1,i})\right]$
	$\displaystyle=$	$\displaystyle\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t,i})+\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle+\Xi_{3}-\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},f_{t,i})-\frac{\lambda_{i}}{2}\\|\hat{\nabla}_{t,i}\\|^{2}_{\mathcal{H}_{i}}$
	$\displaystyle\leq$	$\displaystyle\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t,i})+\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle-\frac{\lambda_{i}}{2}\\|\tilde{\nabla}_{t,i}-\hat{\nabla}_{t,i}\\|^{2}_{\mathcal{H}_{i}}-\frac{\lambda_{i}}{2}\\|\hat{\nabla}_{t,i}\\|^{2}_{\mathcal{H}_{i}}\quad\mathrm{by}~{}\mathrm{Lemma}~{}\ref{lemma:JCST2022:property_of_OMD}$
	$\displaystyle=$	$\displaystyle\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t,i})+\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle+\frac{\lambda_{i}}{2}\left(\frac{\\|\nabla_{t,i}-\hat{\nabla}_{t,i}\\|^{2}_{\mathcal{H}_{i}}}{(\mathbb{P}[b_{t,i}=1])^{2}}\mathbb{I}_{b_{t,i}=1}-\\|\hat{\nabla}_{t,i}\\|^{2}_{\mathcal{H}_{i}}\right),$

where $\Xi_{1}+\Xi_{2}$ follows the analysis in Case 1.
Case 3: regret in $\mathcal{J}_{i}$
Recalling that the second mirror updating is

f^{\prime}_{t,i}=\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\{\langle f,\tilde{\nabla}_{t,i}\rangle+\mathcal{B}_{\psi_{i}}(f,\bar{f}^{\prime}_{t-1,i}(1))\right\}.

We still decompose the instantaneous regret as follows

\langle f_{t,i}-f,\nabla_{t,i}\rangle=\underbrace{\langle f_{t,i}-f^{\prime}_{t,i},\hat{\nabla}_{t,i}\rangle}_{\Xi_{1}}+\underbrace{\langle f^{\prime}_{t,i}-f,\tilde{\nabla}_{t,i}\rangle}_{\Xi_{2}}+\underbrace{\langle f_{t,i}-f^{\prime}_{t,i},\tilde{\nabla}_{t,i}-\hat{\nabla}_{t,i}\rangle}_{\Xi_{3}}+\left\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\right\rangle.

We reanalyze $\Xi_{1}$ and $\Xi_{2}$ as follows

	$\displaystyle\Xi_{1}$	$\displaystyle\leq\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},f_{t,i})-\mathcal{B}_{\psi_{i}}(f_{t,i},f^{\prime}_{t-1,i}),$
	$\displaystyle\Xi_{2}$	$\displaystyle\leq\mathcal{B}_{\psi_{i}}(f,\bar{f}^{\prime}_{t-1,i}(1))-\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t,i})-\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},\bar{f}^{\prime}_{t-1,i}(1)).$

Then $\Xi_{1}+\Xi_{2}+\Xi_{3}$ can be further bounded as follows,

	$\displaystyle\Xi_{1}+\Xi_{2}+\Xi_{3}\leq$	$\displaystyle\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t,i})+\left[\mathcal{B}_{\psi_{i}}(f,\bar{f}^{\prime}_{t-1,i}(1))-\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t-1,i})\right]+$
		$\displaystyle\left[\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},\bar{f}^{\prime}_{t-1,i}(1))\right]-\left[\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},f_{t,i})+\mathcal{B}_{\psi_{i}}(f_{t,i},f^{\prime}_{t-1,i})\right]+\Xi_{3}.$

By Lemma A.8.6, we analyze the following term

		$\displaystyle\Xi_{3}-\left[\mathcal{B}_{\psi_{i}}(f^{\prime}_{t,i},f_{t,i})+\mathcal{B}_{\psi_{i}}(f_{t,i},f^{\prime}_{t-1,i})\right]$
	$\displaystyle\leq$	$\displaystyle\frac{\lambda_{i}}{2}\left[\frac{\\|\nabla_{t,i}-\hat{\nabla}_{t,i}\\|^{2}_{\mathcal{H}_{i}}}{(\mathbb{P}[b_{t,i}=1])^{2}}\mathbb{I}_{b_{t,i}=1}-\\|\hat{\nabla}_{t,i}\\|^{2}_{\mathcal{H}_{i}}\right]-\frac{1}{2\lambda_{i}}\\|f^{\prime}_{t-1,i}-f^{\prime}_{t,i}\\|^{2}_{\mathcal{H}_{i}}+\langle f^{\prime}_{t-1,i}-f^{\prime}_{t,i},\tilde{\nabla}_{t,i}\rangle.$

Substituting into the instantaneous regret gives

	$\displaystyle\langle f_{t,i}-f,\nabla_{t,i}\rangle\leq$	$\displaystyle\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t,i})+\left\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\right\rangle+\langle\tilde{\nabla}_{t,i},f^{\prime}_{t-1,i}-f^{\prime}_{t,i}\rangle+$
		$\displaystyle\frac{\\|\bar{f}^{\prime}_{t-1,i}(1)-f\\|^{2}_{\mathcal{H}_{i}}-\\|f^{\prime}_{t-1,i}-f\\|^{2}_{\mathcal{H}_{i}}}{2\lambda_{i}}+\frac{\lambda_{i}}{2}\frac{\\|\nabla_{t,i}-\hat{\nabla}_{t,i}\\|^{2}_{\mathcal{H}_{i}}}{(\mathbb{P}[b_{t,i}=1])^{2}}\mathbb{I}_{b_{t,i}=1}-\frac{\lambda_{i}}{2}\\|\hat{\nabla}_{t,i}\\|^{2}_{\mathcal{H}_{i}}.$

Combining all
Combining the above three cases, we obtain

	$\displaystyle\mathcal{T}_{2}\leq$	$\displaystyle\sum_{t\in E_{T,i}}\left[\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t-1,i})-\mathcal{B}_{\psi_{i}}(f,f^{\prime}_{t,i})\right]+4(U+\lambda_{i})\tilde{\mathcal{A}}^{\frac{1}{2}}_{T,\kappa_{i}}+\sum_{t\in\mathcal{J}_{i}}\left[\langle\tilde{\nabla}_{t,i},f^{\prime}_{t-1,i}-f^{\prime}_{t,i}\rangle+\frac{2U^{2}}{\lambda_{i}}\right]+$
		$\displaystyle\frac{\lambda_{i}}{2}\sum_{t\in\bar{E}_{i}\cup\mathcal{J}_{i}}\left[\frac{\\|\nabla_{t,i}-\hat{\nabla}_{t,i}\\|^{2}_{\mathcal{H}_{i}}}{(\mathbb{P}[b_{t,i}=1])^{2}}\mathbb{I}_{b_{t,i}=1}-\\|\hat{\nabla}_{t,i}\\|^{2}_{\mathcal{H}_{i}}\right]+\sum_{t\in\bar{E}_{i}\cup\mathcal{J}_{i}}\left\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\right\rangle+2\lambda_{i}\tilde{\mathcal{A}}_{T,\kappa_{i}}.$

Recalling that $\|f^{\prime}_{t,i}\|_{\mathcal{H}_{i}}\leq U$ and $f\leq U$ . Conditioned on $b_{s_{r},i},\ldots,b_{t-1,i}$ , taking expectation w.r.t. $b_{t,i}$ gives

\mathbb{E}\left[\mathcal{T}_{2}\right]\leq\frac{U^{2}}{2\lambda_{i}}+\left(2U+\frac{2U^{2}}{\lambda_{i}}\right)\cdot J+\frac{5\lambda_{i}}{2}\tilde{\mathcal{A}}_{T,\kappa_{i}}+4(U+\lambda_{i})\sqrt{\tilde{\mathcal{A}}_{T,\kappa_{i}}}.

(A7)

Let $\lambda_{i}=\frac{\sqrt{K}U}{2\sqrt{B}}$ . Assuming that $B\geq K$ , we have $\lambda_{i}\leq\frac{U}{2}$ . Then

	$\displaystyle\mathbb{E}\left[\mathcal{T}_{2}\right]=$	$\displaystyle O\left(\frac{U\sqrt{B}}{\sqrt{K}}+\frac{\sqrt{K}U}{\sqrt{B}k_{1}}\tilde{\mathcal{A}}_{T,\kappa_{i}}+U\sqrt{\tilde{\mathcal{A}}_{T,\kappa_{i}}}\right)\quad\quad\quad\mathrm{by}~{}\eqref{eq:JCST2022:M-OMD-H:J}$
	$\displaystyle=$	$\displaystyle O\left(\frac{U\sqrt{B}}{\sqrt{K}}+\frac{\sqrt{K}U\mathcal{A}_{T,\kappa_{i}}\ln{T}}{\sqrt{B}k_{1}}\right),\quad\quad\quad\mathrm{by}~{}\mathrm{Lemma}~{}\ref{lem:JCST2022:reservoir_estimator}$

where we omit the lower order term.

A.2.3 Combining $\mathcal{T}_{1}$ and $\mathcal{T}_{2}$

Combining $\mathcal{T}_{1}$ and $\mathcal{T}_{2}$ , and taking expectation w.r.t. the randomness of reservoir sampling gives

		$\displaystyle\mathbb{E}\left[\mathrm{Reg}(f)\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\sum^{T}_{t=1}\ell(f_{t}({\bm{x}}_{t}),y_{t})-\sum^{T}_{t=1}\ell(f_{t,i}({\bm{x}}_{t}),y_{t})\right]+\mathbb{E}\left[\mathcal{T}_{2}\right]$
	$\displaystyle\leq$	$\displaystyle\frac{3}{\sqrt{2}}\mathbb{E}\left[\sqrt{\max_{t,j}c_{t,j}\cdot\sum^{T}_{t=1}\ell(f_{t,i}({\bm{x}}_{t}),y_{t})\ln{K}}\right]+\frac{9}{2}\max_{t,j}c_{t,j}\cdot\ln{K}+\mathbb{E}\left[\mathcal{T}_{2}\right]\quad\quad\quad\mathrm{by}~{}\eqref{eq:JCST2022:kernel_alignment_bound:first_part_of_regret}$
	$\displaystyle=$	$\displaystyle\frac{3}{\sqrt{2}}\mathbb{E}\left[\sqrt{\max_{t,j}c_{t,j}\cdot\left(\sum^{T}_{t=1}\ell(f({\bm{x}}_{t}),y_{t})+\mathbb{E}\left[\mathcal{T}_{2}\right]\right)\ln{K}}\right]+\frac{9}{2}\max_{t,j}c_{t,j}\cdot\ln{K}+\mathbb{E}\left[\mathcal{T}_{2}\right]$
	$\displaystyle=$	$\displaystyle O\left(\sqrt{\max_{t,j}c_{t,j}\cdot L_{T}(f)\ln{K}}+\frac{U\sqrt{B}}{\sqrt{K}}+\frac{\sqrt{K}U\mathcal{A}_{T,\kappa_{i}}\ln{T}}{\sqrt{B}k_{1}}+\max_{t,j}c_{t,j}\cdot\ln{K}\right).$

For the Hinge loss function, we have $\max_{t,j}c_{t,j}=1+U$ . ∎

Appendix A.3 Proof of Theorem 2

Proof.

For simplicity, denote by

\Lambda_{i}=\sum_{t\in\mathcal{J}_{i}}\left[\left\|\bar{f}^{\prime}_{t-1,i}(1)-f\right\|^{2}_{\mathcal{H}_{i}}-\left\|f^{\prime}_{t-1,i}-f\right\|^{2}_{\mathcal{H}_{i}}\right].

There must be a constant $\xi_{i}\in(0,4]$ such that $\Lambda_{i}\leq\xi_{i}U^{2}J$ . We will prove a better regret bound if $\xi_{i}$ is small enough. Recalling that (A3) gives an upper bound on $J$ . If $\xi_{i}\leq\frac{1}{J}$ , then we rewrite (A7) by

\displaystyle\mathcal{T}_{2}\leq\frac{U^{2}}{2\lambda_{i}}+2UJ+\frac{U^{2}}{2\lambda_{i}}+\frac{5\lambda_{i}}{2}\tilde{\mathcal{A}}_{T,\kappa_{i}}+4(U+\lambda_{i})\sqrt{\tilde{\mathcal{A}}_{T,\kappa_{i}}}.

Let $\lambda_{i}=\frac{\sqrt{2}U}{\sqrt{5\tilde{\mathcal{A}}_{T,\kappa_{i}}}}$ . Taking expectation w.r.t. the reservoir sampling and using Lemma A.1.1 gives

\mathbb{E}\left[\mathcal{T}_{2}\right]=O\left(\frac{UK}{Bk_{1}}\mathcal{A}_{T,\kappa_{i}}\ln{T}+U\sqrt{\mathcal{A}_{T,\kappa_{i}}\ln{T}}\right),

where we omit the lower order terms. Combining $\mathcal{T}_{1}$ and $\mathcal{T}_{2}$ gives

		$\displaystyle\mathbb{E}\left[\mathrm{Reg}(f)\right]$
	$\displaystyle=$	$\displaystyle\frac{3}{\sqrt{2}}\mathbb{E}\left[\sqrt{\max_{t,j}c_{t,j}\cdot\left(\sum^{T}_{t=1}\ell(f({\bm{x}}_{t}),y_{t})+\mathbb{E}\left[\mathcal{T}_{2}\right]\right)\ln{K}}\right]+\frac{9}{2}\max_{t,j}c_{t,j}\cdot\ln{K}+\mathbb{E}\left[\mathcal{T}_{2}\right]$
	$\displaystyle=$	$\displaystyle O\left(\sqrt{\max_{t,j}c_{t,j}\cdot L_{T}(f)\ln{K}}+\frac{UK}{Bk_{1}}\mathcal{A}_{T,\kappa_{i}}\ln{T}+U\sqrt{\mathcal{A}_{T,\kappa_{i}}\ln{T}}+\max_{t,j}c_{t,j}\cdot\ln{K}\right),$

which concludes the proof. ∎

Appendix A.4 Proof of Lemma 2

Proof.

Recalling the definition of $\mathcal{J}$ and $T_{r}$ in Section A.1. For any $t\in T_{r}$ , $({\bm{x}}_{t},y_{t})$ will be added into $S$ only if $b_{t}=1$ . At the end of the $t_{r}$ -th round, we have

|S|=\frac{B}{2}\mathbb{I}_{r\neq 1}+\sum^{t_{r}}_{t=t_{r-1}+1}b_{t}=B.

We remove $\frac{B}{2}$ examples from $S$ at the end of the $t_{r}$ -th round.

Assuming that there is no budget. For any $t_{0}>t_{r-1}+1$ , we will prove an upper bound on $\sum^{t_{0}}_{t=t_{r-1}+1}b_{t}$ . Define a random variable $X_{t}$ as follows,

X_{t}=b_{t}-\mathbb{P}[b_{t}=1],\quad|X_{t}|\leq 1.

Under the condition of $b_{t_{r-1}+1},\ldots,b_{t-1}$ , we have $\mathbb{E}_{b_{t}}[X_{t}]=0$ . Thus $X_{t_{r-1}+1},\ldots,X_{t_{0}}$ form bounded martingale difference. Let $\hat{L}_{a:b}:=\sum^{b}_{t=a}\ell(f_{t}({\bm{x}}_{t}),y_{t})$ and $\hat{L}_{1:T}\leq N$ . The sum of conditional variances satisfies

\Sigma^{2}\leq\sum^{t_{0}}_{t=t_{r-1}+1}\frac{|\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t})|}{|\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t})|+G_{1}}\leq\frac{G_{2}}{G_{1}}\hat{L}_{t_{r-1}+1:t_{0}},

where the last inequality comes from Assumption 3. Since $\hat{L}_{t_{r-1}+1:t_{0}}$ is a random variable, Lemma A.8.3 can give an upper bound on $\sum^{t_{0}}_{t=t_{r-1}+1}b_{t}$ with probability at least $1-2\lceil\log{N}\rceil\delta$ . Let $t_{r}$ be the minimal $t_{0}$ such that

\displaystyle\frac{G_{2}}{G_{1}}\hat{L}_{t_{r-1}+1:t_{r}}+\frac{2}{3}\ln\frac{1}{\delta}+2\sqrt{\frac{G_{2}}{G_{1}}\hat{L}_{t_{r-1}+1:t_{r}}\ln\frac{1}{\delta}}\geq\frac{B}{2}\cdot\mathbb{I}_{r\geq 2}+B\cdot\mathbb{I}_{r=1}.

The $r$ -th epoch will end at $t_{r}$ . Summing over $r\in\{1,\ldots,J\}$ , with probability at least $1-2J\lceil\log{N}\rceil\delta$ ,

\displaystyle\sum^{J}_{r=1}\sum^{t_{r}}_{t=t_{r-1}+1}b_{t}\leq\sum^{J}_{r=1}\left(\frac{G_{2}}{G_{1}}\hat{L}_{t_{r-1}+1:t_{r}}+\frac{2}{3}\ln\frac{1}{\delta}+2\sqrt{\frac{G_{2}}{G_{1}}\hat{L}_{t_{r-1}+1:t_{r}}\ln\frac{1}{\delta}}\right),

which is equivalent to

\frac{B}{2}+\frac{JB}{2}\leq\frac{G_{2}}{G_{1}}\hat{L}_{1:T}+\frac{2}{3}J\ln\frac{1}{\delta}+2\sqrt{J\frac{G_{2}}{G_{1}}\hat{L}_{1:T}\ln\frac{1}{\delta}}.

Solving the above inequality yields,

\displaystyle J\leq

\displaystyle\frac{2G_{2}}{G_{1}}\frac{\hat{L}_{1:T}}{B-\frac{4}{3}\ln\frac{1}{\delta}}+\frac{16G_{2}}{G_{1}}\frac{\hat{L}_{1:T}}{(B-\frac{4}{3}\ln\frac{1}{\delta})^{2}}\ln\frac{1}{\delta}+\frac{4\sqrt{2}}{(B-\frac{4}{3}\ln\frac{1}{\delta})^{\frac{3}{2}}}\frac{G_{2}}{G_{1}}\hat{L}_{1:T}\sqrt{\ln\frac{1}{\delta}}.

Let $B\geq 21\ln\frac{1}{\delta}$ . Simplifying the above result concludes the proof. ∎

Appendix A.5 Proof of Theorem 3

Proof.

Let ${\bm{p}}\in\Delta_{K-1}$ satisfy $p_{i}=1$ . By the convexity of loss function, we have

	$\displaystyle\mathrm{Reg}(f)\leq$	$\displaystyle\sum^{T}_{t=1}\langle\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t}),f_{t}({\bm{x}}_{t})-f({\bm{x}}_{t})\rangle$
	$\displaystyle=$	$\displaystyle\sum^{T}_{t=1}\left\langle\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t}),\sum^{K}_{i=1}p_{t,i}f_{t,i}({\bm{x}}_{t})-f({\bm{x}}_{t})\right\rangle$
	$\displaystyle=$	$\displaystyle\sum^{T}_{t=1}\left\langle\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t}),\sum^{K}_{i=1}p_{t,i}f_{t,i}({\bm{x}}_{t})-\sum^{K}_{i=1}p_{i}f_{t,i}({\bm{x}}_{t})+\sum^{K}_{i=1}p_{i}f_{t,i}({\bm{x}}_{t})-f({\bm{x}}_{t})\right\rangle$
	$\displaystyle=$	$\displaystyle\underbrace{\sum^{T}_{t=1}\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t})\sum^{K}_{i=1}(p_{t,i}-p_{i})f_{t,i}({\bm{x}}_{t})}_{\mathcal{T}_{1}}+\underbrace{\sum^{T}_{t=1}\langle\nabla_{t,i},f_{t,i}-f\rangle}_{\mathcal{T}_{2}}.$

We first analyze $\mathcal{T}_{1}$ . We have

	$\displaystyle\mathcal{T}_{1}=$	$\displaystyle\sum_{t\in T^{1}}\sum^{K}_{i=1}(p_{t,i}-p_{i})\cdot\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t})\cdot\left(f_{t,i}({\bm{x}}_{t})-\min_{j\in[K]}f_{t,j}({\bm{x}}_{t})\right)+$
		$\displaystyle\sum_{t\in T^{2}}\sum^{K}_{i=1}(p_{t,i}-p_{i})\cdot\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t})\cdot\left(f_{t,i}({\bm{x}}_{t})-\max_{j\in[K]}f_{t,j}({\bm{x}}_{t})\right)$
	$\displaystyle=$	$\displaystyle\sum^{T}_{t=1}\langle{\bm{p}}_{t}-{\bm{p}},{\bm{c}}_{t}\rangle\quad\quad\quad\mathrm{by}~{}\eqref{eq:JCST2022:smooth_loss:unsigned_criterion}$
	$\displaystyle\leq$	$\displaystyle\frac{3}{\sqrt{2}}\sqrt{\max_{t,j}c_{t,j}\sum^{T}_{\tau=1}\langle{\bm{p}}_{\tau},{\bm{c}}_{\tau}\rangle\ln{K}}\quad\quad\quad\mathrm{by}~{}\eqref{eq:JCST2023_supp:initial_regret_PEA}$
	$\displaystyle\leq$	$\displaystyle\frac{3}{\sqrt{2}}\sqrt{\max_{t,j}c_{t,j}\sum^{T}_{\tau=1}\|\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t})\|\cdot 2U\ln{K}}$
	$\displaystyle\leq$	$\displaystyle\frac{6}{\sqrt{2}}U\sqrt{G_{2}G_{1}\hat{L}_{1:T}\ln{K}}.\quad\quad\quad\mathrm{by}~{}\mathrm{Assumption}~{}\eqref{ass:JCST2022:property_smooth_loss}$

Next we analyze $\mathcal{T}_{2}$ . We decompose $[T]$ as follows,

	$\displaystyle T_{1}=$	$\displaystyle\{t\in[T]:\mathrm{con}(a)\},$
	$\displaystyle\mathcal{J}=$	$\displaystyle\{t\in[T]:\|S\|=\alpha\mathcal{R},b_{t}=1\},$
	$\displaystyle\bar{T}_{1}=$	$\displaystyle[T]\setminus(T_{1}\cup\mathcal{J}).$

Case 1: regret in $T_{1}$
We decompose $\langle f_{t,i}-f,\nabla_{t,i}\rangle$ as follows,

		$\displaystyle\langle f_{t,i}-f,\nabla_{t,i}\rangle$
	$\displaystyle=$	$\displaystyle\langle f_{t+1,i}-f,\nabla_{i(s_{t}),i}\rangle+\langle f_{t+1,i}-f,\nabla_{t,i}-\nabla_{i(s_{t}),i}\rangle+\langle f_{t,i}-f_{t+1,i},\nabla_{i(s_{t}),i}+\nabla_{t,i}-\nabla_{i(s_{t}),i}\rangle$
	$\displaystyle\leq$	$\displaystyle\mathcal{B}_{\psi_{i}}(f,f_{t,i})-\mathcal{B}_{\psi_{i}}(f,f_{t+1,i})-\mathcal{B}_{\psi_{i}}(f_{t+1,i},f_{t,i})+\left\langle f_{t,i}-f_{t+1,i},\nabla_{i(s_{t}),i}\right\rangle+\left\langle f_{t,i}-f,\nabla_{t,i}-\nabla_{i(s_{t}),i}\right\rangle$
	$\displaystyle\leq$	$\displaystyle\mathcal{B}_{\psi_{i}}(f,f_{t,i})-\mathcal{B}_{\psi_{i}}(f,f_{t+1,i})+\frac{\lambda}{2}\\|\nabla_{i(s_{t}),i}\\|^{2}_{\mathcal{H}_{i}}+\left\langle f_{t,i}-f,\nabla_{t,i}-\nabla_{i(s_{t}),i}\right\rangle,$

where the last inequality comes from Lemma A.8.6. Next we analyze the third term.

	$\displaystyle\sum_{t\in T_{1}}\langle f_{t,i}-f,\nabla_{t,i}-\nabla_{i(s_{t}),i}\rangle\leq$	$\displaystyle 2U\cdot\sum_{t\in T_{1}}\frac{\left\|\ell^{\prime}(f_{t}({\bm{x}}_{t}),y_{t})\right\|}{\sqrt{1+\sum_{\tau\in T_{1},\tau\leq t}\left\|\ell^{\prime}(f_{\tau}({\bm{x}}_{\tau}),y_{\tau})\right\|}}$
	$\displaystyle\leq$	$\displaystyle 4U\sqrt{G_{2}\hat{L}_{1:T}}.$

Case 2: regret in $\bar{T}_{1}$
We use a different decomposition as follows

		$\displaystyle\langle f_{t,i}-f,\nabla_{t,i}\rangle$
	$\displaystyle=$	$\displaystyle\underbrace{\langle f_{t+1,i}-f,\tilde{\nabla}_{t,i}\rangle}_{\Xi_{1}}+\underbrace{\langle f_{t+1,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle}_{\Xi_{2}}+\underbrace{\langle f_{t,i}-f_{t+1,i},\nabla_{t,i}\rangle}_{\Xi_{3}}$
	$\displaystyle\leq$	$\displaystyle\mathcal{B}_{\psi_{i}}(f,f_{t,i})-\mathcal{B}_{\psi_{i}}(f,f_{t+1,i})-\mathcal{B}_{\psi_{i}}(f_{t+1,i},f_{t,i})+\langle f_{t+1,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle+$
		$\displaystyle\langle f_{t,i}-f_{t+1,i},\tilde{\nabla}_{t,i}\rangle+\langle f_{t,i}-f_{t+1,i},\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle$
	$\displaystyle=$	$\displaystyle\mathcal{B}_{\psi_{i}}(f,f_{t,i})-\mathcal{B}_{\psi_{i}}(f,f_{t+1,i})+\left\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\right\rangle+\langle f_{t,i}-f_{t+1,i},\tilde{\nabla}_{t,i}\rangle-\mathcal{B}_{\psi_{i}}(f_{t+1,i},f_{t,i})$
	$\displaystyle\leq$	$\displaystyle\mathcal{B}_{\psi_{i}}(f,f_{t,i})-\mathcal{B}_{\psi_{i}}(f,f_{t+1,i})+\left\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\right\rangle+\frac{\lambda_{i}}{2}\\|\tilde{\nabla}_{t,i}\\|^{2}_{\mathcal{H}_{i}}.$

Case 3: regret in $\mathcal{J}$
We decompose $\left\langle f_{t,i}-f,\nabla_{t,i}\right\rangle$ into three terms as in Case 2. The second mirror updating is

f_{t+1,i}=\mathop{\arg\min}_{f\in\mathbb{H}_{i}}\left\{\langle f,\tilde{\nabla}_{t,i}\rangle+\mathcal{B}_{\psi_{i}}(f,\bar{f}_{t,i}(2))\right\}.

Similar to the analysis of Case 2, we obtain

	$\displaystyle\Xi_{1}\leq$	$\displaystyle\mathcal{B}_{\psi_{i}}(f,f_{t,i})-\mathcal{B}_{\psi_{i}}(f,f_{t+1,i})-\mathcal{B}_{\psi_{i}}(f_{t+1,i},\bar{f}_{t,i}(2))+[\mathcal{B}_{\psi_{i}}(f,\bar{f}_{t,i}(2))-\mathcal{B}_{\psi_{i}}(f,f_{t,i})],$
	$\displaystyle\Xi_{3}=$	$\displaystyle\langle\bar{f}_{t,i}(2)-f_{t+1,i},\tilde{\nabla}_{t,i}\rangle+\langle f_{t,i}-\bar{f}_{t,i}(2),\tilde{\nabla}_{t,i}\rangle+\langle f_{t,i}-f_{t+1,i},\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle.$

Combining $\Xi_{1}$ , $\Xi_{2}$ and $\Xi_{3}$ gives

	$\displaystyle\left\langle f_{t,i}-f,\nabla_{t,i}\right\rangle$
$\displaystyle\leq$	$\displaystyle\mathcal{B}_{\psi_{i}}(f,f_{t,i})-\mathcal{B}_{\psi_{i}}(f,f_{t+1,i})+\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle+$
	$\displaystyle\underbrace{\mathcal{B}_{\psi_{i}}(f,\bar{f}_{t,i}(2))-\mathcal{B}_{\psi_{i}}(f,f_{t,i})+\langle f_{t,i}-\bar{f}_{t,i}(2),\tilde{\nabla}_{t,i}\rangle}_{\Xi_{4}}+\langle\bar{f}_{t,i}(2)-f_{t+1,i},\tilde{\nabla}_{t,i}\rangle-\mathcal{B}_{\psi_{i}}(f_{t+1,i},\bar{f}_{t,i}(2))$	(A8)
$\displaystyle\leq$	$\displaystyle\mathcal{B}_{\psi_{i}}(f,f_{t,i})-\mathcal{B}_{\psi_{i}}(f,f_{t+1,i})+\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle+$
	$\displaystyle\frac{2U^{2}}{\lambda_{i}}+4UG_{1}+\langle\bar{f}_{t,i}(2)-f_{t+1,i},\tilde{\nabla}_{t,i}\rangle-\mathcal{B}_{\psi_{i}}(f_{t+1,i},\bar{f}_{t,i}(2))\quad\quad\quad\mathrm{by}~{}\mathrm{Lemma}~{}\ref{lemma:JCST2022:property_of_OMD}$
$\displaystyle\leq$	$\displaystyle\mathcal{B}_{\psi_{i}}(f,f_{t,i})-\mathcal{B}_{\psi_{i}}(f,f_{t+1,i})+\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle+\frac{2U^{2}}{\lambda_{i}}+4UG_{1}+\frac{\lambda_{i}}{2}\\|\tilde{\nabla}_{t,i}\\|^{2}_{\mathcal{H}_{i}}.$

Combining the regret in $T_{1}$ , $\mathcal{J}$ and $\bar{T}_{1}$ gives

	$\displaystyle\mathcal{T}_{2}\leq$	$\displaystyle 4U\sqrt{G_{2}\hat{L}_{1:T}}+\left(\frac{2U^{2}}{\lambda_{i}}+4UG_{1}\right)\|\mathcal{J}\|+\underbrace{\sum_{t\in\bar{T}_{1}\cup\mathcal{J}}\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle}_{\Xi_{2,1}}+$
		$\displaystyle\sum^{T}_{t=1}\left(\mathcal{B}_{\psi_{i}}(f,f_{t,i})-\mathcal{B}_{\psi_{i}}(f,f_{t+1,i})\right)+\lambda_{i}\underbrace{\left(\frac{1}{2}\sum_{t\in\bar{T}_{1}\cup\mathcal{J}}\\|\tilde{\nabla}_{t,i}\\|^{2}_{\mathcal{H}_{i}}+\sum_{t\in T_{1}}\frac{1}{2}\\|\nabla_{i(s_{t}),i}\\|^{2}_{\mathcal{H}_{i}}\right)}_{\Xi_{2,2}}$
	$\displaystyle\leq$	$\displaystyle 4U\sqrt{G_{2}\hat{L}_{1:T}}+\left(\frac{U^{2}}{2\lambda_{i}}+4UG_{1}\right)\|\mathcal{J}\|+\Xi_{2,1}+\frac{U^{2}}{2\lambda_{i}}+\Xi_{2,2}.$

Lemma A.8.4 gives, with probability at least $1-\Theta(\lceil\ln{T}\rceil)\delta$ ,

	$\displaystyle\Xi_{2,1}\leq$	$\displaystyle\frac{4}{3}UG_{1}\ln\frac{1}{\delta}+2U\sqrt{2G_{2}G_{1}\hat{L}_{1:T}\ln\frac{1}{\delta}},$
	$\displaystyle\Xi_{2,2}\leq$	$\displaystyle G_{1}G_{2}\hat{L}_{1:T}+\frac{2}{3}G^{2}_{1}\ln\frac{1}{\delta}+2\sqrt{G^{3}_{1}G_{2}\hat{L}_{1:T}\ln\frac{1}{\delta}}.$

Let $\lambda_{i}=\frac{2U}{\sqrt{B}G_{1}}$ . Using Lemma 2 and combining $\mathcal{T}_{1}$ and $\mathcal{T}_{2}$ gives, with probability at least $1-\Theta(\lceil\ln{T}\rceil)\delta$ ,

	$\displaystyle\mathrm{Reg}(f)=$	$\displaystyle\hat{L}_{1:T}-L_{T}(f)$
	$\displaystyle\leq$	$\displaystyle\mathcal{T}_{1}+\mathcal{T}_{2}$
	$\displaystyle\leq$	$\displaystyle 10U\sqrt{G_{2}G_{1}\hat{L}_{1:T}\ln\frac{1}{\delta}}+\frac{UG_{1}\sqrt{B}}{4}+\frac{6UG_{2}\hat{L}_{1:T}}{\sqrt{B-\frac{4}{3}\ln\frac{1}{\delta}}},$

where we omit the constant terms and the lower order terms. Let $\gamma=\frac{6UG_{2}}{\sqrt{B-\frac{4}{3}\ln\frac{1}{\delta}}}$ and $U\leq\frac{1}{8G_{2}}\sqrt{B-\frac{4}{3}\ln\frac{1}{\delta}}$ . Then $1-\gamma\geq\frac{1}{4}$ . Solving for $\hat{L}_{1:T}$ concludes the proof.

Finally, we explain why it must be satisfied that $K\leq d$ . The space complexity of M-OMD-S is $O(KB+dB+K)$ . According to Assumption 1, the coefficient $\alpha$ only depends on $d$ . If $K\leq d$ , then the space complexity of M-OMD-S is $O(dB)$ . In this case, $B=\Theta(\alpha\mathcal{R})$ . If $K>d$ , then the space complexity is $O(KB)$ . M-OMD-S must allocate the memory resource over $K$ hypotheses. For instance, if $K=d^{\nu}$ , $\nu>1$ , then $B=\Theta(K^{\frac{1-v}{v}}\alpha\mathcal{R})$ . Thus the regret bound will increase a factor of order $O(K^{\frac{v-1}{2v}})$ . ∎

Appendix A.6 Proof of Theorem 4

Proof.

Let $\kappa({\bm{x}},{\bm{v}})=\langle{\bm{x}},{\bm{v}}\rangle^{p}$ . The adversary first constructs $\mathcal{I}_{T}$ . For $1\leq t\leq 3B$ , let ${\bm{x}}_{t}={\bm{e}}_{t}$ where ${\bm{e}}_{t}$ is the standard basis vector in $\mathbb{R}^{d}$ . Let $y_{t}=1$ if $t$ is odd. Otherwise, $y_{t}=-1$ . For $3B+1\leq t\leq T$ , let $({\bm{x}}_{t},y_{t})\in\{({\bm{x}}_{\tau},y_{\tau})\}^{3B}_{\tau=1}$ uniformly.

We construct a competitor as follows,

\bar{f}_{\mathbb{H}}=\frac{U}{\sqrt{3B}}\cdot\sum^{3B}_{t=1}y_{t}\kappa({\bm{x}}_{t},\cdot).

It is easy to prove

	$\displaystyle L_{T}(\bar{f}_{\mathbb{H}})=$	$\displaystyle T\cdot\ln\left(1+\exp\left(-\frac{U}{\sqrt{3B}}\right)\right),$
	$\displaystyle\\|\bar{f}_{\mathbb{H}}\\|_{\mathcal{H}}=$	$\displaystyle U.$

Thus $\bar{f}_{\mathbb{H}}\in\mathbb{H}$ .
Let $\mathcal{A}$ be an algorithm storing $B$ examples at most. At the beginning of round $t$ , let

f_{t}=\sum_{i\leq B}a^{(t)}_{i}\kappa({\bm{x}}^{(t)}_{i},\cdot)

be the hypothesis maintained by $\mathcal{A}$ where ${\bm{x}}^{(t)}_{i}\in\{{\bm{x}}_{1},\ldots,{\bm{x}}_{t-1}\}$ . Besides, it must be satisfied that

\|f_{t}\|_{\mathcal{H}}=\sqrt{\sum_{i\leq B}|a^{(t)}_{i}|^{2}}\leq U.

(A9)

It is easy to obtain $\sum^{3B}_{t=1}\ell(f_{t}({\bm{x}}_{t}),y_{t})=3\ln(2)B.$ For any $t\geq 3B+1$ , the expected per-round loss is

\mathbb{E}\left[\ell(f_{t}({\bm{x}}_{t}),y_{t})\right]=\frac{2}{3}\ln(2)+\frac{1}{3B}\sum_{i\leq B}\ln(1+\exp(-|a^{(t)}_{i}|)).

Note that $|a^{(t)}_{1}|,\ldots,|a^{(t)}_{B}|$ must satisfy (A9). By the Lagrangian multiplier method, the minimum is obtained at $|a^{(t)}_{i}|=\frac{U}{\sqrt{B}}$ for all $i$ . Then we have

\mathbb{E}\left[\ell(f_{t}({\bm{x}}_{t}),y_{t})\right]\geq\frac{2}{3}\ln(2)+\frac{\ln(1+\exp(-\frac{U}{\sqrt{B}}))}{3}.

It can be verified that

\forall 0<x\leq 0.2,\quad\ln(1+\exp(-x))\leq\ln(2)-0.45x.

Let $B<T$ and $U\leq\frac{1}{5}\sqrt{3B}$ . The expected regret w.r.t. $\bar{f}_{\mathbb{H}}$ is lower bounded as follows

	$\displaystyle\mathbb{E}\left[\mathrm{Reg}(\bar{f}_{\mathbb{H}})\right]\geq$	$\displaystyle 3\ln(2)B+(T-3B)\cdot\frac{\ln(1+\exp(-\frac{U}{\sqrt{B}}))}{3}+\frac{2}{3}\ln(2)\cdot(T-3B)-T\cdot\ln\left(1+\exp\left(-\frac{U}{\sqrt{3B}}\right)\right)$
	$\displaystyle=$	$\displaystyle\left(\frac{2}{3}T+B\right)\cdot\left(\ln(2)-\ln\left(1+\exp\left(-\frac{U}{\sqrt{3B}}\right)\right)\right)+\frac{1}{3}(T-3B)\ln\frac{1+\exp(-\frac{U}{\sqrt{B}})}{1+\exp(-\frac{U}{\sqrt{3B}})}$
	$\displaystyle\geq$	$\displaystyle\frac{\sqrt{3}}{10}\cdot\frac{UT}{\sqrt{B}}+\frac{1}{3}\left(\frac{\sqrt{3}}{3}-1\right)\cdot\frac{UT}{\sqrt{B}}.$

It can be verified that $L_{T}(\bar{f}_{\mathbb{H}})=\Theta(T)$ . Replacing $T$ with $\Theta(L_{T}(\bar{f}_{\mathbb{H}}))$ concludes the proof. ∎

Appendix A.7 Proof of Theorem 5

Proof.

There is a $\xi_{i}\in(0,4]$ such that

\Lambda_{i}=\sum_{t\in\mathcal{J}}\left[\left\|\bar{f}_{t,i}(2)-f\right\|^{2}_{\mathcal{H}_{i}}-\|f_{t,i}-f\|^{2}_{\mathcal{H}_{i}}\right]\leq\xi_{i}U^{2}|\mathcal{J}|.

We can rewrite $\Xi_{4}$ in (A8) as follow,

\sum_{t\in\mathcal{J}}\Xi_{4}\leq\left(\frac{\xi_{i}U^{2}}{2\lambda_{i}}+4UG_{1}\right)\cdot|\mathcal{J}|.

If $\xi_{i}\leq\frac{1}{|\mathcal{J}|}$ , then $\frac{\xi_{i}U^{2}}{2\lambda_{i}}\cdot|\mathcal{J}|\leq\frac{U^{2}}{2\lambda_{i}}$ . Let $\lambda_{i}=\frac{U}{\sqrt{G_{1}G_{2}\hat{L}_{1:T}}}$ . In this way, we obtain a new upper bound on $\mathcal{T}_{2}$ . Combining $\mathcal{T}_{1}$ and $\mathcal{T}_{2}$ gives

\displaystyle\hat{L}_{1:T}-L_{T}(f)\leq 12U\sqrt{G_{2}G_{1}\hat{L}_{1:T}\ln\frac{1}{\delta}}+\frac{16UG_{2}\hat{L}_{1:T}}{B-\frac{4}{3}\ln\frac{1}{\delta}}+4UG_{1}\ln\frac{1}{\delta}.

Let $\gamma=16UG_{2}(B-\frac{4}{3}\ln\frac{1}{\delta})^{-1}$ and $U<\frac{B-\frac{4}{3}\ln\frac{1}{\delta}}{32G_{2}}$ . We have $\gamma\leq\frac{1}{2}$ . Solving for $\hat{L}_{1:T}$ concludes the proof. ∎

Appendix A.8 Auxiliary Lemmas

Lemma A.8.1 ([Hazan2009Better]).

$\forall t>M$ and $\forall i\in[K]$ , $\mathbb{E}[\|\hat{\nabla}_{t,i}-\mu_{t,i}\|^{2}_{\mathcal{H}_{i}}]\leq\frac{1}{t|V|}\mathcal{A}_{t,\kappa_{i}}$ .

Lemma A.8.2.

Let $\eta_{t}$ follow (10) and ${\bm{p}}_{1}$ be the uniform distribution. Then

\sum^{T}_{t=1}\sum^{K}_{i=1}\frac{\eta_{t}}{2}p_{t,i}c^{2}_{t,i}\leq\sqrt{2\ln{K}}\sqrt{\sum^{T}_{\tau=1}\sum^{K}_{i=1}p_{\tau,i}c^{2}_{\tau,i}}+\frac{\sqrt{2\ln{K}}}{2}\max_{t,i}c_{t,i}.

Proof.

Let $\sigma_{\tau}=\sum^{K}_{i=1}p_{\tau,i}c^{2}_{\tau,i}$ and $\sigma_{0}=1$ . We decompose the term as follows

	$\displaystyle\frac{\sqrt{\ln{K}}}{\sqrt{2}}\cdot\sum^{T}_{t=1}\frac{\sigma_{t}}{\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}}=$	$\displaystyle\frac{\sqrt{\ln{K}}}{\sqrt{2}}\cdot\sum^{T}_{t=1}\frac{\sigma_{t}}{\sqrt{\sum^{t-1}_{\tau=0}\sigma_{\tau}}}$
	$\displaystyle=$	$\displaystyle\frac{\sqrt{\ln{K}}}{\sqrt{2}}\cdot\left[\sigma_{1}+\sum^{T}_{t=2}\frac{\sigma_{t-1}}{\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}}+\sum^{T}_{t=2}\frac{\sigma_{t}-\sigma_{t-1}}{\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}}\right].$

We analyze the third term.

	$\displaystyle\sum^{T}_{t=2}\frac{\sigma_{t}-\sigma_{t-1}}{\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}}=$	$\displaystyle\frac{-\sigma_{1}}{\sqrt{1+\sigma_{1}}}+\frac{\sigma_{T}}{\sqrt{1+\sum^{T-1}_{\tau=1}\sigma_{\tau}}}+\sum^{T-1}_{t=2}\sigma_{t}\left[\frac{1}{\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}}-\frac{1}{\sqrt{1+\sum^{t}_{\tau=1}\sigma_{\tau}}}\right]$
	$\displaystyle\leq$	$\displaystyle-\frac{\sigma_{1}}{\sqrt{1+\sigma_{1}}}+\max_{t=1,\ldots,T}\sigma_{t}\cdot\frac{1}{\sqrt{1+\sigma_{1}}}.$

Now we analyze the second term.

\displaystyle\sum^{T}_{t=2}\frac{\sigma_{t-1}}{\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}}=\frac{\sigma_{1}}{\sqrt{1+\sigma_{1}}}+\sum^{T}_{t=3}\frac{\sigma_{t-1}}{\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}}.

For any $a>0$ and $b>0$ , we have $2\sqrt{a}\sqrt{b}\leq a+b$ . Let $a=1+\sum^{t-1}_{\tau=1}\sigma_{\tau}$ and $b=1+\sum^{t-2}_{\tau=1}\sigma_{\tau}$ . Then we have

\displaystyle 2\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}\cdot\sqrt{1+\sum^{t-2}_{\tau=1}\sigma_{\tau}}\leq 2\left(1+\sum^{t-1}_{\tau=1}\sigma_{\tau}\right)-\sigma_{t-1}.

Dividing by $\sqrt{a}$ and rearranging terms yields

\displaystyle\frac{1}{2}\frac{\sigma_{t-1}}{\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}}\leq\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}-\sqrt{1+\sum^{t-2}_{\tau=1}\sigma_{\tau}}.

Summing over $t=3,\ldots,T$ , we obtain

\displaystyle\sum^{T}_{t=3}\frac{\sigma_{t-1}}{\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}}\leq 2\sqrt{1+\sum^{T-1}_{\tau=1}\sigma_{\tau}}-2\sqrt{1+\sigma_{1}}.

Summing over all results, we have

	$\displaystyle\sum^{T}_{t=1}\frac{\sigma_{t}}{\sqrt{1+\sum^{t-1}_{\tau=1}\sigma_{\tau}}}\leq$	$\displaystyle 2\sqrt{1+\sum^{T-1}_{\tau=1}\sigma_{\tau}}-2\sqrt{1+\sigma_{1}}+\max_{t}\sigma_{t}\cdot\frac{1}{\sqrt{1+\sigma_{1}}}+\sigma_{1}$
	$\displaystyle\leq$	$\displaystyle 2\sqrt{\sum^{T}_{\tau=1}\sigma_{\tau}}+\max_{t}\sigma_{t},$

which concludes the proof. ∎

Lemma A.8.3 (Improved Bernstein’s inequality [Li2024On]).

Let $X_{1},\ldots,X_{n}$ be a bounded martingale difference sequence w.r.t. the filtration $\mathcal{F}=(\mathcal{F}_{k})_{1\leq k\leq n}$ and with $|X_{k}|\leq a$ . Let $Z_{t}=\sum^{t}_{k=1}X_{k}$ be the associated martingale. Denote the sum of the conditional variances by

\Sigma^{2}_{n}=\sum^{n}_{k=1}\mathbb{E}\left[X^{2}_{k}|\mathcal{F}_{k-1}\right]\leq v,

where $v\in[0,V]$ is a random variable and $V\geq 2$ is a constant. Then for all constants $a>0$ , with probability at least $1-2\lceil\log{V}\rceil\delta$ ,

\max_{t=1,\ldots,n}Z_{t}<\frac{2a}{3}\ln\frac{1}{\delta}+\sqrt{\frac{2}{V}\ln\frac{1}{\delta}}+2\sqrt{v\ln\frac{1}{\delta}}.

Lemma A.8.4.

With probability at least $1-\Theta(\lceil\ln{T}\rceil)\delta$ ,

\displaystyle\sum_{t\in\bar{T}_{1}\cup\mathcal{J}}\frac{1}{(\mathbb{P}[b_{t}=1])^{2}}\left\|\nabla_{t,i}\right\|^{2}_{\mathcal{H}_{i}}\cdot\mathbb{I}_{b_{t}=1}\leq

\displaystyle\sum_{t\in\bar{T}_{1}\cup\mathcal{J}}\frac{\left\|\nabla_{t,i}\right\|^{2}_{\mathcal{H}_{i}}}{\mathbb{P}[b_{t}=1]}+\frac{4G^{2}_{1}}{3}\ln\frac{1}{\delta}+4\sqrt{G^{3}_{1}G_{2}\hat{L}_{1:T}\ln\frac{1}{\delta}}.

Proof.

Define a random variable $X_{t}$ by

X_{t}=\frac{\|\nabla_{t,i}\|^{2}_{\mathcal{H}_{i}}}{(\mathbb{P}[b_{t}=1])^{2}}\mathbb{I}_{b_{t}=1}-\frac{\|\nabla_{t,i}\|^{2}_{\mathcal{H}_{i}}}{\mathbb{P}[b_{t}=1]},\quad|X_{t}|\leq 2G^{2}_{1}.

$\{X_{t}\}_{t\in\bar{T}_{1}}$ forms bounded martingale difference w.r.t. $\{b_{\tau}\}^{t-1}_{\tau=1}$ . The sum of conditional variances is

\Sigma^{2}\leq\sum_{t\in\bar{T}_{1}\cup\mathcal{J}}\mathbb{E}[X^{2}_{t}]\leq 4G^{3}_{1}G_{2}\hat{L}_{1:T}.

Using Lemma A.8.3 concludes the proof. ∎

Lemma A.8.5.

Let $\Delta_{t}=\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle$ . With probability at least $1-\Theta(\lceil\ln{T}\rceil)\delta$ ,

\sum_{t\in\bar{T}_{1}}\Delta_{t}\leq\frac{4UG_{1}}{3}\ln\frac{1}{\delta}+2U\sqrt{2G_{2}G_{1}\hat{L}_{1:T}\ln\frac{1}{\delta}}.

Proof.

The proof is similar to that of Lemma A.8.4. ∎

Lemma A.8.6.

For each $i\in[K]$ , let $\psi_{i}(f)=\frac{1}{2\lambda_{i}}\|f\|^{2}_{\mathcal{H}_{i}}$ . Then $\mathcal{B}_{\psi_{i}}(f,g)=\frac{1}{2\lambda_{i}}\|f-g\|^{2}_{\mathcal{H}_{i}}$ and the solutions of (4) and (5) are as follows

	$\displaystyle f_{t,i}=$	$\displaystyle f^{\prime}_{t-1,i}-\lambda_{i}\hat{\nabla}_{t,i},$
	$\displaystyle f^{\prime}_{t,i}=$	$\displaystyle\min\left\{1,\frac{U}{\left\\|f^{\prime}_{t-1,i}-\lambda_{i}\nabla_{t,i}\right\\|_{\mathcal{H}_{i}}}\right\}\cdot\left(f^{\prime}_{t-1,i}-\lambda_{i}\nabla_{t,i}\right).$

Similarly, we can obtain the solution of (14)

f_{t+1,i}=\min\left\{1,\frac{U}{\left\|f_{t,i}-\lambda_{i}\tilde{\nabla}_{t,i}\right\|_{\mathcal{H}_{i}}}\right\}\cdot\left(f_{t,i}-\lambda_{i}\tilde{\nabla}_{t,i}\right).

Besides,

\langle f_{t,i}-f_{t+1,i},\tilde{\nabla}_{t,i}\rangle-\mathcal{B}_{\psi_{i}}(f_{t+1,i},f_{t,i})\leq\frac{\lambda_{i}}{2}\left\|\tilde{\nabla}_{t,i}\right\|^{2}_{\mathcal{H}_{i}}.

Proof.

We can solve (4) and (5) by Lagrangian multiplier method. Next we prove the last inequality.

	$\displaystyle\langle f_{t,i}-f_{t+1,i},\tilde{\nabla}_{t,i}\rangle-\mathcal{B}_{\psi_{i}}(f_{t+1,i},f_{t,i})=$	$\displaystyle\left\langle f_{t,i}-f_{t+1,i},\tilde{\nabla}_{t,i}\right\rangle-\frac{\\|f_{t+1,i}-f_{t,i}\\|^{2}_{\mathcal{H}_{i}}}{2\lambda_{i}}$
	$\displaystyle=$	$\displaystyle\frac{\lambda_{i}}{2}\left\\|\tilde{\nabla}_{t,i}\right\\|^{2}_{\mathcal{H}_{i}}-\frac{1}{2\lambda_{i}}\left\\|f_{t+1,i}-f_{t,i}-\lambda_{i}\hat{\nabla}_{i}\right\\|^{2}_{\mathcal{H}_{i}}$
	$\displaystyle\leq$	$\displaystyle\frac{\lambda_{i}}{2}\left\\|\tilde{\nabla}_{t,i}\right\\|^{2}_{\mathcal{H}_{i}},$

which concludes the proof. ∎

References

[A1] S. Bubeck and N. Cesa-Bianchi, “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” Foundations and Trends^® in Machine Learning, vol. 5, no. 1, pp. 1–122, 2012.
[A2] C. Chiang, T. Yang, C. Lee, M. Mahdavi, C. Lu, R. Jin, and S. Zhu, “Online optimization with gradual variations,” in Proceedings of the 25th Annual Conference on Learning Theory, 2012, pp. 6.1–6.20.
[A3] E. Hazan and S. Kale, “Better algorithms for benign bandits,” in Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, 2009, pp. 38–47.
[A4] J. Li, Z. Xu, Z. Wu, and I. King, “On the necessity of collaboration in online model selection with decentralized data,” CoRR, vol. arXiv:2404.09494v3, Apr 2024, https://arxiv.org/abs/2404.09494v3.

	$\displaystyle\mathcal{T}_{2}\leq$	$\displaystyle 4U\sqrt{G_{2}\hat{L}_{1:T}}+\left(\frac{2U^{2}}{\lambda_{i}}+4UG_{1}\right)\|\mathcal{J}\|+\underbrace{\sum_{t\in\bar{T}_{1}\cup\mathcal{J}}\langle f_{t,i}-f,\nabla_{t,i}-\tilde{\nabla}_{t,i}\rangle}_{\Xi_{2,1}}+$
		$\displaystyle\sum^{T}_{t=1}\left(\mathcal{B}_{\psi_{i}}(f,f_{t,i})-\mathcal{B}_{\psi_{i}}(f,f_{t+1,i})\right)+\lambda_{i}\underbrace{\left(\frac{1}{2}\sum_{t\in\bar{T}_{1}\cup\mathcal{J}}\\|\tilde{\nabla}_{t,i}\\|^{2}_{\mathcal{H}_{i}}+\sum_{t\in T_{1}}\frac{1}{2}\\|\nabla_{i(s_{t}),i}\\|^{2}_{\mathcal{H}_{i}}\right)}_{\Xi_{2,2}}$
	$\displaystyle\leq$	$\displaystyle 4U\sqrt{G_{2}\hat{L}_{1:T}}+\left(\frac{U^{2}}{2\lambda_{i}}+4UG_{1}\right)\|\mathcal{J}\|+\Xi_{2,1}+\frac{U^{2}}{2\lambda_{i}}+\Xi_{2,2}.$

Learnability in Online Kernel Selection with Memory Constraint via Data-dependent Regret Analysis

Abstract

1 Introduction

2 Related Work

3 Problem Setup

Definition 1 (Memory Budget [Li2022Worst]).

Assumption 1 ([Li2022Worst]).

Assumption 2.

Definition 2 (Online Learnability).

4 Algorithmic Framework

4.1 Adaptive Sampling

4.2 Removing a Half of Examples

4.3 Reduction to PEA

5 Applications

5.1 the Hinge Loss Function

Lemma 1.

Theorem 1.

Theorem 2 (Algorithm-dependent Bound).

5.2 Smooth Loss Functions

Assumption 3 ([Zhang2013Online]).

Definition 3.

Lemma 2.

Theorem 3.

Theorem 4 (Lower Bound).

Theorem 5 (Algorithm-dependent Bound).

6 Experiments

6.1 Experimental Setting

6.2 Experimental Results

6.2.1 Hinge Loss Function

6.2.2 Logistic Loss Function

7 Conclusion

Appendix

Appendix A.1 Proof of Lemma 1

Proof.

Lemma A.1.1.

Proof.

Appendix A.2 Proof of Theorem 1

Proof.

A.2.1 Analyzing 𝒯1\mathcal{T}_{1}

A.2.2 Analyzing 𝒯2\mathcal{T}_{2}

A.2.3 Combining 𝒯1\mathcal{T}_{1} and 𝒯2\mathcal{T}_{2}

Appendix A.3 Proof of Theorem 2

Proof.

Appendix A.4 Proof of Lemma 2

Proof.

Appendix A.5 Proof of Theorem 3

Proof.

Appendix A.6 Proof of Theorem 4

Proof.

Appendix A.7 Proof of Theorem 5

Proof.

Appendix A.8 Auxiliary Lemmas

Lemma A.8.1 ([Hazan2009Better]).

Lemma A.8.2.

Proof.

Lemma A.8.3 (Improved Bernstein’s inequality [Li2024On]).

Lemma A.8.4.

Proof.

Lemma A.8.5.

Proof.

Lemma A.8.6.

Proof.

References

A.2.1 Analyzing $\mathcal{T}_{1}$

A.2.2 Analyzing $\mathcal{T}_{2}$

A.2.3 Combining $\mathcal{T}_{1}$ and $\mathcal{T}_{2}$