Contextual Bandits for Unbounded Context Distributions

Puning Zhao Rongfei Fan Shaowei Wang Li Shen Qixin Zhang Zong Ke Tianhang Zheng

Abstract

Nonparametric contextual bandit is an important model of sequential decision making problems. Under $\alpha$ -Tsybakov margin condition, existing research has established a regret bound of $\tilde{O}\left(T^{1-\frac{\alpha+1}{d+2}}\right)$ for bounded supports. However, the optimal regret with unbounded contexts has not been analyzed. The challenge of solving contextual bandit problems with unbounded support is to achieve both exploration-exploitation tradeoff and bias-variance tradeoff simultaneously. In this paper, we solve the nonparametric contextual bandit problem with unbounded contexts. We propose two nearest neighbor methods combined with UCB exploration. The first method uses a fixed $k$ . Our analysis shows that this method achieves minimax optimal regret under a weak margin condition and relatively light-tailed context distributions. The second method uses adaptive $k$ . By a proper data-driven selection of $k$ , this method achieves an expected regret of $\tilde{O}\left(T^{1-\frac{(\alpha+1)\beta}{\alpha+(d+2)\beta}}+T^{1-\beta}\right)$ , in which $\beta$ is a parameter describing the tail strength. This bound matches the minimax lower bound up to logarithm factors, indicating that the second method is approximately optimal.

Machine Learning, ICML

1 Introduction

Multi-armed bandit (Robbins, 1952; Lai & Robbins, 1985) is an important sequential decision problem that has been extensively studied (Agrawal, 1995; Auer et al., 2002; Garivier & Cappé, 2011). In many practical applications such as recommender systems and information retrieval in healthcare and finance (Bouneffouf et al., 2020), decision problems are usually modeled as contextual bandits (Woodroofe, 1979), in which the reward depends on some side information, called contexts. At the $t$ -th iteration, the decision maker observes the context $\mathbf{X}_{t}$ , and then pulls an arm $A_{t}\in\mathcal{A}$ based on $\mathbf{X}_{t}$ and the previous trajectory $(\mathbf{X}_{i},A_{i}),i=1,\ldots,t-1$ . Many research assume linear rewards (Abbasi-Yadkori et al., 2011; Bastani & Bayati, 2020; Bastani et al., 2021; Qian et al., 2023; Langford & Zhang, 2007; Dudik et al., 2011; Chu et al., 2011; Li et al., 2010), which is restrictive and may not fit well into practical scenarios. Consequently, in recent years, nonparametric contextual bandits have received significant attention, which does not make any parametric assumption about the reward functions (Perchet & Rigollet, 2013; Guan & Jiang, 2018; Gur et al., 2022; Blanchard et al., 2023; Suk & Kpotufe, 2023; Suk, 2024; Cai et al., 2024).

Despite significant progress on nonparametric contextual bandits, existing studies focus only on the case with bounded contexts, and the probability density functions (pdf) of the contexts are required to be bounded away from zero. However, many practical applications often involve unbounded contexts, such as healthcare (Durand et al., 2018), dynamic pricing (Misra et al., 2019) and recommender systems (Zhou et al., 2017). In particular, the contexts may follow a heavy-tailed distribution (Zangerle & Bauer, 2022), which is significantly different from bounded contexts. Therefore, to bridge the gap between theoretical studies and practical applications of contextual bandits, an in-depth theoretical study of unbounded contexts is crucially needed. Compared with bounded contexts, heavy-tailed context distribution requires the learning method to be adaptive to the pdf of contexts, in order to balance the bias and variance of the estimation of reward functions. On the other hand, compared with existing works on nonparametric classification and regression with identically and independently distributed (i.i.d) data, bandit problems require us to achieve a good balance between exploration and exploitation, thus the learning method needs to be adaptive to the suboptimality gap of reward functions. Therefore, the main challenge of solving nonparametric contextual bandit problems with unbounded contexts is to achieve both bias-variance tradeoff and exploration-exploitation tradeoff using a single algorithm.

	Method	Bound of expected regret
	Method	Bounded context	Heavy-tailed context
(Rigollet & Zeevi, 2010)	UCBogram	$\tilde{O}\left(T^{1-\min\left\{\frac{\alpha+1}{d+2},\frac{2}{d+2}\right\}}\right)$	None
(Perchet & Rigollet, 2013)	ABSE	$\tilde{O}\left(T^{1-\frac{\alpha+1}{d+2}}\right)$	None
(Gur et al., 2022)	SACB	$\tilde{O}\left(T^{1-\frac{\alpha+1}{d+2}}\right)$	None
(Guan & Jiang, 2018)	kNN-UCB	$\tilde{O}\left(T^{\frac{d+1}{d+2}}\right)$	None
(Reeve et al., 2018)	kNN-UCB	$\tilde{O}\left(T^{1-\frac{\alpha+1}{d+2}}\right)$	None
This work	kNN-UCB¹¹1Despite that the name kNN-UCB is the same as (Guan & Jiang, 2018) and (Reeve et al., 2018), the calculations of UCB are different between these methods. See details in the ”Nearest neighbor method with fixed $k$ ” section.	$\tilde{O}\left(T^{\max\left\{1-\frac{\alpha+1}{d+2},\frac{2}{\alpha+3}\right\}}\right)$	$\tilde{O}\left(T^{1-\frac{\beta\min(d,\alpha+1)}{\min(d-1,\alpha)+\max(1,d\beta)+2\beta}}\right)$
This work	Adaptive kNN-UCB	$\tilde{O}\left(T^{1-\frac{\alpha+1}{d+2}}\right)$	$\tilde{O}\left(T^{1-\min\left\{\frac{(\alpha+1)\beta}{\alpha+(d+2)\beta},\beta\right\}}\right)$
Minimax lower bound		$\Omega\left(T^{1-\frac{\alpha+1}{d+2}}\right)$	$\Omega\left(T^{1-\min\left\{\frac{(\alpha+1)\beta}{\alpha+(d+2)\beta},\beta\right\}}\right)$

Table 1: Comparison of the

T

-step expected cumulative regrets of learning algorithms for nonparametric contextual bandits with Lipschitz reward function under

\alpha

-Tsybakov margin condition and tail parameter

\beta

(Assumption 1(a) and (b)).

In this paper, we solve the nonparametric contextual bandit problem with heavy-tailed contexts. To begin with, we derive a minimax lower bound that characterizes the theoretical limits of contextual bandit learning. We then propose a relatively simple method that uses fixed $k$ combined with upper confidence bound (UCB) exploration and derive the bound of expected regret. Even for bounded contexts, our method improves over an existing nearest neighbor method (Guan & Jiang, 2018) for large margin parameter $\alpha$ , since our method uses an improved UCB calculation which is more adaptive to the suboptimality gap of reward functions. Despite such progress, there is still some gap between the regret bound and the minimax lower bound, indicating room for further improvement. To close such a gap, we further propose a new adaptive nearest neighbor approach, which selects $k$ adaptively based on the density of samples and the suboptimality gap of reward functions. Our analysis shows that the regret bound of this new method nearly matches the minimax lower bound up to logarithmic factors, indicating that this method is approximately minimax optimal.

The general guidelines of our adaptive kNN method are summarized as follows. Firstly, with higher context pdf, we use larger $k$ , and vice versa. Secondly, given a specific context, if the value of an action is far away from optimal (i.e. large suboptimality gap), then we use smaller $k$ , and vice versa. Such a choice of $k$ achieves a good tradeoff between estimation bias and variance. With a lower pdf or larger suboptimality gap, the samples are relatively more sparse. As a result, a large bias may happen due to large kNN distances. Therefore, we use smaller $k$ to control the bias. On the contrary, with a higher pdf or smaller suboptimality gap, the samples are dense and thus we can use larger $k$ to reduce the variance. Note that the pdf and the suboptimality gap are unknown to the learner. Therefore, we design a method, such that the value of $k$ is selected by a data-driven manner, based on the density of existing samples.

1.1 Contribution

The contributions of this paper are summarized as follows.

•

We derive the minimax lower bound of nonparametric contextual bandits with heavy-tailed context distributions.
•

We propose a simple kNN method with UCB exploration. The regret bound matches the minimax lower bound with small $\alpha$ and large $\beta$ .
•

We propose an adaptive kNN method, such that $k$ is selected based on previous steps. The regret bound matches the minimax lower bound under all parameter regimes.

Our results and the comparison with related works are summarized in Table 1. In general, to the best of our knowledge, our work is the first attempt to handle heavy-tailed context distribution in contextual bandit problems. In particular, our new proposed adaptive kNN method achieves the minimax lower bound for the first time. The proofs of all theoretical results in the paper are shown in the supplementary material.

2 Related Work

In this section, we briefly review the related works about contextual bandits and nearest neighbor methods.

Nonparametric contextual bandits with bounded contexts. (Yang & Zhu, 2002) first introduced the nonparametric contextual bandit problem, proposed an $\epsilon$ -greedy approach and proved the consistency. (Rigollet & Zeevi, 2010) derived a minimax lower bound on the regret, and showed that this bound is achievable by a UCB method. (Perchet & Rigollet, 2013) proposed Adaptively Binned Successive Elimination (ABSE), which adapts to the unknown margin parameter. (Qian & Yang, 2016) proposed a kernel estimation method. (Hu et al., 2020) analyzed nonparametric bandit problem under general Hölder smoothness assumption. (Gur et al., 2022) proposed Smoothness-Adaptive Contextual Bandits (SACB), which is adaptive to the smoothness parameter. (Slivkins, 2014; Suk & Kpotufe, 2023; Akhavan et al., 2024; Ghosh et al., 2024; Komiyama et al., 2024; Suk, 2024) analyzed the problem of dynamic regret. Furthermore, (Wanigasekara & Yu, 2019; Locatelli & Carpentier, 2018; Krishnamurthy et al., 2020; Zhu et al., 2022) discussed the case with continuous actions.

Nearest neighbor methods. Nearest neighbor classification has been analyzed in (Chaudhuri & Dasgupta, 2014; Döring et al., 2018) for bounded support of features. (Gadat et al., 2016; Kpotufe, 2011; Cannings et al., 2020; Zhao & Lai, 2021b, a) proposed adaptive nearest neighbor methods for heavy-tailed feature distributions. (Guan & Jiang, 2018) proposed kNN-UCB method for contextual bandits and proved a regret bound $\tilde{O}(T^{\frac{1+d}{2+d}})$ .

Compared with existing methods on nonparametric contextual bandits, for unbounded contexts, the methods need to adapt to different density levels of contexts and achieve a better bias and variance tradeoff in the estimation of reward functions. Moreover, existing works on nonparametric classification can not be easily extended here, since the samples are no longer i.i.d and we now need to bound the regret instead of the estimation error. These factors introduce new technical difficulties in theoretical analysis. In this work, to address these challenges, we design new algorithms that are adaptive to both the pdf and the suboptimality of reward functions and provide a corresponding theoretical analysis.

3 Preliminaries

Denote $\mathcal{X}$ as the space of contexts, and $\mathcal{A}$ as the space of actions. Throughout this paper, we discuss the case with infinite $\mathcal{X}$ and finite $\mathcal{A}$ . At the $t$ -th step, the context $\mathbf{X}_{t}$ is a random variable drawn from a distribution with probability density function (pdf) $f$ . Then the agent takes action $A_{t}\in\mathcal{A}$ and receive reward $Y_{t}$ :

\displaystyle Y_{t}=\eta_{A_{t}}(\mathbf{X}_{t})+W_{t},

(1)

in which $\eta_{a}(\mathbf{x})$ for $a\in\mathcal{A}$ and $\mathbf{x}\in\mathcal{X}$ is an unknown expected reward function, and $W_{t}$ denotes the noise, with $\mathbb{E}[W_{t}|\mathbf{X}_{1:t},A_{1:t}]=0$ .

Throughout this paper, define

\displaystyle\eta^{*}(\mathbf{x})=\max_{a}\eta_{a}(\mathbf{x})

(2)

as the maximum expected reward of context $\mathbf{x}$ . For any suboptimal action $a$ , $\eta_{a}(\mathbf{x})<\eta^{*}(\mathbf{x})$ . Correspondingly, $\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})$ is called suboptimality gap.

The performance of an algorithm is evaluated by the expected regret

\displaystyle R=\mathbb{E}\left[\sum_{t=1}^{T}\left(\eta^{*}(\mathbf{X}_{t})-\eta_{A_{t}}(\mathbf{X}_{t})\right)\right].

(3)

We then present the assumptions needed for the analysis. To begin with, we state some basic conditions in Assumption 1.

Assumption 1.

There exists some constants $C_{\alpha}$ , $\sigma$ , $L$ , such that

(a) (Tsybakov margin condition) For some $\alpha\leq d$ , for all $a\in\mathcal{A}$ and $u>0$ , $\text{P}(0<\eta^{*}(\mathbf{X})-\eta_{a}(\mathbf{X})<u)\leq C_{\alpha}u^{\alpha}$ ;

(b) $W_{t}$ is subgaussian with parameter $\sigma^{2}$ , i.e. $\mathbb{E}[e^{\lambda W_{i}}]\leq e^{\frac{1}{2}\lambda^{2}\sigma^{2}}$ ;

(c) For all $a$ , $\eta_{a}$ is Lipschitz with constant $L$ , i.e. for any $x$ and $\mathbf{x}^{\prime}$ , $|\eta_{a}(\mathbf{x})-\eta_{a}(\mathbf{x}^{\prime})|\leq L\left\lVert\mathbf{x}-\mathbf{x}^{\prime}\right\rVert$ .

Now we comment on these assumptions. (a) is the Tsybakov margin condition, which was first introduced in (Audibert & Tsybakov, 2007) for classification problems, and then used in contextual bandit problems (Perchet & Rigollet, 2013). Note that $\text{P}(0<\eta^{*}(\mathbf{X})-\eta_{a}(\mathbf{X})<t)\leq 1$ always hold, thus for any $\eta_{a}$ , (a) holds with $C_{\alpha}=1$ and $\alpha=0$ . Therefore, this assumption is nontrivial only if it holds with some $\alpha>0$ . Moreover, we only consider the case with $\alpha\leq d$ here. If $\alpha>d$ , then an arm is either always or never optimal, thus it is easy to achieve logarithmic regret (see (Perchet & Rigollet, 2013), Proposition 3.1). An additional remark is that in (Perchet & Rigollet, 2013; Reeve et al., 2018), the margin assumption is $\text{P}(0<\eta^{*}(\mathbf{X})-\eta_{s}(\mathbf{X})<t)\lesssim t^{\alpha}$ , in which $\eta_{s}(\mathbf{x})$ is the second largest one among $\{\eta_{a}(\mathbf{x})|a\in\mathcal{A}\}$ . Our assumption (a) is slightly weaker than existing ones since we only impose margin conditions on the suboptimality gap $\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})$ for each $a$ separately, instead of on the minimum suboptimality gap. In (b), following existing works (Reeve et al., 2018), we assume that the noise has light tails. (c) is a common assumption for various literatures on nonparametric estimation (Mai & Johansson, 2021). It is possible to extend this work to a more general Hölder smoothness assumption by adaptive nearest neighbor weights (Cannings et al., 2020). In this paper, we focus only on Lipschitz continuity for convenience.

Assumption 2 is designed for the case that the contexts have bounded support.

Assumption 2.

$f(\mathbf{x})\geq c$ for all $\mathbf{x}\in\mathcal{X}$ , in which $f$ is the pdf of contexts.

In Assumption 2, the pdf $f$ is required to be bounded away from zero, which is also made in (Perchet & Rigollet, 2013; Guan & Jiang, 2018; Reeve et al., 2018). Note that even for estimation with i.i.d data, this assumption is common (Audibert & Tsybakov, 2007; Döring et al., 2018; Gao et al., 2018).

We then show some assumptions for heavy-tailed distributions.

Assumption 3.

(a) For any $u>0$ , $\text{P}(f(\mathbf{X})\leq u)\leq C_{\beta}u^{\beta}$ for some constants $C_{\beta}$ and $\beta$ ;

(b) The difference of regret function $\eta$ among all actions are bounded, i.e. $\sup_{\mathbf{x}\in\mathcal{X}}(\eta^{*}(\mathbf{x})-\min_{a}\eta_{a}(\mathbf{x}))\leq M$ for some constant $M$ .

(a) is a common tail assumption for nonparametric statistics, which has been made in (Gadat et al., 2016; Zhao & Lai, 2021b). $\beta$ describes the tail strength. Smaller $\beta$ indicates that the context distribution has heavy tails, and vice versa. To further illustrate this assumption, we show several examples.

Example 1.

If $f$ has bounded support $\mathcal{X}$ , then Assumption 3(a) holds with $C_{\beta}=V(\mathcal{X})$ and $\beta=1$ , in which $V(\mathcal{X})$ is the volume of the support set $\mathcal{X}$ .

Example 2.

If $f$ has $p$ -th bounded moment, i.e. $\mathbb{E}[\left\lVert\mathbf{X}\right\rVert^{p}]<\infty$ , then for all $\beta<p/(p+d)$ , there exists a constant $C_{\beta}$ such that Assumption 3(a) holds. In particular, for subgaussian or subexponential random variables, Assumption 3(a) holds for all $\beta<1$ .

Proof.

The analysis of these examples and other related discussions are shown in the supplementary material. ∎

It worths mentioning that although the growth rate of the regret is affected by the value of $\beta$ , our proposed algorithms including both fixed and adaptive methods do not require knowing $\beta$ .

(b) restricts the suboptimality gap of each action. This is not necessary if the support is bounded. However, with unbounded support, without assumption (b), $\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})$ can increase appropriately with the decrease of $f(\mathbf{x})$ , such that the regret of suboptimal action is large, and the identification of best action is hard.

Finally, we clarify notations as follows. Throughout this paper, $\left\lVert\cdot\right\rVert$ denotes $\ell_{2}$ norm. $a\lesssim b$ denotes $a\leq Cb$ for some constant $C$ , which may depend on the constants in Assumption 2. The notation $\gtrsim$ is defined conversely.

4 Minimax Analysis

In this section, we show the minimax lower bound, which characterizes the theoretical limit of regrets of contextual bandits. Throughout this section, denote $\pi:\mathcal{X}\times\mathcal{X}^{t-1}\times\mathbb{R}^{t-1}\rightarrow\mathcal{A}$ as the policy, such that each action is selected according to policy $\pi$ . To be more precise,

\displaystyle A_{t}=\pi(\mathbf{X}_{t};\mathbf{X}_{1:t-1},Y_{1:t-1}),

(4)

which indicates that the action $A_{t}$ at time $t$ depends on the current context and the records of contexts and rewards in previous $t-1$ steps.

The minimax lower bound for the case with bounded support has been shown in Theorem 4.1 in (Rigollet & Zeevi, 2010). For completeness and notation consistency, we state the results below and provide a simplified proof.

Theorem 1.

((Rigollet & Zeevi, 2010), Theorem 4.1) Denote $\mathcal{F}_{A}$ as the set of pairs $(f,\eta)$ that satisfy Assumption 1 and 2 (which means that the contexts have bounded support). Then

\displaystyle\inf_{\pi}\underset{(f,\eta)\in\mathcal{F}_{A}}{\sup}R\gtrsim T^{1-\frac{1+\alpha}{d+2}}.

(5)

We then show the minimax regret bounds for unbounded support, which is a new result that has not been obtained before.

Theorem 2.

Denote $\mathcal{F}_{B}$ as the set of pairs $(f,\eta)$ that satisfy Assumption 1 and 3 (which means that the contexts have unbounded support). Then

\displaystyle\inf_{\pi}\underset{(f,\eta)\in\mathcal{F}_{B}}{\sup}R\gtrsim T^{1-\frac{(\alpha+1)\beta}{\alpha+(d+2)\beta}}+T^{1-\beta}.

(6)

From the results above, with $\beta\rightarrow\infty$ , (6) reduces to (5).

Proof.

(Outline) For bounded support, we just derive the lower bound of regret by analyzing the minimax optimal number of suboptimal actions first. Define

\displaystyle S=\mathbb{E}\left[\sum_{t=1}^{T}\mathbf{1}(\eta_{A_{t}}(\mathbf{X}_{t})<\eta^{*}(\mathbf{X}_{t}))\right].

(7)

$S$ can be lower bounded using standard tools in nonparametric statistics (Tsybakov, 2009), which constructs multiple hypotheses and bounds the minimum error probability. As shown in (Rigollet & Zeevi, 2010), the lower bound of $S$ can then be transformed to the lower bound of $R$ .

The minimax analysis becomes more complex with unbounded support. Firstly, the heavy-tailed context distribution requires different hypotheses construction. Secondly, the transformation from the lower bound of $S$ to $R$ does not yield tight lower bounds. We design new approaches to construct a set of candidate functions $\eta$ and derive lower bounds of $R$ directly. ∎

For bounded context support, regret comes mainly from the region with $\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})\lesssim T^{-1/(d+2)}$ (which is the classical rate for nonparametric estimation (Tsybakov, 2009)), in which the identification of best action is not guaranteed to be correct. However, with heavy-tailed contexts, regret may also come from the tail, i.e. the region with small $f(\mathbf{x})$ , where the number of samples around $\mathbf{x}$ is not enough to yield a reliable best action identification. For heavy-tailed cases, i.e. $\beta$ is small, the regret caused by the tail region may dominate. This also explains why we need to use different techniques to derive the minimax lower bound for heavy-tailed contexts.

In the remainder of this paper, we claim that a method is nearly minimax optimal if the dependence of expected regret on $T$ matches (5) or (6). Following conventions in existing works (Rigollet & Zeevi, 2010; Perchet & Rigollet, 2013; Hu et al., 2020; Gur et al., 2022), currently, the minimax lower bounds are derived for contextual bandit problems with only two actions, thus we do not consider the minimax optimality of regrets with respect to the number of actions $|\mathcal{A}|$ .

5 Nearest Neighbor Method with Fixed $k$

To begin with, we propose and analyze a simple nearest neighbor method with fixed $k$ . We make the following definitions first.

Denote $n_{a}(t)=|\{i<t|A_{i}=a\}|$ as the number of steps with action $a$ before time step $t$ . Let $\mathcal{N}_{t}(\mathbf{x},a)$ be the set of $k$ nearest neighbors among $\{i<t|A_{i}=a\}$ . Define

\displaystyle\rho_{a,t}(\mathbf{x})=\max_{i\in\mathcal{N}_{t}(\mathbf{x},a)}\left\lVert\mathbf{X}_{i}-\mathbf{x}\right\rVert

(8)

as the $k$ nearest neighbor distance, i.e. the distance from $\mathbf{x}$ to its $k$ -th nearest neighbor among all previous steps with action $a$ .

With the above notations, we describe the fixed $k$ nearest neighbor method as follows. If $n_{a}(t)\geq k$ , then

\displaystyle\hat{\eta}_{a,t}(\mathbf{x})=\frac{1}{k}\sum_{i\in\mathcal{N}_{t}(\mathbf{x},a)}Y_{i}+b+L\rho_{a,t}(\mathbf{x}),

(9)

in which $b$ has a fixed value

\displaystyle b=\sqrt{\frac{2\sigma^{2}}{k}\ln(dT^{2d+2}|\mathcal{A}|)},

(10)

If $n_{a}(t)<k$ , then

\displaystyle\hat{\eta}_{a,t}(\mathbf{x})=\infty.

(11)

Here we explain our design. If $n_{a}(t)\geq k$ , then it is possible to give a UCB estimate of $\eta_{a}(\mathbf{x})$ , shown in (9). $L\rho_{a,t}(\mathbf{x})$ bounds the estimation bias, while $b$ is an upper bound of the error caused by random noise that holds with high probability. In Lemma 5 in Appendix E, we show that $\hat{\eta}_{a,t}(\mathbf{x})$ is a valid UCB estimate of $\eta_{a,t}(\mathbf{x})$ , i.e. $\hat{\eta}_{a,t}(\mathbf{x})\geq\eta_{a,t}(\mathbf{x})$ holds with high probability. If $n_{a}(t)<k$ , then it is impossible to give a UCB estimate. In this case, we just let $\hat{\eta}_{a,t}(\mathbf{x})$ to be infinite.

Finally, the algorithm selects the action $A_{t}$ with the maximum UCB value:

\displaystyle A_{t}=\underset{a}{\arg\max}\hat{\eta}_{a,t}(\mathbf{X}_{t}).

(12)

According to (11), as long as an action has not been taken for at least $k$ times, the UCB estimate of $\eta_{a}$ will be infinite. Note that the selection rule (12) ensures that the actions with infinite UCB values will be taken first. Therefore, the first $k|\mathcal{A}|$ steps are used for pure exploration. In this stage, the agent takes each action $a$ for $k$ times. After $k|\mathcal{A}|$ steps, the UCB values for all $\mathbf{x}\in\mathcal{X}$ and $a\in\mathcal{A}$ become finite. Since then, at each step, the action is selected with the maximum UCB value specified in (9).

Algorithm 1 Adaptive nearest neighbor with UCB exploration

for

t=1,\ldots,T

Receive context

\mathbf{X}_{t}

;

for

a\in\mathcal{A}

Calculate

n_{a}(t)=|\{i<t|A_{i}=a\}

;

n_{a}(t)\geq k

then

Calculate

\hat{\eta}_{a,t}(\mathbf{X}_{t})

using (9);

else

Let

\hat{\eta}_{a,t}(\mathbf{X}_{t})=\infty

;

end if

end for

A_{t}=\arg\max_{a}\hat{\eta}_{a,t}(\mathbf{X}_{t})

;

Pull

A_{t}

;

end for

The procedures above are summarized in Algorithm 1. Compared with (Guan & Jiang, 2018), our method constructs the UCB differently. In (Guan & Jiang, 2018), the UCB is $\frac{1}{k}\underset{i\in\mathcal{N}_{t}(\mathbf{x},a)}{\sum}Y_{i}+\sigma(T_{a}(t-1))$ , in which $\sigma(T_{a}(t-1))$ is uniform among all $\mathbf{x}$ with fixed action $a$ .

Therefore, the method (Guan & Jiang, 2018) is not adaptive to the suboptimality gap $\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})$ . On the contrary, our method has a term $L\rho_{a,t}(\mathbf{x})$ that varies for different $\mathbf{x}$ , and thus adapts better to the suboptimality gap. The bound of regret is shown in Theorem 3.

Theorem 3.

Under Assumption 1 and 2, the regret of the simple nearest neighbor method with UCB exploration is bounded as follows:

(1) If $d>\alpha+1$ , then with $k\sim T^{\frac{2}{d+2}}$ ,

\displaystyle R\lesssim T^{1-\frac{\alpha+1}{d+2}}|\mathcal{A}|\ln^{\frac{\alpha+1}{2}}(dT^{2d+2}|\mathcal{A}|);

(13)

(2) If $d\leq\alpha+1$ , then with $k\sim T^{\frac{2}{\alpha+3}}$ ,

\displaystyle R\lesssim T^{\frac{2}{\alpha+3}}|\mathcal{A}|\ln^{\frac{\alpha+1}{2}}(dT^{2d+2}|\mathcal{A}|).

(14)

We compare our result with (Guan & Jiang, 2018), which proposes a similar nearest neighbor method. The analysis in (Guan & Jiang, 2018) does not make Tsybakov margin assumption (Assumption 1(a)), and the regret bound is $\tilde{O}(T^{\frac{d+1}{d+2}})$ . Without any restriction on $\eta$ , Assumption 1(a) holds with $C_{\alpha}=1$ and $\alpha=0$ , under which (13) reduces to $\tilde{O}(T^{\frac{d+1}{d+2}})$ . Therefore, our result matches (Guan & Jiang, 2018) with $\alpha=0$ . If $\alpha>0$ , which indicates that a small optimality gap only happens with small probability, then the regret of the method in (Guan & Jiang, 2018) is still $\tilde{O}(T^{\frac{d+1}{d+2}})$ , while our result improves it to $\tilde{O}(T^{1-\frac{\alpha+1}{d+2}})$ . As discussed earlier, compared with (Guan & Jiang, 2018), our method improves the UCB calculation in (17), and is thus more adaptive to the suboptimalilty gap $\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})$ . With $\alpha>0$ , our method achieves smaller regret due to a better tradeoff between exploration and exploitation.

Compared with the minimax lower bound shown in Theorem 1, it can be found that the kNN method with fixed $k$ is not completely optimal. With $d>\alpha+1$ , the upper bound matches the lower bound derived in Theorem 1. However, with $d\leq\alpha+1$ , the regret is significantly higher than the minimax lower bound, indicating that there is room for further improvement.

We then analyze the performance for heavy-tailed context distribution. The result is shown in the following theorem.

Theorem 4.

Under Assumption 1 and 3, the regret of the simple nearest neighbor method with UCB exploration is bounded as follows:

	$\displaystyle R$	$\displaystyle\lesssim$	$\displaystyle T^{1-\frac{\beta\min(d,\alpha+1)}{\min(d-1,\alpha)+\max(1,d\beta)+2\beta}}\|\mathcal{A}\|$		(15)
			$\displaystyle\ln^{\frac{1}{2}\max(d,\alpha+1)}(dT^{2d+2}\|\mathcal{A}\|).$		(15)

From (15), there are two phase transitions. The first one is at $d=\alpha+1$ , while the second one is at $d\beta=1$ . Intuitively, the phase transition occurs because the regret is dominated by different regions depending on the settings $\alpha$ and $\beta$ . Compared with the minimax lower bound shown in Theorem 2, it can be found that the kNN method with fixed $k$ achieves nearly minimax optimal regret up to logarithm factors if $d>\alpha+1$ and $\beta>1/d$ , otherwise the regret bound is suboptimal. Here we provide an intuition of the reason why the kNN method with fixed $k$ achieves suboptimal regrets. In the region where the context pdf $f(\mathbf{x})$ is low, or the suboptimality gap $\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})$ is large, the samples with action $a$ are relatively sparse. In this case, with fixed $k$ , the nearest neighbor distances are too large, resulting in a large estimation bias. On the contrary, if $f(\mathbf{x})$ is high or $\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})$ is small, then samples with action $a$ are relatively dense, thus the bias is small, and we can increase $k$ to achieve a better bias and variance tradeoff. Therefore, if $k$ is fixed throughout the support set, then the algorithm estimates the reward function $\eta_{a}(\mathbf{x})$ in an inefficient way, resulting in suboptimal regrets. Apart from suboptimal regret, another drawback is that with $d\leq\alpha+1$ , the optimal selection of $k$ depends on the margin parameter $\alpha$ , which is usually unknown in practice. In the next section, we propose an adaptive nearest neighbor method to address these issues mentioned above.

6 Nearest Neighbor Method with Adaptive $k$

In the previous section, we have shown that the standard kNN method with fixed $k$ is suboptimal with $d\leq\alpha+1$ or $\beta\leq 1/d$ . The intuition is that the standard nearest neighbor method does not adjust $k$ based on the pdf and the suboptimality gap. In this section, we propose an adaptive nearest neighbor approach. To achieve a good exploration-exploitation tradeoff and bias-variance tradeoff, $k$ needs to be smaller for small pdf $f(\mathbf{x})$ or large suboptimality gap $\eta^{*}(x)-\eta_{a}(x)$ , and vice versa. However, as both $f(\mathbf{x})$ and $\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})$ are unknown to the learner, we need to decide $k$ based entirely on existing samples. The guideline of our design is that given a context $\mathbf{X}_{t}$ at time $t$ , we use large $k$ if previous samples are relatively dense around $\mathbf{X}_{t}$ , and vice versa. To be more precise, for all $\mathbf{x}\in\mathcal{X}$ , let

\displaystyle k_{t}(\mathbf{x})=\max\left\{j|L\rho_{a,t,j}(\mathbf{x})\leq\sqrt{\frac{\ln T}{j}}\right\},

(16)

in which $\rho_{a,t,j}(\mathbf{x})$ is the distance from $x$ to its $j$ -th nearest neighbors among existing samples with action $a$ , i.e. $\{i<t|A_{i}=a\}$ . Such selection of $k$ makes the bias term $L\rho_{a,t,j}(\mathbf{x})$ matches the variance term $\sqrt{\ln T/j}$ , thus (16) achieves a good tradeoff between bias and variance. The exploration-exploitation tradeoff is also desirable as $\rho_{a,t,j}(\mathbf{x})$ is large with large $\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})$ , which yields smaller $k$ . Note that (16) can be calculated only if $L\rho_{a,t,1}\leq\sqrt{\ln T}$ , which means that the $1$ -nearest neighbor distance can not be too large. At some time step $t$ , for some action $a$ , if there is no existing samples, or $\mathbf{X}_{t}$ is more than $\sqrt{\ln T}/L$ far away from any existing samples $\mathbf{X}_{1},\ldots,\mathbf{X}_{t-1}$ , then we can just let the UCB estimate to be infinite, i.e. $\hat{\eta}_{a,t}(\mathbf{x})=\infty$ . Otherwise, we calculate the upper confidence bound as follows:

\displaystyle\hat{\eta}_{a,t}(\mathbf{x})=\frac{1}{k_{a,t}(\mathbf{x})}\sum_{i\in\mathcal{N}_{t}(\mathbf{x},a)}Y_{i}+b_{a,t}(\mathbf{x})+L\rho_{a,t}(\mathbf{x}),

(17)

in which $\mathcal{N}_{t}(\mathbf{x},a)$ is the set of $k_{a,t}(\mathbf{x})$ neighbors of $x$ among $\{i<t|A_{i}=a\}$ , $\rho_{a,t}(\mathbf{x})$ is the corresponding $k_{a,t}(\mathbf{x})$ neighbor distance of $\mathbf{x}$ , i.e. $\rho_{a,t}(\mathbf{x})=\rho_{a,t,k_{a,t}(\mathbf{x})}(\mathbf{x})$ , and

\displaystyle b_{a,t}(\mathbf{x})=\sqrt{\frac{2\sigma^{2}}{k_{a,t}(\mathbf{x})}\ln(dT^{2d+3}|\mathcal{A}|)}.

(18)

Similar to the fixed nearest neighbor method, the last two terms in (17) cover the uncertainty of reward function estimation. The term $b_{a,t}(\mathbf{x})$ gives a high probability bound of random error, and $L\rho_{a,t}(\mathbf{x})$ bounds the bias. With the UCB calculation in (17), the $\hat{\eta}_{a,t}(\mathbf{x})$ is an upper bound of $\eta(\mathbf{x})$ that holds with high probability, so that the exploration and exploitation can be balanced well. The complete description of the newly proposed adaptive nearest neighbor method is shown in Algorithm 2.

Algorithm 2 Adaptive nearest neighbor with UCB exploration

for

t=1,\ldots,T

Receive context

\mathbf{X}_{t}

;

for

a\in\mathcal{A}

L\rho_{a,t,1}(\mathbf{X}_{t})>\sqrt{\ln T}

then

\hat{\eta}_{a,t}(\mathbf{X}_{t})=\infty

;

else

Calculate

k_{t}(\mathbf{X}_{t})

using (16);

Calculate

\hat{\eta}_{a,t}(\mathbf{X}_{t})

using (17);

end if

end for

A_{t}=\arg\max_{a}\hat{\eta}_{a,t}(\mathbf{X}_{t})

;

Pull

A_{t}

;

end for

We then analyze the regret of the adaptive method for both bounded and unbounded supports of contexts.

Theorem 5.

Under Assumption 1 and 2, the regret of the adaptive nearest neighbor method with UCB exploration is bounded by

\displaystyle R\lesssim T|\mathcal{A}|\left(\frac{T}{\ln T}\right)^{-\frac{1+\alpha}{d+2}}.

(19)

By comparing Theorem 5 with Theorem 3, it can be found that for the case with bounded support, the adaptive method improves over the fixed $k$ nearest neighbor method. From the minimax bound in Theorem 1, the fixed $k$ method is only optimal for $d\geq\alpha+1$ , while the adaptive method is also optimal for $d<\alpha+1$ , up to logarithm factors. An intuitive explanation is that with large $\alpha$ , the suboptimality gap $\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})$ is small only in a small region, and the exploration-exploitation tradeoff becomes harder, thus the advantage of the adaptive method over the fixed one becomes more obvious.

We then analyze the performance of the adaptive nearest neighbor method for heavy-tailed distribution. The result is shown in the following theorem.

Theorem 6.

Under Assumption 1 and Assumption 3, the regret of the adaptive nearest neighbor method with UCB exploration is bounded by

\displaystyle R\lesssim\left\{\begin{array}[]{ccc}T^{1-\min\left\{\frac{(\alpha+1)\beta}{\alpha+(d+2)\beta},\beta\right\}}|\mathcal{A}|\ln T&\text{if}&\beta\neq\frac{1}{d+2}\\ T^{\frac{d+2}{d+1}}|\mathcal{A}|\ln^{2}T&\text{if}&\beta=\frac{1}{d+2}.\end{array}\right.

(22)

Compared with the minimax lower bound shown in Theorem 2, it can be found that our method achieves nearly minimax optimal regret up to a logarithm factor. Regarding this result, we have some additional remarks.

Remark 1.

It can be found that with $\beta\rightarrow\infty$ , the regret bound in (22) reduces to (19). As discussed earlier, the case that contexts have bounded support and $f$ is bounded away from zero can be viewed as a special case with $\beta\rightarrow\infty$ .

Remark 2.

In (Zhao & Lai, 2021b), it is shown that the optimal rate of the excess risk of nonparametric classification is $\tilde{O}\left(N^{-\frac{(\alpha+1)\beta}{\alpha+(d+2)\beta}}\right)$ ²²2The analysis in (Zhao & Lai, 2021b) is under a general smoothness assumption with parameter $p$ . $p=1$ corresponds to the Lipschitz assumption (Assumption 1(c) in this paper). Therefore, here we replace the bounds in (Zhao & Lai, 2021b) with $p=1$ .. From (22), the average regret over all $T$ steps is $\tilde{O}\left(T^{-\frac{(\alpha+1)\beta}{\alpha+(d+2)\beta}}\right)$ , which has the same rate as the nonparametric classification problem.

7 Numerical Experiments

To begin with, to validate our theoretical analysis, we run experiments using some synthesized data. We then move on to experiments with the MNIST dataset (LeCun, 1998).

7.1 Synthesized Data

Refer to caption — (a) Uniform distribution.

To begin with, we conduct experiments with $d=1$ . In each experiment, we run $T=1,000$ steps and compare the performance of the adaptive nearest neighbor method with the UCBogram (Rigollet & Zeevi, 2010), ABSE (Perchet & Rigollet, 2013) and fixed $k$ nearest neighbor method. For a fair comparison, for UCBogram and ABSE, we try different numbers of bins and only pick the one with the best performance. The results are shown in Figure 1. In (a), (b), (c), and (d), the contexts follow uniform distribution in $[-1,1]$ , standard Gaussian distribution, $t_{4}$ distribution, and Cauchy distribution, respectively. The uniform distribution is an example of distributions with bounded support. The Gaussian, $t_{4}$ and Cauchy distribution satisfy the tail assumption (Assumption 3(a)) with $\beta=1$ , $0.8$ and $0.5$ , respectively. In each experiment, there are two actions. For uniform and Gaussian distribution, we have $\eta_{1}(x)=x$ and $\eta_{2}(x)=-x$ . For $t_{4}$ and Cauchy distribution, since they are heavy-tailed, to ensure that Assumption 3(b) is satisfied, we do not use the linear reward function. Instead, we let $\eta_{1}(x)=\sin(x)$ and $\eta_{2}(x)=\cos(x)$ . To make the comparison more reliable, the values in each curve in Figure 1 are averaged over $m=100$ random and independent trials.

We then run experiments for two dimensional distributions. In these experiments, the context distributions are just Cartesian products of two one dimensional distributions. The two dimensional Gaussian distribution still satisfies Assumption 3(a) with $\beta=1$ , and the two dimensional $t_{4}$ and Cauchy distribution satisfy Assumption 3(a) with $\beta=2/3$ and $1/3$ , respectively, which are lower than the one dimensional case. The results are shown in Figure 2.

From these experiments, it can be observed that the adaptive nearest neighbor method significantly outperforms the other baselines.

7.2 Real Data

Now we run experiments using the MNIST dataset (LeCun, 1998), which contains $60,000$ images of handwritten digits with size $28\times 28$ . Following the settings in (Guan & Jiang, 2018), the images are regarded as contexts, and there are $10$ actions from $0$ to $9$ . The reward is $1$ if the selected action equals the true label, and $0$ otherwise. The results are shown in Figure 3. Image data have high dimensionality but low intrinsic dimensionality. Compared with bin splitting based methods (Rigollet & Zeevi, 2010; Perchet & Rigollet, 2013), nearest neighbor methods are more adaptive to local intrinsic dimension (Kpotufe, 2011). Therefore, in this experiment, we do not compare with the bin splitting based methods.

From Figure 3, the adaptive kNN method performs better than the standard kNN method with various values of $k$ .

8 Conclusion

This paper analyzes the contextual bandit problem that allows the context distribution to be heavy-tailed. To begin with, we have derived the minimax lower bound of the expected cumulative regret. We then show that the expected cumulative regret of the fixed $k$ nearest neighbor method is suboptimal compared with the minimax lower bound. To close the gap, we have proposed an adaptive nearest neighbor approach, which significantly improves the performance, and the bound of expected regret matches the minimax lower bound up to logarithm factors. Finally, we have conducted numerical experiments to validate our results.

In the future, this work can be extended in the following ways. Firstly, following existing analysis in (Gur et al., 2022), it may be meaningful to design a smoothness adaptive method that can handle any Hölder smoothness parameters. Secondly, it is worth extending current work to handle dynamic regret functions. Finally, the theories and methods developed in this paper can be extended to more complicated tasks, such as reinforcement learning (Zhao & Lai, 2024).

References

Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. Advances in Neural Information Processing Systems, 24, 2011.
Agrawal (1995) Agrawal, R. Sample mean based index policies by o (log n) regret for the multi-armed bandit problem. Advances in applied probability, 27(4):1054–1078, 1995.
Akhavan et al. (2024) Akhavan, A., Lounici, K., Pontil, M., and Tsybakov, A. B. Contextual continuum bandits: Static versus dynamic regret. arXiv preprint arXiv:2406.05714, 2024.
Audibert & Tsybakov (2007) Audibert, J.-Y. and Tsybakov, A. B. Fast learning rates for plug-in classifiers. The Annals of Statistics, pp. 608–633, 2007.
Auer et al. (2002) Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47:235–256, 2002.
Bastani & Bayati (2020) Bastani, H. and Bayati, M. Online decision making with high-dimensional covariates. Operations Research, 68(1):276–294, 2020.
Bastani et al. (2021) Bastani, H., Bayati, M., and Khosravi, K. Mostly exploration-free algorithms for contextual bandits. Management Science, 67(3):1329–1349, 2021.
Blanchard et al. (2023) Blanchard, M., Hanneke, S., and Jaillet, P. Adversarial rewards in universal learning for contextual bandits. arXiv preprint arXiv:2302.07186, 2023.
Bouneffouf et al. (2020) Bouneffouf, D., Rish, I., and Aggarwal, C. Survey on applications of multi-armed and contextual bandits. In 2020 IEEE Congress on Evolutionary Computation (CEC), pp. 1–8, 2020.
Cai et al. (2024) Cai, C., Cai, T. T., and Li, H. Transfer learning for contextual multi-armed bandits. The Annals of Statistics, 52(1):207–232, 2024.
Cannings et al. (2020) Cannings, T. I., Berrett, T. B., and Samworth, R. J. Local nearest neighbour classification with applications to semi-supervised learning. The Annals of Statistics, 48(3):1789–1814, 2020.
Chaudhuri & Dasgupta (2014) Chaudhuri, K. and Dasgupta, S. Rates of convergence for nearest neighbor classification. In Advances in Neural Information Processing Systems, volume 27, 2014.
Chu et al. (2011) Chu, W., Li, L., Reyzin, L., and Schapire, R. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 208–214, 2011.
Döring et al. (2018) Döring, M., Györfi, L., and Walk, H. Rate of convergence of $k$ -nearest-neighbor classification rule. Journal of Machine Learning Research, 18(227):1–16, 2018.
Dudik et al. (2011) Dudik, M., Hsu, D., Kale, S., Karampatziakis, N., Langford, J., Reyzin, L., and Zhang, T. Efficient optimal learning for contextual bandits. arXiv preprint arXiv:1106.2369, 2011.
Durand et al. (2018) Durand, A., Achilleos, C., Iacovides, D., Strati, K., Mitsis, G. D., and Pineau, J. Contextual bandits for adapting treatment in a mouse model of de novo carcinogenesis. In Machine learning for healthcare conference, pp. 67–82, 2018.
Fedotov et al. (2003) Fedotov, A. A., Harremoës, P., and Topsoe, F. Refinements of pinsker’s inequality. IEEE Transactions on Information Theory, 49(6):1491–1498, 2003.
Gadat et al. (2016) Gadat, S., Klein, T., and Marteau, C. Classification in general finite dimensional spaces with the k nearest neighbor rule. The Annals of Statistics, pp. 982–1009, 2016.
Gao et al. (2018) Gao, W., Oh, S., and Viswanath, P. Demystifying fixed $k$ -nearest neighbor information estimators. IEEE Transactions on Information Theory, 64(8):5629–5661, 2018.
Garivier & Cappé (2011) Garivier, A. and Cappé, O. The kl-ucb algorithm for bounded stochastic bandits and beyond. In Proceedings of the 24th Annual Conference on Learning Theory, pp. 359–376, 2011.
Ghosh et al. (2024) Ghosh, A., Sankararaman, A., Ramchandran, K., Javidi, T., and Mazumdar, A. Competing bandits in non-stationary matching markets. IEEE Transactions on Information Theory, 2024.
Guan & Jiang (2018) Guan, M. and Jiang, H. Nonparametric stochastic contextual bandits. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
Gur et al. (2022) Gur, Y., Momeni, A., and Wager, S. Smoothness-adaptive contextual bandits. Operations Research, 70(6):3198–3216, 2022.
Hu et al. (2020) Hu, Y., Kallus, N., and Mao, X. Smooth contextual bandits: Bridging the parametric and non-differentiable regret regimes. In Conference on Learning Theory, pp. 2007–2010, 2020.
Jiang (2019) Jiang, H. Non-asymptotic uniform rates of consistency for k-nn regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 3999–4006, 2019.
Komiyama et al. (2024) Komiyama, J., Fouché, E., and Honda, J. Finite-time analysis of globally nonstationary multi-armed bandits. Journal of Machine Learning Research, 25(112):1–56, 2024.
Kpotufe (2011) Kpotufe, S. k-nn regression adapts to local intrinsic dimension. In Advances in Neural Information Processing Systems, pp. 729–737, 2011.
Krishnamurthy et al. (2020) Krishnamurthy, A., Langford, J., Slivkins, A., and Zhang, C. Contextual bandits with continuous actions: Smoothing, zooming, and adapting. Journal of Machine Learning Research, 21(137):1–45, 2020.
Lai & Robbins (1985) Lai, T. L. and Robbins, H. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
Langford & Zhang (2007) Langford, J. and Zhang, T. The epoch-greedy algorithm for multi-armed bandits with side information. Advances in Neural Information Processing Systems, 20, 2007.
LeCun (1998) LeCun, Y. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
Li et al. (2010) Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670, 2010.
Locatelli & Carpentier (2018) Locatelli, A. and Carpentier, A. Adaptivity to smoothness in x-armed bandits. In Conference on Learning Theory, pp. 1463–1492, 2018.
Mai & Johansson (2021) Mai, V. V. and Johansson, M. Stability and convergence of stochastic gradient clipping: Beyond lipschitz continuity and smoothness. In International Conference on Machine Learning, pp. 7325–7335, 2021.
Misra et al. (2019) Misra, K., Schwartz, E. M., and Abernethy, J. Dynamic online pricing with incomplete information using multiarmed bandit experiments. Marketing Science, 38(2):226–252, 2019.
Perchet & Rigollet (2013) Perchet, V. and Rigollet, P. The multi-armed bandit problem with covariates. The Annals of Statistics, 41(2):693–721, 2013.
Qian & Yang (2016) Qian, W. and Yang, Y. Kernel estimation and model combination in a bandit problem with covariates. Journal of Machine Learning Research, 17(149):1–37, 2016.
Qian et al. (2023) Qian, W., Ing, C.-K., and Liu, J. Adaptive algorithm for multi-armed bandit problem with high-dimensional covariates. Journal of the American Statistical Association, pp. 1–13, 2023.
Reeve et al. (2018) Reeve, H., Mellor, J., and Brown, G. The k-nearest neighbour ucb algorithm for multi-armed bandits with covariates. In Algorithmic Learning Theory, pp. 725–752, 2018.
Rigollet & Zeevi (2010) Rigollet, P. and Zeevi, A. Nonparametric bandits with covariates. 23th Annual Conference on Learning Theory, pp. 54, 2010.
Robbins (1952) Robbins, H. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society, 58(5):527–535, 1952.
Slivkins (2014) Slivkins, A. Contextual bandits with similarity information. Journal of Machine Learning Research, 15:2533–2568, 2014.
Suk (2024) Suk, J. Adaptive smooth non-stationary bandits. arXiv preprint arXiv:2407.08654, 2024.
Suk & Kpotufe (2023) Suk, J. and Kpotufe, S. Tracking most significant shifts in nonparametric contextual bandits. Advances in Neural Information Processing Systems, 36:6202–6241, 2023.
Tsybakov (2009) Tsybakov, A. B. Introduction to Nonparametric Estimation. 2009.
Wanigasekara & Yu (2019) Wanigasekara, N. and Yu, C. L. Nonparametric contextual bandits in an unknown metric space. Advances in Neural Information Processing Systems, 32:14684–14694, 2019.
Woodroofe (1979) Woodroofe, M. A one-armed bandit problem with a concomitant variable. Journal of the American Statistical Association, 74(368):799–806, 1979.
Yang & Zhu (2002) Yang, Y. and Zhu, D. Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates. The Annals of Statistics, 30(1):100–121, 2002.
Zangerle & Bauer (2022) Zangerle, E. and Bauer, C. Evaluating recommender systems: survey and framework. ACM Computing Surveys, 55(8):1–38, 2022.
Zhao & Lai (2021a) Zhao, P. and Lai, L. Efficient classification with adaptive knn. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 11007–11014, 2021a.
Zhao & Lai (2021b) Zhao, P. and Lai, L. Minimax rate optimal adaptive nearest neighbor classification and regression. IEEE Transactions on Information Theory, 67(5):3155–3182, 2021b.
Zhao & Lai (2022) Zhao, P. and Lai, L. Analysis of knn density estimation. IEEE Transactions on Information Theory, 68(12):7971–7995, 2022.
Zhao & Lai (2024) Zhao, P. and Lai, L. Minimax optimal q learning with nearest neighbors. IEEE Transactions on Information Theory, 2024.
Zhao & Wan (2024) Zhao, P. and Wan, Z. Robust nonparametric regression under poisoning attack. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 17007–17015, 2024.
Zhou et al. (2017) Zhou, Q., Zhang, X., Xu, J., and Liang, B. Large-scale bandit approaches for recommender systems. In Neural Information Processing: 24th International Conference, ICONIP 2017, Guangzhou, China, November 14-18, 2017, Proceedings, Part I 24, pp. 811–821. Springer, 2017.
Zhu et al. (2022) Zhu, Y., Foster, D. J., Langford, J., and Mineiro, P. Contextual bandits with large action spaces: Made practical. In International Conference on Machine Learning, pp. 27428–27453. PMLR, 2022.

Appendix A Examples of Heavy-tailed Distributions

This section explains Example 1 and 2 in the paper. For Example 1,

\displaystyle\text{P}(f(\mathbf{X})<t)=\int_{\mathcal{X}}f(\mathbf{x})\mathbf{1}(f(\mathbf{x})<t)d\mathbf{x}\leq tV(\mathcal{X}).

(23)

For Example 2, from Hölder’s inequality,

	$\displaystyle\int f^{1-\beta}(\mathbf{x})d\mathbf{x}=\int f^{1-\beta}(\mathbf{x})(1+\left\lVert\mathbf{x}\right\rVert^{\gamma})\frac{1}{1+\left\lVert\mathbf{x}\right\rVert^{\gamma}}$
	$\displaystyle\leq\left(\int f(\mathbf{x})(1+\left\lVert\mathbf{x}\right\rVert^{\gamma})^{\frac{1}{1-\beta}}d\mathbf{x}\right)^{1-\beta}\left(\int(1+\left\lVert\mathbf{x}\right\rVert^{\gamma})^{-\frac{1}{\beta}}d\mathbf{x}\right)^{\beta}.$

Let $\gamma=p(1-\beta)$ , then $\left(\int f(\mathbf{x})(1+\left\lVert\mathbf{x}\right\rVert^{\gamma})^{\frac{1}{1-\beta}}d\mathbf{x}\right)^{1-\beta}<\infty$ . If $\beta<p/(p+d)$ , then $\gamma/\beta>d$ , thus $\left(\int(1+\left\lVert\mathbf{x}\right\rVert^{\gamma})^{-\frac{1}{\beta}}d\mathbf{x}\right)^{\beta}<\infty$ . Hence $\int f^{1-\beta}(\mathbf{x})d\mathbf{x}<\infty$ , and

\displaystyle\text{P}(f(\mathbf{X})<t)=\text{P}(f^{-\beta}(\mathbf{X})>t^{-\beta})\leq t^{\beta}\mathbb{E}[f^{-\beta}(\mathbf{X})]=t^{\beta}\int f^{1-\beta}(\mathbf{x})d\mathbf{x}.

(25)

Therefore for all $\beta<p/(p+d)$ , Assumption 3(a) holds with some finite $C_{\beta}$ .

For subgaussian or subexponential random variables, $\mathbb{E}[\left\lVert\mathbf{X}\right\rVert^{p}]<\infty$ holds for any $p$ , thus Assumption 3(a) holds for $\beta$ arbitrarily close to $1$ .

Appendix B Expected Sample Density

In this section, we define expected sample density, which is then used in the later analysis. Throughout this section, denote $x(j)$ as the value of $j$ -th component of vector $\mathbf{x}$ .

Definition 1.

(expected sample density) $q_{a}:\mathcal{X}\rightarrow\mathbb{R}$ is defined as the function such that for all $S\subseteq\mathcal{X}$ ,

\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}\mathbf{1}(\mathbf{X}_{t}\in S,A_{t}=a)\right]=\int_{S}q_{a}(\mathbf{x})d\mathbf{x}.

(26)

To show the existence of $q_{a}$ , define

\displaystyle Q_{a}(\mathbf{x})=\mathbb{E}\left[\sum_{t=1}^{T}\mathbf{1}(\mathbf{X}_{t}\in\{u|u(1)\leq x(1),\ldots,u(d)\leq x(d)\},A_{t}=a)\right].

(27)

Then let

\displaystyle q_{a}(\mathbf{x})=\left.\frac{\partial^{d}Q_{a}}{\partial x(1)\ldots\partial x(d)}\right|_{x},

(28)

and then (26) is satisfied for all $S\subseteq\mathcal{X}$ .

Then we show the following basic lemmas.

Lemma 1.

Regardless of $\eta$ , $q_{a}$ satisfies

\displaystyle q_{a}(\mathbf{x})\leq Tf(\mathbf{x})

(29)

for almost all $\mathbf{x}\in\mathcal{X}$ .

Proof.

Note that $\text{P}(\mathbf{X}_{t}\in S)=\int_{S}f(\mathbf{x})d\mathbf{x}$ . Therefore for all set $S$ ,

\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}\mathbf{1}(\mathbf{X}_{t}\in S,A_{t}=a)\right]\leq T\int_{S}f(\mathbf{x})d\mathbf{x}.

(30)

From (26) and (30), $\int_{S}q_{a}(\mathbf{x})d\mathbf{x}\leq T\int_{S}f(\mathbf{x})d\mathbf{x}$ for all $S$ . Therefore $q_{a}(\mathbf{x})\leq Tf(\mathbf{x})$ for almost all $\mathbf{x}\in\mathcal{X}$ . ∎

Lemma 2.

$R=\sum_{a\in\mathcal{A}}R_{a}$ , in which $R_{a}$ is defined as

\displaystyle R_{a}:=\int_{\mathcal{X}}(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x}))q_{a}(\mathbf{x})d\mathbf{x}.

(31)

Proof.

$\displaystyle R$	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}\left(\eta^{*}(\mathbf{X}_{t})-\eta_{A_{t}}(\mathbf{X}_{t})\right)\right]$	(32)
	$\displaystyle=$	$\displaystyle\sum_{a\in\mathcal{A}}\mathbb{E}\left[\sum_{t=1}^{T}(\eta^{*}(\mathbf{X}_{t})-\eta_{a}(\mathbf{X}_{t}))\mathbf{1}(A_{t}=a)\right]$
	$\displaystyle=$	$\displaystyle\sum_{a\in\mathcal{A}}\int_{\mathcal{X}}\left(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})\right)q_{a}(\mathbf{x})d\mathbf{x}.$

The proof is complete. ∎

Appendix C Proof of Theorem 1

Recall that

\displaystyle R=\mathbb{E}\left[\sum_{t=1}^{T}(\eta^{*}(\mathbf{X}_{t})-\eta_{A_{t}}(\mathbf{X}_{t}))\right].

(33)

Now we define

\displaystyle S=\mathbb{E}\left[\sum_{t=1}^{T}\mathbf{1}(\eta_{A_{t}}(\mathbf{X}_{t})<\eta^{*}(\mathbf{X}_{t}))\right]

(34)

as the expected number of steps with suboptimal actions.

The following lemma characterizes the relationship between $S$ and $R$ .

Lemma 3.

There exists a constant $C_{0}$ , such that

\displaystyle R\geq C_{0}S^{\frac{\alpha+1}{\alpha}}T^{-\frac{1}{\alpha}}.

(35)

Proof.

The proof of Lemma 3 follows the proof of Lemma 3.1 in (Rigollet & Zeevi, 2010). For completeness and consistency of notations, we show the proof in Appendix I.10. ∎

From now on, we only discuss the case with only two actions, such that $\mathcal{A}=\{1,-1\}$ . Construct $B$ disjoint balls with centers $\mathbf{c}_{1},\ldots,\mathbf{c}_{B}$ and radius $h$ . Let

\displaystyle f(\mathbf{x})=\sum_{j=1}^{B}\mathbf{1}(\mathbf{x}\in B_{j}),

(36)

in which $B_{j}=\{\mathbf{x}^{\prime}|\left\lVert\mathbf{x}^{\prime}-\mathbf{c}_{j}\right\rVert\leq h\}$ is the $j$ -th ball. To ensure that the pdf defined above is normalized (i.e. $\int f(\mathbf{x})d\mathbf{x}=1$ ), $B$ and $h$ need to satisfy

\displaystyle Bv_{d}h^{d}=1,

(37)

in which $v_{d}$ is the volume of $d$ dimensional unit ball.

Let $\eta_{1}(\mathbf{x})=\eta(\mathbf{x})$ and $\eta_{2}(\mathbf{x})=0$ , with

\displaystyle\eta(\mathbf{x})=\sum_{j=1}^{K}v_{j}h\mathbf{1}(\mathbf{x}\in B(c_{j},h)),

(38)

in which $v_{j}\in\{-1,1\}$ for $j=1,\ldots,K$ . To satisfy the margin assumption (Assumption 1(a)), note that

\displaystyle\text{P}(0<|\eta(\mathbf{X})|\leq t)\leq\left\{\begin{array}[]{ccc}Kv_{d}h^{d}&\text{if}&t\geq h\\ 0&\text{if}&t<h.\end{array}\right.

(41)

Note that for any suboptimal action $a$ , $\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})=|\eta(\mathbf{x})|$ . Assumption 1(a) requires that $\text{P}(0<|\eta(\mathbf{X})|\leq t)\leq C_{\alpha}t^{\alpha}$ . Therefore, it suffices to ensure that

\displaystyle Kv_{d}h^{d}=C_{\alpha}h^{\alpha}.

(42)

Then

$\displaystyle S$	$\displaystyle=$	$\displaystyle\sum_{j=1}^{K}\sum_{t=1}^{T}\text{P}(\mathbf{X}_{t}\in B_{j},A_{t}\neq a^{*}(\mathbf{X}_{t}))$	(43)
	$\displaystyle\geq$	$\displaystyle\sum_{j=1}^{K}\sum_{t=1}^{T}\int_{B_{j}}f(\mathbf{x})\text{P}(A_{t}\neq a^{*}(\mathbf{x})\|\mathbf{X}_{t}=x)d\mathbf{x}$
	$\displaystyle=$	$\displaystyle\sum_{j=1}^{K}\sum_{t=1}^{T}\int_{B_{j}}f(\mathbf{x})\text{P}(A_{t}\neq v_{j}\|\mathbf{X}_{t}=x)d\mathbf{x}$
	$\displaystyle=$	$\displaystyle\sum_{j=1}^{k}\sum_{t=1}^{T}\mathbb{E}\left[\int_{B_{j}}f(\mathbf{x})\mathbf{1}(\pi(x;\mathbf{X}_{1:t-1},Y_{1:t-1})\neq v_{j})d\mathbf{x}\right].$

Define

\displaystyle\hat{v}_{j}(t)=\operatorname{sign}\left(\int_{B_{j}}f(\mathbf{x})\pi(x;\mathbf{X}_{1:t-1},Y_{1:t-1})d\mathbf{x}\right).

(44)

Then

\displaystyle\int_{B_{j}}f(\mathbf{x})\mathbf{1}(\pi(x;\mathbf{X}_{1:t-1},Y_{1:t-1})-\hat{v}_{j}(t))d\mathbf{x}\geq\int_{B_{j}}f(\mathbf{x})\mathbf{1}\left(\pi(x;\mathbf{X}_{1:t-1},Y_{1:t-1})=-\hat{v}_{j}(t)\right)d\mathbf{x}.

(45)

Since

	$\displaystyle\int_{B_{j}}f(\mathbf{x})\mathbf{1}\left(\pi(x;\mathbf{X}_{1:t-1},Y_{1:t-1})-\hat{v}_{j}(t)\right)d\mathbf{x}+\int_{B_{j}}f(\mathbf{x})\mathbf{1}\left(\pi(x;\mathbf{X}_{1:t-1},Y_{1:t-1})=-\hat{v}_{j}(t)\right)d\mathbf{x}$
	$\displaystyle=\int_{B_{j}}f(\mathbf{x})d\mathbf{x},$		(46)

we have

\displaystyle\int_{B_{j}}f(\mathbf{x})\mathbf{1}\left(\pi(x;\mathbf{X}_{1:t-1},Y_{1:t-1})=\hat{v}_{j}(t)\right)d\mathbf{x}\geq\frac{1}{2}\int_{B_{j}}f(\mathbf{x})d\mathbf{x}.

(47)

If $\hat{v}_{j}(t)\neq v_{j}$ , then

\displaystyle\int_{B_{j}}f(\mathbf{x})\mathbf{1}\left(\pi(x;\mathbf{X}_{1:t-1},Y_{1:t-1})\neq v_{j}\right)d\mathbf{x}\geq\frac{1}{2}\int_{B_{j}}f(\mathbf{x})d\mathbf{x}.

(48)

Therefore, from (43),

	$\displaystyle S$	$\displaystyle\geq$	$\displaystyle\sum_{j=1}^{K}\sum_{t=1}^{T}\frac{1}{2}\text{P}(\hat{v}_{j}(t)\neq v_{j})\int_{B_{j}}f(\mathbf{x})d\mathbf{x}$		(49)
		$\displaystyle\geq$	$\displaystyle\sum_{j=1}^{K}\sum_{t=1}^{T}\frac{1}{2}v_{d}h^{d}\text{P}(\hat{v}_{j}(t)\neq v_{j}).$		(49)

Note that the error probability of hypothesis testing between distance $p$ and $q$ is at least $(1-\mathbb{TV}(p,q))/2$ , in which $\mathbb{TV}$ denotes the total variation distance. Let $(V_{1},\ldots,V_{K})$ be a vector of $K$ random variables taking values from $\{-1,1\}^{K}$ randomly. In other words, $\text{P}(V_{j}=1)=\text{P}(V_{j}=-1)=1/2$ , and $V_{j}$ for different $j$ are i.i.d. Denote $\mathbb{P}_{XY|V_{j}=v_{j}}$ as the distribution of $X$ and $Y$ conditional on $V_{j}=v_{j}$ . Moreover, $\mathbb{P}_{XY|V_{j}=v_{j}}^{t-1}$ means the distribution of $X$ and $Y$ of the first $t-1$ samples conditional on $V_{j}=v_{j}$ . Then

	$\displaystyle\text{P}(\hat{v}_{j}(t)\neq v_{j})$	$\displaystyle\geq$	$\displaystyle\frac{1}{2}\left(1-\mathbb{TV}\left(\mathbb{P}_{XY\|v_{j}=1}^{t-1}\|\|\mathbb{P}_{XY\|V_{j}=-1}^{t-1}\right)\right)$		(50)
		$\displaystyle\geq$	$\displaystyle\frac{1}{2}\left(1-\sqrt{\frac{1}{2}D\left(\mathbb{P}_{XY\|V_{j}=1}^{t-1}\|\|\mathbb{P}_{XY\|V_{j}=-1}^{t-1}\right)}\right),$		(50)

in which the second step uses Pinsker’s inequality (Fedotov et al., 2003), and $D(p||q)$ denotes the Kullback-Leibler (KL) divergence between distributions $p$ and $q$ . Note that the KL divergence between the conditional distribution is bounded by

\displaystyle D(\mathbb{P}_{Y|X,V_{j}=1}||\mathbb{P}_{Y|X,V_{j}=-1})\leq\frac{1}{2}(\eta_{1}(\mathbf{x})-\eta_{2}(\mathbf{x}))^{2}\leq\frac{1}{2}\eta^{2}(\mathbf{x})=\frac{1}{2}h^{2}

(51)

for $\mathbf{x}\in B_{j}$ . Therefore

$\displaystyle D(\mathbb{P}_{XY\|V_{j}=1}\|\|\mathbb{P}_{XY\|V_{j}=-1})$	$\displaystyle=$	$\displaystyle\int f(\mathbf{x})D(\mathbb{P}_{Y\|X,V_{j}=1}\|\|\mathbb{P}_{Y\|X,V_{j}=-1})d\mathbf{x}$	(52)
	$\displaystyle\leq$	$\displaystyle\int_{B_{j}}\frac{1}{2}h^{2}d\mathbf{x}$
	$\displaystyle=$	$\displaystyle\frac{1}{2}v_{d}h^{d+2}.$

Hence, from (50),

	$\displaystyle\text{P}(\hat{v}_{j}(t)\neq v_{j})$	$\displaystyle\geq$	$\displaystyle\frac{1}{2}\left(1-\sqrt{\frac{1}{4}(t-1)v_{d}h^{d+2}}\right)$		(53)
		$\displaystyle\geq$	$\displaystyle\frac{1}{2}\left(1-\frac{1}{2}\sqrt{Tv_{d}h^{d+2}}\right).$		(53)

Recall (49),

$\displaystyle S$	$\displaystyle\geq$	$\displaystyle\frac{1}{4}\sum_{j=1}^{K}\sum_{t=1}^{T}v_{d}h^{d}\left(1-\frac{1}{2}\sqrt{Tv_{d}h^{d+2}}\right)$	(54)
	$\displaystyle=$	$\displaystyle\frac{1}{4}KTv_{d}h^{d}\left(1-\frac{1}{2}\sqrt{Tv_{d}h^{d+2}}\right)$
	$\displaystyle=$	$\displaystyle\frac{1}{4}C_{\alpha}Th^{\alpha}\left(1-\frac{1}{2}\sqrt{Tv_{d}h^{d+2}}\right),$

in which the last step comes from (42).

Let $h\sim T^{-\frac{1}{d+2}}$ , then

\displaystyle S\gtrsim T^{1-\frac{\alpha}{d+2}}.

(55)

From Lemma 3,

	$\displaystyle R$	$\displaystyle\gtrsim$	$\displaystyle T^{\left(1-\frac{\alpha}{d+2}\right)\frac{\alpha+1}{\alpha}}T^{-\frac{1}{\alpha}}\sim T^{1-\frac{\alpha+1}{d+2}}$		(56)
		$\displaystyle\sim$	$\displaystyle T^{1-\frac{\alpha+1}{d+2}}.$		(56)

Appendix D Proof of Theorem 2

In this section, we derive the minimax lower bound of the expected regret with unbounded support. Recall that in the case with bounded support of contexts (Proof of Theorem 1 in Appendix C), we construct $B$ disjoint balls with pdf $f(\mathbf{x})=1$ for all $\mathbf{x}\in\mathcal{X}$ . Now for the case with unbounded support, the distribution of context has tails, on which the pdf $f(\mathbf{x})$ is small. Therefore, we modify the construction of balls as follows. We now construct $B+1$ disjoint balls with center $\mathbf{c}_{0},\ldots,\mathbf{c}_{B}$ , such that

	$\displaystyle B_{0}$	$\displaystyle=$	$\displaystyle\{\mathbf{x}^{\prime}\|\left\lVert\mathbf{x}^{\prime}-\mathbf{c}_{0}\right\rVert\leq r_{0}\},$		(57)
	$\displaystyle B_{j}$	$\displaystyle=$	$\displaystyle\{\mathbf{x}^{\prime}\|\left\lVert\mathbf{x}^{\prime}-\mathbf{c}_{j}\right\rVert\leq h\}.$		(58)

Let

\displaystyle\eta(\mathbf{x})=\sum_{j=1}^{K}v_{j}h\mathbf{1}(\mathbf{x}\in B(\mathbf{c}_{j},h)),

(59)

in which $v_{j}\in\{0,1\}$ is unknown, and

\displaystyle f(\mathbf{x})=\mathbf{1}(\mathbf{x}\in B_{0})+\sum_{j=1}^{B}m\mathbf{1}(\mathbf{x}\in B_{j}),

(60)

in which $m\ll 1$ will be determined later. Here we construct one ball that denotes the center region which has the most of probability mass, as well as $B$ balls that denotes the tail region. For simplicity, we let $\eta(\mathbf{x})=0$ at the largest ball $B_{0}$ , and only

To satisfy the margin condition (i.e. Assumption 1(a)), note that now

\displaystyle\text{P}(0<|\eta(\mathbf{X})|<t)

\displaystyle\leq

\displaystyle\left\{\begin{array}[]{ccc}mKv_{d}h^{d}&\text{if}&t>h\\ 0&\text{if}&t\leq h.\end{array}\right.

(63)

The right hand side of (63) can not exceed $C_{\alpha}t^{\alpha}$ , which requires

\displaystyle mKv_{d}h^{d}\leq C_{\alpha}h^{\alpha}.

(64)

Moreover, to satisfy the tail assumption (Assumption 3(a)), note that

\displaystyle\text{P}(f(\mathbf{X})<t)

\displaystyle\leq

\displaystyle\left\{\begin{array}[]{ccc}mKv_{d}h^{d}&\text{if}&t>m\\ 0&\text{if}&t\leq m.\end{array}\right.

(67)

The right hand side of (67) can not exceed $C_{\beta}t^{\beta}$ , which requires

\displaystyle mKv_{d}h^{d}\lesssim C_{\beta}m^{\beta}.

(68)

Following (49), $S$ can be lower bounded by

$\displaystyle S$	$\displaystyle\geq$	$\displaystyle\sum_{j=1}^{K}\sum_{t=1}^{T}\frac{1}{2}\text{P}(\hat{v}_{j}(t)\neq v_{j})\int_{B_{j}}f(\mathbf{x})d\mathbf{x}$	(69)
	$\displaystyle=$	$\displaystyle\sum_{j=1}^{K}\sum_{t=1}^{T}\frac{1}{2}mv_{d}h^{d}\text{P}(\hat{v}_{j}(t)\neq v_{j})$
	$\displaystyle\geq$	$\displaystyle\frac{1}{4}\sum_{j=1}^{K}\sum_{t=1}^{T}mv_{d}h^{d}\left(1-\sqrt{\frac{1}{2}D(\mathbb{P}_{XY\|V_{j}=1}\|\|\mathbb{P}_{XY\|V_{j}=-1})}\right)$
	$\displaystyle\geq$	$\displaystyle\frac{1}{4}\sum_{j=1}^{K}\sum_{t=1}^{T}mv_{d}h^{d}\left(1-\frac{1}{2}\sqrt{Tmv_{d}h^{d+2}}\right).$

From (69), we pick $m$ and $h$ to ensure that

\displaystyle Tmv_{d}h^{d+2}<\frac{1}{2}.

(70)

Then under three conditions (64), (68) and (70),

\displaystyle S\gtrsim KTmh^{d}.

(71)

It remains to determine the value of $m$ , $h$ and $K$ based on these three conditions. Let

\displaystyle h\sim(Tm)^{-\frac{1}{d+2}},

(72)

\displaystyle m\sim T^{-\frac{\alpha}{\alpha+\beta(d+2)}},

(73)

and

\displaystyle K\sim h^{\alpha-d}/m,

(74)

then

\displaystyle S\gtrsim Th^{\alpha}\sim T^{1-\frac{\alpha\beta}{\alpha+\beta(d+2)}}.

(75)

Based on Lemma 3,

\displaystyle R\gtrsim S^{\frac{1+\alpha}{\alpha}}T^{-\frac{1}{\alpha}}\sim T^{1-\frac{(\alpha+1)\beta}{\alpha+\beta(d+2)}}.

(76)

It remains to show that $R\gtrsim T^{1-\beta}$ . Let $h\sim 1$ , $K\sim T^{1-\beta}$ and $m\sim 1/T$ , the conditions (64), (68) and (70) are still satisfied. In this case,

\displaystyle S\gtrsim T^{1-\beta}.

(77)

Direct transformation using Lemma 3 yields suboptimal bound. Intuitively, for the case with heavy tails (i.e. $\beta$ is small), the regret mainly occur at the tail of the context distribution. Therefore, we bound the expected regret again.

$\displaystyle R$	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}(\eta^{*}(\mathbf{X}_{t})-\eta_{A_{t}}(\mathbf{X}_{t}))\right]$	(78)
	$\displaystyle\overset{(a)}{=}$	$\displaystyle\mathbb{E}\left[\sum_{t=1}^{T}h\mathbf{1}\left(\eta_{A_{t}}(\mathbf{X}_{t})<\eta^{*}(\mathbf{X}_{t})\right)\right]$
	$\displaystyle\overset{(b)}{=}$	$\displaystyle S\gtrsim T^{1-\beta}.$

(a) comes from the construction of $\eta$ in (59). (b) holds since we set $h=1$ here.

Combine (76) and (78),

\displaystyle R\gtrsim T^{1-\frac{(\alpha+1)\beta}{\alpha+\beta(d+2)}}+T^{1-\beta}.

(79)

Appendix E Proof of Theorem 3

To begin with, we show the following lemma.

Lemma 4.

For all $u>0$ ,

\displaystyle\text{P}\left(\underset{x,a}{\sup}\left|\frac{1}{k}\sum_{i\in\mathcal{N}_{t}(\mathbf{x},a)}W_{i}\right|>u\right)\leq dT^{2d}|\mathcal{A}|e^{-\frac{ku^{2}}{2\sigma^{2}}}.

(80)

Proof.

The proof is shown in Appendix I.1. ∎

From Lemma 4, recall the definition of $b$ in (10),

\displaystyle\text{P}\left(\underset{x,a}{\sup}\left|\frac{1}{k}\sum_{i\in\mathcal{N}_{t}(\mathbf{x},a)}W_{i}\right|>b\right)\leq\frac{1}{T^{2}}.

(81)

Therefore, with probability $1-1/T$ , for all $\mathbf{x}\in\mathcal{X}$ , $a\in\mathcal{A}$ and $t=1,\ldots,T$ , $|\sum_{i\in\mathcal{N}_{t}(\mathbf{x},a)}W_{i}|/k\leq b$ . Denote $E$ as the event such that $|\sum_{i\in\mathcal{N}_{t}(\mathbf{x},a)}W_{i}|/k\leq b$ , $\forall x,a,t$ , then

$\displaystyle\text{P}(E)$	$\displaystyle=$	$\displaystyle\text{P}\left(\cap_{t=1}^{T}\left\{\sup_{x,a}\left\|\frac{1}{k}\sum_{i\in\mathcal{N}_{t}(\mathbf{x},a)}W_{i}\right\|\leq b\right\}\right)$	(82)
	$\displaystyle=$	$\displaystyle 1-\text{P}\left(\cap_{t=1}^{T}\left\{\sup_{x,a}\left\|\frac{1}{k}\sum_{i\in\mathcal{N}_{t}(\mathbf{x},a)}W_{i}\right\|>b\right\}\right)$
	$\displaystyle\geq$	$\displaystyle 1-T\frac{1}{T^{2}}\geq 1-\frac{1}{T}.$

Recall the calculation of UCB in (9). Based on Lemma 4, we then show some properties of the UCB in (9).

Lemma 5.

Under $E$ , if $|\{i<t|A_{i}=a\}|\geq k$ , then

\displaystyle\eta_{a}(t)\leq\hat{\eta}_{a,t}(\mathbf{x})\leq\eta_{a}(\mathbf{x})+2b+2L\rho_{a,t}(\mathbf{x}).

(83)

Proof.

The proof is shown in Appendix I.2. ∎

We then bound the number of steps with suboptimal action $a$ . Define

\displaystyle n(x,a,r):=\sum_{t=1}^{T}\mathbf{1}\left(\left\lVert\mathbf{X}_{t}-\mathbf{x}\right\rVert<r,A_{t}=a\right).

(84)

Then the following lemma holds.

Lemma 6.

Under $E$ , for any $\mathbf{x}\in\mathcal{X}$ , $a\in\mathcal{A}$ , if $\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})>2b$ , define

\displaystyle r_{a}(\mathbf{x})=\frac{\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})-2b}{6L},

(85)

then

\displaystyle n(x,a,r_{a}(\mathbf{x}))\leq k.

(86)

Proof.

The proof is shown in Appendix I.3. ∎

From Lemma 6, the expectation of $n(x,a,r_{a}(\mathbf{x}))$ can be bounded as follows.

	$\displaystyle\mathbb{E}[n(x,a,r_{a}(\mathbf{x}))]$	$\displaystyle=$	$\displaystyle\text{P}(E)\mathbb{E}[n(x,a,r_{a}(\mathbf{x}))\|E]+\text{P}(E^{c})T$		(87)
		$\displaystyle\leq$	$\displaystyle k+1,$		(87)

in which the first step holds since even if $E$ does not hold, the number of steps in $n(x,a,r_{a}(\mathbf{x}))$ is no more than the total sample size $T$ . The second step uses (82). From (87) and the definition of expected sample density in (26),

\displaystyle\int_{B(x,r_{a}(\mathbf{x}))}q_{a}(\mathbf{u})du\leq k+1.

(88)

It bounds the average value of $q_{a}$ over the neighborhood of $\mathbf{x}$ . However, it does not bound $q_{a}(\mathbf{x})$ directly. To bound $R_{a}$ , we introduce a new random variable $\mathbf{Z}$ , with pdf

\displaystyle g(\mathbf{z})=\frac{1}{M_{Z}\left[(\eta^{*}(\mathbf{z})-\eta_{a}(\mathbf{z}))\vee\epsilon\right]^{d}},

(89)

in which $\epsilon=4b$ , with $b$ defined in (10). $M_{Z}$ is the constant for normalization. We then bound $R_{a}$ defined in (31). $R_{a}$ can be split into two terms:

	$\displaystyle R_{a}$	$\displaystyle=$	$\displaystyle\int_{\mathcal{X}}(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x}))q_{a}(\mathbf{x})\mathbf{1}(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x})>\epsilon)d\mathbf{x}$		(90)
			$\displaystyle+\int_{\mathcal{X}}(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x}))q_{a}(\mathbf{x})\mathbf{1}(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x})\leq\epsilon)d\mathbf{x}.$		(90)

To begin with, we bound the first term in (90). We show the following lemma.

Lemma 7.

There exists a constant $C_{1}$ , such that

	$\displaystyle\int_{\mathcal{X}}(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x}))q_{a}(\mathbf{x})\mathbf{1}(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x})>\epsilon)d\mathbf{x}\leq C_{1}M_{Z}\mathbb{E}\left[\int_{B(\mathbf{Z},r_{a}(\mathbf{Z}))}q_{a}(\mathbf{u})(\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u}))du\right],$
			(91)

in which $\epsilon=4b$ .

Proof.

The proof of Lemma 7 is shown in Appendix I.4. ∎

Now we bound the right hand side of (91). We show the following lemma.

Lemma 8.

\displaystyle\mathbb{E}\left[\int_{B(\mathbf{Z},r_{a}(\mathbf{Z}))}q_{a}(\mathbf{u})(\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u}))du\right]\lesssim\left\{\begin{array}[]{ccc}\frac{k}{M_{Z}c}\epsilon^{\alpha+1-d}&\text{if}&d>\alpha+1\\ \frac{k}{M_{Z}c}\ln\frac{1}{\epsilon}&\text{if}&d=\alpha+1\\ \frac{k}{M_{Z}c}&\text{if}&d<\alpha+1,\end{array}\right.

(95)

in which $c$ is the lower bound of pdf of contexts, which comes from Assumption 2.

Proof.

The proof of Lemma 8 is shown in Appendix I.5. ∎

From Lemma 7 and 8,

\displaystyle\int_{\mathcal{X}}(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x}))q_{a}(\mathbf{x})\mathbf{1}(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})>\epsilon)d\mathbf{x}\lesssim\left\{\begin{array}[]{ccc}k\epsilon^{\alpha+1-d}&\text{if}&d>\alpha+1\\ k\ln\frac{1}{\epsilon}&\text{if}&d=\alpha+1\\ k&\text{if}&d<\alpha+1.\end{array}\right.

(99)

Now we bound the second term in (90). From Lemma 1, $q_{a}(\mathbf{x})\leq Tf(\mathbf{x})$ for almost all $\mathbf{x}\in\mathcal{X}$ . Thus

$\displaystyle\int(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x}))q_{a}(\mathbf{x})\mathbf{1}(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x})\leq\epsilon)d\mathbf{x}$	$\displaystyle\leq$	$\displaystyle T\int(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x}))f(\mathbf{x})\mathbf{1}(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x})\leq\epsilon)d\mathbf{x}$	(100)
	$\displaystyle\leq$	$\displaystyle T\epsilon\text{P}\left(\eta^{*}(\mathbf{X})-\eta_{a}(\mathbf{X})\leq\epsilon\right)$
	$\displaystyle\leq$	$\displaystyle C_{\alpha}T\epsilon^{\alpha+1}.$

Therefore, from (90), (99) and (100),

\displaystyle R_{a}\lesssim\left\{\begin{array}[]{ccc}T\epsilon^{\alpha+1}+k\epsilon^{\alpha+1-d}&\text{if}&d>\alpha+1\\ T\epsilon^{\alpha+1}+k\ln\frac{1}{\epsilon}&\text{if}&d=\alpha+1\\ T\epsilon^{\alpha+1}+k&\text{if}&d<\alpha+1.\end{array}\right.

(104)

Recall that

\displaystyle\epsilon=4b=4\sqrt{\frac{2\sigma^{2}}{k}\ln(dT^{2d+2}|\mathcal{A}|)}.

(105)

If $d>\alpha+1$ , let $k\sim T^{\frac{2}{d+2}}$ , then

\displaystyle\epsilon\sim T^{-\frac{1}{d+2}}\sqrt{\ln(dT^{2d+2}|\mathcal{A}|)},

(106)

and

\displaystyle R_{a}\lesssim T^{1-\frac{\alpha+1}{d+2}}\ln^{\frac{\alpha+1}{2}}(dT^{2d+2}|\mathcal{A}|).

(107)

If $d\leq\alpha+1$ , let $k\sim T^{\frac{2}{\alpha+3}}$ , then

\displaystyle\epsilon\sim T^{-\frac{1}{\alpha+3}}\sqrt{\ln(dT^{2d+2}|\mathcal{A}|)},

(108)

and

\displaystyle R_{a}\lesssim T^{\frac{2}{\alpha+3}}\ln^{\frac{\alpha+1}{2}}(dT^{(}2d+2)|\mathcal{A}|).

(109)

Theorem 3 can then be proved using by $R=\sum_{a\in\mathcal{A}}R_{a}$ stated in Lemma 2.

Appendix F Proof of Theorem 4

Recall the expression of regret shown in Lemma 2. We decompose $R_{a}$ as follows.

$\displaystyle R_{a}$	$\displaystyle=$	$\displaystyle\int_{\mathcal{X}}(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x}))q_{a}(\mathbf{x})\mathbf{1}(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x})\leq\epsilon)d\mathbf{x}$	(110)
		$\displaystyle+\int_{\mathcal{X}}(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x}))q_{a}(\mathbf{x})\mathbf{1}\left(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x})>\epsilon,f(\mathbf{x})>\frac{k}{T\epsilon^{d}}\right)d\mathbf{x}$
		$\displaystyle+\int_{\mathcal{X}}(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x}))q_{a}(\mathbf{x})\mathbf{1}\left(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x})>\epsilon,\frac{k}{T}<f(\mathbf{x})\leq\frac{k}{T\epsilon^{d}}\right)d\mathbf{x}$
		$\displaystyle+\int_{\mathcal{X}}(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x}))q_{a}(\mathbf{x})\mathbf{1}\left(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x})>\epsilon,f(\mathbf{x})\leq\frac{k}{T}\right)d\mathbf{x}$
	$\displaystyle:=$	$\displaystyle I_{1}+I_{2}+I_{3}+I_{4},$

in which $\epsilon$ is the same as the proof of Theorem 3 in Appendix 5, i.e. $\epsilon=4b$ .

Bound of $I_{1}$ . From Lemma 1, $q(\mathbf{x})\leq Tf(\mathbf{x})$ for almost all $\mathbf{x}\in\mathcal{X}$ . Hence

$\displaystyle I_{1}$	$\displaystyle\leq$	$\displaystyle T\int_{\mathcal{X}}(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x}))\mathbf{1}(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x})\leq\epsilon)d\mathbf{x}$	(111)
	$\displaystyle\leq$	$\displaystyle T\epsilon\text{P}(\eta^{*}(\mathbf{X})-\eta_{a}(\mathbf{X})\leq\epsilon)$
	$\displaystyle\leq$	$\displaystyle C_{\alpha}T\epsilon^{1+\alpha}.$

Bound of $I_{2}$ . The regret of the high density region can be bounded similarly as the regret for pdf bounded away from zero. Follow the proof of Theorem 3 in Appendix 5, define

\displaystyle g(\mathbf{z})=\frac{1}{M_{Z}\left[(\eta^{*}(\mathbf{z})-\eta_{a}(\mathbf{z}))\vee\epsilon\right]^{d}}\mathbf{1}\left(f(\mathbf{z})>\frac{k}{T\epsilon^{d}}\right).

(112)

Similar to Lemma 5,

\displaystyle I_{2}\lesssim M_{Z}\int_{B(\mathbf{Z},r_{a}(\mathbf{Z}))}q_{a}(\mathbf{u})(\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u}))du.

(113)

Similar to Lemma 6, now we replace $c$ with $k/(T\epsilon^{d})$ . Then

\displaystyle\int_{B(\mathbf{Z},r_{a}(\mathbf{Z}))}q_{a}(\mathbf{u})(\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u}))du\lesssim\left\{\begin{array}[]{ccc}\frac{T\epsilon^{d}}{M_{Z}}\epsilon^{\alpha+1-d}&\text{if}&d>\alpha+1\\ \frac{T\epsilon^{d}}{M_{Z}}\ln\frac{1}{\epsilon}&\text{if}&d=\alpha+1\\ \frac{T\epsilon^{d}}{M_{Z}}&\text{if}&d<\alpha+1.\end{array}\right.

(117)

Therefore

\displaystyle I_{2}\lesssim\left\{\begin{array}[]{ccc}T\epsilon^{d}\epsilon^{\alpha+1-d}&\text{if}&d>\alpha+1\\ T\epsilon^{d}\ln\frac{1}{\epsilon}&\text{if}&d=\alpha+1\\ T\epsilon^{d}&\text{if}&d<\alpha+1.\end{array}\right.

(121)

Bound of $I_{3}$ . Here we introduce the following lemma.

Lemma 9.

(Restated from Lemma 6 in (Zhao & Lai, 2021b)) For any $0<a<b$ ,

\displaystyle\mathbb{E}[f^{-p}(\mathbf{X})\mathbf{1}(a\leq f(\mathbf{X})<b)]\lesssim\left\{\begin{array}[]{ccc}b^{\beta-p}&\text{if}&p>\beta\\ \ln\frac{b}{a}&\text{if}&p=\beta\\ a^{\beta-p}&\text{if}&p<\beta.\end{array}\right.

(125)

Proof.

The proof of Lemma 9 can follow that of Lemma 6 in (Zhao & Lai, 2021b). For completeness, we show the proof in Appendix I.9. ∎

Based on Lemma 9, $I_{3}$ can be bounded by

$\displaystyle I_{3}$	$\displaystyle=$	$\displaystyle\int_{\mathcal{X}}(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x}))q_{a}(\mathbf{x})\mathbf{1}\left(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x})>\epsilon,\frac{k}{T}<f(\mathbf{x})\leq\frac{k}{\epsilon^{d}}\right)d\mathbf{x}$	(128)
	$\displaystyle\lesssim$	$\displaystyle\int_{\mathcal{X}}\left(\frac{k}{Tf(\mathbf{x})}\right)^{\frac{1}{d}}Tf(\mathbf{x})\mathbf{1}\left(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})>\epsilon,\frac{k}{T}<f(\mathbf{x})\leq\frac{k}{T\epsilon^{d}}\right)d\mathbf{x}$
	$\displaystyle\leq$	$\displaystyle T\left(\frac{k}{T}\right)^{\frac{1}{d}}\int\mathbb{E}\left[f^{-\frac{1}{d}}(\mathbf{X})\mathbf{1}\left(\frac{k}{T}<f(\mathbf{x})<\frac{k}{T\epsilon^{d}}\right)\right]$
	$\displaystyle\lesssim$	$\displaystyle\left\{\begin{array}[]{ccc}T\left(\frac{k}{T}\right)^{\beta}&\text{if}&\beta<\frac{1}{d}\\ T\left(\frac{k}{T}\right)^{\frac{1}{d}}\ln\frac{1}{\epsilon}&\text{if}&\beta=\frac{1}{d}\\ T\left(\frac{k}{T}\right)^{\beta}\epsilon^{1-d\beta}&\text{if}&\beta>\frac{1}{d}.\end{array}\right.$

Bound of $I_{4}$ .

$\displaystyle I_{4}$	$\displaystyle\leq$	$\displaystyle TM\int f(\mathbf{x})\mathbf{1}\left(f(\mathbf{x})\leq\frac{k}{T}\right)d\mathbf{x}$	(129)
	$\displaystyle\leq$	$\displaystyle TM\text{P}(f(\mathbf{X})\leq\frac{k}{T})$
	$\displaystyle\lesssim$	$\displaystyle T\left(\frac{k}{T}\right)^{\beta}.$

Now we bound $R_{a}$ by selecting $k$ to minimize the sum of $I_{1}$ , $I_{2}$ , $I_{3}$ , $I_{4}$ . Recall that $\epsilon=4b$ , in which $b$ is defined in (10), thus $\epsilon\sim\sqrt{\ln(dT^{2d+2}|\mathcal{A}|)/k}$ .

(1) If $d>\alpha+1$ and $\beta>1/d$ , then with $k\sim T^{\frac{2\beta}{\alpha+(d+2)\beta}}$ ,

	$\displaystyle R$	$\displaystyle\lesssim$	$\displaystyle T\left(\epsilon^{1+\alpha}+\left(\frac{k}{T}\right)^{\beta}\epsilon^{1-d\beta}\right)$		(130)
		$\displaystyle\sim$	$\displaystyle T^{1-\frac{\beta(\alpha+1)}{\alpha+(d+2)\beta}}\ln^{\frac{\alpha+1}{2}}(dT^{2d+2}\|\mathcal{A}\|).$		(130)

(2) If $d\leq\alpha+1$ and $\beta>1/d$ , then with $k\sim T^{\frac{2\beta}{d-1+\beta(d+2)}}$ ,

	$\displaystyle R$	$\displaystyle\lesssim$	$\displaystyle T\left(\epsilon^{d}+\left(\frac{k}{T}\right)^{\beta}\epsilon^{1-d\beta}\right)$		(131)
		$\displaystyle\sim$	$\displaystyle T^{1-\frac{\beta d}{d-1+(d+2)\beta}}\ln^{\frac{d}{2}}(dT^{2d+2}\|\mathcal{A}\|).$		(131)

(3) If $d>\alpha+1$ and $\beta\leq 1/d$ , then with $k\sim T^{\frac{2\beta}{1+\alpha+2\beta}}$ ,

	$\displaystyle R$	$\displaystyle\lesssim$	$\displaystyle T\epsilon^{1+\alpha}+T\left(\frac{k}{N}\right)^{\beta}$		(132)
		$\displaystyle\sim$	$\displaystyle T^{1-\frac{\beta(\alpha+1)}{1+\alpha+2\beta}}\ln^{\frac{\alpha+1}{2}}(dT^{2d+2}\|\mathcal{A}\|).$		(132)

(4) If $d\leq\alpha+1$ and $\beta\leq 1/d$ , then with $k\sim T^{\frac{2\beta}{d+2\beta}}$ ,

	$\displaystyle R$	$\displaystyle\lesssim$	$\displaystyle T\epsilon^{d}+T\left(\frac{k}{N}\right)^{\beta}$		(133)
		$\displaystyle\sim$	$\displaystyle T^{1-\frac{\beta d}{d+2\beta}}\ln^{\frac{d}{2}}(dT^{2d+2}\|\mathcal{A}\|).$		(133)

Combine all these cases, we conclude that

\displaystyle R\lesssim T^{1-\frac{\beta\min(d,\alpha+1)}{\min(d-1,\alpha)+\max(1,d\beta)+2\beta}}|\mathcal{A}|\ln^{\frac{1}{2}\max(d,\alpha+1)}(dT^{2d+2}|\mathcal{A}|).

(134)

The proof of Theorem 4 is complete.

Appendix G Proof of Theorem 5

To begin with, similar to Lemma 4 and Lemma 5, we show the following lemmas.

Lemma 10.

\displaystyle\text{P}\left(\underset{x,a,k}{\sup}\left|\frac{1}{\sqrt{k}}\sum_{i\in\mathcal{N}_{t,k}(\mathbf{x},a)}W_{i}\right|>u\right)\leq dT^{2d+1}|\mathcal{A}|e^{-\frac{u^{2}}{2\sigma^{2}}},

(135)

with $\mathcal{N}_{t,k}(s,a)$ being the set of $k$ neighbors among $\{\mathbf{X}_{i}|i<t,A_{i}=a\}$ .

Proof.

From Lemma 4,

\displaystyle\text{P}\left(\underset{x,a}{\sup}\left|\frac{1}{\sqrt{k}}\sum_{i\in\mathcal{N}_{t,k}(\mathbf{x},a)}W_{i}\right|>u\right)\leq dT^{2d}|\mathcal{A}|e^{-\frac{u^{2}}{2\sigma^{2}}}.

(136)

Lemma 10 can then be proved by taking a union bound over all $k$ . ∎

Lemma 11.

Define event $E$ , such that $E=1$ if

\displaystyle\left|\frac{1}{\sqrt{k}}\sum_{i\in\mathcal{N}_{t,k}(\mathbf{x},a)}W_{i}\right|\leq\sqrt{2\sigma^{2}\ln(dT^{2d+3}|\mathcal{A}|)}

(137)

for all $x,a,k,t$ , then $\text{P}(E)\geq 1-1/T$ . Moreover, under $E$ ,

\displaystyle\eta_{a}(\mathbf{x})\leq\hat{\eta}_{a,t}(\mathbf{x})\leq\eta_{a}(\mathbf{x})+2b_{a,t}(\mathbf{x})+2L\rho_{a,t}(\mathbf{x}).

(138)

Proof.

The proof is similar to the proof of Lemma 5.

	$\displaystyle\|\hat{\eta}_{a,t}(\mathbf{x})-(\eta_{a}(\mathbf{x})+b_{a,t}(\mathbf{x})+L\rho_{a,t}(\mathbf{x}))\|$	(139)
$\displaystyle\leq$	$\displaystyle\left\|\frac{1}{k_{a,t}(\mathbf{x})}\sum_{i\in\mathcal{N}_{t}(\mathbf{x},a)}(Y_{i}-\eta_{a}(\mathbf{x}))\right\|$
$\displaystyle\leq$	$\displaystyle\left\|\frac{1}{k_{a,t}(\mathbf{x})}\sum_{i\in\mathcal{N}_{t}(\mathbf{x},a)}(Y_{i}-\eta_{a}(\mathbf{X}_{i}))\right\|+\frac{1}{k_{a,t}(\mathbf{x})}\sum_{i\in\mathcal{N}_{t}(\mathbf{x},a)}\|\eta_{a}(\mathbf{X}_{i})-\eta_{a}(\mathbf{x})\|$
$\displaystyle\leq$	$\displaystyle L\rho_{a,t}(\mathbf{x})+b_{a,t}(\mathbf{x}).$

∎

With these preparations, we then bound the number of steps around each $\mathbf{x}$ in the next lemma, which is crucially different with Lemma 6. Here we keep the definition $n(x,a,r):=\sum_{t=1}^{T}\mathbf{1}\left(\left\lVert\mathbf{X}_{t}-\mathbf{x}\right\rVert<r,A_{t}=a\right)$ to be the same as (84), but change the definition of $r_{a}$ and $n_{a}$ as follows.

Lemma 12.

Define

\displaystyle r_{a}(\mathbf{x})=\frac{1}{2L\sqrt{C_{1}}}(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})),

(140)

and

\displaystyle n_{a}(\mathbf{x})=\frac{C_{1}\ln T}{(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x}))^{2}},

(141)

in which

\displaystyle C_{1}=\max\{4,32\sigma^{2}(2d+3+\ln(d|\mathcal{A}|))\}.

(142)

Then under $E$ ,

\displaystyle n(x,a,r_{a}(\mathbf{x}))\leq n_{a}(\mathbf{x}).

(143)

Proof.

The proof of Lemma 12 is shown in Appendix I.6. ∎

From Lemma 12,

	$\displaystyle\mathbb{E}[n(x,a,r_{a}(\mathbf{x}))]$	$\displaystyle\leq$	$\displaystyle\text{P}(E)\mathbb{E}[n(x,a,r_{a}(\mathbf{x}))\|E]+\text{P}(E^{c})\mathbb{E}[n(x,a,r_{a}(\mathbf{x}))\|E^{c}]$		(144)
		$\displaystyle\leq$	$\displaystyle n_{a}(\mathbf{x})+1.$		(144)

From the definition of $q_{a}$ in (26),

\displaystyle\int_{B(x,r_{a}(\mathbf{x}))}q_{a}(\mathbf{u})du\leq n_{a}(\mathbf{x})+1.

(145)

Now we bound $R_{a}$ . Similar to (89), let random variable $\mathbf{Z}$ follows a distribution with pdf $g$ :

\displaystyle g(\mathbf{z})=\frac{1}{M_{Z}[(\eta^{*}(\mathbf{z})-\eta_{a}(\mathbf{z}))\vee\epsilon]^{d}}.

(146)

The difference with the case with fixed $k$ is that in (89), $\epsilon=4b$ . However, now $b_{a,t}(\mathbf{x})$ varies among $\mathbf{x}$ , thus we do not determine $\epsilon$ based on $b$ . Instead, for the adaptive nearest neighbor method, $\epsilon$ will be determined after we get the final bound of $R_{a}$ .

We show the following lemma.

Lemma 13.

There exists a constant $C_{2}$ , such that

\displaystyle\int_{\mathcal{X}}(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x}))q_{a}(\mathbf{x})\mathbf{1}(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})>\epsilon)d\mathbf{x}\leq C_{2}M_{Z}\mathbb{E}\left[\int_{B(\mathbf{Z},r_{a}(\mathbf{Z}))}q_{a}(\mathbf{u})(\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u}))du\right].

(147)

Proof.

The proof of Lemma 13 is shown in Appendix I.7. ∎

We then bound the right hand side of (147).

Lemma 14.

\displaystyle\mathbb{E}\left[\int_{B(\mathbf{Z},r_{a}(\mathbf{Z}))}q_{a}(\mathbf{u})(\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u}))du\right]\lesssim\frac{1}{M_{Z}}\left(\epsilon^{\alpha-d-1}\ln T+T\epsilon^{1+\alpha}\right).

(148)

Proof.

The proof of Lemma 14 is shown in Appendix I.8. ∎

From Lemma 13 and 14,

\displaystyle\int_{\mathcal{X}}(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x}))q_{a}(\mathbf{x})\mathbf{1}(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})>\epsilon)d\mathbf{x}\lesssim\epsilon^{\alpha-d-1}\ln T+T\epsilon^{1+\alpha}.

(149)

Moreover,

	$\displaystyle\int_{\mathcal{X}}(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x}))q_{a}(\mathbf{x})\mathbf{1}(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x})\leq\epsilon)d\mathbf{x}$	$\displaystyle\leq$	$\displaystyle T\epsilon\int f(\mathbf{x})\mathbf{1}(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})\leq\epsilon)d\mathbf{x}$		(150)
		$\displaystyle\lesssim$	$\displaystyle T\epsilon^{1+\alpha}.$		(150)

Therefore

\displaystyle R_{a}\lesssim\epsilon^{\alpha-d-1}\ln T+T\epsilon^{1+\alpha}.

(151)

Note that we have not specified the value of $\epsilon$ earlier. Therefore, in (151), $\epsilon$ can take any values. We can then select $\epsilon$ to minimize the right hand side of (151). Therefore, let

\displaystyle\epsilon\sim\left(\frac{\ln T}{T}\right)^{\frac{1}{d+2}},

(152)

then

\displaystyle R_{a}\lesssim T\left(\frac{T}{\ln T}\right)^{-\frac{1+\alpha}{d+2}}.

(153)

The overall regret can then be bounded by summation over $a$ :

\displaystyle R\lesssim T|\mathcal{A}|\left(\frac{T}{\ln T}\right)^{-\frac{1+\alpha}{d+2}}.

(154)

Appendix H Proof of Theorem 6

Define

\displaystyle g(\mathbf{z})=\frac{1}{M_{Z}(\eta^{*}(\mathbf{z})-\eta_{a}(\mathbf{z}))^{d}}\mathbf{1}\left(f(\mathbf{z})\geq\frac{1}{T},\eta^{*}(\mathbf{z})-\eta_{a}(\mathbf{z})>\epsilon(\mathbf{x})\right),

(155)

in which

\displaystyle\epsilon(\mathbf{x})=(Tf(\mathbf{x}))^{-\frac{1}{d+2}},

(156)

and $M_{Z}$ is the normalization constant, which ensures that $\int g(\mathbf{z})dz=1$ . Let $\mathbf{Z}$ be a random variable with pdf $g$ . We then bound $R_{a}$ for the case with unbounded support on the contexts, under Assumption 2 and 3.

$\displaystyle R_{a}$	$\displaystyle=$	$\displaystyle\int_{\mathcal{X}}(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x}))q_{a}(\mathbf{x})d\mathbf{x}$	(157)
	$\displaystyle=$	$\displaystyle\int_{\mathcal{X}}(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x}))q_{a}(\mathbf{x})\mathbf{1}\left(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x})>\epsilon(\mathbf{x}),f(\mathbf{x})\geq\frac{1}{T}\right)d\mathbf{x}$
		$\displaystyle+\int_{\mathcal{X}}(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x}))q_{a}(\mathbf{x})\mathbf{1}\left(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x})\leq\epsilon(\mathbf{x}),f(\mathbf{x})\geq\frac{1}{T}\right)d\mathbf{x}$
		$\displaystyle+\int_{\mathcal{X}}(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x}))q_{a}(\mathbf{x})\mathbf{1}\left(f(\mathbf{x})<\frac{1}{T}\right)d\mathbf{x}$
	$\displaystyle:=$	$\displaystyle I_{1}+I_{2}+I_{3}.$

Now we bound three terms in (157) separately.

Bound of $I_{1}$ . Following Lemma 7 and 13, it can be shown that for some constant $C_{3}$ , such that

\displaystyle\int_{\mathcal{X}}(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x}))q_{a}(\mathbf{x})\mathbf{1}(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})>\epsilon(\mathbf{x}))d\mathbf{x}\leq C_{3}M_{Z}\mathbb{E}\left[\int_{B(\mathbf{Z},r_{a}(\mathbf{Z}))}q_{a}(\mathbf{u})(\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u}))du\right].

(158)

The right hand side of (158) can be bounded as follows.

	$\displaystyle\mathbb{E}\left[\int_{B(\mathbf{Z},r_{a}(\mathbf{Z}))}q_{a}(\mathbf{u})(\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u}))du\right]$	(159)
$\displaystyle\leq$	$\displaystyle\frac{3}{2}\mathbb{E}\left[\int_{B(\mathbf{Z},r_{a}(\mathbf{Z}))}q_{a}(\mathbf{u})(\eta^{*}(\mathbf{Z})-\eta_{a}(\mathbf{Z}))du\right]$
$\displaystyle\leq$	$\displaystyle\frac{3}{2}\mathbb{E}\left[(n_{a}(\mathbf{Z})+1)(\eta^{*}(\mathbf{Z})-\eta_{a}(\mathbf{Z}))\right]$
$\displaystyle\leq$	$\displaystyle\frac{3}{2M_{Z}}\int(n_{a}(\mathbf{z})+1)(\eta^{*}(\mathbf{z})-\eta_{a}(\mathbf{z}))g(\mathbf{z})dz$
$\displaystyle\leq$	$\displaystyle\frac{3}{2M_{Z}}\int\left(\frac{C_{1}\ln T}{\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z})}+\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z})\right)\frac{1}{(\eta^{*}(\mathbf{z})-\eta_{a}(\mathbf{z}))^{d}}\mathbf{1}\left(\eta(\mathbf{z})\geq\epsilon(\mathbf{z}),f(\mathbf{z})\geq\frac{1}{T}\right)dz$
$\displaystyle\lesssim$	$\displaystyle\frac{\ln T}{M_{Z}}\int(\eta^{*}(\mathbf{z})-\eta_{a}(\mathbf{z}))^{-(d+1)}\mathbf{1}\left(\eta(\mathbf{z})\geq\epsilon(\mathbf{z}),f(\mathbf{z})\geq\frac{1}{T}\right)dz.$

Hence

\displaystyle I_{1}\lesssim\ln T\int(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x}))^{-(d+1)}\mathbf{1}\left(\eta(\mathbf{x})\geq\epsilon(\mathbf{x}),f(\mathbf{x})\geq\frac{1}{T}\right)d\mathbf{x}.

(160)

Let $c\in R$ that will be determined later.

	$\displaystyle\int(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x}))^{-(d+1)}\mathbf{1}(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x})\geq\epsilon(\mathbf{x}),f(\mathbf{x})\geq c)d\mathbf{x}$	(161)
$\displaystyle\leq$	$\displaystyle\frac{1}{c}\int(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x}))^{-(d+1)}\mathbf{1}\left(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x})\geq(Tc)^{-\frac{1}{d+2}}\right)f(\mathbf{x})d\mathbf{x}$
$\displaystyle=$	$\displaystyle\frac{1}{c}\mathbb{E}\left[(\eta^{}(\mathbf{X})-\eta_{a}(\mathbf{X}))^{-(d+1)}\mathbf{1}\left(\eta^{}(\mathbf{X})-\eta_{a}(\mathbf{X})\geq(Tc)^{-\frac{1}{d+2}}\right)\right]$
$\displaystyle\leq$	$\displaystyle\frac{1}{c}(Tc)^{\frac{d+1}{d+2}\left(1-\frac{\alpha}{d+1}\right)}$
$\displaystyle=$	$\displaystyle T(Tc)^{-\frac{\alpha+1}{d+2}}.$

To bound the integration of the other side, i.e. $1/T\leq f(\mathbf{x})<c$ , we use Lemma 9.

	$\displaystyle\int(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x}))^{-(d+1)}\mathbf{1}\left(\eta(\mathbf{x})\geq\epsilon(\mathbf{x}),\frac{1}{T}\leq f(\mathbf{x})<c\right)d\mathbf{x}$	(164)
$\displaystyle\leq$	$\displaystyle\int\epsilon^{-(d+1)}(\mathbf{x})\mathbf{1}\left(\frac{1}{T}\leq f(\mathbf{x})<c\right)d\mathbf{x}$
$\displaystyle=$	$\displaystyle T^{\frac{d+1}{d+2}}\int f^{\frac{d+1}{d+2}}(\mathbf{x})\mathbf{1}\left(\frac{1}{T}\leq f(\mathbf{x})<c\right)d\mathbf{x}$
$\displaystyle=$	$\displaystyle T^{\frac{d+1}{d+2}}\mathbb{E}\left[f^{-\frac{1}{d+2}}(\mathbf{X})\mathbf{1}\left(\frac{1}{T}\leq f(\mathbf{X})<c\right)\right]$
$\displaystyle\lesssim$	$\displaystyle\left\{\begin{array}[]{ccc}T^{\frac{d+1}{d+2}}\left(T^{\frac{1}{d+2}-\beta}+c^{\beta-\frac{1}{d+2}}\right)&\text{if}&\beta\neq\frac{1}{d+2}\\ T^{\frac{d+1}{d+2}}\ln(Tc)&\text{if}&\beta=\frac{1}{d+2}.\end{array}\right.$

From (161) and (164),

\displaystyle I_{1}\lesssim\left\{\begin{array}[]{ccc}T\ln T\left[(Tc)^{-\frac{\alpha+1}{d+2}}+T^{-\frac{1}{d+2}}c^{\beta-\frac{1}{d+2}}+T^{-\beta}\right]&\text{if}&\beta\neq\frac{1}{d+2}\\ T\ln T\left[(Tc)^{-\frac{\alpha+1}{d+2}}+T^{-\frac{1}{d+2}}\ln(Tc)\right]&\text{if}&\beta=\frac{1}{d+2}.\end{array}\right.

(167)

To minimize (167), let

\displaystyle c=T^{-\frac{\alpha}{\alpha+(d+2)\beta}},

(168)

then

\displaystyle I_{1}\lesssim\left\{\begin{array}[]{ccc}T^{1-\frac{(\alpha+1)\beta}{\alpha+(d+2)\beta}}\ln T+T^{1-\beta}\ln T&\text{if}&\beta\neq\frac{1}{d+2}\\ T^{\frac{d+1}{d+2}}\ln^{2}T&\text{if}&\beta=\frac{1}{d+2}.\end{array}\right.

(171)

Bound of $I_{2}$ . We still discuss $f(\mathbf{x})\geq c$ and $1/T\leq f(\mathbf{x})<c$ separately. For $f(\mathbf{x})\geq c$ ,

	$\displaystyle\int(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x}))q_{a}(\mathbf{x})\mathbf{1}(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x})\leq\epsilon(\mathbf{x}),f(\mathbf{x})\geq c)d\mathbf{x}$	(172)
$\displaystyle\leq$	$\displaystyle T\int(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x}))\mathbf{1}\left(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x})\leq(Tc)^{-\frac{1}{d+2}}\right)f(\mathbf{x})d\mathbf{x}$
$\displaystyle\leq$	$\displaystyle T(Tc)^{-\frac{1}{d+2}}\text{P}\left(\eta^{*}(\mathbf{X})-\eta_{a}(\mathbf{X})\leq(Tc)^{-\frac{1}{d+2}}\right)$
$\displaystyle\lesssim$	$\displaystyle T(Tc)^{-\frac{1+\alpha}{d+2}}.$

For $1/T\leq f(\mathbf{x})<c$ ,

	$\displaystyle\int(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x}))q_{a}(\mathbf{x})\mathbf{1}\left(\eta^{}(\mathbf{x})-\eta_{a}(\mathbf{x})\leq\epsilon(\mathbf{x}),\frac{1}{T}\leq f(\mathbf{x})<c\right)d\mathbf{x}$	(175)
$\displaystyle\leq$	$\displaystyle T\int\epsilon(\mathbf{x})f(\mathbf{x})\mathbf{1}\left(\frac{1}{T}\leq f(\mathbf{x})<c\right)d\mathbf{x}$
$\displaystyle=$	$\displaystyle T^{\frac{d+1}{d+2}}\int f^{\frac{d+1}{d+2}}(\mathbf{x})\mathbf{1}\left(\frac{1}{T}\leq f(\mathbf{x})<c\right)d\mathbf{x}$
$\displaystyle\lesssim$	$\displaystyle\left\{\begin{array}[]{ccc}T^{\frac{d+1}{d+2}}(T^{\frac{1}{d+2}-\beta}+c^{\beta-\frac{1}{d+2}})&\text{if}&\beta\neq\frac{1}{d+2}\\ T^{\frac{d+1}{d+2}}\ln(Tc)&\text{if}&\beta=\frac{1}{d+2}.\end{array}\right.$

Similar to $I_{1}$ , pick $c=T^{-\alpha/(\alpha+(d+2)\beta)}$ .

Bound of $I_{3}$ . From Assumption 3(b), $\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})\leq M$ . Moreover, from Lemma 1, $q(\mathbf{x})\leq Tf(\mathbf{x})$ for almost all $\mathbf{x}\in\mathcal{X}$ . Hence

\displaystyle\int(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x}))q_{a}(\mathbf{x})\mathbf{1}\left(f(\mathbf{x})<\frac{1}{T}\right)d\mathbf{x}\leq MT\int f(\mathbf{x})\mathbf{1}\left(f(\mathbf{x})<\frac{1}{T}\right)d\mathbf{x}\lesssim T^{1-\beta}.

(176)

Combine $I_{1}$ , $I_{2}$ and $I_{3}$ ,

\displaystyle R_{a}\lesssim\left\{\begin{array}[]{ccc}T^{1-\frac{(\alpha+1)\beta}{\alpha+(d+2)\beta}}\ln T+T^{1-\beta}\ln T&\text{if}&\beta\neq\frac{1}{d+2}\\ T^{\frac{d+1}{d+2}}\ln^{2}T&\text{if}&\beta=\frac{1}{d+2}.\end{array}\right.

(179)

Theorem 6 can then proved by $R=\sum_{a\in\mathcal{A}}R_{a}$ .

Appendix I Proof of Lemmas

I.1 Proof of Lemma 4

From Assumption 2(c), $W_{i}$ is subgaussian with parameter $\sigma^{2}$ . Therefore for any fixed set $I\subset\{1,\ldots,T\}$ with $|I|=k$ ,

\displaystyle\mathbb{E}\left[\exp\left(\lambda\sum_{i\in I}W_{i}\right)\right]\leq\exp\left(\frac{k}{2}\lambda^{2}\sigma^{2}\right),

(180)

and

$\displaystyle\text{P}\left(\frac{1}{k}\sum_{i\in I}W_{i}>u\right)$	$\displaystyle\leq$	$\displaystyle\inf_{\lambda}e^{-\lambda u}\mathbb{E}\left[\exp\left(\frac{\lambda}{k}\sum_{i\in I}W_{i}\right)\right]$	(181)
	$\displaystyle\leq$	$\displaystyle\inf_{\lambda}e^{-\lambda u}\exp\left(\frac{\lambda^{2}\sigma^{2}}{2k}\right)$
	$\displaystyle=$	$\displaystyle\exp\left(-\frac{ku^{2}}{2\sigma^{2}}\right).$

Now we need to give a union bound³³3The construction of hyperplanes follows the proof of Lemma 3 in (Jiang, 2019) and Appendix H.5 in (Zhao & Wan, 2024).. Let $A_{ij}$ be $d-1$ dimensional hyperplane that bisects $\mathbf{X}_{i}$ , $\mathbf{X}_{j}$ . Then the number of planes is at most $N_{p}=T(T-1)/2$ . Note that $N_{p}$ planes divide a $d$ dimensional space into at most $N_{r}=\sum_{j=0}^{d}\binom{N_{p}}{j}$ regions. Therefore

\displaystyle N_{r}\leq\sum_{j=0}^{d}\binom{\frac{1}{2}T(T-1)}{j}\leq d\left(\frac{1}{2}T(T-1)\right)^{d}<dT^{2d}.

(182)

The $k$ nearest neighbors for all $\mathbf{x}$ within a region should be the same. Combining with the action space $\mathcal{A}$ , there are at most $N_{r}|\mathcal{A}|$ regions. Hence

\displaystyle|\{\mathcal{N}_{t}(\mathbf{x},a)|\mathbf{x}\in\mathcal{X},a\in\mathcal{A}\}|\leq dT^{2d}|\mathcal{A}|.

(183)

Therefore

\displaystyle\text{P}\left(\cup_{\mathbf{x}\in\mathcal{X}}\cup_{a\in\mathcal{A}}\left\{\frac{1}{k}\left|\sum_{i\in\mathcal{N}_{t}(\mathbf{x},a)}W_{i}\right|>u\right\}\right)\leq dT^{2d}|\mathcal{A}|e^{-\frac{ku^{2}}{2\sigma^{2}}}.

(184)

The proof is complete.

I.2 Proof of Lemma 5

From (9), with $|\{i<t|A_{i}=a\}|\geq k$ ,

	$\displaystyle\hat{\eta}_{a,t}(\mathbf{x})$	$\displaystyle=$	$\displaystyle\frac{1}{k}\sum_{i\in\mathcal{N}_{t}(\mathbf{x},a)}Y_{i}+b+L\rho_{a,t}(\mathbf{x})$		(185)
		$\displaystyle=$	$\displaystyle\frac{1}{k}\sum_{i\in\mathcal{N}_{t}(\mathbf{x},a)}\eta_{a}(\mathbf{X}_{i})+\frac{1}{k}\sum_{i\in\mathcal{N}_{t}(\mathbf{x},a)}W_{i}+b+L\rho_{a,t}(\mathbf{x}).$		(185)

Hence

	$\displaystyle\|\hat{\eta}_{a,t}(\mathbf{x})-(\eta_{a}(\mathbf{x})+b+L\rho_{a,t}(\mathbf{x}))\|$	$\displaystyle\leq$	$\displaystyle\frac{1}{k}\sum_{i\in\mathcal{N}_{t}(\mathbf{x},a)}\|\eta_{a}(\mathbf{X}_{i})-\eta_{a}(\mathbf{x})\|+\left\|\frac{1}{k}\sum_{i\in\mathcal{N}_{t}(\mathbf{x},a)}W_{i}\right\|$		(186)
		$\displaystyle\leq$	$\displaystyle L\rho_{a,t}(\mathbf{x})+b,$		(186)

which comes from Assumption 2(d) and Lemma 4. Lemma 5 can then be proved using (186).

I.3 Proof of Lemma 6

We prove Lemma 6 by contradiction. If $n(x,a,r_{a}(\mathbf{x}))>k$ , then let

\displaystyle t=\max\{\tau|\left\lVert\mathbf{X}_{\tau}-\mathbf{x}\right\rVert\leq r_{a}(\mathbf{x}),A_{\tau}=a\}

(187)

be the last step falling in $B(x,r_{a}(\mathbf{x}))$ with action a. Then $B(x,r_{a}(\mathbf{x}))\leq B(\mathbf{X}_{t},2r_{a}(\mathbf{x}))$ , and thus there are at least $k$ points in $B(\mathbf{X}_{t},2r_{a}(\mathbf{x}))$ . Therefore,

\displaystyle\rho_{a,t}(\mathbf{x})<2r_{a}(\mathbf{x})

(188)

Denote

\displaystyle a^{*}(\mathbf{x})=\underset{a}{\arg\max}\eta_{a}(\mathbf{x})

(189)

as the best action at context $\mathbf{x}$ . $A_{t}=a$ is selected only if the UCB of action $a$ is not less than the UCB of action $a^{*}(\mathbf{x})$ , i.e.

\displaystyle\hat{\eta}_{a,t}(\mathbf{X}_{t})\geq\hat{\eta}_{a^{*}(\mathbf{x}),t}(\mathbf{X}_{t}).

(190)

From Lemma 5,

\displaystyle\hat{\eta}_{a,t}(\mathbf{X}_{t})\leq\eta_{a}(\mathbf{X}_{t})+2b+2L\rho_{a,t}(\mathbf{X}_{t}),

(191)

and

\displaystyle\hat{\eta}_{a^{*}(\mathbf{x}),t}(\mathbf{X}_{t})\geq\eta_{a^{*}(\mathbf{x})}(\mathbf{X}_{t})=\eta^{*}(\mathbf{X}_{t}).

(192)

From (190), (191) and (192),

\displaystyle\eta_{a}(\mathbf{X}_{t})+2b+2L\rho_{a,t}(\mathbf{X}_{t})\geq\eta^{*}(\mathbf{X}_{t}),

(193)

which yields

$\displaystyle\rho_{a,t}(\mathbf{X}_{t})$	$\displaystyle\geq$	$\displaystyle\frac{\eta^{*}(\mathbf{X}_{t})-\eta_{a}(\mathbf{X}_{t})-2b}{2L}$	(194)
	$\displaystyle\geq$	$\displaystyle\frac{\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})-2b-2Lr_{a}(\mathbf{x})}{2L}$
	$\displaystyle=$	$\displaystyle 2r_{a}(\mathbf{x}),$

in which the last step comes from the definition of $r_{a}$ in (85). Note that (194) contradicts (188). Therefore $n(x,a,r_{a}(\mathbf{x}))\leq k$ . The proof of Lemma 6 is complete.

I.4 Proof of Lemma 7

	$\displaystyle\mathbb{E}\left[\int_{B(\mathbf{Z},r_{a}(\mathbf{Z}))}q_{a}(\mathbf{u})(\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u}))du\right]$	(195)
$\displaystyle=$	$\displaystyle\int_{\mathcal{X}}g(\mathbf{z})\left[\int_{B(z,r_{a}(\mathbf{z}))}q_{a}(\mathbf{u})(\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u}))du\right]dz$
$\displaystyle\overset{(a)}{\geq}$	$\displaystyle\int_{\mathcal{X}}\int_{B(u,\frac{3}{4}r_{a}(\mathbf{u}))}g(\mathbf{z})q_{a}(\mathbf{u})(\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u}))dzdu$
$\displaystyle\geq$	$\displaystyle\int_{\mathcal{X}}\left[\underset{\left\lVert z-u\right\rVert\leq 3r_{a}(\mathbf{u})/4}{\inf}g(\mathbf{z})\cdot\left(\frac{3}{4}\right)^{d}r_{a}^{d}(\mathbf{u})\right]q_{a}(\mathbf{u})(\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u}))dzdu$
$\displaystyle\overset{(b)}{\geq}$	$\displaystyle\left(\frac{3}{4}\right)^{d}\left(\frac{4}{5}\right)^{d}\int_{\mathcal{X}}g(\mathbf{u})r_{a}^{d}(\mathbf{u})q_{a}(\mathbf{u})(\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u}))du$
$\displaystyle\overset{(c)}{=}$	$\displaystyle\left(\frac{3}{5}\right)^{d}\frac{1}{M_{Z}}\int_{\mathcal{X}}\frac{1}{[(\eta^{}(\mathbf{u})-\eta_{a}(\mathbf{u}))\vee\epsilon]^{d}}\left(\frac{\eta^{}(\mathbf{u})-\eta_{a}(\mathbf{u})-2b}{6L}\right)^{d}q_{a}(\mathbf{u})(\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u}))du$
$\displaystyle\overset{(d)}{\geq}$	$\displaystyle\left(\frac{3}{5}\right)^{d}\frac{1}{M_{Z}}\int_{\mathcal{X}}\mathbf{1}(\eta^{}(\mathbf{u})-\eta_{a}(\mathbf{u})>\epsilon)\frac{1}{(\eta^{}(\mathbf{u})-\eta_{a}(\mathbf{u}))^{d}}\left(\frac{\eta^{}(\mathbf{u})-\eta_{a}(\mathbf{u})}{12L}\right)^{d}q_{a}(\mathbf{u})(\eta^{}(\mathbf{u})-\eta_{a}(\mathbf{u}))du$
$\displaystyle=$	$\displaystyle\left(\frac{3}{5}\right)^{d}\frac{1}{(12L)^{d}M_{Z}}\int_{\mathcal{X}}q_{a}(\mathbf{u})(\eta^{}(\mathbf{u})-\eta_{a}(\mathbf{u}))\mathbf{1}(\eta^{}(\mathbf{u})-\eta_{a}(\mathbf{u})>\epsilon)du.$

Based on (195), Lemma 7 holds with

\displaystyle C_{1}=\left(\frac{5}{3}\right)^{d}(12L)^{d}.

(196)

Now we explain some key steps in (195).

In (a), the order of integration is swapped. Note that if $u\in B(z,r_{a}(\mathbf{z}))$ , then $\left\lVert u-z\right\rVert\leq r_{a}(\mathbf{z})$ . From Assumption 2(d), $|\eta_{a}(\mathbf{u})-\eta_{a}(\mathbf{z})|\leq Lr_{a}(\mathbf{z})$ . Then from (85),

$\displaystyle r_{a}(\mathbf{u})$	$\displaystyle=$	$\displaystyle\frac{\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u})-2b}{6L}$	(197)
	$\displaystyle\leq$	$\displaystyle\frac{\eta^{*}(\mathbf{z})-\eta_{a}(\mathbf{z})+2Lr_{a}(\mathbf{z})-2b}{6L}$
	$\displaystyle=$	$\displaystyle r_{a}(\mathbf{z})+\frac{1}{3}r_{a}(\mathbf{z})$
	$\displaystyle=$	$\displaystyle\frac{4}{3}r_{a}(\mathbf{z}),$

thus $\left\lVert z-u\right\rVert\leq 3r_{a}(\mathbf{u})/4$ implies $\left\lVert u-z\right\rVert\leq r_{a}(\mathbf{z})$ . Therefore (a) holds.

For (b) in (195), note that for $\left\lVert z-u\right\rVert\leq 3r_{a}(\mathbf{u})/4$ , using Assumption 2(d) again,

\displaystyle\eta^{*}(\mathbf{z})-\eta_{a}(\mathbf{z})\leq\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u})+L\frac{3r_{a}(\mathbf{u})}{4}+L\frac{3r_{a}(\mathbf{u})}{4}=\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u})+\frac{3}{2}Lr_{a}(\mathbf{u}).

(198)

Then

$\displaystyle\frac{g(\mathbf{z})}{g(\mathbf{u})}$	$\displaystyle=$	$\displaystyle\frac{[(\eta^{}(\mathbf{u})-\eta_{a}(\mathbf{u}))\vee\epsilon]^{d}}{[(\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z}))\vee\epsilon]^{d}}$	(199)
	$\displaystyle\geq$	$\displaystyle\frac{[(\eta^{}(\mathbf{u})-\eta_{a}(\mathbf{u}))\vee\epsilon]^{d}}{\left[(\eta^{}(\mathbf{u})-\eta_{a}(\mathbf{u})+\frac{3}{2}Lr_{a}(\mathbf{u}))\vee\epsilon\right]^{d}}$
	$\displaystyle\geq$	$\displaystyle\left(\frac{4}{5}\right)^{d},$

in which the last step comes from the definition of $r_{a}$ in (85).

For (d), recall the statement of Lemma 7, $\epsilon=4b$ . Therefore, if $\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u})>\epsilon$ , then $\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u})-2b>(\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u}))/2$ .

I.5 Proof of Lemma 8

	$\displaystyle\mathbb{E}\left[\int_{B(\mathbf{Z},r_{a}(\mathbf{Z}))}q_{a}(\mathbf{u})(\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u}))du\right]$	(200)
$\displaystyle\overset{(a)}{\leq}$	$\displaystyle\frac{4}{3}\mathbb{E}\left[\int_{B(\mathbf{Z},r_{a}(\mathbf{Z}))}q_{a}(\mathbf{u})(\eta^{*}(\mathbf{z})-\eta_{a}(\mathbf{z}))du\right]$
$\displaystyle\overset{(b)}{\leq}$	$\displaystyle\frac{4}{3}(k+1)\mathbb{E}\left[(\eta^{*}(\mathbf{Z})-\eta_{a}(\mathbf{Z}))\right]$
$\displaystyle=$	$\displaystyle\frac{4}{3}(k+1)\int_{\mathcal{X}}(\eta^{*}(\mathbf{z})-\eta_{a}(\mathbf{z}))g(\mathbf{z})dz$
$\displaystyle=$	$\displaystyle\frac{4(k+1)}{3M_{Z}}\int_{\mathcal{X}}(\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z}))\frac{1}{[(\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z}))\vee\epsilon]^{d}}dz$
$\displaystyle=$	$\displaystyle\frac{4(k+1)}{3M_{Z}}\left[\int_{\mathcal{X}}(\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z}))^{-(d-1)}\mathbf{1}(\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z})>\epsilon)dz\right.$
	$\displaystyle\left.+\frac{1}{\epsilon^{d}}\int_{\mathcal{X}}(\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z}))\mathbf{1}(\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z})<\epsilon)dz\right].$

For (a), from Assumption 2(d),

$\displaystyle\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u})$	$\displaystyle\leq$	$\displaystyle\eta^{*}(\mathbf{z})-\eta_{a}(\mathbf{z})+2Lr_{a}(\mathbf{z})$	(201)
	$\displaystyle=$	$\displaystyle\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z})+2L\frac{\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z})-2b}{6L}$
	$\displaystyle\leq$	$\displaystyle\frac{4}{3}(\eta^{*}(\mathbf{z})-\eta_{a}(\mathbf{z})).$

(b) comes from (88).

The first term in the bracket in (200) can be bounded by

	$\displaystyle\int_{\mathcal{X}}(\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z}))^{-(d-1)}\mathbf{1}(\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z})>\epsilon)dz$	(202)
$\displaystyle\overset{(a)}{\leq}$	$\displaystyle\frac{1}{c}\int_{\mathcal{X}}(\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z}))^{-(d-1)}\mathbf{1}(\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z})>\epsilon)f(\mathbf{z})dz$
$\displaystyle\overset{(b)}{=}$	$\displaystyle\frac{1}{c}\mathbb{E}\left[(\eta^{}(\mathbf{X})-\eta_{a}(\mathbf{X}))^{-(d-1)}\mathbf{1}(\eta^{}(\mathbf{X})-\eta_{a}(\mathbf{X})>\epsilon)\right]$
$\displaystyle=$	$\displaystyle\frac{1}{c}\int_{0}^{\infty}\text{P}(\epsilon<\eta^{*}(\mathbf{X})-\eta_{a}(\mathbf{X})<t^{-\frac{1}{d-1}})dt$
$\displaystyle\leq$	$\displaystyle\frac{1}{c}\int_{0}^{\epsilon^{-(d-1)}}\text{P}(\eta^{*}(\mathbf{X})-\eta_{a}(\mathbf{X})<t^{-\frac{1}{d-1}})dt.$

(a) comes from Assumption 2, which requires that $f(\mathbf{x})\geq c$ all over the support. In (b), the random variable $X$ follows a distribution with pdf $f$ .

If $d>\alpha+1$ , then from Assumption 2(b),

\displaystyle\eqref{eq:psmallt}\leq\frac{C_{\alpha}}{c}\int_{0}^{\epsilon^{-(d-1)}}t^{-\frac{\alpha}{d-1}}dt=\frac{C_{\alpha}(d-1)}{c(d-1-\alpha)}\epsilon^{\alpha+1-d}.

(203)

If $d=\alpha+1$ , then

\displaystyle\eqref{eq:psmallt}\leq\frac{1}{c}\int_{0}^{1}dt+\frac{C_{\alpha}}{c}\int_{1}^{\epsilon^{-(d-1)}}t^{-\frac{\alpha}{d-1}}dt=\frac{1}{c}+\frac{C_{\alpha}(d-1)}{c}\ln\frac{1}{\epsilon}.

(204)

If $d<\alpha+1$ , then

\displaystyle\eqref{eq:psmallt}\leq\frac{1}{c}\int_{0}^{1}dt+\frac{C_{\alpha}}{c}\int_{1}^{\epsilon^{-(d-1)}}t^{-\frac{\alpha}{d-1}}dt\leq\frac{1}{c}+\frac{C_{\alpha}(d-1)}{c(\alpha+1-d)}.

(205)

Now it remains to bound the second term in (200):

	$\displaystyle\int_{\mathcal{X}}(\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z}))\mathbf{1}(\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z})<\epsilon)dz$	$\displaystyle\leq$	$\displaystyle\frac{1}{c}\mathbb{E}[(\eta^{}(\mathbf{X})-\eta_{a}(\mathbf{X}))\mathbf{1}(\eta^{}(\mathbf{X})-\eta_{a}(\mathbf{X})<\epsilon)]$		(206)
		$\displaystyle\leq$	$\displaystyle\frac{C_{\alpha}}{c}\epsilon^{\alpha+1}.$		(206)

Therefore, from (200), (203), (204), (205) and (206),

\displaystyle\mathbb{E}\left[\int_{B(\mathbf{Z},r_{a}(\mathbf{Z}))}q_{a}(\mathbf{u})(\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u}))du\right]\lesssim\left\{\begin{array}[]{ccc}\frac{k}{M_{Z}c}\epsilon^{\alpha+1-d}&\text{if}&d>\alpha+1\\ \frac{k}{M_{Z}c}\ln\frac{1}{\epsilon}&\text{if}&d=\alpha+1\\ \frac{k}{M_{Z}c}&\text{if}&d<\alpha+1.\end{array}\right.

(210)

I.6 Proof of Lemma 12

We prove Lemma 12 by contradiction. Suppose now that $n(x,a,r_{a}(\mathbf{x}))>n_{a}(\mathbf{x})$ . Let $t$ be the last sample falling in $B(x,r_{a}(\mathbf{x}))$ , i.e.

\displaystyle t:=\max\left\{j|\left\lVert\mathbf{X}_{j}-\mathbf{x}\right\rVert\leq r_{a}(\mathbf{x}),A_{j}=a\right\}.

(211)

We first show that $\rho_{a,t}(\mathbf{x})\leq 2r_{a}(\mathbf{x})$ . From (211) and the condition $n(x,a,r_{a}(\mathbf{x}))>n_{a}(\mathbf{x})$ , before time step $t$ , there are already at least $n_{a}(\mathbf{x})$ steps in $B(x,r_{a}(\mathbf{x}))$ . Note that $\left\lVert\mathbf{X}_{t}-\mathbf{x}\right\rVert\leq r_{a}(\mathbf{x})$ , thus $B(x,r_{a}(\mathbf{x}))\subseteq B(\mathbf{X}_{t},2r_{a}(\mathbf{x}))$ . Therefore, there are already at least $n_{a}(\mathbf{x})$ samples with action $a$ in $B(\mathbf{X}_{t},2r_{a}(\mathbf{x}))$ . Recall that $\rho_{a,t}(\mathbf{x})=\rho_{a,t,k_{a,t}(\mathbf{x})}(\mathbf{x})$ . If $\rho_{a,t}(\mathbf{x})>2r_{a}(\mathbf{x})$ , then $k_{a,t}(\mathbf{x})>n_{a}(\mathbf{x})$ . From (16),

\displaystyle L\rho_{a,t}(\mathbf{x})=L\rho_{a,t,k_{a,t}(\mathbf{x})}(\mathbf{x})\leq\sqrt{\frac{\ln T}{k_{a,t}(\mathbf{x})}}\leq\sqrt{\frac{\ln T}{n_{a}(\mathbf{x})}}=2Lr_{a}(\mathbf{x}),

(212)

then contradiction occurs. Therefore $\rho_{a,t}(\mathbf{x})\leq 2r_{a}(\mathbf{x})$ .

From Lemma 11, under $E$ ,

\displaystyle\hat{\eta}_{a,t}(\mathbf{x})\leq\eta_{a}(\mathbf{x})+2\sqrt{\frac{2\sigma^{2}}{n_{a}(\mathbf{x})}\ln(dT^{2d+3}|\mathcal{A}|)}+2Lr_{a}(\mathbf{x}).

(213)

Since action $a$ is selected at time $t$ , from Lemma 11,

\displaystyle\hat{\eta}_{a,t}(\mathbf{x})\geq\hat{\eta}_{a^{*}(\mathbf{x}),t}(\mathbf{x})\geq\eta^{*}(\mathbf{x}).

(214)

Combining (213) and (214) yields

\displaystyle 2\sqrt{\frac{2\sigma^{2}}{n_{a}(\mathbf{x})}\ln(dT^{2d+3}|\mathcal{A}|)}+2Lr_{a}(\mathbf{x})\geq\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x}).

(215)

We now derive an inequality that contradicts with (215). From (141) and (142),

$\displaystyle 2\sqrt{\frac{2\sigma^{2}}{n_{a}(\mathbf{x})}\ln(dT^{2d+3}\|\mathcal{A}\|)}$	$\displaystyle=$	$\displaystyle 2\sqrt{\frac{2\sigma^{2}}{C_{1}\ln T}\ln(dT^{2d+3}\|\mathcal{A}\|)}(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x}))$	(216)
	$\displaystyle\leq$	$\displaystyle\frac{1}{2}\sqrt{\frac{\ln(dT^{2d+3}\|\mathcal{A}\|)}{(2d+3+\ln(d\|\mathcal{A}\|))\ln T}}(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x}))$
	$\displaystyle<$	$\displaystyle\frac{1}{2}(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})).$

From (140),

\displaystyle 2Lr_{a}(\mathbf{x})=\frac{1}{\sqrt{C_{1}}}(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x}))\leq\frac{1}{2}(\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x})).

(217)

From (216) and (217),

\displaystyle 2\sqrt{\frac{2\sigma^{2}}{n_{a}(\mathbf{x})}\ln(dT^{2d+3}|\mathcal{A}|)}+2Lr_{a}(\mathbf{x})<\eta^{*}(\mathbf{x})-\eta_{a}(\mathbf{x}).

(218)

(218) contradicts (215). Hence

\displaystyle n(x,a,r_{a}(\mathbf{x}))\leq n_{a}(\mathbf{x}).

(219)

I.7 Proof of Lemma 13

	$\displaystyle\mathbb{E}\left[\int_{B(\mathbf{Z},r_{a}(\mathbf{Z}))}q_{a}(\mathbf{u})(\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u}))du\right]$	(220)
$\displaystyle\overset{(a)}{\geq}$	$\displaystyle\int_{\mathcal{X}}\int_{B(u,2r_{a}(\mathbf{u})/3)}g(\mathbf{z})q_{a}(\mathbf{u})(\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u}))dzdu$
$\displaystyle\geq$	$\displaystyle\int_{\mathcal{X}}\left(\underset{\left\lVert z-u\right\rVert\leq 2r_{a}(\mathbf{u})/3}{\inf}g(\mathbf{z})\right)\left(\frac{2}{3}\right)^{d}r_{a}^{d}(\mathbf{u})q_{a}(\mathbf{u})(\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u}))du$
$\displaystyle\overset{(b)}{\geq}$	$\displaystyle\left(\frac{2}{3}\right)^{d}\left(\frac{3}{4}\right)^{d}\int_{\mathcal{X}}g(\mathbf{u})r_{a}^{d}(\mathbf{u})q_{a}(\mathbf{u})(\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u}))du$
$\displaystyle=$	$\displaystyle\frac{1}{2^{d}M_{Z}}\int_{\mathcal{X}}\frac{1}{[(\eta^{}(\mathbf{u})-\eta_{a}(\mathbf{u}))\vee\epsilon^{d}]}r_{a}^{d}(\mathbf{u})q_{a}(\mathbf{u})(\eta^{}(\mathbf{u})-\eta_{a}(\mathbf{u}))du$
$\displaystyle\leq$	$\displaystyle\frac{1}{2^{d}M_{Z}}\int_{\mathcal{X}}\mathbf{1}(\eta^{}(\mathbf{u})-\eta_{a}(\mathbf{u})>\epsilon)\frac{1}{(\eta^{}(\mathbf{u})-\eta_{a}(\mathbf{u}))^{d}}\frac{(\eta^{}(\mathbf{u})-\eta_{a}(\mathbf{u}))^{d}}{(4L)^{d}}q_{a}(\mathbf{u})(\eta^{}(\mathbf{u})-\eta_{a}(\mathbf{u}))du$
$\displaystyle\leq$	$\displaystyle\frac{1}{2^{3d}L^{d}M_{Z}}\int_{\mathcal{X}}q_{a}(\mathbf{u})(\eta^{}(\mathbf{u})-\eta_{a}(\mathbf{u}))\mathbf{1}(\eta^{}(\mathbf{u})-\eta_{a}(\mathbf{u})>\epsilon)du.$

For (a), if $\left\lVert u-z\right\rVert\leq r_{a}(\mathbf{z})$ , then from the definition of $r_{a}$ in (140),

\displaystyle\frac{r_{a}(\mathbf{u})}{r_{a}(\mathbf{z})}=\frac{\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u})}{\eta^{*}(\mathbf{z})-\eta_{a}(\mathbf{z})}\leq\frac{\eta^{*}(\mathbf{z})-\eta_{a}(\mathbf{z})+2Lr_{a}(\mathbf{z})}{\eta^{*}(\mathbf{z})-\eta_{a}(\mathbf{z})}=1+\frac{1}{\sqrt{C_{1}}}\leq\frac{3}{2}.

(221)

For (b),

$\displaystyle\frac{g(\mathbf{z})}{g(\mathbf{u})}$	$\displaystyle=$	$\displaystyle\frac{[(\eta^{}(\mathbf{u})-\eta_{a}(\mathbf{u}))\vee\epsilon]^{d}}{[(\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z}))\vee\epsilon]^{d}}$	(222)
	$\displaystyle\geq$	$\displaystyle\frac{[(\eta^{}(\mathbf{u})-\eta_{a}(\mathbf{u}))\vee\epsilon]^{d}}{\left[(\eta^{}(\mathbf{u})-\eta_{a}(\mathbf{u})+\frac{4}{3}Lr_{a}(\mathbf{u}))\vee\epsilon\right]^{d}}$
	$\displaystyle\geq$	$\displaystyle\left(\frac{3}{4}\right)^{d}.$

I.8 Proof of Lemma 14

	$\displaystyle\mathbb{E}\left[\int_{B(\mathbf{Z},r_{a}(\mathbf{Z}))}q_{a}(\mathbf{u})(\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u}))du\right]$	(223)
$\displaystyle\overset{(a)}{\leq}$	$\displaystyle\frac{3}{2}\mathbb{E}\left[\int_{B(\mathbf{Z},r_{a}(\mathbf{Z}))}q_{a}(\mathbf{u})(\eta^{*}(\mathbf{z})-\eta_{a}(\mathbf{z}))du\right]$
$\displaystyle\leq$	$\displaystyle\frac{3}{2}\mathbb{E}\left[((n_{a}(\mathbf{Z})+1)\wedge(Tf(\mathbf{Z})r_{a}^{d}(\mathbf{Z})))(\eta^{*}(\mathbf{Z})-\eta_{a}(\mathbf{Z}))\right]$
$\displaystyle=$	$\displaystyle\frac{3}{2}\int((n_{a}(\mathbf{z})+1)\wedge(Tf(\mathbf{z})r_{a}^{d}(\mathbf{z})))(\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z}))\frac{1}{M_{Z}[(\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z}))\vee\epsilon]^{d}}dz$
$\displaystyle=$	$\displaystyle\frac{3}{2M_{g}}\left[\int\left(\frac{C_{1}\ln T}{\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z})}+\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z})\right)\frac{1}{(\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z}))^{d}}\mathbf{1}(\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z})>\epsilon)dz\right.$
	$\displaystyle\left.+\int Tf(\mathbf{z})r_{a}^{d}(\mathbf{z})(\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z}))\frac{1}{\epsilon^{d}}\mathbf{1}(\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z})\leq\epsilon)dz\right]$
$\displaystyle\lesssim$	$\displaystyle\frac{1}{M_{Z}}\mathbb{E}\left[(\eta^{}(\mathbf{Z})-\eta_{a}(\mathbf{Z}))^{-(d+1)}\mathbf{1}(\eta^{}(\mathbf{Z})-\eta_{a}(\mathbf{Z})>\epsilon)\right]\ln T$
	$\displaystyle+\frac{T}{\epsilon^{d}}\mathbb{E}\left[(\eta^{}(\mathbf{Z})-\eta_{a}(\mathbf{Z}))^{d+1}\mathbf{1}(\eta^{}(\mathbf{Z})-\eta_{a}(\mathbf{Z})\leq\epsilon)\right]$
$\displaystyle\lesssim$	$\displaystyle\frac{1}{M_{Z}}\left(\epsilon^{\alpha-d-1}\ln T+T\epsilon^{1+\alpha}\right).$

For (a),

$\displaystyle\eta^{*}(\mathbf{u})-\eta_{a}(\mathbf{u})$	$\displaystyle\leq$	$\displaystyle\eta^{*}(\mathbf{z})-\eta_{a}(\mathbf{z})+2Lr_{a}(\mathbf{z})$	(224)
	$\displaystyle\leq$	$\displaystyle\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z})+\frac{1}{\sqrt{C_{1}}}(\eta^{}(\mathbf{z})-\eta_{a}(\mathbf{z}))$
	$\displaystyle\leq$	$\displaystyle\frac{3}{2}(\eta^{*}(\mathbf{z})-\eta_{a}(\mathbf{z})).$

I.9 Proof of Lemma 9

$\displaystyle\mathbb{E}[f^{-p}(\mathbf{X})\mathbf{1}(a\leq f(\mathbf{X})<b)]$	$\displaystyle=$	$\displaystyle\int_{0}^{\infty}\text{P}(f(\mathbf{X})<t^{-\frac{1}{p}},a\leq f(\mathbf{X})<b)dt$	(225)
	$\displaystyle=$	$\displaystyle\int_{0}^{b^{-p}}\text{P}(f(\mathbf{X})<b)dt+\int_{b^{-p}}^{a^{-p}}\text{P}(f(\mathbf{X})<t^{-\frac{1}{p}})dt$
	$\displaystyle\leq$	$\displaystyle C_{\beta}b^{\beta-p}+C_{\beta}\int_{b^{-p}}^{a^{-p}}t^{-\frac{\beta}{p}}dt.$

If $p>\beta$ , i.e. $\beta/p<1$ , then

\displaystyle\eqref{eq:star}\leq C_{\beta}b^{\beta-p}+\frac{C_{\beta}}{1-\beta/p}(a^{-p})^{1-\frac{\beta}{p}}\sim a^{\beta-p}.

(226)

If $p<\beta$ , then

\displaystyle\eqref{eq:star}\leq C_{\beta}b^{\beta-p}+C_{\beta}\int_{b^{-p}}^{\infty}t^{-\frac{\beta}{p}}dt=C_{\beta}b^{\beta-p}+\frac{C_{\beta}}{\beta/p-1}b^{\beta-p}\sim b^{\beta-p}.

(227)

If $p=\beta$ , then

\displaystyle\eqref{eq:star}\leq C_{\beta}b^{\beta-p}+C_{\beta}\ln\frac{a^{-p}}{b^{-p}}\sim\ln\frac{b}{a}.

(228)

I.10 Proof of Lemma 3

Our proof follows the proof of Lemma 3.1 in (Rigollet & Zeevi, 2010).

\displaystyle U=\sum_{t=1}^{T}(\eta^{*}(\mathbf{X}_{t})-\eta_{A_{t}}(\mathbf{X}_{t})),

(229)

and

\displaystyle V=\sum_{t=1}^{T}\mathbf{1}(\eta_{A_{t}}(\mathbf{X}_{t})<\eta^{*}(\mathbf{X}_{t})).

(230)

Then $R=\mathbb{E}[U]$ and $S=\mathbb{E}[V]$ . For any $\delta>0$ ,

	$\displaystyle U$	$\displaystyle\geq$	$\displaystyle\delta\sum_{t=1}^{T}\mathbf{1}(\eta_{A_{t}}(\mathbf{X}_{t})<\eta^{}(\mathbf{X}_{t}))\mathbf{1}(\|\eta^{}(\mathbf{X}_{t})-\eta_{A_{t}}(\mathbf{X}_{t})\|>\delta)$		(231)
		$\displaystyle\geq$	$\displaystyle\delta\left[V-\sum_{t=1}^{T}\mathbf{1}\left(A_{t}\neq a^{}(\mathbf{X}_{t}),\|\eta^{}(\mathbf{X}_{t})-\eta_{A_{t}}(\mathbf{X}_{t})\|\leq\delta\right)\right].$		(231)

Take expectations, we have

\displaystyle R\geq\delta S-TC_{\alpha}\delta^{\alpha+1}.

(232)

Now we minimize the right hand side of (232). By making the derivative to be zero, let

\displaystyle\delta=\left(\frac{S}{(\alpha+1)TC_{\alpha}}\right)^{\frac{1}{\alpha}}.

(233)

Then

$\displaystyle R$	$\displaystyle\geq$	$\displaystyle S\left(\frac{S}{(\alpha+1)TC_{\alpha}}\right)^{\frac{1}{\alpha}}-TC_{\alpha}\left(\frac{S}{(\alpha+1)TC_{\alpha}}\right)^{\frac{\alpha+1}{\alpha}}$	(234)
	$\displaystyle=$	$\displaystyle\frac{\alpha S}{\alpha+1}\left(\frac{S}{(\alpha+1)TC_{\alpha}}\right)^{\frac{1}{\alpha}}$
	$\displaystyle=$	$\displaystyle C_{0}S^{\frac{\alpha+1}{\alpha}}T^{-\frac{1}{\alpha}}.$

The proof is complete.

Contextual Bandits for Unbounded Context Distributions

Abstract

1 Introduction

1.1 Contribution

2 Related Work

3 Preliminaries

Assumption 1.

Assumption 2.

Assumption 3.

Example 1.

Example 2.

Proof.

4 Minimax Analysis

Theorem 1.

Theorem 2.

Proof.

5 Nearest Neighbor Method with Fixed kk

Theorem 3.

Theorem 4.

6 Nearest Neighbor Method with Adaptive kk

Theorem 5.

Theorem 6.

Remark 1.

Remark 2.

7 Numerical Experiments

7.1 Synthesized Data

7.2 Real Data

8 Conclusion

References

Appendix A Examples of Heavy-tailed Distributions

Appendix B Expected Sample Density

Definition 1.

Lemma 1.

Proof.

Lemma 2.

Proof.

Appendix C Proof of Theorem 1

Lemma 3.

Proof.

Appendix D Proof of Theorem 2

Appendix E Proof of Theorem 3

Lemma 4.

Proof.

Lemma 5.

Proof.

Lemma 6.

Proof.

Lemma 7.

Proof.

Lemma 8.

Proof.

Appendix F Proof of Theorem 4

Lemma 9.

Proof.

Appendix G Proof of Theorem 5

Lemma 10.

Proof.

Lemma 11.

Proof.

Lemma 12.

Proof.

Lemma 13.

Proof.

Lemma 14.

Proof.

Appendix H Proof of Theorem 6

Appendix I Proof of Lemmas

I.1 Proof of Lemma 4

I.2 Proof of Lemma 5

I.3 Proof of Lemma 6

I.4 Proof of Lemma 7

I.5 Proof of Lemma 8

I.6 Proof of Lemma 12

I.7 Proof of Lemma 13

I.8 Proof of Lemma 14

I.9 Proof of Lemma 9

I.10 Proof of Lemma 3

5 Nearest Neighbor Method with Fixed $k$

6 Nearest Neighbor Method with Adaptive $k$