This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Approximation Theory Based Methods for RKHS Bandits

Sho Takemori, Masahiro Sato sho.takemori.py@fujifilm.com,takemorisho@gmail.com
Abstract

The RKHS bandit problem (also called kernelized multi-armed bandit problem) is an online optimization problem of non-linear functions with noisy feedback. Although the problem has been extensively studied, there are unsatisfactory results for some problems compared to the well-studied linear bandit case. Specifically, there is no general algorithm for the adversarial RKHS bandit problem. In addition, high computational complexity of existing algorithms hinders practical application. We address these issues by considering a novel amalgamation of approximation theory and the misspecified linear bandit problem. Using an approximation method, we propose efficient algorithms for the stochastic RKHS bandit problem and the first general algorithm for the adversarial RKHS bandit problem. Furthermore, we empirically show that one of our proposed methods has comparable cumulative regret to IGP-UCB and its running time is much shorter.

1 Introduction

The RKHS bandit problem (also called kernelized multi-armed bandit problem) is an online optimization problem of non-linear functions with noisy feedback. Srinivas et al. (2010) studied a multi-armed bandit problem where the reward function belongs to the reproducing kernel Hilbert space (RKHS) associated with a kernel. In this paper, we call this problem the (stochastic) RKHS bandit problem. Although the problem has been studied extensively, some issues are not completely solved yet. In this paper, we focus on mainly two issues: non-existence of general algorithms for the adversarial RKHS bandit problem and high computational complexity for the stochastic RKHS bandit algorithms.

First, as a non-linear generalization of the classical adversarial linear bandit problem, Chatterji et al. (2019) proposed the adversarial RKHS bandit problem, where a learner interacts with a sequence of any functions from the RKHS with bounded norms. However, they only consider the kernel loss, i.e., a loss function of the form xK(x,x0)x\mapsto K(x,x_{0}), where x0x_{0} is a fixed point. Considering functions in the RKHS can be represented as infinite linear combinations of such functions, kernel loss is a very special function in the RKHS. Therefore, there are no algorithms for the adversarial RKHS bandits with general loss (or reward) functions.

Next, we discuss the efficiency of existing methods for the stochastic RKHS bandit problem. We note that most of the existing methods have regret guarantees at the cost of high computational complexity. For example, IGP-UCB Chowdhury & Gopalan (2017) requires matrix-vector multiplication of size tt for each arm at each round t=1,,Tt=1,\dots,T. Therefore, the total computational complexity up to round TT is given as O(|𝒜|T3)O(|\mathcal{A}|T^{3}), where 𝒜\mathcal{A} is the set of arms. To address the issue, Calandriello et al. (2020) proposed BBKB and proved its total computational complexity is given as O~(|𝒜|TγT2+γT4)\widetilde{O}(|\mathcal{A}|T\gamma_{T}^{2}+\gamma_{T}^{4}), where 𝒜Ω\mathcal{A}\subset\Omega is the set of arms, Ω\Omega is a subset of a Euclidean space d\mathbb{R}^{d}, and γT\gamma_{T} is the maximum information gain (Srinivas et al., 2010). If the kernel is a squared exponential kernel, then since γT=O~(logd(T))\gamma_{T}=\widetilde{O}(\log^{d}(T))111In this paper, we use O~\widetilde{O} notation to ignore logc(T)\log^{c}(T) factor, where cc is a universal constant. (Srinivas et al., 2010), ignoring the polylogarithmic factor, BBKB’s computational complexity is nearly linear in TT. However, the coefficient |𝒜||\mathcal{A}| in the term is large in general.

In this paper, we address these two issues by considering a novel amalgamation of approximation theory (Wendland, 2004) and the misspecified linear bandit problem Lattimore et al. (2020). That is, we approximately reduce the RKHS bandit problem to the well-studied linear bandit problem. Here, because of an approximation error, the model is a misspecified linear model. Ordinary approximation methods (such as Random Fourier Features or Nyström embedding) basically aim to approximate kernel K(x,y)K(x,y) by an inner product of finite dimensional vectors. However, to reduce the RKHS bandits to the linear bandits, we want to approximate a function ff in the RKHS K(Ω)\mathcal{H}_{K}\left(\Omega\right) by a function ϕ\phi in a finite dimensional subspace so that fϕL(Ω)\|f-\phi\|_{L^{\infty}(\Omega)} is small. Since the usual approximation methods are not appropriate for the purpose, in this paper, we utilize a method developed in the approximation theory literature called the PP-greedy algorithm (De Marchi et al., 2005) to minimize the LL^{\infty} error. More precisely, we shall introduce that any function ff in the RKHS is approximately equal (in terms of the LL^{\infty} norm) to a linear combination of Dq,α(T)D_{q,\alpha}(T) functions, where q,α>0q,\alpha>0 are parameters and Dq,α(T)D_{q,\alpha}(T) is the number of functions (or equivalently points) returned by the PP-greedy algorithm (Algorithm 1) with admissible error 𝔢=αTq\mathfrak{e}=\frac{\alpha}{T^{q}}. If KK is sufficiently smooth, Dq,α(T)D_{q,\alpha}(T) is much smaller than TT and |𝒜||\mathcal{A}|. By this approximation, we can tackle the original RKHS bandit problem by applying an algorithm for the misspecified linear bandit problem.

Contributions

To state contributions, we introduce terminology for kernels. In this paper, we consider two types of kernels: kernels with infinite smoothness and those with finite smoothness with smoothness parameter ν\nu (we provide a precise definition in §4). Examples of the former include Rational Quadratic (RQ) and Squared Exponential (SE) kernels and those of the latter include the Matérn kernels with parameter ν\nu. The latter type of kernels also include a general class of kernels that belong to C2ν(Ω×Ω)C^{2\nu}(\Omega\times\Omega) with ν12>0\nu\in\frac{1}{2}\mathbb{Z}_{>0} and satisfy some additional conditions. Let Dq,α(T)>0D_{q,\alpha}(T)\in\mathbb{Z}_{>0} be as before. Then, in §4, we shall introduce that Dq,α(T)=O((qlogTlog(α))d)D_{q,\alpha}(T)=O\left((q\log T-\log(\alpha))^{d}\right) if KK has infinite smoothness and Dq,α(T)=O(αd/νTdq/ν)D_{q,\alpha}(T)=O\left(\alpha^{-d/\nu}T^{dq/\nu}\right) if KK has finite smoothness. Our contributions are stated as follows:

  1. 1.

    We apply an approximation method that has not been applied to the RKHS bandit problem and reduce the problem to the well-studied (misspecified) linear bandit problem. This novel reduction method has potential to tackle issues other than the ones we deal with in this paper.

  2. 2.

    We propose APG-EXP3 for the adversarial RKHS bandit problem, where APG stands for an Approximation theory based method using PP-Greedy. We prove its expected cumulative regret is upper bounded by O~(TD1,α(T)log(|𝒜|))\widetilde{O}\left(\sqrt{TD_{1,\alpha}(T)\log\left(|\mathcal{A}|\right)}\right), where α=log(|𝒜|)\alpha=\log(|\mathcal{A}|). To the best of our knowledge, this is the first method for the adversarial RKHS bandit problem with general reward functions.

  3. 3.

    We propose a method for the stochastic RKHS bandit problem called APG-PE and prove its cumulative regret is given as O~(TD1/2,α(T)log(|𝒜|δ)),\widetilde{O}\left(\sqrt{TD_{1/2,\alpha}(T)\log\left(\frac{|\mathcal{A}|}{\delta}\right)}\right), with probability at least 1δ1-\delta and its total computational complexity is given as O~((|𝒜|+T)D1/2,α2(T))\widetilde{O}\left((|\mathcal{A}|+T)D_{1/2,\alpha}^{2}(T)\right). We note that the total computational complexity is generally much better than that of the state of the art result O~(|𝒜|TγT2+γT4)\widetilde{O}(|\mathcal{A}|T\gamma_{T}^{2}+\gamma_{T}^{4}) (Calandriello et al., 2020).

  4. 4.

    We propose APG-UCB as an approximation of IGP-UCB and provide an upper bound of its cumulative regret if q1/2q\geq 1/2 and prove that its total computational complexity is given as O(|𝒜|TDq,α2(T))O(|\mathcal{A}|TD_{q,\alpha}^{2}(T)).

    If we take the parameter qq so that q>3/2q>3/2, then we shall show that RAPG-UCB(T)R_{\text{APG-UCB}}(T) is upper bounded by 4βTIGP-UCBγTT+O(TγTT(3/2q)/2+γTT1q)4\beta^{\text{IGP-UCB}}_{T}\sqrt{\gamma_{T}T}+O(\sqrt{T\gamma_{T}}T^{(3/2-q)/2}+\gamma_{T}T^{1-q}), where we define βTIGP-UCB\beta^{\text{IGP-UCB}}_{T} in §6. Since the upper bound for the cumulative regret of IGP-UCB is also given as 4βTIGP-UCBγT(T+2)4\beta^{\text{IGP-UCB}}_{T}\sqrt{\gamma_{T}(T+2)}, APG-UCB has asymptotically the same regret upper bound as that of IGP-UCB in this case. If the kernel has infinite smoothness or finite smoothness with sufficiently large ν\nu (i.e., ν>3d/2\nu>3d/2), then this method is more efficient than IGP-UCB, whose computational complexity is O(|𝒜|T3)O(|\mathcal{A}|T^{3}).

  5. 5.

    In synthetic environments, we empirically show that APG-UCB has almost the same cumulative regret as that of IGP-UCB and its running time is much shorter.

2 Related Work

First, we review previous works on the adversarial RKHS bandit problem. There are almost no existing results concerning the adversarial RKHS bandit problem except for (Chatterji et al., 2019). They also used an approximation method to solve the problem, but their approximation method can handle only a limited case. Therefore, there are no existing algorithms for the adversarial RKHS bandit problem with general reward functions. Next, we review existing results for the stochastic RKHS bandit problem. Srinivas et al. (2010) studied a multi-armed bandit problem, where the reward function is assumed to be sampled from a Gaussian process or belongs to an RKHS. Chowdhury & Gopalan (2017) improved the result of Srinivas et al. (2010) in the RKHS setting and proposed two methods called IGP-UCB and GP-TS. Valko et al. (2013) considered a stochastic RKHS bandit problem, where the arm set 𝒜\mathcal{A} is finite and fixed, prpoposed a method called SupKernelUCB, and proved a regret upper bound O~(TγTlog3(|𝒜|T/δ))\widetilde{O}(\sqrt{T\gamma_{T}\log^{3}(|\mathcal{A}|T/\delta)}). To address the computational inefficiency in the stochastic RKHS bandit problem, Mutny & Krause (2018) proposed Thompson Sampling and UCB-type algorithms using an approximation method called Quadrature Fourier Features which is an improved method of Random Fourier Features Rahimi & Recht (2008). They proved that the total computational complexity of their methods is given as O~(|𝒜|TγT2)\widetilde{O}(|\mathcal{A}|T\gamma_{T}^{2}). However, their methods can be applied to only a very special class of kernels. For example, among three examples introduced in §3, only SE kernels satisfy their assumption unless d=1d=1. Our methods work for general symmetric positive definite kernels with enough smoothness. Calandriello et al. (2020) proposed a method called BBKB and proved its regret is upper bounded by 55C~3RGP-UCB(T)55\tilde{C}^{3}R_{\text{GP-UCB}}(T) with C~>1\tilde{C}>1 and its total computational complexity is given as O~(|𝒜|Tγ2(T)+γ4(T))\widetilde{O}(|\mathcal{A}|T\gamma^{2}(T)+\gamma^{4}(T)). Here we use the maximum information gain instead of the effective dimension since they have the same order up to polylogarithmic factors (Calandriello et al., 2019). If the kernel is an SE kernel, ignoring polylogarithmic factors, their computational complexity is linear in TT. However, the term incurs generally large coefficient |𝒜||\mathcal{A}| in the term unlike APG-PE. Finally, we note that we construct APG-PE from PHASED ELIMINATION (Lattimore et al., 2020), which is an algorithm for the stochastic misspecified linear bandit problem, where PE stands for PHASED ELIMINATION.

3 Problem Formulation

Let Ω\Omega be a non-empty subset of a Euclidean space d\mathbb{R}^{d} and K:Ω×ΩK:\Omega\times\Omega\rightarrow\mathbb{R} be a symmetric, positive definite kernel on Ω\Omega, i.e., K(x,y)=K(y,x)K(x,y)=K(y,x) for all x,yΩx,y\in\Omega and for a pairwise distinct points {x1,,xn}Ω\{x_{1},\dots,x_{n}\}\subseteq\Omega, the kernel matrix (K(xi,xj))1i,jn(K(x_{i},x_{j}))_{1\leq i,j\leq n} is positive definite. Examples of such kernels are Rational Quadratic (RQ), Squared Exponential (SE), and Matérn kernels defined as KRQ(x,y):=(1+s22μl2)μ,K_{\mathrm{RQ}}(x,y):=\left(1+\frac{s^{2}}{2\mu l^{2}}\right)^{-\mu}, KSE(x,y):=exp(s22l2),K_{\mathrm{SE}}(x,y):=\exp\left(-\frac{s^{2}}{2l^{2}}\right), and KMate´rn(ν)(x,y):=21νΓ(ν)(s2νl)νKν(s2νl)K_{\mathrm{Mat\acute{e}rn}}^{(\nu)}(x,y):=\frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{s\sqrt{2\nu}}{l}\right)^{\nu}K_{\nu}\left(\frac{s\sqrt{2\nu}}{l}\right). where s=xy2s=\|x-y\|_{2} and l>0,μ>d/2l>0,\mu>d/2, ν>0\nu>0 are parameters, and KνK_{\nu} is the modified Bessel function of the second kind. As in the previous work Chowdhury & Gopalan (2017), we normalize kernel KK so that K(x,x)1K(x,x)\leq 1 for all xΩx\in\Omega. We note that the above three examples satisfy K(x,x)=1K(x,x)=1 for any xx. We denote by K(Ω)\mathcal{H}_{K}\left(\Omega\right) the RKHS corresponding to the kernel KK, which we shall review briefly in §4 and assume that fK(Ω)f\in\mathcal{H}_{K}\left(\Omega\right) has bounded norm, i.e., fK(Ω)B\|f\|_{\mathcal{H}_{K}\left(\Omega\right)}\leq B. In this paper, we consider the following multi-armed bandit problem with time interval TT and arm set 𝒜Ω\mathcal{A}\subseteq\Omega. First, we formulate the stochastic RKHS bandit problem. In each round t=1,2,,Tt=1,2,\dots,T, a learner selects an arm xt𝒜x_{t}\in\mathcal{A} and observes noisy reward yt=f(xt)+εty_{t}=f(x_{t})+\varepsilon_{t}. Here we assume that noise stochastic process is conditionally RR-sub-Gaussian with respect to a filtration {t}t=1,2,\{\mathcal{F}_{t}\}_{t=1,2,\dots}, i.e., 𝐄[exp(ξεt)t]exp(ξ2R2/2)\mathbf{E}\left[\exp(\xi\varepsilon_{t})\mid\mathcal{F}_{t}\right]\leq\exp(\xi^{2}R^{2}/2) for all t1t\geq 1 and ξ\xi\in\mathbb{R}. We also assume that xtx_{t} is t\mathcal{F}_{t}-measurable and yty_{t} is t+1\mathcal{F}_{t+1}-measurable. The objective of the learner is to maximize the cumulative reward t=1Tf(xt)\sum_{t=1}^{T}f(x_{t}) and regret is defined by R(T):=supx𝒜t=1T(f(x)f(xt))R(T):=\sup_{x\in\mathcal{A}}\sum_{t=1}^{T}\left(f(x)-f(x_{t})\right). In the adversarial (or non-stochastic) bandit RKHS problem, we assume a sequence ftK(Ω)f_{t}\in\mathcal{H}_{K}\left(\Omega\right) with ftK(Ω)B\|f_{t}\|_{\mathcal{H}_{K}\left(\Omega\right)}\leq B for t=1,,Tt=1,\dots,T is given. In each round t=1,,Tt=1,\dots,T, a learner selects an arm xt𝒜x_{t}\in\mathcal{A} and observes a reward ft(xt)f_{t}(x_{t}). The learner’s objective is to minimize the cumulative regret R(T):=supx𝒜t=1Tft(x)t=1Tft(xt)R(T):=\sup_{x\in\mathcal{A}}\sum_{t=1}^{T}f_{t}(x)-\sum_{t=1}^{T}f_{t}(x_{t}). In this paper we only consider oblivious adversary, i.e., we assume the adversary chooses a sequence {ft}t=1T\{f_{t}\}_{t=1}^{T} before the game starts.

4 Results from Approximation Theory

In this section, we introduce important results provided by approximation theory. For introduction to this subject, we refer to the monograph Wendland (2004). We first briefly review basic properties of the RKHS and introduce classical results regarding the convergence rate of the power function, which are required for the proof of Theorem 6. Then, we introduce the PP-greedy algorithm and its convergence rate in Theorem 6, which generalizes the existing result Santin & Haasdonk (2017).

4.1 Reproducing Kernel Hilbert Space

Let F(Ω):={f:Ω}F(\Omega):=\{f:\Omega\rightarrow\mathbb{R}\} be the real vector space of \mathbb{R}-valued functions on Ω\Omega. Then, there exists a unique real Hilbert space (K(Ω),,K(Ω))\left(\mathcal{H}_{K}\left(\Omega\right),\langle\cdot,\cdot\rangle_{\mathcal{H}_{K}\left(\Omega\right)}\right) with K(Ω)F(Ω)\mathcal{H}_{K}\left(\Omega\right)\subseteq F(\Omega) satisfying the following two properties: (i) K(,x)K(Ω)K(\cdot,x)\in\mathcal{H}_{K}\left(\Omega\right) for all xΩx\in\Omega. (ii) f,K(,x)K(Ω)=f(x)\langle f,K(\cdot,x)\rangle_{\mathcal{H}_{K}\left(\Omega\right)}=f(x) for all fK(Ω)f\in\mathcal{H}_{K}\left(\Omega\right) and xΩx\in\Omega. Because of the second property, the kernel KK is called reproducing kernel and K(Ω)\mathcal{H}_{K}\left(\Omega\right) is called the reproducing kernel Hilbert space (RKHS).

For a subset ΩΩ\Omega^{\prime}\subseteq\Omega, we denote by V(Ω)V(\Omega^{\prime}) the vector subspace of K(Ω)\mathcal{H}_{K}\left(\Omega\right) spanned by {K(,x)xΩ}\{K(\cdot,x)\mid x\in\Omega^{\prime}\}. We define an inner product of V(Ω)V(\Omega^{\prime}) as follows. For f=iIaiK(,xi)f=\sum_{i\in I}a_{i}K(\cdot,x_{i}) and g=jIbjK(,xj)g=\sum_{j\in I}b_{j}K(\cdot,x_{j}) with |I|<|I|<\infty, we define f,g:=i,jIaibjK(xi,xj)\langle f,g\rangle:=\sum_{i,j\in I}a_{i}b_{j}K(x_{i},x_{j}). Since KK is symmetric and positive definite, V(Ω)V(\Omega^{\prime}) becomes a pre-Hilbert space with this inner product. Then it is known that RKHS K(Ω)\mathcal{H}_{K}\left(\Omega\right) is isomorphic to the completion of V(Ω)V(\Omega). Therefore, for each fK(Ω)f\in\mathcal{H}_{K}\left(\Omega\right), there exists a sequence {xn}i=1Ω\{x_{n}\}_{i=1}^{\infty}\subseteq\Omega and real numbers {an}n=1\{a_{n}\}_{n=1}^{\infty} such that f=n=1anK(,xn)f=\sum_{n=1}^{\infty}a_{n}K(\cdot,x_{n}). Here the convergence is that with respect to the norm of K(Ω)\mathcal{H}_{K}\left(\Omega\right) and because of a special property of RKHS, it is also a pointwise convergence.

4.2 Power Function and its Convergence Rate

Since for any fK(Ω)f\in\mathcal{H}_{K}\left(\Omega\right), there exists a sequence of finite sums n=1NanK(,xn)\sum_{n=1}^{N}a_{n}K(\cdot,x_{n}) that converges to ff, it is natural to consider the error between ff and the finite sum. A natural notion to capture the error for any fK(Ω)f\in\mathcal{H}_{K}\left(\Omega\right) is the power function defined as below. For a finite subset of points X={xn}n=1NΩX=\{x_{n}\}_{n=1}^{N}\subseteq\Omega, we denote by ΠV(X):K(Ω)V(X)\Pi_{V(X)}:\mathcal{H}_{K}\left(\Omega\right)\rightarrow V(X) the orthogonal projection to V(X)V(X). We note that the function ΠV(X)f\Pi_{V(X)}f is characterized as the interpolant of ff, i.e, ΠV(X)f\Pi_{V(X)}f is a unique function gV(X)g\in V(X) satisfying g(x)=f(x)g(x)=f(x) for all xXx\in X. Then the power function PV(X):Ω0P_{V(X)}:\Omega\rightarrow\mathbb{R}_{\geq 0} is defined as:

PV(X)(x)=supfK(Ω){0}|f(x)(ΠV(X)f)(x)|fK(Ω).P_{V(X)}(x)=\sup_{f\in\mathcal{H}_{K}\left(\Omega\right)\setminus\{0\}}\frac{|f(x)-(\Pi_{V(X)}f)(x)|}{\|f\|_{\mathcal{H}_{K}\left(\Omega\right)}}.

By definition, we have

|f(x)(ΠV(X)f)(x)|fK(Ω)PV(X)(x)\left|f(x)-\left(\Pi_{V(X)}f\right)(x)\right|\leq\|f\|_{\mathcal{H}_{K}\left(\Omega\right)}P_{V(X)}(x)

for any fK(Ω)f\in\mathcal{H}_{K}\left(\Omega\right) and xΩx\in\Omega.

Since the power function PV(X)P_{V(X)} represents how well the space V(X)V(X) approximates any function in K(Ω)\mathcal{H}_{K}\left(\Omega\right) with a bounded norm, it is intuitively clear that the value of PV(X)P_{V(X)} is small if XX is a “fine” discretization of Ω\Omega. The fineness of a finite subset X={x1,,xN}ΩX=\{x_{1},\dots,x_{N}\}\subseteq\Omega can be evaluated by the fill distance hX,Ωh_{X,\Omega} of XX defined as supxΩmin1nNxxn2.\sup_{x\in\Omega}\min_{1\leq n\leq N}\|x-x_{n}\|_{2}. We introduce classical results on the convergence rate of the power function as hX,Ω0h_{X,\Omega}\rightarrow 0. We introduce two kinds of these results: polynomial decay and exponential decay.222We note that more generalized results including in the case of conditionally positive definite kernels and differentials of functions of RKHS are proved (Wendland, 2004, Chapter 11). Before introducing the results, we define smoothness of kernels.

Definition 1.
  1. (i)

    We say (K,Ω)(K,\Omega) has finite smoothness333By abuse of notation, omitting Ω\Omega, we also say “KK has finite smoothness”. with a smoothness parameter ν12>0\nu\in\frac{1}{2}\mathbb{Z}_{>0}, if Ω\Omega is bounded and satisfies an interior cone condition (see remark below), and satisfies either the following condition (a) or (b): (a) KC2ν(Ωι×Ωι)K\in C^{2\nu}(\Omega^{\iota}\times\Omega^{\iota}), and all the differentials of KK of order 2ν2\nu are bounded on Ω×Ω\Omega\times\Omega. Here Ωι\Omega^{\iota} denotes the interior. (b) There exists Φ:d\Phi:\mathbb{R}^{d}\rightarrow\mathbb{R} such that K(x,y)=Φ(xy)K(x,y)=\Phi(x-y), ν+d/2\nu+d/2\in\mathbb{Z}, Φ\Phi has continuous Fourier transformation Φ^\hat{\Phi} and Φ^(x)=Θ((1+x22)(ν+d/2))\hat{\Phi}(x)=\Theta((1+\|x\|_{2}^{2})^{-(\nu+d/2)}) as x2\|x\|_{2}\rightarrow\infty.

  2. (ii)

    We say (K,Ω)(K,\Omega) has infinite smoothness if Ω\Omega is a dd-dimensional cube {xd:|xa0|r0}\{x\in\mathbb{R}^{d}:|x-a_{0}|_{\infty}\leq r_{0}\}, K(x,y)=ϕ(xy2)K(x,y)=\phi(\|x-y\|_{2}) with a function ϕ:0\phi:\mathbb{R}_{\geq 0}\rightarrow\mathbb{R}, and there exists a positive integer l0l_{0} and a constant M>0M>0 such that φ(r):=ϕ(r)\varphi(r):=\phi(\sqrt{r}) satisfies |dlφdrl(r)|l!Ml|\frac{d^{l}\varphi}{dr^{l}}(r)|\leq l!M^{l} for any ll0l\geq l_{0} and r0r\in\mathbb{R}_{\geq 0}.

Remark 2.

(i) Results introduced in this subsection depend on local polynomial reproduction on Ω\Omega and such a result is hopeless if Ω\Omega is a general bounded set Wendland (2004). The interior cone condition is a mild condition that assures such results. For example, if Ω\Omega is a cube {x:|xa|r}\{x:|x-a|_{\infty}\leq r\} or ball {x:|xa|2r}\{x:|x-a|_{2}\leq r\}, then this condition is satisfied. (ii) Since Φ^(x)=c(1+x22)νd/2\hat{\Phi}(x)=c\ (1+\|x\|_{2}^{2})^{-\nu-d/2} with c>0c>0 for Φ(x)=x2νKν(x2)\Phi(x)=\|x\|_{2}^{\nu}K_{\nu}(\|x\|_{2}), Matérn kernels KMate´rn(ν)K_{\mathrm{Mat\acute{e}rn}}^{(\nu)} have finite smoothness with smoothness parameter ν\nu. In addition, it can be shown that the RQ and SE kernels have infinite smoothness.

Theorem 3 (Wu & Schaback (1993), Wendland (2004) Theorem 11.13).

We assume (K,Ω)(K,\Omega) has finite smoothness with smoothness parameter ν\nu. Then there exist constants C>0C>0 and h0>0h_{0}>0 that depend only on ν,d,K\nu,d,K and Ω\Omega such that PXL(Ω)ChX,Ων\|P_{X}\|_{L^{\infty}\left(\Omega\right)}\leq Ch_{X,\Omega}^{\nu} for any XΩX\subseteq\Omega with hX,Ωh0h_{X,\Omega}\leq h_{0}.

One can apply this result to RQ and SE kernels for any ν>0\nu>0, but a stronger result holds for these kernels.

Theorem 4 (Madych & Nelson (1992), Wendland (2004) Theorem 11.22).

Let Ωd\Omega\subset\mathbb{R}^{d} be a cube and assume KK has infinite smoothness. Then, there exist constants C1,C2,h0>0C_{1},C_{2},h_{0}>0 depending only on d,Ω,d,\Omega, and KK such that

PXL(Ω)C1exp(C2/hX,Ω),\|P_{X}\|_{L^{\infty}(\Omega)}\leq C_{1}\exp\left(-C_{2}/h_{X,\Omega}\right),

for any finite subset XΩX\subseteq\Omega with hX,Ωh0h_{X,\Omega}\leq h_{0}.

Remark 5.

(i) The assumption on Ω\Omega can be relaxed, i.e., the set Ω\Omega is not necessarily a cube. See Madych & Nelson (1992) for details. (ii) In the case of SE kernels, a stronger result holds. More precisely, for sufficiently small hX,Ωh_{X,\Omega}, PXL(Ω)C1exp(C2log(hX,Ω)/hX,Ω)\|P_{X}\|_{L^{\infty}(\Omega)}\leq C_{1}\exp\left(C_{2}\log(h_{X,\Omega})/h_{X,\Omega}\right) holds.

4.3 PP-greedy Algorithm and its Convergence Rate

Algorithm 1 Construction of Newton basis with PP-greedy algorithm (c.f. Pazouki & Schaback (2011))
  Input: kernel KK, admissible error 𝔢>0\mathfrak{e}>0, a subset of points Ω^Ω\widehat{\Omega}\subseteq\Omega,
  Output: A subset of points XmΩ^X_{m}\subseteq\widehat{\Omega} and Newton basis N1,,NmN_{1},\dots,N_{m} of V(Xm)V(X_{m}).
  ξ1:=argmaxxΩ^K(x,x)\xi_{1}:=\operatorname{argmax}_{x\in\widehat{\Omega}}K(x,x).
  N1(x):=K(x,ξ1)K(ξ1,ξ1)N_{1}(x):=\frac{K(x,\xi_{1})}{\sqrt{K(\xi_{1},\xi_{1})}}.
  for m=1,2,3,,m=1,2,3,\dots, do
     Pm2(x):=K(x,x)k=1mNk2(x)P_{m}^{2}(x):=K(x,x)-\sum_{k=1}^{m}N_{k}^{2}(x).
     if maxxΩ^Pm2(x)<𝔢2\max_{x\in\widehat{\Omega}}P_{m}^{2}(x)<\mathfrak{e}^{2} then
        return {ξ1,,ξm}\{\xi_{1},\dots,\xi_{m}\} and {N1,,Nm}\{N_{1},\dots,N_{m}\}.
     end if
     ξm+1:=argmaxxΩ^Pm2(x)\xi_{m+1}:=\operatorname{argmax}_{x\in\widehat{\Omega}}P_{m}^{2}(x).
     u(x):=K(x,ξm+1)k=1mNk(ξm+1)Nk(x)u(x):=K(x,\xi_{m+1})-\sum_{k=1}^{m}N_{k}(\xi_{m+1})N_{k}(x), Nm+1(x):=u(x)/Pm2(ξm+1)N_{m+1}(x):=u(x)/\sqrt{P_{m}^{2}(\xi_{m+1})}.
  end for

In a typical application, for a given discretization Ω^Ω\widehat{\Omega}\subseteq\Omega and function fK(Ω)f\in\mathcal{H}_{K}\left(\Omega\right), we want to find a finite subset X={ξ1,,ξm}Ω^X=\{\xi_{1},\dots,\xi_{m}\}\subseteq\widehat{\Omega} with |X||Ω^||X|\ll|\widehat{\Omega}| so that ff is close to an element of V(X)V(X). Several greedy algorithms are proposed to solve this problem De Marchi et al. (2005); Schaback & Wendland (2000); Müller (2009). Among them, the PP-greedy algorithm De Marchi et al. (2005) is most suitable for our purpose, since the point selection depends only on KK and Ω^\widehat{\Omega} but not on the function ff which is unknown to the learner in the bandit setting.

The PP-greedy algorithm first selects a point ξ1Ω^\xi_{1}\in\widehat{\Omega} maximizing PV()(x)=K(x,x)P_{V(\emptyset)}(x)=K(x,x) and after selecting points Xn1={ξ1,,ξn1}X_{n-1}=\{\xi_{1},\dots,\xi_{n-1}\}, it selects ξn\xi_{n} by ξn=argmaxxΩ^PV(Xn1)(x)\xi_{n}=\operatorname{argmax}_{x\in\widehat{\Omega}}P_{V(X_{n-1})}(x). Following Pazouki & Schaback (2011), we introduce (a variant of) the PP-greedy algorithm that simultaneously computes the Newton basis Müller & Schaback (2009) in Algorithm 1. If Ω^\widehat{\Omega} is finite, this algorithm outputs Newton Basis N1,,NmN_{1},\dots,N_{m} at the cost of O(|Ω^|m2)O(|\widehat{\Omega}|m^{2}) time complexity using O(|Ω^|m)O(|\widehat{\Omega}|m) space. Newton basis {N1,,Nm}\{N_{1},\dots,N_{m}\} is the Gram-Schmidt orthonormalization of basis {K(,ξ1),,K(,ξm)}\{K(\cdot,\xi_{1}),\dots,K(\cdot,\xi_{m})\}. Because of orthonormality, the following equality holds (Santin & Haasdonk, 2017, Lemma 5): PV(X)2(x)=K(x,x)i=1mNi2(x),P_{V(X)}^{2}(x)=K(x,x)-\sum_{i=1}^{m}N_{i}^{2}(x), where X={ξ1,,ξm}X=\{\xi_{1},\dots,\xi_{m}\}. Seemingly, Algorithm 1 is different from the PP-greedy algorithm described above, using this formula, we can see that these two algorithms are identical.

The following theorem is essentially due to Santin & Haasdonk (2017) and we provide a more generalized result.

Theorem 6 (Santin & Haasdonk (2017)).

Let K:Ω×ΩK:\Omega\times\Omega\rightarrow\mathbb{R} be a symmetric positive definite kernel. Suppose that the PP-greedy algorithm applied to Ω^Ω\widehat{\Omega}\subseteq\Omega with error 𝔢\mathfrak{e} gives n𝔢n_{\mathfrak{e}} points XΩ^X\subseteq\widehat{\Omega} with |X|=n𝔢|X|=n_{\mathfrak{e}}. Then the following statements hold:

(i) Suppose (K,Ω)(K,\Omega) has finite smoothness with smoothness parameter ν>0\nu>0. Then, there exists a constant C^>0\hat{C}>0 depending only on d,ν,Kd,\nu,K, and Ω\Omega such that PV(X)L(Ω^)<C^n𝔢ν/d.\|P_{V(X)}\|_{L^{\infty}\left(\widehat{\Omega}\right)}<\hat{C}n_{\mathfrak{e}}^{-\nu/d}.

(ii) Suppose (K,Ω)(K,\Omega) has infinite smoothness. Then there exist constants C^1,C^2>0\hat{C}_{1},\hat{C}_{2}>0 depending only on d,K,d,K, and Ω\Omega such that PV(X)L(Ω^)<C1^exp(C2^n𝔢1/d)\|P_{V(X)}\|_{L^{\infty}\left(\widehat{\Omega}\right)}<\hat{C_{1}}\exp\left(-\hat{C_{2}}n_{\mathfrak{e}}^{1/d}\right).

The statements of the theorem are non-trivial in two folds. First, by Theorems 3, 4, if n>0n\in\mathbb{Z}_{>0} is sufficiently large, there exists a subset XΩX\subseteq\Omega with |X|=n|X|=n that gives the same convergence rate above (e.g. XX is a uniform mesh of Ω\Omega). This theorem assures the same convergence rate is achieved by the points selected by the PP-greedy algorithm. Secondly, it also assures that the same result holds even if the PP-greedy algorithm is applied to a subset Ω^Ω\widehat{\Omega}\subseteq\Omega.

If the kernel has finite smoothness, Santin & Haasdonk (2017) only considered the case when K(Ω)\mathcal{H}_{K}\left(\Omega\right) is norm equivalent to a Sobolev space, which is also norm equivalent to the RKHS associated with a Matérn kernel. One can prove Theorem 6 from Theorems 3, 4 and (DeVore et al., 2013, Corollary 3.3) by the same argument to Santin & Haasdonk (2017).

For later use, we provide a restatement of Theorem 6 as follows.

Corollary 7.

Let α,q>0\alpha,q>0 be parameters and denote by D=Dq,α(T)D=D_{q,\alpha}(T) the number of points returned by the PP-greedy algorithm with error 𝔢=α/Tq\mathfrak{e}=\alpha/T^{q}.

(i) Suppose KK has finite smoothness with smoothness parameter ν>0\nu>0. Then Dq,α(T)=O(αd/νTdq/ν)D_{q,\alpha}(T)=O\left(\alpha^{-d/\nu}T^{dq/\nu}\right).

(ii) Suppose KK has infinite smoothness. Then Dq,α(T)=O((qlogTlog(α))d)D_{q,\alpha}(T)=O\left((q\log T-\log(\alpha))^{d}\right).

5 Misspecified Linear Bandit Problem

Since we can approximate fK(Ω)f\in\mathcal{H}_{K}\left(\Omega\right) by an element of V(X)V(X), where V(X)V(X) is a finite dimensional subspace of the RKHS, we study a linear bandit problem where the linear model is misspecified, i.e, the misspecified linear bandit problem (Lattimore et al., 2020; Lattimore & Szepesvári, 2020). In this section, we introduce several algorithms for the stochastic and adversarial misspecified linear bandit problems. It turns out that such algorithms can be constructed by modifying (or even without modification) algorithms for the linear bandit problem. We provide proofs in this section in the supplementary material.

First, we provide a formulation of the stochastic misspecified linear bandit problem suitable for our purpose. Let 𝒜\mathcal{A} be a set and suppose that there exists a map xx~x\mapsto\widetilde{x} from 𝒜\mathcal{A} to the unit ball {ξD:ξ21}\{\xi\in\mathbb{R}^{D}:\|\xi\|_{2}\leq 1\} of a Euclidean space. In each round t=1,2,,Tt=1,2,\dots,T, a learner selects an action xt𝒜x_{t}\in\mathcal{A} and the environment reveals a noisy reward yt=g(xt)+εty_{t}=g(x_{t})+\varepsilon_{t}, where g(x):=θ,x~+ω(x)g(x):=\langle\theta,\widetilde{x}\rangle+\omega(x), θD\theta\in\mathbb{R}^{D}, and ω(x)\omega(x) is a biased noise and satisfies supx𝒜ω(x)ϵ\sup_{x\in\mathcal{A}}\|\omega(x)\|\leq\epsilon and ϵ>0\epsilon>0 is known to the learner. We also assume that there exists B>0B>0 such that supx𝒜g(x)B\sup_{x\in\mathcal{A}}\|g(x)\|\leq B and θ2B\|\theta\|_{2}\leq B. As before, {εt}t1\{\varepsilon_{t}\}_{t\geq 1} is conditionally RR-sub-Gaussian w.r.t a filtration {t}t1\{\mathcal{F}_{t}\}_{t\geq 1} and we assume that x~t\widetilde{x}_{t} is t\mathcal{F}_{t}-measurable and yty_{t} is t+1\mathcal{F}_{t+1}-measurable. The regret is defined as R(T):=t=1T(supx𝒜g(x)g(xt)).R(T):=\sum_{t=1}^{T}\left(\sup_{x\in\mathcal{A}}g(x)-g(x_{t})\right). We can formulate the adversarial misspecified linear bandit problem in a similar way. Let {gt}t=1T\{g_{t}\}_{t=1}^{T} be a sequence of functions on 𝒜\mathcal{A} with gt(x)=θt,x~+ωt(x)g_{t}(x)=\langle\theta_{t},\widetilde{x}\rangle+\omega_{t}(x), θtD\theta_{t}\in\mathbb{R}^{D}, and supx𝒜ωt(x)ϵ\sup_{x\in\mathcal{A}}\|\omega_{t}(x)\|\leq\epsilon, where the map xx~x\mapsto\widetilde{x} is as before. We also assume that there exists B>0B>0 such that supx𝒜|g(x)|B\sup_{x\in\mathcal{A}}|g(x)|\leq B and θt2B\|\theta_{t}\|_{2}\leq B. In each round, t=1,,Tt=1,\dots,T, the learner selects an arm xt𝒜x_{t}\in\mathcal{A} and observes a reward gt(xt)g_{t}(x_{t}). The cumulative regret is defined as RT=supx𝒜t=1Tgt(x)t=1Tgt(xt)R_{T}=\sup_{x\in\mathcal{A}}\sum_{t=1}^{T}g_{t}(x)-\sum_{t=1}^{T}g_{t}(x_{t}).

First, we introduce a modification of LinUCB Abbasi-Yadkori et al. (2011). To do this, we prepare notation for the stochastic linear bandit problem. Let λ>0\lambda>0 and δ>0\delta>0 be parameters. We define At:=λ1D+s=1tx~sx~sTA_{t}:=\lambda 1_{D}+\sum_{s=1}^{t}\widetilde{x}_{s}\widetilde{x}_{s}^{\mathrm{T}}, bt:=s=1tysx~sb_{t}:=\sum_{s=1}^{t}y_{s}\widetilde{x}_{s}, and θ^t:=At1bt\hat{\theta}_{t}:=A_{t}^{-1}b_{t}. Here, 1D1_{D} is the identity matrix of size DD. For xDx\in\mathbb{R}^{D}, we define the Mahalanobis norm as xAt1:=xTAt1x\|x\|_{A_{t}^{-1}}:=\sqrt{x^{\mathrm{T}}A_{t}^{-1}x} and define βt\beta_{t} as

βt:=β(At,δ,λ):=Rlogdetλ1Atδ2+λB.\beta_{t}:=\beta(A_{t},\delta,\lambda):=R\sqrt{\log\frac{\det\lambda^{-1}A_{t}}{\delta^{2}}}+\sqrt{\lambda}B.

We note that by the proof of (Abbasi-Yadkori et al., 2011, Lemma 11), computational complexity for updating βt\beta_{t} is O(D2)O(D^{2}) at each round.

Lattimore et al. (2020) (see appendix of its arXiv version) considered a modification of LinUCB which selects x𝒜x\in\mathcal{A} maximizing (modified) UCB θ^t,x~+βtx~At1+ϵs=1t|x~TAt1x~s|\langle\hat{\theta}_{t},\widetilde{x}\rangle+\beta_{t}\|\widetilde{x}\|_{A_{t}^{-1}}+\epsilon\sum_{s=1}^{t}|\widetilde{x}^{\mathrm{T}}A_{t}^{-1}\widetilde{x}_{s}| in (t+1)(t+1)th round and proved the regret of the algorithm is upper bounded by O(DTlog(T)+ϵTDlog(T))O(D\sqrt{T}\log(T)+\epsilon T\sqrt{D\log(T)}). However, computing the above value requires O(t)O(t) time for each arm x𝒜x\in\mathcal{A}. Therefore, instead of incurring additional D\sqrt{D} factor in the second term in the regret upper bound above, we consider another upper confidence bound which can be easily computed. In (t+1)(t+1)th round, our modification of UCB type algorithm selects arm x𝒜x\in\mathcal{A} maximizing the modified UCB θ^t,x~+x~At1(βt+ϵψt),\langle\hat{\theta}_{t},\widetilde{x}\rangle+\|\widetilde{x}\|_{A_{t}^{-1}}\left(\beta_{t}+\epsilon\psi_{t}\right), where ψt\psi_{t} is defined as s=1tx~sAs11\sum_{s=1}^{t}\|\widetilde{x}_{s}\|_{A_{s-1}^{-1}}. Then by storing ψt\psi_{t} in each round, the complexity for computing this value is given as O(D2)O(D^{2}) for each x𝒜x\in\mathcal{A} and as is well-known, one can update At1A_{t}^{-1} in O(D2)O(D^{2}) time using the Sherman–Morrison formula. By the standard argument, we can prove the following regret bound.

Proposition 8.

Let notation and assumptions be as above. We further assume that λ1\lambda\geq 1. Then with probability at least 1δ1-\delta, the regret R(T)R(T) of the modified UCB algorithm satisfies R(T)2βTT2logdet(λ1At)+2ϵT+4ϵTlog(det(λ1AT)).R(T)\leq 2\beta_{T}\sqrt{T}\sqrt{2\log\det(\lambda^{-1}A_{t})}\allowbreak+2\epsilon T\allowbreak+4\epsilon T\log(\det(\lambda^{-1}A_{T})). In particular, we have

R(T)=O~(DTlog(1/δ)+DT+ϵDT).R(T)=\widetilde{O}\left(\sqrt{DT\log(1/\delta)}+D\sqrt{T}+\epsilon DT\right).

In the supplementary material, we also introduce a modification of Thompson Sampling.

The regret upper bound provided above does not depend on the arm set 𝒜\mathcal{A}. Moreover, the same results hold even if the arm set changes over time step tt (with minor modification of the definition of regret). On the other hand, several authors Lattimore et al. (2020); Auer (2002); Valko et al. (2013) studied algorithm whose regret depends on the cardinality of the arm set in the stochastic linear or RKHS setting. In some rounds, such algorithms eliminate arms that are supposed to be non-optimal with a high probability and therefore the arm set should be the same over time. Generally, these algorithms are more complicated than LinUCB or Thompson Sampling. However, recently, Lattimore et al. (2020) proposed a simpler and sophisticated algorithm called PHASED ELIMINATION using Kiefer–Wolfowitz theorem. Furthermore, they showed that it works well for the stochastic misspecified linear bandit problem without modification. More precisely, they proved the following result.

Theorem 9 (Lattimore et al. (2020); Lattimore & Szepesvári (2020)).

Let R(T)R(T) be the regret PHASED ELIMINATION incurs for the stochastic misspecified linear bandit problem. We further assume that {εt}\{\varepsilon_{t}\} is independent RR-sub-Gaussian. Then, with probability at least 1δ1-\delta, we have

R(T)=O(DTlog(|𝒜|log(T)δ)+ϵDTlog(T)).R(T)=O\left(\sqrt{DT\log\left(\frac{|\mathcal{A}|\log(T)}{\delta}\right)}+\epsilon\sqrt{D}T\log(T)\right).

Moreover the total computational complexity up to round TT is given as O(D2|𝒜|loglog(D)log(T)+TD2)O(D^{2}|\mathcal{A}|\log\log(D)\log(T)+TD^{2}).

Remark 10.

Although they provided an upper bound for the expected regret, it is not difficult to see that their proof gave a high probability regret upper bound.

Next, we show that EXP3 for adversarial linear bandits (c.f. Lattimore & Szepesvári (2020)) works for the adversarial misspecified linear bandits without modification. We introduce notations for EXP3. Let η>0\eta>0 be a learning rate, γ\gamma an exploration parameter, and πexp\pi_{\mathrm{exp}} be an exploration distribution over 𝒜\mathcal{A}. For a distribution π\pi on 𝒜\mathcal{A}, we define a matrix Q(π):=x𝒜π(x)x~x~TQ(\pi):=\sum_{x\in\mathcal{A}}\pi(x)\widetilde{x}\widetilde{x}^{\mathrm{T}}. We also put ϕt:=gt(xt)Qt1x~t\phi_{t}:=g_{t}(x_{t})Q_{t}^{-1}\widetilde{x}_{t} and ϕt(x):=ϕt,x~\phi_{t}(x):=\langle\phi_{t},\widetilde{x}\rangle for x𝒜x\in\mathcal{A}, where the matrix QtQ_{t} is defined later. We define a distribution qtq_{t} over 𝒜\mathcal{A} by qt(x)exp(ηs=1t1ϕs(x))q_{t}(x)\sim\exp(\eta\sum_{s=1}^{t-1}\phi_{s}(x)) and a distribution ptp_{t} by pt(x)=γπexp(x)+(1γ)qt(x)p_{t}(x)=\gamma\pi_{\mathrm{exp}}(x)+(1-\gamma)q_{t}(x) for x𝒜x\in\mathcal{A}. The matrix QtQ_{t} is defined as Q(pt)Q(p_{t}). We put Γ(πexp):=supx𝒜x~TQ(πexp)1x~\Gamma(\pi_{\mathrm{exp}}):=\sup_{x\in\mathcal{A}}\widetilde{x}^{\mathrm{T}}Q(\pi_{\mathrm{exp}})^{-1}\widetilde{x}.

Proposition 11.

We assume that {x~x𝒜}\{\widetilde{x}\mid x\in\mathcal{A}\} spans D\mathbb{R}^{D}. We also assume πexp\pi_{\mathrm{exp}} satisfies Γ(πexp)D\Gamma(\pi_{\mathrm{exp}})\leq D and we take γ=BΓ(πexp)η\gamma=B\Gamma(\pi_{\mathrm{exp}})\eta. Then applying EXP3 to the adversarial misspecified linear bandit problem, we have the following upper bound for the expected regret:

𝐄[R(T)]2εT+eB2ηDT+2ϵTBη+log|𝒜|η.\mathbf{E}\left[R(T)\right]\leq 2\varepsilon T+eB^{2}\eta DT+\frac{2\epsilon T}{B\eta}+\frac{\log|\mathcal{A}|}{\eta}.
Remark 12.

By the Kiefer–Wolfowitz theorem, there exists an exploitation distribution πexp\pi_{\mathrm{exp}} such that Γ(πexp)D\Gamma(\pi_{\mathrm{exp}})\leq D.

6 Main Results

Using results from approximation theory explained in §4 and algorithms for the misspecified bandit problem, we provide several algorithms for the stochastic and adversarial RKHS bandit problems. We provide proofs of the results in this section in the supplementary material.

Let N1,,NDN_{1},\dots,N_{D} be the Newton basis returned by Algorithm 1 with 𝔢=αTq\mathfrak{e}=\frac{\alpha}{T^{q}} with q,α>0q,\alpha>0, and Ω^=𝒜\widehat{\Omega}=\mathcal{A}. Then, by orthonormality of the Newton basis and the definition of the power function, for any fK(Ω)f\in\mathcal{H}_{K}\left(\Omega\right) and xΩx\in\Omega, we have

|f(x)θf,x~|fK(Ω)PV(X)(x),|f(x)-\langle\theta_{f},\widetilde{x}\rangle|\leq\|f\|_{\mathcal{H}_{K}\left(\Omega\right)}P_{V(X)}(x),

where θf=(f,Ni)1iDD\theta_{f}=\left(\langle f,N_{i}\rangle\right)_{1\leq i\leq D}\in\mathbb{R}^{D} and x~=(Ni(x))1iDD\widetilde{x}=\left(N_{i}(x)\right)_{1\leq i\leq D}\in\mathbb{R}^{D}. Therefore, if ff is an objective function of a RKHS bandit problem, we can regard ff as a linearly misspecified model and apply algorithms for misspecified linear bandit problems to solve the original RKHS bandit problems.

In this section, we reduce the RKHS bandit problem to the the misspecified linear bandit problem by the map xx~x\mapsto\widetilde{x} and apply modified LinUCB, PHASED ELIMINATION, and EXP3 to the problem. We call these algorithms APG-UCB, APG-PE and APG-EXP3 respectively and APG-UCB is displayed in Algorithm 2. We denote by Dq,α(T)=DD_{q,\alpha}(T)=D the number of points returned by Algorithm 1 with 𝔢=αTq\mathfrak{e}=\frac{\alpha}{T^{q}}. By the results in §4, we have an upper bound of Dq,α(T)D_{q,\alpha}(T) (Corollary 7).

Algorithm 2 Approximated RKHS Bandit Algorithm of UCB type (APG-UCB)
  Input: Time interval TT, admissible error 𝔢=αTq\mathfrak{e}=\frac{\alpha}{T^{q}}, λ,R,B,δ\lambda,R,B,\delta
  Using Alg. 1, compute Newton basis N1,,NDN_{1},\dots,N_{D} with admissible error 𝔢\mathfrak{e} and Ω^=𝒜\widehat{\Omega}=\mathcal{A}, and put ϵ=B𝔢\epsilon=B\mathfrak{e}.
  for x𝒜x\in\mathcal{A} do
     x~:=[N1(x),N2(x),,ND(x)]TD\widetilde{x}:=[N_{1}(x),N_{2}(x),\dots,N_{D}(x)]^{\mathrm{T}}\in\mathbb{R}^{D}.
  end for
  for t=0,1,,T1t=0,1,\dots,T-1 do
     At:=λ1D+s=1tx~sx~sTA_{t}:=\lambda 1_{D}+\sum_{s=1}^{t}\widetilde{x}_{s}\widetilde{x}_{s}^{\mathrm{T}},  bt:=s=1tysx~sb_{t}:=\sum_{s=1}^{t}y_{s}\widetilde{x}_{s}.
     θ^t:=At1bt\hat{\theta}_{t}:=A_{t}^{-1}b_{t}, ψt:=s=1tx~sAs11\psi_{t}:=\sum_{s=1}^{t}\|\widetilde{x}_{s}\|_{A_{s-1}^{-1}}.
     xt+1:=argmaxx𝒜{x~,θ^t+x~At1(βt+ϵψt)}x_{t+1}:=\operatorname{argmax}_{x\in\mathcal{A}}\left\{\langle\widetilde{x},\hat{\theta}_{t}\rangle+\|\widetilde{x}\|_{A_{t}^{-1}}\left(\beta_{t}+\epsilon\psi_{t}\right)\right\}.
     Select xt+1x_{t+1} and observe yt+1y_{t+1}.
  end for

First, we state the results for APG-UCB.

Theorem 13.

We denote by RAPG-UCB(T)R_{\text{APG-UCB}}(T) the regret that Algorithm 2 incurs for the stochastic RKHS bandit problem up to time step TT and assume that λ1\lambda\geq 1 and q1/2q\geq 1/2. Then with probability at least 1δ1-\delta, RAPG-UCB(T)R_{\text{APG-UCB}}(T) is given as

O~(TDq,α(T)log(1/δ)+Dq,α(T)T)\widetilde{O}\left(\sqrt{TD_{q,\alpha}(T)\log(1/\delta)}+D_{q,\alpha}(T)\sqrt{T}\right)

and the total computational complexity of the algorithm is given as O(|𝒜|TDq,α2(T))O(|\mathcal{A}|TD_{q,\alpha}^{2}(T)).

The admissible error 𝔢\mathfrak{e} balances the computational complexity and regret minimization. However, this is not clear from Theorem 13. The following theorem provides another upper bound of APG-UCB and it states that if we take smaller error 𝔢\mathfrak{e}, then an upper bound of APG-UCB is almost the same as that of IGP-UCB.

Theorem 14.

We assume λ=1\lambda=1 and take parameter qq of APG-UCB so that q>3/2q>3/2. We define βTIGP-UCB\beta^{\text{IGP-UCB}}_{T} as B+R2(γT+1+log(1/δ))B+R\sqrt{2(\gamma_{T}+1+\log(1/\delta))}. Then with probability at least 1δ1-\delta, we have RAPG-UCB(T)b(T)R_{\text{APG-UCB}}(T)\leq b(T), where b(T)b(T) is given as 4βTIGP-UCBγTT+O(TγTT(3/2q)/2+γTT1q).4\beta^{\text{IGP-UCB}}_{T}\sqrt{\gamma_{T}T}+O(\sqrt{T\gamma_{T}}T^{(3/2-q)/2}+\gamma_{T}T^{1-q}).

Remark 15.

Since the main term of b(T)b(T) is 4βTIGP-UCBγTT4\beta^{\text{IGP-UCB}}_{T}\sqrt{\gamma_{T}T} and by the proof in (Chowdhury & Gopalan, 2017), IGP-UCB has the regret upper bound 4βTIGP-UCBγT(T+2)4\beta^{\text{IGP-UCB}}_{T}\sqrt{\gamma_{T}(T+2)}, APG-UCB has an asymptotically the same regret upper bound as IGP-UCB if we take a small error 𝔢\mathfrak{e}. We note that if ν\nu is sufficiently large compared to dd (this is always the case if the kernel has infinite smoothness), then APG-UCB is more efficient than IGP-UCB. We note that for any choice of parameters, the regret upper bound of BBKB is given as 55C~3RGP-UCB(T)55\tilde{C}^{3}R_{\text{GP-UCB}}(T), where C~1\tilde{C}\geq 1.

Next, we state the results for APG-PE.

Theorem 16.

We denote by RAPG-PE(T)R_{\text{APG-PE}}(T) the regret that APG-PE with q=1/2q=1/2 incurs for the stochastic RKHS bandit problem up to time step TT. We further assume that {εt}\{\varepsilon_{t}\} is independent RR-sub-Gaussian. Then with probability at least 1δ1-\delta, we have RAPG-PE(T)=O~(TD1/2,α(T)log(|𝒜|δ)),R_{\text{APG-PE}}(T)=\widetilde{O}\left(\sqrt{TD_{1/2,\alpha}(T)\log\left(\frac{|\mathcal{A}|}{\delta}\right)}\right), and its total computational complexity is given as O~((|𝒜|+T)D1/2,α2(T))\widetilde{O}\left((|\mathcal{A}|+T)D_{1/2,\alpha}^{2}(T)\right).

Finally, we state a result for the adversarial RKHS bandit problem.

Theorem 17.

We denote by RAPG-EXP3(T)R_{\text{APG-EXP3{}}}(T) the cumulative regret that APG-EXP3 with α=log(|𝒜|)\alpha=\log(|\mathcal{A}|) and q=1q=1 incurs for the adversarial RKHS bandit problem up to time step TT. Then with appropriate choices of the learning rate η\eta and exploration distribution, the expected regret 𝐄[RAPG-EXP3(T)]\mathbf{E}\left[R_{\text{APG-EXP3}}(T)\right] is given as O~(TD1,α(T)log(|𝒜|))\widetilde{O}\left(\sqrt{TD_{1,\alpha}(T)\log\left(|\mathcal{A}|\right)}\right).


Refer to caption
Figure 1: Normalized Cumulative Regret for RQ kernels.

Refer to caption
Figure 2: Normalized Cumulative Regret for SE kernels.

7 Discussion

So far, we have emphasized the advantages of our methods. In this section, we discuss limitations of our methods. Here, we focus on Theorem 13 with q=1/2q=1/2 and Theorem 16. Since we do not see limitations if the kernel has infinite smoothness, in this section we assume the kernel is a Matérn kernel. In our theoretical results, Dq,α(T)D_{q,\alpha}(T) plays a similar role as the information gain in the theoretical result of BBKB. If the kernel is a Matérn kernel with parameter ν\nu, then, by recent results on the information gain (vakili2021information), we have γT=O~(Td/(d+2ν))\gamma_{T}=\widetilde{O}(T^{d/(d+2\nu)}), which is a nearly optimal result by (Scarlett et al., 2017) and is slightly better than the upper bound of D1/2,α(T)D_{1/2,\alpha}(T). Therefore, in this case the regret upper bound O~(TTd/(2ν))\widetilde{O}(\sqrt{T}T^{d/(2\nu)}) of Theorem 13 is slightly worse than the regret upper bound O~(TTd/(d+2ν))\widetilde{O}(\sqrt{T}T^{d/(d+2\nu)}) of BBKB. In addition, similarly SupKernelUCB has nearly optimal regret upper bound if the kernel is Matérn, but regret upper bound of APG-PE is slightly worse in that case.

Inferiority of our method in the Matérn kernel case might be counter-intuitive since it is also proved that the convergence rate of the power function for Matérn kernel is optimal (c.f. Schaback (1995)) and Theorem 9 cannot be improved (Lattimore et al., 2020). We explain why a combination of optimal results leads to a non-optimal result. The results on the information gain depend on the eigenvalue decay of the Mercer operator rather than the decay of the power function in the LL^{\infty}-norm as in this study. However these two notions are closely related. From the nn-width theory (Pinkus, 2012, Chapter IV, Corollary 2.6), eigenvalue decay corresponds to the decay of the power function in the L2L^{2}-norm (or more precisely Kolmogorov nn-width). The decay in the L2L^{2}-norm is derived from that in the LL^{\infty}-norm. If the kernel is a Matérn kernel, using a localization trick called Duchon’s trick (Wendland, 1997), it can be possible to give a faster decay in the LpL^{p}-norm than that in the LL^{\infty}-norm if p<p<\infty. Since the norm regarding the misspecified bandit problem is not a L2L^{2} norm but a LL^{\infty} norm, we took the approach proposed in this paper.

8 Experiments

In this section, we empirically verify our theoretical results. We compare APG-UCB to IGP-UCB Chowdhury & Gopalan (2017) in terms of cumulative regret and running time for RQ and SE kernels in synthetic environments.

8.1 Environments

We assume the action set is a discretization of a cube [0,1]d[0,1]^{d} for d=1,2,3d=1,2,3. We take 𝒜\mathcal{A} so that |𝒜||\mathcal{A}| is about 10001000. More precisely, we define 𝒜\mathcal{A} by {i/mdi=0,1,,md1}d\{i/m_{d}\mid i=0,1,\dots,m_{d}-1\}^{d} where m1=1000,m2=30,m3=10m_{1}=1000,m_{2}=30,m_{3}=10. We randomly construct reward functions fK(Ω)f\in\mathcal{H}_{K}\left(\Omega\right) with fK(Ω)=1\|f\|_{\mathcal{H}_{K}\left(\Omega\right)}=1 as follows. We randomly select points ξi\xi_{i} (for 1im1\leq i\leq m) from 𝒜\mathcal{A} until m300m\leq 300 or PV({ξ1,,ξm})L(𝒜)<104\|P_{V(\{\xi_{1},\dots,\xi_{m}\})}\|_{L^{\infty}(\mathcal{A})}<10^{-4} and compute orthonormal basis {φ1,,φm}\{\varphi_{1},\dots,\varphi_{m}\} of V({ξ1,,ξm})V(\{\xi_{1},\dots,\xi_{m}\}). Then, we define f=i=1maiφif=\sum_{i=1}^{m}a_{i}\varphi_{i}, where [a1,,am]m[a_{1},\dots,a_{m}]\in\mathbb{R}^{m} is a random vector with unit norm. We take l=0.3dl=0.3\sqrt{d} for the RQ kernel and l=0.2dl=0.2\sqrt{d} for the SE kernel, because the diameter of the dd-dimensional cube is d\sqrt{d}. For each kernel, we generate 1010 reward functions as above and evaluate our proposed method and the existing algorithm for time interval T=5000T=5000 in terms of mean cumulative regret and total running time. We compute the mean of cumulative regret and running time for these 10 environments. We normalize cumulative regret so that normalized cumulative regret of uniform random policy corresponds to the line through origin with slope 1 in the figure. For simplicity, we assume the kernel, BB, and RR are known to the algorithms. For the other parameters, we use theoretically suggested ones for both APG-UCB and IGP-UCB. Computation is done by Intel Xeon E5-2630 v4 processor with 128 GB RAM. In supplementary material, we explain the experiment setting in more detail and provide additional experimental results.

8.2 Results

We show the results for normalized cumulative regret in Figures 1 and 2. As suggested by the theoretical results, growth of the cumulative regret of these algorithms is O~(T)\widetilde{O}(\sqrt{T}) ignoring a polylogarithmic factor. Although, convergence rate of the power function of SE kernels is slightly faster than that of RQ kernels (by remark of Theorem 4), empirical results of RQ kernels and SE kernels are similar. In both cases, APG-UCB has almost the same cumulative regret as that of IGP-UCB.

We also show (mean) total running time in Table 1, where we abbreviate APG-UCB as APG and IGP-UCB as IGP. For all dimensions, it took about from five to six thousand seconds for IGP-UCB to complete an experiment for one environment. As shown by the table and figures, running time of our methods is much shorter than that of IGP-UCB while it has almost the same regret as IGP-UCB.

Table 1: Total Running Time (in seconds).
APG(RQ) IGP(RQ) APG(SE) IGP(SE)
d=1d=1 4.2e-01 5.7e+03 4.0e-01 5.7e+03
d=2d=2 2.7e+00 5.1e+03 2.9e+00 5.1e+03
d=3d=3 3.0e+01 5.7e+03 4.3e+01 5.7e+03

9 Conclusion

By reducing the RKHS bandit problem to the misspecified linear bandit problem, we provide the first general algorithm for the adversarial RKHS bandit problem and several efficient algorithms for the stochastic RKHS bandit problem. We provide cumulative regret upper bounds for them and empirically verify our theoretical results.

10 Acknowledgement

We would like to thank anonymous reviewers for suggestions that improved the paper. We also would like to thank Janmajay Singh and Takafumi J. Suzuki for valuable comments on the preliminarily version of the manuscript.

References

  • Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320, 2011.
  • Agrawal & Goyal (2013) Agrawal, S. and Goyal, N. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning, pp.  127–135, 2013.
  • Auer (2002) Auer, P. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
  • Bubeck & Cesa-Bianchi (2012) Bubeck, S. and Cesa-Bianchi, N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012.
  • Calandriello et al. (2019) Calandriello, D., Carratino, L., Lazaric, A., Valko, M., and Rosasco, L. Gaussian process optimization with adaptive sketching: Scalable and no regret. In Conference on Learning Theory, pp.  533–557. PMLR, 2019.
  • Calandriello et al. (2020) Calandriello, D., Carratino, L., Valko, M., Lazaric, A., and Rosasco, L. Near-linear time gaussian process optimization with adaptive batching and resparsification. In Proceedings of the 37th International Conference on Machine Learning, 2020.
  • Chatterji et al. (2019) Chatterji, N., Pacchiano, A., and Bartlett, P. Online learning with kernel losses. In International Conference on Machine Learning, pp. 971–980. PMLR, 2019.
  • Chowdhury & Gopalan (2017) Chowdhury, S. R. and Gopalan, A. On kernelized multi-armed bandits. In Proceedings of the 34th International Conference on Machine Learning, pp.  844–853, 2017. supplementary material.
  • De Marchi et al. (2005) De Marchi, S., Schaback, R., and Wendland, H. Near-optimal data-independent point locations for radial basis function interpolation. Advances in Computational Mathematics, 23(3):317–330, 2005.
  • DeVore et al. (2013) DeVore, R., Petrova, G., and Wojtaszczyk, P. Greedy algorithms for reduced bases in banach spaces. Constructive Approximation, 37(3):455–466, 2013.
  • Hoffman et al. (1953) Hoffman, A., Wielandt, H., et al. The variation of the spectrum of a normal matrix. Duke Mathematical Journal, 20(1):37–39, 1953.
  • Lattimore & Szepesvári (2020) Lattimore, T. and Szepesvári, C. Bandit Algorithm. Cambridge University Press, 2020.
  • Lattimore et al. (2020) Lattimore, T., Szepesvari, C., and Weisz, G. Learning with good feature representations in bandits and in rl with a generative model. In Proceedings of the 37th International Conference on Machine Learning, 2020.
  • Madych & Nelson (1992) Madych, W. and Nelson, S. Bounds on multivariate polynomials and exponential error estimates for multiquadric interpolation. Journal of Approximation Theory, 70(1):94–114, 1992.
  • Müller (2009) Müller, S. Komplexität und Stabilität von kernbasierten Rekonstruktionsmethoden. PhD thesis, Fakultät für Mathematik und Informatik, Georg-August-Universität Göttingen, 2009.
  • Müller & Schaback (2009) Müller, S. and Schaback, R. A newton basis for kernel spaces. Journal of Approximation Theory, 161(2):645–655, 2009.
  • Mutny & Krause (2018) Mutny, M. and Krause, A. Efficient high dimensional bayesian optimization with additivity and quadrature fourier features. In Advances in Neural Information Processing Systems, pp. 9005–9016, 2018.
  • Pazouki & Schaback (2011) Pazouki, M. and Schaback, R. Bases for kernel-based spaces. Journal of Computational and Applied Mathematics, 236(4):575–588, 2011.
  • Pinkus (2012) Pinkus, A. N-widths in Approximation Theory, volume 7. Springer Science & Business Media, 2012.
  • Rahimi & Recht (2008) Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pp. 1177–1184, 2008.
  • Santin & Haasdonk (2017) Santin, G. and Haasdonk, B. Convergence rate of the data-independent pp-greedy algorithm in kernel-based approximation. Dolomites Research Notes on Approximation, 10(Special_Issue), 2017.
  • Scarlett et al. (2017) Scarlett, J., Bogunovic, I., and Cevher, V. Lower bounds on regret for noisy gaussian process bandit optimization. In Conference on Learning Theory, pp.  1723–1742, 2017.
  • Schaback (1995) Schaback, R. Error estimates and condition numbers for radial basis function interpolation. Advances in Computational Mathematics, 3(3):251–264, 1995.
  • Schaback & Wendland (2000) Schaback, R. and Wendland, H. Adaptive greedy techniques for approximate solution of large rbf systems. Numerical Algorithms, 24(3):239–254, 2000.
  • Srinivas et al. (2010) Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, pp.  1015–1022, 2010.
  • Valko et al. (2013) Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N. Finite-time analysis of kernelised contextual bandits. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence, 2013.
  • Wendland (1997) Wendland, H. Sobolev-type error estimates for interpolation by radial basis functions. Surface fitting and multiresolution methods, pp.  337–344, 1997.
  • Wendland (2004) Wendland, H. Scattered data approximation, volume 17. Cambridge University Press, 2004.
  • Wu & Schaback (1993) Wu, Z.-m. and Schaback, R. Local error estimates for radial basis function interpolation of scattered data. IMA Journal of Numerical Analysis, 13(1):13–27, 1993.

Appendix

In this appendix, we provide some results on Thompson Sampling based algorithms in §A, proofs of the results in §B, and detailed experimental setting and additional experimental results in §C.

Appendix A Additional Results for Thompson Sampling

A.1 Misspecified Linear Bandit Problem

We consider a modification of Thompson Sampling (Agrawal & Goyal, 2013). In (t+1)(t+1)th round, we sample μt\mu_{t} from the multinomial normal distribution 𝒩(θ^t,(β(At,δ/2,λ)+ϵψt)2At1)\mathcal{N}\left(\hat{\theta}_{t},\left(\beta(A_{t},\delta/2,\lambda)+\epsilon\psi_{t}\right)^{2}A_{t}^{-1}\right), and the modified algorithm selects x𝒜x\in\mathcal{A} that maximizes μt,x~\langle\mu_{t},\widetilde{x}\rangle. Then the following result holds.

Proposition 18.

We assume that λ1\lambda\geq 1. Then, with probability at least 1δ1-\delta, the modification of Thompson Sampling algorithm incurs regret upper bound by

O~(log(|𝒜|){DT+DTlog(1/δ)+log(1/δ)T+(DT+TDlog(1/δ))ϵ)})\widetilde{O}\left(\sqrt{\log(|\mathcal{A}|)}\left\{D\sqrt{T}+\sqrt{DT\log(1/\delta)}+\log(1/\delta)\sqrt{T}+\left(DT+T\sqrt{D\log(1/\delta)}\right)\epsilon)\right\}\right)

We provide proof of the proposition in §B.4.

A.2 Thompson Sampling for the Stochastic RKHS Bandit Problem

We provide a result on a Thompson Sample based algorithm for the stochastic RKHS bandit problem.

Theorem 19.

We reduce the RKHS bandit problem to the misspecified linear bandit problem, and apply the modified Thompson Sampling introduced above with admissible error 𝔢=αTq\mathfrak{e}=\frac{\alpha}{T^{q}} with q1/2q\geq 1/2. We denote by RAPG-TS(T)R_{\text{APG-TS}}(T) its regret and assume that λ1\lambda\geq 1. Then with probability at least 1δ1-\delta, RAPG-TS(T)R_{\text{APG-TS}}(T) is upper bounded by

O~(log(|𝒜|)(Dq,α(T)T+Dq,α(T)Tlog(1/δ)+log(1/δ)T))).\widetilde{O}\left(\sqrt{\log(|\mathcal{A}|)}\left(D_{q,\alpha}(T)\sqrt{T}+\sqrt{D_{q,\alpha}(T)T\log(1/\delta)}+\log(1/\delta)\sqrt{T})\right)\right).

The total computational complexity of the algorithm is given as O(|𝒜|TDq,α2(T)+TDq,α3(T))O(|\mathcal{A}|TD_{q,\alpha}^{2}(T)+TD_{q,\alpha}^{3}(T)).

We provide proof of the theorem in §B.5.

Appendix B Proofs

We provide omitted proofs in the main article and §A.

B.1 Proof of Corollary 7

For completeness, we provide a proof of corollary 7.

Proof.

For simplicity, we consider only the infinite smoothness case. We use the same notation as in Theorem 6 and Algorithm 1. Denote by DD the number of points returned by the algorithm with error 𝔢=α/Tq\mathfrak{e}=\alpha/T^{q}. Since the statement of the corollary is obvious if D=1D=1, we assume D>1D>1. Because the condition maxxΩ^Pm(x)<α/Tq\max_{x\in\widehat{\Omega}}P_{m}(x)<\alpha/T^{q} is satisfied only when mDm\geq D, we have α/TqmaxxΩ^PD1(x)\alpha/T^{q}\leq\max_{x\in\widehat{\Omega}}P_{D-1}(x). If we run the algorithm with error 𝔢=maxxΩ^PD1(x)+ϵ\mathfrak{e}=\max_{x\in\widehat{\Omega}}P_{D-1}(x)+\epsilon with sufficiently small ϵ>0\epsilon>0, then the algorithm returns D1D-1 points. Therefore by the theorem and the inequality above, we have

α/TqmaxxΩ^PD1(x)<C1^exp(C2^(D1)1/d).\alpha/T^{q}\leq\max_{x\in\widehat{\Omega}}P_{D-1}(x)<\hat{C_{1}}\exp\left(-\hat{C_{2}}(D-1)^{1/d}\right).

Ignoring constants other than α,T,q\alpha,T,q, we have the assertion of the corollary. ∎

B.2 Proof of Proposition 8

For symmetric matrices P,Qn×nP,Q\in\mathbb{R}^{n\times n}, we write PQP\geq Q if and only if PQP-Q is positive semi-definite, i.e., xT(PQ)x0x^{T}(P-Q)x\geq 0 for all xnx\in\mathbb{R}^{n}. For completeness, we prove the following elementary lemma.

Lemma 20.

Let P,Qn×nP,Q\in\mathbb{R}^{n\times n} be symmetric matrices of size nn and assume that 0<PQ0<P\leq Q. Then we have Q1P1Q^{-1}\leq P^{-1}.

Proof.

It is enough to prove the statement for UTPUU^{\mathrm{T}}PU and UTQUU^{\mathrm{T}}QU for some UGLn()U\in\mathrm{GL}_{n}(\mathbb{R}), where GLn()\mathrm{GL}_{n}(\mathbb{R}) is the general linear group of size nn. Since PP is positive definite, using Cholesky decomposition, one can prove that there exists UGLn()U\in\mathrm{GL}_{n}(\mathbb{R}) such that UTPU=1nU^{\mathrm{T}}PU=1_{n} and UTQU=ΛU^{\mathrm{T}}QU=\Lambda is a diagonal matrix. Then, the assumption implies that every diagonal entry of Λ\Lambda is greater than or equal to 11. Now, the statement is obvious. ∎

Next, we prove that x~,θt+x~At1(βt+ϵψt)\langle\widetilde{x},\theta_{t}\rangle+\|\widetilde{x}\|_{A_{t}^{-1}}(\beta_{t}+\epsilon\psi_{t}) is a UCB up to a constant.

Lemma 21.

We assume λ1\lambda\geq 1. Then, with probability at least 1δ1-\delta, we have

|x~,θtx~,θ|x~At1(βt+ϵψt),|\langle\widetilde{x},\theta_{t}\rangle-\langle\widetilde{x},\theta\rangle|\leq\|\widetilde{x}\|_{A_{t}^{-1}}(\beta_{t}+\epsilon\psi_{t}),

for any tt and x~D\widetilde{x}\in\mathbb{R}^{D}.

Proof.

By proof of (Abbasi-Yadkori et al., 2011, Theorem 2), we have

|x~,θtx~,θ|\displaystyle|\langle\widetilde{x},\theta_{t}\rangle-\langle\widetilde{x},\theta\rangle| x~At1(s=1tx~s(εs+ω(xs))At1+λ1/2θ)\displaystyle\leq\|\widetilde{x}\|_{A_{t}^{-1}}\left(\left\|\sum_{s=1}^{t}\widetilde{x}_{s}(\varepsilon_{s}+\omega(x_{s}))\right\|_{A_{t}^{-1}}+\lambda^{1/2}\|\theta\|\right)
x~At1(s=1tx~sεsAt1+λ1/2θ+ϵs=1tx~sAt1).\displaystyle\leq\|\widetilde{x}\|_{A_{t}^{-1}}\left(\left\|\sum_{s=1}^{t}\widetilde{x}_{s}\varepsilon_{s}\right\|_{A_{t}^{-1}}+\lambda^{1/2}\|\theta\|+\epsilon\sum_{s=1}^{t}\|\widetilde{x}_{s}\|_{A_{t}^{-1}}\right).

By the self-normalized concentration inequality (Abbasi-Yadkori et al., 2011), with probability at least 1δ1-\delta, we have

|x~,θtx~,θ|x~At1(βt+ϵs=1tx~sAt1).|\langle\widetilde{x},\theta_{t}\rangle-\langle\widetilde{x},\theta\rangle|\leq\|\widetilde{x}\|_{A_{t}^{-1}}\left(\beta_{t}+\epsilon\sum_{s=1}^{t}\|\widetilde{x}_{s}\|_{A_{t}^{-1}}\right).

Since As1AsA_{s-1}\leq A_{s} for any ss, by Lemma 20, we have s=1tx~sAt1ψt\sum_{s=1}^{t}\|\widetilde{x}_{s}\|_{A_{t}^{-1}}\leq\psi_{t}. This completes the proof. ∎

Proof of Proposition 8.

We assume λ1\lambda\geq 1. Let x:=argmaxx𝒜f(x)x^{*}:=\operatorname{argmax}_{x\in\mathcal{A}}f(x) and (xt)t\left(x_{t}\right)_{t} be a sequence of arms selected by the algorithm. Denote by EE the event on which the inequality in Lemma 21 holds for all tt and x~\widetilde{x}. Then on event EE, we have

f(x)f(xt)\displaystyle f(x^{*})-f(x_{t}) 2ϵ+x~,θx~t,θ\displaystyle\leq 2\epsilon+\langle\widetilde{x}^{*},\theta\rangle-\langle\widetilde{x}_{t},\theta\rangle
2ϵ+x~,θt+x~At1(βt+ϵψt)x~t,θ\displaystyle\leq 2\epsilon+\langle\widetilde{x}^{*},\theta_{t}\rangle+\|\widetilde{x}^{*}\|_{A_{t}^{-1}}(\beta_{t}+\epsilon\psi_{t})-\langle\widetilde{x}_{t},\theta\rangle
2ϵ+x~t,θt+x~tAt1(βt+ϵψt)(x~t,θtx~tAt1(βt+ϵψt))\displaystyle\leq 2\epsilon+\langle\widetilde{x}_{t},\theta_{t}\rangle+\|\widetilde{x}_{t}\|_{A_{t}^{-1}}(\beta_{t}+\epsilon\psi_{t})-\left(\langle\widetilde{x}_{t},\theta_{t}\rangle-\|\widetilde{x}_{t}\|_{A_{t}^{-1}}(\beta_{t}+\epsilon\psi_{t})\right)
=2ϵ+2x~tAt1(βt+ϵψt).\displaystyle=2\epsilon+2\|\widetilde{x}_{t}\|_{A_{t}^{-1}}(\beta_{t}+\epsilon\psi_{t}).

Therefore, on event EE,

R(T)\displaystyle R(T) 2ϵT+2βTt=1Tx~tAt11+2ϵt=1Tx~tAt11ψt\displaystyle\leq 2\epsilon T+2\beta_{T}\sum_{t=1}^{T}\|\widetilde{x}_{t}\|_{A_{t-1}^{-1}}+2\epsilon\sum_{t=1}^{T}\|\widetilde{x}_{t}\|_{A_{t-1}^{-1}}\psi_{t}
2ϵT+2βTTt=1Tx~tAt112+2ϵψTt=1Tx~tAt11\displaystyle\leq 2\epsilon T+2\beta_{T}\sqrt{T}\sqrt{\sum_{t=1}^{T}\|\widetilde{x}_{t}\|^{2}_{A_{t-1}^{-1}}}+2\epsilon\psi_{T}\sum_{t=1}^{T}\|\widetilde{x}_{t}\|_{A_{t-1}^{-1}}
=2ϵT+2βTTt=1Tx~tAt112+2ϵ(t=1Tx~tAt11)2\displaystyle=2\epsilon T+2\beta_{T}\sqrt{T}\sqrt{\sum_{t=1}^{T}\|\widetilde{x}_{t}\|^{2}_{A_{t-1}^{-1}}}+2\epsilon\left(\sum_{t=1}^{T}\|\widetilde{x}_{t}\|_{A_{t-1}^{-1}}\right)^{2}
2ϵT+2βTTt=1Tx~tAt112+2ϵT(t=1Tx~tAt112).\displaystyle\leq 2\epsilon T+2\beta_{T}\sqrt{T}\sqrt{\sum_{t=1}^{T}\|\widetilde{x}_{t}\|^{2}_{A_{t-1}^{-1}}}+2\epsilon T\left(\sum_{t=1}^{T}\|\widetilde{x}_{t}\|^{2}_{A_{t-1}^{-1}}\right).

By assumptions, we have x~At11x~A01=λ1/2x~21\|\widetilde{x}\|_{A_{t-1}^{-1}}\leq\|\widetilde{x}\|_{A_{0}^{-1}}=\lambda^{-1/2}\|\widetilde{x}\|_{2}\leq 1 for any x𝒜x\in\mathcal{A}. Therefore, by (Abbasi-Yadkori et al., 2011, Lemma 11), the following inequalities hold:

t=1Tx~tAt1122log(det(λ1At))2Dlog(1+TλD),βtRDlog(1+TλD)+2log(1/δ)+λB.\sum_{t=1}^{T}\|\widetilde{x}_{t}\|^{2}_{A_{t-1}^{-1}}\leq 2\log(\det(\lambda^{-1}A_{t}))\leq 2D\log\left(1+\frac{T}{\lambda D}\right),\quad\beta_{t}\leq R\sqrt{D\log\left(1+\frac{T}{\lambda D}\right)+2\log(1/\delta)}+\sqrt{\lambda}B. (1)

Thus, on event EE, we have

R(T)\displaystyle R(T) 2βTT2log(λ1AT)+2ϵT+4ϵTlog(det(λ1AT))\displaystyle\leq 2\beta_{T}\sqrt{T}\sqrt{2\log(\lambda^{-1}A_{T})}+2\epsilon T+4\epsilon T\log(\det(\lambda^{-1}A_{T}))
=O~(ϵT+(D+1/δ)DT+ϵDT)\displaystyle=\widetilde{O}\left(\epsilon T+(\sqrt{D}+\sqrt{1/\delta})\sqrt{DT}+\epsilon DT\right)
=O~(DT+DTlog(1/δ)+ϵDT).\displaystyle=\widetilde{O}\left(D\sqrt{T}+\sqrt{DT\log(1/\delta)}+\epsilon DT\right).

B.3 Proof of Proposition 11

This proposition can be proved by adapting the standard proof of the adversarial linear bandit problem (Lattimore & Szepesvári, 2020; Bubeck & Cesa-Bianchi, 2012). We recall notation for the adversarial misspecified linear bandit problem and EXP3. Let 𝒜xx~{ξD:ξ1}\mathcal{A}\ni x\mapsto\widetilde{x}\in\{\xi\in\mathbb{R}^{D}:\|\xi\|\leq 1\} be a map, {gt}t=1T\{g_{t}\}_{t=1}^{T} be a sequence of reward functions on 𝒜\mathcal{A} such that gt(x)=θt,x~+ωt(x)g_{t}(x)=\langle\theta_{t},\widetilde{x}\rangle+\omega_{t}(x) for x𝒜x\in\mathcal{A}, where θtD\theta_{t}\in\mathbb{R}^{D}, supx𝒜|gt(x)|,θtB\sup_{x\in\mathcal{A}}|g_{t}(x)|,\|\theta_{t}\|\leq B, and supx𝒜|ωt(x)|ϵ\sup_{x\in\mathcal{A}}|\omega_{t}(x)|\leq\epsilon.

Let γ(0,1)\gamma\in(0,1) be an exploration parameter, η>0\eta>0 a learning rate, and πexp\pi_{\mathrm{exp}} an exploitation distribution over 𝒜\mathcal{A}. For a distribution π\pi over 𝒜\mathcal{A}, we put Q(π)=x𝒜π(x)x~x~TQ(\pi)=\sum_{x\in\mathcal{A}}\pi(x)\widetilde{x}\widetilde{x}^{\mathrm{T}}. We define ϕt=gx(xt)Qt1x~t\phi_{t}=g_{x}(x_{t})Q_{t}^{-1}\widetilde{x}_{t} and ϕt(x)=ϕt,x~\phi_{t}(x)=\langle\phi_{t},\widetilde{x}\rangle for x𝒜x\in\mathcal{A}, where the matrix QtQ_{t} can be computed from the past observations at round tt and is defined later. Let qtq_{t} be a distribution over 𝒜\mathcal{A} such that qt(x)exp(ηs=1t1ϕs(x))q_{t}(x)\sim\exp\left(\eta\sum_{s=1}^{t-1}\phi_{s}(x)\right) and put pt(x)=γπexp(x)+(1γ)qt(x)p_{t}(x)=\gamma\pi_{\mathrm{exp}}(x)+(1-\gamma)q_{t}(x) for x𝒜x\in\mathcal{A}. We assume that Q(πexp)Q(\pi_{\mathrm{exp}}) is non-singular and define Qt=Q(pt)Q_{t}=Q(p_{t}). For a distribution π\pi over 𝒜\mathcal{A}, we define Γ(π)=supx𝒜x~TQ(πexp)1x~\Gamma(\pi)=\sup_{x\in\mathcal{A}}\widetilde{x}^{\mathrm{T}}Q(\pi_{\mathrm{exp}})^{-1}\widetilde{x} and in this section we assume Γ(π)D\Gamma(\pi)\leq D.

Let x=argmaxx𝒜t=1Tgt(x)x_{*}=\operatorname{argmax}_{x\in\mathcal{A}}\sum_{t=1}^{T}g_{t}(x) be an optimal arm and regret is defined as R(T)=t=1Tgt(x)t=1Tgt(xt)R(T)=\sum_{t=1}^{T}g_{t}(x_{*})-\sum_{t=1}^{T}g_{t}(x_{t}). We have

𝐄[R(T)]\displaystyle\mathbf{E}\left[R(T)\right] =𝐄[t=1T(θt,x~θt,x~t)]+𝐄[t=1T(ωt(x~)ωt(x~t))]\displaystyle=\mathbf{E}\left[\sum_{t=1}^{T}\left(\langle\theta_{t},\widetilde{x}_{*}\rangle-\langle\theta_{t},\widetilde{x}_{t}\rangle\right)\right]+\mathbf{E}\left[\sum_{t=1}^{T}\left(\omega_{t}(\widetilde{x}_{*})-\omega_{t}(\widetilde{x}_{t})\right)\right]
2ϵT+𝐄[t=1T(θt,x~θt,x~t)].\displaystyle\leq 2\epsilon T+\mathbf{E}\left[\sum_{t=1}^{T}\left(\langle\theta_{t},\widetilde{x}_{*}\rangle-\langle\theta_{t},\widetilde{x}_{t}\rangle\right)\right]. (2)

We denote by t1\mathcal{H}_{t-1} the sigma field generated by x1,,xt1x_{1},\dots,x_{t-1} and by 𝐄t1\mathbf{E}_{t-1} the conditional expectation conditioned on t1\mathcal{H}_{t-1}. We note that pt(x),qt(x)p_{t}(x),q_{t}(x) for x𝒜x\in\mathcal{A} and QtQ_{t} are t1\mathcal{H}_{t-1}-measurable but ϕt\phi_{t} is not. Then we have

𝐄[t=1Tθt,x~t]=𝐄[t=1T𝐄t1[θt,x~t]]=𝐄[t=1Tx𝒜pt(x)θt,x~]\displaystyle\mathbf{E}\left[\sum_{t=1}^{T}\langle\theta_{t},\widetilde{x}_{t}\rangle\right]=\mathbf{E}\left[\sum_{t=1}^{T}\mathbf{E}_{t-1}\left[\langle\theta_{t},\widetilde{x}_{t}\rangle\right]\right]=\mathbf{E}\left[\sum_{t=1}^{T}\sum_{x\in\mathcal{A}}p_{t}(x){\langle\theta_{t},\widetilde{x}\rangle}\right]
=γ𝐄[t=1Tx𝒜πexp(x)θ,x~]+(1γ)𝐄[t=1Tx𝒜qt(x)θt,x~]\displaystyle=\gamma\mathbf{E}\left[\sum_{t=1}^{T}\sum_{x\in\mathcal{A}}{\pi_{\mathrm{exp}}(x)\langle\theta,\widetilde{x}\rangle}\right]+(1-\gamma)\mathbf{E}\left[\sum_{t=1}^{T}\sum_{x\in\mathcal{A}}{q_{t}(x)\langle\theta_{t},\widetilde{x}\rangle}\right]
γBT+(1γ)S.\displaystyle\geq-\gamma BT+(1-\gamma)S. (3)

Here we used |θt,x~|θtx~B|\langle\theta_{t},\widetilde{x}\rangle|\leq\|\theta_{t}\|\|\widetilde{x}\|\leq B and SS is defined as 𝐄[t=1Tx𝒜qt(x)θt,x~]\mathbf{E}\left[\sum_{t=1}^{T}\sum_{x\in\mathcal{A}}{q_{t}(x)\langle\theta_{t},\widetilde{x}\rangle}\right]. Since t=1Tθt,x~γBT+(1γ)t=1Tθt,x~\sum_{t=1}^{T}\langle\theta_{t},\widetilde{x}_{*}\rangle\leq\gamma BT+(1-\gamma)\sum_{t=1}^{T}\langle\theta_{t},\widetilde{x}_{*}\rangle, by inequalities (2), (3), we have

𝐄[R(T)]2ϵT+2γBT+(1γ)(t=1Tθt,x~S).\mathbf{E}\left[R(T)\right]\leq 2\epsilon T+2\gamma BT+(1-\gamma)\left(\sum_{t=1}^{T}\langle\theta_{t},\widetilde{x}_{*}\rangle-S\right). (4)

We decompose S=S1+S2S=S_{1}+S_{2}, where

S1=𝐄[t=1Tx𝒜qt(x)ϕt,x~],S2=𝐄[t=1Tx𝒜qt(x)θtϕt,x~].S_{1}=\mathbf{E}\left[\sum_{t=1}^{T}\sum_{x\in\mathcal{A}}{q_{t}(x)\langle\phi_{t},\widetilde{x}\rangle}\right],\quad S_{2}=\mathbf{E}\left[\sum_{t=1}^{T}\sum_{x\in\mathcal{A}}{q_{t}(x)\langle\theta_{t}-\phi_{t},\widetilde{x}\rangle}\right].

First, we bound |S2||S_{2}|. To do this, we prove the following lemma.

Lemma 22.

For any x𝒜x\in\mathcal{A}, the following inequality holds:

|𝐄t1[ϕtθt,x~]|ϵΓ(πexp)γ.\left|\mathbf{E}_{t-1}\left[\langle\phi_{t}-\theta_{t},\widetilde{x}\rangle\right]\right|\leq\frac{\epsilon\Gamma(\pi_{\mathrm{exp}})}{\gamma}.

In particular, we have |𝐄[ϕtθt,x~]|ϵΓ(πexp)γ\left|\mathbf{E}\left[\langle\phi_{t}-\theta_{t},\widetilde{x}\rangle\right]\right|\leq\frac{\epsilon\Gamma(\pi_{\mathrm{exp}})}{\gamma}.

Proof.

We note that by conditioning on t1\mathcal{H}_{t-1}, randomness comes only from xtx_{t}. By definition of ϕt\phi_{t}, we have

𝐄t1[ϕt,x~]\displaystyle\mathbf{E}_{t-1}\left[\langle\phi_{t},\widetilde{x}\rangle\right] =𝐄t1[(θt,x~t+ωt(xt))Qt1x~t,x~]\displaystyle=\mathbf{E}_{t-1}\left[\langle\left(\langle\theta_{t},\widetilde{x}_{t}\rangle+\omega_{t}(x_{t})\right)Q_{t}^{-1}\widetilde{x}_{t},\widetilde{x}\rangle\right]
=𝐄t1[x~TQt1x~tx~tTθt]+𝐄t1[ωt(xt)x~TQt1x~t]\displaystyle=\mathbf{E}_{t-1}\left[\widetilde{x}^{\mathrm{T}}Q_{t}^{-1}\widetilde{x}_{t}\widetilde{x}_{t}^{\mathrm{T}}\theta_{t}\right]+\mathbf{E}_{t-1}\left[\omega_{t}(x_{t})\widetilde{x}^{\mathrm{T}}Q_{t}^{-1}\widetilde{x}_{t}\right]
=θt,x~+𝐄t1[ωt(xt)x~TQt1x~t].\displaystyle=\langle\theta_{t},\widetilde{x}\rangle+\mathbf{E}_{t-1}\left[\omega_{t}(x_{t})\widetilde{x}^{\mathrm{T}}Q_{t}^{-1}\widetilde{x}_{t}\right].

Therefore,

|𝐄t1[ϕtθt,x~]|ϵ𝐄t1[x~tQt1x~Qt1]ϵγ𝐄t1[x~tQ(πexp)1x~Q(πexp)1]ϵΓ(πexp)γ.\displaystyle\left|\mathbf{E}_{t-1}\left[\langle\phi_{t}-\theta_{t},\widetilde{x}\rangle\right]\right|\leq\epsilon\mathbf{E}_{t-1}\left[\|\widetilde{x}_{t}\|_{Q_{t}^{-1}}\|\widetilde{x}\|_{Q_{t}^{-1}}\right]\leq\frac{\epsilon}{\gamma}\mathbf{E}_{t-1}\left[\|\widetilde{x}_{t}\|_{Q(\pi_{\mathrm{exp}})^{-1}}\|\widetilde{x}\|_{Q(\pi_{\mathrm{exp}})^{-1}}\right]\leq\frac{\epsilon\Gamma(\pi_{\mathrm{exp}})}{\gamma}.

Here in the second inequality, we use γQ(πexp)Qt\gamma Q(\pi_{\mathrm{exp}})\leq Q_{t} and the last inequality follows from the definition of Γ(πexp)\Gamma(\pi_{\mathrm{exp}}). The second assertion follows from

|𝐄[ϕtθt,x~]|𝐄[|𝐄t1[ϕtθt,x~]|]ϵΓ(πexp)γ.\left|\mathbf{E}\left[\langle\phi_{t}-\theta_{t},\widetilde{x}\rangle\right]\right|\leq\mathbf{E}\left[\left|\mathbf{E}_{t-1}\left[\langle\phi_{t}-\theta_{t},\widetilde{x}\rangle\right]\right|\right]\leq\frac{\epsilon\Gamma(\pi_{\mathrm{exp}})}{\gamma}.

By this lemma, we can bound S2S_{2} as follows.

Lemma 23.

The following inequality holds:

|S2|ϵTΓ(πexp)γ.|S_{2}|\leq\frac{\epsilon T\Gamma(\pi_{\mathrm{exp}})}{\gamma}.
Proof.

Since qt(x)q_{t}(x) is t1\mathcal{H}_{t-1}-measurable for any x𝒜x\in\mathcal{A}, we have

S2=𝐄[t=1Tx𝒜qt(x)𝐄t1[θtϕt,x~]].\displaystyle S_{2}=\mathbf{E}\left[\sum_{t=1}^{T}\sum_{x\in\mathcal{A}}q_{t}(x)\mathbf{E}_{t-1}\left[\langle\theta_{t}-\phi_{t},\widetilde{x}\rangle\right]\right].

Therefore, we have

|S2|𝐄[t=1Tx𝒜qt(x)|𝐄t1[θtϕt,x~]|]ϵTΓ(πexp)γ.\displaystyle|S_{2}|\leq\mathbf{E}\left[\sum_{t=1}^{T}\sum_{x\in\mathcal{A}}q_{t}(x)\left|\mathbf{E}_{t-1}\left[\langle\theta_{t}-\phi_{t},\widetilde{x}\rangle\right]\right|\right]\leq\frac{\epsilon T\Gamma(\pi_{\mathrm{exp}})}{\gamma}.

Here we used Lemma 22 in the last inequality. ∎

Next, we introduce the following elementary lemma (c.f. Chatterji et al. (2019, Lemma 49)).

Lemma 24.

Let η>0\eta>0 and XX be a random variable. We assume that ηX1\eta X\leq 1 almost surely. Then we have

𝐄[X]1ηlog(𝐄[exp(ηX)])(e2)η𝐄[X2].\mathbf{E}\left[X\right]\geq\frac{1}{\eta}\log\left(\mathbf{E}\left[\exp(\eta X)\right]\right)-(e-2)\eta\mathbf{E}\left[X^{2}\right].
Proof.

By log(x)x1\log(x)\leq x-1 for x>0x>0 and exp(y)1+y+(e2)y2\exp(y)\leq 1+y+(e-2)y^{2} for y1y\leq 1, we have

log𝐄[exp(ηX)]𝐄[exp(ηX)]1𝐄[ηX+(e2)η2X2].\displaystyle\log\mathbf{E}\left[\exp(\eta X)\right]\leq\mathbf{E}\left[\exp(\eta X)\right]-1\leq\mathbf{E}\left[\eta X+(e-2)\eta^{2}X^{2}\right].

To apply the lemma with X=ϕt,x~X=\langle\phi_{t},\widetilde{x}\rangle and 𝐄=𝐄xqt\mathbf{E}=\mathbf{E}_{x\sim q_{t}}, we prove the following:

Lemma 25.

Let x𝒜x\in\mathcal{A} and assume that γ=ηBΓ(πexp)\gamma=\eta B\Gamma(\pi_{\mathrm{exp}}). Then, we have η|ϕt,x~|1\eta|\langle\phi_{t},\widetilde{x}\rangle|\leq 1.

Proof.

By definition of ϕt\phi_{t}, we have

η|gt(xt)x~tTQt1x~|ηBx~tQt1x~Qt1ηBΓ(πexp)γ.\displaystyle\eta|g_{t}(x_{t})\widetilde{x}_{t}^{\mathrm{T}}Q_{t}^{-1}\widetilde{x}|\leq\eta B\|\widetilde{x}_{t}\|_{Q_{t}^{-1}}\|\widetilde{x}\|_{Q_{t}^{-1}}\leq\frac{\eta B\Gamma(\pi_{\mathrm{exp}})}{\gamma}.

Here, in the last inequality, we use γQ(πexp)Qt\gamma Q(\pi_{\mathrm{exp}})\leq Q_{t} and the definition of Γ(πexp)\Gamma(\pi_{\mathrm{exp}}). ∎

By Lemma 24 and Lemma 25, we obtain the following.

S1U1η(e2)ηU2,S_{1}\geq\frac{U_{1}}{\eta}-(e-2)\eta U_{2}, (5)

where U1U_{1} and U2U_{2} are given as

U1=𝐄[t=1Tlog(x𝒜qt(x)exp(ηϕt,x~))],U2=𝐄[t=1Tx𝒜qt(x)ϕt,x~2].\displaystyle U_{1}=\mathbf{E}\left[\sum_{t=1}^{T}\log\left(\sum_{x\in\mathcal{A}}{q_{t}(x)\exp\left(\eta\langle\phi_{t},\widetilde{x}\rangle\right)}\right)\right],\quad U_{2}=\mathbf{E}\left[\sum_{t=1}^{T}\sum_{x\in\mathcal{A}}{q_{t}(x)\langle\phi_{t},\widetilde{x}\rangle^{2}}\right].

We bound |U2||U_{2}| as follows.

Lemma 26.

The following inequality holds:

|U2|B2DT1γ.|U_{2}|\leq\frac{B^{2}DT}{1-\gamma}.
Proof.

By definition of ϕt\phi_{t}, we have

x𝒜qt(x)ϕt,x~2B2x𝒜qt(x)x~tTQt1x~x~TQt1x~t=B2x~tQt1Q(qt)Qt1x~t\displaystyle\sum_{x\in\mathcal{A}}q_{t}(x)\langle\phi_{t},\widetilde{x}\rangle^{2}\leq B^{2}\sum_{x\in\mathcal{A}}q_{t}(x)\widetilde{x}_{t}^{\mathrm{T}}Q_{t}^{-1}\widetilde{x}\widetilde{x}^{\mathrm{T}}Q_{t}^{-1}\widetilde{x}_{t}=B^{2}\widetilde{x}_{t}Q_{t}^{-1}Q(q_{t})Q_{t}^{-1}\widetilde{x}_{t}
B21γx~tTQt1x~t.\displaystyle\leq\frac{B^{2}}{1-\gamma}\widetilde{x}_{t}^{\mathrm{T}}Q_{t}^{-1}\widetilde{x}_{t}.

Here the last inequality follows from (1γ)Q(qt)Qt(1-\gamma)Q(q_{t})\leq Q_{t}. Therefore,

𝐄[x𝒜ϕt,x~2]B21γ𝐄[x~tTQt1x~t]=B21γ𝐄[Tr(x~tx~tTQt1)]\displaystyle\mathbf{E}\left[\sum_{x\in\mathcal{A}}\langle\phi_{t},\widetilde{x}\rangle^{2}\right]\leq\frac{B^{2}}{1-\gamma}\mathbf{E}\left[\widetilde{x}_{t}^{\mathrm{T}}Q_{t}^{-1}\widetilde{x}_{t}\right]=\frac{B^{2}}{1-\gamma}\mathbf{E}\left[\operatorname{Tr}\left(\widetilde{x}_{t}\widetilde{x}_{t}^{\mathrm{T}}Q_{t}^{-1}\right)\right]
=B21γ𝐄[Tr(QtQt1)]=B2D1γ.\displaystyle=\frac{B^{2}}{1-\gamma}\mathbf{E}\left[\operatorname{Tr}\left(Q_{t}Q_{t}^{-1}\right)\right]=\frac{B^{2}D}{1-\gamma}.

Here the second equality follows from the fact that QtQ_{t} is t1\mathcal{H}_{t-1}-measurable and the linearity of the trace. The assertion of the lemma follows from this. ∎

Next, we give a lower bound for U1U_{1}.

Lemma 27.

Let x0𝒜x_{0}\in\mathcal{A} be any element. Then the following inequality holds:

U1η𝐄[t=1Tϕt,x~0]log(|𝒜|).U_{1}\geq\eta\mathbf{E}\left[\sum_{t=1}^{T}\langle\phi_{t},\widetilde{x}_{0}\rangle\right]-\log(|\mathcal{A}|).
Proof.

By definition of qtq_{t}, we have

U1\displaystyle U_{1} =𝐄[t=1T{log(x𝒜exp(ηs=1tϕs,x~))log(x𝒜exp(ηs=1t1ϕs,x~))}]\displaystyle=\mathbf{E}\left[\sum_{t=1}^{T}\left\{\log\left(\sum_{x\in\mathcal{A}}\exp\left(\eta\sum_{s=1}^{t}\langle\phi_{s},\widetilde{x}\rangle\right)\right)-\log\left(\sum_{x\in\mathcal{A}}\exp\left(\eta\sum_{s=1}^{t-1}\langle\phi_{s},\widetilde{x}\rangle\right)\right)\right\}\right]
=𝐄[log(x𝒜exp(ηs=1Tϕs,x~))]log|𝒜|\displaystyle=\mathbf{E}\left[\log\left(\sum_{x\in\mathcal{A}}\exp\left(\eta\sum_{s=1}^{T}\langle\phi_{s},\widetilde{x}\rangle\right)\right)\right]-\log|\mathcal{A}|
=η𝐄[t=1Tϕt,x~0]log|𝒜|.\displaystyle=\eta\mathbf{E}\left[\sum_{t=1}^{T}\langle\phi_{t},\widetilde{x}_{0}\rangle\right]-\log|\mathcal{A}|.

Proof of Proposition 11.

We assume γ=ηBΓ(πexp)\gamma=\eta B\Gamma(\pi_{\mathrm{exp}}). By (4), we have

𝐄[R(T)]2ϵT+2γBT+(1γ)𝐄[t=1Tθt,x~](1γ)S1+(1γ)|S2|\displaystyle\mathbf{E}\left[R(T)\right]\leq 2\epsilon T+2\gamma BT+(1-\gamma)\mathbf{E}\left[\sum_{t=1}^{T}\langle\theta_{t},\widetilde{x}_{*}\rangle\right]-(1-\gamma)S_{1}+(1-\gamma)|S_{2}|

By inequality (5), Lemma 23, and Lemma 27 with x0=xx_{0}=x_{*}, we have

𝐄[R(T)]2ϵT+2γBT+(1γ)𝐄[t=1Tθtϕt,x~]+1γηlog|𝒜|+(e2)η(1γ)U2+ϵTΓ(πexp)γ\displaystyle\mathbf{E}\left[R(T)\right]\leq 2\epsilon T+2\gamma BT+(1-\gamma)\mathbf{E}\left[\sum_{t=1}^{T}\langle\theta_{t}-\phi_{t},\widetilde{x}_{*}\rangle\right]+\frac{1-\gamma}{\eta}\log|\mathcal{A}|+(e-2)\eta(1-\gamma)U_{2}+\frac{\epsilon T\Gamma(\pi_{\mathrm{exp}})}{\gamma}
2ϵT+2γBT+ϵTΓ(πexp)(1γ)γ+log|𝒜|η+(e2)ηB2DT+ϵTΓ(πexp)γ\displaystyle\leq 2\epsilon T+2\gamma BT+\frac{\epsilon T\Gamma(\pi_{\mathrm{exp}})(1-\gamma)}{\gamma}+\frac{\log|\mathcal{A}|}{\eta}+(e-2)\eta B^{2}DT+\frac{\epsilon T\Gamma(\pi_{\mathrm{exp}})}{\gamma}
2ϵT+2γBT+log|𝒜|η+(e2)ηB2DT+2ϵTΓ(πexp)γ.\displaystyle\leq 2\epsilon T+2\gamma BT+\frac{\log|\mathcal{A}|}{\eta}+(e-2)\eta B^{2}DT+\frac{2\epsilon T\Gamma(\pi_{\mathrm{exp}})}{\gamma}.

Here in the second inequality, we used Lemma 22 and Lemma 26. By γ=ηBΓ(πexp)\gamma=\eta B\Gamma(\pi_{\mathrm{exp}}) and Γ(πexp)D\Gamma(\pi_{\mathrm{exp}})\leq D, we have

𝐄[R(T)]2ϵT+2B2ηTΓ(πexp)+2ϵTBη+log|𝒜|η+(e2)B2ηDT\displaystyle\mathbf{E}\left[R(T)\right]\leq 2\epsilon T+2B^{2}\eta T\Gamma(\pi_{\mathrm{exp}})+\frac{2\epsilon T}{B\eta}+\frac{\log|\mathcal{A}|}{\eta}+(e-2)B^{2}\eta DT
2ϵT+eB2ηDT+2ϵTBη+log|𝒜|η.\displaystyle\leq 2\epsilon T+eB^{2}\eta DT+\frac{2\epsilon T}{B\eta}+\frac{\log|\mathcal{A}|}{\eta}.

B.4 Proof of Proposition 18

We assume λ1\lambda\geq 1. This can be proved by modifying the proof of (Agrawal & Goyal, 2013). Since most of their arguments can be directly applicable to our case, we omit proofs of some lemmas. Let (Ψ,Pr,𝒢)(\Psi,\operatorname{Pr},\mathcal{G}) be the probability space on which all random variables considered here are defined, where 𝒢2Ψ\mathcal{G}\subset 2^{\Psi} is a σ\sigma-algebra on Ψ\Psi. We put x:=argmaxx𝒜g(x)x^{*}:=\operatorname{argmax}_{x\in\mathcal{A}}g(x) and for t=1,2,,Tt=1,2,\dots,T and x𝒜x\in\mathcal{A}, we put Δ(x):=x~,θx~,θ\Delta(x):=\langle\widetilde{x}^{*},\theta\rangle-\langle\widetilde{x},\theta\rangle. We also put vt:=lt:=β(At1,δ/(2T),λ)+ϵψt1v_{t}:=l_{t}:=\beta(A_{t-1},\delta/(2T),\lambda)+\epsilon\psi_{t-1} and gt:=4log(|𝒜|t)vt+ltg_{t}:=\sqrt{4\log(|\mathcal{A}|t)}v_{t}+l_{t}. In each round tt, μt1\mu_{t-1} is sampled from the multinomial normal distribution 𝒩(θt1,(β(At1,δ/2,λ)+ϵψt)2At11)\mathcal{N}\left(\theta_{t-1},\left(\beta(A_{t-1},\delta/2,\lambda)+\epsilon\psi_{t}\right)^{2}A_{t-1}^{-1}\right). For t=1,,Tt=1,\dots,T, we define EtE_{t} by

Et:={ψΨ:|x~,θt1x~,θ|ltx~At11,x𝒜},E_{t}:=\left\{\psi\in\Psi:|\langle\widetilde{x},\theta_{t-1}\rangle-\langle\widetilde{x},\theta\rangle|\leq l_{t}\|\widetilde{x}\|_{A_{t-1}^{-1}},\quad\forall x\in\mathcal{A}\right\},

and define event EtE^{\prime}_{t} by

Et:={ψΨ:|x~,μt1x~,θt1|4log(|𝒜|t)vtx~At11,x𝒜}.E_{t}^{\prime}:=\left\{\psi\in\Psi:|\langle\widetilde{x},\mu_{t-1}\rangle-\langle\widetilde{x},\theta_{t-1}\rangle|\leq\sqrt{4\log(|\mathcal{A}|t)}v_{t}\|\widetilde{x}\|_{A_{t-1}^{-1}},\quad\forall x\in\mathcal{A}\right\}.

For an event GG, we denote by 1G1_{G} the corresponding indicator function. Then by assumptions, we see that Ett1E_{t}\in\mathcal{F}_{t-1}, i.e., 1Et1_{E_{t}} is t1\mathcal{F}_{t-1}-measurable. For a random variable XX on Ψ\Psi, we say “on event EtE_{t}, the conditional expectation (or conditional probability) 𝐄[Xt1]\mathbf{E}\left[X\mid\mathcal{F}_{t-1}\right] satisfies a property” if and only if 1Et𝐄[Xt1]=𝐄[1EtXt1]1_{E_{t}}\mathbf{E}\left[X\mid\mathcal{F}_{t-1}\right]=\mathbf{E}\left[1_{E_{t}}X\mid\mathcal{F}_{t-1}\right] satisfies the property for almost all ψΨ\psi\in\Psi.

Then by Lemma 21 and the proof of (Agrawal & Goyal, 2013, Lemma 1), we have

Lemma 28.

Pr(Et)1δ2T\operatorname{Pr}(E_{t})\geq 1-\frac{\delta}{2T} and Pr(Ett1)11/t2\operatorname{Pr}(E^{\prime}_{t}\mid\mathcal{F}_{t-1})\geq 1-1/t^{2} for all tt.

We note that the proof of (Agrawal & Goyal, 2013, Lemma 2) works if ltvtl_{t}\leq v_{t}, i.e., we have the following lemma:

Lemma 29.

On event EtE_{t}, we have

Pr(μt,x~>θ,x~t1)p,\operatorname{Pr}\left(\langle\mu_{t},\widetilde{x}^{*}\rangle>\langle\theta,\widetilde{x}^{*}\rangle\mid\mathcal{F}_{t-1}\right)\geq p,

where p=14eπp=\frac{1}{4e\sqrt{\pi}}.

The main differences of our proof and theirs lie in the definitions of lt,vt,xl_{t},v_{t},x^{*}, and Δ(x)\Delta(x) (they define Δ(x)\Delta(x) as supy𝒜θ,y~θ,x~\sup_{y\in\mathcal{A}}\langle\theta,\widetilde{y}\rangle-\langle\theta,\widetilde{x}\rangle and we consider x=argmaxg(x)x^{*}=\operatorname{argmax}g(x) instead of argmaxxx~,θ\operatorname{argmax}_{x}\langle\widetilde{x},\theta\rangle). However, it can be verified that these differences do not matter in the arguments of Lemma 3, 4 in (Agrawal & Goyal, 2013). In fact, we can prove the following lemma in a similar way to the proof of (Agrawal & Goyal, 2013, Lemma 3).

Lemma 30.

We define C(t)C(t) by {x𝒜:Δ(x)>gtx~At11}\{x\in\mathcal{A}:\Delta(x)>g_{t}\|\widetilde{x}\|_{A_{t-1}^{-1}}\}. On event EtE_{t}, we have

Pr(xC(t)t1)p1t2.\operatorname{Pr}(x\not\in C(t)\mid\mathcal{F}_{t-1})\geq p-\frac{1}{t^{2}}.

Here pp is given in Lemma 29.

Proof.

Because the algorithm selects x𝒜x\in\mathcal{A} that maximizes x~,μt1\langle\widetilde{x},\mu_{t-1}\rangle, if x~,μt1>x~,μt1\langle\widetilde{x}^{*},\mu_{t-1}\rangle>\langle\widetilde{x},\mu_{t-1}\rangle for all xC(t)x\in C(t), then we have xtC(t)x_{t}\not\in C(t). Therefore, we have

Pr(xtC(t)t1)Pr(x~,μt1>x~,μt1,xC(t)t1).\operatorname{Pr}(x_{t}\not\in C(t)\mid\mathcal{F}_{t-1})\geq\operatorname{Pr}\left(\langle\widetilde{x}^{*},\mu_{t-1}\rangle>\langle\widetilde{x},\mu_{t-1}\rangle,\forall x\in C(t)\mid\mathcal{F}_{t-1}\right). (6)

By definitions of C(t),EtC(t),E_{t}, and EtE_{t}^{\prime}, on even EtEtE_{t}\cap E_{t}^{\prime}, we have x~,μt1x~,θ+gtx~At11<x~,θ\langle\widetilde{x},\mu_{t-1}\rangle\leq\langle\widetilde{x},\theta\rangle+g_{t}\|\widetilde{x}\|_{A_{t-1}^{-1}}<\langle\widetilde{x}^{*},\theta\rangle for all xC(t)x\in C(t). Therefore, on EtEtE_{t}\cap E_{t}^{\prime}, if x~,μt1>x~,θ\langle\widetilde{x}^{*},\mu_{t-1}\rangle>\langle\widetilde{x}^{*},\theta\rangle, we have x~,μt1>x~,μt1\langle\widetilde{x}^{*},\mu_{t-1}\rangle>\langle\widetilde{x},\mu_{t-1}\rangle for all xC(t)x\in C(t). Thus we obtain the following inequalities:

Pr(x~,μt1>x~,μt1,xC(t)t1)\displaystyle\operatorname{Pr}(\langle\widetilde{x}^{*},\mu_{t-1}\rangle>\langle\widetilde{x},\mu_{t-1}\rangle,\quad\forall x\in C(t)\mid\mathcal{F}_{t-1})
Pr(x~,μt1>x~,θt1)Pr((Et)ct1)\displaystyle\geq\operatorname{Pr}(\langle\widetilde{x}^{*},\mu_{t-1}\rangle>\langle\widetilde{x}^{*},\theta\rangle\mid\mathcal{F}_{t-1})-\operatorname{Pr}((E_{t}^{\prime})^{c}\mid\mathcal{F}_{t-1})
p1/t2.\displaystyle\geq p-1/t^{2}.

Here (Et)c(E_{t}^{\prime})^{c} is the complement of EtE_{t}^{\prime} and we used Lemmas 28, 29 in the last inequality. By inequality (6), we have our assertion. ∎

We can also prove the following lemma in a similar way to the proof of (Agrawal & Goyal, 2013, Lemma 4).

Lemma 31.

On event EtE_{t}, we have

𝐄[Δ(xt)t1]c1gt𝐄[xtAt11t1]+c2gtt2,\mathbf{E}\left[\Delta(x_{t})\mid\mathcal{F}_{t-1}\right]\leq c_{1}g_{t}\mathbf{E}\left[\|x_{t}\|_{A_{t-1}^{-1}}\mid\mathcal{F}_{t-1}\right]+\frac{c_{2}g_{t}}{t^{2}},

where c1c_{1} and c2c_{2} are universal constants.

For t=1,2,,Tt=1,2,\dots,T, define random variables XtX_{t} and YtY_{t} by

Xt:=Δ(xt)1Etc1gtxtAt11c2gtt2,Yt:=s=1tXt.X_{t}:=\Delta(x_{t})1_{E_{t}}-c_{1}g_{t}\|x_{t}\|_{A_{t-1}^{-1}}-\frac{c_{2}g_{t}}{t^{2}},\quad Y_{t}:=\sum_{s=1}^{t}X_{t}.

From Lemma 31, we can prove the following lemma.

Lemma 32.

The process {Yt}t=0,,T\{Y_{t}\}_{t=0,\dots,T} is a super-martingale process w.r.t. the filtration {t}t\{\mathcal{F}_{t}\}_{t}.

Proof of Proposition 18.

By Lemma 32 and Xt2(B+ϵ)+(c1+c2)gt\|X_{t}\|\leq 2(B+\epsilon)+(c_{1}+c_{2})g_{t} (for all tt), applying Azuma-Hoeffding inequality, we see that there exists an event GG with Pr(G)1δ/2\operatorname{Pr}(G)\geq 1-\delta/2 such that on GG, the following inequality holds:

t=1TΔ(xt)1Ett=1Tc1gtx~tAt11+t=1Tc2gt/t2+(4T(B+ϵ)2+2(c1+c2)2t=1Tgt2)log(2/δ).\sum_{t=1}^{T}\Delta(x_{t})1_{E_{t}}\leq\sum_{t=1}^{T}c_{1}g_{t}\|\widetilde{x}_{t}\|_{A_{t-1}^{-1}}+\sum_{t=1}^{T}c_{2}g_{t}/t^{2}+\sqrt{\left(4T(B+\epsilon)^{2}+2(c_{1}+c_{2})^{2}\sum_{t=1}^{T}g_{t}^{2}\right)\log(2/\delta)}.

Since gtgTg_{t}\leq g_{T} for any tt, on the event GG, we have

t=1TΔ(xt)1Etc1gTTt=1Tx~tAt112+c2gTπ26+T(4(B+ϵ)2+2(c1+c2)2gT2)log(2/δ).\displaystyle\sum_{t=1}^{T}\Delta(x_{t})1_{E_{t}}\leq c_{1}g_{T}\sqrt{T}\sqrt{\sum_{t=1}^{T}\|\widetilde{x}_{t}\|^{2}_{A_{t-1}^{-1}}}+c_{2}g_{T}\frac{\pi^{2}}{6}+\sqrt{T}\sqrt{\left(4(B+\epsilon)^{2}+2(c_{1}+c_{2})^{2}g_{T}^{2}\right)\log(2/\delta)}.

By inequalities (1), we have

t=1Tx~tAt112\displaystyle\sqrt{\sum_{t=1}^{T}\|\widetilde{x}_{t}\|^{2}_{A_{t-1}^{-1}}} =O~(D),gT=O~(log(|𝒜|)vT)=O~(|log|𝒜|(D+log(1/δ)+ϵψT)).\displaystyle=\widetilde{O}(\sqrt{D}),\quad g_{T}=\widetilde{O}(\sqrt{\log(|\mathcal{A}|)}v_{T})=\widetilde{O}\left(\sqrt{|\log|\mathcal{A}|}(\sqrt{D}+\sqrt{\log(1/\delta)}+\epsilon\psi_{T})\right).

Since ψT=s=1Tx~sAs11Ts=1Tx~sAs112=O~(DT)\psi_{T}=\sum_{s=1}^{T}\|\widetilde{x}_{s}\|_{A_{s-1}^{-1}}\leq\sqrt{T}\sqrt{\sum_{s=1}^{T}\|\widetilde{x}_{s}\|^{2}_{A_{s-1}^{-1}}}=\widetilde{O}(\sqrt{DT}), we obtain

gT=O~(logA(D+log(1/δ)+ϵDT)).g_{T}=\widetilde{O}\left(\sqrt{\log A}(\sqrt{D}+\sqrt{\log(1/\delta)}+\epsilon\sqrt{DT})\right).

Therefore, on the event GG, we have

t=1TΔ(xt)1Et=O~(log(|𝒜|){DT+DTlog(1/δ)+log(1/δ)T+(DT+TDlog(1/δ))ϵ)}).\sum_{t=1}^{T}\Delta(x_{t})1_{E_{t}}=\widetilde{O}\left(\sqrt{\log(|\mathcal{A}|)}\left\{D\sqrt{T}+\sqrt{DT\log(1/\delta)}+\log(1/\delta)\sqrt{T}+\left(DT+T\sqrt{D\log(1/\delta)}\right)\epsilon)\right\}\right).

Therefore, on event t=1TEtG\bigcap_{t=1}^{T}E_{t}\cap G, we can upper bound the regret as follows:

R(T)\displaystyle R(T) =t=1T{g(x)g(xt)}ϵT+t=1TΔ(xt)1Et\displaystyle=\sum_{t=1}^{T}\{g(x^{*})-g(x_{t})\}\leq\epsilon T+\sum_{t=1}^{T}\Delta(x_{t})1_{E_{t}}
=O~(log(|𝒜|){DT+DTlog(1/δ)+log(1/δ)T+(DT+TDlog(1/δ))ϵ)}).\displaystyle=\widetilde{O}\left(\sqrt{\log(|\mathcal{A}|)}\left\{D\sqrt{T}+\sqrt{DT\log(1/\delta)}+\log(1/\delta)\sqrt{T}+\left(DT+T\sqrt{D\log(1/\delta)}\right)\epsilon)\right\}\right).

Since Pr(t=1TEtG)1δ\operatorname{Pr}(\bigcap_{t=1}^{T}E_{t}\cap G)\geq 1-\delta, we have the assertion of the proposition. ∎

B.5 Proof of Theorem 13

Since Theorems 16, 19 can be proved in a similar way, we only provide proof of Theorem 13.

Let {ξ1,,ξD}\{\xi_{1},\dots,\xi_{D}\} and N1,,NDN_{1},\dots,N_{D} be a sequence of points and Newton basis returned by Algorithm 2 with 𝔢=αTq\mathfrak{e}=\frac{\alpha}{T^{q}}, where D=Dq,α(T)D=D_{q,\alpha}(T) and q1/2q\geq 1/2.

We verify the assumptions of the (stochastic) misspecified linear bandit problem hold, i.e., we show there exists θD\theta\in\mathbb{R}^{D} such that the following conditions are satisfied for x~=[N1(x),,ND(x)]T\widetilde{x}=[N_{1}(x),\dots,N_{D}(x)]^{\mathrm{T}} and θ\theta:

  1. 1.

    x~21\|\widetilde{x}\|_{2}\leq 1.

  2. 2.

    If xx is a 𝒜\mathcal{A}-valued random variable and t\mathcal{F}_{t}-measurable, then x~\widetilde{x} is t\mathcal{F}_{t}-measurable .

  3. 3.

    θ2B\|\theta\|_{2}\leq B.

  4. 4.

    supx𝒜|f(x)θ,x~|<ϵ\sup_{x\in\mathcal{A}}|f(x)-\langle\theta,\widetilde{x}\rangle|<\epsilon, where ϵ=αB/Tq\epsilon=\alpha B/T^{q}.

We put XD:={ξ1,,ξD}X_{D}:=\{\xi_{1},\dots,\xi_{D}\}. Then by definition, Newton basis N1,,NDN_{1},\dots,N_{D} is a basis of V(XD)V(X_{D}). We define θ1,,θD\theta_{1},\dots,\theta_{D}\in\mathbb{R} by ΠV(XD)f=i=1DθiNi\Pi_{V(X_{D})}f=\sum_{i=1}^{D}\theta_{i}N_{i} and put θ=[θ1,,θD]T\theta=[\theta_{1},\dots,\theta_{D}]^{\mathrm{T}}. Since Newton basis is an orthonormal basis of V(XD)V(X_{D}), we have

θ2=i=1DθiNiK(Ω)=ΠV(XD)fK(Ω)fK(Ω)B.\|\theta\|_{2}=\left\|\sum_{i=1}^{D}\theta_{i}N_{i}\right\|_{\mathcal{H}_{K}\left(\Omega\right)}=\left\|\Pi_{V(X_{D})}f\right\|_{\mathcal{H}_{K}\left(\Omega\right)}\leq\|f\|_{\mathcal{H}_{K}\left(\Omega\right)}\leq B.

By the orthonormality, we have PV(X)2(x)=K(x,x)i=1mNi2(x)P_{V(X)}^{2}(x)=K(x,x)-\sum_{i=1}^{m}N_{i}^{2}(x) (c.f. Santin & Haasdonk (2017, Lemma 5)). Then by assumption, we have x~22=i=1mNi2(x)=K(x,x)PV(XD)2(x)1\|\widetilde{x}\|_{2}^{2}=\sum_{i=1}^{m}N_{i}^{2}(x)=K(x,x)-P_{V(X_{D})}^{2}(x)\leq 1. Since NkN_{k} for k=1,,Dk=1,\dots,D is a linear combination of K(,ξ1),,K(,ξD)K(\cdot,\xi_{1}),\dots,K(\cdot,\xi_{D}) and KK is continuous, xx~x\mapsto\widetilde{x} is continuous. Therefore, x~\widetilde{x} is t\mathcal{F}_{t}-measurable if xx is t\mathcal{F}_{t}-measurable. By definition of the P-greedy algorithm, we have supx𝒜PV(XD)(x)<αTq\sup_{x\in\mathcal{A}}P_{V(X_{D})}(x)<\frac{\alpha}{T^{q}}. By this inequality and the definition of the power function, the following inequality holds:

supx𝒜|f(x)θ,x~|=supx𝒜|f(x)(ΠV(XD)f)(x)|fαTqαBTq.\sup_{x\in\mathcal{A}}|f(x)-\langle\theta,\widetilde{x}\rangle|=\sup_{x\in\mathcal{A}}|f(x)-\left(\Pi_{V(X_{D})}f\right)(x)|\leq\|f\|\frac{\alpha}{T^{q}}\leq\frac{\alpha B}{T^{q}}.

Thus, one can apply results of a misspecified linear bandit problem with ϵ=αBTq\epsilon=\frac{\alpha B}{T^{q}}. By applying Proposition 8, with probability at least 1δ1-\delta, the regret is upper bounded as follows:

RAPG-UCB(T)=O~(TDq,α(T)log(1/δ)+Dq,α(T)T).R_{\text{APG-UCB}}(T)=\widetilde{O}\left(\sqrt{TD_{q,\alpha}(T)\log(1/\delta)}+D_{q,\alpha}(T)\sqrt{T}\right).

Since computing Newton basis requires O(|𝒜|D2)O(|\mathcal{A}|D^{2}) time and total complexity of the modified LinUCB is given as O(|𝒜|D2T)O(|\mathcal{A}|D^{2}T), we have the assertion of Theorem 13.

B.6 Proof of Theorem 17

For simplicity, by normalization, we assume B=1B=1. We denote by RAPG-EXP3(T)R_{\text{APG-EXP3{}}}(T) the cumulative regret that APG-EXP3 with q=1q=1 and α=log(|𝒜|)\alpha=\log(|\mathcal{A}|) incurs up to time step TT. We can reduce the adversarial RKHS bandit problem to the adversarial misspecified linear bandit problem as in §B.5. To apply Proposition 11, we need to prove that {x~|x𝒜}\{\widetilde{x}|x\in\mathcal{A}\} spans D\mathbb{R}^{D}. We denote by X={X1,,XD}X=\{X_{1},\dots,X_{D}\} the points returned by the PP-greedy algorithm. Then, since N1,,NDN_{1},\dots,N_{D} is a basis of V(X)V(X) and KK is positive definite, rank(Ni(x))1iD,x𝒜=rank(K(xi,x))1iD,x𝒜=D\mathrm{rank}\hskip 2.0pt(N_{i}(x))_{1\leq i\leq D,x\in\mathcal{A}}=\mathrm{rank}\hskip 2.0pt(K(x_{i},x))_{1\leq i\leq D,x\in\mathcal{A}}=D. Therefore, {x~|x𝒜}\{\widetilde{x}|x\in\mathcal{A}\} spans D\mathbb{R}^{D}.

By Proposition 11, we have

𝐄[RAPG-EXP3(T)]2ϵT+eηDT+2ϵη+log(|𝒜|)η,\mathbf{E}\left[R_{\text{APG-EXP3{}}}(T)\right]\leq 2\epsilon T+e\eta DT+\frac{2\epsilon}{\eta}+\frac{\log(|\mathcal{A}|)}{\eta},

where ϵ=log(|𝒜|)T\epsilon=\frac{\log(|\mathcal{A}|)}{T} and D=D1,log(|𝒜|)(T)D=D_{1,\log(|\mathcal{A}|)}(T). Thus we have 𝐄[RAPG-EXP3(T)]2log(|𝒜|)+eηDT+3log(|𝒜|)η.\mathbf{E}\left[R_{\text{APG-EXP3{}}}(T)\right]\leq 2\log(|\mathcal{A}|)+e\eta DT+\frac{3\log(|\mathcal{A}|)}{\eta}. By taking η=log(|𝒜|)DT\eta=\sqrt{\frac{\log(|\mathcal{A}|)}{DT}}, we have the assertion of the theorem.

B.7 Proof of Theorem 14

First, we prove that the PP-greedy algorithm (Algorithm 1) also gives a uniform kernel approximation.

Lemma 33.

Let N1,,NDN_{1},\dots,N_{D} be a Newton basis returned by the PP-greedy algorithm 1 with error 𝔢\mathfrak{e} and Ω^=𝒜\widehat{\Omega}=\mathcal{A}. For x𝒜x\in\mathcal{A}, we put x~:=[N1(x),,ND(x)]T\widetilde{x}:=[N_{1}(x),\dots,N_{D}(x)]^{\mathrm{T}}. Then, we have supx,y𝒜|K(x,y)x~,y~|𝔢.\sup_{x,y\in\mathcal{A}}|K(x,y)-\langle\widetilde{x},\widetilde{y}\rangle|\leq\mathfrak{e}.

Proof.

We denote by XX the points returned by the PP-greedy algorithm. Then, by definition of the Power function, we have

|h(x)(ΠV(X)h)(x)|hK(Ω)𝔢,\left|h(x)-\left(\Pi_{V(X)}h\right)(x)\right|\leq\|h\|_{\mathcal{H}_{K}\left(\Omega\right)}\mathfrak{e},

for any hK(Ω)h\in\mathcal{H}_{K}\left(\Omega\right) and x𝒜x\in\mathcal{A}. We take arbitrary y𝒜y\in\mathcal{A} and take h=K(,y)h=K(\cdot,y). Since N1,,NDN_{1},\dots,N_{D} is an orthonormal basis of V(X)V(X), we have

(ΠV(X)h)(x)=i=1Dh,NiK(Ω)Ni(x)=i=1DNi(y)Ni(x)=x~,y~.\displaystyle\left(\Pi_{V(X)}h\right)(x)=\sum_{i=1}^{D}\langle h,N_{i}\rangle_{\mathcal{H}_{K}\left(\Omega\right)}N_{i}(x)=\sum_{i=1}^{D}N_{i}(y)N_{i}(x)=\langle\widetilde{x},\widetilde{y}\rangle.

Here, in the second equality, we used the reproducing property. Since hK(Ω)1\|h\|_{\mathcal{H}_{K}\left(\Omega\right)}\leq 1 and x,yx,y are arbitrary, we have our assertion. ∎

Next, we introduce the following classical result on matrix eigenvalues.

Lemma 34 (a specail case of the Wielandt-Hoffman theorem Hoffman et al. (1953)).

Let A,Bn×nA,B\in\mathbb{R}^{n\times n} be symmetric matrices. Denote by a1ana_{1}\leq\dots\leq a_{n} and b1bnb_{1}\leq\dots\leq b_{n} be the eigenvalues of AA and BB respectively. Then, we have i=1n|aibi|2ABF2,\sum_{i=1}^{n}|a_{i}-b_{i}|^{2}\leq\|A-B\|_{F}^{2}, where F\|\cdot\|_{F} denotes the Frobenius norm.

By these lemmas, we can prove logdet(λ1AT)\log\det\left(\lambda^{-1}A_{T}\right) is an approximation of the maximum information gain.

Lemma 35.

We apply APG-UCB with admissible error 𝔢\mathfrak{e} to the stochastic RKHS bandit, then following inequality holds:

logdet(λ1AT)2γT+𝔢T3/2λ.\log\det\left(\lambda^{-1}A_{T}\right)\leq 2\gamma_{T}+\frac{\mathfrak{e}T^{3/2}}{\lambda}.
Proof.

We define a T×TT\times T matrix K~T\widetilde{K}_{T} as (x~i,x~j)1i,jT(\langle\widetilde{x}_{i},\widetilde{x}_{j}\rangle)_{1\leq i,j\leq T}. Since for any matrix Xn×mX\in\mathbb{R}^{n\times m}, det(1n+XXT)=det(1m+XTX)\det(1_{n}+XX^{\mathrm{T}})=\det(1_{m}+X^{\mathrm{T}}X) holds, we have det(λ1AT)=det(1T+λ1K~T)\det(\lambda^{-1}A_{T})=\det\left(1_{T}+\lambda^{-1}\widetilde{K}_{T}\right). We denote by ρ1ρT\rho_{1}\leq\dots\leq\rho_{T} the eigenvalues of KTK_{T} and ρ~1ρ~T\widetilde{\rho}_{1}\leq\dots\leq\widetilde{\rho}_{T} those of K~T\widetilde{K}_{T}. Then by the Wielandt-Hoffman theorem (Lemma 34), we have

i=1T(ρiρ~i)2λ1KTK~TFλ1𝔢T,\sqrt{\sum_{i=1}^{T}(\rho_{i}-\widetilde{\rho}_{i})^{2}}\leq\lambda^{-1}\|K_{T}-\widetilde{K}_{T}\|_{F}\leq\lambda^{-1}\mathfrak{e}T, (7)

where the last inequality follows from Lemma 33. Thus, we have

logdet(λ1AT)\displaystyle\log\det\left(\lambda^{-1}A_{T}\right) =logdet(1T+λ1K~T)=i=1Tlog(ρ~i)=i=1Tlog(ρi)+i=1Tlog(ρ~i/ρi)\displaystyle=\log\det\left(1_{T}+\lambda^{-1}\widetilde{K}_{T}\right)=\sum_{i=1}^{T}\log(\widetilde{\rho}_{i})=\sum_{i=1}^{T}\log(\rho_{i})+\sum_{i=1}^{T}\log(\widetilde{\rho}_{i}/\rho_{i})
logdet(1T+λ1KT)+i=1Tρ~iρiρi\displaystyle\leq\log\det(1_{T}+\lambda^{-1}K_{T})+\sum_{i=1}^{T}\frac{\widetilde{\rho}_{i}-\rho_{i}}{\rho_{i}}
logdet(1T+λ1KT)+i=1T|ρ~iρi|\displaystyle\leq\log\det(1_{T}+\lambda^{-1}K_{T})+\sum_{i=1}^{T}|\widetilde{\rho}_{i}-\rho_{i}|
logdet(1T+λ1KT)+𝔢T3/2λ.\displaystyle\leq\log\det(1_{T}+\lambda^{-1}K_{T})+\frac{\mathfrak{e}T^{3/2}}{\lambda}.

Here in the second inequality, we used ρi1\rho_{i}\geq 1 and in the third inequality, we used inequality (7) and the Cauchy-Schwartz inequality. Noting that logdet(1T+λ1KT)2γT\log\det(1_{T}+\lambda^{-1}K_{T})\leq 2\gamma_{T} (Chowdhury & Gopalan, 2017), we have our assertion. ∎

We provide a more precise result than Theorem 14. We can prove the following by Proposition 8.

Proposition 36.

We assume that λ1log(det(λ1AT))2γT+δT\lambda^{-1}\log\left(\det(\lambda^{-1}A_{T})\right)\leq 2\gamma_{T}+\delta_{T}, where δT=O(Taq)\delta_{T}=O(T^{a-q}) with aa\in\mathbb{R} and qq is the parameter of APG-UCB. We also assume that δT=O(γT)\delta_{T}=O(\gamma_{T}) and λ=1\lambda=1. Then with probability at least 1δ1-\delta, the cumulative regret of APG-UCB is upper bounded by a function b(T)b(T), where b(T)b(T) is given as

b(T)=4βTIGP-UCBγTT+O(TγTT(aq)/2+γTT1q),b(T)=4\beta^{\text{IGP-UCB}}_{T}\sqrt{\gamma_{T}T}+O(\sqrt{T\gamma_{T}}T^{(a-q)/2}+\gamma_{T}T^{1-q}), (8)

where βTIGP-UCB\beta^{\text{IGP-UCB}}_{T} is defined by B+R2(γT+1+log(1/δ))B+R\sqrt{2(\gamma_{T}+1+\log(1/\delta))}.

Remark 37.

We note that the cumulative regret of IGP-UCB is upper bounded by 4βTIGP-UCBγT(T+2)4\beta^{\text{IGP-UCB}}_{T}\sqrt{\gamma_{T}(T+2)} by the proof in (Chowdhury & Gopalan, 2017).

If q>max(a,1/2)q>\max(a,1/2), then the first term 4βTIGP-UCBγTT4\beta^{\text{IGP-UCB}}_{T}\sqrt{\gamma_{T}T} in (8) is the main term of b(T)b(T). By Lemma 35, we can take a=3/2a=3/2. Thus, we have the assertion of Theorem 14.

Appendix C Supplement to the Experiments

C.1 Experimental Setting

For each reward function ff, we add independent Gaussian noise of mean 0 and standard deviation 0.2fL1(𝒜)0.2\cdot\|f\|_{L^{1}(\mathcal{A})}. We use the L1L^{1}-norm because even if we normalize ff so that fK(Ω)=1\|f\|_{\mathcal{H}_{K}\left(\Omega\right)}=1, the values of the function ff can be small. As for the parameters of the kernels, we take μ=2d\mu=2d for the RQ kernel because the condition μ=Ω(d)\mu=\Omega(d) is required for positive definiteness. We take l=0.3dl=0.3\sqrt{d} and l=0.2dl=0.2\sqrt{d} if the kernel is RQ kernel and SE kernel respectively because the diameter of the dd-dimensional cube is d\sqrt{d}. As for the parameters of the algorithms, we take B=1,δ=103B=1,\delta=10^{-3} and R=0.2(i=110fiL1(𝒜)/10)R=0.2\cdot\left(\sum_{i=1}^{10}\|f_{i}\|_{L^{1}(\mathcal{A})}/10\right) for both algorithms, where f1,,f10f_{1},\dots,f_{10} are the reward functions used for the experiment. We take λ=1,α=5103,q=1/2\lambda=1,\alpha=5\cdot 10^{-3},q=1/2 for APG-UCB and λ=1+2/T\lambda=1+2/T for IGP-UCB.

Since exact value of the maximum information gain is not known, when computing UCB for IGP-UCB, we modify IGP-UCB as follows. Using notation of (Chowdhury & Gopalan, 2017), IGP-UCB selects an arm xx maximizing μt1(x)+βtσt1(x)\mu_{t-1}(x)+\beta_{t}\sigma_{t-1}(x), where βt=B+R2(γt1+1+log(1/δ))\beta_{t}=B+R\sqrt{2(\gamma_{t-1}+1+\log(1/\delta))}. Since exact value of γt1\gamma_{t-1} is not known, we use 12lndet(I+λ1Kt1)\frac{1}{2}\ln\det(I+\lambda^{-1}K_{t-1}) instead of γt1\gamma_{t-1}. From their proof, it is easy to see that this modification of IGP-UCB have the same guarantee for the regret upper bound as that of IGP-UCB. In addition, by lndet(I+λ1Kt)=s=1tlog(1+λ1σs12(xs))\ln\det(I+\lambda^{-1}K_{t})=\sum_{s=1}^{t}\log(1+\lambda^{-1}\sigma_{s-1}^{2}(x_{s})), one can update lndet(I+λ1Kt)\ln\det(I+\lambda^{-1}K_{t}) in O(t2)O(t^{2}) time at each round if Kt1K_{t}^{-1} is known. To compute the inverse of the regularized kernel matrix Kt1K_{t}^{-1}, we used a Schur complement of the matrix.

Computation was done by Intel Xeon E5-2630 v4 processor with 128 GB RAM. We computed UCB for each arm in parallel for both algorithms. For matrix-vector multiplication, we used efficient implementation of the dot product provided in https://github.com/dimforge/nalgebra/blob/dev/src/base/blas.rs.

C.2 Additional Experimental Results

As shown in the main article and §B.7, the error 𝔢\mathfrak{e} balances the computational complexity and cumulative regret, i.e., if 𝔢\mathfrak{e} is smaller, then the cumulative regret is smaller, but the computational complexity becomes larger. In this subsection, we provide additional experimental results by changing α\alpha with fixed q=1/2q=1/2. We also show results for more complicated reward functions, i.e. l=0.2dl=0.2\sqrt{d} for RQ kernels (μ\mu is the same) and l=0.1dl=0.1\sqrt{d} for SE kernels.

In Table 2, we show the number of points returned by the PP-greedy algorithms for the RQ and SE kernels.

Table 2: The Number of Points Returned by the P-greedy Algorithm with ϵ=5103T.\epsilon=\frac{5\cdot 10^{-3}}{\sqrt{T}}.
RQ (l=0.3dl=0.3\sqrt{d}) SE (l=0.2dl=0.2\sqrt{d}) RQ (l=0.2dl=0.2\sqrt{d}) SE (l=0.1dl=0.1\sqrt{d})
d=1d=1 18 15 23 25
d=2d=2 105 108 188 283
d=3d=3 376 457 725 994

In Figures 3, 4 and Tables 3, 4, we show the dependence on the parameter α\alpha. In these figures, we denote APG-UCB with parameter α\alpha by APG-UCB(α\alpha).

In Figures 5, 6 and Tables 5, 6, we also show the dependence on the parameter α\alpha for more complicated functions.

Refer to caption
Figure 3: Normalized Cumulative Regret for RQ kernels with l=0.3dl=0.3\sqrt{d}.
Refer to caption
Figure 4: Normalized Cumulative Regret for RQ kernels with l=0.2dl=0.2\sqrt{d}.
Refer to caption
Figure 5: Normalized Cumulative Regret for SE kernels with l=0.2dl=0.2\sqrt{d}.
Refer to caption
Figure 6: Normalized Cumulative Regret for SE kernels with l=0.1dl=0.1\sqrt{d}.
Table 3: Total Running Time for RQ Kernels with l=0.3dl=0.3\sqrt{d}.
APG-UCB(5e-2) APG-UCB(1e-2) APG-UCB(5e-3)
d = 1 (RQ) 3.91e-01 4.06e-01 4.23e-01
d = 2 (RQ) 1.36e+00 2.39e+00 2.76e+00
d = 3 (RQ) 1.19e+01 2.40e+01 2.98e+01
Table 4: Total Running Time for SE Kernels with l=0.2dl=0.2\sqrt{d}.
APG-UCB(5e-2) APG-UCB(1e-2) APG-UCB(5e-3)
d = 1 (SE) 3.84e-01 4.04e-01 4.02e-01
d = 2 (SE) 1.69e+00 2.59e+00 2.89e+00
d = 3 (SE) 2.13e+01 3.51e+01 4.30e+01
Table 5: Total Running Time for RQ Kernels with l=0.2dl=0.2\sqrt{d}.
APG-UCB(5e-2) APG-UCB(1e-2) APG-UCB(5e-3)
d = 1 (RQ) 4.49e-01 4.84e-01 4.96e-01
d = 2 (RQ) 3.84e+00 6.01e+00 7.39e+00
d = 3 (RQ) 4.87e+01 8.76e+01 1.07e+02
Table 6: Total Running Time for SE Kernels with l=0.1dl=0.1\sqrt{d}.
APG-UCB(5e-2) APG-UCB(1e-2) APG-UCB(5e-3)
d = 1 (SE) 4.72e-01 4.88e-01 5.08e-01
d = 2 (SE) 9.59e+00 1.40e+01 1.61e+01
d = 3 (SE) 1.77e+02 2.02e+02 2.02e+02