Approximation Theory Based Methods for RKHS Bandits
Abstract
The RKHS bandit problem (also called kernelized multi-armed bandit problem) is an online optimization problem of non-linear functions with noisy feedback. Although the problem has been extensively studied, there are unsatisfactory results for some problems compared to the well-studied linear bandit case. Specifically, there is no general algorithm for the adversarial RKHS bandit problem. In addition, high computational complexity of existing algorithms hinders practical application. We address these issues by considering a novel amalgamation of approximation theory and the misspecified linear bandit problem. Using an approximation method, we propose efficient algorithms for the stochastic RKHS bandit problem and the first general algorithm for the adversarial RKHS bandit problem. Furthermore, we empirically show that one of our proposed methods has comparable cumulative regret to IGP-UCB and its running time is much shorter.
1 Introduction
The RKHS bandit problem (also called kernelized multi-armed bandit problem) is an online optimization problem of non-linear functions with noisy feedback. Srinivas et al. (2010) studied a multi-armed bandit problem where the reward function belongs to the reproducing kernel Hilbert space (RKHS) associated with a kernel. In this paper, we call this problem the (stochastic) RKHS bandit problem. Although the problem has been studied extensively, some issues are not completely solved yet. In this paper, we focus on mainly two issues: non-existence of general algorithms for the adversarial RKHS bandit problem and high computational complexity for the stochastic RKHS bandit algorithms.
First, as a non-linear generalization of the classical adversarial linear bandit problem, Chatterji et al. (2019) proposed the adversarial RKHS bandit problem, where a learner interacts with a sequence of any functions from the RKHS with bounded norms. However, they only consider the kernel loss, i.e., a loss function of the form , where is a fixed point. Considering functions in the RKHS can be represented as infinite linear combinations of such functions, kernel loss is a very special function in the RKHS. Therefore, there are no algorithms for the adversarial RKHS bandits with general loss (or reward) functions.
Next, we discuss the efficiency of existing methods for the stochastic RKHS bandit problem. We note that most of the existing methods have regret guarantees at the cost of high computational complexity. For example, IGP-UCB Chowdhury & Gopalan (2017) requires matrix-vector multiplication of size for each arm at each round . Therefore, the total computational complexity up to round is given as , where is the set of arms. To address the issue, Calandriello et al. (2020) proposed BBKB and proved its total computational complexity is given as , where is the set of arms, is a subset of a Euclidean space , and is the maximum information gain (Srinivas et al., 2010). If the kernel is a squared exponential kernel, then since 111In this paper, we use notation to ignore factor, where is a universal constant. (Srinivas et al., 2010), ignoring the polylogarithmic factor, BBKB’s computational complexity is nearly linear in . However, the coefficient in the term is large in general.
In this paper, we address these two issues by considering a novel amalgamation of approximation theory (Wendland, 2004) and the misspecified linear bandit problem Lattimore et al. (2020). That is, we approximately reduce the RKHS bandit problem to the well-studied linear bandit problem. Here, because of an approximation error, the model is a misspecified linear model. Ordinary approximation methods (such as Random Fourier Features or Nyström embedding) basically aim to approximate kernel by an inner product of finite dimensional vectors. However, to reduce the RKHS bandits to the linear bandits, we want to approximate a function in the RKHS by a function in a finite dimensional subspace so that is small. Since the usual approximation methods are not appropriate for the purpose, in this paper, we utilize a method developed in the approximation theory literature called the -greedy algorithm (De Marchi et al., 2005) to minimize the error. More precisely, we shall introduce that any function in the RKHS is approximately equal (in terms of the norm) to a linear combination of functions, where are parameters and is the number of functions (or equivalently points) returned by the -greedy algorithm (Algorithm 1) with admissible error . If is sufficiently smooth, is much smaller than and . By this approximation, we can tackle the original RKHS bandit problem by applying an algorithm for the misspecified linear bandit problem.
Contributions
To state contributions, we introduce terminology for kernels. In this paper, we consider two types of kernels: kernels with infinite smoothness and those with finite smoothness with smoothness parameter (we provide a precise definition in §4). Examples of the former include Rational Quadratic (RQ) and Squared Exponential (SE) kernels and those of the latter include the Matérn kernels with parameter . The latter type of kernels also include a general class of kernels that belong to with and satisfy some additional conditions. Let be as before. Then, in §4, we shall introduce that if has infinite smoothness and if has finite smoothness. Our contributions are stated as follows:
-
1.
We apply an approximation method that has not been applied to the RKHS bandit problem and reduce the problem to the well-studied (misspecified) linear bandit problem. This novel reduction method has potential to tackle issues other than the ones we deal with in this paper.
-
2.
We propose APG-EXP3 for the adversarial RKHS bandit problem, where APG stands for an Approximation theory based method using -Greedy. We prove its expected cumulative regret is upper bounded by , where . To the best of our knowledge, this is the first method for the adversarial RKHS bandit problem with general reward functions.
-
3.
We propose a method for the stochastic RKHS bandit problem called APG-PE and prove its cumulative regret is given as with probability at least and its total computational complexity is given as . We note that the total computational complexity is generally much better than that of the state of the art result (Calandriello et al., 2020).
-
4.
We propose APG-UCB as an approximation of IGP-UCB and provide an upper bound of its cumulative regret if and prove that its total computational complexity is given as .
If we take the parameter so that , then we shall show that is upper bounded by , where we define in §6. Since the upper bound for the cumulative regret of IGP-UCB is also given as , APG-UCB has asymptotically the same regret upper bound as that of IGP-UCB in this case. If the kernel has infinite smoothness or finite smoothness with sufficiently large (i.e., ), then this method is more efficient than IGP-UCB, whose computational complexity is .
-
5.
In synthetic environments, we empirically show that APG-UCB has almost the same cumulative regret as that of IGP-UCB and its running time is much shorter.
2 Related Work
First, we review previous works on the adversarial RKHS bandit problem. There are almost no existing results concerning the adversarial RKHS bandit problem except for (Chatterji et al., 2019). They also used an approximation method to solve the problem, but their approximation method can handle only a limited case. Therefore, there are no existing algorithms for the adversarial RKHS bandit problem with general reward functions. Next, we review existing results for the stochastic RKHS bandit problem. Srinivas et al. (2010) studied a multi-armed bandit problem, where the reward function is assumed to be sampled from a Gaussian process or belongs to an RKHS. Chowdhury & Gopalan (2017) improved the result of Srinivas et al. (2010) in the RKHS setting and proposed two methods called IGP-UCB and GP-TS. Valko et al. (2013) considered a stochastic RKHS bandit problem, where the arm set is finite and fixed, prpoposed a method called SupKernelUCB, and proved a regret upper bound . To address the computational inefficiency in the stochastic RKHS bandit problem, Mutny & Krause (2018) proposed Thompson Sampling and UCB-type algorithms using an approximation method called Quadrature Fourier Features which is an improved method of Random Fourier Features Rahimi & Recht (2008). They proved that the total computational complexity of their methods is given as . However, their methods can be applied to only a very special class of kernels. For example, among three examples introduced in §3, only SE kernels satisfy their assumption unless . Our methods work for general symmetric positive definite kernels with enough smoothness. Calandriello et al. (2020) proposed a method called BBKB and proved its regret is upper bounded by with and its total computational complexity is given as . Here we use the maximum information gain instead of the effective dimension since they have the same order up to polylogarithmic factors (Calandriello et al., 2019). If the kernel is an SE kernel, ignoring polylogarithmic factors, their computational complexity is linear in . However, the term incurs generally large coefficient in the term unlike APG-PE. Finally, we note that we construct APG-PE from PHASED ELIMINATION (Lattimore et al., 2020), which is an algorithm for the stochastic misspecified linear bandit problem, where PE stands for PHASED ELIMINATION.
3 Problem Formulation
Let be a non-empty subset of a Euclidean space and be a symmetric, positive definite kernel on , i.e., for all and for a pairwise distinct points , the kernel matrix is positive definite. Examples of such kernels are Rational Quadratic (RQ), Squared Exponential (SE), and Matérn kernels defined as and . where and , are parameters, and is the modified Bessel function of the second kind. As in the previous work Chowdhury & Gopalan (2017), we normalize kernel so that for all . We note that the above three examples satisfy for any . We denote by the RKHS corresponding to the kernel , which we shall review briefly in §4 and assume that has bounded norm, i.e., . In this paper, we consider the following multi-armed bandit problem with time interval and arm set . First, we formulate the stochastic RKHS bandit problem. In each round , a learner selects an arm and observes noisy reward . Here we assume that noise stochastic process is conditionally -sub-Gaussian with respect to a filtration , i.e., for all and . We also assume that is -measurable and is -measurable. The objective of the learner is to maximize the cumulative reward and regret is defined by . In the adversarial (or non-stochastic) bandit RKHS problem, we assume a sequence with for is given. In each round , a learner selects an arm and observes a reward . The learner’s objective is to minimize the cumulative regret . In this paper we only consider oblivious adversary, i.e., we assume the adversary chooses a sequence before the game starts.
4 Results from Approximation Theory
In this section, we introduce important results provided by approximation theory. For introduction to this subject, we refer to the monograph Wendland (2004). We first briefly review basic properties of the RKHS and introduce classical results regarding the convergence rate of the power function, which are required for the proof of Theorem 6. Then, we introduce the -greedy algorithm and its convergence rate in Theorem 6, which generalizes the existing result Santin & Haasdonk (2017).
4.1 Reproducing Kernel Hilbert Space
Let be the real vector space of -valued functions on . Then, there exists a unique real Hilbert space with satisfying the following two properties: (i) for all . (ii) for all and . Because of the second property, the kernel is called reproducing kernel and is called the reproducing kernel Hilbert space (RKHS).
For a subset , we denote by the vector subspace of spanned by . We define an inner product of as follows. For and with , we define . Since is symmetric and positive definite, becomes a pre-Hilbert space with this inner product. Then it is known that RKHS is isomorphic to the completion of . Therefore, for each , there exists a sequence and real numbers such that . Here the convergence is that with respect to the norm of and because of a special property of RKHS, it is also a pointwise convergence.
4.2 Power Function and its Convergence Rate
Since for any , there exists a sequence of finite sums that converges to , it is natural to consider the error between and the finite sum. A natural notion to capture the error for any is the power function defined as below. For a finite subset of points , we denote by the orthogonal projection to . We note that the function is characterized as the interpolant of , i.e, is a unique function satisfying for all . Then the power function is defined as:
By definition, we have
for any and .
Since the power function represents how well the space approximates any function in with a bounded norm, it is intuitively clear that the value of is small if is a “fine” discretization of . The fineness of a finite subset can be evaluated by the fill distance of defined as We introduce classical results on the convergence rate of the power function as . We introduce two kinds of these results: polynomial decay and exponential decay.222We note that more generalized results including in the case of conditionally positive definite kernels and differentials of functions of RKHS are proved (Wendland, 2004, Chapter 11). Before introducing the results, we define smoothness of kernels.
Definition 1.
-
(i)
We say has finite smoothness333By abuse of notation, omitting , we also say “ has finite smoothness”. with a smoothness parameter , if is bounded and satisfies an interior cone condition (see remark below), and satisfies either the following condition (a) or (b): (a) , and all the differentials of of order are bounded on . Here denotes the interior. (b) There exists such that , , has continuous Fourier transformation and as .
-
(ii)
We say has infinite smoothness if is a -dimensional cube , with a function , and there exists a positive integer and a constant such that satisfies for any and .
Remark 2.
(i) Results introduced in this subsection depend on local polynomial reproduction on and such a result is hopeless if is a general bounded set Wendland (2004). The interior cone condition is a mild condition that assures such results. For example, if is a cube or ball , then this condition is satisfied. (ii) Since with for , Matérn kernels have finite smoothness with smoothness parameter . In addition, it can be shown that the RQ and SE kernels have infinite smoothness.
Theorem 3 (Wu & Schaback (1993), Wendland (2004) Theorem 11.13).
We assume has finite smoothness with smoothness parameter . Then there exist constants and that depend only on and such that for any with .
One can apply this result to RQ and SE kernels for any , but a stronger result holds for these kernels.
Theorem 4 (Madych & Nelson (1992), Wendland (2004) Theorem 11.22).
Let be a cube and assume has infinite smoothness. Then, there exist constants depending only on and such that
for any finite subset with .
Remark 5.
(i) The assumption on can be relaxed, i.e., the set is not necessarily a cube. See Madych & Nelson (1992) for details. (ii) In the case of SE kernels, a stronger result holds. More precisely, for sufficiently small , holds.
4.3 -greedy Algorithm and its Convergence Rate
In a typical application, for a given discretization and function , we want to find a finite subset with so that is close to an element of . Several greedy algorithms are proposed to solve this problem De Marchi et al. (2005); Schaback & Wendland (2000); Müller (2009). Among them, the -greedy algorithm De Marchi et al. (2005) is most suitable for our purpose, since the point selection depends only on and but not on the function which is unknown to the learner in the bandit setting.
The -greedy algorithm first selects a point maximizing and after selecting points , it selects by . Following Pazouki & Schaback (2011), we introduce (a variant of) the -greedy algorithm that simultaneously computes the Newton basis Müller & Schaback (2009) in Algorithm 1. If is finite, this algorithm outputs Newton Basis at the cost of time complexity using space. Newton basis is the Gram-Schmidt orthonormalization of basis . Because of orthonormality, the following equality holds (Santin & Haasdonk, 2017, Lemma 5): where . Seemingly, Algorithm 1 is different from the -greedy algorithm described above, using this formula, we can see that these two algorithms are identical.
The following theorem is essentially due to Santin & Haasdonk (2017) and we provide a more generalized result.
Theorem 6 (Santin & Haasdonk (2017)).
Let be a symmetric positive definite kernel. Suppose that the -greedy algorithm applied to with error gives points with . Then the following statements hold:
(i) Suppose has finite smoothness with smoothness parameter . Then, there exists a constant depending only on , and such that
(ii) Suppose has infinite smoothness. Then there exist constants depending only on and such that .
The statements of the theorem are non-trivial in two folds. First, by Theorems 3, 4, if is sufficiently large, there exists a subset with that gives the same convergence rate above (e.g. is a uniform mesh of ). This theorem assures the same convergence rate is achieved by the points selected by the -greedy algorithm. Secondly, it also assures that the same result holds even if the -greedy algorithm is applied to a subset .
If the kernel has finite smoothness, Santin & Haasdonk (2017) only considered the case when is norm equivalent to a Sobolev space, which is also norm equivalent to the RKHS associated with a Matérn kernel. One can prove Theorem 6 from Theorems 3, 4 and (DeVore et al., 2013, Corollary 3.3) by the same argument to Santin & Haasdonk (2017).
For later use, we provide a restatement of Theorem 6 as follows.
Corollary 7.
Let be parameters and denote by the number of points returned by the -greedy algorithm with error .
(i) Suppose has finite smoothness with smoothness parameter . Then .
(ii) Suppose has infinite smoothness. Then .
5 Misspecified Linear Bandit Problem
Since we can approximate by an element of , where is a finite dimensional subspace of the RKHS, we study a linear bandit problem where the linear model is misspecified, i.e, the misspecified linear bandit problem (Lattimore et al., 2020; Lattimore & Szepesvári, 2020). In this section, we introduce several algorithms for the stochastic and adversarial misspecified linear bandit problems. It turns out that such algorithms can be constructed by modifying (or even without modification) algorithms for the linear bandit problem. We provide proofs in this section in the supplementary material.
First, we provide a formulation of the stochastic misspecified linear bandit problem suitable for our purpose. Let be a set and suppose that there exists a map from to the unit ball of a Euclidean space. In each round , a learner selects an action and the environment reveals a noisy reward , where , , and is a biased noise and satisfies and is known to the learner. We also assume that there exists such that and . As before, is conditionally -sub-Gaussian w.r.t a filtration and we assume that is -measurable and is -measurable. The regret is defined as We can formulate the adversarial misspecified linear bandit problem in a similar way. Let be a sequence of functions on with , , and , where the map is as before. We also assume that there exists such that and . In each round, , the learner selects an arm and observes a reward . The cumulative regret is defined as .
First, we introduce a modification of LinUCB Abbasi-Yadkori et al. (2011). To do this, we prepare notation for the stochastic linear bandit problem. Let and be parameters. We define , , and . Here, is the identity matrix of size . For , we define the Mahalanobis norm as and define as
We note that by the proof of (Abbasi-Yadkori et al., 2011, Lemma 11), computational complexity for updating is at each round.
Lattimore et al. (2020) (see appendix of its arXiv version) considered a modification of LinUCB which selects maximizing (modified) UCB in th round and proved the regret of the algorithm is upper bounded by . However, computing the above value requires time for each arm . Therefore, instead of incurring additional factor in the second term in the regret upper bound above, we consider another upper confidence bound which can be easily computed. In th round, our modification of UCB type algorithm selects arm maximizing the modified UCB where is defined as . Then by storing in each round, the complexity for computing this value is given as for each and as is well-known, one can update in time using the Sherman–Morrison formula. By the standard argument, we can prove the following regret bound.
Proposition 8.
Let notation and assumptions be as above. We further assume that . Then with probability at least , the regret of the modified UCB algorithm satisfies In particular, we have
In the supplementary material, we also introduce a modification of Thompson Sampling.
The regret upper bound provided above does not depend on the arm set . Moreover, the same results hold even if the arm set changes over time step (with minor modification of the definition of regret). On the other hand, several authors Lattimore et al. (2020); Auer (2002); Valko et al. (2013) studied algorithm whose regret depends on the cardinality of the arm set in the stochastic linear or RKHS setting. In some rounds, such algorithms eliminate arms that are supposed to be non-optimal with a high probability and therefore the arm set should be the same over time. Generally, these algorithms are more complicated than LinUCB or Thompson Sampling. However, recently, Lattimore et al. (2020) proposed a simpler and sophisticated algorithm called PHASED ELIMINATION using Kiefer–Wolfowitz theorem. Furthermore, they showed that it works well for the stochastic misspecified linear bandit problem without modification. More precisely, they proved the following result.
Theorem 9 (Lattimore et al. (2020); Lattimore & Szepesvári (2020)).
Let be the regret PHASED ELIMINATION incurs for the stochastic misspecified linear bandit problem. We further assume that is independent -sub-Gaussian. Then, with probability at least , we have
Moreover the total computational complexity up to round is given as .
Remark 10.
Although they provided an upper bound for the expected regret, it is not difficult to see that their proof gave a high probability regret upper bound.
Next, we show that EXP3 for adversarial linear bandits (c.f. Lattimore & Szepesvári (2020)) works for the adversarial misspecified linear bandits without modification. We introduce notations for EXP3. Let be a learning rate, an exploration parameter, and be an exploration distribution over . For a distribution on , we define a matrix . We also put and for , where the matrix is defined later. We define a distribution over by and a distribution by for . The matrix is defined as . We put .
Proposition 11.
We assume that spans . We also assume satisfies and we take . Then applying EXP3 to the adversarial misspecified linear bandit problem, we have the following upper bound for the expected regret:
Remark 12.
By the Kiefer–Wolfowitz theorem, there exists an exploitation distribution such that .
6 Main Results
Using results from approximation theory explained in §4 and algorithms for the misspecified bandit problem, we provide several algorithms for the stochastic and adversarial RKHS bandit problems. We provide proofs of the results in this section in the supplementary material.
Let be the Newton basis returned by Algorithm 1 with with , and . Then, by orthonormality of the Newton basis and the definition of the power function, for any and , we have
where and . Therefore, if is an objective function of a RKHS bandit problem, we can regard as a linearly misspecified model and apply algorithms for misspecified linear bandit problems to solve the original RKHS bandit problems.
In this section, we reduce the RKHS bandit problem to the the misspecified linear bandit problem by the map and apply modified LinUCB, PHASED ELIMINATION, and EXP3 to the problem. We call these algorithms APG-UCB, APG-PE and APG-EXP3 respectively and APG-UCB is displayed in Algorithm 2. We denote by the number of points returned by Algorithm 1 with . By the results in §4, we have an upper bound of (Corollary 7).
First, we state the results for APG-UCB.
Theorem 13.
We denote by the regret that Algorithm 2 incurs for the stochastic RKHS bandit problem up to time step and assume that and . Then with probability at least , is given as
and the total computational complexity of the algorithm is given as .
The admissible error balances the computational complexity and regret minimization. However, this is not clear from Theorem 13. The following theorem provides another upper bound of APG-UCB and it states that if we take smaller error , then an upper bound of APG-UCB is almost the same as that of IGP-UCB.
Theorem 14.
We assume and take parameter of APG-UCB so that . We define as . Then with probability at least , we have , where is given as
Remark 15.
Since the main term of is and by the proof in (Chowdhury & Gopalan, 2017), IGP-UCB has the regret upper bound , APG-UCB has an asymptotically the same regret upper bound as IGP-UCB if we take a small error . We note that if is sufficiently large compared to (this is always the case if the kernel has infinite smoothness), then APG-UCB is more efficient than IGP-UCB. We note that for any choice of parameters, the regret upper bound of BBKB is given as , where .
Next, we state the results for APG-PE.
Theorem 16.
We denote by the regret that APG-PE with incurs for the stochastic RKHS bandit problem up to time step . We further assume that is independent -sub-Gaussian. Then with probability at least , we have and its total computational complexity is given as .
Finally, we state a result for the adversarial RKHS bandit problem.
Theorem 17.
We denote by the cumulative regret that APG-EXP3 with and incurs for the adversarial RKHS bandit problem up to time step . Then with appropriate choices of the learning rate and exploration distribution, the expected regret is given as .


7 Discussion
So far, we have emphasized the advantages of our methods. In this section, we discuss limitations of our methods. Here, we focus on Theorem 13 with and Theorem 16. Since we do not see limitations if the kernel has infinite smoothness, in this section we assume the kernel is a Matérn kernel. In our theoretical results, plays a similar role as the information gain in the theoretical result of BBKB. If the kernel is a Matérn kernel with parameter , then, by recent results on the information gain (vakili2021information), we have , which is a nearly optimal result by (Scarlett et al., 2017) and is slightly better than the upper bound of . Therefore, in this case the regret upper bound of Theorem 13 is slightly worse than the regret upper bound of BBKB. In addition, similarly SupKernelUCB has nearly optimal regret upper bound if the kernel is Matérn, but regret upper bound of APG-PE is slightly worse in that case.
Inferiority of our method in the Matérn kernel case might be counter-intuitive since it is also proved that the convergence rate of the power function for Matérn kernel is optimal (c.f. Schaback (1995)) and Theorem 9 cannot be improved (Lattimore et al., 2020). We explain why a combination of optimal results leads to a non-optimal result. The results on the information gain depend on the eigenvalue decay of the Mercer operator rather than the decay of the power function in the -norm as in this study. However these two notions are closely related. From the -width theory (Pinkus, 2012, Chapter IV, Corollary 2.6), eigenvalue decay corresponds to the decay of the power function in the -norm (or more precisely Kolmogorov -width). The decay in the -norm is derived from that in the -norm. If the kernel is a Matérn kernel, using a localization trick called Duchon’s trick (Wendland, 1997), it can be possible to give a faster decay in the -norm than that in the -norm if . Since the norm regarding the misspecified bandit problem is not a norm but a norm, we took the approach proposed in this paper.
8 Experiments
In this section, we empirically verify our theoretical results. We compare APG-UCB to IGP-UCB Chowdhury & Gopalan (2017) in terms of cumulative regret and running time for RQ and SE kernels in synthetic environments.
8.1 Environments
We assume the action set is a discretization of a cube for . We take so that is about . More precisely, we define by where . We randomly construct reward functions with as follows. We randomly select points (for ) from until or and compute orthonormal basis of . Then, we define , where is a random vector with unit norm. We take for the RQ kernel and for the SE kernel, because the diameter of the -dimensional cube is . For each kernel, we generate reward functions as above and evaluate our proposed method and the existing algorithm for time interval in terms of mean cumulative regret and total running time. We compute the mean of cumulative regret and running time for these 10 environments. We normalize cumulative regret so that normalized cumulative regret of uniform random policy corresponds to the line through origin with slope 1 in the figure. For simplicity, we assume the kernel, , and are known to the algorithms. For the other parameters, we use theoretically suggested ones for both APG-UCB and IGP-UCB. Computation is done by Intel Xeon E5-2630 v4 processor with 128 GB RAM. In supplementary material, we explain the experiment setting in more detail and provide additional experimental results.
8.2 Results
We show the results for normalized cumulative regret in Figures 1 and 2. As suggested by the theoretical results, growth of the cumulative regret of these algorithms is ignoring a polylogarithmic factor. Although, convergence rate of the power function of SE kernels is slightly faster than that of RQ kernels (by remark of Theorem 4), empirical results of RQ kernels and SE kernels are similar. In both cases, APG-UCB has almost the same cumulative regret as that of IGP-UCB.
We also show (mean) total running time in Table 1, where we abbreviate APG-UCB as APG and IGP-UCB as IGP. For all dimensions, it took about from five to six thousand seconds for IGP-UCB to complete an experiment for one environment. As shown by the table and figures, running time of our methods is much shorter than that of IGP-UCB while it has almost the same regret as IGP-UCB.
APG(RQ) | IGP(RQ) | APG(SE) | IGP(SE) | |
---|---|---|---|---|
4.2e-01 | 5.7e+03 | 4.0e-01 | 5.7e+03 | |
2.7e+00 | 5.1e+03 | 2.9e+00 | 5.1e+03 | |
3.0e+01 | 5.7e+03 | 4.3e+01 | 5.7e+03 |
9 Conclusion
By reducing the RKHS bandit problem to the misspecified linear bandit problem, we provide the first general algorithm for the adversarial RKHS bandit problem and several efficient algorithms for the stochastic RKHS bandit problem. We provide cumulative regret upper bounds for them and empirically verify our theoretical results.
10 Acknowledgement
We would like to thank anonymous reviewers for suggestions that improved the paper. We also would like to thank Janmajay Singh and Takafumi J. Suzuki for valuable comments on the preliminarily version of the manuscript.
References
- Abbasi-Yadkori et al. (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. Improved algorithms for linear stochastic bandits. In Advances in Neural Information Processing Systems, pp. 2312–2320, 2011.
- Agrawal & Goyal (2013) Agrawal, S. and Goyal, N. Thompson sampling for contextual bandits with linear payoffs. In Proceedings of the 30th International Conference on Machine Learning, pp. 127–135, 2013.
- Auer (2002) Auer, P. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
- Bubeck & Cesa-Bianchi (2012) Bubeck, S. and Cesa-Bianchi, N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 5(1):1–122, 2012.
- Calandriello et al. (2019) Calandriello, D., Carratino, L., Lazaric, A., Valko, M., and Rosasco, L. Gaussian process optimization with adaptive sketching: Scalable and no regret. In Conference on Learning Theory, pp. 533–557. PMLR, 2019.
- Calandriello et al. (2020) Calandriello, D., Carratino, L., Valko, M., Lazaric, A., and Rosasco, L. Near-linear time gaussian process optimization with adaptive batching and resparsification. In Proceedings of the 37th International Conference on Machine Learning, 2020.
- Chatterji et al. (2019) Chatterji, N., Pacchiano, A., and Bartlett, P. Online learning with kernel losses. In International Conference on Machine Learning, pp. 971–980. PMLR, 2019.
- Chowdhury & Gopalan (2017) Chowdhury, S. R. and Gopalan, A. On kernelized multi-armed bandits. In Proceedings of the 34th International Conference on Machine Learning, pp. 844–853, 2017. supplementary material.
- De Marchi et al. (2005) De Marchi, S., Schaback, R., and Wendland, H. Near-optimal data-independent point locations for radial basis function interpolation. Advances in Computational Mathematics, 23(3):317–330, 2005.
- DeVore et al. (2013) DeVore, R., Petrova, G., and Wojtaszczyk, P. Greedy algorithms for reduced bases in banach spaces. Constructive Approximation, 37(3):455–466, 2013.
- Hoffman et al. (1953) Hoffman, A., Wielandt, H., et al. The variation of the spectrum of a normal matrix. Duke Mathematical Journal, 20(1):37–39, 1953.
- Lattimore & Szepesvári (2020) Lattimore, T. and Szepesvári, C. Bandit Algorithm. Cambridge University Press, 2020.
- Lattimore et al. (2020) Lattimore, T., Szepesvari, C., and Weisz, G. Learning with good feature representations in bandits and in rl with a generative model. In Proceedings of the 37th International Conference on Machine Learning, 2020.
- Madych & Nelson (1992) Madych, W. and Nelson, S. Bounds on multivariate polynomials and exponential error estimates for multiquadric interpolation. Journal of Approximation Theory, 70(1):94–114, 1992.
- Müller (2009) Müller, S. Komplexität und Stabilität von kernbasierten Rekonstruktionsmethoden. PhD thesis, Fakultät für Mathematik und Informatik, Georg-August-Universität Göttingen, 2009.
- Müller & Schaback (2009) Müller, S. and Schaback, R. A newton basis for kernel spaces. Journal of Approximation Theory, 161(2):645–655, 2009.
- Mutny & Krause (2018) Mutny, M. and Krause, A. Efficient high dimensional bayesian optimization with additivity and quadrature fourier features. In Advances in Neural Information Processing Systems, pp. 9005–9016, 2018.
- Pazouki & Schaback (2011) Pazouki, M. and Schaback, R. Bases for kernel-based spaces. Journal of Computational and Applied Mathematics, 236(4):575–588, 2011.
- Pinkus (2012) Pinkus, A. N-widths in Approximation Theory, volume 7. Springer Science & Business Media, 2012.
- Rahimi & Recht (2008) Rahimi, A. and Recht, B. Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, pp. 1177–1184, 2008.
- Santin & Haasdonk (2017) Santin, G. and Haasdonk, B. Convergence rate of the data-independent -greedy algorithm in kernel-based approximation. Dolomites Research Notes on Approximation, 10(Special_Issue), 2017.
- Scarlett et al. (2017) Scarlett, J., Bogunovic, I., and Cevher, V. Lower bounds on regret for noisy gaussian process bandit optimization. In Conference on Learning Theory, pp. 1723–1742, 2017.
- Schaback (1995) Schaback, R. Error estimates and condition numbers for radial basis function interpolation. Advances in Computational Mathematics, 3(3):251–264, 1995.
- Schaback & Wendland (2000) Schaback, R. and Wendland, H. Adaptive greedy techniques for approximate solution of large rbf systems. Numerical Algorithms, 24(3):239–254, 2000.
- Srinivas et al. (2010) Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, pp. 1015–1022, 2010.
- Valko et al. (2013) Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N. Finite-time analysis of kernelised contextual bandits. In Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence, 2013.
- Wendland (1997) Wendland, H. Sobolev-type error estimates for interpolation by radial basis functions. Surface fitting and multiresolution methods, pp. 337–344, 1997.
- Wendland (2004) Wendland, H. Scattered data approximation, volume 17. Cambridge University Press, 2004.
- Wu & Schaback (1993) Wu, Z.-m. and Schaback, R. Local error estimates for radial basis function interpolation of scattered data. IMA Journal of Numerical Analysis, 13(1):13–27, 1993.
Appendix
Appendix A Additional Results for Thompson Sampling
A.1 Misspecified Linear Bandit Problem
We consider a modification of Thompson Sampling (Agrawal & Goyal, 2013). In th round, we sample from the multinomial normal distribution , and the modified algorithm selects that maximizes . Then the following result holds.
Proposition 18.
We assume that . Then, with probability at least , the modification of Thompson Sampling algorithm incurs regret upper bound by
We provide proof of the proposition in §B.4.
A.2 Thompson Sampling for the Stochastic RKHS Bandit Problem
We provide a result on a Thompson Sample based algorithm for the stochastic RKHS bandit problem.
Theorem 19.
We reduce the RKHS bandit problem to the misspecified linear bandit problem, and apply the modified Thompson Sampling introduced above with admissible error with . We denote by its regret and assume that . Then with probability at least , is upper bounded by
The total computational complexity of the algorithm is given as .
We provide proof of the theorem in §B.5.
Appendix B Proofs
We provide omitted proofs in the main article and §A.
B.1 Proof of Corollary 7
For completeness, we provide a proof of corollary 7.
Proof.
For simplicity, we consider only the infinite smoothness case. We use the same notation as in Theorem 6 and Algorithm 1. Denote by the number of points returned by the algorithm with error . Since the statement of the corollary is obvious if , we assume . Because the condition is satisfied only when , we have . If we run the algorithm with error with sufficiently small , then the algorithm returns points. Therefore by the theorem and the inequality above, we have
Ignoring constants other than , we have the assertion of the corollary. ∎
B.2 Proof of Proposition 8
For symmetric matrices , we write if and only if is positive semi-definite, i.e., for all . For completeness, we prove the following elementary lemma.
Lemma 20.
Let be symmetric matrices of size and assume that . Then we have .
Proof.
It is enough to prove the statement for and for some , where is the general linear group of size . Since is positive definite, using Cholesky decomposition, one can prove that there exists such that and is a diagonal matrix. Then, the assumption implies that every diagonal entry of is greater than or equal to . Now, the statement is obvious. ∎
Next, we prove that is a UCB up to a constant.
Lemma 21.
We assume . Then, with probability at least , we have
for any and .
Proof.
Proof of Proposition 8.
We assume . Let and be a sequence of arms selected by the algorithm. Denote by the event on which the inequality in Lemma 21 holds for all and . Then on event , we have
Therefore, on event ,
By assumptions, we have for any . Therefore, by (Abbasi-Yadkori et al., 2011, Lemma 11), the following inequalities hold:
(1) |
Thus, on event , we have
∎
B.3 Proof of Proposition 11
This proposition can be proved by adapting the standard proof of the adversarial linear bandit problem (Lattimore & Szepesvári, 2020; Bubeck & Cesa-Bianchi, 2012). We recall notation for the adversarial misspecified linear bandit problem and EXP3. Let be a map, be a sequence of reward functions on such that for , where , , and .
Let be an exploration parameter, a learning rate, and an exploitation distribution over . For a distribution over , we put . We define and for , where the matrix can be computed from the past observations at round and is defined later. Let be a distribution over such that and put for . We assume that is non-singular and define . For a distribution over , we define and in this section we assume .
Let be an optimal arm and regret is defined as . We have
(2) |
We denote by the sigma field generated by and by the conditional expectation conditioned on . We note that for and are -measurable but is not. Then we have
(3) |
Here we used and is defined as . Since , by inequalities (2), (3), we have
(4) |
We decompose , where
First, we bound . To do this, we prove the following lemma.
Lemma 22.
For any , the following inequality holds:
In particular, we have .
Proof.
We note that by conditioning on , randomness comes only from . By definition of , we have
Therefore,
Here in the second inequality, we use and the last inequality follows from the definition of . The second assertion follows from
∎
By this lemma, we can bound as follows.
Lemma 23.
The following inequality holds:
Proof.
Since is -measurable for any , we have
Therefore, we have
Here we used Lemma 22 in the last inequality. ∎
Next, we introduce the following elementary lemma (c.f. Chatterji et al. (2019, Lemma 49)).
Lemma 24.
Let and be a random variable. We assume that almost surely. Then we have
Proof.
By for and for , we have
∎
To apply the lemma with and , we prove the following:
Lemma 25.
Let and assume that . Then, we have .
Proof.
By definition of , we have
Here, in the last inequality, we use and the definition of . ∎
Lemma 26.
The following inequality holds:
Proof.
By definition of , we have
Here the last inequality follows from . Therefore,
Here the second equality follows from the fact that is -measurable and the linearity of the trace. The assertion of the lemma follows from this. ∎
Next, we give a lower bound for .
Lemma 27.
Let be any element. Then the following inequality holds:
Proof.
By definition of , we have
∎
B.4 Proof of Proposition 18
We assume . This can be proved by modifying the proof of (Agrawal & Goyal, 2013). Since most of their arguments can be directly applicable to our case, we omit proofs of some lemmas. Let be the probability space on which all random variables considered here are defined, where is a -algebra on . We put and for and , we put . We also put and . In each round , is sampled from the multinomial normal distribution . For , we define by
and define event by
For an event , we denote by the corresponding indicator function. Then by assumptions, we see that , i.e., is -measurable. For a random variable on , we say “on event , the conditional expectation (or conditional probability) satisfies a property” if and only if satisfies the property for almost all .
Lemma 28.
and for all .
We note that the proof of (Agrawal & Goyal, 2013, Lemma 2) works if , i.e., we have the following lemma:
Lemma 29.
On event , we have
where .
The main differences of our proof and theirs lie in the definitions of , and (they define as and we consider instead of ). However, it can be verified that these differences do not matter in the arguments of Lemma 3, 4 in (Agrawal & Goyal, 2013). In fact, we can prove the following lemma in a similar way to the proof of (Agrawal & Goyal, 2013, Lemma 3).
Lemma 30.
Proof.
Because the algorithm selects that maximizes , if for all , then we have . Therefore, we have
(6) |
By definitions of , and , on even , we have for all . Therefore, on , if , we have for all . Thus we obtain the following inequalities:
Here is the complement of and we used Lemmas 28, 29 in the last inequality. By inequality (6), we have our assertion. ∎
We can also prove the following lemma in a similar way to the proof of (Agrawal & Goyal, 2013, Lemma 4).
Lemma 31.
On event , we have
where and are universal constants.
Lemma 32.
The process is a super-martingale process w.r.t. the filtration .
Proof of Proposition 18.
By Lemma 32 and (for all ), applying Azuma-Hoeffding inequality, we see that there exists an event with such that on , the following inequality holds:
Since for any , on the event , we have
By inequalities (1), we have
Since , we obtain
Therefore, on the event , we have
Therefore, on event , we can upper bound the regret as follows:
Since , we have the assertion of the proposition. ∎
B.5 Proof of Theorem 13
Let and be a sequence of points and Newton basis returned by Algorithm 2 with , where and .
We verify the assumptions of the (stochastic) misspecified linear bandit problem hold, i.e., we show there exists such that the following conditions are satisfied for and :
-
1.
.
-
2.
If is a -valued random variable and -measurable, then is -measurable .
-
3.
.
-
4.
, where .
We put . Then by definition, Newton basis is a basis of . We define by and put . Since Newton basis is an orthonormal basis of , we have
By the orthonormality, we have (c.f. Santin & Haasdonk (2017, Lemma 5)). Then by assumption, we have . Since for is a linear combination of and is continuous, is continuous. Therefore, is -measurable if is -measurable. By definition of the P-greedy algorithm, we have . By this inequality and the definition of the power function, the following inequality holds:
Thus, one can apply results of a misspecified linear bandit problem with . By applying Proposition 8, with probability at least , the regret is upper bounded as follows:
Since computing Newton basis requires time and total complexity of the modified LinUCB is given as , we have the assertion of Theorem 13.
B.6 Proof of Theorem 17
For simplicity, by normalization, we assume . We denote by the cumulative regret that APG-EXP3 with and incurs up to time step . We can reduce the adversarial RKHS bandit problem to the adversarial misspecified linear bandit problem as in §B.5. To apply Proposition 11, we need to prove that spans . We denote by the points returned by the -greedy algorithm. Then, since is a basis of and is positive definite, . Therefore, spans .
By Proposition 11, we have
where and . Thus we have By taking , we have the assertion of the theorem.
B.7 Proof of Theorem 14
First, we prove that the -greedy algorithm (Algorithm 1) also gives a uniform kernel approximation.
Lemma 33.
Let be a Newton basis returned by the -greedy algorithm 1 with error and . For , we put . Then, we have
Proof.
We denote by the points returned by the -greedy algorithm. Then, by definition of the Power function, we have
for any and . We take arbitrary and take . Since is an orthonormal basis of , we have
Here, in the second equality, we used the reproducing property. Since and are arbitrary, we have our assertion. ∎
Next, we introduce the following classical result on matrix eigenvalues.
Lemma 34 (a specail case of the Wielandt-Hoffman theorem Hoffman et al. (1953)).
Let be symmetric matrices. Denote by and be the eigenvalues of and respectively. Then, we have where denotes the Frobenius norm.
By these lemmas, we can prove is an approximation of the maximum information gain.
Lemma 35.
We apply APG-UCB with admissible error to the stochastic RKHS bandit, then following inequality holds:
Proof.
We define a matrix as . Since for any matrix , holds, we have . We denote by the eigenvalues of and those of . Then by the Wielandt-Hoffman theorem (Lemma 34), we have
(7) |
where the last inequality follows from Lemma 33. Thus, we have
Here in the second inequality, we used and in the third inequality, we used inequality (7) and the Cauchy-Schwartz inequality. Noting that (Chowdhury & Gopalan, 2017), we have our assertion. ∎
Proposition 36.
We assume that , where with and is the parameter of APG-UCB. We also assume that and . Then with probability at least , the cumulative regret of APG-UCB is upper bounded by a function , where is given as
(8) |
where is defined by .
Remark 37.
We note that the cumulative regret of IGP-UCB is upper bounded by by the proof in (Chowdhury & Gopalan, 2017).
Appendix C Supplement to the Experiments
C.1 Experimental Setting
For each reward function , we add independent Gaussian noise of mean and standard deviation . We use the -norm because even if we normalize so that , the values of the function can be small. As for the parameters of the kernels, we take for the RQ kernel because the condition is required for positive definiteness. We take and if the kernel is RQ kernel and SE kernel respectively because the diameter of the -dimensional cube is . As for the parameters of the algorithms, we take and for both algorithms, where are the reward functions used for the experiment. We take for APG-UCB and for IGP-UCB.
Since exact value of the maximum information gain is not known, when computing UCB for IGP-UCB, we modify IGP-UCB as follows. Using notation of (Chowdhury & Gopalan, 2017), IGP-UCB selects an arm maximizing , where . Since exact value of is not known, we use instead of . From their proof, it is easy to see that this modification of IGP-UCB have the same guarantee for the regret upper bound as that of IGP-UCB. In addition, by , one can update in time at each round if is known. To compute the inverse of the regularized kernel matrix , we used a Schur complement of the matrix.
Computation was done by Intel Xeon E5-2630 v4 processor with 128 GB RAM. We computed UCB for each arm in parallel for both algorithms. For matrix-vector multiplication, we used efficient implementation of the dot product provided in https://github.com/dimforge/nalgebra/blob/dev/src/base/blas.rs.
C.2 Additional Experimental Results
As shown in the main article and §B.7, the error balances the computational complexity and cumulative regret, i.e., if is smaller, then the cumulative regret is smaller, but the computational complexity becomes larger. In this subsection, we provide additional experimental results by changing with fixed . We also show results for more complicated reward functions, i.e. for RQ kernels ( is the same) and for SE kernels.
In Table 2, we show the number of points returned by the -greedy algorithms for the RQ and SE kernels.
RQ () | SE () | RQ () | SE () | |
---|---|---|---|---|
18 | 15 | 23 | 25 | |
105 | 108 | 188 | 283 | |
376 | 457 | 725 | 994 |
In Figures 3, 4 and Tables 3, 4, we show the dependence on the parameter . In these figures, we denote APG-UCB with parameter by APG-UCB().
In Figures 5, 6 and Tables 5, 6, we also show the dependence on the parameter for more complicated functions.




APG-UCB(5e-2) | APG-UCB(1e-2) | APG-UCB(5e-3) | |
---|---|---|---|
d = 1 (RQ) | 3.91e-01 | 4.06e-01 | 4.23e-01 |
d = 2 (RQ) | 1.36e+00 | 2.39e+00 | 2.76e+00 |
d = 3 (RQ) | 1.19e+01 | 2.40e+01 | 2.98e+01 |
APG-UCB(5e-2) | APG-UCB(1e-2) | APG-UCB(5e-3) | |
---|---|---|---|
d = 1 (SE) | 3.84e-01 | 4.04e-01 | 4.02e-01 |
d = 2 (SE) | 1.69e+00 | 2.59e+00 | 2.89e+00 |
d = 3 (SE) | 2.13e+01 | 3.51e+01 | 4.30e+01 |
APG-UCB(5e-2) | APG-UCB(1e-2) | APG-UCB(5e-3) | |
---|---|---|---|
d = 1 (RQ) | 4.49e-01 | 4.84e-01 | 4.96e-01 |
d = 2 (RQ) | 3.84e+00 | 6.01e+00 | 7.39e+00 |
d = 3 (RQ) | 4.87e+01 | 8.76e+01 | 1.07e+02 |
APG-UCB(5e-2) | APG-UCB(1e-2) | APG-UCB(5e-3) | |
---|---|---|---|
d = 1 (SE) | 4.72e-01 | 4.88e-01 | 5.08e-01 |
d = 2 (SE) | 9.59e+00 | 1.40e+01 | 1.61e+01 |
d = 3 (SE) | 1.77e+02 | 2.02e+02 | 2.02e+02 |