This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Batch Bayesian optimisation via density-ratio estimation with guarantees

Rafael Oliveira1,2
rafael.oliveira@sydney.edu.au
&Louis C. Tiao3
louis.tiao@sydney.edu.au
Fabio Ramos3,4
fabio.ramos@sydney.edu.au
&1Brain and Mind Centre, the University of Sydney, Australia
2ARC Training Centre in Data Analytics for Resources and Environments, Australia
3School of Computer Science, the University of Sydney, Australia
4NVIDIA, USA
Corresponding author.
Abstract

Bayesian optimisation (BO) algorithms have shown remarkable success in applications involving expensive black-box functions. Traditionally BO has been set as a sequential decision-making process which estimates the utility of query points via an acquisition function and a prior over functions, such as a Gaussian process. Recently, however, a reformulation of BO via density-ratio estimation (BORE) allowed reinterpreting the acquisition function as a probabilistic binary classifier, removing the need for an explicit prior over functions and increasing scalability. In this paper, we present a theoretical analysis of BORE’s regret and an extension of the algorithm with improved uncertainty estimates. We also show that BORE can be naturally extended to a batch optimisation setting by recasting the problem as approximate Bayesian inference. The resulting algorithms come equipped with theoretical performance guarantees and are assessed against other batch and sequential BO baselines in a series of experiments.

1 Introduction

Bayesian optimisation (BO) algorithms provide flexible black-box optimisers for problems involving functions which are noisy or expensive to evaluate [1]. Typical BO approaches place a probabilistic model over the objective function which is updated with every new observation in a sequential decision-making process. Most methods are based on Gaussian process (GP) surrogates [2], which provide closed-form analytic expressions for the model’s posterior distribution and allow for a number of theoretical performance guarantees [3, 4, 5]. However, GP surrogates have a number of limitations, such as not easily scaling to high-dimensional domains, high computational complexity and requiring a careful choice of covariance function and hyper-parameters [2]. Non-GP-based BO methods have also been proposed in the literature, such as BO methods based on neural networks [6, 7] and random forests [8] regression models.

As an alternative to improving the model, Tiao et al. [9] focus on the acquisition function, which in BO frameworks represents the guide that takes the model predictions into account. They show that one can derive the acquisition function directly without an implicit model by reinterpreting the expected improvement [4, 1] via a density-ratio estimation problem. Applying this perspective, the acquisition function can then be derived as a classification model, which can be represented by flexible parametric models, such as deep neural networks, and efficiently trained via stochastic gradient descent. The resulting method, called Bayesian optimisation via density-ratio estimation (BORE) is then shown to outperform a variety of traditional GP-based and non-GP baselines.

Despite the significant performance gains, BORE has only been applied to a sequential setting and not much is known about the method’s theoretical guarantees. Batch BO methods have the potential to speed up optimisation in settings where multiple queries to the objective function can be evaluated simultaneously [10, 11, 12, 13]. Given its flexibility to apply models which can scale to large datasets, it is therefore a natural question as to whether BORE can be readily extended to the batch setting in a computationally efficient way.

In this paper, we extend the BORE framework to the batch setting and analyse its theoretical performance. To derive theoretical guarantees, we first show that the original BORE can be improved by accounting for uncertainty in the classifier’s predictions. We then propose a novel method, called BORE++, which uses an upper confidence bound over the classifier’s predictions as its acquisition function. The method comes equipped with guarantees in the probabilistic least-squares setting. We provide extensions for both BORE and BORE++ to the batch setting. Lastly, we present experimental results demonstrating the performance of the proposed algorithms in practical optimisation problems.

2 Background

We consider a global optimisation problem over a compact search space 𝒳d{\mathcal{{X}}}\subset\mathbb{{R}}^{d} of the form:

𝐱argmin𝐱𝒳f(𝐱),{\boldsymbol{\mathbf{x}}}^{*}\in\operatornamewithlimits{argmin}_{{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}}f({\boldsymbol{\mathbf{x}}})\,, (1)

where f:𝒳f:{\mathcal{{X}}}\to\mathbb{{R}} is assumed to be a black-box objective function, i.e., we have no access to gradients nor analytic formulations of it. In addition, we are only allowed to run up to TT rounds of function evaluations, where we might collect single points or batches of observations yt:=f(𝐱t)+ϵty_{t}:=f({\boldsymbol{\mathbf{x}}}_{t})+\epsilon_{t}, which are corrupted by additive noise ϵt\epsilon_{t}, for t{1,,T}t\in\{1,\dots,T\}.

2.1 Bayesian optimisation

Bayesian optimisation (BO) algorithms approach the problem in Equation 1 via sequential decision making [1]. At each iteration, BO selects a query point by maximising an acquisition function aa:

𝐱targmax𝐱𝒳a(𝐱|𝒟t1){\boldsymbol{\mathbf{x}}}_{t}\in\operatornamewithlimits{argmax}_{{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}}a({\boldsymbol{\mathbf{x}}}|{\mathcal{{D}}}_{t-1}) (2)

The acquisition function encodes information provided by the observations collected so far 𝒟t:={𝐱i,yi}i=1t1{\mathcal{{D}}}_{t}:=\{{\boldsymbol{\mathbf{x}}}_{i},y_{i}\}_{i=1}^{t-1} using a probabilistic model over ff, typically a Gaussian process (GP) [2], conditioned on the data. After collecting an observation yty_{t}, the dataset is updated with the new query-observation pair 𝒟t:=𝒟t1𝐱t,yt{\mathcal{{D}}}_{t}:={\mathcal{{D}}}_{t-1}\cup{{\boldsymbol{\mathbf{x}}}_{t},y_{t}}. This process then repeats for a given number of iterations TT.

2.2 Bayesian optimisation via density-ratio estimation (BORE)

The expected improvement (EI) [4, 14] is a popular acquisition function in the BO literature and the basis for many BO algorithms. At each iteration t1t\geq 1, one can define τ:=mini<tyt\tau:=\min_{i<t}y_{t} as an incumbent target. EI is then defined as:

aEI(𝐱|𝒟t1):=𝔼[max{0,τf(𝐱)}|𝒟t1].a_{\operatorname{EI}}({\boldsymbol{\mathbf{x}}}|{\mathcal{{D}}}_{t-1}):=\mathbb{E}[\max\left\{0,\tau-f({\boldsymbol{\mathbf{x}}})\right\}|{\mathcal{{D}}}_{t-1}]~{}. (3)

In the case of a GP prior on f|𝒟t1𝒢𝒫(μt1,kt1)f|{\mathcal{{D}}}_{t-1}\sim\mathcal{GP}(\mu_{t-1},k_{t-1}), the EI is available in closed form as a function of the GP posterior. However, the EI may be reformulated without the need for a prior.

Under mild assumptions, Bergstra et al. [15] showed that the EI can be formulated as a density ratio between two probability distributions. Let (𝐱):=p(𝐱|yτ)\ell({\boldsymbol{\mathbf{x}}}):=p({\boldsymbol{\mathbf{x}}}|y\leq\tau) represent the probability density over 𝐱𝒳{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}} conditioned on the observation yy being below a threshold τ\tau\in\mathbb{{R}}. Conversely, let g(𝐱):=p(𝐱|y>τ)g({\boldsymbol{\mathbf{x}}}):=p({\boldsymbol{\mathbf{x}}}|y>\tau). For γ[0,1]\gamma\in[0,1], the γ\gamma-relative density ratio between these two densities is:

(𝐱)γ:=(𝐱)γ(𝐱)+(1γ)g(𝐱),𝐱𝒳,{}_{\gamma}({\boldsymbol{\mathbf{x}}}):=\frac{\ell({\boldsymbol{\mathbf{x}}})}{\gamma\ell({\boldsymbol{\mathbf{x}}})+(1-\gamma)g({\boldsymbol{\mathbf{x}}})}\,,\quad{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}\,, (4)

noting that γ=0\gamma=0 leads to the ordinary probability density ratio definition, (𝐱)0=(𝐱)/g(𝐱){}_{0}({\boldsymbol{\mathbf{x}}})=\nicefrac{{\ell({\boldsymbol{\mathbf{x}}})}}{{g({\boldsymbol{\mathbf{x}}})}}. Now if we choose τ:=Φ1(γ)\tau:=\Phi^{-1}(\gamma), where Φ(s):=p(ys)\Phi(s):=p(y\leq s) represents the cumulative distribution function of the marginal distribution of observations,111Note that p(ys)=𝒳p(ys|𝐱)p(𝐱)d𝐱p(y\leq s)=\int_{{\mathcal{{X}}}}p(y\leq s|{\boldsymbol{\mathbf{x}}})p({\boldsymbol{\mathbf{x}}}){\mathop{}\operatorname{d}}{\boldsymbol{\mathbf{x}}}, where we may assume p(𝐱)p({\boldsymbol{\mathbf{x}}}) uniform. for ss\in\mathbb{{R}}, and then replace τ\tau in Equation 3, Bergstra et al. [15] have shown that222Bergstra et al. [15] and Tiao et al. [9] also rely on the mild assumption that p(𝐱|y)(𝐱)p({\boldsymbol{\mathbf{x}}}|y)\approx\ell({\boldsymbol{\mathbf{x}}}) for all yτy\leq\tau. aEI(𝐱)(𝐱)γa_{\operatorname{EI}}({\boldsymbol{\mathbf{x}}})\propto{}_{\gamma}({\boldsymbol{\mathbf{x}}}), for 𝐱𝒳{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}. Based on this fact, Tiao et al. [9] showed:

aEI(𝐱)(𝐱)γ=γ1π(𝐱),𝐱𝒳,a_{\operatorname{EI}}({\boldsymbol{\mathbf{x}}})\propto{}_{\gamma}({\boldsymbol{\mathbf{x}}})=\gamma^{-1}\pi({\boldsymbol{\mathbf{x}}}),\quad{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}, (5)

where π(𝐱):=p(yτ|𝐱)\pi({\boldsymbol{\mathbf{x}}}):=p(y\leq\tau|{\boldsymbol{\mathbf{x}}}) can be approximated by a probabilistic classifier trained with a proper scoring rule, such as the binary cross-entropy loss:

t[π]:=i=1tzilogπ(𝐱i)+(1zi)log(1π(𝐱i)).\mathcal{L}_{t}[\pi]:=\sum_{i=1}^{t}z_{i}\log\pi({\boldsymbol{\mathbf{x}}}_{i})+(1-z_{i})\log(1-\pi({\boldsymbol{\mathbf{x}}}_{i}))\,. (6)

Other examples of proper scoring rules include the least-squares loss, which leads to probabilistic least-squares classifiers [16], and the zero-one loss. We refer the reader to Gneiting and Raftery [17] for a review and theoretical analysis on this topic.

1 for t{1,,T}t\in\{1,\dots,T\} do
2       τ:=Φ^t11(γ)\tau:=\hat{\Phi}_{t-1}^{-1}(\gamma)
3       zi:=𝕀[yiτ],i{1,,t1}z_{i}:=\mathbb{I}[y_{i}\leq\tau],\quad i\in\{1,\dots,t-1\}
4       𝒟~t1:={𝐱i,zi}i=1t1\tilde{\mathcal{{D}}}_{t-1}:=\{{\boldsymbol{\mathbf{x}}}_{i},z_{i}\}_{i=1}^{t-1}
5       π^targminπ[π|𝒟~t1]\hat{\pi}_{t}\in\operatornamewithlimits{argmin}_{\pi}\mathcal{L}[\pi|\tilde{\mathcal{{D}}}_{t-1}]
6       𝐱targmax𝐱𝒳π^t1(𝐱){\boldsymbol{\mathbf{x}}}_{t}\in\operatornamewithlimits{argmax}_{{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}}\hat{\pi}_{t-1}({\boldsymbol{\mathbf{x}}})
7       yt:=f(𝐱t)+ϵty_{t}:=f({\boldsymbol{\mathbf{x}}}_{t})+\epsilon_{t}
8      
9 end for
Algorithm 1 BORE

BORE is summarised in Algorithm 1. As seen, the marginal observations distribution CDF Φ(s):=p(ys)\Phi(s):=p(y\leq s) is replaced by the empirical approximation Φ^t(s):=1ti=1t𝕀[yis]\hat{\Phi}_{t}(s):=\frac{1}{t}\sum_{i=1}^{t}\mathbb{I}[y_{i}\leq s] and its corresponding quantile function Φ^t1\hat{\Phi}_{t}^{-1}. At each iteration, observations are labelled according to the estimated γ\gammath quantile τ\tau, and a classifier π^t\hat{\pi}_{t} is trained by minimising the loss [π|𝒟~t]\mathcal{L}[\pi|\tilde{\mathcal{{D}}}_{t}] over the data points 𝒟~t\tilde{\mathcal{{D}}}_{t}. A query point 𝐱t{\boldsymbol{\mathbf{x}}}_{t} is chosen by maximising the classifier’s probabilities, which in our case corresponds to maximising the expected improvement. A new observation is collected, and the algorithm continues running up to a given number of iterations TT. As demonstrated, no explicit probabilistic model for ff is needed, only a classifier, which can be efficiently trained via, e.g., stochastic gradient descent.

3 Analysis of the BORE framework

In this section, we analyse limitations of the BORE framework in modelling uncertainty and analyse its effects on the algorithm’s performance. As presented in Section 2.2, at each iteration t1t\geq 1, the original BORE framework trains a probabilistic classifier π^t(𝐱)\hat{\pi}_{t}({\boldsymbol{\mathbf{x}}}) to approximate p(yτ|𝐱)p(y\leq\tau|{\boldsymbol{\mathbf{x}}}), where τ\tau denotes the γ\gammath quantile of the marginal observations distribution, i.e., p(yτ)=γp(y\leq\tau)=\gamma. This approach leads to a maximum likelihood estimate for the classifier π^\hat{\pi}, which may not properly account for the uncertainty in the classifier’s approximation.

Since BORE is based on probabilistic classifiers, instead of regression models as in traditional BO frameworks [1], a natural first question to ask is whether a classifier can guide it to the global optimum of the objective function. The following lemma answers this question and is a basis for our analysis.

Lemma 1.

Let f:𝒳f:{\mathcal{{X}}}\to\mathbb{{R}} be a continuous function over a compact space 𝒳{\mathcal{{X}}}. Assume that, for any 𝐱𝒳{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}, we observe y=f(𝐱)+ϵy=f({\boldsymbol{\mathbf{x}}})+\epsilon, where ϵ\epsilon is i.i.d. noise with a strictly monotonic cumulative distribution function Φϵ:[0,1]\Phi_{\epsilon}:\mathbb{{R}}\to[0,1]. Then, for any τ\tau\in\mathbb{{R}}, we have:

argmax𝐱𝒳p(yτ|𝐱,f)=argmin𝐱𝒳f(𝐱).\operatornamewithlimits{argmax}_{{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}}p(y\leq\tau|{\boldsymbol{\mathbf{x}}},f)=\operatornamewithlimits{argmin}_{{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}}f({\boldsymbol{\mathbf{x}}})\,. (7)
Proof.

As the observation noise CDF is monotonic, by basic properties of the argmax\operatornamewithlimits{argmax}, we have:

argmax𝐱𝒳p(yτ|𝐱,f)=argmax𝐱𝒳Φϵ(τf(𝐱))=argmin𝐱𝒳f(𝐱),\begin{split}\operatornamewithlimits{argmax}_{{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}}p(y\leq\tau|{\boldsymbol{\mathbf{x}}},f)=\operatornamewithlimits{argmax}_{{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}}\Phi_{\epsilon}(\tau-f({\boldsymbol{\mathbf{x}}}))=\operatornamewithlimits{argmin}_{{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}}f({\boldsymbol{\mathbf{x}}})\,,\end{split} (8)

which concludes the proof. ∎

According to this lemma, maximising class probabilities is equivalent to optimising the objective function when the classifier is optimal, i.e., it has perfect knowledge of ff. This result holds for any given threshold τ\tau\in\mathbb{{R}}. We only make a mild assumption on the CDF of the observation noise Φϵ\Phi_{\epsilon}, which is satisfied for any probability distribution with support covering the real line (e.g. Gaussian, Student-T, Cauchy, etc.).333This result could also be easily extended to distributions with bounded support as long as their CDF is monotonic within it. However, we keep the support as \mathbb{{R}} for simplicity, and the extension is left for future work.

To analyse BORE’s optimisation performance, we will aim to bound the algorithm’s instant regret:

rt:=f(𝐱t)f(𝐱),t1,r_{t}:=f({\boldsymbol{\mathbf{x}}}_{t})-f({\boldsymbol{\mathbf{x}}}^{*}),\quad t\geq 1, (9)

and its cumulative version RT:=t=1Trt{{R}}_{T}:=\sum_{t=1}^{T}r_{t}. Sub-linear bounds on RT{{R}}_{T} lead to a no-regret algorithm, since limTRTT=0\lim_{T\to\infty}\frac{{{R}}_{T}}{T}=0 and mintTrtRTT\min_{t\leq T}r_{t}\leq\frac{{{R}}_{T}}{T}.

Assuming that there is an optimal classifier π:𝒳[0,1]\pi^{*}:{\mathcal{{X}}}\to[0,1], which is such that π(𝐱)=p(yτ|𝐱,f)\pi^{*}({\boldsymbol{\mathbf{x}}})=p(y\leq\tau|{\boldsymbol{\mathbf{x}}},f), for a given τ\tau\in\mathbb{{R}}, we can directly relate the classifier probabilities to the objective function ff values, since:

π(𝐱)=p(yτ|𝐱,f)=Φϵ(τf(𝐱))f(𝐱)=τΦϵ1(π(𝐱)).\begin{split}\pi^{*}({\boldsymbol{\mathbf{x}}})=p(y\leq\tau|{\boldsymbol{\mathbf{x}}},f)=\Phi_{\epsilon}(\tau-f({\boldsymbol{\mathbf{x}}}))\quad\therefore\quad f({\boldsymbol{\mathbf{x}}})=\tau-\Phi_{\epsilon}^{-1}(\pi^{*}({\boldsymbol{\mathbf{x}}}))\,.\end{split} (10)

The existence of the inverse Φϵ1\Phi_{\epsilon}^{-1} is ensured by the strict monotonicity assumption on Φϵ\Phi_{\epsilon} in 1. Under this observation, the algorithm’s regret at any iteration t1t\geq 1 can be bounded in terms of classifier probabilities:

rt=f(𝐱t)f(𝐱)=Φϵ1(π(𝐱))Φϵ1(π(𝐱t))Lϵ(π(𝐱)π(𝐱t)),\begin{split}r_{t}&=f({\boldsymbol{\mathbf{x}}}_{t})-f({\boldsymbol{\mathbf{x}}}^{*})=\Phi_{\epsilon}^{-1}(\pi^{*}({\boldsymbol{\mathbf{x}}}^{*}))-\Phi_{\epsilon}^{-1}(\pi^{*}({\boldsymbol{\mathbf{x}}}_{t}))\leq L_{\epsilon}(\pi^{*}({\boldsymbol{\mathbf{x}}}^{*})-\pi^{*}({\boldsymbol{\mathbf{x}}}_{t}))\,,\end{split} (11)

where LϵL_{\epsilon} is any Lipschitz constant for Φϵ1\Phi_{\epsilon}^{-1}, which exists since 𝒳{\mathcal{{X}}} is compact. Therefore, we should be able to bound BORE’s regret by analysing the approximation error for π^t\hat{\pi}_{t} at each iteration t1t\geq 1.

Although approximation guarantees for classification algorithms under i.i.d. data settings are well known [18], each observation in BORE depends on the previous ones via the acquisition function. This process is also not necessarily stationary, so that we cannot apply known results for classifiers under stationary processes [19]. In the next section, we consider a particular setting for learning a classifier which allows us to bound the prediction error under BORE’s data-generating process.

3.1 Probabilistic least-squares classifiers

We consider the case of probabilistic least-squares (PLS) classifiers [20, 21]. In particular, we model a probabilistic classifier π:𝒳[0,1]\pi:{\mathcal{{X}}}\to[0,1] as an element of a reproducing kernel Hilbert space (RKHS) {\mathcal{{H}}} associated with a positive-definite kernel k:𝒳×𝒳k:{\mathcal{{X}}}\times{\mathcal{{X}}}\to\mathbb{{R}}. A RKHS is a space of functions equipped with inner product ,k\langle\cdot,\cdot\rangle_{k} and norm k:=,k\lVert\cdot\rVert_{k}:=\sqrt{\langle\cdot,\cdot\rangle_{k}} [22]. For the purposes of this analysis, we will also assume that k(𝐱,𝐱)1k({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}})\leq 1, for all 𝒳{\mathcal{{X}}}.444This assumption can always be satisfied by proper scaling. This setting allows for both linear and non-parametric models. Gaussian assumptions on the function space would lead us to GP-based PLS classifiers [2], but we are not restricted by Gaussianity in our analysis. If the kernel kk is universal, as Φϵ\Phi_{\epsilon} is injective, we can also see that the RKHS assumption allows for modelling any continuous function.

For a given τ\tau\in\mathbb{{R}}, a PLS classifier is obtained by minimising the regularised squared-error loss:

π^targminπi=1t(ziπ(𝐱i))2+λπk2,t1,\hat{\pi}_{t}\in\operatornamewithlimits{argmin}_{\pi\in{\mathcal{{H}}}}\sum_{i=1}^{t}(z_{i}-\pi({\boldsymbol{\mathbf{x}}}_{i}))^{2}+\lambda\lVert\pi\rVert_{k}^{2}\,,\quad t\geq 1, (12)

where λ>0\lambda>0 is a given regularisation factor and zi:=𝕀[yiτ]{0,1}z_{i}:=\mathbb{I}[y_{i}\leq\tau]\in\{0,1\}. In the RKHS case, the solution to the problem above is available in closed form [23, 16] as:

π^t(𝐱)=𝐤t(𝐱)𝖳(𝐊t+λ𝐈)1𝐳t,𝐱𝒳,t1,\hat{\pi}_{t}({\boldsymbol{\mathbf{x}}})={\boldsymbol{\mathbf{k}}}_{t}({\boldsymbol{\mathbf{x}}})^{\mathsf{T}}({\boldsymbol{\mathbf{{K}}}}_{t}+\lambda{\boldsymbol{\mathbf{{I}}}})^{-1}{\boldsymbol{\mathbf{z}}}_{t}\,,\quad{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}},\,t\geq 1, (13)

where 𝐤t(𝐱):=[k(𝐱,𝐱1),,k(𝐱,𝐱t)]𝖳t{\boldsymbol{\mathbf{k}}}_{t}({\boldsymbol{\mathbf{x}}}):=[k({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}}_{1}),\dots,k({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}}_{t})]^{\mathsf{T}}\in\mathbb{{R}}^{t}, 𝐊t:=[k(𝐱i,𝐱j)]i,j=1tt×t{\boldsymbol{\mathbf{{K}}}}_{t}:=[k({\boldsymbol{\mathbf{x}}}_{i},{\boldsymbol{\mathbf{x}}}_{j})]_{i,j=1}^{t}\in\mathbb{{R}}^{t\times t} and 𝐳t:=[z1,,zt]𝖳t{\boldsymbol{\mathbf{z}}}_{t}:=[z_{1},\dots,z_{t}]^{\mathsf{T}}\in\mathbb{{R}}^{t}. This PLS approximation may not yield a valid classifier, since it is possible that π^t(𝐱)[0,1]\hat{\pi}_{t}({\boldsymbol{\mathbf{x}}})\notin[0,1] for some 𝐱𝒳{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}. However, it allows us to place a confidence interval on the optimal classifier’s prediction, as presented in the following theorem, which is based on theoretical results from the online learning literature [24, 25]. Our proofs can be found in the supplement.

Theorem 1.

Given τ\tau\in\mathbb{{R}}, assume π(𝐱):=Φϵ(τf(𝐱))\pi({\boldsymbol{\mathbf{x}}}):=\Phi_{\epsilon}(\tau-f({\boldsymbol{\mathbf{x}}})) is such that π\pi\in{\mathcal{{H}}}, and πkb\lVert\pi\rVert_{k}\leq b. Let {𝐱t}t=1\{{\boldsymbol{\mathbf{x}}}_{t}\}_{t=1}^{\infty} be a 𝒳{\mathcal{{X}}}-valued discrete-time stochastic process predictable with respect to the filtration {𝔉t}t=0\{{\mathfrak{{F}}}_{t}\}_{t=0}^{\infty}. Let {zt}t=1\{z_{t}\}_{t=1}^{\infty} be a real-valued stochastic process such that νt:=ztπ(𝐱t)\nu_{t}:=z_{t}-\pi({\boldsymbol{\mathbf{x}}}_{t}) is 11-sub-Gaussian conditionally on 𝔉t1{\mathfrak{{F}}}_{t-1}, for all t1t\geq 1. Then, for any δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, we have that:

𝐱𝒳,|π(𝐱)π^t(𝐱)|βt(δ)σt(𝐱),t1,\forall{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}},\quad|\pi({\boldsymbol{\mathbf{x}}})-\hat{\pi}_{t}({\boldsymbol{\mathbf{x}}})|\leq\beta_{t}(\delta)\sigma_{t}({\boldsymbol{\mathbf{x}}}),\quad\forall t\geq 1\,, (14)

where βt(δ):=b+2λ1log(|𝐈+λ1𝐊t|1/2/δ){\beta_{t}(\delta):=b+\sqrt{2\lambda^{-1}\log(|{\boldsymbol{\mathbf{{I}}}}+\lambda^{-1}{\boldsymbol{\mathbf{{K}}}}_{t}|^{1/2}/\delta)}}, with |𝐀||{\boldsymbol{\mathbf{{A}}}}| denoting the determinant of matrix 𝐀{\boldsymbol{\mathbf{{A}}}}, and σt2(𝐱):=k(𝐱,𝐱)𝐤t(𝐱)𝖳(𝐊t+λ𝐈)1𝐤t(𝐱),𝐱𝒳,t1.\sigma_{t}^{2}({\boldsymbol{\mathbf{x}}}):=k({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}})-{\boldsymbol{\mathbf{k}}}_{t}({\boldsymbol{\mathbf{x}}})^{\mathsf{T}}({\boldsymbol{\mathbf{{K}}}}_{t}+\lambda{\boldsymbol{\mathbf{{I}}}})^{-1}{\boldsymbol{\mathbf{k}}}_{t}({\boldsymbol{\mathbf{x}}})\,,\quad{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}},\quad t\geq 1.

3.2 Regret analysis for BORE

We now consider BORE with a PLS classifier. For this analysis, we will assume an ideal setting where τ\tau is fixed, possibly corresponding to the true γ\gammath quantile of the observations distribution. However, our results hold for any choice of τ\tau\in\mathbb{{R}} and can therefore be assumed to approximately hold for a varying τ\tau which is converging to a fixed value. In this setting, the algorithm’s choices are: given by:

𝐱targmax𝐱𝒳π^t1(𝐱),{\boldsymbol{\mathbf{x}}}_{t}\in\operatornamewithlimits{argmax}_{{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}}\hat{\pi}_{t-1}({\boldsymbol{\mathbf{x}}})\,, (15)

where π^t\hat{\pi}_{t} is the estimator in Equation 13. we can then apply Theorem 1 to the classifier-based regret in Equation 11 to obtain a regret bound. For this result, we will also need the following quantity:

ξN:=max{𝐱i}i=1N𝒳12log|𝐈+λ1𝐊N|,N1,\xi_{N}:=\max_{\{{\boldsymbol{\mathbf{x}}}_{i}\}_{i=1}^{N}\subset{\mathcal{{X}}}}\frac{1}{2}\log|{\boldsymbol{\mathbf{{I}}}}+\lambda^{-1}{\boldsymbol{\mathbf{{K}}}}_{N}|\,,\quad N\geq 1\,, (16)

where the maximisation is taken over the discrete set of locations {𝐱i}i=1N𝒳\{{\boldsymbol{\mathbf{x}}}_{i}\}_{i=1}^{N}\subset{\mathcal{{X}}} and 𝐊N:=[k(𝐱i,𝐱j)]i,j=1N{\boldsymbol{\mathbf{{K}}}}_{N}:=[k({\boldsymbol{\mathbf{x}}}_{i},{\boldsymbol{\mathbf{x}}}_{j})]_{i,j=1}^{N}. This quantity denotes the maximum information gain of a Gaussian process model after NN observations. We are now ready to state our theoretical result regarding BORE’s regret.

Theorem 2.

Under the conditions in Theorem 1, with probability at least 1δ1-\delta, δ(0,1)\delta\in(0,1), the instant regret of the BORE algorithm with a PLS classifier after T1T\geq 1 iterations is bounded by:

rtLϵβt1(δ)(σt1(𝐱t)+σt1(𝐱)),r_{t}\leq L_{\epsilon}\beta_{t-1}(\delta)(\sigma_{t-1}({\boldsymbol{\mathbf{x}}}_{t})+\sigma_{t-1}({\boldsymbol{\mathbf{x}}}^{*})), (17)

and the cumulative regret by:

RTLϵβT(δ)(4(T+2)ξT+t=1Tσt1(𝐱)).{{R}}_{T}\leq L_{\epsilon}\beta_{T}(\delta)\left(\sqrt{4(T+2)\xi_{T}}+\sum_{t=1}^{T}\sigma_{t-1}({\boldsymbol{\mathbf{x}}}^{*})\right)\,. (18)

As Theorem 2 shows, the regret of the BORE algorithm in the PLS setting is comprised of two components. The first term is related to the regret of a GP-UCB algorithm [see 26, Thr. 3] and its known to grow sub-linearly for a few popular kernels, such as the squared exponential and the Matérn class [3, 27]. The second term, however, reflects the uncertainty of the algorithm around the optimum location 𝐱{\boldsymbol{\mathbf{x}}}^{*}. If the algorithm never samples at that location, this second summation might have a mostly linear growth, which will not lead to a vanishing regret. In fact, if we consider Equation 13 and a RKHS with a translation-invariant kernel, we see that, as soon as an observation zt=1z_{t}=1 is collected at a location 𝐱t𝐱{\boldsymbol{\mathbf{x}}}_{t}\neq{\boldsymbol{\mathbf{x}}}^{*}, that location will constitute the maximum of the classifier output. Then the algorithm would keep returning to that same location, missing opportunities to sample at 𝐱{\boldsymbol{\mathbf{x}}}^{*}.

It is worth noting that Theorem 2 reflects the regret of BORE in an idealistic setting where the algorithm uses the optimal PLS estimator in the function space {\mathcal{{H}}}. However, if we train a parametric classifier, such as a neural network, via gradient descent, the behaviour will not necessarily be the same, and the algorithm might still achieve a good performance. In the original BORE paper, for instance, a parametric classifier is trained by minimising the binary cross-entropy loss [9] and leads to a successful performance in experiments. Neural network models trained via stochastic gradient descent are known to provide approximate samples of a posterior distribution [28, 29], instead of an optimal best-fit predictor, which might make BORE behave like Thompson sampling [30] (see discussion in the appendix). Nevertheless, Theorem 2 still shows us that BORE may get stuck into local optima, which is not ideal for BO methods. In the next section, we present an extension of the BORE framework which addresses this shortcoming.

4 BORE++: improved uncertainty estimates

As discussed in the previous section, the lack of uncertainty quantification in the estimation of the classifier for the original BORE might lead to sub-optimal performance. To address this shortcoming, we present an approach for uncertainty quantification in the BORE framework which leads to improvements in performance and theoretical optimality guarantees. Our approach is based on using an upper confidence bound (UCB) on the predicted class probabilities as the acquisition function for BORE. Due to its improved uncertainty estimates, we call this approach BORE++.

4.1 Class-probability upper confidence bounds

We propose replacing π^t\hat{\pi}_{t} in Algorithm 1 by an upper confidence bound which is such that:

t1,π(𝐱)πt,δ(𝐱),𝐱𝒳\forall t\geq 1,\quad\pi^{*}({\boldsymbol{\mathbf{x}}})\leq\pi_{t,\delta}({\boldsymbol{\mathbf{x}}}),\quad\forall{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}} (19)

which with probability greater than 1δ1-\delta, given δ(0,1)\delta\in(0,1). Therefore, πt,δ(𝐱)\pi_{t,\delta}({\boldsymbol{\mathbf{x}}}) represents an upper quantile over the optimal class probability π(𝐱)\pi^{*}({\boldsymbol{\mathbf{x}}}). BORE++ selects 𝐱targmax𝐱𝒳πt1,δ(𝐱){\boldsymbol{\mathbf{x}}}_{t}\in\operatornamewithlimits{argmax}_{{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}}\pi_{t-1,\delta}({\boldsymbol{\mathbf{x}}}).

To derive an upper confidence bound on a classifier’s predictions π(𝐱)\pi({\boldsymbol{\mathbf{x}}}), we can take a few different approaches. For a parametric model π𝜽\pi_{\boldsymbol{\mathbf{\theta}}}, a Bayesian model updating the posterior p(𝜽|𝒟t)p({\boldsymbol{\mathbf{\theta}}}|{\mathcal{{D}}}_{t}) leads to a corresponding predictive distribution over π𝜽(𝐱)\pi_{\boldsymbol{\mathbf{\theta}}}({\boldsymbol{\mathbf{x}}}). This is the case of ensemble models [31], for instance, where we approximate predictions p(yτt|𝐱,𝒟t)1Mi=1Mπ𝜽i(𝐱)p(y\leq\tau_{t}|{\boldsymbol{\mathbf{x}}},{\mathcal{{D}}}_{t})\approx\frac{1}{M}\sum_{i=1}^{M}\pi_{{\boldsymbol{\mathbf{\theta}}}^{i}}({\boldsymbol{\mathbf{x}}}) with 𝜽ip(𝜽|𝒟t){\boldsymbol{\mathbf{\theta}}}^{i}\sim p({\boldsymbol{\mathbf{\theta}}}|{\mathcal{{D}}}_{t}). Instead of using the expected class probability, however, BORE++ uses an (empirical) quantile approximation for πt,δ\pi_{t,\delta} to ensure Equation 19 holds. Bayesian neural networks [32], random forests [33], dropout methods, etc. [34], also constitute valid approaches for predictive uncertainty estimation. An alternative approach is to place a non-parametric prior over π\pi^{*}, such as a Gaussian process model [2], which allows for the modelling of uncertainty directly in the function space where π\pi^{*} lies. In the next section, we present a concrete derivation of BORE++ for the PLS classifier setting which takes the non-parametric perspective and allows us to derive theoretical performance guarantees.

4.2 BORE++ with PLS classifiers

In the PLS setting, the result in Theorem 1 gives us a closed-form expression for a classifier upper confidence bound satisfying the condition in Equation 19. Given δ(0,1)\delta\in(0,1), we set:

πt,δ(𝐱):=min(1,max(0,π^t(𝐱)+βt(δ)σt(𝐱)))[0,1],𝐱𝒳,\pi_{t,\delta}({\boldsymbol{\mathbf{x}}}):=\min(1,\max(0,\hat{\pi}_{t}({\boldsymbol{\mathbf{x}}})+\beta_{t}(\delta)\sigma_{t}({\boldsymbol{\mathbf{x}}})))\in[0,1]\,,\quad{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}\,, (20)

where σt\sigma_{t} and βt\beta_{t} are set according to Theorem 1. We then obtain the following result for BORE++.

Theorem 3.

Under the assumptions in Theorem 1, running the BORE++ algorithm with a PLS classifier πt,δ\pi_{t,\delta} as defined above yields, with probability at least 1δ1-\delta, an instant regret bound of:

rt2Lϵβt(δ)σt(𝐱),t1,r_{t}\leq 2L_{\epsilon}\beta_{t}(\delta)\sigma_{t}({\boldsymbol{\mathbf{x}}})\,,\quad\forall t\geq 1\,, (21)

and a cumulative regret bound after T1T\geq 1 iterations:

RT4LϵβT(δ)(T+2)ξT𝒪(T(bξT+ξT)).\begin{split}{{R}}_{T}&\leq 4L_{\epsilon}\beta_{T}(\delta)\sqrt{(T+2)\xi_{T}}\in{\mathcal{{O}}}\left(\sqrt{T}(b\sqrt{\xi_{T}}+\xi_{T})\right).\end{split} (22)

According to Theorem 3, the regret of BORE++ vanishes if the maximum information gain ξT\xi_{T} grows sub-linearly, since limTRTT=0\lim_{T\to\infty}\frac{{{R}}_{T}}{T}=0 and mintTrtRTT\min_{t\leq T}r_{t}\leq\frac{{{R}}_{T}}{T}. Sub-linear growth is known to be achieved for popular kernels, such as the squared exponential, the Matérn family and linear kernels [3, 27]. This result also tells us that theoretically BORE++ performs no worse than GP-UCB since they share similar regret bounds [3, 26]. However, in practice, the BORE++ framework offers a series of practical advantages over GP-UCB, such as no need for an explicit surrogate model, and a classifier which does not need to be a GP and can therefore be more flexible and scalable to high-dimensional problems and large amounts of data. The connection with GP-UCB, instead, brings us new insights into how the density-ratio BO algorithm can still share some of the well known guarantees of traditional BO methods.

5 Batch BORE

This section proposes an extension of the BORE framework which allows for multiple queries to the objective function to be performed in parallel. Although many methods for batch BO have been previously proposed in the literature, we here focus on approaching batch optimisation as an approximate Bayesian inference problem. Instead of having to derive complex heuristics to approximate the utility of a batch of query points, we can view points in a batch as samples from a posterior probability distribution which uses the acquisition function as a likelihood.

5.1 BORE batches via approximate inference

Applying an optimisation-as-inference perspective to BORE, we can formulate a batch BO algorithm which does not require an explicit regression model for ff. The classifier π^(𝐱)p(yτ|𝐱)\hat{\pi}({\boldsymbol{\mathbf{x}}})\approx p(y\leq\tau|{\boldsymbol{\mathbf{x}}}) naturally turns out as a likelihood function over query locations 𝐱𝒳{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}. Since the search space 𝒳{\mathcal{{X}}} is compact, we can assume a uniform prior distribution p(𝐱)1p({\boldsymbol{\mathbf{x}}})\propto 1. Also note that the normalisation constant in this case is simply 𝒳p(yτ|𝐱)p(𝐱)d𝐱=p(yτ)=γ\int_{\mathcal{{X}}}p(y\leq\tau|{\boldsymbol{\mathbf{x}}})p({\boldsymbol{\mathbf{x}}}){\mathop{}\operatorname{d}}{\boldsymbol{\mathbf{x}}}=p(y\leq\tau)=\gamma. Our posterior distribution then becomes:

(𝐱)=p(𝐱|yτ)=p(yτ|𝐱)p(𝐱)p(yτ).\ell({\boldsymbol{\mathbf{x}}})=p({\boldsymbol{\mathbf{x}}}|y\leq\tau)=\frac{p(y\leq\tau|{\boldsymbol{\mathbf{x}}})p({\boldsymbol{\mathbf{x}}})}{p(y\leq\tau)}\,. (23)

Therefore, we formulate a batch version of BORE as an inference problem aiming for:

qargminq𝒫DKL(q||),q^{*}\in\operatornamewithlimits{argmin}_{q\in{\mathcal{{P}}}}D_{\mathrm{KL}}(q||\ell), (24)

where DKL(q||)D_{\mathrm{KL}}(q||\ell) denotes the Kullback-Leibler (KL) divergence between qq and \ell, and 𝒫{\mathcal{{P}}} represents the space of probability distributions over 𝒳{\mathcal{{X}}}. Sampling from \ell would allow us to obtain the points of interest in the search space, including the optimum 𝐱{\boldsymbol{\mathbf{x}}}^{*} and other locations where yτy\leq\tau. However, as the true p(yτ|𝐱)p(y\leq\tau|{\boldsymbol{\mathbf{x}}}) is unknown, we instead formulate a proxy inference problem with respect to a surrogate target distribution p^t\hat{p}_{t} based on the classifier model. For BORE, we set p^t(𝐱)π^t1(𝐱)\hat{p}_{t}({\boldsymbol{\mathbf{x}}})\propto\hat{\pi}_{t-1}({\boldsymbol{\mathbf{x}}}), while for BORE++ the setting is p^t(𝐱)πt1,δ(𝐱)\hat{p}_{t}({\boldsymbol{\mathbf{x}}})\propto\pi_{t-1,\delta}({\boldsymbol{\mathbf{x}}}). In contrast to (𝐱)p(yτ|𝐱)\ell({\boldsymbol{\mathbf{x}}})\propto p(y\leq\tau|{\boldsymbol{\mathbf{x}}}), the normalisation constant for the surrogate distributions is unknown, leading us to a proxy problem of minimising qtargminq𝒫DKL(q||p^t)q_{t}\in\operatornamewithlimits{argmin}_{q\in{\mathcal{{P}}}}D_{\mathrm{KL}}(q||\hat{p}_{t}) at each iteration t1t\geq 1. This variational inference problem above can be efficiently solved via Stein variational gradient descent (SVGD) [35], described next.

5.2 Batch sampling via Stein variational gradient descent

In our implementation, we apply SVGD to approximately sample a batch t:={𝐱t,i}i=1M{\mathcal{{B}}}_{t}:=\{{\boldsymbol{\mathbf{x}}}_{t,i}\}_{i=1}^{M} of M1M\geq 1 points from p^t\hat{p}_{t}. Other approximate inference algorithms could also be applied. One of the main advantages of SVGD, however, is that it encourages diversification in the batch, capturing the possible multimodality of p^t\hat{p}_{t}. Given the batch locations, observations can be collected in parallel and then added to the dataset to update the classifier model.

SVGD is an approximate inference algorithm which represents a variational distribution qq as a set of particles {𝐱i}i=1M\{{\boldsymbol{\mathbf{x}}}^{i}\}_{i=1}^{M} [35]. The particles are initialised as i.i.d. samples from an arbitrary base distribution and then optimised via a sequence of smooth transformations towards the target distribution, which in our case corresponds to p^tπt1,δ\hat{p}_{t}\propto\pi_{t-1,\delta}. The SVGD steps are given by:

𝐱ti\displaystyle{\boldsymbol{\mathbf{x}}}^{i}_{t} 𝐱ti+α𝜻t(𝐱ti),i{1,,M},\displaystyle\leftarrow{\boldsymbol{\mathbf{x}}}^{i}_{t}+\alpha{\boldsymbol{\mathbf{\zeta}}}_{t}({\boldsymbol{\mathbf{x}}}^{i}_{t})~{},\quad i\in\{1,\dots,M\}, (25)
𝜻t(𝐱)\displaystyle{\boldsymbol{\mathbf{\zeta}}}_{t}({\boldsymbol{\mathbf{x}}}) :=1Mj=1Mk(𝐱tj,𝐱)𝐱tjlogπt1,δ(𝐱j)+𝐱tjk(𝐱tj,𝐱),\displaystyle:=\frac{1}{M}\sum_{j=1}^{M}k({\boldsymbol{\mathbf{x}}}^{j}_{t},{\boldsymbol{\mathbf{x}}})\nabla_{{\boldsymbol{\mathbf{x}}}^{j}_{t}}\log\pi_{t-1,\delta}({\boldsymbol{\mathbf{x}}}^{j})+\nabla_{{\boldsymbol{\mathbf{x}}}^{j}_{t}}k({\boldsymbol{\mathbf{x}}}^{j}_{t},{\boldsymbol{\mathbf{x}}}), (26)

where k:𝒳×𝒳k:{\mathcal{{X}}}\times{\mathcal{{X}}}\to\mathbb{{R}} is a positive-definite kernel, and α>0\alpha>0 is a small step size. Intuitively, the first term in the definition of 𝜻t{\boldsymbol{\mathbf{\zeta}}}_{t} guides the particles to the modes of p^t\hat{p}_{t}, while the second term encourages diversification by repelling nearby particles. Theoretical convergence guarantees [36, 37] and practical extensions, such as second-order methods [38, 39] and derivative-free approaches [40], have been proposed in the literature. Further details on SVGD can be found in Liu and Wang [35].

5.3 Regret bound for Batch BORE++ with PLS classifiers

We follow the derivations in Oliveira et al. [41] to derive a distributional regret bound for batch BORE++ with respect to its target sampling distribution \ell, which is presented in the following result.

Theorem 4.

Under the same assumptions in Theorem 1, running batch BORE++ with πt,δ\pi_{t,\delta} set as in Equation 20, we obtain a bound on the instantaneous distributional regret:

r¯t:=𝔼𝐱p^t[f(𝐱)]𝔼𝐱[f(𝐱)]2LϵLπβt1(δ)𝔼qt[σt1],t1,\bar{r}_{t}:=\mathbb{E}_{{\boldsymbol{\mathbf{x}}}\sim\hat{p}_{t}}[f({\boldsymbol{\mathbf{x}}})]-\mathbb{E}_{{\boldsymbol{\mathbf{x}}}\sim\ell}[f({\boldsymbol{\mathbf{x}}})]\leq 2L_{\epsilon}L_{\pi}\beta_{t-1}(\delta)\mathbb{E}_{q_{t}}[\sigma_{t-1}]\,,\quad t\geq 1\,, (27)

where Lπ:=max𝐱𝒳1π(𝐱)L_{\pi}:=\max_{{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}}\frac{1}{\pi({\boldsymbol{\mathbf{x}}})}, and on the cumulative distributional regret at T1T\geq 1:

R¯T:=t=1Tr¯t4LϵLπβT(δ)(T+2)ξT𝒪(T(bξT+ξTξMT))\bar{{{R}}}_{T}:=\sum_{t=1}^{T}\bar{r}_{t}\leq 4L_{\epsilon}L_{\pi}\beta_{T}(\delta)\sqrt{(T+2)\xi_{T}}\in{\mathcal{{O}}}(\sqrt{T}(b\sqrt{\xi_{T}}+\sqrt{\xi_{T}\xi_{MT}})) (28)

both of which hold with probability at least 1δ1-\delta.

As in the case of non-batch BORE++, the distributional regret bounds for the batch algorithm also grow sub-linearly for most popular kernels, leading to an asymptotically vanishing simple regret. Although different, to compare the distributional regret of batch BORE++ with the non-distributional regret bounds for BORE++, we may consider a case where τ\tau is set to the function minimum τ:=f(𝐱)=min𝐱𝒳f(𝐱)\tau:=f({\boldsymbol{\mathbf{x}}}^{*})=\min_{{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}}f({\boldsymbol{\mathbf{x}}}) and the observation noise is small. In this case, the batch sampling distribution would converge to a Dirac at the optimum, so that 𝔼𝐱[f(𝐱)]f(𝐱)\mathbb{E}_{{\boldsymbol{\mathbf{x}}}\sim\ell}[f({\boldsymbol{\mathbf{x}}})]\approx f({\boldsymbol{\mathbf{x}}}^{*}). Compared to the regret of non-batch BORE++ (Theorem 3) after collecting an equivalent number of observations T:=MTT^{\prime}:=MT, the expected regret of the batch version of BORE++ after TT iterations is then lower by a factor of ξT/ξMT\xi_{T}/\xi_{MT}, noting that ξTξT=ξMT\xi_{T}\leq\xi_{T^{\prime}}=\xi_{MT}. Therefore, batch BORE++ should be able to achieve lower regret per runtime than sequential BORE++ with an equivalent number of observations.

6 Related work

Since their proposal by Schonlau et al. [42], batch Bayesian optimisation methods have appeared in various forms in the literature. Many methods are based on heuristics derived from estimates given by a Gaussian process regression model [11, 12, 43]. Others are based on Monte Carlo estimates of multi-query acquisition functions [10, 13], optimising points over GP posterior samples [44], solving local optimisation problems [45], or optimising over ensembles of acquisition functions [46]. Despite that, the prevalent approaches to batch BO are still based on a GP regression model, which require prior knowledge about the objective function and do not scale to high-dimensional problems. We instead take a different approach by viewing BO as a density-ratio estimation problem following the BORE framework by Tiao et al. [9]. For batch design, we take an optimisation-as-inference approach [47, 48] by applying Stein variational gradient descent, a non-parametric approximate inference method [35], which has been recently combined with GP-based BO [49, 50]. Our theoretical results, however, are agnostic to the choice of inference algorithm. In contrast to traditional batch BO methods, the inference approach does not require solving inter-dependent optimisation problems for each batch point, as in heuristic-based approaches [43, 11, 12], Monte Carlo integration over the GP posterior [10, 13], nor sampling from it [44]. SVGD allows batch selection to be solved in a vectorised way, which can take advantage of hardware accelerators, such as GPUs.

7 Experiments

This section presents experiments assessing the theoretical results and demonstrating the practical performance of batch BORE on a series of global optimisation benchmarks. We compared our methods against GP-based BO baselines in both experiments sets. Additional experimental results, including the sequential setting (Appendix E), a description of the experiments setup (Appendix E), and further discussions on theoretical aspects can be found in the supplementary material.555Code will be made available at https://github.com/rafaol/batch-bore-with-guarantees

Theory assessment.

We first present simulated experiments assessing the theoretical results in practice, testing BORE and BORE++ in the PLS setting. As a baseline, we compare both methods against GP-UCB. This experiment was run by generating a random base classifier in the RKHS k{\mathcal{{H}}}_{k} and then a corresponding objective function via the inverse noise CDF Φϵ1\Phi_{\epsilon}^{-1}. The search space was set as a uniformly-sampled finite subset of the unit interval 𝒳:=[0,1]{\mathcal{{X}}}:=[0,1]\subset\mathbb{{R}}. We applied the theory-backed settings for BORE++ (Section 2.2) and GP-UCB [25], while BORE employed the optimal PLS classifier (Equation 13).

[Uncaptioned image]\captionof

figure[Theory assessment]Regret in theory assessment experiment. Results were averaged over 10 trials, and the shaded area indicates the 95% confidence interval.666Linear interpolation is applied to obtain the plotted confidence intervals when the number of trials is small.

As the results in footnote 6 show, BORE using an optimal PLS classifier simply gets stuck at a its initial point, resulting in constant regret. BORE++, however, is able to progress in the optimisation problem towards the global optimum, outperforming the GP-UCB baseline.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 1: Performance on synthetic benchmarks. Plots show the simple regret, i.e., mintTrt\min_{t\leq T}r_{t}, per iteration. Results were averaged over 5 trials, and shaded areas indicate the 95% confidence interval.
Global optimisation benchmarks.

We evaluated the proposed SVGD-based batch BORE method in a series of test functions for global optimisation comparing it against other BO baselines. In particular, for our comparisons, we ran the locally penalised EI (LP-EI) method [11] and the Monte Carlo based qq-EI method [10], which are both based on the EI algorithm, like BORE. Results are presented in Figure 1. All methods ran for T:=200T:=200 iterations and used of batch size of 10 evaluations per iteration. Additional experimental details are deferred to the supplementary material.

As Figure 1 shows, batch BORE is able to outperform its baselines on most of the global optimisation benchmarks. We also note that, in some case, due to its complexity the LP-EI method becomes computationally infeasible after 100 iterations, having to be aborted halfway through the optimisation. Batch BORE, however, is able to maintain steady performance throughout its runs.

Real-data benchmarks.

Lastly, we compared the sequential version of BORE++ against BORE and other baselines, including traditional BO methods, such as GP-UCB and GP-EI [1], the Tree-structured Parzen Estimator (TPE) [15], and random search, on real-data benchmarks. In particular, we assessed the algorithms on some of the same benchmarks present in the original BORE paper [9].

Refer to caption
(a) Hyper-parameter tuning on Parkinson’s dataset
Refer to caption
(b) Hyper-parameter tuning on CT slice dataset
Refer to caption
(c) Racing line optimisation
Refer to caption
(d) Neural architecture search on MNIST data
Figure 2: Experimental results on real-data benchmarks. The plots show each algorithm’s simple regret averaged across multiple runs. The shaded areas correspond to the 95% confidence intervals.

Results are presented in Figure 2. As the plots show, BORE++ presents significantly better performance than BORE in the probabilistic least-squares (PLS) setting (i.e., βt:=0\beta_{t}:=0), as the theoretical results suggested. In fact, it is possible to note that BORE (PLS) performs comparably to (or at times worse than) random search, indicating that the optimal least-squares classifier by itself is unable to properly capture the epistemic uncertainty. By using a neural network classifier trained via gradient descent and a different loss function (cross-entropy), the original BORE is still able to achieve top performance in most benchmarks. Both BORE versions are only surpassed by traditional GP-based BO on the racing line optimisation problem, as observed in Tiao et al. [9], due to the inherent smoothness the problem, and in the final iterations of the neural architecture search problem by GP-EI. Interestingly, even though restricted to the kernel-based PLS setting, we observe that BORE++ is able to surpass the original BORE in the neural network hyper-parameter tuning problems (slice and parkinsons), while maintaining similar performance in other tasks. These results confirm that improved uncertainty estimates can lead to practical performance gains.

8 Conclusion

This paper presented an extension of the BORE framework to the batch setting alongside the theoretical analysis of the proposed extension and an improvement over the original BORE. Theoretical results in terms of regret bounds and experiments show that BORE methods are able to maintain performance guarantees while outperforming traditional BO baselines. The main purpose of this work, however, was to establish the theoretical foundations for the analysis and derivation of new algorithmic frameworks for Bayesian optimisation via density-ratio estimation, equipping BO with new tools based on probabilistic classification, instead of regression models.

As future work, we plan to investigate the theoretical properties of BORE under different loss functions and analyse other batch design strategies, e.g., sampling methods for combinatorial domains. The theoretical contributions of this work can also be extended to other versions of BORE, such as its recent multi-objective version [51], and provide insights into other likelihood-free BO methods [52]. Finally, we consider integrating BORE++ with other probabilistic classification models equipped with predictive uncertainty estimates, such as neural network ensembles [53], random forests [54], and Bayesian generalised linear models, which should lead to improvements in scalability and additional performance gains.

Acknowledgments and Disclosure of Funding

Rafael Oliveira was supported by the Medical Research Future Fund Applied Artificial Intelligence in Health Care grant (MRFAI000097) by the Australian Department of Health and Aged Care.

References

  • Shahriari et al. [2016] Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando De Freitas. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2016.
  • Rasmussen and Williams [2006] Carl E. Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, Cambridge, MA, 2006.
  • Srinivas et al. [2010] Niranjan Srinivas, Andreas Krause, Sham M. Kakade, and Matthias Seeger. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), pages 1015–1022, 2010.
  • Bull [2011] Adam D. Bull. Convergence Rates of Efficient Global Optimization Algorithms. Journal of Machine Learning Research (JMLR), 12:2879–2904, 2011.
  • Wang et al. [2018] Zi Wang, Beomjoon Kim, and Leslie Kaelbling. Regret bounds for meta Bayesian optimization with an unknown Gaussian process prior. In Conference on Neural Information Processing Systems, Montreal, Canada, 2018.
  • Snoek et al. [2015] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, M Patwary, Prabhat, and R Adams. Scalable Bayesian optimization using deep neural networks. In International Conference on Machine Learning (ICML), Lille, France, 2015.
  • Springenberg et al. [2016] Jost Tobias Springenberg, Klein Aaron, Stefan Falkner, and Frank Hutter. Bayesian optimization with robust Bayesian neural networks. In Advances in Neural Information Processing Systems (NIPS), Barcelona, Spain, 2016.
  • Hutter et al. [2011] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithm configuration. In International conference on learning and intelligent optimization, pages 507–523. Springer, 2011.
  • Tiao et al. [2021] Louis C Tiao, Aaron Klein, Matthias Seeger, Edwin V Bonilla, Cédric Archambeau, and Fabio Ramos. Bayesian Optimization by Density-Ratio Estimation. In Proceedings of the 38th International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, Jul 2021.
  • Snoek et al. [2012] Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical bayesian optimization of machine learning algorithms. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 2951–2959. Curran Associates, Inc., 2012.
  • Gonzalez et al. [2016] Javier Gonzalez, Zhenwen Dai, Philipp Hennig, and Neil D. Lawrence. Batch Bayesian optimization via local penalization. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 648–657, Cadiz, Spain, 2016.
  • Wang et al. [2017] Zi Wang, Chengtao Li, Stefanie Jegelka, and Pushmeet Kohli. Batched high-dimensional Bayesian optimization via structural kernel learning. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 3656–3664, International Convention Centre, Sydney, Australia, 2017. PMLR.
  • Wilson et al. [2018] James T. Wilson, Frank Hutter, and Marc Peter Deisenroth. Maximizing acquisition functions for Bayesian optimization. In 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada, 2018.
  • Jones et al. [1998] D. R. Jones, M. Schonlau, and W. J. Welch. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 13:455–492, 1998.
  • Bergstra et al. [2011] James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for Hyper-parameter Optimization. In Advances in Neural Information Processing Systems, pages 2546–2554, 2011.
  • Sugiyama et al. [2012] Masashi Sugiyama, Hirotaka Hachiya, Makoto Yamada, Jaak Simm, and Hyunha Nam. Least-squares probabilistic classifier: A computationally efficient alternative to kernel logistic regression. In Proceedings of International Workshop on Statistical Machine Learning for Speech Processing (IWSML2012), Kyoto, Japan, 2012.
  • Gneiting and Raftery [2007] Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007.
  • Barron [1994] Andrew R Barron. Approximation and estimation bounds for artificial neural networks. Machine learning, 14(1):115–133, 1994.
  • Steinwart and Christmann [2009] Ingo Steinwart and Andreas Christmann. Fast learning from Non-i.i.d. observations. In Advances in Neural Information Processing Systems 22, pages 1768–1776, 2009.
  • Selten [1998] Reinhard Selten. Axiomatic characterization of the quadratic scoring rule. Experimental Economics, 1(1):43–62, 1998.
  • Suykens and Vandewalle [1999] J. A. K. Suykens and J. Vandewalle. Least squares support vector machine classifier. Neural Processing Letters, 9:293–300, 1999.
  • Schölkopf and Smola [2002] Bernhard Schölkopf and Alexander J. Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge, Mass, 2002.
  • Abbasi-Yadkori et al. [2010] Yasin Abbasi-Yadkori, David Pal, and Csaba Szepesvari. Improved Algorithms for Linear Stochastic Bandits. In Advances in Neural Information Processing Systems (NIPS), pages 1–19, 2010.
  • Abbasi-Yadkori [2012] Yasin Abbasi-Yadkori. Online Learning for Linearly Parametrized Control Problems. Phd, University of Alberta, 2012.
  • Durand et al. [2018] Audrey Durand, Odalric-Ambrym Maillard, and Joelle Pineau. Streaming kernel regression with provably adaptive mean, variance, and regularization. Journal of Machine Learning Research, 19(1):650–683, 2018.
  • Chowdhury and Gopalan [2017] Sayak Ray Chowdhury and Aditya Gopalan. On Kernelized Multi-armed Bandits. In Proceedings of the 34th International Conference on Machine Learning (ICML), Sydney, Australia, 2017.
  • Vakili et al. [2021] Sattar Vakili, Kia Khezeli, and Victor Picheny. On information gain and regret bounds in gaussian process bandits. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 82–90. PMLR, 13–15 Apr 2021.
  • Bardsley et al. [2014] Johnathan M Bardsley, Antti Solonen, Heikki Haario, and Marko Laine. Randomize-then-optimize: A method for sampling from posterior distributions in nonlinear inverse problems. SIAM Journal on Scientific Computing, 36(4):A1895 – A1910, 2014.
  • Mandt et al. [2017] Stephan Mandt, Matthew D. Hoffman, and David M. Blei. Stochastic gradient descent as approximate Bayesian inference. Journal of Machine Learning Research, 18, 2017.
  • Russo and Van Roy [2016] Daniel Russo and Benjamin Van Roy. An Information-Theoretic Analysis of Thompson Sampling. Journal of Machine Learning Research (JMLR), 17:1–30, 2016.
  • Rokach [2010] Lior Rokach. Ensemble-based classifiers. Artificial intelligence review, 33(1):1–39, 2010.
  • Penny and Roberts [1999] William D Penny and Stephen J Roberts. Bayesian neural networks for classification: how useful is the evidence framework? Neural networks, 12(6):877–892, 1999.
  • Amit and Geman [1997] Yali Amit and Donald Geman. Shape quantization and recognition with randomized trees. Neural Computation, 9(7):1545–1588, 07 1997.
  • Polson and Sokolov [2017] Nicholas G. Polson and Vadim Sokolov. Deep learning: A Bayesian perspective. Bayesian Analysis, 12(4):1275–1304, 2017.
  • Liu and Wang [2016] Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose Bayesian inference algorithm. In Advances in Neural Information Processing Systems (NIPS), 2016.
  • Liu [2017] Qiang Liu. Stein variational gradient descent as gradient flow. In Advances in Neural Information Processing Systems, pages 8854–8863, Long Beach, CA, USA, 2017.
  • Korba et al. [2020] Anna Korba, Adil Salim, Michael Arbel, Giulia Luise, and Arthur Gretton. A non-asymptotic analysis for Stein variational gradient descent. In Advances in Neural Information Processing Systems, Vancouver, Canada, 2020.
  • Detommaso et al. [2018] Gianluca Detommaso, Tiangang Cui, Alessio Spantini, Youssef Marzouk, and Robert Scheichl. A Stein variational Newton method. In 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada, 2018.
  • Liu et al. [2019] Chang Liu, Jingwei Zhuo, Pengyu Cheng, Ruiyi Zhang, Jun Zhu, and Lawrence Carin. Understanding and accelerating particle-based variational inference. In 36th International Conference on Machine Learning (ICML 2019), Long Beach, CA, 2019.
  • Han and Liu [2018] Jun Han and Qiang Liu. Stein variational gradient descent without gradient. In 35th International Conference on Machine Learning (ICML 2018), 2018.
  • Oliveira et al. [2021] Rafael Oliveira, Lionel Ott, and Fabio Ramos. No-regret approximate inference via Bayesian optimisation. In Cassio de Campos and Marloes H. Maathuis, editors, Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, volume 161 of Proceedings of Machine Learning Research, pages 2082–2092. PMLR, 27–30 Jul 2021.
  • Schonlau et al. [1998] Matthias Schonlau, William J. Welch, and Donald R. Jones. Global versus local search in constrained optimization of computer models. New Developments and Applications in Experimental Design, 34:11–25, 1998.
  • Azimi et al. [2010] Javad Azimi, Alan Fern, and Xiaoli Z. Fern. Batch Bayesian optimization via simulation matching. In Advances in Neural Information Processing Systems 23: 24th Annual Conference on Neural Information Processing Systems 2010 (NIPS 2010), 2010.
  • Kandasamy et al. [2018] Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, and Barnabas Poczos. Parallelised Bayesian optimisation via Thompson sampling. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS), Lanzarote, Spain, 2018.
  • Eriksson et al. [2019] David Eriksson, Michael Pearce, Jacob R Gardner, Ryan Turner, and Matthias Poloczek. Scalable global optimization via local Bayesian optimization. In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), 2019.
  • Zhang et al. [2022] Shuhan Zhang, Fan Yang, Changhao Yan, Dian Zhou, and Xuan Zeng. An efficient batch-constrained Bayesian optimization approach for analog circuit synthesis via multiobjective acquisition ensemble. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 41(1), 2022.
  • Todorov [2008] Emanuel Todorov. General duality between optimal control and estimation. In Proceedings of the 47th IEEE Conference on Decision and Control, Cancun, Mexico, 2008.
  • Fellows et al. [2019] Matthew Fellows, Anuj Mahajan, Tim G. J. Rudner, and Shimon Whiteson. VIREL: A variational inference framework for reinforcement learning. In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, 2019.
  • Gong et al. [2019] Chengyue Gong, Jian Peng, and Qiang Liu. Quantile Stein Variational Gradient Descent for Batch Bayesian Optimization. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 2019.
  • Oliveira et al. [2019] Rafael Oliveira, Lionel Ott, and Fabio Ramos. Distributional Bayesian optimisation for variational inference on black-box simulators. In 2nd Symposium on Advances in Approximate Bayesian Inference, Vancouver, Canada, 2019.
  • De Ath et al. [2022] George De Ath, Tinkle Chugh, and Alma A. M. Rahat. MBORE: Multi-objective Bayesian optimisation by density-ratio estimation. GECCO ’22, page 776–785, New York, NY, USA, 2022. Association for Computing Machinery.
  • Song et al. [2022] Jiaming Song, Lantao Yu, Willie Neiswanger, and Stefano Ermon. A general recipe for likelihood-free Bayesian optimization. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 20384–20404. PMLR, 17–23 Jul 2022.
  • Lakshminarayanan et al. [2017] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. In Neural Information Processing Systems (NIPS), 2017.
  • Balog and Teh [2015] Matej Balog and Yee Whye Teh. The Mondrian Process for Machine Learning. Technical report, University of Oxford, 2015.
  • Boucheron et al. [2013] Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013.
  • Munkres [1975] James Raymond Munkres. Topology: a first course. Prentice Hall, Edgewood Cliffs, NJ, 1975.
  • Jacot et al. [2018] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, Montreal, Canada, 2018.
  • de G. Matthews et al. [2018] Alexander G. de G. Matthews, Jiri Hron, Mark Rowland, Richard E. Turner, and Zoubin Ghahramani. Gaussian Process Behaviour in Wide Deep Neural Networks. In International Conference on Learning Representations, Vancouver, Canada, 2018. OpenReview.net.
  • Grünewälder et al. [2010] Steffen Grünewälder, Jean Yves Audibert, Manfred Opper, and John Shawe-Taylor. Regret bounds for Gaussian process bandit problems. In Journal of Machine Learning Research, volume 9, pages 273–280, 2010.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  • Byrd et al. [1995] Richard H Byrd, Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. A limited memory algorithm for bound constrained optimization. SIAM Journal on scientific computing, 16(5):1190–1208, 1995.
  • authors [2016] The GPyOpt authors. GPyOpt: A Bayesian optimization framework in python. http://github.com/SheffieldML/GPyOpt, 2016.
  • Balandat et al. [2020] Maximilian Balandat, Brian Karrer, Daniel R. Jiang, Samuel Daulton, Benjamin Letham, Andrew Gordon Wilson, and Eytan Bakshy. BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization. In Advances in Neural Information Processing Systems 33, 2020.
  • Kingma and Ba [2015] Diederik P Kingma and Jimmy Lei Ba. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR), 2015.
  • Virtanen et al. [2020] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020. doi: 10.1038/s41592-019-0686-2.
  • Tsanas et al. [2009] Athanasios Tsanas, Max Little, Patrick McSharry, and Lorraine Ramig. Accurate telemonitoring of parkinson’s disease progression by non-invasive speech tests. Nature Precedings, pages 1–1, 2009.
  • Graf et al. [2011] Franz Graf, Hans-Peter Kriegel, Matthias Schubert, Sebastian Pölsterl, and Alexander Cavallaro. 2d image registration in ct images using radial image descriptors. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 607–614. Springer, 2011.
  • Dua and Graff [2017] Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  • Eggensperger et al. [2021] Katharina Eggensperger, Philipp Müller, Neeratyoy Mallik, Matthias Feurer, Rene Sass, Aaron Klein, Noor Awad, Marius Lindauer, and Frank Hutter. Hpobench: A collection of reproducible multi-fidelity benchmark problems for hpo. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  • Liniger et al. [2015] Alexander Liniger, Alexander Domahidi, and Manfred Morari. Optimization-based autonomous racing of 1:43 scale RC cars. OPTIMAL CONTROL APPLICATIONS AND METHODS, 36:628–647, 2015.
  • Jain and Morari [2020] Achin Jain and Manfred Morari. Computing the racing line using Bayesian optimization. In Proceedings of the 59th IEEE Conference on Decision and Control (CDC), 2020.
  • Deng [2012] Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
  • Falkner et al. [2018] Stefan Falkner, Aaron Klein, and Frank Hutter. BOHB: Robust and efficient hyperparameter optimization at scale. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 1437–1446. PMLR, 10–15 Jul 2018.
  • Rahimi and Recht [2007] Ali Rahimi and Ben Recht. Random features for large-scale kernel machines. In Advances in Neural Information Processing (NIPS), 2007.

Appendix

This appendix complements the main paper with proofs, experiment details and additional experiments and discussions. Appendix A presents full proofs for the main theoretical results in the paper. In Appendix B, we discuss an approach to derive alternative regret bounds for BORE under a Thompson sampling perspective. We discuss the theoretical analysis of BORE with its non-constant approximation for the observations quantile τ\tau in Appendix C. In Appendix D, we present further details on the experiments setup. Finally, Appendix E presents an additional experiment assessing dimensionality effects.

Appendix A Proofs

This section presents proofs for the main theoretical results in the paper. We start with a few auxiliary results from the GP-UCB literature [3, 26], following up with the proofs for the main theorems.

A.1 Auxiliary results

Lemma A.1 (Srinivas et al. [3, Lemma 5.3]).

The information gain for a sequence of N1N\geq 1 observations {𝐱i,zi}i=1N\{{\boldsymbol{\mathbf{x}}}_{i},z_{i}\}_{i=1}^{N}, where zi=f(𝐱i)+νiz_{i}=f({\boldsymbol{\mathbf{x}}}_{i})+\nu_{i}, νi𝒩(0,λ)\nu_{i}\sim\mathcal{N}(0,\lambda), can be expressed in terms of the predictive variances. Namely, if f𝒢𝒫(m,k)f\sim\mathcal{GP}(m,k), then the information gain provided by the observations is such that:

I(𝐳N,𝐟N|𝒳N)=12i=1Nlog(1+λ1σi12(𝐱i)),I({\boldsymbol{\mathbf{z}}}_{N},{\boldsymbol{\mathbf{f}}}_{N}|{\mathcal{{X}}}_{N})=\frac{1}{2}\sum_{i=1}^{N}\log(1+\lambda^{-1}\sigma^{2}_{i-1}({\boldsymbol{\mathbf{x}}}_{i}))~{}, (A.1)

where 𝐟N:=[f(𝐱i)]i=1N{\boldsymbol{\mathbf{f}}}_{N}:=[f({\boldsymbol{\mathbf{x}}}_{i})]_{i=1}^{N} and 𝒳N:={𝐱i}i=1N𝒳{\mathcal{{X}}}_{N}:=\{{\boldsymbol{\mathbf{x}}}_{i}\}_{i=1}^{N}\subset{\mathcal{{X}}}.

Lemma A.2 (Chowdhury and Gopalan [26, Lemma 4]).

Following the setting of A.1, the sum of predictive standard deviations at a sequence of NN points is bounded in terms of the maximum information gain:

i=1Nσi1(𝐱i)4(N+2)ξN.\sum_{i=1}^{N}\sigma_{i-1}({\boldsymbol{\mathbf{x}}}_{i})\leq\sqrt{4(N+2)\xi_{N}}\,. (A.2)
Lemma A.3.

Let 𝒜𝒳{\mathcal{{A}}}\subset{\mathcal{{X}}} be a finite set of points where a function f𝒢𝒫(m,k)f\sim\mathcal{GP}(m,k) was evaluated, so that the GP posterior covariance function and the corresponding variance are given by:

k𝒜(𝐱,𝐱)\displaystyle k_{{\mathcal{{A}}}}({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}}^{\prime}) :=k(𝐱,𝐱)k(𝐱,𝒜)𝖳(𝐊(𝒜)+η𝐈)1k(𝒜,𝐱)\displaystyle:=k({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}}^{\prime})-k({\boldsymbol{\mathbf{x}}},{\mathcal{{A}}})^{\mathsf{T}}({\boldsymbol{\mathbf{{K}}}}({\mathcal{{A}}})+\eta{\boldsymbol{\mathbf{{I}}}})^{-1}k({\mathcal{{A}}},{\boldsymbol{\mathbf{x}}}^{\prime}) (A.3)
σ𝒜2(𝐱)\displaystyle\sigma_{{\mathcal{{A}}}}^{2}({\boldsymbol{\mathbf{x}}}) :=k𝒜(𝐱,𝐱),𝐱,𝐱𝒳,\displaystyle:=k_{{\mathcal{{A}}}}({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}})\,,\quad{\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}}^{\prime}\in{\mathcal{{X}}}, (A.4)

where k(𝐱,𝒜):=[k(𝐱,𝐚)]𝐚𝒜k({\boldsymbol{\mathbf{x}}},{\mathcal{{A}}}):=[k({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{a}}})]_{{\boldsymbol{\mathbf{a}}}\in{\mathcal{{A}}}} and 𝐊(𝒜):=[k(𝐚,𝐚)]𝐚,𝐚𝒜{\boldsymbol{\mathbf{{K}}}}({\mathcal{{A}}}):=[k({\boldsymbol{\mathbf{a}}},{\boldsymbol{\mathbf{a}}}^{\prime})]_{{\boldsymbol{\mathbf{a}}},{\boldsymbol{\mathbf{a}}}^{\prime}\in{\mathcal{{A}}}}. Then, for any given set 𝒜{\mathcal{{B}}}\supset{\mathcal{{A}}} of evaluations of ff, we have:

σ2(𝐱)σ𝒜2(𝐱),𝐱𝒳.\sigma_{{\mathcal{{B}}}}^{2}({\boldsymbol{\mathbf{x}}})\leq\sigma_{{\mathcal{{A}}}}^{2}({\boldsymbol{\mathbf{x}}}),\quad\forall{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}\,. (A.5)
Proof.

The result follows by observing that the GP posterior given observations at 𝒜{\mathcal{{A}}} is a prior for the GP with the new observations at the complement 𝒞:=𝒜{\mathcal{{C}}}:={\mathcal{{B}}}\setminus{\mathcal{{A}}}. Then we obtain, for all 𝐱𝒳{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}:

σ2(𝐱):=k(𝐱,𝐱)k(𝐱,)(𝐊()+η𝐈)1k(,𝐱)=σ𝒜2(𝐱)k𝒜(𝐱,𝒞)(𝐊𝒜(𝒞)+η𝐈)1k𝒜(𝒞,𝐱)σ𝒜2(𝐱),\begin{split}\sigma_{{\mathcal{{B}}}}^{2}({\boldsymbol{\mathbf{x}}})&:=k({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}})-k({\boldsymbol{\mathbf{x}}},{\mathcal{{B}}})({\boldsymbol{\mathbf{{K}}}}({\mathcal{{B}}})+\eta{\boldsymbol{\mathbf{{I}}}})^{-1}k({\mathcal{{B}}},{\boldsymbol{\mathbf{x}}})\\ &=\sigma_{{\mathcal{{A}}}}^{2}({\boldsymbol{\mathbf{x}}})-k_{{\mathcal{{A}}}}({\boldsymbol{\mathbf{x}}},{\mathcal{{C}}})({\boldsymbol{\mathbf{{K}}}}_{{\mathcal{{A}}}}({\mathcal{{C}}})+\eta{\boldsymbol{\mathbf{{I}}}})^{-1}k_{{\mathcal{{A}}}}({\mathcal{{C}}},{\boldsymbol{\mathbf{x}}})\\ &\leq\sigma_{{\mathcal{{A}}}}^{2}({\boldsymbol{\mathbf{x}}})\,,\end{split} (A.6)

since k𝒜(𝐱,𝒞)(𝐊𝒜(𝒞)+η𝐈)1k𝒜(𝒞,𝐱)k_{{\mathcal{{A}}}}({\boldsymbol{\mathbf{x}}},{\mathcal{{C}}})({\boldsymbol{\mathbf{{K}}}}_{{\mathcal{{A}}}}({\mathcal{{C}}})+\eta{\boldsymbol{\mathbf{{I}}}})^{-1}k_{{\mathcal{{A}}}}({\mathcal{{C}}},{\boldsymbol{\mathbf{x}}}) is non-negative. ∎

A.2 Proof of Theorem 1

Proof of Theorem 1.

The proof follows by a simple application of Durand et al. [25, Thm. 1] on GP-UCB to our settings, as π\pi\in{\mathcal{{H}}} and the stochastic process defining the query locations 𝐱t{\boldsymbol{\mathbf{x}}}_{t} and observation noise νt:=ztπ(𝐱t)\nu_{t}:=z_{t}-\pi({\boldsymbol{\mathbf{x}}}_{t}) satisfies their assumptions of sub-Gaussianity. In particular, νt\nu_{t} is σν\sigma_{\nu}-sub-Gaussian with σν1\sigma_{\nu}\leq 1, since |νt|1|\nu_{t}|\leq 1, for all t1t\geq 1 [55]. ∎

A.3 Proof of Theorem 2

To prove Theorem 2, we will follow the procedure of GP-UCB proofs [3, 26] by bounding the approximation error |π(𝐱)π^t(𝐱)||\pi({\boldsymbol{\mathbf{x}}})-\hat{\pi}_{t}({\boldsymbol{\mathbf{x}}})| via a confidence bound (Theorem 1) and then applying it to the instant regret. From the instant regret to the cumulative regret, the bounds are extended by means of the maximum information gain ξT\xi_{T} introduced in the main text. One of the differences with our proof, however, is that BORE with a PLS classifier is not following the optimal UCB policy, but instead a pure-exploitation approach by following the maximum of the mean estimator π^t\hat{\pi}_{t}, which does not account for uncertainty.

Proof of Theorem 2.

Recalling the classifier-based bound in Section 4 and that for any τ\tau\in\mathbb{{R}} the result in Lemma 1 holds, we have:

rt=f(𝐱t)f(𝐱)Lϵ(π(𝐱)π(𝐱t))\begin{split}r_{t}&=f({\boldsymbol{\mathbf{x}}}_{t})-f({\boldsymbol{\mathbf{x}}}^{*})\\ &\leq L_{\epsilon}(\pi({\boldsymbol{\mathbf{x}}}^{*})-\pi({\boldsymbol{\mathbf{x}}}_{t}))\end{split} (A.7)

According to Theorem 1, working with the confidence bounds on π(𝐱)\pi({\boldsymbol{\mathbf{x}}}), we then have that the instant regret is bounded with probability at least 1δ1-\delta by:

t1,rtLϵ(π^t1(𝐱)+βt1(δ)σt1(𝐱)π(𝐱t))Lϵ(π^t1(𝐱)+βt1(δ)σt1(𝐱)π^t1(𝐱t)+βt1(δ)σt1(𝐱t))Lϵβt1(δ)(σt1(𝐱)+σt1(𝐱t)),\begin{split}\forall t\geq 1,\quad r_{t}&\leq L_{\epsilon}(\hat{\pi}_{t-1}({\boldsymbol{\mathbf{x}}}^{*})+\beta_{t-1}(\delta)\sigma_{t-1}({\boldsymbol{\mathbf{x}}}^{*})-\pi({\boldsymbol{\mathbf{x}}}_{t}))\\ &\leq L_{\epsilon}(\hat{\pi}_{t-1}({\boldsymbol{\mathbf{x}}}^{*})+\beta_{t-1}(\delta)\sigma_{t-1}({\boldsymbol{\mathbf{x}}}^{*})-\hat{\pi}_{t-1}({\boldsymbol{\mathbf{x}}}_{t})+\beta_{t-1}(\delta)\sigma_{t-1}({\boldsymbol{\mathbf{x}}}_{t}))\\ &\leq L_{\epsilon}\beta_{t-1}(\delta)(\sigma_{t-1}({\boldsymbol{\mathbf{x}}}^{*})+\sigma_{t-1}({\boldsymbol{\mathbf{x}}}_{t})),\end{split} (A.8)

since π^t1(𝐱)max𝐱𝒳π^t1(𝐱)=π^t1(𝐱t)\hat{\pi}_{t-1}({\boldsymbol{\mathbf{x}}}^{*})\leq\max_{{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}}\hat{\pi}_{t-1}({\boldsymbol{\mathbf{x}}})=\hat{\pi}_{t-1}({\boldsymbol{\mathbf{x}}}_{t}). Now we can apply A.2, yielding with probability at least 1δ1-\delta:

RT:=t=1TrtLϵβT(δ)t=1T(σt1(𝐱t)+σt1(𝐱))LϵβT(δ)(4(T+2)ξT+t=1Tσt1(𝐱))\begin{split}{{R}}_{T}:=\sum_{t=1}^{T}r_{t}&\leq L_{\epsilon}\beta_{T}(\delta)\sum_{t=1}^{T}(\sigma_{t-1}({\boldsymbol{\mathbf{x}}}_{t})+\sigma_{t-1}({\boldsymbol{\mathbf{x}}}^{*}))\\ &\leq L_{\epsilon}\beta_{T}(\delta)\left(\sqrt{4(T+2)\xi_{T}}+\sum_{t=1}^{T}\sigma_{t-1}({\boldsymbol{\mathbf{x}}}^{*})\right)\end{split} (A.9)

since βt(δ)βt+1(δ)\beta_{t}(\delta)\leq\beta_{t+1}(\delta) for all t1t\geq 1. This concludes the proof. ∎

A.4 Proof of Theorem 3

Again, we will be following standard GP-UCB proofs for this result using the bound in Theorem 1.

Proof of Theorem 3.

Extending the bound in Equation A.7 with Theorem 1, we have with probability at least 1δ1-\delta:

t1,rtLϵ(π^t1(𝐱)+βt1(δ)σt1(𝐱)πt1(𝐱t))Lϵ(π^t1(𝐱)+βt1(δ)σt1(𝐱)π^t1(𝐱t)+βt1(δ)σt1(𝐱t))2Lϵβt1(δ)σt1(𝐱t),\begin{split}\forall t\geq 1,\quad r_{t}&\leq L_{\epsilon}(\hat{\pi}_{t-1}({\boldsymbol{\mathbf{x}}}^{*})+\beta_{t-1}(\delta)\sigma_{t-1}({\boldsymbol{\mathbf{x}}}^{*})-\pi_{t-1}^{*}({\boldsymbol{\mathbf{x}}}_{t}))\\ &\leq L_{\epsilon}(\hat{\pi}_{t-1}({\boldsymbol{\mathbf{x}}}^{*})+\beta_{t-1}(\delta)\sigma_{t-1}({\boldsymbol{\mathbf{x}}}^{*})-\hat{\pi}_{t-1}({\boldsymbol{\mathbf{x}}}_{t})+\beta_{t-1}(\delta)\sigma_{t-1}({\boldsymbol{\mathbf{x}}}_{t}))\\ &\leq 2L_{\epsilon}\beta_{t-1}(\delta)\sigma_{t-1}({\boldsymbol{\mathbf{x}}}_{t}),\end{split} (A.10)

since π^t1(𝐱)+βt1(δ)σt1(𝐱)max𝐱𝒳π^t1(𝐱)+βt1(δ)σt1(𝐱)=π^t1(𝐱t)+βt1(δ)σt1(𝐱t))\hat{\pi}_{t-1}({\boldsymbol{\mathbf{x}}}^{*})+\beta_{t-1}(\delta)\sigma_{t-1}({\boldsymbol{\mathbf{x}}}^{*})\leq\max_{{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}}\hat{\pi}_{t-1}({\boldsymbol{\mathbf{x}}})+\beta_{t-1}(\delta)\sigma_{t-1}({\boldsymbol{\mathbf{x}}})=\hat{\pi}_{t-1}({\boldsymbol{\mathbf{x}}}_{t})+\beta_{t-1}(\delta)\sigma_{t-1}({\boldsymbol{\mathbf{x}}}_{t})). Turning our attention to the cumulative regret, with the same probability, we have:

RT:=t=1Trt2LϵβT(δ)t=1Tσt1(𝐱t)4LϵβT(δ)(T+2)ξT,\begin{split}{{R}}_{T}:=\sum_{t=1}^{T}r_{t}&\leq 2L_{\epsilon}\beta_{T}(\delta)\sum_{t=1}^{T}\sigma_{t-1}({\boldsymbol{\mathbf{x}}}_{t})\\ &\leq 4L_{\epsilon}\beta_{T}(\delta)\sqrt{(T+2)\xi_{T}},\end{split} (A.11)

which concludes the proof. ∎

A.5 Proof of Theorem 4

Proof.

Starting with the regret definition, we can define a bound in terms of the discrepancy between the two sampling distributions:

rt𝔼𝐱p^t[f(𝐱)]𝔼𝐱[f(𝐱)]Lϵ(𝔼𝐱[π(𝐱)]𝔼𝐱p^t[π(𝐱)])Lϵπ𝒳|(𝐱)qt1(𝐱)|d𝐱Lϵπ12DKL(qt1||),t1,\begin{split}r_{t}&\coloneqq\mathbb{E}_{{\boldsymbol{\mathbf{x}}}\sim\hat{p}_{t}}[f({\boldsymbol{\mathbf{x}}})]-\mathbb{E}_{{\boldsymbol{\mathbf{x}}}\sim\ell}[f({\boldsymbol{\mathbf{x}}})]\\ &\leq L_{\epsilon}\left(\mathbb{E}_{{\boldsymbol{\mathbf{x}}}\sim\ell}[\pi({\boldsymbol{\mathbf{x}}})]-\mathbb{E}_{{\boldsymbol{\mathbf{x}}}\sim\hat{p}_{t}}[\pi({\boldsymbol{\mathbf{x}}})]\right)\\ &\leq L_{\epsilon}\lVert\pi\rVert_{\infty}\int_{\mathcal{{X}}}\lvert\ell({\boldsymbol{\mathbf{x}}})-q_{t-1}({\boldsymbol{\mathbf{x}}})\rvert{\mathop{}\operatorname{d}}{\boldsymbol{\mathbf{x}}}\\ &\leq L_{\epsilon}\lVert\pi\rVert_{\infty}\sqrt{\frac{1}{2}D_{\mathrm{KL}}(q_{t-1}||\ell)},\qquad\forall t\geq 1,\end{split} (A.12)

where the last line is due to Pinsker’s inequality [55] applied to the total variation distance between p^t\hat{p}_{t} and \ell (third line).

Tp obtain a bound on DKL(p^t||)D_{\mathrm{KL}}(\hat{p}_{t}||\ell), starting from the definition of the terms, with probability at least 1δ1-\delta, we have that:

t0,DKL(p^t||)=𝔼𝐱p^t[logp^t(𝐱)log(𝐱)]=𝔼𝐱p^t[log(π^t(𝐱)+βt(δ)σt(𝐱))logπ(𝐱)+logηπlogγ]𝔼𝐱p^t[log(π^t(𝐱)+βt(δ)σt(𝐱))logπ(𝐱)],\begin{split}\forall t\geq 0,\quad D_{\mathrm{KL}}(\hat{p}_{t}||\ell)&=\mathbb{E}_{{\boldsymbol{\mathbf{x}}}\sim\hat{p}_{t}}[\log\hat{p}_{t}({\boldsymbol{\mathbf{x}}})-\log\ell({\boldsymbol{\mathbf{x}}})]\\ &=\mathbb{E}_{{\boldsymbol{\mathbf{x}}}\sim\hat{p}_{t}}[\log(\hat{\pi}_{t}({\boldsymbol{\mathbf{x}}})+\beta_{t}(\delta)\sigma_{t}({\boldsymbol{\mathbf{x}}}))-\log\pi({\boldsymbol{\mathbf{x}}})+\log\eta_{\pi}-\log\gamma]\\ &\leq\mathbb{E}_{{\boldsymbol{\mathbf{x}}}\sim\hat{p}_{t}}[\log(\hat{\pi}_{t}({\boldsymbol{\mathbf{x}}})+\beta_{t}(\delta)\sigma_{t}({\boldsymbol{\mathbf{x}}}))-\log\pi({\boldsymbol{\mathbf{x}}})]\,,\end{split} (A.13)

which follows from ηt:=𝒳(π^t(𝐱)+βt(δ)σt(𝐱))p(𝐱)d𝐱𝒳π(𝐱)p(𝐱)d𝐱=:γ\eta_{t}:=\int_{\mathcal{{X}}}(\hat{\pi}_{t}({\boldsymbol{\mathbf{x}}})+\beta_{t}(\delta)\sigma_{t}({\boldsymbol{\mathbf{x}}}))p({\boldsymbol{\mathbf{x}}}){\mathop{}\operatorname{d}}{\boldsymbol{\mathbf{x}}}\geq\int_{\mathcal{{X}}}\pi({\boldsymbol{\mathbf{x}}})p({\boldsymbol{\mathbf{x}}}){\mathop{}\operatorname{d}}{\boldsymbol{\mathbf{x}}}=:\gamma. Now, by the mean value theorem [56], for all t0t\geq 0, we have that the following holds with the same probability:

|log(π^t(𝐱)+βt(δ)σt(𝐱))logπ(𝐱)|Lπ|π^t(𝐱)+βt(δ)σt(𝐱)π(𝐱)|2Lπβt(δ)σt(𝐱),𝐱𝒳,\begin{split}|\log(\hat{\pi}_{t}({\boldsymbol{\mathbf{x}}})+\beta_{t}(\delta)\sigma_{t}({\boldsymbol{\mathbf{x}}}))-\log\pi({\boldsymbol{\mathbf{x}}})|&\leq L_{\pi}|\hat{\pi}_{t}({\boldsymbol{\mathbf{x}}})+\beta_{t}(\delta)\sigma_{t}({\boldsymbol{\mathbf{x}}})-\pi({\boldsymbol{\mathbf{x}}})|\\ &\leq 2L_{\pi}\beta_{t}(\delta)\sigma_{t}({\boldsymbol{\mathbf{x}}})\,,\quad\forall{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}\,,\end{split} (A.14)

since dlog(s)ds<Lπ<\frac{{\mathop{}\operatorname{d}}\log(s)}{{\mathop{}\operatorname{d}}s}<L_{\pi}<\infty for all s>min𝐱𝒳π(𝐱)>0s>\min_{{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}}\pi({\boldsymbol{\mathbf{x}}})>0, and |π^t(𝐱)π(𝐱)|βt(δ)σt(𝐱)|\hat{\pi}_{t}({\boldsymbol{\mathbf{x}}})-\pi({\boldsymbol{\mathbf{x}}})|\leq\beta_{t}(\delta)\sigma_{t}({\boldsymbol{\mathbf{x}}}) by Theorem 1. The first result in the theorem then follows.

For the second part of the result, we first note that:

T1,mintTDKL(p^t||)1Tt=1TDKL(p^t1||)\forall T\geq 1,\quad\min_{t\leq T}D_{\mathrm{KL}}(\hat{p}_{t}||\ell)\leq\frac{1}{T}\sum_{t=1}^{T}D_{\mathrm{KL}}(\hat{p}_{t-1}||\ell) (A.15)

Following the previous derivations, it holds with probability at least 1δ1-\delta that:

t=1TDKL(p^t||)2Lπt=1Tβt1(δ)𝔼𝐱~tp^t[σt1(𝐱~t)]2LπβT(δ)t=1T𝔼𝐱~tqt[σt1(𝐱~t)]2LπβT(δ)𝔼𝐱~1q1,,𝐱~TqT[t=1Tσt1(𝐱~t)],\begin{split}\sum_{t=1}^{T}D_{\mathrm{KL}}(\hat{p}_{t}||\ell)&\leq 2L_{\pi}\sum_{t=1}^{T}\beta_{t-1}(\delta)\mathbb{E}_{{\boldsymbol{\mathbf{\tilde{\boldsymbol{\mathbf{x}}}}}}_{t}\sim\hat{p}_{t}}[\sigma_{t-1}({\boldsymbol{\mathbf{\tilde{\boldsymbol{\mathbf{x}}}}}}_{t})]\\ &\leq 2L_{\pi}\beta_{T}(\delta)\sum_{t=1}^{T}\mathbb{E}_{{\boldsymbol{\mathbf{\tilde{\boldsymbol{\mathbf{x}}}}}}_{t}\sim q_{t}}[\sigma_{t-1}({\boldsymbol{\mathbf{\tilde{\boldsymbol{\mathbf{x}}}}}}_{t})]\\ &\leq 2L_{\pi}\beta_{T}(\delta)\mathbb{E}_{{\boldsymbol{\mathbf{\tilde{\boldsymbol{\mathbf{x}}}}}}_{1}\sim q_{1},\dots,{\boldsymbol{\mathbf{\tilde{\boldsymbol{\mathbf{x}}}}}}_{T}\sim q_{T}}\left[\sum_{t=1}^{T}\sigma_{t-1}({\boldsymbol{\mathbf{\tilde{\boldsymbol{\mathbf{x}}}}}}_{t})\right]\,,\end{split} (A.16)

since βTβt\beta_{T}\geq\beta_{t}, for all tTt\leq T, and expectations are linear operations. Considering the predictive variances above, recall that, at each iteration t1t\geq 1, the algorithm selects a batch of i.i.d. points t:={𝐱t,i}i=1M{\mathcal{{B}}}_{t}:=\{{\boldsymbol{\mathbf{x}}}_{t,i}\}_{i=1}^{M}, sampled from p^t\hat{p}_{t}, where to evaluate the objective function ff. The predictive variance σt12\sigma_{t-1}^{2} is conditioned on all previous observations, which are grouped by batches. We can then decompose, for any t1t\geq 1:

σt2(𝐱)=σt12(𝐱)kt1(𝐱,t)(𝐊t1(t)+η𝐈)1kt1(t,𝐱),\begin{split}\sigma_{t}^{2}({\boldsymbol{\mathbf{x}}})&=\sigma_{t-1}^{2}({\boldsymbol{\mathbf{x}}})-k_{t-1}({\boldsymbol{\mathbf{x}}},{\mathcal{{B}}}_{t})({\boldsymbol{\mathbf{{K}}}}_{t-1}({\mathcal{{B}}}_{t})+\eta{\boldsymbol{\mathbf{{I}}}})^{-1}k_{t-1}({\mathcal{{B}}}_{t},{\boldsymbol{\mathbf{x}}})\,,\end{split} (A.17)

where we use the notation introduced in A.3, and:

kt(𝐱,𝐱)=kt1(𝐱,𝐱)kt1(𝐱,t)(𝐊t1(t)+η𝐈)1kt1(t,𝐱),t1,\displaystyle\begin{split}k_{t}({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}}^{\prime})&=k_{t-1}({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}}^{\prime})-k_{t-1}({\boldsymbol{\mathbf{x}}},{\mathcal{{B}}}_{t})({\boldsymbol{\mathbf{{K}}}}_{t-1}({\mathcal{{B}}}_{t})+\eta{\boldsymbol{\mathbf{{I}}}})^{-1}k_{t-1}({\mathcal{{B}}}_{t},{\boldsymbol{\mathbf{x}}}^{\prime}),\quad t\geq 1,\end{split} (A.18)
k0(𝐱,𝐱)\displaystyle k_{0}({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}}^{\prime}) :=k(𝐱,𝐱).\displaystyle:=k({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}}^{\prime})\,. (A.19)

Therefore, the predictive variance of the batched algorithm is not the same as the predictive variance of a sequential algorithm, and we cannot direcly apply A.2 to bound the last term in Equation A.16.

A.3 tells us that the predictive variance given a set of observations is less than the predictive variance given a subset of observations. Selecting only the first point from within each batch and applying A.3, we get, for t1t\geq 1:

σt2(𝐱)st2(𝐱):=k(𝐱,𝐱)k(𝐱,𝒳t)(𝐊(𝒳t)+η𝐈)1k(𝒳t,𝐱),\begin{split}\sigma^{2}_{t}({\boldsymbol{\mathbf{x}}})&\leq s_{t}^{2}({\boldsymbol{\mathbf{x}}}):=k({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}})-k({\boldsymbol{\mathbf{x}}},{\mathcal{{X}}}_{t})({\boldsymbol{\mathbf{{K}}}}({\mathcal{{X}}}_{t})+\eta{\boldsymbol{\mathbf{{I}}}})^{-1}k({\mathcal{{X}}}_{t},{\boldsymbol{\mathbf{x}}})\,,\end{split} (A.20)

where 𝒳t:={𝐱i,1}i=1t{\mathcal{{X}}}_{t}:=\{{\boldsymbol{\mathbf{x}}}_{i,1}\}_{i=1}^{t}, with 𝐱i,1i{\boldsymbol{\mathbf{x}}}_{i,1}\in{\mathcal{{B}}}_{i}, i{1,,t}i\in\{1,\dots,t\}. Note that the right-hand side of the equation above is simply the non-batched GP predictive variance. Furthermore, sample points within a batch are i.i.d., so that 𝐱t,1qt{\boldsymbol{\mathbf{x}}}_{t,1}\sim q_{t} and 𝐱~tqt{\boldsymbol{\mathbf{\tilde{{\boldsymbol{\mathbf{x}}}}}}}_{t}\sim q_{t} are identically distributed. We can now apply A.2, yielding:

𝔼𝐱~1q1,,𝐱~TqT[t=1Tσt1(𝐱~t)]𝔼𝐱~1q1,,𝐱~TqT[t=1Tst1(𝐱~t)]2(T+2)ξT.\begin{split}\mathbb{E}_{{\boldsymbol{\mathbf{\tilde{\boldsymbol{\mathbf{x}}}}}}_{1}\sim q_{1},\dots,{\boldsymbol{\mathbf{\tilde{\boldsymbol{\mathbf{x}}}}}}_{T}\sim q_{T}}\left[\sum_{t=1}^{T}\sigma_{t-1}({\boldsymbol{\mathbf{\tilde{\boldsymbol{\mathbf{x}}}}}}_{t})\right]\leq\mathbb{E}_{{\boldsymbol{\mathbf{\tilde{\boldsymbol{\mathbf{x}}}}}}_{1}\sim q_{1},\dots,{\boldsymbol{\mathbf{\tilde{\boldsymbol{\mathbf{x}}}}}}_{T}\sim q_{T}}\left[\sum_{t=1}^{T}s_{t-1}({\boldsymbol{\mathbf{\tilde{\boldsymbol{\mathbf{x}}}}}}_{t})\right]\leq 2\sqrt{(T+2)\xi_{T}}\,.\end{split} (A.21)

Combining this result with Equation A.16, we obtain:

t=1TDKL(p^t||)4LπβT(δ)(T+2)ξT𝒪(βT(δ)TξT).\sum_{t=1}^{T}D_{\mathrm{KL}}(\hat{p}_{t}||\ell)\leq 4L_{\pi}\beta_{T}(\delta)\sqrt{(T+2)\xi_{T}}\in{\mathcal{{O}}}(\beta_{T}(\delta)\sqrt{T\xi_{T}})\,. (A.22)

Lastly, from the definition of βt(δ)\beta_{t}(\delta), we have:

βT(δ):=b+2λ1log(|𝐈+λ1𝐊𝒟T|1/2/δ),\beta_{T}(\delta):=b+\sqrt{2\lambda^{-1}\log(|{\boldsymbol{\mathbf{{I}}}}+\lambda^{-1}{\boldsymbol{\mathbf{{K}}}}_{{\mathcal{{D}}}_{T}}|^{1/2}/\delta)}\,, (A.23)

where:

log(|𝐈+λ1𝐊𝒟T|1/2)=I(𝐳NT,𝐡NT)ξNT=ξMT,\log(|{\boldsymbol{\mathbf{{I}}}}+\lambda^{-1}{\boldsymbol{\mathbf{{K}}}}_{{\mathcal{{D}}}_{T}}|^{1/2})=I({\boldsymbol{\mathbf{z}}}_{N_{T}},{\boldsymbol{\mathbf{h}}}_{N_{T}})\leq\xi_{N_{T}}=\xi_{MT}\,, (A.24)

for h𝒢𝒫(m,k)h\sim\mathcal{GP}(m,k). Therefore, the cumulative sum of divergences is such that:

t=1TDKL(p^t||)𝒪(T(bξT+ξTξMT)).\sum_{t=1}^{T}D_{\mathrm{KL}}(\hat{p}_{t}||\ell)\in{\mathcal{{O}}}(\sqrt{T}(b\sqrt{\xi_{T}}+\sqrt{\xi_{T}\xi_{MT}}))\,. (A.25)

which concludes the proof. ∎

Appendix B Bayesian regret bounds for BORE as Thompson sampling

Although in our main results we considered BORE using an optimal classifier according to a least-squares loss, we may instead consider that, in practice, the trained classifier might be sub-optimal due to training via gradient descent. In particular, in the case of stochastic gradient descent, Mandt et al. [29] showed that parameters learnt this way can be seen as approximate samples of a Bayesian posterior distribution. This is, therefore, the case of Thompson (or posterior) sampling [30]. If we consider that the posterior over the model’s function space is Gaussian, e.g., as in the case of infinitely-wide deep neural networks [57, 58], we may instead analyse the original BORE as a GP-based Thompson sampling algorithm. We can then apply theoretical results from Russo and Van Roy [30] to use general GP-UCB approximation guarantees [3, 59] to bound BORE’S Bayesian regret. Note, however, that this is a different type of analysis compared to the one presented in this paper, which considered a frequentist setting where the objective function is fixed, but unknown.

Appendix C Analysis with a non-constant quantile approximation

Our main theoretical results so far relied upon the quantile τ\tau being fixed throughout all iterations t{1,T}t\in\{1,\dots T\}, though in practice we have to approximate the quantile based on the empirical observations distribution up to time t1t\geq 1. In this section, we discuss the plausibility of the theoretical results under this practical scenario.

The main impact of a time-varying quantile τt\tau_{t}, and the corresponding classifier πt(𝐱):=p(yτt|𝐱)\pi_{t}({\boldsymbol{\mathbf{x}}}):=p(y\leq\tau_{t}|{\boldsymbol{\mathbf{x}}}), in theoretical results is in the UCB approximation error (Theorem 1). This result depends on the observation noise νt,i:=zt,iπt(𝐱i)\nu_{t,i}:=z_{t,i}-\pi_{t}({\boldsymbol{\mathbf{x}}}_{i}) as perceived by a GP model with observations zt,i:=𝕀[yiτt]z_{t,i}:=\mathbb{I}[y_{i}\leq\tau_{t}], i{1,,t}i\in\{1,\dots,t\}, to be sub-Gaussian when conditioned on the history. Hence, a few challenges originate from there. Firstly, the past observations in the vector 𝐳t:=[zt,i]i=1t{\boldsymbol{\mathbf{z}}}_{t}:=[z_{t,i}]_{i=1}^{t} are changing across iterations, due to the update in τt\tau_{t}. Secondly, the latent function πt\pi_{t} is stochastic, as the quantile τt\tau_{t} depends on the current set of observations 𝐲t{\boldsymbol{\mathbf{y}}}_{t}. Lastly, it is not very clear how to define a filtration for the resulting stochastic process such that the GP noise νt,i\nu_{t,i} is sub-Gaussian. Nevertheless, as the number of observations increases, τ\tau converges to a fixed value, making our asymptotic results valid.

Appendix D Experiment details

This section presents details of our experiments setup. We used PyTorch [60] for our implementation of batch BORE and BORE++, which we plan to make publicly available in the future.

D.1 Theory assessment

For this experiment, we generated a random classifier as an element of the RKHS {\mathcal{{H}}} defined by the kernel kk as:

π:=i=1Fαik(,𝐱i),\pi^{*}:=\sum_{i=1}^{F}\alpha_{i}k(\cdot,{\boldsymbol{\mathbf{x}}}_{i}^{*})\in{\mathcal{{H}}}\,, (D.1)

where {𝐱i}i=1F\{{\boldsymbol{\mathbf{x}}}_{i}^{*}\}_{i=1}^{F} and the weights {αi}i=1F\{\alpha_{i}\}_{i=1}^{F} were i.i.d. sampled from a unit uniform distribution 𝒰(0,1)\mathcal{U}(0,1), with F:=5F:=5. The norm of π\pi^{*} is given by:

πk=𝜶F𝖳𝐊F𝜶F,\lVert\pi^{*}\rVert_{k}=\sqrt{{\boldsymbol{\mathbf{\alpha}}}^{\mathsf{T}}_{F}{\boldsymbol{\mathbf{{K}}}}_{F}{\boldsymbol{\mathbf{\alpha}}}_{F}}\,, (D.2)

where 𝐊:=[𝐱i,𝐱j]i,j=1FF×F{\boldsymbol{\mathbf{{K}}}}:=[{\boldsymbol{\mathbf{x}}}_{i}^{*},{\boldsymbol{\mathbf{x}}}_{j}^{*}]_{i,j=1}^{F}\in\mathbb{{R}}^{F\times F} and 𝜶F:=[α1,,αF]𝖳F{\boldsymbol{\mathbf{\alpha}}}_{F}:=[\alpha_{1},\dots,\alpha_{F}]^{\mathsf{T}}\in\mathbb{{R}}^{F}. To ensure π(𝐱)1\pi^{*}({\boldsymbol{\mathbf{x}}})\leq 1, we normalised the weights according to the classifier norm, i.e., 𝜶:=1π𝜶{\boldsymbol{\mathbf{\alpha}}}:=\frac{1}{\lVert\pi^{*}\rVert}{\boldsymbol{\mathbf{\alpha}}}, so that π=1\lVert\pi^{*}\rVert=1, and consequently π(𝐱)k(𝐱,𝐱)π=π=1\pi^{*}({\boldsymbol{\mathbf{x}}})\leq k({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}})\lVert\pi^{*}\rVert=\lVert\pi^{*}\rVert=1, for all 𝐱𝒳{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}. The kernel was set as the squared exponential (RBF) with a length-scale of 0.10.1.

Given a threshold τ\tau\in\mathbb{{R}}, the objective function corresponding to π\pi^{*} satisfies:

f(𝐱):=τΦϵ1(π(𝐱)),𝐱𝒳.f({\boldsymbol{\mathbf{x}}}):=\tau-\Phi_{\epsilon}^{-1}(\pi^{*}({\boldsymbol{\mathbf{x}}})),\quad\forall{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}\,. (D.3)

For this experiment, we set τ:=0\tau:=0. Each trial had a different objective function generated for it. An example of classifier and objective function pair is presented in Figure 1b (main paper). Observations were composed as function evaluations corrupted by zero-mean Gaussian noise with variance σϵ2:=0.01\sigma_{\epsilon}^{2}:=0.01.

The search space was configured as a finite set 𝒳:={𝐱i}i=1N𝒳[0,1]{\mathcal{{X}}}:=\{{\boldsymbol{\mathbf{x}}}_{i}\}_{i=1}^{N_{\mathcal{{X}}}}\subset[0,1] by sampling N𝒳N_{\mathcal{{X}}} points from a unit uniform distribution. The number of points in the search space was set as N𝒳:=100N_{\mathcal{{X}}}:=100. As the search space is finite, we also know γ:=p(yτ)=𝒳π(𝐱)p(𝐱)d𝐱=1N𝒳𝐱𝒳π(𝐱)\gamma:=p(y\leq\tau)=\int_{\mathcal{{X}}}\pi({\boldsymbol{\mathbf{x}}})p({\boldsymbol{\mathbf{x}}}){\mathop{}\operatorname{d}}{\boldsymbol{\mathbf{x}}}=\frac{1}{N_{\mathcal{{X}}}}\sum_{{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}}\pi^{*}({\boldsymbol{\mathbf{x}}}).

Parameter Setting
λ\lambda 0.025
δ\delta 0.1
τ\tau 0
Table D.1: Settings for BORE++ in the theory assessment experiment.
Parameter Setting
λ\lambda 0.01
δ\delta 0.1
σϵ2\sigma_{\epsilon}^{2} 0.01
Table D.2: Settings for GP-UCB in the theory assessment experiment.

Regarding algorithm hyper-parameters, although any upper bound bπb\geq\lVert\pi^{*}\rVert would work for setting up βt\beta_{t}, BORE++ was configured with the RKHS norm π\pi^{*} as the first term in the setting for βt\beta_{t} (see Theorem 1). To configure GP-UCB according to its theoretical settings [25, Thm. 1], we computed the RKHS norm of the resulting ff in the RKHS. We can compute the norm of ff as an element of {\mathcal{{H}}} by solving the following constrained optimisation problem:

fk=minhhk,s.t.h(𝐱)=f(𝐱),𝐱𝒳.\begin{split}\lVert f\rVert_{k}&=\min_{h\in{\mathcal{{H}}}}\lVert h\rVert_{k},\quad\mathrm{s.t.}\quad h({\boldsymbol{\mathbf{x}}})=f({\boldsymbol{\mathbf{x}}}),\quad\forall{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}\,.\end{split} (D.4)

As the search space is finite, the solution to this problem is available in closed form as:

fk=𝐟𝒳𝖳𝐊𝒳1𝐟𝒳,\lVert f\rVert_{k}=\sqrt{{\boldsymbol{\mathbf{f}}}^{\mathsf{T}}_{\mathcal{{X}}}{\boldsymbol{\mathbf{{K}}}}_{\mathcal{{X}}}^{-1}{\boldsymbol{\mathbf{f}}}_{\mathcal{{X}}}}\,, (D.5)

where 𝐟𝒳:=[f(𝐱)]𝐱𝒳N𝒳{\boldsymbol{\mathbf{f}}}_{\mathcal{{X}}}:=[f({\boldsymbol{\mathbf{x}}})]_{{\boldsymbol{\mathbf{x}}}\in{\mathcal{{X}}}}\in\mathbb{{R}}^{N_{\mathcal{{X}}}}, and 𝐊𝒳:=[k(𝐱,𝐱)]𝐱,𝐱𝒳{\boldsymbol{\mathbf{{K}}}}_{\mathcal{{X}}}:=[k({\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}}^{\prime})]_{{\boldsymbol{\mathbf{x}}},{\boldsymbol{\mathbf{x}}}^{\prime}\in{\mathcal{{X}}}}. We set δ:=0.1\delta:=0.1. For both BORE++ and GP-UCB, the information gain was recomputed at each iteration. Lastly, the regularisation factor λ\lambda was set as λ:=σϵ2\lambda:=\sigma_{\epsilon}^{2} for GP-UCB and as λ:=0.025\lambda:=0.025 for BORE++, which was found by grid search. In summary, for this experiment, the settings for BORE++ can be found in Table D.1 and, for GP-UCB, in Table D.2.

D.2 Global optimisation benchmarks

For each problem, all methods used 10 initial points uniformly sampled from the search space. As performance indicator, we measured the simple regret:

rt:=minitri=minitf(𝐱i)f(𝐱),t1,r_{t}^{*}:=\min_{i\leq t}r_{i}=\min_{i\leq t}f({\boldsymbol{\mathbf{x}}}_{i})-f({\boldsymbol{\mathbf{x}}}^{*})\,,\quad t\geq 1\,, (D.6)

as the global minimum of each of the considered benchmark functions is known. All objective function evaluations were provided free of noise to the algorithms.

Parameter Setting
Optimiser Adam
Batch size 64
Steps 100*
Table D.3: Stochastic gradient descent training settings for batch BORE. (*) For the Six-hump Camel problem, we applied 200 steps.
Parameter Setting
Step size 0.001
Decay rate 0.9
Number of steps 10001000^{*}
Table D.4: SVGD settings for batch BORE. (*) For the Hartmann 3D problem, we used 500 steps.

Batch BORE was run with a percentile γ:=0.25\gamma:=0.25, which was applied to estimate the empirical quantile τ\tau at every iteration t{1,,T}t\in\{1,\dots,T\}. The method’s classifier model was composed of a multilayer perceptron neural network model with 2 hidden layers of 32 units each, which was trained to minimise the binary cross-entropy loss. The activation function was set as the rectified linear unit (ReLU) with exception for the Hartmann 3D and the Six-hump Camel problem, which were run with an exponential linear unit (ELU), instead. Training for the neural networks was performed via stochastic gradient descent, whose settings are presented in Table D.3. SVGD was run applying Adadelta to configure its steps according to the settings in Table D.4. The SVGD kernel was set as the squared exponential (RBF) using the median trick to adjust its lengthscale [35].

LP-EI [11] was run using L-BFGS [61] to optimise its acquisition function. The optimisation settings were kept as the default for GPyOpt [62].

The qq-EI method [10] was run using the BoTorch implementation [63]. The acquisition function was optimised via multi-start optimisation with L-BFGS [61] using 10 random restarts. Monte Carlo integration for qq-EI used 256 samples.

D.3 Comparisons on real-data benchmarks

We here present experiments comparing the sequential version of BORE++ against BORE and other baselines, including traditional BO methods, such as GP-UCB and GP-EI [1], the Tree-structured Parzen Estimator (TPE) [15], and random search, on real-data benchmarks. In particular, we assessed the algorithms on some of the same benchmarks present in the original BORE paper [9].

D.3.1 Algorithm settings

All versions of BORE were set with γ:=0.25\gamma:=0.25. The original BORE algorithm used a 2-layer, 32-unit fully connected neural network as a classifier. The network was trained via stochastic gradient descent using Adam [64]. As in the other experiments in this paper, we followed the same scheme that keeps the number of gradient steps per epoch fixed [see 9, Appendix J.3], set in our case as 100, and a mini-batch of size 64. The probabilistic least-squares version of BORE and BORE++ were configured with a GP classifier using the rational quadratic kernel [2, Ch. 4] with output scaling and independent length scales per input dimension. All GP-based algorithms used the same type of kernel. GP hyper-parameters were estimated by maximising the GP’s marginal likelihood at each iteration using BoTorch’s hyper-parameter estimation methods, which apply L-BFGS by default [63]. BORE++ was set with a fixed value for its parameter βt:=3\beta_{t}:=3, the regularisation factor was set as λ:=0.025\lambda:=0.025. Acquisition function optimisation was run for 500 to 1000 iterations via L-BFGS with multiple restarts using SciPy’s toolkit [65]. Lastly, for the experiment with the MNIST dataset, we also used the Tree-structured Parzen Estimator (TPE) by Bergstra et al. [15] set with default settings from the HyperOpt package. All algorithms were run for a given number of independent trials and results are presented with their 95% confidence intervals777Confidence intervals are calculated via linear interpolation when the number of trials is small.

D.3.2 Benchmarks

Neural network hyper-parameter tuning.

We first considered two of the neural network (NN) tuning problems found in Tiao et al. [9], where a two-layer feed-forward NN is trained for regression. The NN is trained for 100 epochs with the ADAM optimizer [64], and the objective is the validation mean-squared error (MSE). The hyper-parameters are the initial learning rate, learning rate schedule, batch size, along with the layer-specific widths, activations, and dropout rates. In particular, we considered Parkinson’s telemonitoring [66] and the CT slice localisation [67] datasets, available at UCI’s machine learning repository [68], and utilize HPOBench [69], which tabulates, for each dataset, the MSEs resulting from all possible (62,208) configurations. The datasets and code are publicly available888Tabular benchmarks: https://github.com/automl/nas_benchmarks. Each algorithm was run for 500 iterations across 10 independent trials.

Racing line optimisation.

We compare the different versions of BORE against a random search baseline in the UC Berkeley racing line optimisation benchmark [70] using the code provided by Jain and Morari [71]. The task consists of finding the optimal racing line across a given track by optimising the positions of a set of 10 waypoints on the Cartesian plane along the track’s centre line which would reduce the lap time for a given car, resulting in a 20-dimensional problem. For this track, the car model is based on UC Berkeley’s 1:10 scale miniature racing car open source model999Open source race car: https://github.com/MPC-Berkeley/barc/tree/devel-ugo. Each algorithm was run for 50 iterations across 5 independent trials.

Neural architecture search.

Lastly, we compare all algorithms on a neural network architecture search problem. The task consists of optimising hyper-parameters which control the training process (initial learning rate, batch size, dropout, exponential decay factor for learning rate) and the architecture (number of layers and units per layer) of a feed forward neural network on the MNIST hand-written digits classification task [72]. The objective is to minimise the NN classification error. To allow for a wide range of hyper-parameter evaluations, this task uses a random forest surrogate trained with data obtained by training the actual NNs on MNIST [73]. For this experiment, each method was run for 200 iterations across 10 independent trials.

Appendix E Dimensionality effect on batch BORE vs. batch BORE++ performance

Refer to caption
(a) 3D
Refer to caption
(b) 4D
Refer to caption
(c) 5D
Figure E.1: BORE vs. BORE++ in the batch setting tested on the Rosenbrock function at varying search space dimensionalities. The plots compare the cumulative regret of each algorithm averaged over 10 runs. Shaded areas correspond to the 95% confidence interval.

We compared batch BORE against the batch BORE++ algorithm on a synthetic optimisation problem with the Rosenbrock function. The dimensionality of the search space was varied. The cumulative regret curves for each algorithm are presented in Figure E.1.

Both algorithms were configured with a Bayesian logistic regression classifier applying random Fourier features [74] as feature maps based on the squared-exponential kernel. The number of features was set as 300, and the classifier was trained via expectation maximisation. Observations were corrupted by additive Gaussian noise with zero mean and a small noise variance σϵ2=104\sigma_{\epsilon}^{2}=10^{-4}, and each model was set accordingly. To demonstrate the practicality of the method, the UCB parameter for BORE++ was fixed at βt:=3\beta_{t}:=3 across all iterations t1t\geq 1, instead of applying the theoretical setup. SVGD was configured as its second-order version [38] applying L-BFGS to adjust its steps [61].

As the results show in Figure E.1, batch BORE++ has a clear advantage over batch BORE in low dimensions. However, the performance gains become less obvious at higher dimensionalities and eventually deteriorate. One of the factors explaining this behaviour is that, as the dimensionality increases, uncertainty estimates become less useful. Distances between data points largely increase and affect the posterior variance estimates provided by translation-invariant kernels, such as the squared-exponential kernel our feature maps were based on. Other classification models may lead to different behaviours, and their investigation is left for future work.