This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Optimal Learning for Sequential Decision Making for Expensive Cost Functions with Stochastic Binary Feedbacks

Yingfei Wang Department of Computer Science, Princeton University, Princeton, NJ 08540, yingfei@cs.princeton.edu    Chu Wang The Program in Applied and Computational Mathematics, Princeton University, Princeton, NJ 08544    Warren Powell Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ 08544
Abstract

We consider the problem of sequentially making decisions that are rewarded by “successes" and “failures" which can be predicted through an unknown relationship that depends on a partially controllable vector of attributes for each instance. The learner takes an active role in selecting samples from the instance pool. The goal is to maximize the probability of success in either offline (training) or online (testing) phases. Our problem is motivated by real-world applications where observations are time consuming and/or expensive. We develop a knowledge gradient policy using an online Bayesian linear classifier to guide the experiment by maximizing the expected value of information of labeling each alternative. We provide a finite-time analysis of the estimated error and show that the maximum likelihood estimator based produced by the KG policy is consistent and asymptotically normal. We also show that the knowledge gradient policy is asymptotically optimal in an offline setting. This work further extends the knowledge gradient to the setting of contextual bandits. We report the results of a series of experiments that demonstrate its efficiency.

1 Introduction

There are many real-world optimization tasks where observations are time consuming and/or expensive. This often occurs in experimental sciences where testing different values of controllable parameters of a physical system is expensive. Another example arises in health services, where physicians have to make medical decisions (e.g. a course of drugs, surgery, and expensive tests) and we can characterize an outcome as a success (patient does not need to return for more treatment) or a failure (patient does need followup care such as repeated operations). This puts us in the setting of optimal learning where the number of objective function samples are limited, requiring that we learn from our decisions as quickly as possible. This represents a distinctly different learning environment than what has traditionally been considered using popular policies such as upper confidence bounding (UCB) which have proven effective in fast-paced settings such as learning ad-clicks.

In this paper, we are interested in a sequential decision-making setting where at each time step we choose one of finitely many decisions and observe a stochastic binary reward where each instance is described by various controllable and uncontrollable attributes. The goal is to maximize the probability of success in either offline (training) or online (testing) phases. There are a number of applications that can easily fit into our success/failure model:

  • Producing single-walled nanotubes. Scientists have physical procedures to produce nanotubes. It can produce either single-walled or double walled nanotubes through an unknown relationship with the controllable parameters, e.g. laser poser, ethylene, Hydrogen and pressure. Yet only the single-walled nanotubes are acceptable. The problem is to quickly learn the best parameter values with the highest probability of success (producing single-walled nanotubes).

  • Personalized health care. We consider the problem of how to choose clinical pathways (including surgery, medication and tests) for different upcoming patients to maximize the success of the treatment.

  • Minimizing the default rate for loan applications. When facing borrowers with different background information and credit history, a lending company needs to decide whether to grant a loan, and with what terms (interest rate, payment schedule).

  • Enhancing the acceptance of the admitted students. A university needs to decide which students to admit and how much aid to offer so that the students will then accept the offer of admission and matriculate.

Our work will initially focus on offline settings such as laboratory experiments or medical trials where we are not punished for errors incurred during training and instead are only concerned with the final recommendation after the offline training phases. We then extend our discussion to online learning settings where the goal is to minimize cumulative regret. We also consider problems with partially controllable attributes, which is known as contextual bandits. For example, in the health care problems, we do not have control over the patients (which is represented by a feature vector including demographic characteristics, diagnoses, medical history) and can only choose the medical decision. A university cannot control which students are applying to the university. When deciding whether to grant a loan, the lending company cannot choose the personal information of the borrowers.

Scientists can draw on an extensive body of literature on the classic design of experiments [12, 59, 42] where the goal is to decide which observations to make when fitting a function. Yet in our setting, the decisions are guided by a well-defined utility function (that is, maximize the probability of success). The problem is related to the literature on active learning [49, 55, 18, 50], where our setting is most similar to membership query synthesis where the learner may request labels for any unlabeled instance in the input space to learn a classifier that accurately predicts the labels of new examples. By contrast, our goal is to maximize a utility function such as the success of a treatment. Moreover, the expense of labeling each alternative sharpens the conflicts of learning the prediction and finding the best alternative.

Another similar sequential decision making setting is multi-armed bandit problems [2, 6, 14, 39, 52, 7, 56]. Different belief models have been studied under the name of contextual bandits, including linear models [9] and Gaussian process regression [33]. The focus of bandit work is minimizing cumulative regret in an online setting, while we consider the performance of the final recommendation after an offline training phase. There are recent works to address the problem we describe here by minimizing the simple regret. But first, the UCB type policies [1] are not best suited for expensive experiments. Second, the work on simple regret minimization [28, 27] mainly focuses on real-valued functions and do not consider the problem with stochastic binary feedbacks.

There is a literature on Bayesian optimization [26, 8, 44]. Efficient global optimization (EGO), and related methods such as sequential kriging optimization (SKO) [32, 30] assume a Gaussian process belief model which does not scale to the higher dimensional settings that we consider. Others assume lookup table, or low-dimensional parametric methods, e.g. response surface/surrogate models [24, 31, 46]. The existing literature mainly focuses on real-valued functions and none of these methods are directly suitable for our problem of maximizing the probability of success with binary outcomes. A particularly relevant body of work in the Bayesian optimization literature is the expected improvement (EI) for binary outputs [53]. Yet when EI decides which alternative to measure, it is based on the expected improvement over current predictive posterior distribution while ignoring the potential change of the posterior distribution resulting from the next stochastic measurement (see Section 5.6 of [44] and [30] for detailed explanations).

We investigate a knowledge gradient policy that maximizes the value of information, since this approach is particularly well suited to problems where observations are expensive. After its first appearance for ranking and selection problems [16], KG has been extended to various other belief models (e.g. [41, 43, 57]). Yet there is no KG variant designed for binary classification with parametric belief models. In this paper, we extend the KG policy to the setting of classification problems under a Bayesian classification belief model which introduces the computational challenge of working with nonlinear belief models.

This paper makes the following contributions. 1. Due to the sequential nature of the problem, we develop a fast online Bayesian linear classification procedure based on Laplace approximation for general link functions to recursively predict the response of each alternative. 2. We design a knowledge-gradient type policy for stochastic binary responses to guide the experiment. It can work with any choice of link function σ()\sigma(\cdot) and approximation procedure. 3. Since the knowledge gradient policy adaptively chooses the next sampled points, we provide a finite-time analysis on the estimated error that does not rely on the i.i.d. assumption. 4. We show that the maximum likelihood estimator based on the adaptively sampled points by the KG policy is consistent and asymptotically normal. We show furthermore that the knowledge gradient policy is asymptotic optimal. 5. We extend the KG policy to contextual bandit settings with stochastic binary outcomes.

The paper is organized as follows. In Section 2, we establish the mathematical model for the problem of sequentially maximizing the response under binary outcomes. We give a brief overview of (Bayesian) linear classification in Section 3. We develop an online Bayesian linear classification procedure based on Laplace approximation in Section 4. In Section 5, we propose a knowledge-gradient type policy for stochastic binary responses with or with context information. We theoretically study its finite-time and asymptotic behavior. Extensive demonstrations and comparisons of methods for offline objective are demonstrated in Section 6.

2 Problem formulation

We assume that we have a finite set of alternatives 𝒙𝒳={𝒙1,,𝒙M}\bm{x}\in\mathcal{X}=\{\bm{x}_{1},\dots,\bm{x}_{M}\}. The observation of measuring each 𝒙\bm{x} is a binary outcome y{1,+1}y\in\{-1,+1\}/{failure, success} with some unknown probability p(y=+1|𝒙)p(y=+1|\bm{x}). The learner sequentially chooses a series of points (𝒙0,,𝒙N1)(\bm{x}^{0},\dots,\bm{x}^{N-1}) to run the experiments. Under a limited measurement budget NN, the goal of the learner is to recommend an implementation decision 𝒙N\bm{x}^{N} that maximizes p(y=+1|𝒙N)p(y=+1|\bm{x}^{N}).

We adopt probabilistic models for classification. Under general assumptions, the probability of success can be written as a link function acting on a linear function of the feature vector

p(y=+1|𝒙)=σ(𝒘T𝒙).p(y=+1|\bm{x})=\sigma(\bm{w}^{T}\bm{x}).

In this paper, we illustrate the ideas using the logistic link function σ(a)=11+exp(a)\sigma(a)=\frac{1}{1+\text{exp}(-a)} and probit link function σ(a)=Φ(a)=a𝒩(s|0,12)ds\sigma(a)=\Phi(a)=\int_{-\infty}^{a}\mathcal{N}(s|0,1^{2})\text{d}s given its analytic simplicity and popularity, but any monotonically increasing function σ:[0,1]\sigma:\mathbb{R}\mapsto[0,1] can be used. The main difference between the two sigmoid functions is that the logistic function has slightly heavier tails than the normal CDF. Classification using the logistic function is called logistic regression and that using the normal CDF is called probit regression.

We start with a multivariate prior distribution for the unknown parameter vector 𝒘\bm{w}. At iteration nn, we choose an alternative 𝒙n\bm{x}^{n} to measure and observe a binary outcome yn+1y^{n+1} assuming labels are generated independently given 𝒘\bm{w}. Each alternative can be evaluated more than once with potentially different outcomes. Let 𝒟n={(𝒙i,yi+1)}i=0n\mathcal{D}^{n}=\{(\bm{x}^{i},y^{i+1})\}_{i=0}^{n} denote the previous measured data set for any n=0,,Nn=0,\dots,N. Define the filtration (n)n=0N(\mathcal{F}^{n})_{n=0}^{N} by letting n\mathcal{F}^{n} be the sigma-algebra generated by 𝒙0,y1,,𝒙n1,yn\bm{x}^{0},y^{1},\dots,\bm{x}^{n-1},y^{n}. We use n\mathcal{F}^{n} and 𝒟n\mathcal{D}^{n} interchangeably. Note that the notation here is slightly different from the (passive) PAC learning model where the data are i.i.d. and are denoted as {(𝒙i,yi)}\{(\bm{x}_{i},y_{i})\}. Yet in our (adaptive) sequential decision setting, measurement and implementation decisions 𝒙n\bm{x}^{n} are restricted to be n\mathcal{F}^{n}-measurable so that decisions may only depend on measurements made in the past. This notation with superscript indexing time stamp is standard, for example, in control theory, stochastic optimization and optimal learning. We use Bayes’ theorem to form a sequence of posterior predictive distributions p(𝒘|𝒟n)p(\bm{w}|\mathcal{D}^{n}).

The next lemma states the equivalence of using true probabilities and sample estimates when evaluating a policy. The proof is left in the supplementary material.

Lemma 2.1.

Let Π\Pi be the set of policies, πΠ\pi\in\Pi, and 𝐱π=argmax𝐱p(y=+1|𝐱,𝒟N)\bm{x}^{\pi}=\arg\max_{\bm{x}}p(y=+1|\bm{x},\mathcal{D}^{N}) be the implementation decision after the budget NN is exhausted. Then

𝔼𝒘[p(y=+1|𝒙π,𝒘)]=𝔼𝒘[max𝒙p(y=+1|𝒙,𝒟N)],\mathbb{E}_{\bm{w}}[p(y=+1|\bm{x}^{\pi},\bm{w})]=\mathbb{E}_{\bm{w}}[\max_{\bm{x}}p(y=+1|\bm{x},\mathcal{D}^{N})],

where the expectation is taken over the prior distribution of 𝐰\bm{w}.

By denoting 𝒳I\mathcal{X}^{I} as an implementation policy for selecting an alternative after the measurement budget is exhausted, then 𝒳I\mathcal{X}^{I} is a mapping from the history 𝒟N\mathcal{D}^{N} to an alternative 𝒳I(𝒟N)\mathcal{X}^{I}(\mathcal{D}^{N}). Then as a corollary of Lemma 2.1, we have [44]

max𝒳I𝔼[p(y=+1|𝒳(𝒟N))]=max𝒙p(y=+1|𝒙,𝒟N).\max_{\mathcal{X}^{I}}\mathbb{E}\big{[}p\big{(}y=+1|\mathcal{X}(\mathcal{D}^{N})\big{)}\big{]}=\max_{\bm{x}}p(y=+1|\bm{x},\mathcal{D}^{N}).

In other words, the optimal decision at time NN is to go with our final set of beliefs. By the equivalence of using true probabilities and sample estimates when evaluating a policy as stated in Lemma 2.1, while we want to learn the unknown true value max𝒙p(y=+1|𝒙)\max_{\bm{x}}p(y=+1|\bm{x}), we may write our objective function as

maxπΠ𝔼π[max𝒙p(y=+1|𝒙,𝒟N)].\max_{\pi\in\Pi}\mathbb{E}^{\pi}[\max_{\bm{x}}p(y=+1|\bm{x},\mathcal{D}^{N})]. (1)

3 Background: Linear classification

Linear classification, especially logistic regression, is widely used in machine learning for binary classification [29]. Assume that the probability of success p(y=+1|𝒙)p(y=+1|\bm{x}) is a parameterized function σ(𝒘T𝒙)\sigma(\bm{w}^{T}\bm{x}) and further assume that observations are independently of each other. Given a training set 𝒟={(𝒙i,yi)}i=1n\mathcal{D}=\{(\bm{x}_{i},y_{i})\}_{i=1}^{n} with 𝒙i\bm{x}_{i} a dd-dimensional vector and yi{1,+1}y_{i}\in\{-1,+1\}, the likelihood function p(𝒟|𝒘)p(\mathcal{D}|\bm{w}) is p(𝒟|𝒘)=i=1nσ(yi𝒘T𝒙i).p(\mathcal{D}|\bm{w})=\prod_{i=1}^{n}\sigma(y_{i}\cdot\bm{w}^{T}\bm{x}_{i}). The weight vector 𝒘\bm{w} is found by maximizing the likelihood p(𝒟|𝒘)p(\mathcal{D}|\bm{w}) or equivalently, minimizing the negative log likelihood:

min𝒘i=1nlog(σ(yi𝒘T𝒙i)).\min_{\bm{w}}\sum_{i=1}^{n}-\log(\sigma(y_{i}\cdot\bm{w}^{T}\bm{x}_{i})).

In order to avoid over-fitting, especially when there are a large number of parameters to be learned, l2l_{2} regularization is often used. The estimate of the weight vector 𝒘\bm{w} is then given by:

min𝒘λ2𝒘2i=1nlog(σ(yi𝒘T𝒙i)).\min_{\bm{w}}\frac{\lambda}{2}\|\bm{w}\|^{2}-\sum_{i=1}^{n}\log(\sigma(y_{i}\cdot\bm{w}^{T}\bm{x}_{i})). (2)

It can be shown that this log-likelihood function is globally concave in 𝒘\bm{w} for both logistic regression or probit regression. As a result, numerous optimization techniques are available for solving it, such as steepest ascent, Newton’s method and conjugate gradient ascent.

This logic is suitable for batch learning where we only need to conduct the minimization once to find the estimation of weight vector 𝒘\bm{w} based on a given batch of training examples 𝒟\mathcal{D}. Yet due to the sequential nature of our problem setting, observations come one by one as in online learning. After each new observation, if we retrain the linear classifier using all the previous data, we need to re-do the minimization, which is computationally inefficient. In this paper, we instead extend Bayesian linear classification to perform recursive updates with each observation.

A Bayesian approach to linear classification models requires a prior distribution for the weight parameters 𝒘\bm{w}, and the ability to compute the conditional posterior p(𝒘|𝒟)p(\bm{w}|\mathcal{D}) given the observation. Specifically, suppose we begin with an arbitrary prior p(𝒘)p(\bm{w}) and apply Bayes’ theorem to calculate the posterior: p(𝒘|𝒟)=1Zp(𝒟|𝒘)p(𝒘),p(\bm{w}|\mathcal{D})=\frac{1}{Z}p(\mathcal{D}|\bm{w})p(\bm{w}), where the normalization constant ZZ is the unknown evidence. An l2l_{2}-regularized logistic regression can be interpreted as a Bayesian model with a Gaussian prior on the weights with standard deviation 1/λ1/\sqrt{\lambda}.

Unfortunately, exact Bayesian inference for linear classifiers is intractable since the evaluation of the posterior distribution comprises a product of sigmoid functions; in addition, the integral in the normalization constant is intractable as well, for both the logistic function or probit function. We can either use analytic approximations to the posterior, or solutions based on Monte Carlo sampling, foregoing a closed-form expression for the posterior. In this paper, we consider different analytic approximations to the posterior to make the computation tractable.

3.1 Online Bayesian probit regression based on assumed Gaussian density filtering

Assumed density filtering (ADF) is a general online learning schema for computing approximate posteriors in statistical models [5, 35, 40, 48]. In ADF, observations are processed one by one, updating the posterior which is then approximated and is used as the prior distribution for the next observation.

For a given Gaussian prior distribution on some latent parameter 𝜽\bm{\theta}, p(𝜽)=𝒩(𝜽|𝝁,𝚺)p(\bm{\theta})=\mathcal{N}(\bm{\theta}|\bm{\mu},\bm{\Sigma}) and a likelihood t(𝜽):=p(𝒟|𝜽)t(\bm{\theta}):=p(\mathcal{D}|\bm{\theta}), the posterior p(𝜽|𝒟)p(\bm{\theta}|\mathcal{D}) is generally non-Gaussian,

p(𝜽|𝒟)=t(𝜽)p(𝜽)t(𝜽~)p(𝜽~)d𝜽~.p(\bm{\theta}|\mathcal{D})=\frac{t(\bm{\theta})p(\bm{\theta})}{\int t(\tilde{\bm{\theta}})p(\tilde{\bm{\theta}})\text{d}\tilde{\bm{\theta}}}.

We find the best approximation by minimizing the Kullback-Leibler (KL) divergence between the true posterior p(𝜽|𝒟)p(\bm{\theta}|\mathcal{D}) and the Gaussian approximation. It is well known that when q(x)q(x) is Gaussian, the distribution q(x)q(x) that minimizes KL(p(x)||q(x))(p(x)||q(x)) is the one whose first and second moments match that of p(x)p(x). It can be shown that the Gaussian approximation q^(𝜽)=𝒩(𝜽|𝝁^,𝚺^)\hat{q}(\bm{\theta})=\mathcal{N}(\bm{\theta}|\hat{\bm{\mu}},\hat{\bm{\Sigma}}) found by moment matching is given as:

𝝁^=𝝁+𝚺𝒈,𝚺^=𝚺𝚺(𝒈𝒈T2𝑮)𝚺,\hat{\bm{\mu}}=\bm{\mu}+\bm{\Sigma}\bm{g},~~~\hat{\bm{\Sigma}}=\bm{\Sigma}-\bm{\Sigma}(\bm{g}\bm{g}^{T}-2\bm{G})\bm{\Sigma}, (3)

where the vector 𝒈\bm{g} and the matrix 𝑮\bm{G} are given by

𝒈=log(Z(𝝁~,𝚺~))𝝁~|𝝁~=𝝁,𝚺~=𝚺,𝑮=log(Z(𝝁~,𝚺~))𝚺~|𝝁~=𝝁,𝚺~=𝚺,\bm{g}=\frac{\partial\log\bigg{(}Z(\tilde{\bm{\mu}},\tilde{\bm{\Sigma}})\bigg{)}}{\partial\tilde{\bm{\mu}}}|_{\tilde{\bm{\mu}}=\bm{\mu},\tilde{\bm{\Sigma}}=\bm{\Sigma}},~~~\bm{G}=\frac{\partial\log\bigg{(}Z(\tilde{\bm{\mu}},\tilde{\bm{\Sigma}})\bigg{)}}{\partial\tilde{\bm{\Sigma}}}|_{\tilde{\bm{\mu}}=\bm{\mu},\tilde{\bm{\Sigma}}=\bm{\Sigma}},

and the normalization function ZZ is defined by

Z(𝝁,𝚺):=t(𝜽~)𝒩(𝜽~|𝝁,𝚺)d𝜽~.Z(\bm{\mu},\bm{\Sigma}):=\int t(\tilde{\bm{\theta}})\mathcal{N}(\tilde{\bm{\theta}}|\bm{\mu},\bm{\Sigma})\text{d}\tilde{\bm{\theta}}.

For the sake of analytic convenience, we only consider probit regression under assumed Gaussian density filtering. Specifically, the distribution of 𝒘\bm{w} after nn observations is modeled as p(𝒘)=𝒩(𝒘|𝝁n,𝚺n)p(\bm{w})=\mathcal{N}(\bm{w}|\bm{\mu}^{n},\bm{\Sigma}^{n}). The likelihood function for the next available data (𝒙,y)(\bm{x},y) is t(𝒘):=Φ(y𝒘t𝒙)t(\bm{w}):=\Phi(y\bm{w}^{t}\bm{x}). Thus we have,

p(𝒘|𝒙,y)Φ(y𝒘t𝒙)𝒩(𝒘|𝝁n,𝚺n).p(\bm{w}|\bm{x},y)\propto\Phi(y\bm{w}^{t}\bm{x})\mathcal{N}(\bm{w}|\bm{\mu}^{n},\bm{\Sigma}^{n}).

Since the convolution of the normal CDF and a Gaussian distribution is another normal CDF, moment matching (3) results in an analytical solution to the Gaussian approximation:

𝝁n+1\displaystyle\bm{\mu}^{n+1} =\displaystyle= 𝝁n+y𝚺n𝒙1+𝒙T𝚺n𝒙v(y𝒙T𝝁n1+𝒙T𝚺n𝒙),\displaystyle\bm{\mu}^{n}+\frac{y\bm{\Sigma}^{n}\bm{x}}{\sqrt{1+\bm{x}^{T}\bm{\Sigma}^{n}\bm{x}}}v\bigg{(}\frac{y\bm{x}^{T}\bm{\mu}^{n}}{\sqrt{1+\bm{x}^{T}\bm{\Sigma}^{n}\bm{x}}}\bigg{)}, (4)
𝚺n+1\displaystyle\bm{\Sigma}^{n+1} =\displaystyle= 𝚺n(𝚺n𝒙)(𝚺n𝒙)T1+𝒙T𝚺n𝒙w(y𝒙T𝝁n1+𝒙T𝚺n𝒙),\displaystyle\bm{\Sigma}^{n}-\frac{(\bm{\Sigma}^{n}\bm{x})(\bm{\Sigma}^{n}\bm{x})^{T}}{1+\bm{x}^{T}\bm{\Sigma}^{n}\bm{x}}w\bigg{(}\frac{y\bm{x}^{T}\bm{\mu}^{n}}{\sqrt{1+\bm{x}^{T}\bm{\Sigma}^{n}\bm{x}}}\bigg{)}, (5)

where

v(z):=𝒩(z|0,1)Φ(z) and w(z):=v(z)(v(z)+z).v(z):=\frac{\mathcal{N}(z|0,1)}{\Phi(z)}\text{ and }w(z):=v(z)\Big{(}v(z)+z\Big{)}.

In this work, we focus on diagonal covariance matrices 𝚺n\bm{\Sigma}^{n} with (σin)2(\sigma_{i}^{n})^{2} as the diagonal element due to computational simplicity and its equivalence with l2l_{2} regularization, resulting in the following update for the posterior parameters:

μin+1\displaystyle\mu^{n+1}_{i} =\displaystyle= μin+yxi(σin)2σ~v(y𝒙T𝝁nσ~),\displaystyle\mu^{n}_{i}+\frac{yx_{i}(\sigma_{i}^{n})^{2}}{\tilde{\sigma}}v(\frac{y\bm{x}^{T}\bm{\mu}^{n}}{\tilde{\sigma}}), (6)
(σin+1)2\displaystyle(\sigma_{i}^{n+1})^{2} =\displaystyle= (σin)2xi2(σin)4σ~2w(y𝒙T𝝁nσ~),\displaystyle(\sigma_{i}^{n})^{2}-\frac{x_{i}^{2}(\sigma^{n}_{i})^{4}}{\tilde{\sigma}^{2}}w(\frac{y\bm{x}^{T}\bm{\mu}^{n}}{\tilde{\sigma}}), (7)

where σ~2:=1+j=1d(σjn)2xj2\tilde{\sigma}^{2}:=1+\sum_{j=1}^{d}(\sigma_{j}^{n})^{2}x_{j}^{2}. See, for example, [22] and [10] for successful applications of this online probit regression model in prediction of click-through rates and stream-based active learning.

Due to the popularity of logistic regression and the computational limitations of ADF (on general link functions other than probit function), we develop an online Bayesian linear classification procedure for general link functions to recursively predict the response of each alternative in the next section.

4 Online Bayesian Linear Classification based on Laplace approximation

In this section, we consider the Laplace approximation to the posterior and develop an online Bayesian linear classification schema for general link functions.

4.1 Laplace approximation

Laplace’s method aims to find a gaussian approximation to a probability density defined over a set of continuous variables. It can be obtained by finding the mode of the posterior distribution and then fitting a Gaussian distribution centered at that mode [4]. Specifically, define the logarithm of the unnormalized posterior distribution as

Ψ(𝒘)=logp(𝒟|𝒘)+logp(𝒘).\displaystyle\Psi(\bm{w})=\log p(\mathcal{D}|\bm{w})+\log p(\bm{w}).

Since the logarithm of a Gaussian distribution is a quadratic function, we consider a second-order Taylor expansion to Ψ\Psi around its MAP (maximum a posteriori) solution 𝒘^=argmax𝒘Ψ(𝒘)\hat{\bm{w}}=\arg\max_{\bm{w}}\Psi(\bm{w}):

Ψ(𝒘)Ψ(𝒘^)12(𝒘𝒘^)T𝑯(𝒘𝒘^),\Psi(\bm{w})\approx\Psi(\hat{\bm{w}})-\frac{1}{2}(\bm{w}-\hat{\bm{w}})^{T}\bm{H}(\bm{w}-\hat{\bm{w}}), (8)

where 𝑯\bm{H} is the Hessian of the negative log posterior evaluated at 𝒘^\hat{\bm{w}}:

𝑯=2Ψ(𝒘)|𝒘=𝒘^.\bm{H}=-\nabla^{2}\Psi(\bm{w})|_{\bm{w}=\hat{\bm{w}}}.

By exponentiating both sides of Eq. (8), we can see that the Laplace approximation results in a normal approximation to the posterior

p(𝒘|𝒟)𝒩(𝒘|𝒘^,𝑯1).p(\bm{w}|\mathcal{D})\approx\mathcal{N}(\bm{w}|\hat{\bm{w}},\bm{H}^{-1}). (9)

For multivariate Gaussian priors p(𝒘)=𝒩(𝒘|𝒎,𝚺)p(\bm{w})=\mathcal{N}(\bm{w}|\bm{m},\bm{\Sigma}),

Ψ(𝒘|𝒎,𝚺)=12(𝒘𝒎)T𝚺1(𝒘𝒎)+i=1nlog(σ(yi𝒘T𝒙i)),\Psi(\bm{w}|\bm{m},\bm{\Sigma})=-\frac{1}{2}(\bm{w}-\bm{m})^{T}\bm{\Sigma}^{-1}(\bm{w}-\bm{m})+\sum_{i=1}^{n}\log(\sigma(y_{i}\cdot\bm{w}^{T}\bm{x}_{i})), (10)

and the Hessian 𝑯\bm{H} evaluated at 𝒘^\hat{\bm{w}} is given for both logistic and normal CDF link functions as:

𝑯=𝚺1i=1nt^i𝒙i𝒙iT,\bm{H}=\bm{\Sigma}^{-1}-\sum_{i=1}^{n}\hat{t}_{i}\bm{x}_{i}\bm{x}_{i}^{T}, (11)

where t^i:=2logp(yi|𝒙i,𝒘)fi2|fi=𝒘^T𝒙i\hat{t}_{i}:=\frac{\partial^{2}\log p(y_{i}|\bm{x}_{i},\bm{w})}{\partial f_{i}^{2}}|_{f_{i}=\hat{\bm{w}}^{T}\bm{x}_{i}} and fi=𝒘T𝒙if_{i}=\bm{w}^{T}\bm{x}_{i}.

4.2 Online Bayesian linear classification based on Laplace approximation

Starting from a Gaussian prior 𝒩(𝒘|𝒎0,𝚺0)\mathcal{N}(\bm{w}|\bm{m}^{0},\bm{\Sigma}^{0}), after the first nn observations, the Laplace approximated posterior distribution is p(𝒘|𝒟n)𝒩(𝒘|𝒎n,𝚺n)p(\bm{w}|\mathcal{D}^{n})\approx\mathcal{N}(\bm{w}|\bm{m}^{n},\bm{\Sigma}^{n}) according to (9). We formally define the state space 𝒮\mathcal{S} to be the cross-product of d\mathbb{R}^{d} and the space of positive semidefinite matrices. At each time nn, our state of knowledge is thus Sn=(𝒎n,𝚺n)S^{n}=(\bm{m}^{n},\bm{\Sigma}^{n}). Observations come one by one due to the sequential nature of our problem setting. After each new observation, if we retrain the Bayesian classifier using all the previous data, we need to calculate the MAP solution of (10) with 𝒟=𝒟n\mathcal{D}=\mathcal{D}^{n} to update from SnS^{n} to Sn+1S^{n+1}. It is computationally inefficient even with a diagonal covariance matrix. It is better to extend the Bayesian linear classifier to handle recursive updates with each observation.

Here, we propose a fast and stable online algorithm for model updates with independent normal priors (with 𝚺=λ1𝑰\bm{\Sigma}=\lambda^{-1}\bm{I}, where 𝑰\bm{I} is the identity matrix), which is equivalent to l2l_{2} regularization and which also offers greater computational efficiency [58]. At each time step nn, the Laplace approximated posterior 𝒩(𝒘|𝒎n,𝚺n)\mathcal{N}(\bm{w}|\bm{m}^{n},\bm{\Sigma}^{n}) serves as a prior to update the model when the next observation is made. In this recursive way of model updating, previously measured data need not be stored or used for retraining the model. By setting the batch size n=1n=1 in Eq. (10) and (11), we have the sequential Bayesian linear model for classification as in Algorithm 1, where t^:=2log(σ(yf))f2|f=𝒘^T𝒙\hat{t}:=\frac{\partial^{2}\log(\sigma(yf))}{\partial f^{2}}|_{f=\hat{\bm{w}}^{T}\bm{x}}.

input : Regularization parameter λ>0\lambda>0
mj=0m_{j}=0, qj=λq_{j}=\lambda. (Each weight wjw_{j} has an independent prior 𝒩(wj|mj,qj1)\mathcal{N}(w_{j}|m_{j},q_{j}^{-1}))
for t=1t=1 to TT do
   Get a new point (𝒙,y)(\bm{x},y).
   Find 𝒘^\hat{\bm{w}} as the maximizer of (10): 12j=1dqi(wimi)2+log(σ(yi𝒘T𝒙i)).-\frac{1}{2}\sum_{j=1}^{d}q_{i}(w_{i}-m_{i})^{2}+\log(\sigma(y_{i}\bm{w}^{T}\bm{x}_{i})).
 mj=w^jm_{j}=\hat{w}_{j}
   Update qiq_{i} according to (11): qjqjt^xj2q_{j}\leftarrow q_{j}-\hat{t}x^{2}_{j}.
end for
Algorithm 1 Online Bayesian linear classification

It is generally assumed that logσ()\log\sigma(\cdot) is concave to ensure a unique solution of Eq. (10). It is satisfied by commonly used sigmoid functions for classification problems, including logistic function, probit function, complementary log-log function σ(a)=1exp(exp(a))\sigma(a)=1-\exp(-\exp(a)) and log-log function exp(exp(a))\exp(-\exp(-a)).

We can tap a wide range of convex optimization algorithms including gradient search, conjugate gradient, and BFGS method [60]. But if we set n=1n=1 and Σ=λ1𝑰\Sigma=\lambda^{-1}\bm{I} in Eq. (10), a stable and efficient algorithm for solving

argmax𝒘12j=1dqi(wimi)2+log(σ(y𝒘T𝒙))\arg\max_{\bm{w}}-\frac{1}{2}\sum_{j=1}^{d}q_{i}(w_{i}-m_{i})^{2}+\log(\sigma(y\bm{w}^{T}\bm{x})) (12)

can be obtained as follows. First, taking derivatives with respect to wiw_{i} and setting Fwi\frac{\partial F}{\partial w_{i}} to zero, we have

qi(wimi)=yxiσ(y𝒘Tx)σ(y𝒘Tx),i=1,2,,d.q_{i}(w_{i}-m_{i})=\frac{yx_{i}\sigma^{\prime}(y\bm{w}^{T}x)}{\sigma(y\bm{w}^{T}x)},~~~~i=1,2,\dots,d.

Defining pp as

p:=σ(y𝒘Tx)σ(y𝒘Tx),p:=\frac{\sigma^{\prime}(y\bm{w}^{T}x)}{\sigma(y\bm{w}^{T}x)},

we then have wi=mi+ypxiqi.w_{i}=m_{i}+yp\frac{x_{i}}{q_{i}}. Plugging this back into the definition of pp to eliminate wiw_{i}’s, we get the equation for pp:

p=σ(pi=1dxi2/qi+y𝒎Tx)σ(pi=1dxi2/qi+y𝒎Tx).p=\frac{\sigma^{\prime}(p\sum_{i=1}^{d}x_{i}^{2}/q_{i}+y\bm{m}^{T}x)}{\sigma(p\sum_{i=1}^{d}x_{i}^{2}/q_{i}+y\bm{m}^{T}x)}.

Since log(σ())\log(\sigma(\cdot)) is concave, by its derivative we know the function σ/σ\sigma^{\prime}/\sigma is monotonically decreasing, and thus the right hand side of the equation decreases as pp goes from 0 to \infty. We notice that the right hand side is positive when p=0p=0 and the left hand side is larger than the right hand side when p=σ(y𝒎Tx)/σ(y𝒎Tx)p=\sigma^{\prime}(y\bm{m}^{T}x)/\sigma(y\bm{m}^{T}x). Hence the equation has a unique solution in interval [0,σ(y𝒎Tx)/σ(y𝒎Tx)][0,\sigma^{\prime}(y\bm{m}^{T}x)/\sigma(y\bm{m}^{T}x)]. A simple one dimensional bisection method is sufficient to efficiently find the root pp^{*} and thus the solution to the dd-dimensional optimization problem (12).

We illustrate and validate this schema using logistic and probit functions. For logistic function σ(𝒘T𝒙)=log(1+exp(y𝒘T𝒙))\sigma(\bm{w}^{T}\bm{x})=-\log(1+\exp(-y\bm{w}^{T}\bm{x})), by setting F/wi=0\partial F/\partial w_{i}=0 for all ii and then by denoting (1+exp(y𝐰T𝐱))1(1+\exp(y\mathbf{w}^{T}\mathbf{x}))^{-1} as pp, we have

qi(wimi)=ypxi,i=1,2,,d,q_{i}(w_{i}-m_{i})=ypx_{i},~~~~i=1,2,\dots,d,

resulting in the following equation for pp:

1p=1+exp(yi=1d(mi+ypxiqi)xi)=1+exp(y𝐦T𝐱)exp(y2pi=1dxi2qi).\frac{1}{p}=1+\exp\Big{(}y\sum_{i=1}^{d}(m_{i}+yp\frac{x_{i}}{q_{i}})x_{i}\Big{)}=1+\exp(y\mathbf{m}^{T}\mathbf{x})\exp\Big{(}y^{2}p\sum_{i=1}^{d}\frac{x_{i}^{2}}{q_{i}}\Big{)}.

It is easy to see that the left hand side decreases from infinity to 1 and the right hand side increases from 1 when pp goes from 0 to 1, therefore the solution exists and is unique in [0,1][0,1].

For normal CDF link function σ()=Φ()\sigma(\cdot)=\Phi(\cdot), the computation is a little lengthier. First use ϕ()\phi(\cdot) to denote the standard normal distribution 𝒩(|0,12)\mathcal{N}(\cdot|0,1^{2}), we have

Fwi=qi(wimi)+yxiϕ(𝒘T𝒙)Φ(y𝒘T𝒙).\frac{\partial F}{\partial w_{i}}=-q_{i}(w_{i}-m_{i})+\frac{yx_{i}\phi(\bm{w}^{T}\bm{x})}{\Phi(y\bm{w}^{T}\bm{x})}.

Let p=ϕ(𝒘T𝒙)Φ(y𝒘T𝒙)p=\frac{\phi(\bm{w}^{T}\bm{x})}{\Phi(y\bm{w}^{T}\bm{x})} and F/wi=0\partial F/\partial w_{i}=0 for all ii, and thus we have wi=mi+ypxiqiw_{i}=m_{i}+yp\frac{x_{i}}{q_{i}}. Plugging these into the definition of pp we have

p=ϕ(i=1d(mi+ypxiqi)xi)Φ(yi=1d(mi+ypxiqi)xi).p=\frac{\phi\big{(}\sum_{i=1}^{d}(m_{i}+yp\frac{x_{i}}{q_{i}})x_{i}\big{)}}{\Phi\big{(}y\sum_{i=1}^{d}(m_{i}+yp\frac{x_{i}}{q_{i}})x_{i}\big{)}}. (13)

Define the right-hand-size of the above equation as g(p)g(p). The next lemma shows that ddpg(p)0\frac{\text{d}}{\text{d}{p}}g(p)\leq 0 and thus the right-hand-side of Eq. (13) is non-increasing. Together with the fact that the left hand side increases from 0 to 1, the bisection method can also be used based on Eq. (13). The proof can be found in Appendix 9.

Lemma 4.1.

Define

g(p):=ϕ(i=1d(mi+ypxiqi)xi)Φ(yi=1d(mi+ypxiqi)xi).g(p):=\frac{\phi\big{(}\sum_{i=1}^{d}(m_{i}+yp\frac{x_{i}}{q_{i}})x_{i}\big{)}}{\Phi\big{(}y\sum_{i=1}^{d}(m_{i}+yp\frac{x_{i}}{q_{i}})x_{i}\big{)}}.

We have ddpg(p)0\frac{\text{d}}{\text{d}{p}}g(p)\leq 0 for every pp\in\mathbb{R}.

5 Knowledge Gradient Policy for Bayesian Linear Classification Belief Model

We begin with a brief review of the knowledge gradient (KG) for ranking and selection problems, where each of the alternative can be measured sequentially to estimates its unknown underlying expected performance μx\mu_{x}. The goal is to adaptively allocate alternatives to measure so as to find an implementation decision that has the largest mean after the budget is exhausted. In a Bayesian setting, the performance of each alternative is represented by a (non-parametric) lookup table model of Gaussian distribution. Specifically, by imposing a Gaussian prior 𝒩(𝝁|𝜽0,𝚺0)\mathcal{N}(\bm{\mu}|\bm{\theta}^{0},\bm{\Sigma}^{0}), the posterior after the first nn observations is denoted by 𝒩(𝝁|𝜽n,𝚺n)\mathcal{N}(\bm{\mu}|\bm{\theta}^{n},\bm{\Sigma}^{n}). At the nnth iteration, the knowledge gradient policy chooses its (n+1n+1)th measurement to maximize the single-period expected increase in value [16]:

νxKG,n=𝔼[maxxθxn+1maxxθxn|xn=x,Sn].\nu_{x}^{\text{KG},n}=\mathbb{E}[\max_{x^{\prime}}\theta_{x^{\prime}}^{n+1}-\max_{x^{\prime}}\theta_{x^{\prime}}^{n}|x^{n}=x,S^{n}].

It enjoys nice properties, including myopic and asymptotic optimality. KG has been extended to various belief models (e.g. [41, 43, 47, 57]). The knowledge gradient can be extended to online problems where we need to maximize cumulative rewards [47],

νxOLKG,n=θxn+τνxKG,n,\nu_{x}^{\text{OLKG},n}=\theta_{x}^{n}+\tau\nu^{\text{KG},n}_{x},

where τ\tau reflects a planning horizon.

Yet there is no KG variant designed for binary classification with parametric models, primarily because of the computational intractability of dealing with nonlinear belief models. In what follows, we first formulate our learning problem as a Markov decision process and then extend the KG policy for stochastic binary outcomes where, for example, each choice (say, a medical decision) influences the success or failure of a medical outcome.

5.1 Markov decision process formulation

Our learning problem is a dynamic program that can be formulated as a Markov decision process. Define the state space 𝒮\mathcal{S} as the space of all possible predictive distributions for 𝒘\bm{w}. By Bayes’ Theorem, the transition function TT: 𝒮×𝒳×{1,1}\mathcal{S}\times\mathcal{X}\times\{-1,1\} is:

T(q(𝒘),𝒙,y)q(𝒘)σ(y𝒘T𝒙).T\bigg{(}q(\bm{w}),\bm{x},y\bigg{)}\propto q(\bm{w})\sigma(y\bm{w}^{T}\bm{x}). (14)

If we start from a Gaussian prior 𝒩(𝒘|𝝁0,𝚺0)\mathcal{N}(\bm{w}|\bm{\mu}^{0},\bm{\Sigma}^{0}), after the first nn observed data, the approximated posterior distribution is p(𝒘|𝒟n)𝒩(𝒘|𝝁n,𝚺n)p(\bm{w}|\mathcal{D}^{n})\approx\mathcal{N}(\bm{w}|\bm{\mu}^{n},\bm{\Sigma}^{n}). The state space 𝒮\mathcal{S} is the cross-product of d\mathbb{R}^{d} and the space of positive semidefinite matrices. The transition function for updating the belief state depends on the belief model σ()\sigma(\cdot) and the approximation strategy. For example, for different update equations in Algorithm 1 and (6)(7) under different approximation methods, the transition function can be defined as follows with degenerate state space 𝒮:=d×[0,)d\mathcal{S}:=\mathbb{R}^{d}\times[0,\infty)^{d}:

Definition 5.1.

The transition function based on online Bayesian classification with Laplace approximation TT: 𝒮×𝒳×{1,1}\mathcal{S}\times\mathcal{X}\times\{-1,1\} is defined as

TL((𝝁,𝝈2),𝒙,y)=(argmin𝒘Ψ(𝒘|𝝁,𝝈2),𝝈2+t^(y)diag(𝒙𝒙T)),T^{\text{L}}\Big{(}(\bm{\mu},\bm{\sigma}^{2}),\bm{x},y\Big{)}=\Big{(}\arg\min_{\bm{w}}\Psi\big{(}\bm{w}|\bm{\mu},\bm{\sigma}^{-2}\big{)},\bm{\sigma}^{-2}+\hat{t}(y)\cdot\textbf{diag}(\bm{x}\bm{x}^{T})\Big{)},

where t^(y):=2logp(y|𝐱,𝐰)f2|f=𝐰^T𝐱\hat{t}(y):=\frac{\partial^{2}\log p(y|\bm{x},\bm{w})}{\partial f^{2}}|_{f=\hat{\bm{w}}^{T}\bm{x}} for either logistic or probit functions, diag(𝐱𝐱T)\textbf{diag}(\bm{x}\bm{x}^{T}) is a column vector containing the diagonal elements of 𝐱𝐱T\bm{x}\bm{x}^{T} and 𝛔2\bm{\sigma}^{-2} is understood as a column vector containing σi2\sigma_{i}^{-2}, so that Sn+1=TL(Sn,𝐱,Yn+1)S^{n+1}=T^{\text{L}}(S^{n},\bm{x},Y^{n+1}). Yn+1Y^{n+1} denotes the unobserved binary random variable at time nn.

Definition 5.2.

The transition function based on assumed density filtering TT: 𝒮×𝒳×{1,1}\mathcal{S}\times\mathcal{X}\times\{-1,1\} is defined as

TADF((𝝁,𝝈2),𝒙,y)=(𝝁+y𝒙T𝝈2σ~v(y𝒙T𝝁σ~),𝝈2(𝒙2)T𝝈4σ~2w(y𝒙T𝝁σ~)),T^{\text{ADF}}\Big{(}(\bm{\mu},\bm{\sigma}^{2}),\bm{x},y\Big{)}=\Big{(}\bm{\mu}+\frac{y\bm{x}^{T}\bm{\sigma}^{2}}{\tilde{\sigma}}v(\frac{y\bm{x}^{T}\bm{\mu}}{\tilde{\sigma}}),\bm{\sigma}^{2}-\frac{(\bm{x}^{2})^{T}\bm{\sigma}^{4}}{\tilde{\sigma}^{2}}w(\frac{y\bm{x}^{T}\bm{\mu}}{\tilde{\sigma}})\Big{)},

where σ~=1+(𝐱2)T𝛔2\tilde{\sigma}=\sqrt{1+(\bm{x}^{2})^{T}\bm{\sigma}^{2}}, v(z):=𝒩(z|0,1)Φ(z)v(z):=\frac{\mathcal{N}(z|0,1)}{\Phi(z)}, w(z):=v(z)(v(z)+z)w(z):=v(z)\Big{(}v(z)+z\Big{)} and 𝐱2\bm{x}^{2} is understood as the column vector containing xi2x_{i}^{2}, so that Sn+1=TADF(Sn,𝐱,Yn+1)S^{n+1}=T^{\text{ADF}}(S^{n},\bm{x},Y^{n+1}).

In a dynamic program, the value function is defined as the value of the optimal policy given a particular state SnS^{n} at time nn, and may also be determined recursively through Bellman’s equation. In the case of stochastic binary feedback, the terminal value function VN:𝒮V^{N}:\mathcal{S}\mapsto\mathbb{R} is given by (1) as

VN(s)=max𝒙p(y=+1|𝒙,s),s𝒮.V^{N}(s)=\max_{\bm{x}}p(y=+1|\bm{x},s),\forall s\in\mathcal{S}.

The dynamic programming principle tells us that the value function at any other time n=1,,Nn=1,\dots,N, VnV^{n}, is given recursively by

Vn(s)=max𝒙𝔼[Vn+1(T(s,𝒙,Yn+1))|𝒙,s],s𝒮.V^{n}(s)=\max_{\bm{x}}\mathbb{E}[V^{n+1}(T(s,\bm{x},Y^{n+1}))|\bm{x},s],\forall s\in\mathcal{S}.

Since the curse of dimensionality on the state space 𝒮\mathcal{S} makes direct computation of the value function intractable, computationally efficient approximate policies need to be considered. A computationally attractive policy for ranking and selection problems is known as the knowledge gradient (KG) [16], which will be extended to handle Bayesian classification models in the next section.

5.2 Knowledge Gradient for Binary Responses

The knowledge gradient of measuring an alternative 𝒙\bm{x} can be defined as follows:

Definition 5.3.

The knowledge gradient of measuring an alternative 𝐱\bm{x} while in state ss is

ν𝒙KG(s):=𝔼[VN(T(s,𝒙,Y))VN(s)|𝒙,s].\nu_{\bm{x}}^{\text{KG}}(s):=\mathbb{E}\Big{[}V^{N}\Big{(}T(s,\bm{x},Y)\Big{)}-V^{N}(s)|\bm{x},s\Big{]}. (15)

VN(s)V^{N}(s) is deterministic given ss and is independent of alternatives 𝒙\bm{x}. Since the label for alternative 𝒙\bm{x} is not known at the time of selection, the expectation is computed conditional on the current belief state s=(𝝁,𝚺)s=(\bm{\mu},\bm{\Sigma}). Specifically, given a state s=(𝝁,𝚺)s=(\bm{\mu},\bm{\Sigma}), the outcome yy of an alternative 𝒙\bm{x} is a random variable that follows from a Bernoulli distribution with a predictive distribution

p(y=+1|𝒙,s)=p(y=+1|𝒙,𝒘)p(𝒘|s)d𝒘=σ(𝒘T𝒙)p(𝒘|s)d𝒘.\displaystyle p(y=+1|\bm{x},s)=\int p(y=+1|\bm{x},\bm{w})p(\bm{w}|s)\text{d}\bm{w}=\int\sigma(\bm{w}^{T}\bm{x})p(\bm{w}|s)\text{d}\bm{w}. (16)

We can calculate the expected value in the next state as

𝔼[VN(T(s,𝒙,y))]\displaystyle\mathbb{E}[V^{N}(T(s,\bm{x},y))]
=\displaystyle= p(y=+1|𝒙,s)VN(T(s,𝒙,+1))+p(y=1|𝒙,s)VN(T(s,𝒙,1))\displaystyle p(y=+1|\bm{x},s)V^{N}\Big{(}T(s,\bm{x},+1)\Big{)}+p(y=-1|\bm{x},s)V^{N}\Big{(}T(s,\bm{x},-1)\Big{)}
=\displaystyle= p(y=+1|𝒙,s)max𝒙p(y=+1|𝒙,T(s,𝒙,+1))\displaystyle p(y=+1|\bm{x},s)\cdot\max_{\bm{x}^{\prime}}p\Big{(}y=+1|\bm{x}^{\prime},T(s,\bm{x},+1)\Big{)}
+p(y=1|𝒙,s)max𝒙p(y=+1|𝒙,T(s,𝒙,1)).\displaystyle+p(y=-1|\bm{x},s)\cdot\max_{\bm{x}^{\prime}}p\Big{(}y=+1|\bm{x}^{\prime},T(s,\bm{x},-1)\Big{)}.

The knowledge gradient policy suggests at each time nn selecting the alternative that maximizes ν𝒙KG,n(sn)\nu_{\bm{x}}^{\text{KG},n}(s^{n}) where ties are broken randomly. Because of the errors incurred by approximation and numerical calculation, the tie should be understood as within ϵ\epsilon-accuracy. The knowledge gradient policy can work with any choice of link function σ()\sigma(\cdot) and approximation procedures by adjusting the transition function T(s,x,)T(s,x,\cdot) accordingly. That is, TLT^{\text{L}} or TADFT^{\text{ADF}}.

The predictive distribution σ(𝒘T𝒙)p(𝒘|s)d𝒘\int\sigma(\bm{w}^{T}\bm{x})p(\bm{w}|s)\text{d}\bm{w} is obtained by marginalizing with respect to the distribution specified by current belief state p(𝒘|s)=𝒩(𝒘|𝝁,𝚺)p(\bm{w}|s)=\mathcal{N}(\bm{w}|\bm{\mu},\bm{\Sigma}). Denoting a=𝒘T𝒙a=\bm{w}^{T}\bm{x} and δ()\delta(\cdot) as the Dirac delta function, we have σ(𝒘T𝒙)=δ(a𝒘T𝒙)σ(a)da.\sigma(\bm{w}^{T}\bm{x})=\int\delta(a-\bm{w}^{T}\bm{x})\sigma(a)\text{d}a. Hence

σ(𝒘T𝒙)p(𝒘|s)d𝒘=σ(a)p(a)da,\int\sigma(\bm{w}^{T}\bm{x})p(\bm{w}|s)\text{d}\bm{w}=\int\sigma(a)p(a)\text{d}a,

where p(a)=δ(a𝒘T𝒙)p(𝒘|s)d𝒘.p(a)=\int\delta(a-\bm{w}^{T}\bm{x})p(\bm{w}|s)\text{d}\bm{w}. Since the delta function imposes a linear constraint on 𝒘\bm{w} and p(𝒘|s)=𝒩(𝒘|𝝁,𝝈2)p(\bm{w}|s)=\mathcal{N}(\bm{w}|\bm{\mu},\bm{\sigma}^{2}) is Gaussian, the marginal distribution p(a)p(a) is also Gaussian. We can evaluate p(a)p(a) by calculating the mean and variance of this distribution [4]. We have

μa\displaystyle\mu_{a} =\displaystyle= 𝔼[a]=p(a)a da=p(𝒘|s)𝒘T𝒙 d𝒘=𝝁T𝒙,\displaystyle\mathbb{E}[a]=\int p(a)a\text{ d}a=\int p(\bm{w}|s)\bm{w}^{T}\bm{x}\text{ d}\bm{w}=\bm{\mu}^{T}\bm{x},
σa2\displaystyle\sigma_{a}^{2} =\displaystyle= Var[a]=p(𝒘|s)((𝒘T𝒙)2(𝝁T𝒙)2) d𝒘=j=1dσj2xj2.\displaystyle\text{Var}[a]=\int p(\bm{w}|s)\big{(}(\bm{w}^{T}\bm{x})^{2}-(\bm{\mu}^{T}\bm{x})^{2}\big{)}\text{ d}\bm{w}=\sum_{j=1}^{d}\sigma_{j}^{2}x_{j}^{2}.

Thus σ(𝒘T𝒙)p(𝒘|s)d𝒘=σ(a)p(a)da=σ(a)𝒩(a|μa,σa2)da.\int\sigma(\bm{w}^{T}\bm{x})p(\bm{w}|s)\text{d}\bm{w}=\int\sigma(a)p(a)\text{d}a=\int\sigma(a)\mathcal{N}(a|\mu_{a},\sigma^{2}_{a})\text{d}a.

For probit function σ(a)=Φ(a)\sigma(a)=\Phi(a), the convolution of a Gaussian and a normal CDF can be evaluated analytically. Thus for probit regression, the predictive distribution can be solved exactly as p(y=+1|𝒙,s)=Φ(μa1+σa2).p(y=+1|\bm{x},s)=\Phi(\frac{\mu_{a}}{\sqrt{1+\sigma^{2}_{a}}}). Yet, the convolution of a Gaussian with a logistic sigmoid function cannot be evaluated analytically. We apply the approximation σ(a)Φ(αa)\sigma(a)\approx\Phi(\alpha a) with α=π/8\alpha=\pi/8 (see [3, 51]), leading to the following approximation for the convolution of a logistic sigmoid with a Gaussian

p(y=+1|𝒙,s)=σ(𝒘T𝒙)p(𝒘|s)d𝒘σ(κ(σa2)μa),p(y=+1|\bm{x},s)=\int\sigma(\bm{w}^{T}\bm{x})p(\bm{w}|s)\text{d}\bm{w}\approx\sigma(\kappa(\sigma^{2}_{a})\mu_{a}),

where κ(σ2)=(1+πσ2/8)1/2\kappa(\sigma^{2})=(1+\pi\sigma^{2}/8)^{-1/2}.

Because of the one-step look ahead, the KG calculation can also benefit from the online recursive update of the belief either from ADF or from online Bayesian linear classification based on Laplace approximation. We summarize the decision rules of the knowledge gradient policy at each iteration under different sigmoid functions and different approximation methods in Algorithm 2, 3, and 4, respectively.

input : mjm_{j}, qjq_{j} (Each weight wjw_{j} has an independent prior 𝒩(wj|mj,qj1)\mathcal{N}(w_{j}|m_{j},q_{j}^{-1}))
for 𝐱\bm{x} in 𝒳\mathcal{X} do
   Let Ψ(𝒘,y)=12j=1dqj(wjmj)2log(1+exp(y𝒘T𝒙))\Psi(\bm{w},y)=-\frac{1}{2}\sum_{j=1}^{d}q_{j}(w_{j}-m_{j})^{2}-\log(1+\exp(-y\bm{w}^{T}\bm{x}))
   Use bisection method to find 𝒘^+=argmax𝒘Ψ(𝒘,+1),𝒘^=argmax𝒘Ψ(𝒘,1)\hat{\bm{w}}_{+}=\arg\max_{\bm{w}}\Psi(\bm{w},+1),~\hat{\bm{w}}_{-}=\arg\max_{\bm{w}}\Psi(\bm{w},-1) μ=𝒎T𝒙\mu=\bm{m}^{T}\bm{x}, σ2=j=1dqj1xj2\sigma^{2}=\sum_{j=1}^{d}q_{j}^{-1}x_{j}^{2}
   Let σ(a)=(1+exp(a))1\sigma(a)=\big{(}1+\exp(-a)\big{)}^{-1}
 μ+(𝒙):=𝒘^+T𝒙\mu_{+}(\bm{x}^{\prime}):=\hat{\bm{w}}_{+}^{T}\bm{x}^{\prime}, μ(𝒙):=𝒘^T𝒙\mu_{-}(\bm{x}^{\prime}):=\hat{\bm{w}}_{-}^{T}\bm{x}^{\prime}
 σ±2(𝒙):=j=1d(qj+σ(𝒘^±T𝒙)(1σ(𝒘^±T𝒙))xj2)1(xj)2\sigma^{2}_{\pm}(\bm{x}^{\prime}):=\sum_{j=1}^{d}\Big{(}q_{j}+\sigma(\hat{\bm{w}}^{T}_{\pm}\bm{x})(1-\sigma(\hat{\bm{w}}_{\pm}^{T}\bm{x}))x^{2}_{j}\Big{)}^{-1}(x^{\prime}_{j})^{2}
 ν~𝒙=σ(κ(σ2)μ)max𝒙σ(κ(σ+2(𝒙))μ+(𝒙))+σ(κ(σ2)μ)max𝒙σ(κ(σ2(𝒙))μ(𝒙))\tilde{\nu}_{\bm{x}}=\sigma(\kappa(\sigma^{2})\mu)\cdot\max_{\bm{x^{\prime}}}\sigma\big{(}\kappa(\sigma_{+}^{2}(\bm{x}^{\prime}))\mu_{+}(\bm{x^{\prime}})\big{)}+\sigma(-\kappa(\sigma^{2})\mu)\cdot\max_{\bm{x^{\prime}}}\sigma\big{(}\kappa(\sigma_{-}^{2}(\bm{x}^{\prime}))\mu_{-}(\bm{x^{\prime}})\big{)}
 
end for
𝒙KGargmax𝒙ν~x\bm{x}^{\text{KG}}\in\arg\max_{\bm{x}}\tilde{\nu}_{x}
Algorithm 2 Knowledge Gradient Policy for Logistic Model based on Laplace approximation
input : mjm_{j}, qjq_{j} (Each weight wjw_{j} has an independent prior 𝒩(wj|mj,qj1)\mathcal{N}(w_{j}|m_{j},q_{j}^{-1}))
for 𝐱\bm{x} in 𝒳\mathcal{X} do
   Let Ψ(𝒘,y)=12j=1dqi(wimi)2+log(Φ(y𝒘T𝒙))\Psi(\bm{w},y)=-\frac{1}{2}\sum_{j=1}^{d}q_{i}(w_{i}-m_{i})^{2}+\log(\Phi(y\bm{w}^{T}\bm{x}))
   Use bisection method to find 𝒘^+=argmax𝒘Ψ(𝒘,+1),𝒘^=argmax𝒘Ψ(𝒘,1)\hat{\bm{w}}_{+}=\arg\max_{\bm{w}}\Psi(\bm{w},+1),\hat{\bm{w}}_{-}=\arg\max_{\bm{w}}\Psi(\bm{w},-1)
 μ=𝒎T𝒙\mu=\bm{m}^{T}\bm{x}, σ2=j=1dqj1xj2\sigma^{2}=\sum_{j=1}^{d}q_{j}^{-1}x_{j}^{2}
 μ+(𝒙):=𝒘^+T𝒙\mu_{+}(\bm{x}^{\prime}):=\hat{\bm{w}}_{+}^{T}\bm{x}^{\prime}, μ(𝒙):=𝒘^T𝒙\mu_{-}(\bm{x}):=\hat{\bm{w}}_{-}^{T}\bm{x}^{\prime}
 σ±2(𝒙):=j=1d(qj+(𝒩(𝒘^±T𝒙|0,1)2Φ(𝒘^±T𝒙)2+𝒘^±T𝒙𝒩(𝒘^±T𝒙|0,1)Φ(𝒘^±T𝒙))xj2)1(xj)2\sigma^{2}_{\pm}(\bm{x}^{\prime}):=\sum_{j=1}^{d}\Big{(}q_{j}+(\frac{\mathcal{N}(\hat{\bm{w}}^{T}_{\pm}\bm{x}|0,1)^{2}}{\Phi(\hat{\bm{w}}^{T}_{\pm}\bm{x})^{2}}+\frac{\hat{\bm{w}}^{T}_{\pm}\bm{x}\mathcal{N}(\hat{\bm{w}}^{T}_{\pm}\bm{x}|0,1)}{\Phi(\hat{\bm{w}}^{T}_{\pm}\bm{x})})x^{2}_{j}\Big{)}^{-1}(x^{\prime}_{j})^{2}
 ν~𝒙=Φ(μ1+σ2)max𝒙Φ(μ+(𝒙)1+σ+2(𝒙))+Φ(μ1+σ2)max𝒙Φ(μ(𝒙)1+σ2(𝒙))\tilde{\nu}_{\bm{x}}=\Phi(\frac{\mu}{\sqrt{1+\sigma^{2}}})\cdot\max_{\bm{x^{\prime}}}\Phi(\frac{\mu_{+}(\bm{x}^{\prime})}{\sqrt{1+\sigma_{+}^{2}(\bm{x}^{\prime})}})+\Phi(-\frac{\mu}{\sqrt{1+\sigma^{2}}})\cdot\max_{\bm{x^{\prime}}}\Phi(\frac{\mu_{-}(\bm{x}^{\prime})}{\sqrt{1+\sigma_{-}^{2}(\bm{x}^{\prime})}})
 
end for
𝒙KGargmax𝒙ν~x\bm{x}^{\text{KG}}\in\arg\max_{\bm{x}}\tilde{\nu}_{x}
Algorithm 3 Knowledge Gradient Policy for Probit Model based on Laplace approximation
input : mjm_{j}, qjq_{j} (Each weight wjw_{j} has an independent prior 𝒩(wj|mj,qj1)\mathcal{N}(w_{j}|m_{j},q_{j}^{-1}))
σj2=1/qj\sigma_{j}^{2}=1/q_{j}
for 𝐱\bm{x} in 𝒳\mathcal{X} do
   Define v(z):=𝒩(z|0,1)Φ(z) and w(z):=v(z)(v(z)+z)v(z):=\frac{\mathcal{N}(z|0,1)}{\Phi(z)}\text{ and }w(z):=v(z)\Big{(}v(z)+z\Big{)}
 
 μ=𝒎T𝒙\mu=\bm{m}^{T}\bm{x}, σ2=j=1dσj2xj2\sigma^{2}=\sum_{j=1}^{d}\sigma_{j}^{2}x_{j}^{2}
 m+j=mj+xjσj21+σ2v(𝒎T𝒙1+σ2)m_{+j}=m_{j}+\frac{x_{j}\sigma_{j}^{2}}{\sqrt{1+\sigma^{2}}}v(\frac{\bm{m}^{T}\bm{x}}{\sqrt{1+\sigma^{2}}}), mj=mjxjσj21+σ2v(𝒎T𝒙1+σ2)m_{-j}=m_{j}-\frac{x_{j}\sigma_{j}^{2}}{\sqrt{1+\sigma^{2}}}v(-\frac{\bm{m}^{T}\bm{x}}{\sqrt{1+\sigma^{2}}})
 σ+j2=σj2xj2σj41+σ22w(𝒎T𝒙1+σ2)\sigma_{+j}^{2}=\sigma_{j}^{2}-\frac{x_{j}^{2}\sigma_{j}^{4}}{\sqrt{1+\sigma^{2}}^{2}}w(\frac{\bm{m}^{T}\bm{x}}{\sqrt{1+\sigma^{2}}}), σj2=σj2xj2σj41+σ22w(𝒎T𝒙1+σ2)\sigma_{-j}^{2}=\sigma_{j}^{2}-\frac{x_{j}^{2}\sigma_{j}^{4}}{\sqrt{1+\sigma^{2}}^{2}}w(-\frac{\bm{m}^{T}\bm{x}}{\sqrt{1+\sigma^{2}}})
 
 μ+(𝒙):=𝒎+T𝒙\mu_{+}(\bm{x}^{\prime}):=\bm{m}_{+}^{T}\bm{x}^{\prime}, μ(𝒙):=𝒎T𝒙\mu_{-}(\bm{x}):=\bm{m}_{-}^{T}\bm{x}^{\prime}
 σ+2(𝒙):=j=1dσ+j2(xj)2\sigma_{+}^{2}(\bm{x}^{\prime}):=\sum_{j=1}^{d}\sigma_{+j}^{2}(x^{\prime}_{j})^{2}, σ2(𝒙):=j=1dσj2(xj)2\sigma_{-}^{2}(\bm{x}^{\prime}):=\sum_{j=1}^{d}\sigma_{-j}^{2}(x^{\prime}_{j})^{2}
 ν~𝒙=Φ(μ1+σ2)max𝒙Φ(μ+(𝒙)1+σ+2(𝒙))+Φ(μ1+σ2)max𝒙Φ(μ(𝒙)1+σ2(𝒙))\tilde{\nu}_{\bm{x}}=\Phi(\frac{\mu}{\sqrt{1+\sigma^{2}}})\cdot\max_{\bm{x^{\prime}}}\Phi(\frac{\mu_{+}(\bm{x}^{\prime})}{\sqrt{1+\sigma_{+}^{2}(\bm{x}^{\prime})}})+\Phi(-\frac{\mu}{\sqrt{1+\sigma^{2}}})\cdot\max_{\bm{x^{\prime}}}\Phi(\frac{\mu_{-}(\bm{x}^{\prime})}{\sqrt{1+\sigma_{-}^{2}(\bm{x}^{\prime})}})
 
end for
𝒙KGargmax𝒙ν~x\bm{x}^{\text{KG}}\in\arg\max_{\bm{x}}\tilde{\nu}_{x}
Algorithm 4 Knowledge Gradient Policy for Probit Model based on assumed density filtering

We next present the following finite-time bound on the the mean squared error (MSE) of the estimated weight for Bayesian logistic regression. The proof is in the supplement. Since the learner plays an active role in selecting the measurements, the bound does not make the i.i.d. assumption of the examples which differs from the PAC learning bound. Without loss of generality, we assume 𝒙21\|\bm{x}\|_{2}\leq 1, 𝒙𝒳\forall{\bm{x}\in\mathcal{X}}. In our finite-time analysis, we would like to highlight the difference between the ideal n-step estimate 𝒘n=argmax𝒘Ψ(𝒘|𝒎0,Σ0)\bm{w}^{n}=\arg\max_{\bm{w}}\Psi(\bm{w}|\bm{m}^{0},\Sigma^{0}) with the prior distribution p(𝒘)=𝒩(𝒘|𝒎0,𝚺0)p(\bm{w}^{*})=\mathcal{N}(\bm{w}^{*}|\bm{m}^{0},\bm{\Sigma}^{0}), and its approximation 𝒘^n\hat{\bm{w}}^{n}. The approximation error may come from the Laplace approximation we adopt when an explicit form is not available, or from accumulation of numerical error.

Theorem 5.4.

Let 𝒟n\mathcal{D}^{n} be the nn measurements produced by the KG policy. Then with probability Pd(M)P_{d}(M), the expected error of 𝐰n\bm{w}^{n} is bounded as

𝔼𝒚(𝒟n,𝒘)𝒘n𝒘2Cmin+λmin(𝚺1)2,\mathbb{E}_{\bm{y}\sim\mathcal{B}(\mathcal{D}^{n},\bm{w}^{*})}||\bm{w}^{n}-\bm{w}^{*}||_{2}\leq\frac{C_{min}+\lambda_{min}\big{(}\bm{\Sigma}^{-1}\big{)}}{2},

where the distribution (𝒟n,𝐰)\mathcal{B}(\mathcal{D}^{n},\bm{w}^{*}) is the vector Bernoulli distribution Pr(yi=+1)=σ((𝐰)T𝐱i)Pr(y^{i}=+1)=\sigma((\bm{w}^{*})^{T}\bm{x}^{i}), Pd(M)P_{d}(M) is the probability of a d-dimensional standard normal random variable appears in the ball with radius M=116λmin2λmaxM=\frac{1}{16}\frac{\lambda_{min}^{2}}{\sqrt{\lambda_{max}}} and Cmin=λmin(1ni=1σ((𝐰)T𝐱i)(1σ((𝐰)T𝐱i))𝐱i(𝐱i)T).C_{min}=\lambda_{min}\Big{(}\frac{1}{n}\sum_{i=1}\sigma((\bm{w}^{*})^{T}\bm{x}^{i})\big{(}1-\sigma((\bm{w}^{*})^{T}\bm{x}^{i})\big{)}\bm{x}^{i}(\bm{x}^{i})^{T}\Big{)}. The same finite time bound holds for 𝐰^n\hat{\bm{w}}^{n} as long as the approximation satisfies

Ψ(𝒘^n|𝒎0,Σ0)Ψ(𝒘n|𝒎0,Σ0)132λmin3(𝚺1).\Psi(\hat{\bm{w}}^{n}|\bm{m}^{0},\Sigma^{0})\geq\Psi(\bm{w}^{n}|\bm{m}^{0},\Sigma^{0})-\frac{1}{32}\lambda^{3}_{min}\big{(}\bm{\Sigma}^{-1}\big{)}. (17)

In the special case where 𝚺0=λ1𝑰\bm{\Sigma}^{0}=\lambda^{-1}\bm{I}, we have λmax=λmin=λ\lambda_{max}=\lambda_{min}=\lambda and M=λ3/216M=\frac{\lambda^{3/2}}{16}. The bound holds with higher probability Pd(M)P_{d}(M) with larger λ\lambda. This is natural since a larger λ\lambda represents a normal distribution with narrower bandwidth, resulting in a more concentrated 𝒘\bm{w}^{*} around 𝒎0\bm{m}^{0}.

Condition (17) quantifies how well the approximation should be carried out in order for the bound to hold. It places a condition on the optimality of the log-likelihood value instead of the distance to the global maximizer. This is a friendly condition since the dependence of the maximizer over the objective function is generally not continuous. The approximation error could come from either the Laplace approximation or the numerical calculation in practice, which calls for experiments to further verity the performance of the algorithm. Indeed, we conduct extensive experiments in Section 6 to demonstrate the behaviors of the proposed algorithm together with bench mark methods.

If the goal of the learner is to maximize the cumulative successes as in bandit settings, rather than finding the final recommendation after the offline training phases, the online knowledge gradient policy can be modified by deleting XKG,nX^{\text{KG},n} at each time step nn as:

XKG,n(Sn)=argmax𝒙p(y=+1|x,Sn)+τν𝒙KG(Sn),X^{\text{KG},n}(S^{n})=\arg\max_{\bm{x}}p(y=+1|x,S^{n})+\tau\nu_{\bm{x}}^{\text{KG}}(S^{n}), (18)

where τ\tau reflects a planning horizon.

5.3 Behavior and Asymptotic Optimality

In this section, we study theoretically the behavior of the KG policy, especially in the limit as the number of measurements NN grows large. For the purposes of the theoretical analysis, we do not approximate the predictive posterior distribution. We use logistic function as the sigmoid link function throughout this section. Yet the theoretical results can be generalized to other link functions. We begin by showing the positive value of information (benefits of measurement).

Proposition 5.5 (Benefits of measurement).

The knowledge gradient of measuring any alternative 𝐱\bm{x} while in any state s𝒮s\in\mathcal{S} is nonnegative, μ𝐱KG(s)0.\mu_{\bm{x}}^{\text{KG}}(s)\geq 0. The state space 𝒮\mathcal{S} is the space of all possible predictive distributions for 𝐰\bm{w}.

The next lemma shows that the asymptotic probability of success of each alternative is well defined.

Lemma 5.6.

For any alternative 𝐱\bm{x}, p𝐱np_{\bm{x}}^{n} converges almost surely to a random variable p𝐱=𝔼[σ(𝐰T𝐱)|]p_{\bm{x}}^{\infty}=\mathbb{E}\big{[}\sigma(\bm{w}^{T}\bm{x})|\mathcal{F}^{\infty}\big{]}, where p𝐱np_{\bm{x}}^{n} is short hand notation for p(y=+1|𝐱,n)=𝔼[σ(𝐰T𝐱)|n].p(y=+1|\bm{x},\mathcal{F}^{n})=\mathbb{E}\big{[}\sigma(\bm{w}^{T}\bm{x})|\mathcal{F}^{n}\big{]}.

Proof.

ProofSince |σ(𝒘T𝒙)|1|\sigma(\bm{w}^{T}\bm{x})|\leq 1, the definition of p𝒙np_{\bm{x}}^{n} implies that p𝒙np_{\bm{x}}^{n} is a bounded martingale and hence converges. ∎

The rest of this section shows that this limiting probability of success of each alternative is one in which the posterior is consistent and thus the KG is asymptotically optimal. We also show that as the number of measurements grows large, the maximum likelihood estimator 𝒘MLE\bm{w}_{\text{MLE}} based on the alternatives measured by the KG policy is consistent and asymptotically normal. The next proposition states that if we have measured an alternative infinitely many times, there is no benefit to measuring it one more time. This is a key step for establishing the consistency of the KG policy and the MLE. The proof is similar to that by [15] with additional mathematical steps for Bernoulli distributed random variables. See Appendix 11.2 for details.

Proposition 5.7.

If the policy π\pi measures alternative 𝐱\bm{x} infinitely often almost surely, then the value of information of that alternative ν𝐱()=0\nu_{\bm{x}}(\mathcal{F}^{\infty})=0 almost surely under policy π\pi.

Without loss of generality, we assume 𝒙21\|\bm{x}\|_{2}\leq 1 and that the d×dd\times d matrix PP formed by (𝒙1,𝒙2,,𝒙d)(\bm{x}_{1},\bm{x}_{2},\dots,\bm{x}_{d}) is invertible. The next theorem states the strong consistency and asymptotic normality of the maximum likelihood estimator 𝒘MLEn\bm{w}^{n}_{\text{MLE}} (e.g. with λ=0\lambda=0) based on KG’s sequential selection of alternatives by verification of the following regularity conditions:

  • (C1)

    The exogenous variables are uniformly bounded.

  • (C2)

    Let λ1n\lambda_{1n} and λdn\lambda_{dn} be respectively the smallest and the largest eigenvalue of the Fisher information of the first nn observations 𝑭n(𝒘)\bm{F}_{n}(\bm{w}^{*}). There exists CC such that λkN/λ1N<C\lambda_{kN}/\lambda_{1N}<C , N\forall N.

  • (C3)

    The smallest eigenvalue of the Fisher information is divergent, λ1N+.\lambda_{1N}\rightarrow+\infty.

Theorem 5.8.

The sequence of 𝐰MLEn\bm{w}^{n}_{\text{MLE}} based on KG’s sequential selection of alternatives converges almost surely to the true value 𝐰\bm{w}^{*} and is asymptotically normal:

𝑭n12(𝒘MLEn𝒘)𝑑𝒩(0,𝑰).\bm{F}_{n}^{\frac{1}{2}}(\bm{w}^{n}_{\text{MLE}}-\bm{w}^{*})\xrightarrow{d}\mathcal{N}(0,\bm{I}).
Proof.

Proof We first prove that for any alternative 𝒙\bm{x}, it will be measured infinitely many times almost surely. We will prove it by contradiction. If this is not the case, then there exists a time TT such that for any n>Tn>T,

μ𝒙KG,n<max𝒙𝒳μ𝒙KG,nϵ.\mu^{\text{KG},n}_{\bm{x}}<\max_{\bm{x}\in\mathcal{X}}\mu^{\text{KG},n}_{\bm{x}}-\epsilon. (19)

This is because otherwise the difference between the KG value of 𝒙\bm{x} and the maximum KG value will be smaller than ϵ\epsilon for infinitely many times, and thus the probability of not measuring 𝒙\bm{x} after TT will be 0.

In addition, since the KG value is always non-negative, we have max𝒙𝒳μ𝒙KG,n>ϵ\max_{\bm{x}\in\mathcal{X}}\mu^{\text{KG},n}_{\bm{x}}>\epsilon for each n>Tn>T. Notice that 𝒳\mathcal{X} is a finite set, then it follows that there exists an alternative 𝒙\bm{x}^{\prime} such that the following happens infinitely many times:

μ𝒙KG,n=max𝒙𝒳μ𝒙KG,n.\mu^{\text{KG},n}_{\bm{x}^{\prime}}=\max_{\bm{x}\in\mathcal{X}}\mu^{\text{KG},n}_{\bm{x}}. (20)

Therefore, 𝒙\bm{x}^{\prime} is measured infinitely many times almost surely. However, we have proved in Proposition 5.7 that μ𝒙KG,n\mu^{\text{KG},n}_{\bm{x}^{\prime}} goes to 0 zero as long as we measure 𝒙\bm{x}^{\prime} infinitely many times, which contradicts (20). The contradiction shows that our original assumption that 𝒙\bm{x} only being measured finite times is incorrect. As a consequence, we have prove that, almost surely, 𝒙\bm{x} will be measured infinitely many times.

Since our proof is for arbitrary 𝒙\bm{x}, we actually proved that every alternative will be measured infinitely many times, which immediately leads to max𝒙𝒳μ𝒙KG,n0\max_{\bm{x}\in\mathcal{X}}\mu^{\text{KG},n}_{\bm{x}}\rightarrow 0. Therefore the algorithm will eventually pick the alternative uniformly at random. Hence it satisfies the conditions (C2) and (C3), leading to the strong consistency and asymptotic normality [21, 13, 25, 11].

In particular, we can prove that the smallest eigenvalue of the Fisher’s information goes to infinity in a simple way. Without loss of generality, assume that the alternatives 𝒙1,𝒙2,,𝒙d\bm{x}_{1},\bm{x}_{2},\dots,\bm{x}_{d} are in general linear position, which means that the d×dd\times d matrix P:=(𝒙1,𝒙2,,𝒙d)P:=(\bm{x}_{1},\bm{x}_{2},\dots,\bm{x}_{d}) is invertible. We use 𝒳\mathcal{X}^{\prime} to denote the set of these dd alternatives and denote P1P^{-1} by QQ. Then Q𝒙t𝒙tTQTQ\bm{x}_{t}\bm{x}_{t}^{T}Q^{T} is a matrix whose every element equals 0 except that its kk-th diagonal element equals one.

We use 𝑭n\bm{F}_{n} to denote the Fisher’s information at time nn. Then from the fact that the matrix of type 𝒙𝒙T\bm{x}\bm{x}^{T} is positive semi-definite, it is straightforward to see that

𝑭n(𝒘)min𝒙𝒳(1σ(𝒙T𝒘))σ(𝒙T𝒘)1tn,𝒙t𝒳𝒙t𝒙tT,\bm{F}_{n}(\bm{w})\geq\min_{\bm{x}\in\mathcal{X}}(1-\sigma(\bm{x}^{T}\bm{w}))\sigma(\bm{x}^{T}\bm{w})\sum_{1\leq t\leq n,\bm{x}_{t}\in\mathcal{X}^{\prime}}\bm{x}_{t}\bm{x}_{t}^{T},

where the constant min𝒙𝒳(1σ(𝒙T𝒘))σ(𝒙T𝒘)>0\min_{\bm{x}\in\mathcal{X}}(1-\sigma(\bm{x}^{T}\bm{w}))\sigma(\bm{x}^{T}\bm{w})>0 since 𝒳\mathcal{X} is a finite set. Now define matrix RnR_{n} as

Rn:=1tn,𝒙t𝒳Q𝒙t𝒙tTQT,R_{n}:=\sum_{1\leq t\leq n,\bm{x}_{t}\in\mathcal{X}^{\prime}}Q\bm{x}_{t}\bm{x}_{t}^{T}Q^{T},

which is a diagonal matrix whose ii-th element equals the times that 𝒙i\bm{x}_{i} is estimated. Since 𝒙k\bm{x}_{k} is measured infinitely many times for 1kd1\leq k\leq d, then each diagonal element of RnR_{n} goes to infinity. Now notice that 1tn,𝒙t𝒳𝒙t𝒙tT=Q1Rn(Q1)T\sum_{1\leq t\leq n,\bm{x}_{t}\in\mathcal{X}^{\prime}}\bm{x}_{t}\bm{x}_{t}^{T}=Q^{-1}R_{n}(Q^{-1})^{T} is congruent to RnR_{n} and Q1Q^{-1} is a constant matrix, then it follows that the smallest eigenvalue of 1tn,𝒙t𝒳𝒙t𝒙tT\sum_{1\leq t\leq n,\bm{x}_{t}\in\mathcal{X}^{\prime}}\bm{x}_{t}\bm{x}_{t}^{T}, and hence the smallest eigenvalue of 𝑭n\bm{F}_{n}, goes to infinity.

Similarly, by using the congruent argument, we only need to show that almost surely for any kk, there exists a constant CC such that nk(t)/tCn_{k}(t)/t\geq C, where nk(t)n_{k}(t) is the time that nk(t)n_{k}(t) is measured in the first tt measurements. Since the algorithm eventually selects alternatives uniformly at random, without loss of generality, we assume nk(t)n_{k}(t) is the num of tt i.i.d Bernoulli random variable with probability 1/m1/m, then it is clear to check that 𝔼[nk(t)]=t/m\mathbb{E}[n_{k}(t)]=t/m, and Var[nk(t)]=t(m1)/m2Var[n_{k}(t)]=t(m-1)/m^{2}. Let LL be the event that lim¯tnk(t)/t<1/2m\underline{\lim}_{t\rightarrow\infty}n_{k}(t)/t<1/2m. Notice that LL implies that there exists a sequence t1<t2<t3<t_{1}<t_{2}<t_{3}<\dots such that nk(ts)/ts<1/2mn_{k}(t_{s})/t_{s}<1/2m. However, by Chebyshev’s inequality, we have

[nk(ts)ts<12m][|nk(ts)tsm|ts2m]4(m1)ts.\mathbb{P}\left[\frac{n_{k}(t_{s})}{t_{s}}<\frac{1}{2m}\right]\leq\mathbb{P}\left[\left|n_{k}(t_{s})-\frac{t_{s}}{m}\right|\geq\frac{t_{s}}{2m}\right]\leq\frac{4(m-1)}{t_{s}}.

Notice that the above inequality holds for any tst_{s}, which means [L]=0\mathbb{P}[L]=0 and thus lim¯tnk(t)/t1/2m\underline{\lim}_{t\rightarrow\infty}n_{k}(t)/t\geq 1/2m almost surely, which completes the proof. ∎

After establishing the consistency and asymptotic normality for 𝒘MLEn\bm{w}_{\text{MLE}}^{n}, for any λ>0\lambda>0 as the inverse of the variance of the prior Gaussian distribution, the asymptotic bias of the estimator 𝒘λn\bm{w}^{n}_{\lambda} is as follows [36]:

𝑬[𝒘λn𝒘]=2λ{𝑭n(𝒘)+2λ𝑰}1𝒘,\bm{E}[\bm{w}^{n}_{\lambda}-\bm{w}^{*}]=-2\lambda\{\bm{F}_{n}(\bm{w}^{*})+2\lambda\bm{I}\}^{-1}\bm{w}^{*},

and the asymptotic variance of 𝒘λn\bm{w}^{n}_{\lambda} is {𝑭n(𝒘)+2λ𝑰}1𝑭n(𝒘){𝑭n(𝒘)+2λ𝑰}1.\{\bm{F}_{n}(\bm{w}^{*})+2\lambda\bm{I}\}^{-1}\bm{F}_{n}(\bm{w}^{*})\{\bm{F}_{n}(\bm{w}^{*})+2\lambda\bm{I}\}^{-1}.

Finally, we show in the next theorem that given the opportunity to measure infinitely often, for any given neighborhood of 𝒘\bm{w}^{*}, the probability that the posterior distribution lies in this neighborhood goes to 1 and KG will discover which one is the true best alternative. The detailed proof can be found in Appendix 11.3.

Theorem 5.9 (Asymptotic optimality).

For any true parameter value 𝐰d\bm{w}^{*}\in\mathbb{R}^{d} and any normal prior distribution with positive definite covariance matrix, under the knowledge gradient policy, the posterior is consistent and the KG policy is asymptotically optimal: argmax𝐱𝔼KG[σ(𝐰T𝐱)|]=argmax𝐱σ((𝐰)T𝐱)\arg\max_{\bm{x}}\mathbb{E}^{\text{KG}}[\sigma(\bm{w}^{T}\bm{x})|\mathcal{F}^{\infty}]=\arg\max_{\bm{x}}\sigma((\bm{w^{*}})^{T}\bm{x}).

5.4 Knowledge Gradient for contextual bandits

In many of the applications, the “success” or “failure” can be predicted though an unknown relationship that depends on a partially controllable vector of attributes for each instance. For example, the patient attributes are given which we do not have control over. We can only choose a medical decision, and the outcome is based on both the patient attributes and the selected medical decision.

This problem is known as a contextual bandit [34, 37, 33]. At each round nn, the learner is presented with a context vector 𝒄n\bm{c}_{n} (patient attributes) and a set of actions 𝒂𝒜\bm{a}\in\mathcal{A} (medical decisions). After choosing an alternative xx, we observe an outcome yn+1y^{n+1}. The goal is to find a policy that selects actions such that the cumulative reward is as large as possible over time. In contrast, the previous section considers a context-free scenario with the goal of maximizing the probability of success after the offline training phase so that the error incurred during the training is not punished.

Since there are very few patients with the same characteristics, it is unlikely that we can work with one type of patients that have the same characteristics to produce statistically reliable performance measurements on medical outcomes. The contextual bandit formulation provides the potential to use parametric belief models that allow us to learn relationships across a wide range of patients and medical decisions and personalize health care based on the patient attributes.

Each of the past observations are made of triplets (𝒄n,𝒂n,yn+1)(\bm{c}^{n},\bm{a}^{n},y^{n+1}). Under Bayesian linear classification belief models, the binomial outcome p(y=+1|𝒄,𝒂)p(y=+1|\bm{c},\bm{a}) is predicted though σ(F(𝒄,𝒂))\sigma\big{(}F(\bm{c},\bm{a})\big{)},

F(𝒄,𝒂)=θ+𝜶T𝝋𝒄+𝜷T𝝍𝒂,F(\bm{c},\bm{a})=\theta+\bm{\alpha}^{T}\bm{\varphi_{\bm{c}}}+\bm{\beta}^{T}\bm{\psi_{\bm{a}}}, (21)

where suppose each context 𝒄\bm{c} and action 𝒙\bm{x} is represented by feature vectors 𝝋𝒄\bm{\varphi_{\bm{c}}} and 𝝍𝒂\bm{\psi_{\bm{a}}}, respectively. At each round nn, the model updates can be slightly modified based on the observation triplet (𝒄n,𝒂n,yn+1)(\bm{c}^{n},\bm{a}^{n},y^{n+1}) by treating 1𝝋𝒄𝝍𝒂1||\bm{\varphi_{\bm{c}}}||\bm{\psi_{\bm{a}}} as the alternative 𝒙\bm{x}, where 𝒖||𝒗\bm{u}||\bm{v} denotes the concatenation of the two vectors 𝒖\bm{u} and 𝒗\bm{v}.

The knowledge gradient ν𝒂|𝒄KG(s)\nu_{\bm{a}|\bm{c}}^{\text{KG}}(s) of measuring an action 𝒂\bm{a} given context 𝒄\bm{c} can be defined as follows:

Definition 5.10.

The knowledge gradient of measuring an action 𝐚\bm{a} given a context 𝐜\bm{c} while in state ss is

ν𝒂|𝒄KG(s):=𝔼[VN(T(s,1𝝋𝒄𝝍𝒂,y))VN(s)|𝒄,𝒂,s].\nu_{\bm{a}|\bm{c}}^{\text{KG}}(s):=\mathbb{E}\Big{[}V^{N}\Big{(}T\big{(}s,1||\bm{\varphi_{\bm{c}}}||\bm{\psi_{\bm{a}}},y\big{)}\Big{)}-V^{N}(s)|\bm{c},\bm{a},s\Big{]}. (22)

The calculation of ν𝒂|𝒄KG(s)\nu_{\bm{a}|\bm{c}}^{\text{KG}}(s) can be modified based on Section 5.2 by replacing 𝒙\bm{x} as 1𝝋𝒄𝝍𝒂1||\bm{\varphi_{\bm{c}}}||\bm{\psi_{\bm{a}}} throughout.

Since the objective in the contextual bandit problems is to maximize the cumulative number of successes, the knowledge gradient policy developed in Section 5.2 for stochastic binary feedback can be easily extended to online learning following the “stop-learning” (SL) policy adopted by [47]. The action X𝒄KG,n(sn)X^{\text{KG},n}_{\bm{c}}(s^{n}) that is chosen by KG at time nn given a context 𝒄\bm{c} and a knowledge state sns^{n} can be obtained as:

X𝒄KG,n(Sn)=argmax𝒂p(y=+1|𝒄,𝒂,Sn)+τν𝒂|𝒄KG,n(Sn),X^{\text{KG},n}_{\bm{c}}(S^{n})=\arg\max_{\bm{a}}p(y=+1|\bm{c},\bm{a},S^{n})+\tau\nu_{\bm{a}|\bm{c}}^{\text{KG},n}(S^{n}),

where τ\tau reflects a planning horizon.

It should be noted that if the context are constant over time, X𝒄KG,n(Sn)X^{\text{KG},n}_{\bm{c}}(S^{n}) is degenerated to the case of context-free multi-armed bandit problems as discussed in Eq. (18).

6 Experimental results

In this section, we evaluate the proposed method in offline learning settings where we are not punished for errors incurred during training and only concern with the final recommendation after the offline training phases.

We experiment with both synthetic datasets and the UCI machine learning repository [38] which includes classification problems drawn from settings including sonar, glass identification, blood transfusion, survival, breast cancer (wpbc), planning relax and climate model failure. We first analyze the behavior of the KG policy and then compare it to state-of-the-art learning algorithms. On synthetic datasets, we randomly generate a set of MM dd-dimensional alternatives 𝒙\bm{x} from [3,3][-3,3]. At each run, the stochastic binary labels are simulated using a d+1d+1-dimensional weight vector 𝒘\bm{w}^{*} which is sampled from the prior distribution wi𝒩(0,λ)w^{*}_{i}\sim\mathcal{N}(0,\lambda). The +1+1 label for each alternative 𝒙\bm{x} is generated with probability σ(w0+j=1dwdxd)\sigma(w^{*}_{0}+\sum_{j=1}^{d}w^{*}_{d}x_{d}). For each UCI dataset, we use all the data points as the set of alternatives with their original attributes. We then simulate their labels using a weight vector 𝒘\bm{w}^{*}. This weight vector could have been chosen arbitrarily, but it was in fact a perturbed version of the weight vector trained through logistic regression on the original dataset.

6.1 Behavior of the KG policy

Refer to caption Refer to caption Refer to caption Refer to caption
Figure 1: The scatter plots illustrate the KG values at 1-4 iterations from left to right with both the color and the size reflecting the magnitude. The star, the red square and pink circle indicate the true best alternative, the alternative to be selected and the implementation decision, respectively.

To better understand the behavior of the KG policy, Fig. 1 shows the snapshot of the KG policy at each iteration on a 22-dimensional synthetic dataset and a 33-dimensional synthetic dataset in one run. Fig. 1 shows the snapshot on a 2-dimensional dataset with 200 alternatives. The scatter plots show the KG values with both the color and the size of the point reflecting the KG value of the corresponding alternative. The star denotes the true alternative with the largest response. The red square is the alternative with the largest KG value. The pink circle is the implementation decision that maximizes the response under current estimation of 𝒘\bm{w}^{*} if the budget is exhausted after that iteration.

It can be seen from the figure that the KG policy finds the true best alternative after only three measurements, reaching out to different alternatives to improve its estimates. We can infer from Fig. 1 that the KG policy tends to choose alternatives near the boundary of the region. This criterion is natural since in order to find the true maximum, we need to get enough information about 𝒘\bm{w}^{*} and estimate well the probability of points near the true maximum which appears near the boundary. On the other hand, in a logistic model with labeling noise, a data 𝒙\bm{x} with small 𝒙T𝒙\bm{x}^{T}\bm{x} inherently brings little information as pointed out by [61]. For an extreme example, when 𝒙=𝟎\bm{x}=\bm{0} the label is always completely random for any 𝒘\bm{w} since p(y=+1|𝒘,𝟎)0.5p(y=+1|\bm{w},\bm{0})\equiv 0.5. This is an issue when perfect classification is not achievable. So it is essential to label a data with larger 𝒙T𝒙\bm{x}^{T}\bm{x} that has the most potential to enhance its confidence non-randomly.

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 2: Snapshots on a 33-dimensional dataset. The scatter plots illustrate the KG values at 1-10 iterations from left to right, top to bottom. The star, the red square and pink circle indicate the best alternative, the alternative to be selected and the implementation decision.

Fig. 2 illustrates the snapshots of the KG policy on a 3-dimensional synthetic dataset with 300 alternatives. It can be seen that the KG policy finds the true best alternative after only 10 measurements. This set of plots also verifies our statement that the KG policy tends to choose data points near the boundary of the region.

Refer to caption
(a) 2-dimensional dataset
Refer to caption
(b) 3-dimensional dataset
Figure 3: Absolute error between the predictive probability of +1+1 under current estimate and the true probability.

Also depicted in Fig. 3 is the absolute class distribution error of each alternative, which is the absolute difference between the predictive probability of class +1+1 under current estimate and the true probability after 66 iterations for the 2-dimensional dataset and 10 iterations for the 3-dimensional dataset. We see that the probability at the true maximum is well approximated, while moderate error in the estimate is located away from this region of interest.

6.2 Comparison with other policies

Recall that our goal is to maximize the expected response of the implementation decision. We define the Opportunity Cost (OC) metric as the expected response of the implementation decision 𝒙N+1:=argmax𝒙p(y=+1|𝒙,𝒘N)\bm{x}^{N+1}:=\arg\max_{\bm{x}}p(y=+1|\bm{x},\bm{w}^{N}) compared to the true maximal response under weight 𝒘\bm{w}^{*}:

OC:=max𝒙𝒳p(y=+1|𝒙,𝒘)p(y=+1|𝒙N+1,𝒘).\text{OC}:=\max_{\bm{x}\in\mathcal{X}}p(y=+1|\bm{x},\bm{w}^{*})-p(y=+1|\bm{x}^{N+1},\bm{w}^{*}).

Note that the opportunity cost is always non-negative and the smaller the better. To make a fair comparison, on each run, all the time-NN labels of all the alternatives are randomly pre-generated according to the weight vector 𝒘\bm{w}^{*} and shared across all the competing policies. We allow each algorithm to sequentially measure N = 30 alternatives.

We compare with the following state-of-the-art active learning and Bayesian optimization policies that are compatible with logistic regression: Random sampling (Random), a myopic method that selects the most uncertain instance each step (MostUncertain), discriminative batch-mode active learning (Disc) [23] with batch size equal to 1, expected improvement (EI) [53] with an initial fit of 5 examples and Thompson sampling (TS) [7]. Besides, as upper confidence bounds (UCB) methods are often considered in bandit and optimization problems, we compare against UCB on the latent function 𝒘T𝒙\bm{w}^{T}\bm{x} (UCB) [37] with α\alpha tuned to be 1. All the state transitions are based on the online Bayesian logistic regression framework developed in Section 4, while different policies provides different rules for measurement decisions at each iteration. The experimental results are shown in figure 4. In all the figures, the x-axis denotes the number of measured alternatives and the y-axis represents the averaged opportunity cost averaged over 100 runs.

Refer to caption
(a) sonar
Refer to caption
(b) glass identification
Refer to caption
(c) blood transfusion
Refer to caption
(d) survival
Refer to caption
(e) breast cancer (wpbc)
Refer to caption
(f) planning relax
Refer to caption
(g) climate
Refer to caption
(h) Synthetic data, d=10d=10
Refer to caption
(i) Synthetic data, d=15d=15
Figure 4: Opportunity cost on UCI and synthetic datasets.

It is demonstrated in Fig. 4 that KG outperforms the other policies in most cases, especially in early iterations, without requiring a tuning parameter. As an unbiased selection procedure, random sampling is at least a consistent algorithm. Yet it is not suitable for expensive experiments where one needs to learn the most within a small experimental budget. MostUncertain and Disc perform quite well on some datasets while badly on others. A possible explanation is that the goal of active leaning is to learn a classifier which accurately predicts the labels of new examples so their criteria are not directly related to maximize the probability of success aside from the intent to learn the prediction. After enough iterations when active learning methods presumably have the ability to achieve a good estimator of 𝒘\bm{w}^{*}, their performance will be enhanced. Thompson sampling works in general quite well as reported in other literature [7]. Yet, KG has a better performance especially during the early iterations. In the case when an experiment is expensive and only a small budget is allowed, the KG policy, which is designed specifically to maximize the response, is preferred.

We also note that KG works better than EI in most cases, especially in Fig. 4(b), 4(c) and 4(e). Although both KG and EI work with the expected value of information, when EI decides which alternative to measure, it is based on the expected improvement over current predictive posterior distribution while ignoring the potential change of the posterior distribution resulting from the next (stochastic) outcome yy. In comparison, KG considers an additional level of expectation over the random (since at the time of decision, we have not yet observed outcome) binary output yy.

Finally, KG, EI and Thompson sampling outperform the naive use of UCB policies on the latent function 𝒘T𝒙\bm{w}^{T}\bm{x} due to the errors in the variance introduced by the nonlinear transformation. At each time step, the posterior of logp1p\log{\frac{p}{1-p}} is approximated as a Gaussian distribution. An upper confidence bound on logp1p\log{\frac{p}{1-p}} does not translate to one on pp with binary outcomes. In the meantime, KG, EI and Thompson sampling make decisions in the underlying binary outcome probability space and find the right balance of exploration and exploitation.

7 Conclusion

Motivated by real world applications, we consider binary classification problems where we have to sequentially run expensive experiments, forcing us to learn the most from each experiment. With a small budget of measurements, the goal is to learn the classification model as quickly as possible to identify the alternative with the highest probability of success. Due to the sequential nature of this problem, we develop a fast online Bayesian linear classifier for general response functions to achieve recursive updates. We propose a knowledge gradient policy using Bayesian linear classification belief models, for which we use different analytical approximation methods to overcome computational challenges. We further extend the knowledge gradient to the contextual bandit settings. We provide a finite-time analysis on the estimated error, and show that the maximum likelihood estimator based on the adaptively sampled points by the KG policy is consistent and asymptotically normal. We show furthermore that the knowledge gradient policy is asymptotic optimal. We demonstrate its efficiency through a series of experiments.

References

  • [1] Jean-Yves Audibert and Sébastien Bubeck. Best arm identification in multi-armed bandits. In COLT-23th Conference on Learning Theory-2010, pages 13–p, 2010.
  • [2] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47(2-3):235–256, 2002.
  • [3] David Barber and Christopher M Bishop. Ensemble learning for multi-layer networks. Advances in neural information processing systems, pages 395–401, 1998.
  • [4] Christopher M Bishop et al. Pattern recognition and machine learning, volume 4. springer New York, 2006.
  • [5] Xavier Boyen and Daphne Koller. Tractable inference for complex stochastic processes. In Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence, pages 33–42. Morgan Kaufmann Publishers Inc., 1998.
  • [6] Sébastien Bubeck and Nicolò Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends in Machine Learning, 2012.
  • [7] Olivier Chapelle and Lihong Li. An empirical evaluation of thompson sampling. In Advances in neural information processing systems, pages 2249–2257, 2011.
  • [8] Stephen E Chick. New two-stage and sequential procedures for selecting the best simulated system. Operations Research, 49(5):732–743, 2001.
  • [9] Wei Chu, Lihong Li, Lev Reyzin, and Robert E Schapire. Contextual bandits with linear payoff functions. In International Conference on Artificial Intelligence and Statistics, pages 208–214, 2011.
  • [10] Wei Chu, Martin Zinkevich, Lihong Li, Achint Thomas, and Belle Tseng. Unbiased online active learning in data streams. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 195–203. ACM, 2011.
  • [11] David Roxbee Cox and David Victor Hinkley. Theoretical statistics. CRC Press, 1979.
  • [12] M. H. DeGroot. Optimal Statistical Decisions. McGraw-Hill, 1970.
  • [13] Ludwig Fahrmeir and Heinz Kaufmann. Consistency and asymptotic normality of the maximum likelihood estimator in generalized linear models. The Annals of Statistics, pages 342–368, 1985.
  • [14] Sarah Filippi, Olivier Cappe, Aurélien Garivier, and Csaba Szepesvári. Parametric bandits: The generalized linear case. In Advances in Neural Information Processing Systems, pages 586–594, 2010.
  • [15] Peter Frazier, Warren Powell, and Savas Dayanik. The knowledge-gradient policy for correlated normal beliefs. INFORMS journal on Computing, 21(4):599–613, 2009.
  • [16] Peter I Frazier, Warren B Powell, and Savas Dayanik. A knowledge-gradient policy for sequential information collection. SIAM Journal on Control and Optimization, 47(5):2410–2439, 2008.
  • [17] David A Freedman. On the asymptotic behavior of bayes’ estimates in the discrete case. The Annals of Mathematical Statistics, pages 1386–1403, 1963.
  • [18] Yoav Freund, H Sebastian Seung, Eli Shamir, and Naftali Tishby. Selective sampling using the query by committee algorithm. Machine learning, 28(2-3):133–168, 1997.
  • [19] Subhashis Ghosal, Jayanta K Ghosh, RV Ramamoorthi, et al. Posterior consistency of dirichlet mixtures in density estimation. Ann. Statist, 27(1):143–158, 1999.
  • [20] Subhashis Ghosal and Anindya Roy. Posterior consistency of gaussian process prior for nonparametric binary regression. The Annals of Statistics, pages 2413–2429, 2006.
  • [21] Christian Gourieroux and Alain Monfort. Asymptotic properties of the maximum likelihood estimator in dichotomous logit models. Journal of Econometrics, 17(1):83–97, 1981.
  • [22] Thore Graepel, Joaquin Q Candela, Thomas Borchert, and Ralf Herbrich. Web-scale bayesian click-through rate prediction for sponsored search advertising in microsoft’s bing search engine. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 13–20, 2010.
  • [23] Yuhong Guo and Dale Schuurmans. Discriminative batch mode active learning. In Advances in neural information processing systems, pages 593–600, 2008.
  • [24] H-M Gutmann. A radial basis function method for global optimization. Journal of Global Optimization, 19(3):201–227, 2001.
  • [25] Shelby J Haberman and Shelby J Haberman. The analysis of frequency data, volume 194. University of Chicago Press Chicago, 1974.
  • [26] Donghai He, Stephen E Chick, and Chun-Hung Chen. Opportunity cost and OCBA selection procedures in ordinal optimization for a fixed number of alternative systems. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 37(5):951–961, 2007.
  • [27] Philipp Hennig and Christian J Schuler. Entropy search for information-efficient global optimization. The Journal of Machine Learning Research, 13(1):1809–1837, 2012.
  • [28] Matthew D Hoffman, Bobak Shahriari, and Nando de Freitas. On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. In AISTATS, pages 365–374, 2014.
  • [29] David W Hosmer Jr and Stanley Lemeshow. Applied logistic regression. John Wiley & Sons, 2004.
  • [30] Deng Huang, Theodore T Allen, William I Notz, and N Zeng. Global optimization of stochastic black-box systems via sequential kriging meta-models. Journal of global optimization, 34(3):441–466, 2006.
  • [31] Donald R Jones. A taxonomy of global optimization methods based on response surfaces. Journal of global optimization, 21(4):345–383, 2001.
  • [32] Donald R Jones, Matthias Schonlau, and William J Welch. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455–492, 1998.
  • [33] Andreas Krause and Cheng S Ong. Contextual gaussian process bandit optimization. In Advances in Neural Information Processing Systems, pages 2447–2455, 2011.
  • [34] John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. In Advances in Neural Information Processing Systems, 2008.
  • [35] Steffen L Lauritzen. Propagation of probabilities, means, and variances in mixed graphical association models. Journal of the American Statistical Association, 87(420):1098–1108, 1992.
  • [36] Saskia Le Cessie and Johannes C Van Houwelingen. Ridge estimators in logistic regression. Applied statistics, pages 191–201, 1992.
  • [37] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM, 2010.
  • [38] M. Lichman. UCI machine learning repository, 2013.
  • [39] Dhruv Kumar Mahajan, Rajeev Rastogi, Charu Tiwari, and Adway Mitra. Logucb: an explore-exploit algorithm for comments recommendation. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 6–15. ACM, 2012.
  • [40] Peter S Maybeck. Stochastic models, estimation, and control, volume 3. Academic press, 1982.
  • [41] Martijn RK Mes, Warren B Powell, and Peter I Frazier. Hierarchical knowledge gradient for sequential sampling. The Journal of Machine Learning Research, 12:2931–2974, 2011.
  • [42] D. C. Montgomery. Design and Analysis of Experiments. John Wiley and Sons, 2008.
  • [43] Diana M Negoescu, Peter I Frazier, and Warren B Powell. The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS Journal on Computing, 23(3):346–363, 2011.
  • [44] Warren B Powell and Ilya O Ryzhov. Optimal learning. John Wiley & Sons, 2012.
  • [45] Pradeep Ravikumar, Martin J Wainwright, John D Lafferty, et al. High-dimensional ising model selection using ℓ1-regularized logistic regression. The Annals of Statistics, 38(3):1287–1319, 2010.
  • [46] Rommel G Regis and Christine A Shoemaker. Constrained global optimization of expensive black box functions using radial basis functions. Journal of Global Optimization, 31(1):153–171, 2005.
  • [47] Ilya O Ryzhov, Warren B Powell, and Peter I Frazier. The knowledge gradient algorithm for a general class of online learning problems. Operations Research, 60(1):180–195, 2012.
  • [48] Mehran Sahami, Susan Dumais, David Heckerman, and Eric Horvitz. A bayesian approach to filtering junk e-mail. In Learning for Text Categorization: Papers from the 1998 workshop, volume 62, pages 98–105, 1998.
  • [49] Andrew I Schein and Lyle H Ungar. Active learning for logistic regression: an evaluation. Machine Learning, 68(3):235–265, 2007.
  • [50] Burr Settles. Active learning literature survey. University of Wisconsin, Madison, 52(55-66):11, 2010.
  • [51] David J Spiegelhalter and Steffen L Lauritzen. Sequential updating of conditional probabilities on directed graphical structures. Networks, 20(5):579–605, 1990.
  • [52] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
  • [53] Matthew Tesch, Jeff Schneider, and Howie Choset. Expensive function optimization with stochastic binary outcomes. In Proceedings of The 30th International Conference on Machine Learning, pages 1283–1291, 2013.
  • [54] Surya T Tokdar and Jayanta K Ghosh. Posterior consistency of logistic gaussian process priors in density estimation. Journal of Statistical Planning and Inference, 137(1):34–42, 2007.
  • [55] Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classification. The Journal of Machine Learning Research, 2:45–66, 2002.
  • [56] Yingfei Wang, Hua Ouyang, Chu Wang, Jianhui Chen, Tsvetan Asamov, and Yi Chang. Efficient ordered combinatorial semi-bandits for whole-page recommendation. In AAAI, pages 2746–2753, 2017.
  • [57] Yingfei Wang, Kristofer G Reyes, Keith A Brown, Chad A Mirkin, and Warren B Powell. Nested-batch-mode learning and stochastic optimization with an application to sequential multistage testing in materials science. SIAM Journal on Scientific Computing, 37(3):B361–B381, 2015.
  • [58] Yingfei Wang, Chu Wang, and Warren Powell. The knowledge gradient for sequential decision making with stochastic binary feedbacks. In Proceedings of the 33rd international conference on Machine learning, 2016.
  • [59] G. B. Wetherill and K. D. Glazebrook. Sequential Methods in Statistics. Chapman and Hall, 1986.
  • [60] Stephen J Wright and Jorge Nocedal. Numerical optimization, volume 2. Springer New York, 1999.
  • [61] Tong Zhang and F Oles. The value of unlabeled data for classification problems. In Proceedings of the Seventeenth International Conference on Machine Learning,(Langley, P., ed.), pages 1191–1198. Citeseer, 2000.

8 Proof of Lemma 2.1

First notice that for any fixed point 𝒙\bm{x},

𝔼N[Pr(y=+1|𝒙,𝒘)]\displaystyle\mathbb{E}_{N}[\text{Pr}(y=+1|\bm{x},\bm{w})] =\displaystyle= 𝔼N[σ(𝒘T𝒙)]\displaystyle\mathbb{E}_{N}[\sigma(\bm{w}^{T}\bm{x})]
=\displaystyle= σ(𝒘T𝒙)Pr(𝒘|𝒟N)d𝒘\displaystyle\int\sigma(\bm{w}^{T}\bm{x})\text{Pr}(\bm{w}|\mathcal{D}^{N})\text{d}\bm{w}
=\displaystyle= Pr(y=+1|𝒙,𝒟N).\displaystyle\text{Pr}(y=+1|\bm{x},\mathcal{D}^{N}).

By the tower property of conditional expectations, and since 𝒙π\bm{x}^{\pi} is N\mathcal{F}^{N} measurable,

𝔼[Pr(y=+1|𝒙π,𝒘)]\displaystyle\mathbb{E}[\text{Pr}(y=+1|\bm{x}^{\pi},\bm{w})] =\displaystyle= 𝔼[σ(𝒘T𝒙π)]\displaystyle\mathbb{E}[\sigma(\bm{w}^{T}\bm{x}^{\pi})]
=\displaystyle= 𝔼𝔼N[σ(𝒘T𝒙π)]\displaystyle\mathbb{E}\mathbb{E}_{N}[\sigma(\bm{w}^{T}\bm{x}^{\pi})]
=\displaystyle= 𝔼[Pr(y=+1|𝒙π,𝒟N)].\displaystyle\mathbb{E}[\text{Pr}(y=+1|\bm{x}^{\pi},\mathcal{D}^{N})].

Then by the definition of 𝒙π\bm{x}^{\pi}, we have Pr(y=+1|𝒙π,𝒟N)=max𝒙Pr[y=+1|𝒙,𝒟N].\text{Pr}(y=+1|\bm{x}^{\pi},\mathcal{D}^{N})=\max_{\bm{x}}\text{Pr}[y=+1|\bm{x},\mathcal{D}^{N}].

9 Proof of Lemma 4.1.

Let f(p)=i=1d(mi+ypxiqi)xif(p)=\sum_{i=1}^{d}(m_{i}+yp\frac{x_{i}}{q_{i}})x_{i}. Then

ddpg(p)\displaystyle\frac{\text{d}}{\text{d}{p}}g(p) =\displaystyle= (ϕ(f)2yΦ(yf)2+fϕ(f)Φ(yf))yi=1dxi2qi\displaystyle-\bigg{(}\frac{\phi(f)^{2}}{y\Phi(yf)^{2}}+\frac{f\phi(f)}{\Phi(yf)}\bigg{)}\cdot y\sum_{i=1}^{d}\frac{x_{i}^{2}}{q_{i}}
=\displaystyle= ϕ(f)Φ(yf)2i=1dxi2qi(ϕ(f)+yfΦ(yf)).\displaystyle-\frac{\phi(f)}{\Phi(yf)^{2}}\sum_{i=1}^{d}\frac{x_{i}^{2}}{q_{i}}\bigg{(}\phi(f)+yf\Phi(yf)\bigg{)}.

It is obvious that ϕ(f)Φ(yf)2i=1dxi2qi0\frac{\phi(f)}{\Phi(yf)^{2}}\sum_{i=1}^{d}\frac{x_{i}^{2}}{q_{i}}\geq 0. We only need to show that ϕ(f)+yfΦ(yf)0\phi(f)+yf\Phi(yf)\geq 0. Since the normal density function 𝒩(|0,1)\mathcal{N}(\cdot|0,1) is symmetric with respect to the origin and y{+1,1}y\in\{+1,-1\}, we have ϕ(f)+yfΦ(yf)=ϕ(yf)+yfΦ(yf)\phi(f)+yf\Phi(yf)=\phi(yf)+yf\Phi(yf). Define u(t)=ϕ(t)+tΦ(t)u(t)=\phi(t)+t\Phi(t). Since limtu(t)=0\lim_{t\to-\infty}u(t)=0, we show u(t)0u(t)\geq 0 for t[,]t\in[-\infty,\infty] by proving that u(t)0u^{\prime}(t)\geq 0.

First note that ddtet2/2=tet2/2\frac{\text{d}}{\text{d}t}e^{-t^{2}/2}=-te^{-t^{2}/2}. So we have ϕ(t)=tϕ(t)\phi^{\prime}(t)=-t\phi(t). Thus u(t)=tϕ(t)+Φ(t)+tϕ(t)=Φ(t)0u^{\prime}(t)=-t\phi(t)+\Phi(t)+t\phi(t)=\Phi(t)\geq 0.

10 Proof of Theorem 5.4

For notation simplicity, we use 𝚺\bm{\Sigma} and 𝒎\bm{m} instead of 𝚺0\bm{\Sigma}^{0} and 𝒎0\bm{m}^{0}, and use 𝒙i\bm{x}_{i}, yiy_{i} to denote 𝒙i\bm{x}^{i} and yiy^{i} in the proof. Consider the objective function f(𝒘)f(\bm{w}) defined as

f(𝒘)=12(𝒘𝒎)T𝚺1(𝒘𝒎)𝔼𝒚(𝒟n,𝒘)[i=1nlogσ(yi𝒘T𝒙i)].f(\bm{w})=\frac{1}{2}(\bm{w}-\bm{m})^{T}\bm{\Sigma}^{-1}(\bm{w}-\bm{m})-\mathbb{E}_{\bm{y}\sim\mathcal{B}(\mathcal{D}^{n},\bm{w}^{*})}\Big{[}\sum_{i=1}^{n}\log\sigma(y_{i}\bm{w}^{T}\bm{x}_{i})\Big{]}.

We use g(𝒘)g(\bm{w}) and h(𝒘)h(\bm{w}) to denote the quadratic term and the rest, respectively. By applying the mean value theorem to h(𝒘)h(\bm{w}), we have

h(𝒘)h(𝒘)=(𝒘𝒘)Th(𝒘)+12(𝒘𝒘)TH(𝒘+η(𝒘𝒘))(𝒘𝒘),h(\bm{w})-h(\bm{w}^{*})=(\bm{w}-\bm{w}^{*})^{T}\nabla h(\bm{w}^{*})+\frac{1}{2}(\bm{w}-\bm{w}^{*})^{T}H\big{(}\bm{w}^{*}+\eta(\bm{w}-\bm{w}^{*})\big{)}(\bm{w}-\bm{w}^{*}),

where HH is the Hessian of h(𝒘)h(\bm{w}). To analyze the first and second order terms in the expansion, we use a similar technique adopted by [45]. For the gradient in the first order term, we have

h(𝒘)=𝔼𝒚(𝒟n,𝒘)[i=1n(1σ(yi𝒘T𝒙i))yi𝒙i]|𝒘=𝒘\displaystyle\nabla h(\bm{w}^{*})=\mathbb{E}_{\bm{y}\sim\mathcal{B}(\mathcal{D}^{n},\bm{w}^{*})}\Big{[}\sum_{i=1}^{n}\big{(}1-\sigma(y_{i}\bm{w}^{T}\bm{x}_{i})\big{)}y_{i}\bm{x}_{i}\Big{]}\bigg{|}_{\bm{w}=\bm{w}^{*}}
=\displaystyle= i=1n((1σ(𝒘T𝒙i))σ((𝒘)T𝒙i)(1σ(𝒘T𝒙i))(1σ((𝒘)T𝒙i)))𝒙i|𝒘=𝒘=0.\displaystyle\sum_{i=1}^{n}\Big{(}\big{(}1-\sigma(\bm{w}^{T}\bm{x}_{i})\big{)}\sigma((\bm{w}^{*})^{T}\bm{x}_{i})-\big{(}1-\sigma(-\bm{w}^{T}\bm{x}_{i})\big{)}\big{(}1-\sigma((\bm{w}^{*})^{T}\bm{x}_{i})\big{)}\Big{)}\bm{x}_{i}\bigg{|}_{\bm{w}=\bm{w}^{*}}=0.

For the second order term, we have

H(𝒘+η(𝒘𝒘))\displaystyle H\big{(}\bm{w}^{*}+\eta(\bm{w}-\bm{w}^{*}))
=\displaystyle= 𝔼𝒚(𝒟n,𝒘)(i=1nσ(yi(𝒘+η(𝒘𝒘))T𝒙i)(1σ(yi(𝒘+η(𝒘𝒘))T𝒙i))𝒙i𝒙iT)\displaystyle\mathbb{E}_{\bm{y}\sim\mathcal{B}(\mathcal{D}^{n},\bm{w}^{*})}\Bigg{(}\sum_{i=1}^{n}\sigma(y_{i}(\bm{w}^{*}+\eta(\bm{w}-\bm{w}^{*}))^{T}\bm{x}_{i})\Big{(}1-\sigma(y_{i}(\bm{w}^{*}+\eta(\bm{w}-\bm{w}^{*}))^{T}\bm{x}_{i})\Big{)}\bm{x}_{i}\bm{x}_{i}^{T}\Bigg{)}
=\displaystyle= i=1nσ((𝒘+η(𝒘𝒘))T𝒙i)(1σ((𝒘+η(𝒘𝒘))T𝒙i))𝒙i𝒙iT\displaystyle\sum_{i=1}^{n}\sigma((\bm{w}^{*}+\eta(\bm{w}-\bm{w}^{*}))^{T}\bm{x}_{i})\Big{(}1-\sigma((\bm{w}^{*}+\eta(\bm{w}-\bm{w}^{*}))^{T}\bm{x}_{i})\Big{)}\bm{x}_{i}\bm{x}_{i}^{T}
=\displaystyle= i=1nJi(𝒘,η)𝒙i𝒙iT,\displaystyle\sum_{i=1}^{n}J_{i}(\bm{w},\eta)\bm{x}_{i}\bm{x}_{i}^{T},

where Ji(𝒘,η)=σ((𝒘+η(𝒘𝒘))T𝒙i)(1σ((𝒘+η(𝒘𝒘))T𝒙i))J_{i}(\bm{w},\eta)=\sigma((\bm{w}^{*}+\eta(\bm{w}-\bm{w}^{*}))^{T}\bm{x}_{i})\Big{(}1-\sigma((\bm{w}^{*}+\eta(\bm{w}-\bm{w}^{*}))^{T}\bm{x}_{i})\Big{)}.

We expand Ji(𝒘,η)J_{i}(\bm{w},\eta) to its first order and use mean value theorem again,

|Ji(𝒘,η)Ji(𝒘,η)|\displaystyle|J_{i}(\bm{w},\eta)-J_{i}(\bm{w}^{*},\eta)| =\displaystyle= |η(𝒘𝒘)T𝒙iJi|\displaystyle|\eta(\bm{w}-\bm{w}^{*})^{T}\bm{x}_{i}J_{i}^{\prime}|
\displaystyle\leq |σ(1σ)(12σ)||(𝒘𝒘)T𝒙i|\displaystyle|\sigma(1-\sigma)(1-2\sigma)||(\bm{w}-\bm{w}^{*})^{T}\bm{x}_{i}|
\displaystyle\leq 𝒘𝒘2,\displaystyle\|\bm{w}-\bm{w}^{*}\|_{2},

where we omit the dependence of σ\sigma on (𝒘+η(𝒘𝒘))T𝒙i(\bm{w}^{*}+\eta(\bm{w}-\bm{w}^{*}))^{T}\bm{x}_{i} for simplicity and use the fact σ(0,1)\sigma\in(0,1).

Combining the first order and second order analysis and denoting 𝒘𝒘2\|\bm{w}-\bm{w}^{*}\|_{2} as RR, we have

h(𝒘)h(𝒘)\displaystyle h(\bm{w})-h(\bm{w}^{*}) \displaystyle\geq 12𝒘𝒘22λmin(H(𝒘+η(𝒘𝒘)))\displaystyle\frac{1}{2}\|\bm{w}-\bm{w}^{*}\|_{2}^{2}\lambda_{min}\Big{(}H\big{(}\bm{w}^{*}+\eta(\bm{w}-\bm{w}^{*})\big{)}\Big{)}
\displaystyle\geq 12R2λmin(i=1n[σ(𝒙iT𝒘)(1σ(𝒙iT𝒘))R]𝒙i𝒙iT)\displaystyle\frac{1}{2}R^{2}\lambda_{min}\Big{(}\sum_{i=1}^{n}\left[\sigma(\bm{x}_{i}^{T}\bm{w}^{*})\big{(}1-\sigma(\bm{x}_{i}^{T}\bm{w}^{*})\big{)}-R\right]\bm{x}i\bm{x}_{i}^{T}\Big{)}
\displaystyle\geq 12R2λmin(i=1nσ(𝒙iT𝒘)(1σ(𝒙iT𝒘))𝒙i𝒙iT)12R3\displaystyle\frac{1}{2}R^{2}\lambda_{min}\Big{(}\sum_{i=1}^{n}\sigma(\bm{x}_{i}^{T}\bm{w}^{*})\big{(}1-\sigma(\bm{x}_{i}^{T}\bm{w}^{*})\big{)}\bm{x}i\bm{x}_{i}^{T}\Big{)}-\frac{1}{2}R^{3}

where we use the fact that 𝒙i𝒙iT\bm{x}_{i}\bm{x}_{i}^{T} is positive semi-definite and 𝒙i21\|\bm{x}_{i}\|_{2}\leq 1. By using CminC_{min} to denote the minimal eigenvalue in the last inequality, we have

h(𝒘)h(𝒘)12CminR212R3.h(\bm{w})-h(\bm{w}^{*})\geq\frac{1}{2}C_{min}R^{2}-\frac{1}{2}R^{3}.

On the other hand, for the quadratic term g(𝒘)g(\bm{w}), we have

g(𝒘)g(𝒘)\displaystyle g(\bm{w})-g(\bm{w}^{*}) =\displaystyle= (𝒘𝒘)T𝚺1(𝒘𝒎)+12(𝒘𝒘)T𝚺1(𝒘𝒘)\displaystyle(\bm{w}-\bm{w}^{*})^{T}\bm{\Sigma}^{-1}(\bm{w}^{*}-\bm{m})+\frac{1}{2}(\bm{w}-\bm{w}^{*})^{T}\bm{\Sigma}^{-1}(\bm{w}-\bm{w}^{*})
\displaystyle\geq DRλmax(𝚺1)+12λmin(𝚺1)R2,\displaystyle-DR\sqrt{\lambda_{max}\big{(}\bm{\Sigma}^{-1}\big{)}}+\frac{1}{2}\lambda_{min}\big{(}\bm{\Sigma}^{-1}\big{)}R^{2},

whereD=𝚺12(𝒘𝒎)2D=\|\bm{\Sigma}^{-\frac{1}{2}}(\bm{w}^{*}-\bm{m})\|_{2}.

Now define function F(Δ)F(\Delta) as F(Δ)=f(𝒘+Δ)f(𝒘)F(\Delta)=f(\bm{w}^{*}+\Delta)-f(\bm{w}^{*}), then we have

F(Δ)\displaystyle F(\Delta) =\displaystyle= g(𝒘)g(𝒘)+h(𝒘n)h(𝒘)\displaystyle g(\bm{w})-g(\bm{w}^{*})+h(\bm{w}^{n})-h(\bm{w}^{*}) (23)
\displaystyle\geq λmax(𝚺1)DR+12λmin(𝚺1)R2+12CminR212R3.\displaystyle-\sqrt{\lambda_{max}\big{(}\bm{\Sigma}^{-1}\big{)}}DR+\frac{1}{2}\lambda_{min}\big{(}\bm{\Sigma}^{-1}\big{)}R^{2}+\frac{1}{2}C_{min}R^{2}-\frac{1}{2}R^{3}.

From now on we will use the simplified symbols λmax\lambda_{max} and λmin\lambda_{min} instead of λmin(𝚺1)\lambda_{min}\big{(}\bm{\Sigma}^{-1}\big{)} and λmax(𝚺1)\lambda_{max}\big{(}\bm{\Sigma}^{-1}\big{)}. We consider the sign of F(Δ)F(\Delta) in the case when Δ2=12(Cmin+λmin)\|\Delta\|_{2}=\frac{1}{2}(C_{min}+\lambda_{min}) and

D116λmin2λmax.D\leq\frac{1}{16}\frac{\lambda_{min}^{2}}{\sqrt{\lambda_{max}}}. (24)

Based on the analysis in (23), we have

F(Δ)\displaystyle F(\Delta) \displaystyle\geq 116λmin2R+12λminR2+12CminR212R3\displaystyle-\frac{1}{16}\lambda_{min}^{2}R+\frac{1}{2}\lambda_{min}R^{2}+\frac{1}{2}C_{min}R^{2}-\frac{1}{2}R^{3}
=\displaystyle= R2(R2+(λmin+Cmin)R18λmin2)\displaystyle\frac{R}{2}\left(-R^{2}+(\lambda_{min}+C_{min})R-\frac{1}{8}\lambda_{min}^{2}\right)
=\displaystyle= R2(R218λmin2)>λmin332.\displaystyle\frac{R}{2}\left(R^{2}-\frac{1}{8}\lambda_{min}^{2}\right)>\frac{\lambda_{min}^{3}}{32}.

Notice that F(0)=0F(0)=0 and recall that 𝒘n\bm{w}^{n} minimizes f(𝒘)f(\bm{w}) so we have

F(𝒘n𝒘)=f(𝒘n)f(𝒘)0.F(\bm{w}^{n}-\bm{w}^{*})=f(\bm{w}^{n})-f(\bm{w}^{*})\leq 0.

Then based on the convexity of FF we know that 𝒘n𝒘212(Cmin+λmin)\|\bm{w}^{n}-\bm{w}^{*}\|_{2}\leq\frac{1}{2}(C_{min}+\lambda_{min}), otherwise the values of FF at 0, 𝒘n𝒘\bm{w}^{n}-\bm{w}^{*} and the intersection between the all Δ2=12(Cmin+λmin)\|\Delta\|_{2}=\frac{1}{2}(C_{min}+\lambda_{min}) and line from 0 to 𝒘n𝒘\bm{w}^{n}-\bm{w}^{*} form a concave pattern, which is contradictory. Similarly, based on the assumption that f(𝒘^n)min𝒘f(𝒘)+λmin3/32f(\hat{\bm{w}}^{n})\leq\min_{\bm{w}}f(\bm{w})+\lambda_{min}^{3}/32, we have

F(𝒘^n𝒘)=f(𝒘^n)f(𝒘)λmin332.F(\hat{\bm{w}}^{n}-\bm{w}^{*})=f(\hat{\bm{w}}^{n})-f(\bm{w}^{*})\leq\frac{\lambda_{min}^{3}}{32}.

Recall that F(0)=0F(0)=0 and F(Δ)>λmin3/32F(\Delta)>\lambda_{min}^{3}/32, then the convexity of FF also ensures that 𝒘^n𝒘212(Cmin+λmin)\|\hat{\bm{w}}^{n}-\bm{w}^{*}\|_{2}\leq\frac{1}{2}(C_{min}+\lambda_{min}).

Now we start to calculate the probability for (24) to hold. Recall that 𝒘\bm{w}^{*} has a prior distribution 𝒘𝒩(𝒎,𝚺)\bm{w}^{*}\sim\mathcal{N}(\bm{m},\bm{\Sigma}). Then by denoting the right hand side of Eq. (24) as MM, we have

Pr(D116λmin2λmax)\displaystyle\text{Pr}\Big{(}D\leq\frac{1}{16}\frac{\lambda_{min}^{2}}{\sqrt{\lambda_{max}}}\Big{)}
=\displaystyle= 𝚺12(𝒂𝒎)M(2π)d2|𝚺|12exp(12(𝒂𝒎)T𝚺1(𝒂𝒎))d𝒂\displaystyle\int_{\|\bm{\Sigma}^{-\frac{1}{2}}(\bm{a}-\bm{m})\|\leq M}(2\pi)^{-\frac{d}{2}}|\bm{\Sigma}|^{-\frac{1}{2}}\exp\big{(}-\frac{1}{2}(\bm{a}-\bm{m})^{T}\bm{\Sigma}^{-1}(\bm{a}-\bm{m})\big{)}\mathrm{d}\bm{a}
=\displaystyle= 𝒃2M(2π)d2exp(12𝒃T𝒃)d𝒃,\displaystyle\int_{\|\bm{b}\|_{2}\leq M}(2\pi)^{-\frac{d}{2}}\exp\big{(}-\frac{1}{2}\bm{b}^{T}\bm{b}\big{)}\mathrm{d}\bm{b},

which is the probability of a d-dimension standard normal random variable appears in the ball with radius MM, Pd(M)P_{d}(M). This completes the proof.

11 Proofs of Asymptotic Optimality

In this section, we provide detailed proofs of all the asymptotic optimal results in Section 5.3.

11.1 Proof of Proposition 5.5

We use q(𝒘)q(\bm{w}) to denote the predictive distribution under the state ss and use p(𝒘|s,𝒙,y𝒙)p(\bm{w}|s,\bm{x},y_{\bm{x}}) to denote the posterior distribution after we observe the outcome of 𝒙\bm{x} to be yy. By Jensen’s inequality, we have

μ𝒙KG(s)\displaystyle\mu_{\bm{x}}^{\text{KG}}(s) =\displaystyle= 𝔼[max𝒙p(y=+1|𝒙,T(s,𝒙,y𝒙))]max𝒙p(y=+1|𝒙,s)\displaystyle\mathbb{E}\Big{[}\max_{\bm{x}^{\prime}}p(y=+1|\bm{x}^{\prime},T(s,\bm{x},y_{\bm{x}}))\Big{]}-\max_{\bm{x}^{\prime}}p(y=+1|\bm{x}^{\prime},s)
\displaystyle\geq max𝒙𝔼[p(y=+1|𝒙,T(s,𝒙,y𝒙))]max𝒙p(y=+1|𝒙,s).\displaystyle\max_{\bm{x}^{\prime}}\mathbb{E}\big{[}p(y=+1|\bm{x}^{\prime},T(s,\bm{x},y_{\bm{x}}))\big{]}-\max_{\bm{x}^{\prime}}p(y=+1|\bm{x}^{\prime},s).

We then show that 𝔼[p(y=+1|𝒙,T(s,𝒙,y𝒙))]=p(y=+1|𝒙,s)\mathbb{E}\big{[}p(y=+1|\bm{x}^{\prime},T(s,\bm{x},y_{\bm{x}}))\big{]}=p(y=+1|\bm{x}^{\prime},s) for any 𝒙,𝒙\bm{x},\bm{x}^{\prime} and ss, which leads to μ𝒙KG(s)0\mu_{\bm{x}}^{\text{KG}}(s)\geq 0. Since y𝒙y_{\bm{x}} is binomial distributed with mean p(y𝒙=+1|𝒙,s)p(y_{\bm{x}}=+1|\bm{x},s), we have

𝔼[p(y=+1|𝒙,T(s,𝒙,y𝒙))]\displaystyle\mathbb{E}\big{[}p(y=+1|\bm{x}^{\prime},T(s,\bm{x},y_{\bm{x}}))\big{]}
=\displaystyle= p(y𝒙=+1|𝒙,s)p(y=+1|𝒙,T(s,𝒙,+1))\displaystyle p(y_{\bm{x}}=+1|\bm{x},s)p(y=+1|\bm{x}^{\prime},T(s,\bm{x},+1))
+\displaystyle+ (1p(y𝒙=+1|𝒙,s))p(y=+1|𝒙,T(s,𝒙,1)).\displaystyle(1-p(y_{\bm{x}}=+1|\bm{x},s))p(y=+1|\bm{x}^{\prime},T(s,\bm{x},-1)).

Recall that p(y=+1|𝒙,s)=σ(𝒘T𝒙)p(𝒘|s)d𝒘.p(y=+1|\bm{x},s)=\int\sigma(\bm{w}^{T}\bm{x})p(\bm{w}|s)\text{d}\bm{w}. By Bayes’ Theorem, the posterior distribution in the updated state T(s,𝒙,yx)T(s,\bm{x},y_{x}) becomes

p(𝒘|T(s,𝒙,+1))=σ((𝒘)T𝒙)p(𝒘|s)σ(𝒘T𝒙)p(𝒘|s)d𝒘,p(\bm{w}^{\prime}|T(s,\bm{x},+1))=\frac{\sigma((\bm{w}^{\prime})^{T}\bm{x})p(\bm{w}^{\prime}|s)}{\int\sigma(\bm{w}^{T}\bm{x})p(\bm{w}|s)\text{d}\bm{w}},

and

p(𝒘|T(s,𝒙,1))=(1σ((𝒘)T𝒙))p(𝒘|s)(1σ(𝒘T𝒙))p(𝒘|s)d𝒘.p(\bm{w}^{\prime}|T(s,\bm{x},-1))=\frac{\left(1-\sigma((\bm{w}^{\prime})^{T}\bm{x})\right)p(\bm{w}^{\prime}|s)}{\int\left(1-\sigma(\bm{w}^{T}\bm{x})\right)p(\bm{w}|s)\text{d}\bm{w}}.

Notice that

p(y=+1|𝒙,T(s,𝒙,+1))\displaystyle p(y=+1|\bm{x}^{\prime},T(s,\bm{x},+1)) =\displaystyle= σ((𝒘)T𝒙)p(𝒘|T(s,𝒙,+1))d𝒘\displaystyle\int\sigma((\bm{w}^{\prime})^{T}\bm{x}^{\prime})p(\bm{w}^{\prime}|T(s,\bm{x},+1))\text{d}\bm{w}^{\prime}
=\displaystyle= σ((𝒘)T𝒙)σ((𝒘)T𝒙)p(𝒘|s)σ(𝒘T𝒙)p(𝒘|s)d𝒘d𝒘\displaystyle\int\sigma((\bm{w}^{\prime})^{T}\bm{x}^{\prime})\frac{\sigma((\bm{w}^{\prime})^{T}\bm{x})p(\bm{w}^{\prime}|s)}{\int\sigma(\bm{w}^{T}\bm{x})p(\bm{w}|s)\text{d}\bm{w}}\text{d}\bm{w}^{\prime}
=\displaystyle= σ((𝒘)T𝒙)σ((𝒘)T𝒙)p(𝒘|s)d𝒘σ(𝒘T𝒙)p(𝒘|s)d𝒘,\displaystyle\frac{\int\sigma((\bm{w}^{\prime})^{T}\bm{x}^{\prime})\sigma((\bm{w}^{\prime})^{T}\bm{x})p(\bm{w}^{\prime}|s)\text{d}\bm{w}^{\prime}}{\int\sigma(\bm{w}^{T}\bm{x})p(\bm{w}|s)\text{d}\bm{w}},

and similarly, we have

p(y=+1|𝒙,T(s,𝒙,1))=σ((𝒘)T𝒙)(1σ((𝒘)T𝒙))p(𝒘|s)d𝒘(1σ(𝒘T𝒙))p(𝒘|s)d𝒘.p(y=+1|\bm{x}^{\prime},T(s,\bm{x},-1))=\frac{\int\sigma((\bm{w}^{\prime})^{T}\bm{x}^{\prime})(1-\sigma((\bm{w}^{\prime})^{T}\bm{x}))p(\bm{w}^{\prime}|s)\text{d}\bm{w}^{\prime}}{\int(1-\sigma(\bm{w}^{T}\bm{x}))p(\bm{w}|s)\text{d}\bm{w}}.

Therefore,

𝔼[p(y=+1|𝒙,T(s,𝒙,y𝒙))]\displaystyle\mathbb{E}\big{[}p(y=+1|\bm{x}^{\prime},T(s,\bm{x},y_{\bm{x}}))\big{]}
=\displaystyle= p(y=+1|s,𝒙)p(y=+1|𝒙,T(s,𝒙,+1))+p(y=1|s,𝒙)p(y=+1|𝒙,T(s,𝒙,1))\displaystyle p(y=+1|s,\bm{x})p(y=+1|\bm{x}^{\prime},T(s,\bm{x},+1))+p(y=-1|s,\bm{x})p(y=+1|\bm{x}^{\prime},T(s,\bm{x},-1))
=\displaystyle= σ((𝒘)T𝒙)σ((𝒘)T𝒙)p(𝒘|s)d𝒘+σ((𝒘)T𝒙)(1σ((𝒘)T𝒙))p(𝒘|s)d𝒘\displaystyle\int\sigma((\bm{w}^{\prime})^{T}\bm{x}^{\prime})\sigma((\bm{w}^{\prime})^{T}\bm{x})p(\bm{w}^{\prime}|s)\text{d}\bm{w}^{\prime}+\int\sigma((\bm{w}^{\prime})^{T}\bm{x}^{\prime})(1-\sigma((\bm{w}^{\prime})^{T}\bm{x}))p(\bm{w}^{\prime}|s)\text{d}\bm{w}^{\prime}
=\displaystyle= σ((𝒘)T𝒙)p(𝒘|s)d𝒘\displaystyle\int\sigma((\bm{w}^{\prime})^{T}\bm{x}^{\prime})p(\bm{w}^{\prime}|s)\text{d}\bm{w}^{\prime}
=\displaystyle= p(y=+1|s,𝒙),\displaystyle p(y=+1|s,\bm{x}^{\prime}),

and thus we obtain

𝔼[p(y=+1|𝒙,T(s,𝒙,y𝒙))]=p(y=+1|s,𝒙).\mathbb{E}\big{[}p(y=+1|\bm{x}^{\prime},T(s,\bm{x},y_{\bm{x}}))\big{]}=p(y=+1|s,\bm{x}^{\prime}).

11.2 Proof of Proposition 5.7

The proof is similar to that by [15] with additional tricks for Bernoulli distributed random variables. Let 𝒢\mathcal{G} be the sigma-algebra by the collection {y^n+1𝟏{𝒙n=𝒙}}\{\hat{y}^{n+1}\bm{1}_{\{\bm{x}^{n}=\bm{x}\}}\}. Since if the policy π\pi measures alternative 𝒙\bm{x} infinitely often, this collection is an infinite sequence of independent random variables with common Bernoulli distribution with mean σ(𝒘T𝒙)\sigma(\bm{w}^{T}\bm{x}), the strong law of large numbers implies σ(𝒘T𝒙)𝒢\sigma(\bm{w}^{T}\bm{x})\in\mathcal{G}. Since 𝒢\mathcal{G}\in\mathcal{F}^{\infty}, we have σ(𝒘T𝒙)\sigma(\bm{w}^{T}\bm{x})\in\mathcal{F}^{\infty}. Let UU be a uniform random variable in [0,1][0,1]. Then the Bernoulli random variable y𝒙y_{\bm{x}} can be rewritten as 𝟏Uσ(𝒘T𝒙)\bm{1}_{U\leq\sigma(\bm{w}^{T}\bm{x})}. Since UU is independent with \mathcal{F}^{\infty} and the σ\sigma-algebra generated by σ(𝒘T𝒙)\sigma(\bm{w}^{T}\bm{x}), it can be shown that by properties of conditional expectations,

𝔼[σ(𝒘T𝒙)|,𝟏Uσ(𝒘T𝒙)]=𝔼[σ(𝒘T𝒙)|].\mathbb{E}\big{[}\sigma(\bm{w}^{T}\bm{x}^{\prime})|\mathcal{F}^{\infty},\bm{1}_{U\leq\sigma(\bm{w}^{T}\bm{x})}\big{]}=\mathbb{E}\big{[}\sigma(\bm{w}^{T}\bm{x}^{\prime})|\mathcal{F}^{\infty}\big{]}.

We next show that the knowledge gradient value of measuring alternative 𝒙\bm{x} is zero by substituting this relation into the definition of the knowledge gradient. We have

ν𝒙()\displaystyle\nu_{\bm{x}}(\mathcal{F}^{\infty}) =\displaystyle= 𝔼[max𝒙𝔼[σ(𝒘T𝒙)|,𝟏Uσ(𝒘T𝒙)]|]max𝒙𝔼[σ(𝒘T𝒙)|]\displaystyle\mathbb{E}\bigg{[}\max_{\bm{x}^{\prime}}\mathbb{E}\big{[}\sigma(\bm{w}^{T}\bm{x}^{\prime})|\mathcal{F}^{\infty},\bm{1}_{U\leq\sigma(\bm{w}^{T}\bm{x})}\big{]}|\mathcal{F}^{\infty}\bigg{]}-\max_{\bm{x}^{\prime}}\mathbb{E}\big{[}\sigma(\bm{w}^{T}\bm{x}^{\prime})|\mathcal{F}^{\infty}\big{]}
=\displaystyle= 𝔼[max𝒙𝔼[σ(𝒘T𝒙)|]|]max𝒙𝔼[σ(𝒘T𝒙)|]\displaystyle\mathbb{E}\bigg{[}\max_{\bm{x}^{\prime}}\mathbb{E}\big{[}\sigma(\bm{w}^{T}\bm{x}^{\prime})|\mathcal{F}^{\infty}\big{]}|\mathcal{F}^{\infty}\bigg{]}-\max_{\bm{x}^{\prime}}\mathbb{E}\big{[}\sigma(\bm{w}^{T}\bm{x}^{\prime})|\mathcal{F}^{\infty}\big{]}
=\displaystyle= max𝒙𝔼[σ(𝒘T𝒙)|]max𝒙𝔼[σ(𝒘T𝒙)|]\displaystyle\max_{\bm{x}^{\prime}}\mathbb{E}\big{[}\sigma(\bm{w}^{T}\bm{x}^{\prime})|\mathcal{F}^{\infty}\big{]}-\max_{\bm{x}^{\prime}}\mathbb{E}\big{[}\sigma(\bm{w}^{T}\bm{x}^{\prime})|\mathcal{F}^{\infty}\big{]}
=\displaystyle= 0.\displaystyle 0.

11.3 Proof of the Theorem 5.9: Consistency of the KG Policy

It has been established that almost surely the knowledge gradient policy will achieve a state (at time T) that the KG values for all the alternatives are smaller than ϵ\epsilon, after which the probability of selecting each alternative is 1/M1/M where MM is the number of alternatives. Notice that in each round, KG policy first picks one out of MM alternatives, then a feedback of either 1 or -1 is observed. Equivalently, we can interpret the two procedures as one of the 2M2M possible outcomes from the set 𝒴~:={(0,,+1/1,0)}\tilde{\mathcal{Y}}:=\{(0,\dots,+1/-1,\dots 0)\}, where each y~𝒴~\tilde{y}\in\tilde{\mathcal{Y}} is a MM-dimensional vector with only element being +1 or -1. It should be noted that by changing the feedback schema in this way will not affect the Bayesian update equations because the likelihood function and the normalization factor in the posterior will both multiply by a factor of 1/M1/M. On the other hand, this combined feedback schema makes it possible to treat each measurement (𝒙n,yn)(\bm{x}^{n},y^{n}) as i.i.d. samples in 𝒴~\tilde{\mathcal{Y}}.

Define the K-L neiborhood as Kϵ(u)={v:KL(u,v)<ϵ}K_{\epsilon}(u)=\{v:KL(u,v)<\epsilon\}, where the K-L divergence is defined as KL(u,v):=vlog(v/u)KL(u,v):=\int v\log(v/u). Since the prior distribution is Gaussian with positive definite covariance matrix, and the likelihood function is the sigmoid function which only takes positive values, then after time TT, the posterior probability in the K-L neighborhood of ww^{*} is positive. Based on standard results on the consistency of Bayes’ estimates [20, 19, 54, 17], the posterior is weakly consistent at ww^{*} in the sense that for any neighborhood UU of ww^{*}, the probability that μ(w)\mu(w) lies in UU converges to 1.

[U|y~1,y~2,,y~n]1.\mathbb{P}[U|\tilde{y}^{1},\tilde{y}^{2},\dots,\tilde{y}^{n}]\rightarrow 1.

Without loss of generality, assume that the alternative 𝒙\bm{x}^{*} with the largest probability of +1 is unique, which means σ((𝒙)T𝒘)>σ(𝒙T𝒘)\sigma((\bm{x}^{*})^{T}\bm{w}^{*})>\sigma(\bm{x}^{T}\bm{w}^{*}) for any alternative 𝒙\bm{x} other than 𝒙\bm{x}^{*}. Then we can pick UU to be the neighborhood of 𝒘\bm{w}^{*} such that σ((𝒙)T𝒘)>σ(𝒙T𝒘)\sigma((\bm{x}^{*})^{T}\bm{w})>\sigma(\bm{x}^{T}\bm{w}) holds for any 𝒘U\bm{w}\in U. The neighborhood UU exists because we only have finite number of altervatives. From the consistency results, the probability that the best arm under posterior estimation is the true best alternative goes to 1 as the measurement budget goes to infinity.