This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning to Actively Learn: A Robust Approach

Jifan Zhang
Department of Computer Science
University of Wisconsin
Madison, WI, USA
jifan@cs.wisc.edu
&Lalit Jain
Foster School of Business
University of Washington
Seattle, WA, USA
lalitj@uw.edu
&Kevin Jamieson
Allen School of Computer Science & Engineering
University of Washington
Seattle, WA, USA
jamieson@cs.washington.edu
Abstract

This work proposes a procedure for designing algorithms for specific adaptive data collection tasks like active learning and pure-exploration multi-armed bandits. Unlike the design of traditional adaptive algorithms that rely on concentration of measure and careful analysis to justify the correctness and sample complexity of the procedure, our adaptive algorithm is learned via adversarial training over equivalence classes of problems derived from information theoretic lower bounds. In particular, a single adaptive learning algorithm is learned that competes with the best adaptive algorithm learned for each equivalence class. Our procedure takes as input just the available queries, set of hypotheses, loss function, and total query budget. This is in contrast to existing meta-learning work that learns an adaptive algorithm relative to an explicit, user-defined subset or prior distribution over problems which can be challenging to define and be mismatched to the instance encountered at test time. This work is particularly focused on the regime when the total query budget is very small, such as a few dozen, which is much smaller than those budgets typically considered by theoretically derived algorithms. We perform synthetic experiments to justify the stability and effectiveness of the training procedure, and then evaluate the method on tasks derived from real data including a noisy 20 Questions game and a joke recommendation task.

1 Introduction

Closed-loop learning algorithms use previous observations to inform what measurements to take next in a closed loop in order to accomplish inference tasks far faster than any fixed measurement plan set in advance. For example, active learning algorithms for binary classification have been proposed that under favorable conditions require exponentially fewer labels than passive, random sampling to identify the optimal classifier (Hanneke et al., 2014; Katz-Samuels et al., 2021). And in the multi-armed bandits literature, adaptive sampling techniques have demonstrated the ability to identify the “best arm” that optimizes some metric with far fewer experiments than a fixed design (Garivier & Kaufmann, 2016; Fiez et al., 2019). Unfortunately, such guarantees often either require simplifying assumptions that limit robustness and applicability, or algorithmic use of concentration inequalities that are very loose unless the number of samples is very large.

This work proposes a framework for producing algorithms that are learned through simulated experience to be as effective and robust as possible, even on a tiny measurement budget (e.g., 20 queries) where most theoretical guarantees do not apply. Our work fits into a recent trend sometimes referred to as learning to actively learn and differentiable meta-learning in bandits (Konyushkova et al., 2017; Bachman et al., 2017; Fang et al., 2017; Boutilier et al., 2020; Kveton et al., 2020) which tune existing algorithms or learn entirely new active learning algorithms by policy optimization. Previous works in this area learn a policy by optimizing with respect to data observed through prior experience (e.g., meta-learning or transfer learning) or an assumed explicit prior distribution of problem parameters (e.g. a Gaussian prior over the true weight vector for linear regression). In contrast, our approach makes no assumptions about what parameters are likely to be encountered at test time, and therefore produces algorithms that do not suffer from mismatching priors at test time. Instead, our method learns a policy that attempts to mirror the guarantees of frequentist algorithms with instance dependent sample complexities: there is an intrinsic difficulty measure that orders problem instances and given a fixed budget, higher accuracies can be obtained for all easier instances than harder instances. This difficulty measure is most naturally derived from information theoretic lower bounds.

But unlike information theoretic bounds that hand-craft adversarial instances, inspired by the robust reinforcement learning literature, we formulate a novel adversarial training objective that automatically train minimax policies and propose a tractable and computationally efficient relaxation. This allows our learned policies to be very aggressive while maintaining robustness over difficulty in problem instances, without resorting to using loose concentration inequalities in the algorithm. Indeed, this work is particularly useful in the setting where relatively few rounds of querying can be made. The learning framework is general enough to be applied to many active learning settings of interest and is intended to be used to produce robust and high performing algorithms. We implement the framework for the pure-exploration combinatorial bandit problem — a paradigm including problems such as active binary classification and the 20 question game. We empirically validate our framework on a simple synthetic experiment before turning our attention to datasets derived from real data including a noisy 20 Questions game and a joke recommendation task which are also embedded as combinatorial bandits. As demonstrated in our experiments, in the low budget setting, our learned algorithms are the only ones that both enjoy robustness guarantees (as opposed to greedy and existing learning to actively learn methods) and perform non-vacuously and instance-optimally (as opposed to statistically justified algorithms).

2 Proposed Framework for Robust Learning to Actively Learn

From a birds-eye perspective, whether learned or defined by an expert, any algorithm for active learning can be thought of as a policy from the perspective of reinforcement learning. To be precise, at time tt, based on an internal state sts_{t}, the policy π\pi defines a distribution π(st)\pi(s_{t}) over the set of potential actions 𝒳\mathcal{X}. It then takes action xt𝒳,xtπ(st)x_{t}\in\mathcal{X},x_{t}\sim\pi(s_{t}) and receives observation yty_{t}, updates the state and the process repeats.

Fix a horizon TT\in\mathbb{N}, and a problem instance θΘd\theta_{*}\in\Theta\subseteq\mathbb{R}^{d} which parameterizes the observation distribution. For t=1,2,,Tt=1,2,\dots,T

  • state st𝒮s_{t}\in\mathcal{S} is a function of the history, {(xi,yi)}i=1t1\{(x_{i},y_{i})\}_{i=1}^{t-1},

  • action xt𝒳x_{t}\in\mathcal{X} is drawn at random from the distribution π(st)\pi(s_{t}) defined over 𝒳\mathcal{X}, and

  • next state st+1𝒮s_{t+1}\in\mathcal{S} is constructed by taking action xtx_{t} in state sts_{t} and observing ytf(|θ,st,xt)y_{t}\sim f(\cdot|\theta_{*},s_{t},x_{t})

until the game terminates at time t=Tt=T and the learner receives a loss LTL_{T} which is task specific. Note that LTL_{T} is a random variable that depends on the tuple (π,{(xi,yi)}i=1T,θ)(\pi,\{(x_{i},y_{i})\}_{i=1}^{T},\theta_{*}). We assume that ff is a known parametric distribution to the policy but the parameter θ\theta is unknown to the policy. Let π,θ,𝔼π,θ\mathbb{P}_{\pi,\theta},\mathbb{E}_{\pi,\theta} denote the probability and expectation under the probability law induced by executing policy π\pi in the game with θ=θ\theta_{*}=\theta to completion. Note that π,θ\mathbb{P}_{\pi,\theta} includes any internal randomness of the policy π\pi and the random observations ytf(|θ,st,xt)y_{t}\sim f(\cdot|\theta,s_{t},x_{t}). Thus, π,θ\mathbb{P}_{\pi,\theta} assigns a probability to any trajectory {(xi,yi)}i=1T\{(x_{i},y_{i})\}_{i=1}^{T}. For a given policy π\pi and θ=θ\theta_{*}=\theta, the metric of interest we wish to minimize is the expected loss (π,θ):=𝔼π,θ[LT]\ell(\pi,\theta):=\mathbb{E}_{\pi,\theta}\left[L_{T}\right] where LTL_{T} as defined above is the loss observed at the end of the episode. For a fixed policy π\pi, (π,θ)\ell(\pi,\theta) defines a loss surface over all possible values of θ\theta. This loss surface captures the fact that some values of θ\theta are just intrinsically harder than others, but also that a policy may be better suited for some values of θ\theta versus others.

Finally, we assume we are equipped with a positive function 𝒞:Θ(0,)\mathcal{C}:\Theta\rightarrow(0,\infty) that assigns a score to each θΘ\theta\in\Theta that intuitively captures the “difficulty” of a particular θ\theta, and can be used as a partial ordering of Θ\Theta. Ideally, 𝒞(θ)\mathcal{C}(\theta) is a monotonic transformation of (π,θ)\ell(\pi^{*},\theta) for some “best” policy π\pi^{*} that we will define shortly. Our plan is now as follows, in Section 2.1, we ground the discussion and describe 𝒞(θ)\mathcal{C}(\theta) for the combinatorial bandit problem. Then in Section 2.2, we zoom out to define our main objective of finding a min-gap optimal policy, finally providing an adversarial training approach in Section 3.

2.1 Complexity for Combinatorial Bandits

A concrete example of the framework above is the combinatorial bandit problem. The learner has access to sets 𝒳={e1,,ed}d\mathcal{X}=\{e_{1},\cdots,e_{d}\}\subset\mathbb{R}^{d}, where eie_{i} is the ii-th standard basis vector, and 𝒵{0,1}d\mathcal{Z}\subset\{0,1\}^{d}. In each round the learner chooses an xt𝒳x_{t}\in\mathcal{X} according to a policy π({(xi,yi)}i=1t1)\pi(\{(x_{i},y_{i})\}_{i=1}^{t-1}) and observes yty_{t} with 𝔼[yt|xt,θ]=xt,θ\mathbb{E}[y_{t}|x_{t},\theta_{\ast}]=\langle x_{t},\theta_{\ast}\rangle for some unknown θd\theta_{\ast}\in\mathbb{R}^{d}. The goal of the learner is Best Arm Identification. Denote z(θ)=argmaxz𝒵z,θz_{\ast}(\theta_{\ast})=\operatorname*{arg\,max}_{z\in\mathcal{Z}}\langle z,\theta_{\ast}\rangle, then at time TT the learner outputs a recommendation z^\hat{z} and incurs loss LBAI,T=𝟏{zz^}L_{BAI,T}=\bm{1}\{z_{\ast}\neq\hat{z}\}. This setting naturally captures the 20 question game. Indeed assume there are dT=20d\gg T=20 potential yes/no questions that can be asked at each time, corresponding to the elements of 𝒳\mathcal{X}, and that each element of 𝒵\mathcal{Z} is a binary vector representing the answers to these questions for a given item. If answers yty_{t} are deterministic then θ{1,1}d\theta_{*}\in\{-1,1\}^{d}, but this framework also captures the case θ[1,1]d\theta_{*}\in[-1,1]^{d} when answers are stochastic, or answered incorrectly with some probability. Then a policy π\pi at each time decides which question to ask based on the answers so far to determine the item closest to an unknown vector θ\theta_{*}.

As described in Sections 5 and Appendix A, combinatorial bandits generalizes standard multi-armed bandits, and all of binary classification, and thus has received a tremendous amount of interest in recent years. A large portion of this work has focused on providing precise characterization of the information theoretic limit on the mimimal number of samples needed to identify z(θ)z_{\ast}(\theta_{\ast}) with high probability a quantity denoted as ρ(θ)\rho_{\ast}(\theta_{\ast}) which is the solution to an optimization problem (Soare et al., 2014; Fiez et al., 2019; Degenne et al., 2020) ρ(θ)1:=maxλ𝒳minθΘz(θ)z(θ)x𝒳λxx,θθ2\rho_{\ast}(\theta_{\ast})^{-1}:=\max_{\lambda\in\triangle_{\mathcal{X}}}\min_{\begin{subarray}{c}\theta^{\prime}\in\Theta\\ z_{\ast}(\theta^{\prime})\neq z_{\ast}(\theta_{\ast})\end{subarray}}\sum_{x\in\mathcal{X}}\lambda_{x}\langle x,\theta_{\ast}-\theta^{\prime}\rangle^{2} for some set of alternatives Θ\Theta. This quantity provides a natural complexity measure 𝒞(θ)=ρ(θ)\mathcal{C}(\theta_{\ast})=\rho_{\ast}(\theta_{\ast}) for a given instance θ\theta_{\ast} and we describe it in a few specific cases below.

As a warmup example, consider the standard best-arm identification problem where 𝒵=𝒳={𝐞i:i[d]}\mathcal{Z}=\mathcal{X}=\{\mathbf{e}_{i}:i\in[d]\} and choosing action xt𝒳x_{t}\in\mathcal{X} results in reward ytBernoulli(θit)y_{t}\sim\text{Bernoulli}(\theta_{i_{t}}). Let i(θ)=argmaxz𝒵zθ=argmaxiθii_{\ast}(\theta)=\operatorname*{arg\,max}_{z\in\mathcal{Z}}z^{\top}\theta=\operatorname*{arg\,max}_{i}\theta_{i}. Then in this case ρ(θ)ii(θ)(θi(θ)θi)2\rho_{\ast}(\theta)\approx\sum_{i\neq i_{\ast}(\theta)}(\theta_{i_{\ast}(\theta)}-\theta_{i})^{-2} and it’s been shown that there exists a constant c0>0c_{0}>0 such that for any sufficiently large ν>0\nu>0 we have

minπmaxθ:ρ(θ)νBAI(π,θ)exp(c0T/ν)\displaystyle\min_{\pi}\max_{\theta:\rho_{\ast}(\theta)\leq\nu}\ell_{BAI}(\pi,\theta)\geq\exp(-c_{0}T/\nu)

In other words, more difficult instances correspond to θ\theta with a small gap between the best arm and any other arm. Moreover, for any θd\theta\in\mathbb{R}^{d} there exists a policy π~\widetilde{\pi} that achieves (π~,θ)c1exp(c2T/ρ(θ))\ell(\widetilde{\pi},\theta)\leq c_{1}\exp(-c_{2}T/\rho_{\ast}(\theta_{\ast})) where c1,c2c_{1},c_{2} capture constant and low-order terms (Carpentier & Locatelli, 2016; Karnin et al., 2013; Garivier & Kaufmann, 2016). Said plainly, the above correspondence between the lower bound and the upper bound for the multi-armed bandit problem shows that ρ(θ)\rho_{\ast}(\theta_{\ast}) is a natural choice for 𝒞(θ)\mathcal{C}(\theta) in this setting.

In recent years, algorithms for the more general combinatorial bandit setting have been established with instance-dependent sample complexities matching ρ(θ)\rho_{\ast}(\theta_{\ast}) (up to logarithmic factors) (Karnin et al., 2013; Chen et al., 2014; Fiez et al., 2019; Chen et al., 2017; Degenne et al., 2020; Katz-Samuels et al., 2020). Another complexity term that appears in Cao & Krishnamurthy (2017) for combinatorial bandits is

ρ~(θ)=i=1dmaxz:ziz,i(θ)zz(θ)22zz(θ),θ2.\displaystyle\widetilde{\rho}(\theta)=\sum_{i=1}^{d}\max_{z:z_{i}\neq z_{\ast,i}(\theta)}\frac{\|z-z_{\ast}(\theta)\|_{2}^{2}}{\langle z-z_{\ast}(\theta),\theta\rangle^{2}}. (1)

One can show ρ(θ)ρ~(θ)\rho_{\ast}(\theta)\leq\widetilde{\rho}(\theta) (Katz-Samuels et al., 2020) and in many cases track each other. Because ρ~(θ)\widetilde{\rho}(\theta) can be computed much more efficiently compared to ρ(θ)\rho_{\ast}(\theta), we take 𝒞(θ)=ρ~(θ)\mathcal{C}(\theta)=\widetilde{\rho}(\theta).

2.2 Objective: Responding to All Difficulties

As described above, though there exists algorithms for the combinatorial bandit problem that are instance-optimal in the fixed-confidence setting along with algorithms for the fixed-budget, they do not work well with small budgets as they rely on statistical guarantees. Indeed, for their guarantees to be non-vacuous, we need the budget TT to be sufficiently large enough to compare to the potentially large constants in upper bounds. In practice, they are so conservative that for the first 20 samples they would just sample uniformly. To overcome this, we now provide a different framework that for policy learning in a worst-case setting that is effective even in the small budget regime.

The challenge is in finding a policy that performs well across all potential problem instances simultaneously. It is common to consider minimax optimal policies which attempt to perform well on worst case instances — but as a result, may perform poorly on easier instances. Thus, an ideal policy π\pi would perform uniformly well over a set of θ\theta’s that are all equivalent in “difficulty”. Since each θΘ\theta\in\Theta is equipped with an inherent notion of difficulty, C(θ)C(\theta), we can stratify the space of all possible instances by difficulty. A good policy is one whose worst case performance over all possible problem difficulties is minimized. We formalize this idea below.

For any set of problem instances Θ~Θ\widetilde{\Theta}\subset\Theta and r0r\geq 0 define

(π,Θ~):=maxθΘ~(π,θ) and Θ(r):={θ:𝒞(θ)r}.\displaystyle\ell(\pi,\widetilde{\Theta}):=\max_{\theta\in\widetilde{\Theta}}\ell(\pi,\theta)\quad\quad\text{ and }\quad\quad\Theta^{(r)}:=\{\theta:\mathcal{C}(\theta)\leq r\}.

For a fixed r>0r>0 (including r=r=\infty), a policy π\pi^{\prime} that aims to minimize just (π,Θ(r))\ell(\pi^{\prime},\Theta^{(r)}) will be minimax for Θ(r)\Theta^{(r)} and may not perform well on easy instances. To overcome this shortsightedness we introduce a new objective by focusing on (π,Θ(r))minπ(π,Θ(r))\ell(\pi,\Theta^{(r)})-\min_{\pi^{\prime}}\ell(\pi^{\prime},\Theta^{(r)}); the sub-optimality gap of a given policy π\pi relative to an rr-dependent baseline policy trained specifically for each rr. Objective: Return the policy π:=argminπmaxr>0((π,Θ(r))minπ(π,Θ(r)))\displaystyle\pi_{*}:=\arg\min_{\pi}\max_{r>0}\left(\ell(\pi,\Theta^{(r)})-\min_{\pi^{\prime}}\ell(\pi^{\prime},\Theta^{(r)})\right) (2) which minimizes the worst case sub-optimality gap over all r>0r>0.

Refer to caption
Figure 1: Performance curves for various policies.

Figure 1 illustrates these definitions. The blue curve (rr-dependent baseline) captures the best possible performance minπ(π,Θ(r))\min_{\pi^{\prime}}\ell(\pi^{\prime},\Theta^{(r)}) that is possible for each difficulty level rr. In other words, the rr-dependent baseline defines a different policy for each value of rr. Therefore, the blue curve may be unachievable with just a single policy. The green curve captures a policy that achieves the minima (rr-dependent baseline) at a given rr^{\prime}. Though it is the ideal policy for this difficulty, it could be sub-optimal at any other difficulty. The orange curve is the performance of our optimal policy π\pi^{\ast} — it is willing to sacrifice performance for any given rr to achieve an overall better worst case gap from the baseline.

3 MAPO: Adversarial Training Algorithm

Identifying π\pi_{*} naively requires the computation of minπ(π,Θ(r))\min_{\pi^{\prime}}\ell(\pi^{\prime},\Theta^{(r)}) for all r>0r>0. However, in practice given an increasing sequence r1<<rKr_{1}<\dots<r_{K} that indexes nested sets of problem instances of increasing difficulty, Θ(r1)Θ(r2)Θ(rK)\Theta^{(r_{1})}\subset\Theta^{(r_{2})}\subset\dots\subset\Theta^{(r_{K})}, we wish to identify a policy π^\widehat{\pi} that minimizes the maximum sub-optimality gap with respect to this sequence. Explicitly, we seek to learn

π^=argminπmaxkK((π,Θ(rk))(πk,Θ(rk))) where πkargminπmaxθ:𝒞(θ)rk(π,θ).\displaystyle\widehat{\pi}=\arg\min_{\pi}\max_{k\leq K}\left(\ell(\pi,\Theta^{(r_{k})})-\ell(\pi_{k},\Theta^{(r_{k})})\right)\text{ where }\quad\pi_{k}\in\arg\min_{\pi}\max_{\theta:\mathcal{C}(\theta)\leq r_{k}}\ell(\pi,\theta). (3)

Note that as KK\to\infty and supkrk+1rk1\sup_{k}\frac{r_{k+1}}{r_{k}}\to 1, equation 2 and equation 3 are essentially equivalent under benign smoothness conditions on 𝒞(θ)\mathcal{C}(\theta), in which case π^π\widehat{\pi}\rightarrow\pi_{*}. In practice, we choose Θ(rK)\Theta^{(r_{K})} contains all problems that can be solved within the budget TT relatively accurately, and a small ϵ>0\epsilon>0, where maxkrk+1rk=1+ϵ\max_{k}\frac{r_{k+1}}{r_{k}}=1+\epsilon. In Algorithm 1, our algorithm MAPO efficiently solves this objective by first computing πk\pi_{k} for all k[K]k\in[K] to obtain (πk,Θ(rk))\ell(\pi_{k},\Theta^{(r_{k})}) as benchmarks, and then uses these benchmarks to train π^\widehat{\pi}. The next section will focus on the challenges of the optimization problems in equation 4 and equation 5.

  Input: sequence {rk}k=1K\{r_{k}\}_{k=1}^{K}, complexity function 𝒞\mathcal{C}.
  Define k(θ)[K]{k(\theta)}\in[K] such that rk(θ)1<𝒞(θ)rk(θ)r_{k(\theta)-1}<\mathcal{C}(\theta)\leq r_{k(\theta)} for all θ\theta with 𝒞(θ)rK\mathcal{C}(\theta)\leq r_{K}.
  for k1,,Kk\in 1,...,K do
     Obtain policy πk\pi_{k} by solving:
πk:=argminπ(π,Θ(rk))=argminπmaxθΘ(rk)(π,θ)andb(rk):=(πk,Θ(rk))\displaystyle\pi_{k}:=\operatorname*{arg\,min}_{\pi}\ell(\pi,\Theta^{(r_{k})})=\operatorname*{arg\,min}_{\pi}\max_{\theta\in\Theta^{(r_{k})}}\ell(\pi,\theta)\quad\quad\text{and}\quad\quad b^{(r_{k})}:=\ell(\pi_{k},\Theta^{(r_{k})}) (4)
     
  end for
  Training for min-gap optimal policy: Solve the following:
π^=argminπmaxθΘ(rK)[(π,θ)b(rk(θ))]\displaystyle\widehat{\pi}=\operatorname*{arg\,min}_{\pi}\max_{\theta\in\Theta^{(r_{K})}}\left[\ell(\pi,\theta)-b^{(r_{k(\theta)})}\right] (5)
Output: π^\widehat{\pi} (a solution to equation 3).
Algorithm 1 MAPO: Min-gap Adversarial Policy Optimization

3.1 Differentiable policy optimization

The critical part of running MAPO (Algorithm 1) is to solve for equation 4 and equation 5. Note that equation 5 is an optimization of the same form with equation 4 after shifting the loss by the scalar value b(rk(θ))b^{(r_{k(\theta)})}. Consequently, to learn {π~k}k\{\widetilde{\pi}_{k}\}_{k} and π^\widehat{\pi}, it suffices to develop a training procedure to solve minπmaxθΩ(π,θ)\min_{\pi}\max_{\theta\in\Omega}\ell^{\prime}(\pi,\theta) for an arbitrary set Ω\Omega and generic loss function (π,θ)\ell^{\prime}(\pi,\theta).

We would like to solve this saddle-point problem using an alternating gradient descent/ascent method in Algorithm 2 that we describe now. Instead of optimizing over all possible policies, we restrict the policy class to neural networks that take state representation as input and output a probability distribution over actions, parameterized by weights ψ\psi. In practice, (πψ,θ)\ell^{\prime}(\pi^{\psi},\theta) may be poorly behaved in (ψ,θ)(\psi,\theta) so a gradient descent/ascent procedure may get stuck in a neighborhood of a critical point that is not an optimal solution to the saddle point problem. To avoid this, we instead track over many different possible θ\theta’s (intuitively corresponding to different initializations):

minψmaxθΩ(πψ,θ)\displaystyle\min_{\psi}\max_{\theta\in\Omega}\ell^{\prime}(\pi^{\psi},\theta) =minψmaxθ~1:NΩmaxi[N](πψ,θ~i).\displaystyle=\min_{\psi}\max_{\widetilde{\theta}_{1:N}\subset\Omega}\max_{i\in[N]}\ell^{\prime}(\pi^{\psi},\widetilde{\theta}_{i}). (6)
=minψmaxθ~1:NΩmaxλΔN𝔼iλ(πψ,θ~i).\displaystyle=\min_{\psi}\max_{\widetilde{\theta}_{1:N}\subset\Omega}\max_{\lambda\in\Delta_{N}}\mathbb{E}_{i\sim\lambda}\ell^{\prime}(\pi^{\psi},\widetilde{\theta}_{i}). (7)
=minψmaxwN,θ~1:NΩ𝔼iSOFTMAX(w)[(πψ,θ~i)].\displaystyle=\min_{\psi}\max_{w\in\mathbb{R}^{N},\widetilde{\theta}_{1:N}\subset\Omega}\mathbb{E}_{i\sim\text{SOFTMAX}(w)}\left[\ell^{\prime}(\pi^{\psi},\widetilde{\theta}_{i})\right]. (8)

In the first equality we replace the maximum over all Ω\Omega to a maximum over all subsets Θ~=θ~1:N\widetilde{\Theta}=\widetilde{\theta}_{1:N} of size NN. The resulting maximum over the NN points is still a discrete optimization. To smooth it out, we utilize the fact that a max\max over a set is just the same as the maximum over of the expectation over all distributions on that set. In the last equality, we reparameterize the set of distributions with the softmax to weight the different values of θ~\widetilde{\theta}. In each round, we backpropagate through ww and θ~1:N\widetilde{\theta}_{1:N}.

Now we discuss the optimization routine outlined in Algorithm 2. For the inside optimization, ideally, in each round we would build an estimate of the loss function at our current choice of πψ\pi^{\psi} for each of the θ~1:N\widetilde{\theta}_{1:N}’s under consideration. To do so, we rollout the policy for each θθ~1:N\theta\in\widetilde{\theta}_{1:N} under consideration LL times and then average the resulting losses (this also allows us to construct a stochastic gradient of the loss). In practice we can’t consider all θθ~1:N\theta\in\widetilde{\theta}_{1:N}, so instead we sample MM of them from ww. This has a computational benefit by allowing us to be strategic by considering θ\theta’s each round that are closest to the argmaxθ~1:N(πψ,θ)\arg\max_{\widetilde{\theta}_{1:N}}\ell^{\prime}(\pi^{\psi},\theta).

After this we then backpropagate through ww and Θ~\widetilde{\Theta} using the stochastic gradients learned from the rollouts. Finally, we then update π\pi by backpropagation through the neural network under consideration. The gradient steps are taken with unbiased gradient estimates gw(i,τ)g^{w}(i,\tau), gΘ~(i,τ)g^{\widetilde{\Theta}}(i,\tau) and gψ(i,τ)g^{\psi}(i,\tau), which are computed by using the score-function identity and is described in detail in Appendix C. We outline more implementation details in Appendix B along with the below algorithm with explicit gradient estimate formulas. Hyperparamters can be found in Appendix D.

Algorithm 2 Gradient Based Optimization of equation 8
  Input: partition Ω\Omega, number of iterations NitN_{it}, number of problem samples MM, number of rollouts per problem LL, and loss variable LTL_{T} at horizon TT (see beginning of Section 2).
  Goal: Compute the optimal policy argminπmaxθΩ(π,θ)=argminπmaxθΩ𝔼π,θ[LT]\arg\min_{\pi}\max_{\theta\in\Omega}\ell^{\prime}(\pi,\theta)=\arg\min_{\pi}\max_{\theta\in\Omega}\mathbb{E}_{\pi,\theta}[L_{T}]. Note in the case of (π,θ)=(π,θ)b(rk(θ))\ell^{\prime}(\pi,\theta)=\ell(\pi,\theta)-b^{(r_{k(\theta)})}, LTL_{T} is inherently subtracting the scalar value b(rk(θ))b^{(r_{k(\theta)})}.
  Initialization: ww, finite set Θ~=θ~1:N\widetilde{\Theta}=\widetilde{\theta}_{1:N} and ψ\psi.
  for t=1,,Nitt=1,...,N_{\text{it}} do
     for m=1,,Mm=1,...,M do
        Sample Imi.i.d.SOFTMAX(w)I_{m}\overset{i.i.d.}{\sim}\text{SOFTMAX}(w).
        Collect LL independent rollout trajectories, denoted as τm,1:L\tau_{m,1:L}, by the policy πψ\pi^{\psi} for θIm\theta_{I_{m}}.
     end for
     Update the generating distribution by taking ascending steps on gradient estimates:
     
Θ~,w\displaystyle\widetilde{\Theta},w Θ~+1MLm=1M(Θ~barrier(θ~Im,Ω)+l=1LgΘ~(Im,τm,l)),w+1MLm=1Ml=1Lgw(Im,τm,l)\displaystyle\leftarrow\widetilde{\Theta}+\frac{1}{ML}\sum_{m=1}^{M}\left(\nabla_{\widetilde{\Theta}}\mathcal{L}_{\text{barrier}}(\widetilde{\theta}_{I_{m}},\Omega)+\sum_{l=1}^{L}g^{\widetilde{\Theta}}(I_{m},\tau_{m,l})\right),w+\frac{1}{ML}\sum_{m=1}^{M}\sum_{l=1}^{L}g^{w}(I_{m},\tau_{m,l})
     where barrier\mathcal{L}_{\text{barrier}} is a differentiable barrier loss that heavily penalizes the θ~Im\widetilde{\theta}_{I_{m}}’s outside Ω\Omega.
     Update the policy by taking descending step on gradient estimate:
     
ψ\displaystyle\psi\leftarrow ψ1MLm=1Ml=1Lgψ(Im,τm,l)\displaystyle\psi-\frac{1}{ML}\sum_{m=1}^{M}\sum_{l=1}^{L}g^{\psi}(I_{m},\tau_{m,l})
  end for

4 Experiments

We now evaluate the approach described in the previous section for combinatorial bandits with 𝒳={𝐞i:i[d]}\mathcal{X}=\{\mathbf{e}_{i}:i\in[d]\} and 𝒵{0,1}d\mathcal{Z}\subset\{0,1\}^{d}. This setting generalizes both binary active classification for arbitrary model class and active recommendation, which we evaluate by conducting experiments on two respective real datasets. We evaluated based on two criteria: instance-dependent worst-case and average-case. For instance-dependent worst-case, we measure, for each rkr_{k} and policy π\pi, (π,Θ(rk)):=maxθΘ(rk)(π,θ)\displaystyle\ell(\pi,\Theta^{(r_{k})}):=\max_{\theta\in\Theta^{(r_{k})}}\ell(\pi,\theta) and plot this value as a function of rkr_{k}. We note that our algorithm is designed to optimize for such a metric. For the secondary average-case metric, we instead measure, for policy π\pi and some collected set Θ\Theta, 1|Θ|θΘ(π,θ)\frac{1}{|\Theta|}\sum_{\theta\in\Theta}\ell(\pi,\theta). Performances of instance-dependent worst-case metric are reported in Figures 2346, and 7 below while the average case performances are reported in the tables and Figure 5. Full scale of the figures can also be found in Appendix F.

4.1 Algorithms

We compare against a number of baseline active learning algorithms (see Section 5 for a review). Uncertainty sampling at time tt computes the empirical maximizer of z,θ^\langle z,\widehat{\theta}\rangle and the runner-up, and samples an index uniformly from their symmetric difference (i.e thinking of elements of 𝒵\mathcal{Z} as subsets of [d][d]); if either are not unique, an index is sampled from the region of disagreement of the winners (see Appendix G for details). The greedy methods are represented by soft generalized binary search (SGBS) (Nowak, 2011) which maintains a posterior distribution over 𝒵\mathcal{Z} and samples to maximize information gain. A hyperparameter β(0,1/2)\beta\in(0,1/2) of SGBS determines the strength of the likelihood update. We plot or report a range of performance over β{.01,.03,.1,.2,.3,.4}\beta\in\{.01,.03,.1,.2,.3,.4\}. The agnostic algorithms for classification (Balcan et al., 2006; Hanneke, 2007b; a; Dasgupta et al., 2008; Huang et al., 2015; Jain & Jamieson, 2019) or combinatorial bandits (Chen et al., 2014; Gabillon et al., 2016; Chen et al., 2017; Cao & Krishnamurthy, 2017; Fiez et al., 2019; Jain & Jamieson, 2019) are so conservative that given just T=20T=20 samples, they are all exactly equivalent to uniform sampling and hence represented by Uniform. To represent a policy based on learning to actively learn with respect to a prior, we employ the method of Kveton et al. (2020), denoted Bayes-LAL, with a fixed prior 𝒫~\widetilde{\mathcal{P}} constructed by drawing a zz uniformly at random from 𝒵\mathcal{Z} and defining θ=2z1[1,1]d\theta=2z-1\in[-1,1]^{d} (details in Appendix H). When evaluating each policy, we use the successive halving algorithm (Li et al., 2017; 2018) for optimizing our non-convex objective with randomly initialized gradient descent and restarts (details in Appendix E).

4.2 Synthetic Dataset: Thresholds

We begin with a very simple instance to demonstrate the instance-dependent performance achieved by our learned policy. For d=25d=25, let 𝒳={𝐞i:i[d]}\mathcal{X}=\{\mathbf{e}_{i}:i\in[d]\}, 𝒵={i=1k𝐞i:k=0,1,,d}\mathcal{Z}=\{\sum_{i=1}^{k}\mathbf{e}_{i}:k=0,1,\dots,d\}, and f(|θ,x)f(\cdot|\theta,x) is a Bernoulli distribution over {1,1}\{-1,1\} with mean x,θ[1,1]\langle x,\theta\rangle\in[-1,1]. Appendix A shows that z(θ)=argmaxzz,θz_{\ast}(\theta_{\ast})=\arg\max_{z}\langle z,\theta_{\ast}\rangle is the best threshold classifier for a label distribution induced by (θ+1)/2(\theta_{\ast}+1)/2. We trained baseline policies {πk}k=19\{\pi_{k}\}_{k=1}^{9} for the Best Identification metric with 𝒞(θ)=ρ~(𝒳,𝒵,θ)\mathcal{C}(\theta)=\widetilde{\rho}(\mathcal{X},\mathcal{Z},\theta) and rk=23+i/2r_{k}=2^{3+i/2} for i{0,,8}i\in\{0,\dots,8\}.

Refer to caption
Figure 2: Learned policies, lower is better
Refer to caption
Figure 3: Sub-optimality of individual policies, lower is better
Refer to caption
Figure 4: Max {θ:ρ~(θ)r}\{\theta:\widetilde{\rho}(\theta)\leq r\}, lower is better
Refer to caption
Figure 5: Average 𝔼θ𝒫h[]\mathbb{E}_{\theta\sim\mathcal{P}_{h}}[\cdot], lower is better

First we compare the base policies πk\pi_{k} to π^\widehat{\pi}. Figure 2 presents (π,Θ(r))=maxθ:ρ~(θ)r(π,θ)=maxθ:ρ~(θ)rπ,θ(z^z(θ))\ell(\pi,\Theta^{(r)})=\max_{\theta:\widetilde{\rho}(\theta)\leq r}\ell(\pi,\theta)=\max_{\theta:\widetilde{\rho}(\theta)\leq r}\mathbb{P}_{\pi,\theta}(\widehat{z}\neq z_{\ast}(\theta)) as a function of rr for our base policies {πk}k\{\pi_{k}\}_{k} and the global policy π^\widehat{\pi}, each as an individual curve. Figure 3 plots the same information in terms of gap: (π,Θ(r))mink:rk1<rrk(πk,Θ(rk))\displaystyle\ell(\pi,\Theta^{(r)})-\min_{k:r_{k-1}<r\leq r_{k}}\ell(\pi_{k},\Theta^{(r_{k})}). We observe that each πk\pi_{k} performs best in a particular region and π^\widehat{\pi} performs almost as well as the rr-dependent baseline policies over the range of rr.

Under the same conditions as Figure 2, Figure 4 compares the performance of π^\widehat{\pi} to the algorithm benchmarks. Since SGBS and Bayes-LAL are deterministic, the adversarial training finds a θ\theta that tricks them into catastrophic failure. Figure 5 trades adversarial evaluation for evaluating with respect to a parameterized prior: For each h{0.5,0.6,,1}h\in\{0.5,0.6,\dots,1\}, θ𝒫h\theta\sim\mathcal{P}_{h} is defined by drawing a zz uniformly at random from 𝒵\mathcal{Z} and then setting θi=(2zi1)(2αi1)\theta_{i}=(2z_{i}-1)(2\alpha_{i}-1) where αiBernoulli(h)\alpha_{i}\sim\text{Bernoulli}(h). Thus, each sign of 2z12z-1 is flipped with probability hh. We then compute 𝔼θ𝒫h[π,θ(z^=z(θ))]=𝔼θ𝒫h[(π,θ)]\mathbb{E}_{\theta\sim\mathcal{P}_{h}}[\mathbb{P}_{\pi,\theta}(\widehat{z}=z_{\ast}(\theta))]=\mathbb{E}_{\theta\sim\mathcal{P}_{h}}[\ell(\pi,\theta)]. While SGBS now performs much better than uniform and uncertainty sampling, our policy π^\widehat{\pi} is still superior to these policies. However, Bayes-LAL is best overall which is expected since the support of 𝒫h\mathcal{P}_{h} is essentially a rescaled version of the prior used in Bayes-LAL.

4.3 Real Datasets

20 Questions. Our dataset is constructed from the real data of Hu et al. (2018). Summarizing how we used the data, 100100 yes/no questions were considered for 10001000 celebrities. Each question i[100]i\in[100] for each person j[1000]j\in[1000] was answered by several annotators to construct an empirical probability p¯i(j)[0,1]\bar{p}_{i}^{(j)}\in[0,1] denoting the proportion of annotators that answered “yes.” To construct our instance, we take 𝒳={𝐞i:i[100]}\mathcal{X}=\{\mathbf{e}_{i}:i\in[100]\} to encode questions and 𝒵={z(j):[z(j)]i=𝟏{p¯i(j)>1/2}}{0,1}1000\mathcal{Z}=\{z^{(j)}:[z^{(j)}]_{i}=\bm{1}\{\bar{p}_{i}^{(j)}>1/2\}\}\subset\{0,1\}^{1000}. Just as before, we trained {πk}k=14\{\pi_{k}\}_{k=1}^{4} for the Best Identification metric with 𝒞(θ)=ρ~(𝒳,𝒵,θ)\mathcal{C}(\theta)=\widetilde{\rho}(\mathcal{X},\mathcal{Z},\theta) and ri=23+i/2r_{i}=2^{3+i/2} for i{1,,4}i\in\{1,\dots,4\}. See Appendix I for details.

Jester Joke Recommendation. We now turn our attention away from Best Identification to Simple Regret where (π,θ)=𝔼π,θ[z(θ)z^,θ]\ell(\pi,\theta)=\mathbb{E}_{\pi,\theta}[\langle z_{\ast}(\theta)-\widehat{z},\theta\rangle]. We consider the Jester jokes dataset of Goldberg et al. (2001) that contains jokes ranging from innocent puns to grossly offensive jokes. We filter the dataset to only contain users that rated all 100100 jokes, resulting in 14116 users. A rating of each joke was provided on a [10,10][-10,10] scale which was rescaled to [1,1][-1,1] and observations were simulated as Bernoulli’s like above. We then clustered the ratings of these users (see Appendix J for details) to 10 groups to obtain 𝒵={z(k):k[10],z(k){0,1}100}\mathcal{Z}=\{z^{(k)}:k\in[10],z^{(k)}\in\{0,1\}^{100}\} where zi(k)=1z_{i}^{(k)}=1 corresponds to recommending the iith joke in user cluster z(k)𝒵z^{(k)}\in\mathcal{Z}. See Appendix J for details.

4.3.1 Instance-dependent Worst-case

Figure 6 and Figure 7 are analogous to Figure 4 but for the 20 questions and Jester joke instances, respectively. The two deterministic policies, SGBS and Bayes-LAL, fail on these datasets as well against the worst-case instances.

Refer to caption
Figure 6: 20 Questions
Refer to caption
Figure 7: Jester Joke

On the Jester joke dataset, our policy alone nearly achieves the rr-dependent baseline for all rr. But on 20 questions, uncertainty sampling performs remarkably well. These experiments on real datasets demonstrate that our policy obtains near-optimal instance dependent sample complexity.

4.3.2 Average Case Performance

While the metric of the previous section rewarded algorithms that perform uniformly well over all possible environments that could be encountered, in this section we consider the performance of an algorithm with respect to a distribution over environments, which we denote as average case.

Table 1: 20 Questions, higher the better
Method Accuracy(%)
π\pi^{*} (Ours) 17.9
SGBS {26.5, 26.2, 27.2,
26.5, 21.4, 12.8}
Uncertainty 14.3
Bayes-LAL 4.1
Uniform 6.9
Table 2: Jester Joke, lower the better
Method Regret
π\pi^{*} (Ours) 3.209
SGBS {3.180, 3.224, 3.278,
3.263, 3.153, 3.090}
Uncertainty 3.027
Bayes-LAL 3.610
Uniform 3.877

While heuristic based algorithms (such as SGBS, uncertainty sampling and Bayes-LAL) can perform catastrophically for worst-case instances, they can perform very well with respect to a benign distribution over instances. Here we demonstrate that our policy not only performs optimally under the instance-dependent worst-case metric but also remain comparable even when evaluated under the average case metric. To measure the average performance, we construct prior distributions 𝒫^\widehat{\mathcal{P}} based on the individual datasets:

  • For the 20 questions dataset, to draw a θ𝒫^\theta\sim\widehat{\mathcal{P}}, we uniformly at random select a j[1000]j\in[1000] and sets θi=2p¯i(j)1\theta_{i}=2\bar{p}_{i}^{(j)}-1 for all i[d]i\in[d].

  • For the Jester joke recommendation dataset, to draw a θ𝒫^\theta\sim\widehat{\mathcal{P}}, we uniformly sample a user and employ their ratings to each joke.

On the 20 questions dataset, as shown in Table 4.3.2, SGBS and π^\widehat{\pi} are the winners. Bayes-LAL performs much worse in this case, potentially because of the distribution shift from 𝒫~\widetilde{\mathcal{P}} (prior we train on) to 𝒫^\widehat{\mathcal{P}} (prior at test time). The strong performance of SGBS may be due to the fact that sign(θi)=2z(θ)i1\text{sign}(\theta_{i})=2z_{\ast}(\theta)_{i}-1 for all ii and θ𝒫^\theta\sim\widehat{\mathcal{P}}, a realizability condition under which SGBS has strong guarantees (Nowak, 2011). On the Jester joke dataset, Table 4.3.2 shows that despite our policy not being trained for this setting, its performance is still among the top.

5 Related work

Learning to actively learn. Previous works vary in how the parameterize the policy, ranging from parameterized mixtures of existing expertly designed active learning algorithms (Baram et al., 2004; Hsu & Lin, 2015; Agarwal et al., 2016), parameterizing hyperparameters (e.g., learning rate, rate of forced exploration, etc.) in an existing popular algorithm (e.g, EXP3) (Konyushkova et al., 2017; Bachman et al., 2017; Cella et al., 2020), and the most ambitious, policies parameterized end-to-end like in this work (Boutilier et al., 2020; Kveton et al., 2020; Sharaf & Daumé III, 2019; Fang et al., 2017; Woodward & Finn, 2017). These works take an approach of defining a prior distribution either through past experience (meta-learning) or expert created (e.g., θ𝒩(0,Σ)\theta\sim\mathcal{N}(0,\Sigma)), and then evaluate their policy with respect to this prior distribution. Defining this prior can be difficult, and moreover, if the θ\theta encountered at test time did not follow this prior distribution, performance could suffer significantly. Our approach, on the other hand, takes an adversarial training approach and can be interpreted as learning a parameterized least favorable prior (Wasserman, 2013), thus gaining a much more robust policy as an end result.

Robust and Safe Reinforcement Learning. Our work is also highly related to the field of robust and safe reinforcement learning, where our objective can be considered as an instance of minimax criterion under parameter uncertainty (Garcıa & Fernández, 2015). Widely applied in applications such as robotics (Mordatch et al., 2015; Rajeswaran et al., 2016), these methods train a policy in a simulator like Mujoco (Todorov et al., 2012) to minimize a defined loss objective while remaining robust to uncertainties and perturbations to the environment (Mordatch et al., 2015; Rajeswaran et al., 2016). Ranges of these uncertainty parameters are chosen based on potential values that could be encountered when deploying the robot in the real world. In our setting, however, defining the set of environments is far less straightforward and is overcome by the adoption of the 𝒞(θ)\mathcal{C}(\theta) function.

Active Binary Classification Algorithms. The literature on active learning algorithms can be partitioned into model-based heuristics like uncertainty sampling, query by committee, or model-change sampling (Settles, 2009), greedy binary-search like algorithms that typically rely on a form of bounded noise for correctness (Dasgupta, 2005; 2006; Kääriäinen, 2006; Golovin & Krause, 2011; Nowak, 2011), and agnostic algorithms that make no assumptions on the probabilistic model (Balcan et al., 2006; Hanneke, 2007b; a; Dasgupta et al., 2008; Huang et al., 2015; Jain & Jamieson, 2019; Katz-Samuels et al., 2020; 2021). Though the heuristics and greedy methods can perform very well for some problems, it is typically easy to construct counter-examples (e.g., outside the assumptions) in which they catastrophically fail as demonstrated in our experiments. The agnostic algorithms have strong robustness guarantees but rely on concentration inequalities, and consequently require at least hundreds of labels to observe any deviation from random sampling (see Huang et al. (2015) for comparison). Therefore, they were implicitly represented by uniform in our experiments.

Pure-exploration Multi-armed Bandit Algorithms. In the linear structure setting, for sets 𝒳,𝒵d\mathcal{X},\mathcal{Z}\subset\mathbb{R}^{d} known to the player, pulling an “arm” x𝒳x\in\mathcal{X} results in an observation x,θ+\langle x,\theta_{*}\rangle+ zero-mean noise, and the objective is to identify argmaxz𝒵z,θ\arg\max_{z\in\mathcal{Z}}\langle z,\theta_{*}\rangle for a vector θ\theta_{*} unknown to the player (Soare et al., 2014; Karnin, 2016; Tao et al., 2018; Xu et al., 2017; Fiez et al., 2019). A special case of linear bandits is combinatorial bandits where 𝒳={𝐞i:i[d]}\mathcal{X}=\{\mathbf{e}_{i}:i\in[d]\} and 𝒵{0,1}d\mathcal{Z}\subset\{0,1\}^{d} (Chen et al., 2014; Gabillon et al., 2016; Chen et al., 2017; Cao & Krishnamurthy, 2017; Fiez et al., 2019; Jain & Jamieson, 2019). Active binary classification is a special case of combinatorial pure-exploration multi-armed bandits (Jain & Jamieson, 2019), which we exploit in the threshold experiments. While the above works have made great theoretical advances in deriving algorithms and information theoretic lower bounds that match up to constants, the constants are so large that these algorithms only behave well when the number of measurements is very large. When applied to the instances of our paper (only 20 queries are made), these algorithms behave no differently than random sampling.

6 Discussion and Future Directions

We see this work as an exciting but preliminary step towards realizing the full potential of this general approach. Although our experiments has been focusing on applications of combinatorial bandit, we see this framework generalizing with minor changes to many more widely applicable settings such as multi-class active classification, contextual bandits, etc. To generalize 𝒞(θ)\mathcal{C}(\theta) to these settings, one can refer to existing literature for instance-dependent lower bounds (Katz-Samuels et al., 2021; Agarwal et al., 2014). Alternatively, when such a lower bound does not exist, we conjecture that a heuristic scoring function could also serve as 𝒞(θ)\mathcal{C}(\theta). For example, in a chess game, one could simply use the scoring function of the pieces left on board as a proxy for difficulty.

From a practical perspective, training a π^\widehat{\pi} can take many hours of computational resources for even these small instances. Scaling these methods to larger instances is an important next step. While training time scales linearly with the horizon length TT, we note that one can take multiple samples per time step. With minimal computational overhead, this could enable training on problems that require larger sample complexities. In our implementation we hard-coded the decision rule for z^\widehat{z} given sTs_{T}, but it could also be learned as in (Luedtke et al., 2020). Likewise, the parameterization of the policy and generator worked well for our purposes but was chosen somewhat arbitrarily–are there more natural choices? Finally, while we focused on stochastic settings, this work naturally extends to constrained fully adaptive adversarial sequences which is an interesting direction of future work.

Funding disclosure

This work was supported in part by NSF IIS-1907907.

Acknowledgement

The authors would like to thank Aravind Rajeswaran for insightful discussions on Robust Reinforcement Learning. We would also like to thank Julian Katz-Samuels and Andrew Wagenmaker for helpful comments.

References

  • Agarwal et al. (2014) Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638–1646. PMLR, 2014.
  • Agarwal et al. (2016) Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. Corralling a band of bandit algorithms. arXiv preprint arXiv:1612.06246, 2016.
  • Aleksandrov et al. (1968) VM Aleksandrov, VI Sysoyev, and SHEMENEV. VV. Stochastic optimization. Engineering Cybernetics, (5):11–+, 1968.
  • Bachman et al. (2017) Philip Bachman, Alessandro Sordoni, and Adam Trischler. Learning algorithms for active learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp.  301–310. JMLR. org, 2017.
  • Balcan et al. (2006) Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. In Proceedings of the 23rd international conference on Machine learning, pp.  65–72, 2006.
  • Baram et al. (2004) Yoram Baram, Ran El Yaniv, and Kobi Luz. Online choice of active learning algorithms. Journal of Machine Learning Research, 5(Mar):255–291, 2004.
  • Boutilier et al. (2020) Craig Boutilier, Chih-Wei Hsu, Branislav Kveton, Martin Mladenov, Csaba Szepesvari, and Manzil Zaheer. Differentiable bandit exploration. arXiv preprint arXiv:2002.06772, 2020.
  • Cao & Krishnamurthy (2017) Tongyi Cao and Akshay Krishnamurthy. Disagreement-based combinatorial pure exploration: Efficient algorithms and an analysis with localization. stat, 2017.
  • Carpentier & Locatelli (2016) Alexandra Carpentier and Andrea Locatelli. Tight (lower) bounds for the fixed budget best arm identification bandit problem. In Conference on Learning Theory, pp.  590–604, 2016.
  • Cella et al. (2020) Leonardo Cella, Alessandro Lazaric, and Massimiliano Pontil. Meta-learning with stochastic linear bandits. arXiv preprint arXiv:2005.08531, 2020.
  • Chen et al. (2017) Lijie Chen, Anupam Gupta, Jian Li, Mingda Qiao, and Ruosong Wang. Nearly optimal sampling algorithms for combinatorial pure exploration. In Conference on Learning Theory, pp.  482–534, 2017.
  • Chen et al. (2014) Shouyuan Chen, Tian Lin, Irwin King, Michael R Lyu, and Wei Chen. Combinatorial pure exploration of multi-armed bandits. In Advances in Neural Information Processing Systems, pp. 379–387, 2014.
  • Dasgupta (2005) Sanjoy Dasgupta. Analysis of a greedy active learning strategy. In Advances in neural information processing systems, pp. 337–344, 2005.
  • Dasgupta (2006) Sanjoy Dasgupta. Coarse sample complexity bounds for active learning. In Advances in neural information processing systems, pp. 235–242, 2006.
  • Dasgupta et al. (2008) Sanjoy Dasgupta, Daniel J Hsu, and Claire Monteleoni. A general agnostic active learning algorithm. In Advances in neural information processing systems, pp. 353–360, 2008.
  • Degenne et al. (2020) Rémy Degenne, Pierre Ménard, Xuedong Shang, and Michal Valko. Gamification of pure exploration for linear bandits. In International Conference on Machine Learning, pp. 2432–2442. PMLR, 2020.
  • Fang et al. (2017) Meng Fang, Yuan Li, and Trevor Cohn. Learning how to active learn: A deep reinforcement learning approach. arXiv preprint arXiv:1708.02383, 2017.
  • Fiez et al. (2019) Tanner Fiez, Lalit Jain, Kevin G Jamieson, and Lillian Ratliff. Sequential experimental design for transductive linear bandits. In Advances in Neural Information Processing Systems, pp. 10666–10676, 2019.
  • Gabillon et al. (2016) Victor Gabillon, Alessandro Lazaric, Mohammad Ghavamzadeh, Ronald Ortner, and Peter Bartlett. Improved learning complexity in combinatorial pure exploration bandits. In Artificial Intelligence and Statistics, pp.  1004–1012, 2016.
  • Garcıa & Fernández (2015) Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
  • Garivier & Kaufmann (2016) Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pp.  998–1027, 2016.
  • Goldberg et al. (2001) Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Eigentaste: A constant time collaborative filtering algorithm. information retrieval, 4(2):133–151, 2001.
  • Golovin & Krause (2011) Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in active learning and stochastic optimization. Journal of Artificial Intelligence Research, 42:427–486, 2011.
  • Hanneke (2007a) Steve Hanneke. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th international conference on Machine learning, pp.  353–360, 2007a.
  • Hanneke (2007b) Steve Hanneke. Teaching dimension and the complexity of active learning. In International Conference on Computational Learning Theory, pp.  66–81. Springer, 2007b.
  • Hanneke et al. (2014) Steve Hanneke et al. Theory of disagreement-based active learning. Foundations and Trends® in Machine Learning, 7(2-3):131–309, 2014.
  • Hao et al. (2019) Botao Hao, Tor Lattimore, and Csaba Szepesvari. Adaptive exploration in linear contextual bandit. arXiv preprint arXiv:1910.06996, 2019.
  • Hsu & Lin (2015) Wei-Ning Hsu and Hsuan-Tien Lin. Active learning by learning. In Twenty-Ninth AAAI conference on artificial intelligence, 2015.
  • Hu et al. (2018) Huang Hu, Xianchao Wu, Bingfeng Luo, Chongyang Tao, Can Xu, Wei Wu, and Zhan Chen. Playing 20 question game with policy-based reinforcement learning. arXiv preprint arXiv:1808.07645, 2018.
  • Huang et al. (2015) Tzu-Kuo Huang, Alekh Agarwal, Daniel J Hsu, John Langford, and Robert E Schapire. Efficient and parsimonious agnostic active learning. In Advances in Neural Information Processing Systems, pp. 2755–2763, 2015.
  • Jain & Jamieson (2019) Lalit Jain and Kevin G Jamieson. A new perspective on pool-based active classification and false-discovery control. In Advances in Neural Information Processing Systems, pp. 13992–14003, 2019.
  • Kääriäinen (2006) Matti Kääriäinen. Active learning in the non-realizable case. In International Conference on Algorithmic Learning Theory, pp.  63–77. Springer, 2006.
  • Karnin et al. (2013) Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed bandits. In International Conference on Machine Learning, pp. 1238–1246, 2013.
  • Karnin (2016) Zohar S Karnin. Verification based solution for structured mab problems. In Advances in Neural Information Processing Systems, pp. 145–153, 2016.
  • Katz-Samuels et al. (2020) Julian Katz-Samuels, Lalit Jain, Zohar Karnin, and Kevin Jamieson. An empirical process approach to the union bound: Practical algorithms for combinatorial and linear bandits. arXiv preprint arXiv:2006.11685, 2020.
  • Katz-Samuels et al. (2021) Julian Katz-Samuels, Jifan Zhang, Lalit Jain, and Kevin Jamieson. Improved algorithms for agnostic pool-based active classification. arXiv preprint arXiv:2105.06499, 2021.
  • Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Konyushkova et al. (2017) Ksenia Konyushkova, Raphael Sznitman, and Pascal Fua. Learning active learning from data. In Advances in Neural Information Processing Systems, pp. 4225–4235, 2017.
  • Kveton et al. (2020) Branislav Kveton, Martin Mladenov, Chih-Wei Hsu, Manzil Zaheer, Csaba Szepesvari, and Craig Boutilier. Differentiable meta-learning in contextual bandits. arXiv preprint arXiv:2006.05094, 2020.
  • Lattimore & Szepesvari (2016) Tor Lattimore and Csaba Szepesvari. The end of optimism? an asymptotic analysis of finite-armed linear bandits. arXiv preprint arXiv:1610.04491, 2016.
  • Li et al. (2018) Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Moritz Hardt, Benjamin Recht, and Ameet Talwalkar. Massively parallel hyperparameter tuning. arXiv preprint arXiv:1810.05934, 2018.
  • Li et al. (2017) Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research, 18(1):6765–6816, 2017.
  • Luedtke et al. (2020) Alex Luedtke, Marco Carone, Noah Simon, and Oleg Sofrygin. Learning to learn from data: Using deep adversarial learning to construct optimal statistical procedures. Science Advances, 6(9), 2020. doi: 10.1126/sciadv.aaw2140. URL https://advances.sciencemag.org/content/6/9/eaaw2140.
  • Mannor & Tsitsiklis (2004) Shie Mannor and John N Tsitsiklis. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623–648, 2004.
  • Mordatch et al. (2015) Igor Mordatch, Kendall Lowrey, and Emanuel Todorov. Ensemble-cio: Full-body dynamic motion planning that transfers to physical humanoids. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp.  5307–5314. IEEE, 2015.
  • Nowak (2011) Robert D Nowak. The geometry of generalized binary search. IEEE Transactions on Information Theory, 57(12):7893–7906, 2011.
  • Ok et al. (2018) Jungseul Ok, Alexandre Proutiere, and Damianos Tranos. Exploration in structured reinforcement learning. In Advances in Neural Information Processing Systems, pp. 8874–8882, 2018.
  • Rajeswaran et al. (2016) Aravind Rajeswaran, Sarvjeet Ghotra, Balaraman Ravindran, and Sergey Levine. Epopt: Learning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283, 2016.
  • Settles (2009) Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009.
  • Sharaf & Daumé III (2019) Amr Sharaf and Hal Daumé III. Meta-learning for contextual bandit exploration. arXiv preprint arXiv:1901.08159, 2019.
  • Simchowitz & Jamieson (2019) Max Simchowitz and Kevin G Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. In Advances in Neural Information Processing Systems, pp. 1151–1160, 2019.
  • Simchowitz et al. (2017) Max Simchowitz, Kevin Jamieson, and Benjamin Recht. The simulator: Understanding adaptive sampling in the moderate-confidence regime. arXiv preprint arXiv:1702.05186, 2017.
  • Soare et al. (2014) Marta Soare, Alessandro Lazaric, and Rémi Munos. Best-arm identification in linear bandits. In Advances in Neural Information Processing Systems, pp. 828–836, 2014.
  • Tao et al. (2018) Chao Tao, Saúl Blanco, and Yuan Zhou. Best arm identification in linear bandits with linear dimension dependency. In International Conference on Machine Learning, pp. 4877–4886, 2018.
  • Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.  5026–5033. IEEE, 2012.
  • Tsybakov (2008) Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer Science & Business Media, 2008.
  • Van Parys & Golrezaei (2020) Bart Van Parys and Negin Golrezaei. Optimal learning for structured bandits. 2020.
  • Wasserman (2013) Larry Wasserman. All of statistics: a concise course in statistical inference. Springer Science & Business Media, 2013.
  • Woodward & Finn (2017) Mark Woodward and Chelsea Finn. Active one-shot learning. arXiv preprint arXiv:1702.06559, 2017.
  • Xu et al. (2017) Liyuan Xu, Junya Honda, and Masashi Sugiyama. Fully adaptive algorithm for pure exploration in linear bandits. arXiv preprint arXiv:1710.05552, 2017.

Appendix A Instance Dependent Sample Complexity

Identifying forms of 𝒞(θ)\mathcal{C}(\theta) is not as difficult a task as one might think due to the proliferation of tools for proving lower bounds for active learning (Mannor & Tsitsiklis, 2004; Tsybakov, 2008; Garivier & Kaufmann, 2016; Carpentier & Locatelli, 2016; Simchowitz et al., 2017; Chen et al., 2014). One can directly extract values of 𝒞(θ)\mathcal{C}(\theta) from the literature for regret minimization of linear or other structured bandits (Lattimore & Szepesvari, 2016; Van Parys & Golrezaei, 2020), contextual bandits (Hao et al., 2019), and tabular as well as structured MDPs (Simchowitz & Jamieson, 2019; Ok et al., 2018). Moreover, we believe that even reasonable surrogates of 𝒞(θ)\mathcal{C}(\theta) should result in a high quality policy π\pi_{*}.

We review some canonical examples:

  • Multi-armed bandits. In the best-arm identification problem, there are dd\in\mathbb{N} Gaussian distributions where the iith distribution has mean θi\theta_{i}\in\mathbb{R} for i=1,,di=1,\dots,d. In the above formulation, this problem is encoded as action xt=itx_{t}=i_{t} results in observation ytBernoulli(θit)y_{t}\sim\text{Bernoulli}(\theta_{i_{t}}) and the loss (π,θ):=𝔼π,θ[𝟏{i^i(θ)}]\ell(\pi,\theta):=\mathbb{E}_{\pi,\theta}[\bm{1}\{\widehat{i}\neq i_{\ast}(\theta)\}] where i^\widehat{i} is π\pi’s recommended index and i(θ)=argmaxiθii_{\ast}(\theta)=\arg\max_{i}\theta_{i}. It’s been shown that there exists a constant c0>0c_{0}>0 such that for any sufficiently large ν>0\nu>0 we have

    minπmaxθ:𝒞MAB(θ)ν(π,θ)exp(c0T/ν)\displaystyle\min_{\pi}\max_{\theta:\mathcal{C}_{MAB}(\theta)\leq\nu}\ell(\pi,\theta)\geq\exp(-c_{0}T/\nu)
    where 𝒞MAB(θ):=ii(θ)(θi(θ)θi)2\displaystyle\text{ where }\quad\mathcal{C}_{MAB}(\theta):=\sum_{i\neq i_{\ast}(\theta)}(\theta_{i_{\ast}(\theta)}-\theta_{i})^{-2}

    Moreover, for any θd\theta\in\mathbb{R}^{d} there exists a policy π~\widetilde{\pi} that achieves (π~,θ)c1exp(c2T/𝒞MAB(θ))\ell(\widetilde{\pi},\theta)\leq c_{1}\exp(-c_{2}T/\mathcal{C}_{MAB}(\theta)) where c1,c2c_{1},c_{2} capture constant and low-order terms (Carpentier & Locatelli, 2016; Karnin et al., 2013; Simchowitz et al., 2017; Garivier & Kaufmann, 2016).

The above correspondence between the lower bound and the upper bound suggests that 𝒞MAB(θ)\mathcal{C}_{MAB}(\theta) plays a critical role in determining the difficulty of identifying i(θ)i_{\ast}(\theta) for any θ\theta. This exercise extends to more structured settings as well:

  • Content recommendation / active search. Consider nn items (e.g., movies, proteins) where the iith item is represented by a feature vector xi𝒳dx_{i}\in\mathcal{X}\subset\mathbb{R}^{d} and a measurement xt=xix_{t}=x_{i} (e.g., preference rating, binding affinity to a target) is modeled as a linear response model such that yt𝒩(xi,θ,1)y_{t}\sim\mathcal{N}(\langle x_{i},\theta\rangle,1) for some unknown θd\theta\in\mathbb{R}^{d}. If (π,θ):=𝔼π,θ[𝟏{i^i(θ)}]\ell(\pi,\theta):=\mathbb{E}_{\pi,\theta}[\bm{1}\{\widehat{i}\neq i_{\ast}(\theta)\}] as above then nearly identical results to that of above hold for an analogous function of 𝒞MAB(θ)\mathcal{C}_{MAB}(\theta) (Soare et al., 2014; Karnin, 2016; Fiez et al., 2019).

  • Active binary classification. For i=1,,di=1,\dots,d let ϕip\phi_{i}\in\mathbb{R}^{p} be a feature vector of an unlabeled item (e.g., image) that can be queried for its binary label yi{1,1}y_{i}\in\{-1,1\} where yiBernoulli(θi)y_{i}\sim\text{Bernoulli}(\theta_{i}) for some θd\theta\in\mathbb{R}^{d}. Let \mathcal{H} be an arbitrary set of classifiers (e.g., neural nets, random forest, etc.) such that each hh\in\mathcal{H} assigns a label {1,1}\{-1,1\} to each of the items {ϕi}i=1d\{\phi_{i}\}_{i=1}^{d} in the pool. If items are chosen sequentially to observe their labels, the objective is to identify the true risk minimizer h(θ)=argminhi=1d𝔼θ[𝟏{h(ϕi)yi}]h_{\ast}(\theta)=\arg\min_{h\in\mathcal{H}}\sum_{i=1}^{d}\mathbb{E}_{\theta}[\bm{1}\{h(\phi_{i})\neq y_{i}\}] using as few requested labels as possible and (π,θ):=𝔼π,θ[𝟏{h^h(θ)}]\ell(\pi,\theta):=\mathbb{E}_{\pi,\theta}[\bm{1}\{\widehat{h}\neq h_{\ast}(\theta)\}] where h^\widehat{h}\in\mathcal{H} is π\pi’s recommended classifier. Many candidates for 𝒞(θ)\mathcal{C}(\theta) have been proposed from the agnostic active learning literature (Balcan et al., 2006; Hanneke, 2007b; a; Dasgupta et al., 2008; Huang et al., 2015; Jain & Jamieson, 2019) but we believe the most granular candidates come from the combinatorial bandit literature (Chen et al., 2017; Fiez et al., 2019; Cao & Krishnamurthy, 2017; Jain & Jamieson, 2019). To make the reduction, for each hh\in\mathcal{H} assign a z(h){0,1}dz^{(h)}\in\{0,1\}^{d} such that [z(h)]i:=𝟏{h(ϕi)=1}[z^{(h)}]_{i}:=\bm{1}\{h(\phi_{i})=1\} for all i=1,,di=1,\dots,d and set 𝒵={z(h):h}\mathcal{Z}=\{z^{(h)}:h\in\mathcal{H}\}. It is easy to check that z(θ):=argmaxz𝒵z,θz_{\ast}(\theta):=\arg\max_{z\in\mathcal{Z}}\langle z,\theta\rangle satisfies z(θ)=z(h(θ))z_{\ast}(\theta)=z^{(h_{\ast}(\theta))}. Thus, requesting the label of example ii is equivalent to sampling from Bernoulli(𝐞i,θ){1,1}(\langle\mathbf{e}_{i},\theta\rangle)\in\{-1,1\}, completing the reduction to combinatorial bandits: 𝒳={𝐞i:i[d]}\mathcal{X}=\{\mathbf{e}_{i}:i\in[d]\}, 𝒵{0,1}d\mathcal{Z}\subset\{0,1\}^{d}. We then apply the exact same 𝒞(θ)\mathcal{C}(\theta) as above for linear bandits.

Appendix B Gradient Based Optimization Algorithm Implementation

First, we restate the algorithm with explicit gradient estimator formulas derived in Appendix C.

Algorithm 3 Gradient Based Optimization of equation 8 (Algorithm 2) with explicit gradient estimators.
  Input: partition Ω\Omega, number of iterations NitN_{it}, number of problem samples MM, number of rollouts per problem LL, and loss variable LTL_{T} at horizon TT (see beginning of Section 2).
  Goal: Compute the optimal policyargminπmaxθΩ(π,θ)=argminπmaxθΩ𝔼π,θ[LT]\arg\min_{\pi}\max_{\theta\in\Omega}\ell^{\prime}(\pi,\theta)=\arg\min_{\pi}\max_{\theta\in\Omega}\mathbb{E}_{\pi,\theta}[L_{T}].
  Initialization: ww, finite set Θ~\widetilde{\Theta} and ψ\psi.
  for t=1,,Nitt=1,...,N_{\text{it}} do
     Collect rollouts of play:
     for m=1,,Mm=1,...,M do
        Sample problem index Imi.i.d.SOFTMAX(w)I_{m}\overset{i.i.d.}{\sim}\text{SOFTMAX}(w).
        Collect LL independent rollout trajectories ({}\{\}), denoted as τm,1:L\tau_{m,1:L}, by the policy πψ\pi^{\psi} for problem instance θIm\theta_{I_{m}} and observe losses 1lL,LT(πψ,τm,l,θ~Im)\forall 1\leq l\leq L,L_{T}(\pi^{\psi},\tau_{m,l},\widetilde{\theta}_{I_{m}}).
     end for
     Optimize worst cases in Ω\Omega:
     Update the generating distribution by taking ascending steps on gradient estimates:
     
ww+1MLm=1Mw\displaystyle w\leftarrow w+\frac{1}{ML}\sum_{m=1}^{M}\nabla_{w} log(SOFTMAX(w)Im)(l=1LLT(πψ,τm,l,θ~Im))\displaystyle\log(\text{SOFTMAX}(w)_{I_{m}})\cdot(\sum_{l=1}^{L}L_{T}(\pi^{\psi},\tau_{m,l},\widetilde{\theta}_{I_{m}})) (9)
Θ~Θ~+1MLm=1Ml=1L\displaystyle\widetilde{\Theta}\leftarrow\widetilde{\Theta}+\frac{1}{ML}\sum_{m=1}^{M}\sum_{l=1}^{L} (Θ~barrier(θ~Im,Ω)+Θ~LT(πψ,τm,l,θ~Im)\displaystyle\left(\nabla_{\widetilde{\Theta}}\mathcal{L}_{\text{barrier}}(\widetilde{\theta}_{I_{m}},\Omega)+\nabla_{\widetilde{\Theta}}L_{T}(\pi^{\psi},\tau_{m,l},\widetilde{\theta}_{I_{m}})\right.
+LT(πψ,τm,l,θ~Im)Θ~log(πψ,θ~Im(τm,l)))\displaystyle\left.+L_{T}(\pi^{\psi},\tau_{m,l},\widetilde{\theta}_{I_{m}})\cdot\nabla_{\widetilde{\Theta}}\log(\mathbb{P}_{\pi^{\psi},\widetilde{\theta}_{I_{m}}}(\tau_{m,l}))\right) (10)
     where barrier\mathcal{L}_{\text{barrier}} is a differentiable barrier loss that heavily penalizes the θ~Im\widetilde{\theta}_{I_{m}}’s outside Ω\Omega.
     Optimize policy:
     Update the policy by taking descending step on gradient estimate:
     
ψ\displaystyle\psi\leftarrow ψ1MLm=1Ml=1LLT(πψ,τm,l,θ~Im)ψlog(πψ,θ~Im(τm,l)).\displaystyle\psi-\frac{1}{ML}\sum_{m=1}^{M}\sum_{l=1}^{L}L_{T}(\pi^{\psi},\tau_{m,l},\widetilde{\theta}_{I_{m}})\cdot\nabla_{\psi}\log(\mathbb{P}_{\pi^{\psi},\widetilde{\theta}_{I_{m}}}(\tau_{m,l})). (11)
  end for

In the above algorithm, the gradient estimates are unbiased estimates of the true gradients with respect to ψ\psi, ww and Θ~\widetilde{\Theta} (shown in Appendix C). We choose NN large enough to avoid mode collapse, and M,LM,L as large as possible to reduce variance in gradient estimates while fitting the memory constraint. We then find the appropriate large number of optimization iterations so that the variance of the gradient estimates is reduced dramatically by averaging over time. We use Adam optimization (Kingma & Ba, 2014) in taking gradient updates.

Note the decomposition for log(πψ,θ(τ))\log(\mathbb{P}_{\pi^{\psi},\theta^{\prime}}(\tau)) in equation 10 and equation 11, where rollout τ={(xt,yt)}t=1T\tau=\{(x_{t},y_{t})\}_{t=1}^{T}, and

log\displaystyle\log (πψ,θ({(xt,yt)}t=1T))=log(πψ(x1)f(y1|θ,s1)t=2Tπψ(st,xt)f(yt|θ,st,xt)).\displaystyle(\mathbb{P}_{\pi^{\psi},\theta^{\prime}}(\{(x_{t},y_{t})\}_{t=1}^{T}))=\log\Big{(}\pi^{\psi}(x_{1})\cdot f(y_{1}|\theta^{\prime},s_{1})\cdot\textstyle\prod_{t=2}^{T}\pi^{\psi}(s_{t},x_{t})\cdot f(y_{t}|\theta^{\prime},s_{t},x_{t})\Big{)}.

Here πψ\pi^{\psi} and ff are only dependent on ψ\psi and Θ~\widetilde{\Theta} respectively. During evaluation of a fixed policy π\pi, we are interested in solving maxθΩ(π,θ)\max_{\theta\in\Omega}\ell^{\prime}(\pi,\theta) by gradient ascent updates like equation 10. The decoupling of πψ\pi^{\psi} and ff thus enables us to optimize the objective without differentiating through a policy π\pi, which could be non-differentiable like a deterministic algorithm.

B.1 Implementation Details

Training. When training our policies for Best Identification, we warm start the training with optimizing Simple Regret. This is because a random initialized policy performs so poorly that Best Identification is nearly always 1, making it difficult to improve the policy. After training π1:K\pi_{1:K} in MAPO (Algorithm 1), we warm start the training of π^\widehat{\pi} with π^=πK/2\widehat{\pi}=\pi_{\lfloor K/2\rfloor}. In addition, our generating distribution parameterizations exactly follows from Section 3.1.

Loss functions. Instead of optimizing the approximated quantity from equation 8 directly, we add regularizers to the losses for both the policy and generator. First, we choose the barrier\mathcal{L}_{\text{barrier}} in equation 10 to be λbarriermax{0,log(𝒞(𝒳,𝒵,θ))log(rk)}\lambda_{\text{barrier}}\cdot\max\{0,\log(\mathcal{C}(\mathcal{X},\mathcal{Z},\theta))-\log(r_{k})\}, for some large constant λbarrier\lambda_{\text{barrier}}. To discourage the policy from over committing to a certain action and/or the generating distribution from covering only a small subset of particles (i.e., mode collapse), we also add negative entropy penalties to both policy’s output distributions and SOFTMAX(w)\text{SOFTMAX}(w) with scaling factors λPol-reg\lambda_{\text{Pol-reg}} and λGen-reg\lambda_{\text{Gen-reg}}.

State representation. We parameterize our state space 𝒮\mathcal{S} as a flattened |𝒳|×3|\mathcal{X}|\times 3 matrix where each row represents a distinct x𝒳x\in\mathcal{X}. Specifically, at time tt the row of sts_{t} corresponding to some x𝒳x\in\mathcal{X} records the number of times that action xx has been taken s=1t1𝟏{xs=x}\sum_{s=1}^{t-1}\mathbf{1}\{x_{s}=x\}, its inverse (s=1t1𝟏{xs=x})1(\sum_{s=1}^{t-1}\mathbf{1}\{x_{s}=x\})^{-1}, and the sum of the observations s=1t1𝟏{xs=x}ys\sum_{s=1}^{t-1}\mathbf{1}\{x_{s}=x\}y_{s}.

Policy MLP architecture. Our policy πψ\pi^{\psi} is a multi-layer perceptron with weights ψ\psi. The policy take a 3|𝒳|3|\mathcal{X}| sized state as input and outputs a vector of size |𝒳||\mathcal{X}| which is then pushed through a soft-max to create a probability distribution over 𝒳\mathcal{X}. At the end of the game, regardless of the policy’s weights, we set z^=argmaxz𝒵z,θ^\widehat{z}=\operatorname*{arg\,max}_{z\in\mathcal{Z}}\langle z,\widehat{\theta}\rangle where θ^\widehat{\theta} is the minimum 2\ell_{2} norm solution to argminθs=1T(ysxs,θ)2\operatorname*{arg\,min}_{\theta}\sum_{s=1}^{T}(y_{s}-\langle x_{s},\theta\rangle)^{2}.

Our policy network is a simple 6-layer MLP, with layer sizes {3|𝒳|,256,256,256,256,|𝒳|}\{3|\mathcal{X}|,256,256,256,256,|\mathcal{X}|\} where 3|𝒳|3|\mathcal{X}| corresponds to the input layer and |𝒳||\mathcal{X}| is the size of the output layer, which is then pushed through a Softmax function to create a probability over arms. In addition, all intermediate layers are activated with the leaky ReLU activation units with negative slopes of .01.01. For the experiments for 1D thresholds and 20 Questions, they share the same network structure as mentioned above with |𝒳|=25|\mathcal{X}|=25 and |𝒳|=100|\mathcal{X}|=100 respectively.

Appendix C Gradient Estimate Derivation

Here we derive the unbiased gradient estimates equation 9, equation 10 and equation 11 in Algorithm 2. Since each the gradient estimates in the above averages over MLM\cdot L identically distributed trajectories, it is therefore sufficient to show that our gradient estimate is unbiased for a single problem θ~i\widetilde{\theta}_{i} and its rollout trajectory {(xt,yt)}t=1T\{(x_{t},y_{t})\}_{t=1}^{T}.

For a feasible ww, using the score-function identity (Aleksandrov et al., 1968)

w𝔼iSOFTMAX(w)[(πψ,θ~i)]\displaystyle\nabla_{w}\mathbb{E}_{i\sim\text{SOFTMAX}(w)}\left[\ell(\pi^{\psi},\widetilde{\theta}_{i})\right]
=𝔼iSOFTMAX(w)[(πψ,θ~i)wlog(SOFTMAX(w)i)].\displaystyle=\mathbb{E}_{i\sim\text{SOFTMAX}(w)}\left[\ell(\pi^{\psi},\widetilde{\theta}_{i})\cdot\nabla_{w}\log(\text{SOFTMAX}(w)_{i})\right].

Observe that if iSOFTMAX(w)i\sim\text{SOFTMAX}(w) and {(xt,yt)}t=1T\{(x_{t},y_{t})\}_{t=1}^{T} is the result of rolling out a policy πψ\pi^{\psi} on θ~i\widetilde{\theta}_{i} then

gw(i,{(xt,yt)}t=1T):=LT(πψ,{(xt,yt)}t=1T,θ~i)wlog(SOFTMAX(w)i)\displaystyle g^{w}(i,\{(x_{t},y_{t})\}_{t=1}^{T}):=L_{T}(\pi^{\psi},\{(x_{t},y_{t})\}_{t=1}^{T},\widetilde{\theta}_{i})\cdot\nabla_{w}\log(\text{SOFTMAX}(w)_{i})

is an unbiased estimate of w𝔼iSOFTMAX(w)[(πψ,θ~i)]\nabla_{w}\mathbb{E}_{i\sim\text{SOFTMAX}(w)}\left[\ell(\pi^{\psi},\widetilde{\theta}_{i})\right].

For a feasible set Θ~\widetilde{\Theta}, by definition of (π,θ)\ell(\pi,\theta),

Θ~𝔼iSOFTMAX(w)\displaystyle\nabla_{\widetilde{\Theta}}\mathbb{E}_{i\sim\text{SOFTMAX}(w)} [(πψ,θ~i)]\displaystyle\left[\ell(\pi^{\psi},\widetilde{\theta}_{i})\right] (12)
=𝔼iSOFTMAX(w)\displaystyle=\mathbb{E}_{i\sim\text{SOFTMAX}(w)} [Θ~𝔼π,θ~i[LT(π,{(xt,yt)}t=1T,θ~i)]]\displaystyle\left[\nabla_{\widetilde{\Theta}}\mathbb{E}_{\pi,\widetilde{\theta}_{i}}\left[L_{T}(\pi,\{(x_{t},y_{t})\}_{t=1}^{T},\widetilde{\theta}_{i})\right]\right]
=𝔼iSOFTMAX(w)\displaystyle=\mathbb{E}_{i\sim\text{SOFTMAX}(w)} [𝔼π,θ~i[Θ~LT(π,{(xt,yt)}t=1T,θ~i)+LT(π,{(xt,yt)}t=1T,θ~i)Θ~log(πψ,θ~i({(xt,yt)}t=1T))]]\displaystyle\left[\mathbb{E}_{\pi,\widetilde{\theta}_{i}}\left[\nabla_{\widetilde{\Theta}}L_{T}(\pi,\{(x_{t},y_{t})\}_{t=1}^{T},\widetilde{\theta}_{i})+L_{T}(\pi,\{(x_{t},y_{t})\}_{t=1}^{T},\widetilde{\theta}_{i})\cdot\nabla_{\widetilde{\Theta}}\log(\mathbb{P}_{\pi^{\psi},\widetilde{\theta}_{i}}(\{(x_{t},y_{t})\}_{t=1}^{T}))\right]\right]

where the last equality follows from chain rule and the score-function identity (Aleksandrov et al., 1968). The quantity inside the expectations, call it gΘ~(i,{(xt,yt)}t=1T)g^{\widetilde{\Theta}}(i,\{(x_{t},y_{t})\}_{t=1}^{T}), is then an unbiased estimator of Θ~𝔼iSOFTMAX(w)[(πψ,θ~i)]\nabla_{\widetilde{\Theta}}\mathbb{E}_{i\sim\text{SOFTMAX}(w)}\left[\ell(\pi^{\psi},\widetilde{\theta}_{i})\right] given ii and {(xt,yt)}t=1T\{(x_{t},y_{t})\}_{t=1}^{T} are rolled out accordingly. Note that if barrier0\mathcal{L}_{\text{barrier}}\neq 0, Θ~barrier(θ~i,Ω)\nabla_{\widetilde{\Theta}}\mathcal{L}_{\text{barrier}}(\widetilde{\theta}_{i},\Omega) is clearly an unbiased gradient estimator of 𝔼iSOFTMAX(w)[𝔼π,θ~i[barrier(θ~i,Ω)]]\mathbb{E}_{i\sim\text{SOFTMAX}(w)}[\mathbb{E}_{\pi,\widetilde{\theta}_{i}}[\mathcal{L}_{\text{barrier}}(\widetilde{\theta}_{i},\Omega)]] given ii and rollout are sampled accordingly.

Likewise, for policy,

gψ(i,{(xt,yt)}t=1T):=LT(πψ,{(xt,yt)}t=1T,θ~i)ψlog(πψ,θ~i({(xt,yt)}t=1T))\displaystyle g^{\psi}(i,\{(x_{t},y_{t})\}_{t=1}^{T}):=L_{T}(\pi^{\psi},\{(x_{t},y_{t})\}_{t=1}^{T},\widetilde{\theta}_{i})\cdot\nabla_{\psi}\log(\mathbb{P}_{\pi^{\psi},\widetilde{\theta}_{i}}(\{(x_{t},y_{t})\}_{t=1}^{T}))

is an unbiased estimate of ψ𝔼iSOFTMAX(w)[(πψ,θ~i)]\nabla_{\psi}\mathbb{E}_{i\sim\text{SOFTMAX}(w)}\left[\ell(\pi^{\psi},\widetilde{\theta}_{i})\right].

Appendix D Hyper-parameters

In this section, we list our hyperparameters. First we define λbinary\lambda_{\text{binary}} to be a coefficient that gets multiplied to binary loses, so instead of 𝟏{z(θ)z^}\bm{1}\{z_{\ast}(\theta_{*})\neq\widehat{z}\}, we receive loss λbinary𝟏{z(θ)z^}\lambda_{\text{binary}}\cdot\bm{1}\{z_{\ast}(\theta_{*})\neq\widehat{z}\}. We choose λbinary\lambda_{\text{binary}} so that the recieved rewards are approximately at the same scale as Simple Regret. During our experiments, all of the optimizers are Adam. All budget sizes are T=20T=20. For fairness of evaluation, during each experiment (1D thresholds or 20 Questions), all parameters below are shared for evaluating all of the policies. To elaborate on training strategy proposed in MAPO (Algorithm 1) more, we divide our training into four procedures, as indicated in Table 3:

  • Init. The initialization procedure takes up a rather small portion of iterations primarily for the purpose of optimizing for barrier\mathcal{L}_{\text{barrier}} so that the particles converge into the constrained difficulty sets. In addition, during the initialization process we initialize and freeze w=0w=\vec{0}, thus putting an uniform distribution over the particles. This allows us to utilize the entire set of particles without ww converge to only a few particles early on. To initialize Θ~\widetilde{\Theta}, we sample 2/32/3 of the NN particles uniformly from [1,1]|𝒳|[-1,1]^{|\mathcal{X}|} and the rest 1/31/3 of the particles by sampling, for each i[|𝒵|]i\in[|\mathcal{Z}|], N3|𝒵|\frac{N}{3|\mathcal{Z}|} particles uniformly from {θ:argmaxjθ,zj=i}\{\theta:\operatorname*{arg\,max}_{j}\langle\theta,z_{j}\rangle=i\}. We initialize our policy weights by Xavier initialization with weights sampled from normal distribution and scaled by .01.01.

  • Regret Training, π~i\widetilde{\pi}_{i} Training with Simple Regret objective usually takes the longest among the Procedures. The primary purpose for this process is to let the policy converge to a reasonable warm start that already captures some essence of the task.

  • Fine-tune πi\pi_{i}. Training with Best Identification objective run multiple times for each πi\pi_{i} with their corresponding complexity set Θi\Theta_{i}. During each run, we start with a warm started policy, and reinitialize the rest of the models by running the initialization procedure followed by optimizing the Best Identification objective.

  • Fine-tune π^\widehat{\pi} This procedure optimizes equation 3, with baselines mink(πk,Θ(rk))\min_{k}\ell(\pi_{k},\Theta^{(r_{k})}) evaluated based on each πi\pi_{i} learned from the previous procedure. Similar to fine-tuning each individual πi\pi_{i}, we warm start a policy πK/2\pi_{\lfloor K/2\rfloor} and reinitialize ww and Θ\Theta by running the initialization procedure again.

Experiment
Procedure Hyper-parameter 1D Threshold |𝒳|=25|\mathcal{X}|=25 20 Questions |𝒳|=100|\mathcal{X}|=100 Jester Joke |𝒳|=100|\mathcal{X}|=100
Init NitN_{it} 20000 (all)
ψ\psi learning rate 10410^{-4} (all)
Θ~\widetilde{\Theta} learning rate 10310^{-3} (all)
ww learning rate 0 (all)
Regret Training NitN_{it} 480000 (all)
ψ\psi learning rate 10410^{-4} (all)
Θ~\widetilde{\Theta} learning rate 10310^{-3} (all)
ww learning rate 10310^{-3} (all)
Fine-tune NitN_{it} for π~i\widetilde{\pi}_{i} 200000 0 200000
NitN_{it} for πi\pi_{i} 200000 1500000 N/A
NitN_{it} for π\pi_{*} 500000 250000 500000
ψ\psi learning rate 10410^{-4} (all)
Θ~\widetilde{\Theta} learning rate 10310^{-3} (all)
ww learning rate 10310^{-3} (all)
Adam Optimizer β1\beta_{1} .9 (all)
β2\beta_{2} .999 (all)
Table 3: Number of Iterations and Learning Rates
Experiment
Procedure Hyper-parameter 1D Threshold |𝒳|=25|\mathcal{X}|=25 20 Questions |𝒳|=100|\mathcal{X}|=100 Jester Joke |𝒳|=100|\mathcal{X}|=100
Init + Train + Fine-tune NN 1000×|𝒵|1000\times|\mathcal{Z}| 300×|𝒵|300\times|\mathcal{Z}| 2000×|𝒵|2000\times|\mathcal{Z}|
M 1000 500 500
L 10 30 30
λbinary\lambda_{\text{binary}} 7.5 30 30
λPol-reg\lambda_{\text{Pol-reg}}(regret) .2 .8 .8
λPol-reg\lambda_{\text{Pol-reg}}(fine-tune) .3 .8 .8
λGen-reg\lambda_{\text{Gen-reg}} .05 .1 .05
λbarrier\lambda_{\text{barrier}} 10310^{3} (all)
Table 4: Parallel Sizes and Regularization coefficients

To provide a general strategy of choosing hyper-parameters, we note that LL, firstly, λbinary\lambda_{\text{binary}}, λPol-reg\lambda_{\text{Pol-reg}} are primarily parameters tuned for |𝒳||\mathcal{X}| as the noisiness and scale of the gradients, and entropy over the arms 𝒳\mathcal{X} grows with the size |𝒳||\mathcal{X}|. Secondly, λGen-reg\lambda_{\text{Gen-reg}} is primarily tuned for |𝒵||\mathcal{Z}| as it penalizes the entropy over the NN arms, which is a multiple of |𝒵||\mathcal{Z}|. Thirdly, learning rate of θ\theta is primarily tuned for the convergence of constraint ρ\rho^{*} into the restricted class, thus barrier\mathcal{L}_{\text{barrier}} becoming 0 after the specified number of iterations during initialization is a good indicator. Finally, we choose NN and MM by memory constraint of our GPU. The hyper-parameters for each experiment was tuned with less than 20 hyper-parameter assignments, some metrics to look at while tuning these hyper-parameters includes but are not limited to: gradient magnitudes of each component, convergence of each loss and entropy losses for each regularization term (how close it is to the entropy of a uniform probability), etc.

Appendix E Policy Evaluation

When evaluating a policy, we are essentially solving the following objective for a fixed policy π\pi:

maxθΩ(π,θ)\displaystyle\max_{\theta\in\Omega}\ell(\pi,\theta)

where Ω\Omega is a set of problems. However, due to non-concavity of this loss function, gradient descent initialized randomly may converge to a local maxima. To reduce this possibility, we randomly initialize many initial iterates and take gradient steps round-robin, eliminating poorly performing trajectories. To do this with a fixed amount of computational resource, we apply the successive halving algorithm from Li et al. (2018). Specifically, we choose hyperparamters: η=4\eta=4, r=100r=100, R=1600R=1600 and s=0s=0. This translates to:

  • Initialize |Θ~|=1600|\widetilde{\Theta}|=1600, optimize for 100 iterations for each θ~iΘ~\widetilde{\theta}_{i}\in\widetilde{\Theta}

  • Take the top 400400 of them and optimize for another 400400 iterations

  • Take the top 100100 of the remaining 400400 and optimize for an additional 16001600 iterations

We take gradient steps with the Adam optimizer (Kingma & Ba, 2014) with learning rate of 10310^{-3} β1=.9\beta_{1}=.9 and β2=.999\beta_{2}=.999.

Appendix F Figures at Full Scale

[Uncaptioned image]
Figure 8: Full scale of Figure 2
[Uncaptioned image]
Figure 9: Full scale of Figure 3
[Uncaptioned image]
Figure 10: Full scale of Figure 4
[Uncaptioned image]
Figure 11: Full scale of Figure 5
[Uncaptioned image]
Figure 12: Full scale of Figure 6
[Uncaptioned image]
Figure 13: Full scale of Figure 7

Appendix G Uncertainty Sampling

We define the symmetric difference of a set of binary vectors, SymDiff({z1,,zn})={i:j,k[n]s.t.,zj(i)=1zk(i)=0}\text{SymDiff}(\{z_{1},...,z_{n}\})=\{i:\exists j,k\in[n]\ \ s.t.,z_{j}^{(i)}=1\land z_{k}^{(i)}=0\}, as the dimensions where inconsistencies exist.

  Input: 𝒳,𝒵\mathcal{X},\mathcal{Z}
  for t=1,,Tt=1,...,T do
     θ^t1=argminθs=1T(ysxs,θ)2\widehat{\theta}_{t-1}=\operatorname*{arg\,min}_{\theta}\sum_{s=1}^{T}(y_{s}-\langle x_{s},\theta\rangle)^{2}
     𝒵^={z𝒵:maxz𝒵z,θ^t1=z,θ^t1}\widehat{\mathcal{Z}}=\{z\in\mathcal{Z}:\max_{z^{\prime}\in\mathcal{Z}}\langle z^{\prime},\widehat{\theta}_{t-1}\rangle=\langle z,\widehat{\theta}_{t-1}\rangle\}
     if |𝒵^|=1|\widehat{\mathcal{Z}}|=1 then
        𝒵^t=𝒵^{z𝒵:maxz(𝒵\𝒵^)z,θ^t1=z,θ^t1}\widehat{\mathcal{Z}}_{t}=\widehat{\mathcal{Z}}\bigcup\{z\in\mathcal{Z}:\max_{z^{\prime}\in(\mathcal{Z}\backslash\widehat{\mathcal{Z}})}\langle z^{\prime},\widehat{\theta}_{t-1}\rangle=\langle z,\widehat{\theta}_{t-1}\rangle\}
     else
        𝒵^t=𝒵^\widehat{\mathcal{Z}}_{t}=\widehat{\mathcal{Z}}
     end if
     Uniformly sample ItI_{t} from SymDiff(𝒵^t)\text{SymDiff}(\widehat{\mathcal{Z}}_{t})
     Pull xItx_{I_{t}} and observe yty_{t}
  end for
Algorithm 4 Uncertainty sampling in very small budget setting

Appendix H Learning to Actively Learn Algorithm

To train a policy under the learning to actively learn setting, we aim to solve for the objective

minψ𝔼θ𝒫^[(πψ,θ)]\displaystyle\min_{\psi}\mathbb{E}_{\theta\sim\widehat{\mathcal{P}}}[\ell(\pi^{\psi},\theta)]

where our policy and states are parameterized the same way as Appendix LABEL:sec:representation_parameterization for a fair comparison. To optimize for the parameter, we take gradient steps like equation 11 but with the new sampling and rollout where θ~i𝒫^\widetilde{\theta}_{i}\sim\widehat{\mathcal{P}}. This gradient step follows from both the classical policy gradient algorithm in reinforcement learning as well as from recent learning to actively learn work by Kveton et al. (2020).

Moreover, note that the optimal policy for the objective must be deterministic as justified by deterministic policies being optimal for MDPs. Therefore, it is clear that, under our experiment setting, the deterministic Bayes-LAL policy will perform poorly in the adversarial setting (for the same reason why SGBS performs poorly).

Appendix I 20 Questions Setup

Hu et al. (2018) collected a dataset of 1000 celebrities and 500 possible questions to ask about each celebrity. We chose 100100 questions out of the 500500 by first constructing p¯\bar{p}^{\prime}, 𝒳\mathcal{X}^{\prime} and 𝒵\mathcal{Z}^{\prime} for the 500500 dimensions data, and sampling without replacement 100100 of the 500500 dimensions from a distribution derived from a static allocation. We down-sampled the number of questions so our training can run with sufficient MM and LL to de-noise the gradients while being prototyped with a single GPU.

Specifically, the dataset from Hu et al. (2018) consists of probabilities of people answering Yes / No / Unknown to each celebrity-question pair collected from some population. To better fit the combinatorial bandit scenario, we re-normalize the probability of getting Yes / No, conditioning on the event that these people did not answer Unknown. The probability of answering Yes to all 500 questions for each celebrity then constitutes vectors p¯(1),,p¯(1000)500\bar{p}^{\prime(1)},...,\bar{p}^{\prime(1000)}\in\mathbb{R}^{500}, where each dimension of a give p¯i(j)\bar{p}_{i}^{\prime(j)} represents the probability of yes to the iith question about the jjth person. The action set 𝒳\mathcal{X}^{\prime} is then constructed as 𝒳={𝐞i:i[500]}\mathcal{X}^{\prime}=\{\mathbf{e}_{i}:i\in[500]\}, while 𝒵={z(j):[zi(j)]=𝟏{p¯i(j)>1/2}}{0,1}1000\mathcal{Z}^{\prime}=\{z^{(j)}:[z^{(j)}_{i}]=\bm{1}\{\bar{p}_{i}^{(j)}>1/2\}\}\subset\{0,1\}^{1000} are binary vectors taking the majority votes.

To sub-sample 100 questions from the 500, we could have uniformly at random selected the questions, but many of these questions are not very discriminative. Thus, we chose a “good” set of queries based on the design recommended by ρ\rho_{\ast} of Fiez et al. (2019). If questions were being answered noiselessly in response to a particular z𝒵z\in\mathcal{Z}^{\prime}, then equivalently we have that for this setting θ=2z1\theta=2z-1. Since ρ\rho_{\ast} optimizes allocations λ\lambda over 𝒳\mathcal{X}^{\prime} that would reduce the number of required queries as much as possible (according to the information theoretic bound of (Fiez et al., 2019)) if we want to find a single allocation for all z𝒵z^{\prime}\in\mathcal{Z} simultaneously, we can perform the optimization problem

minλΔ(|X|1)maxz𝒵maxzzzz(iλixixiT)12((zz)T(2z1))2.\displaystyle\min_{\lambda\in\Delta^{(|X|-1)}}\max_{z^{\prime}\in\mathcal{Z}^{\prime}}\max_{z\neq z^{\prime}}\frac{\lVert z^{\prime}-z\rVert_{(\sum_{i}\lambda_{i}x_{i}x_{i}^{T})^{-1}}^{2}}{((z^{\prime}-z)^{T}(2z^{\prime}-1))^{2}}.

We then sample elements from 𝒳\mathcal{X}^{\prime} according to this optimal λ\lambda without replacement and add them to 𝒳\mathcal{X} until |𝒳|=100|\mathcal{X}|=100.

Appendix J Jester Joke Recommendation Setup

We consider the Jester jokes dataset of Goldberg et al. (2001) that contains jokes ranging from pun-based jokes to grossly offensive. We filter the dataset to only contain users that rated all 100100 jokes, resulting in 14116 users. A rating of each joke was provided on a [10,10][-10,10] scale which was shrunk to [1,1][-1,1]. Denote this set of ratings as Θ^={θi:i[14116],θi[1,1]100}\hat{\Theta}=\{\theta_{i}:i\in[14116],\theta_{i}\in[-1,1]^{100}\}, where θi\theta_{i} encodes the ratings of all 100100 jokes by user ii. To construct the set of arms 𝒵\mathcal{Z}, we then clustered the ratings of these users to 10 groups to obtain 𝒵={zi:i[10],zi{0,1}100}\mathcal{Z}=\{z_{i}:i\in[10],z_{i}\in\{0,1\}^{100}\} by minimizing the following metric:

min𝒵:|𝒵|=10i=114116maxz{0,1}100z,θimaxz𝒵z,θi.\displaystyle\min_{\mathcal{Z}:|\mathcal{Z}|=10}\sum_{i=1}^{14116}\max_{z_{*}\in\{0,1\}^{100}}\langle z_{*},\theta_{i}\rangle-\max_{z\in\mathcal{Z}}\langle z,\theta_{i}\rangle.

To solve for 𝒵\mathcal{Z}, we adapt the kmeansk-means algorithm, with the metric above instead of the L2L-2 metric used traditionally.