Learning to Actively Learn: A Robust Approach

Jifan Zhang
Department of Computer Science
University of Wisconsin
Madison, WI, USA
jifan@cs.wisc.edu
&Lalit Jain
Foster School of Business
University of Washington
Seattle, WA, USA
lalitj@uw.edu
&Kevin Jamieson
Allen School of Computer Science & Engineering
University of Washington
Seattle, WA, USA
jamieson@cs.washington.edu

Abstract

This work proposes a procedure for designing algorithms for specific adaptive data collection tasks like active learning and pure-exploration multi-armed bandits. Unlike the design of traditional adaptive algorithms that rely on concentration of measure and careful analysis to justify the correctness and sample complexity of the procedure, our adaptive algorithm is learned via adversarial training over equivalence classes of problems derived from information theoretic lower bounds. In particular, a single adaptive learning algorithm is learned that competes with the best adaptive algorithm learned for each equivalence class. Our procedure takes as input just the available queries, set of hypotheses, loss function, and total query budget. This is in contrast to existing meta-learning work that learns an adaptive algorithm relative to an explicit, user-defined subset or prior distribution over problems which can be challenging to define and be mismatched to the instance encountered at test time. This work is particularly focused on the regime when the total query budget is very small, such as a few dozen, which is much smaller than those budgets typically considered by theoretically derived algorithms. We perform synthetic experiments to justify the stability and effectiveness of the training procedure, and then evaluate the method on tasks derived from real data including a noisy 20 Questions game and a joke recommendation task.

1 Introduction

Closed-loop learning algorithms use previous observations to inform what measurements to take next in a closed loop in order to accomplish inference tasks far faster than any fixed measurement plan set in advance. For example, active learning algorithms for binary classification have been proposed that under favorable conditions require exponentially fewer labels than passive, random sampling to identify the optimal classifier (Hanneke et al., 2014; Katz-Samuels et al., 2021). And in the multi-armed bandits literature, adaptive sampling techniques have demonstrated the ability to identify the “best arm” that optimizes some metric with far fewer experiments than a fixed design (Garivier & Kaufmann, 2016; Fiez et al., 2019). Unfortunately, such guarantees often either require simplifying assumptions that limit robustness and applicability, or algorithmic use of concentration inequalities that are very loose unless the number of samples is very large.

This work proposes a framework for producing algorithms that are learned through simulated experience to be as effective and robust as possible, even on a tiny measurement budget (e.g., 20 queries) where most theoretical guarantees do not apply. Our work fits into a recent trend sometimes referred to as learning to actively learn and differentiable meta-learning in bandits (Konyushkova et al., 2017; Bachman et al., 2017; Fang et al., 2017; Boutilier et al., 2020; Kveton et al., 2020) which tune existing algorithms or learn entirely new active learning algorithms by policy optimization. Previous works in this area learn a policy by optimizing with respect to data observed through prior experience (e.g., meta-learning or transfer learning) or an assumed explicit prior distribution of problem parameters (e.g. a Gaussian prior over the true weight vector for linear regression). In contrast, our approach makes no assumptions about what parameters are likely to be encountered at test time, and therefore produces algorithms that do not suffer from mismatching priors at test time. Instead, our method learns a policy that attempts to mirror the guarantees of frequentist algorithms with instance dependent sample complexities: there is an intrinsic difficulty measure that orders problem instances and given a fixed budget, higher accuracies can be obtained for all easier instances than harder instances. This difficulty measure is most naturally derived from information theoretic lower bounds.

But unlike information theoretic bounds that hand-craft adversarial instances, inspired by the robust reinforcement learning literature, we formulate a novel adversarial training objective that automatically train minimax policies and propose a tractable and computationally efficient relaxation. This allows our learned policies to be very aggressive while maintaining robustness over difficulty in problem instances, without resorting to using loose concentration inequalities in the algorithm. Indeed, this work is particularly useful in the setting where relatively few rounds of querying can be made. The learning framework is general enough to be applied to many active learning settings of interest and is intended to be used to produce robust and high performing algorithms. We implement the framework for the pure-exploration combinatorial bandit problem — a paradigm including problems such as active binary classification and the 20 question game. We empirically validate our framework on a simple synthetic experiment before turning our attention to datasets derived from real data including a noisy 20 Questions game and a joke recommendation task which are also embedded as combinatorial bandits. As demonstrated in our experiments, in the low budget setting, our learned algorithms are the only ones that both enjoy robustness guarantees (as opposed to greedy and existing learning to actively learn methods) and perform non-vacuously and instance-optimally (as opposed to statistically justified algorithms).

2 Proposed Framework for Robust Learning to Actively Learn

From a birds-eye perspective, whether learned or defined by an expert, any algorithm for active learning can be thought of as a policy from the perspective of reinforcement learning. To be precise, at time $t$ , based on an internal state $s_{t}$ , the policy $\pi$ defines a distribution $\pi(s_{t})$ over the set of potential actions $\mathcal{X}$ . It then takes action $x_{t}\in\mathcal{X},x_{t}\sim\pi(s_{t})$ and receives observation $y_{t}$ , updates the state and the process repeats.

Fix a horizon $T\in\mathbb{N}$ , and a problem instance $\theta_{*}\in\Theta\subseteq\mathbb{R}^{d}$ which parameterizes the observation distribution. For $t=1,2,\dots,T$

•

state $s_{t}\in\mathcal{S}$ is a function of the history, $\{(x_{i},y_{i})\}_{i=1}^{t-1}$ ,
•

action $x_{t}\in\mathcal{X}$ is drawn at random from the distribution $\pi(s_{t})$ defined over $\mathcal{X}$ , and
•

next state $s_{t+1}\in\mathcal{S}$ is constructed by taking action $x_{t}$ in state $s_{t}$ and observing $y_{t}\sim f(\cdot|\theta_{*},s_{t},x_{t})$

until the game terminates at time $t=T$ and the learner receives a loss $L_{T}$ which is task specific. Note that $L_{T}$ is a random variable that depends on the tuple $(\pi,\{(x_{i},y_{i})\}_{i=1}^{T},\theta_{*})$ . We assume that $f$ is a known parametric distribution to the policy but the parameter $\theta$ is unknown to the policy. Let $\mathbb{P}_{\pi,\theta},\mathbb{E}_{\pi,\theta}$ denote the probability and expectation under the probability law induced by executing policy $\pi$ in the game with $\theta_{*}=\theta$ to completion. Note that $\mathbb{P}_{\pi,\theta}$ includes any internal randomness of the policy $\pi$ and the random observations $y_{t}\sim f(\cdot|\theta,s_{t},x_{t})$ . Thus, $\mathbb{P}_{\pi,\theta}$ assigns a probability to any trajectory $\{(x_{i},y_{i})\}_{i=1}^{T}$ . For a given policy $\pi$ and $\theta_{*}=\theta$ , the metric of interest we wish to minimize is the expected loss $\ell(\pi,\theta):=\mathbb{E}_{\pi,\theta}\left[L_{T}\right]$ where $L_{T}$ as defined above is the loss observed at the end of the episode. For a fixed policy $\pi$ , $\ell(\pi,\theta)$ defines a loss surface over all possible values of $\theta$ . This loss surface captures the fact that some values of $\theta$ are just intrinsically harder than others, but also that a policy may be better suited for some values of $\theta$ versus others.

Finally, we assume we are equipped with a positive function $\mathcal{C}:\Theta\rightarrow(0,\infty)$ that assigns a score to each $\theta\in\Theta$ that intuitively captures the “difficulty” of a particular $\theta$ , and can be used as a partial ordering of $\Theta$ . Ideally, $\mathcal{C}(\theta)$ is a monotonic transformation of $\ell(\pi^{*},\theta)$ for some “best” policy $\pi^{*}$ that we will define shortly. Our plan is now as follows, in Section 2.1, we ground the discussion and describe $\mathcal{C}(\theta)$ for the combinatorial bandit problem. Then in Section 2.2, we zoom out to define our main objective of finding a min-gap optimal policy, finally providing an adversarial training approach in Section 3.

2.1 Complexity for Combinatorial Bandits

A concrete example of the framework above is the combinatorial bandit problem. The learner has access to sets $\mathcal{X}=\{e_{1},\cdots,e_{d}\}\subset\mathbb{R}^{d}$ , where $e_{i}$ is the $i$ -th standard basis vector, and $\mathcal{Z}\subset\{0,1\}^{d}$ . In each round the learner chooses an $x_{t}\in\mathcal{X}$ according to a policy $\pi(\{(x_{i},y_{i})\}_{i=1}^{t-1})$ and observes $y_{t}$ with $\mathbb{E}[y_{t}|x_{t},\theta_{\ast}]=\langle x_{t},\theta_{\ast}\rangle$ for some unknown $\theta_{\ast}\in\mathbb{R}^{d}$ . The goal of the learner is Best Arm Identification. Denote $z_{\ast}(\theta_{\ast})=\operatorname*{arg\,max}_{z\in\mathcal{Z}}\langle z,\theta_{\ast}\rangle$ , then at time $T$ the learner outputs a recommendation $\hat{z}$ and incurs loss $L_{BAI,T}=\bm{1}\{z_{\ast}\neq\hat{z}\}$ . This setting naturally captures the 20 question game. Indeed assume there are $d\gg T=20$ potential yes/no questions that can be asked at each time, corresponding to the elements of $\mathcal{X}$ , and that each element of $\mathcal{Z}$ is a binary vector representing the answers to these questions for a given item. If answers $y_{t}$ are deterministic then $\theta_{*}\in\{-1,1\}^{d}$ , but this framework also captures the case $\theta_{*}\in[-1,1]^{d}$ when answers are stochastic, or answered incorrectly with some probability. Then a policy $\pi$ at each time decides which question to ask based on the answers so far to determine the item closest to an unknown vector $\theta_{*}$ .

As described in Sections 5 and Appendix A, combinatorial bandits generalizes standard multi-armed bandits, and all of binary classification, and thus has received a tremendous amount of interest in recent years. A large portion of this work has focused on providing precise characterization of the information theoretic limit on the mimimal number of samples needed to identify $z_{\ast}(\theta_{\ast})$ with high probability a quantity denoted as $\rho_{\ast}(\theta_{\ast})$ which is the solution to an optimization problem (Soare et al., 2014; Fiez et al., 2019; Degenne et al., 2020) $\rho_{\ast}(\theta_{\ast})^{-1}:=\max_{\lambda\in\triangle_{\mathcal{X}}}\min_{\begin{subarray}{c}\theta^{\prime}\in\Theta\\ z_{\ast}(\theta^{\prime})\neq z_{\ast}(\theta_{\ast})\end{subarray}}\sum_{x\in\mathcal{X}}\lambda_{x}\langle x,\theta_{\ast}-\theta^{\prime}\rangle^{2}$ for some set of alternatives $\Theta$ . This quantity provides a natural complexity measure $\mathcal{C}(\theta_{\ast})=\rho_{\ast}(\theta_{\ast})$ for a given instance $\theta_{\ast}$ and we describe it in a few specific cases below.

As a warmup example, consider the standard best-arm identification problem where $\mathcal{Z}=\mathcal{X}=\{\mathbf{e}_{i}:i\in[d]\}$ and choosing action $x_{t}\in\mathcal{X}$ results in reward $y_{t}\sim\text{Bernoulli}(\theta_{i_{t}})$ . Let $i_{\ast}(\theta)=\operatorname*{arg\,max}_{z\in\mathcal{Z}}z^{\top}\theta=\operatorname*{arg\,max}_{i}\theta_{i}$ . Then in this case $\rho_{\ast}(\theta)\approx\sum_{i\neq i_{\ast}(\theta)}(\theta_{i_{\ast}(\theta)}-\theta_{i})^{-2}$ and it’s been shown that there exists a constant $c_{0}>0$ such that for any sufficiently large $\nu>0$ we have

\displaystyle\min_{\pi}\max_{\theta:\rho_{\ast}(\theta)\leq\nu}\ell_{BAI}(\pi,\theta)\geq\exp(-c_{0}T/\nu)

In other words, more difficult instances correspond to $\theta$ with a small gap between the best arm and any other arm. Moreover, for any $\theta\in\mathbb{R}^{d}$ there exists a policy $\widetilde{\pi}$ that achieves $\ell(\widetilde{\pi},\theta)\leq c_{1}\exp(-c_{2}T/\rho_{\ast}(\theta_{\ast}))$ where $c_{1},c_{2}$ capture constant and low-order terms (Carpentier & Locatelli, 2016; Karnin et al., 2013; Garivier & Kaufmann, 2016). Said plainly, the above correspondence between the lower bound and the upper bound for the multi-armed bandit problem shows that $\rho_{\ast}(\theta_{\ast})$ is a natural choice for $\mathcal{C}(\theta)$ in this setting.

In recent years, algorithms for the more general combinatorial bandit setting have been established with instance-dependent sample complexities matching $\rho_{\ast}(\theta_{\ast})$ (up to logarithmic factors) (Karnin et al., 2013; Chen et al., 2014; Fiez et al., 2019; Chen et al., 2017; Degenne et al., 2020; Katz-Samuels et al., 2020). Another complexity term that appears in Cao & Krishnamurthy (2017) for combinatorial bandits is

\displaystyle\widetilde{\rho}(\theta)=\sum_{i=1}^{d}\max_{z:z_{i}\neq z_{\ast,i}(\theta)}\frac{\|z-z_{\ast}(\theta)\|_{2}^{2}}{\langle z-z_{\ast}(\theta),\theta\rangle^{2}}.

(1)

One can show $\rho_{\ast}(\theta)\leq\widetilde{\rho}(\theta)$ (Katz-Samuels et al., 2020) and in many cases track each other. Because $\widetilde{\rho}(\theta)$ can be computed much more efficiently compared to $\rho_{\ast}(\theta)$ , we take $\mathcal{C}(\theta)=\widetilde{\rho}(\theta)$ .

2.2 Objective: Responding to All Difficulties

As described above, though there exists algorithms for the combinatorial bandit problem that are instance-optimal in the fixed-confidence setting along with algorithms for the fixed-budget, they do not work well with small budgets as they rely on statistical guarantees. Indeed, for their guarantees to be non-vacuous, we need the budget $T$ to be sufficiently large enough to compare to the potentially large constants in upper bounds. In practice, they are so conservative that for the first 20 samples they would just sample uniformly. To overcome this, we now provide a different framework that for policy learning in a worst-case setting that is effective even in the small budget regime.

The challenge is in finding a policy that performs well across all potential problem instances simultaneously. It is common to consider minimax optimal policies which attempt to perform well on worst case instances — but as a result, may perform poorly on easier instances. Thus, an ideal policy $\pi$ would perform uniformly well over a set of $\theta$ ’s that are all equivalent in “difficulty”. Since each $\theta\in\Theta$ is equipped with an inherent notion of difficulty, $C(\theta)$ , we can stratify the space of all possible instances by difficulty. A good policy is one whose worst case performance over all possible problem difficulties is minimized. We formalize this idea below.

For any set of problem instances $\widetilde{\Theta}\subset\Theta$ and $r\geq 0$ define

\displaystyle\ell(\pi,\widetilde{\Theta}):=\max_{\theta\in\widetilde{\Theta}}\ell(\pi,\theta)\quad\quad\text{ and }\quad\quad\Theta^{(r)}:=\{\theta:\mathcal{C}(\theta)\leq r\}.

For a fixed $r>0$ (including $r=\infty$ ), a policy $\pi^{\prime}$ that aims to minimize just $\ell(\pi^{\prime},\Theta^{(r)})$ will be minimax for $\Theta^{(r)}$ and may not perform well on easy instances. To overcome this shortsightedness we introduce a new objective by focusing on $\ell(\pi,\Theta^{(r)})-\min_{\pi^{\prime}}\ell(\pi^{\prime},\Theta^{(r)})$ ; the sub-optimality gap of a given policy $\pi$ relative to an $r$ -dependent baseline policy trained specifically for each $r$ . Objective: Return the policy $\displaystyle\pi_{*}:=\arg\min_{\pi}\max_{r>0}\left(\ell(\pi,\Theta^{(r)})-\min_{\pi^{\prime}}\ell(\pi^{\prime},\Theta^{(r)})\right)$ (2) which minimizes the worst case sub-optimality gap over all $r>0$ .

Refer to caption — Figure 1: Performance curves for various policies.

Figure 1 illustrates these definitions. The blue curve ( $r$ -dependent baseline) captures the best possible performance $\min_{\pi^{\prime}}\ell(\pi^{\prime},\Theta^{(r)})$ that is possible for each difficulty level $r$ . In other words, the $r$ -dependent baseline defines a different policy for each value of $r$ . Therefore, the blue curve may be unachievable with just a single policy. The green curve captures a policy that achieves the minima ( $r$ -dependent baseline) at a given $r^{\prime}$ . Though it is the ideal policy for this difficulty, it could be sub-optimal at any other difficulty. The orange curve is the performance of our optimal policy $\pi^{\ast}$ — it is willing to sacrifice performance for any given $r$ to achieve an overall better worst case gap from the baseline.

3 MAPO: Adversarial Training Algorithm

Identifying $\pi_{*}$ naively requires the computation of $\min_{\pi^{\prime}}\ell(\pi^{\prime},\Theta^{(r)})$ for all $r>0$ . However, in practice given an increasing sequence $r_{1}<\dots<r_{K}$ that indexes nested sets of problem instances of increasing difficulty, $\Theta^{(r_{1})}\subset\Theta^{(r_{2})}\subset\dots\subset\Theta^{(r_{K})}$ , we wish to identify a policy $\widehat{\pi}$ that minimizes the maximum sub-optimality gap with respect to this sequence. Explicitly, we seek to learn

\displaystyle\widehat{\pi}=\arg\min_{\pi}\max_{k\leq K}\left(\ell(\pi,\Theta^{(r_{k})})-\ell(\pi_{k},\Theta^{(r_{k})})\right)\text{ where }\quad\pi_{k}\in\arg\min_{\pi}\max_{\theta:\mathcal{C}(\theta)\leq r_{k}}\ell(\pi,\theta).

(3)

Note that as $K\to\infty$ and $\sup_{k}\frac{r_{k+1}}{r_{k}}\to 1$ , equation 2 and equation 3 are essentially equivalent under benign smoothness conditions on $\mathcal{C}(\theta)$ , in which case $\widehat{\pi}\rightarrow\pi_{*}$ . In practice, we choose $\Theta^{(r_{K})}$ contains all problems that can be solved within the budget $T$ relatively accurately, and a small $\epsilon>0$ , where $\max_{k}\frac{r_{k+1}}{r_{k}}=1+\epsilon$ . In Algorithm 1, our algorithm MAPO efficiently solves this objective by first computing $\pi_{k}$ for all $k\in[K]$ to obtain $\ell(\pi_{k},\Theta^{(r_{k})})$ as benchmarks, and then uses these benchmarks to train $\widehat{\pi}$ . The next section will focus on the challenges of the optimization problems in equation 4 and equation 5.

Input: sequence

\{r_{k}\}_{k=1}^{K}

, complexity function

\mathcal{C}

Define

{k(\theta)}\in[K]

such that

r_{k(\theta)-1}<\mathcal{C}(\theta)\leq r_{k(\theta)}

for all

\theta

with

\mathcal{C}(\theta)\leq r_{K}

for

k\in 1,...,K

Obtain policy

\pi_{k}

by solving:

\displaystyle\pi_{k}:=\operatorname*{arg\,min}_{\pi}\ell(\pi,\Theta^{(r_{k})})=\operatorname*{arg\,min}_{\pi}\max_{\theta\in\Theta^{(r_{k})}}\ell(\pi,\theta)\quad\quad\text{and}\quad\quad b^{(r_{k})}:=\ell(\pi_{k},\Theta^{(r_{k})})

(4)

end for

Training for min-gap optimal policy: Solve the following:

\displaystyle\widehat{\pi}=\operatorname*{arg\,min}_{\pi}\max_{\theta\in\Theta^{(r_{K})}}\left[\ell(\pi,\theta)-b^{(r_{k(\theta)})}\right]

(5)

Output:

\widehat{\pi}

(a solution to equation 3).

Algorithm 1 MAPO: Min-gap Adversarial Policy Optimization

3.1 Differentiable policy optimization

The critical part of running MAPO (Algorithm 1) is to solve for equation 4 and equation 5. Note that equation 5 is an optimization of the same form with equation 4 after shifting the loss by the scalar value $b^{(r_{k(\theta)})}$ . Consequently, to learn $\{\widetilde{\pi}_{k}\}_{k}$ and $\widehat{\pi}$ , it suffices to develop a training procedure to solve $\min_{\pi}\max_{\theta\in\Omega}\ell^{\prime}(\pi,\theta)$ for an arbitrary set $\Omega$ and generic loss function $\ell^{\prime}(\pi,\theta)$ .

We would like to solve this saddle-point problem using an alternating gradient descent/ascent method in Algorithm 2 that we describe now. Instead of optimizing over all possible policies, we restrict the policy class to neural networks that take state representation as input and output a probability distribution over actions, parameterized by weights $\psi$ . In practice, $\ell^{\prime}(\pi^{\psi},\theta)$ may be poorly behaved in $(\psi,\theta)$ so a gradient descent/ascent procedure may get stuck in a neighborhood of a critical point that is not an optimal solution to the saddle point problem. To avoid this, we instead track over many different possible $\theta$ ’s (intuitively corresponding to different initializations):

$\displaystyle\min_{\psi}\max_{\theta\in\Omega}\ell^{\prime}(\pi^{\psi},\theta)$	$\displaystyle=\min_{\psi}\max_{\widetilde{\theta}_{1:N}\subset\Omega}\max_{i\in[N]}\ell^{\prime}(\pi^{\psi},\widetilde{\theta}_{i}).$	(6)
	$\displaystyle=\min_{\psi}\max_{\widetilde{\theta}_{1:N}\subset\Omega}\max_{\lambda\in\Delta_{N}}\mathbb{E}_{i\sim\lambda}\ell^{\prime}(\pi^{\psi},\widetilde{\theta}_{i}).$	(7)
	$\displaystyle=\min_{\psi}\max_{w\in\mathbb{R}^{N},\widetilde{\theta}_{1:N}\subset\Omega}\mathbb{E}_{i\sim\text{SOFTMAX}(w)}\left[\ell^{\prime}(\pi^{\psi},\widetilde{\theta}_{i})\right].$	(8)

In the first equality we replace the maximum over all $\Omega$ to a maximum over all subsets $\widetilde{\Theta}=\widetilde{\theta}_{1:N}$ of size $N$ . The resulting maximum over the $N$ points is still a discrete optimization. To smooth it out, we utilize the fact that a $\max$ over a set is just the same as the maximum over of the expectation over all distributions on that set. In the last equality, we reparameterize the set of distributions with the softmax to weight the different values of $\widetilde{\theta}$ . In each round, we backpropagate through $w$ and $\widetilde{\theta}_{1:N}$ .

Now we discuss the optimization routine outlined in Algorithm 2. For the inside optimization, ideally, in each round we would build an estimate of the loss function at our current choice of $\pi^{\psi}$ for each of the $\widetilde{\theta}_{1:N}$ ’s under consideration. To do so, we rollout the policy for each $\theta\in\widetilde{\theta}_{1:N}$ under consideration $L$ times and then average the resulting losses (this also allows us to construct a stochastic gradient of the loss). In practice we can’t consider all $\theta\in\widetilde{\theta}_{1:N}$ , so instead we sample $M$ of them from $w$ . This has a computational benefit by allowing us to be strategic by considering $\theta$ ’s each round that are closest to the $\arg\max_{\widetilde{\theta}_{1:N}}\ell^{\prime}(\pi^{\psi},\theta)$ .

After this we then backpropagate through $w$ and $\widetilde{\Theta}$ using the stochastic gradients learned from the rollouts. Finally, we then update $\pi$ by backpropagation through the neural network under consideration. The gradient steps are taken with unbiased gradient estimates $g^{w}(i,\tau)$ , $g^{\widetilde{\Theta}}(i,\tau)$ and $g^{\psi}(i,\tau)$ , which are computed by using the score-function identity and is described in detail in Appendix C. We outline more implementation details in Appendix B along with the below algorithm with explicit gradient estimate formulas. Hyperparamters can be found in Appendix D.

Algorithm 2 Gradient Based Optimization of equation 8

Input: partition

\Omega

, number of iterations

N_{it}

, number of problem samples

M

, number of rollouts per problem

L

, and loss variable

L_{T}

at horizon

T

(see beginning of Section 2).

Goal: Compute the optimal policy

\arg\min_{\pi}\max_{\theta\in\Omega}\ell^{\prime}(\pi,\theta)=\arg\min_{\pi}\max_{\theta\in\Omega}\mathbb{E}_{\pi,\theta}[L_{T}]

. Note in the case of

\ell^{\prime}(\pi,\theta)=\ell(\pi,\theta)-b^{(r_{k(\theta)})}

L_{T}

is inherently subtracting the scalar value

b^{(r_{k(\theta)})}

Initialization:

w

, finite set

\widetilde{\Theta}=\widetilde{\theta}_{1:N}

and

\psi

for

t=1,...,N_{\text{it}}

for

m=1,...,M

Sample

I_{m}\overset{i.i.d.}{\sim}\text{SOFTMAX}(w)

Collect

L

independent rollout trajectories, denoted as

\tau_{m,1:L}

, by the policy

\pi^{\psi}

for

\theta_{I_{m}}

end for

Update the generating distribution by taking ascending steps on gradient estimates:

\displaystyle\widetilde{\Theta},w

\displaystyle\leftarrow\widetilde{\Theta}+\frac{1}{ML}\sum_{m=1}^{M}\left(\nabla_{\widetilde{\Theta}}\mathcal{L}_{\text{barrier}}(\widetilde{\theta}_{I_{m}},\Omega)+\sum_{l=1}^{L}g^{\widetilde{\Theta}}(I_{m},\tau_{m,l})\right),w+\frac{1}{ML}\sum_{m=1}^{M}\sum_{l=1}^{L}g^{w}(I_{m},\tau_{m,l})

where

\mathcal{L}_{\text{barrier}}

is a differentiable barrier loss that heavily penalizes the

\widetilde{\theta}_{I_{m}}

’s outside

\Omega

Update the policy by taking descending step on gradient estimate:

\displaystyle\psi\leftarrow

\displaystyle\psi-\frac{1}{ML}\sum_{m=1}^{M}\sum_{l=1}^{L}g^{\psi}(I_{m},\tau_{m,l})

end for

4 Experiments

We now evaluate the approach described in the previous section for combinatorial bandits with $\mathcal{X}=\{\mathbf{e}_{i}:i\in[d]\}$ and $\mathcal{Z}\subset\{0,1\}^{d}$ . This setting generalizes both binary active classification for arbitrary model class and active recommendation, which we evaluate by conducting experiments on two respective real datasets. We evaluated based on two criteria: instance-dependent worst-case and average-case. For instance-dependent worst-case, we measure, for each $r_{k}$ and policy $\pi$ , $\displaystyle\ell(\pi,\Theta^{(r_{k})}):=\max_{\theta\in\Theta^{(r_{k})}}\ell(\pi,\theta)$ and plot this value as a function of $r_{k}$ . We note that our algorithm is designed to optimize for such a metric. For the secondary average-case metric, we instead measure, for policy $\pi$ and some collected set $\Theta$ , $\frac{1}{|\Theta|}\sum_{\theta\in\Theta}\ell(\pi,\theta)$ . Performances of instance-dependent worst-case metric are reported in Figures 2, 3, 4, 6, and 7 below while the average case performances are reported in the tables and Figure 5. Full scale of the figures can also be found in Appendix F.

4.1 Algorithms

We compare against a number of baseline active learning algorithms (see Section 5 for a review). Uncertainty sampling at time $t$ computes the empirical maximizer of $\langle z,\widehat{\theta}\rangle$ and the runner-up, and samples an index uniformly from their symmetric difference (i.e thinking of elements of $\mathcal{Z}$ as subsets of $[d]$ ); if either are not unique, an index is sampled from the region of disagreement of the winners (see Appendix G for details). The greedy methods are represented by soft generalized binary search (SGBS) (Nowak, 2011) which maintains a posterior distribution over $\mathcal{Z}$ and samples to maximize information gain. A hyperparameter $\beta\in(0,1/2)$ of SGBS determines the strength of the likelihood update. We plot or report a range of performance over $\beta\in\{.01,.03,.1,.2,.3,.4\}$ . The agnostic algorithms for classification (Balcan et al., 2006; Hanneke, 2007b; a; Dasgupta et al., 2008; Huang et al., 2015; Jain & Jamieson, 2019) or combinatorial bandits (Chen et al., 2014; Gabillon et al., 2016; Chen et al., 2017; Cao & Krishnamurthy, 2017; Fiez et al., 2019; Jain & Jamieson, 2019) are so conservative that given just $T=20$ samples, they are all exactly equivalent to uniform sampling and hence represented by Uniform. To represent a policy based on learning to actively learn with respect to a prior, we employ the method of Kveton et al. (2020), denoted Bayes-LAL, with a fixed prior $\widetilde{\mathcal{P}}$ constructed by drawing a $z$ uniformly at random from $\mathcal{Z}$ and defining $\theta=2z-1\in[-1,1]^{d}$ (details in Appendix H). When evaluating each policy, we use the successive halving algorithm (Li et al., 2017; 2018) for optimizing our non-convex objective with randomly initialized gradient descent and restarts (details in Appendix E).

4.2 Synthetic Dataset: Thresholds

We begin with a very simple instance to demonstrate the instance-dependent performance achieved by our learned policy. For $d=25$ , let $\mathcal{X}=\{\mathbf{e}_{i}:i\in[d]\}$ , $\mathcal{Z}=\{\sum_{i=1}^{k}\mathbf{e}_{i}:k=0,1,\dots,d\}$ , and $f(\cdot|\theta,x)$ is a Bernoulli distribution over $\{-1,1\}$ with mean $\langle x,\theta\rangle\in[-1,1]$ . Appendix A shows that $z_{\ast}(\theta_{\ast})=\arg\max_{z}\langle z,\theta_{\ast}\rangle$ is the best threshold classifier for a label distribution induced by $(\theta_{\ast}+1)/2$ . We trained baseline policies $\{\pi_{k}\}_{k=1}^{9}$ for the Best Identification metric with $\mathcal{C}(\theta)=\widetilde{\rho}(\mathcal{X},\mathcal{Z},\theta)$ and $r_{k}=2^{3+i/2}$ for $i\in\{0,\dots,8\}$ .

First we compare the base policies $\pi_{k}$ to $\widehat{\pi}$ . Figure 2 presents $\ell(\pi,\Theta^{(r)})=\max_{\theta:\widetilde{\rho}(\theta)\leq r}\ell(\pi,\theta)=\max_{\theta:\widetilde{\rho}(\theta)\leq r}\mathbb{P}_{\pi,\theta}(\widehat{z}\neq z_{\ast}(\theta))$ as a function of $r$ for our base policies $\{\pi_{k}\}_{k}$ and the global policy $\widehat{\pi}$ , each as an individual curve. Figure 3 plots the same information in terms of gap: $\displaystyle\ell(\pi,\Theta^{(r)})-\min_{k:r_{k-1}<r\leq r_{k}}\ell(\pi_{k},\Theta^{(r_{k})})$ . We observe that each $\pi_{k}$ performs best in a particular region and $\widehat{\pi}$ performs almost as well as the $r$ -dependent baseline policies over the range of $r$ .

Under the same conditions as Figure 2, Figure 4 compares the performance of $\widehat{\pi}$ to the algorithm benchmarks. Since SGBS and Bayes-LAL are deterministic, the adversarial training finds a $\theta$ that tricks them into catastrophic failure. Figure 5 trades adversarial evaluation for evaluating with respect to a parameterized prior: For each $h\in\{0.5,0.6,\dots,1\}$ , $\theta\sim\mathcal{P}_{h}$ is defined by drawing a $z$ uniformly at random from $\mathcal{Z}$ and then setting $\theta_{i}=(2z_{i}-1)(2\alpha_{i}-1)$ where $\alpha_{i}\sim\text{Bernoulli}(h)$ . Thus, each sign of $2z-1$ is flipped with probability $h$ . We then compute $\mathbb{E}_{\theta\sim\mathcal{P}_{h}}[\mathbb{P}_{\pi,\theta}(\widehat{z}=z_{\ast}(\theta))]=\mathbb{E}_{\theta\sim\mathcal{P}_{h}}[\ell(\pi,\theta)]$ . While SGBS now performs much better than uniform and uncertainty sampling, our policy $\widehat{\pi}$ is still superior to these policies. However, Bayes-LAL is best overall which is expected since the support of $\mathcal{P}_{h}$ is essentially a rescaled version of the prior used in Bayes-LAL.

4.3 Real Datasets

20 Questions. Our dataset is constructed from the real data of Hu et al. (2018). Summarizing how we used the data, $100$ yes/no questions were considered for $1000$ celebrities. Each question $i\in[100]$ for each person $j\in[1000]$ was answered by several annotators to construct an empirical probability $\bar{p}_{i}^{(j)}\in[0,1]$ denoting the proportion of annotators that answered “yes.” To construct our instance, we take $\mathcal{X}=\{\mathbf{e}_{i}:i\in[100]\}$ to encode questions and $\mathcal{Z}=\{z^{(j)}:[z^{(j)}]_{i}=\bm{1}\{\bar{p}_{i}^{(j)}>1/2\}\}\subset\{0,1\}^{1000}$ . Just as before, we trained $\{\pi_{k}\}_{k=1}^{4}$ for the Best Identification metric with $\mathcal{C}(\theta)=\widetilde{\rho}(\mathcal{X},\mathcal{Z},\theta)$ and $r_{i}=2^{3+i/2}$ for $i\in\{1,\dots,4\}$ . See Appendix I for details.

Jester Joke Recommendation. We now turn our attention away from Best Identification to Simple Regret where $\ell(\pi,\theta)=\mathbb{E}_{\pi,\theta}[\langle z_{\ast}(\theta)-\widehat{z},\theta\rangle]$ . We consider the Jester jokes dataset of Goldberg et al. (2001) that contains jokes ranging from innocent puns to grossly offensive jokes. We filter the dataset to only contain users that rated all $100$ jokes, resulting in 14116 users. A rating of each joke was provided on a $[-10,10]$ scale which was rescaled to $[-1,1]$ and observations were simulated as Bernoulli’s like above. We then clustered the ratings of these users (see Appendix J for details) to 10 groups to obtain $\mathcal{Z}=\{z^{(k)}:k\in[10],z^{(k)}\in\{0,1\}^{100}\}$ where $z_{i}^{(k)}=1$ corresponds to recommending the $i$ th joke in user cluster $z^{(k)}\in\mathcal{Z}$ . See Appendix J for details.

4.3.1 Instance-dependent Worst-case

Figure 6 and Figure 7 are analogous to Figure 4 but for the 20 questions and Jester joke instances, respectively. The two deterministic policies, SGBS and Bayes-LAL, fail on these datasets as well against the worst-case instances.

On the Jester joke dataset, our policy alone nearly achieves the $r$ -dependent baseline for all $r$ . But on 20 questions, uncertainty sampling performs remarkably well. These experiments on real datasets demonstrate that our policy obtains near-optimal instance dependent sample complexity.

4.3.2 Average Case Performance

While the metric of the previous section rewarded algorithms that perform uniformly well over all possible environments that could be encountered, in this section we consider the performance of an algorithm with respect to a distribution over environments, which we denote as average case.

Table 1: 20 Questions, higher the better

Method	Accuracy(%)
$\pi^{*}$ (Ours)	17.9
SGBS	{26.5, 26.2, 27.2,
	26.5, 21.4, 12.8}
Uncertainty	14.3
Bayes-LAL	4.1
Uniform	6.9

Table 2: Jester Joke, lower the better

Method	Regret
$\pi^{*}$ (Ours)	3.209
SGBS	{3.180, 3.224, 3.278,
	3.263, 3.153, 3.090}
Uncertainty	3.027
Bayes-LAL	3.610
Uniform	3.877

While heuristic based algorithms (such as SGBS, uncertainty sampling and Bayes-LAL) can perform catastrophically for worst-case instances, they can perform very well with respect to a benign distribution over instances. Here we demonstrate that our policy not only performs optimally under the instance-dependent worst-case metric but also remain comparable even when evaluated under the average case metric. To measure the average performance, we construct prior distributions $\widehat{\mathcal{P}}$ based on the individual datasets:

•

For the 20 questions dataset, to draw a $\theta\sim\widehat{\mathcal{P}}$ , we uniformly at random select a $j\in[1000]$ and sets $\theta_{i}=2\bar{p}_{i}^{(j)}-1$ for all $i\in[d]$ .
•

For the Jester joke recommendation dataset, to draw a $\theta\sim\widehat{\mathcal{P}}$ , we uniformly sample a user and employ their ratings to each joke.

On the 20 questions dataset, as shown in Table 4.3.2, SGBS and $\widehat{\pi}$ are the winners. Bayes-LAL performs much worse in this case, potentially because of the distribution shift from $\widetilde{\mathcal{P}}$ (prior we train on) to $\widehat{\mathcal{P}}$ (prior at test time). The strong performance of SGBS may be due to the fact that $\text{sign}(\theta_{i})=2z_{\ast}(\theta)_{i}-1$ for all $i$ and $\theta\sim\widehat{\mathcal{P}}$ , a realizability condition under which SGBS has strong guarantees (Nowak, 2011). On the Jester joke dataset, Table 4.3.2 shows that despite our policy not being trained for this setting, its performance is still among the top.

5 Related work

Learning to actively learn. Previous works vary in how the parameterize the policy, ranging from parameterized mixtures of existing expertly designed active learning algorithms (Baram et al., 2004; Hsu & Lin, 2015; Agarwal et al., 2016), parameterizing hyperparameters (e.g., learning rate, rate of forced exploration, etc.) in an existing popular algorithm (e.g, EXP3) (Konyushkova et al., 2017; Bachman et al., 2017; Cella et al., 2020), and the most ambitious, policies parameterized end-to-end like in this work (Boutilier et al., 2020; Kveton et al., 2020; Sharaf & Daumé III, 2019; Fang et al., 2017; Woodward & Finn, 2017). These works take an approach of defining a prior distribution either through past experience (meta-learning) or expert created (e.g., $\theta\sim\mathcal{N}(0,\Sigma)$ ), and then evaluate their policy with respect to this prior distribution. Defining this prior can be difficult, and moreover, if the $\theta$ encountered at test time did not follow this prior distribution, performance could suffer significantly. Our approach, on the other hand, takes an adversarial training approach and can be interpreted as learning a parameterized least favorable prior (Wasserman, 2013), thus gaining a much more robust policy as an end result.

Robust and Safe Reinforcement Learning. Our work is also highly related to the field of robust and safe reinforcement learning, where our objective can be considered as an instance of minimax criterion under parameter uncertainty (Garcıa & Fernández, 2015). Widely applied in applications such as robotics (Mordatch et al., 2015; Rajeswaran et al., 2016), these methods train a policy in a simulator like Mujoco (Todorov et al., 2012) to minimize a defined loss objective while remaining robust to uncertainties and perturbations to the environment (Mordatch et al., 2015; Rajeswaran et al., 2016). Ranges of these uncertainty parameters are chosen based on potential values that could be encountered when deploying the robot in the real world. In our setting, however, defining the set of environments is far less straightforward and is overcome by the adoption of the $\mathcal{C}(\theta)$ function.

Active Binary Classification Algorithms. The literature on active learning algorithms can be partitioned into model-based heuristics like uncertainty sampling, query by committee, or model-change sampling (Settles, 2009), greedy binary-search like algorithms that typically rely on a form of bounded noise for correctness (Dasgupta, 2005; 2006; Kääriäinen, 2006; Golovin & Krause, 2011; Nowak, 2011), and agnostic algorithms that make no assumptions on the probabilistic model (Balcan et al., 2006; Hanneke, 2007b; a; Dasgupta et al., 2008; Huang et al., 2015; Jain & Jamieson, 2019; Katz-Samuels et al., 2020; 2021). Though the heuristics and greedy methods can perform very well for some problems, it is typically easy to construct counter-examples (e.g., outside the assumptions) in which they catastrophically fail as demonstrated in our experiments. The agnostic algorithms have strong robustness guarantees but rely on concentration inequalities, and consequently require at least hundreds of labels to observe any deviation from random sampling (see Huang et al. (2015) for comparison). Therefore, they were implicitly represented by uniform in our experiments.

Pure-exploration Multi-armed Bandit Algorithms. In the linear structure setting, for sets $\mathcal{X},\mathcal{Z}\subset\mathbb{R}^{d}$ known to the player, pulling an “arm” $x\in\mathcal{X}$ results in an observation $\langle x,\theta_{*}\rangle+$ zero-mean noise, and the objective is to identify $\arg\max_{z\in\mathcal{Z}}\langle z,\theta_{*}\rangle$ for a vector $\theta_{*}$ unknown to the player (Soare et al., 2014; Karnin, 2016; Tao et al., 2018; Xu et al., 2017; Fiez et al., 2019). A special case of linear bandits is combinatorial bandits where $\mathcal{X}=\{\mathbf{e}_{i}:i\in[d]\}$ and $\mathcal{Z}\subset\{0,1\}^{d}$ (Chen et al., 2014; Gabillon et al., 2016; Chen et al., 2017; Cao & Krishnamurthy, 2017; Fiez et al., 2019; Jain & Jamieson, 2019). Active binary classification is a special case of combinatorial pure-exploration multi-armed bandits (Jain & Jamieson, 2019), which we exploit in the threshold experiments. While the above works have made great theoretical advances in deriving algorithms and information theoretic lower bounds that match up to constants, the constants are so large that these algorithms only behave well when the number of measurements is very large. When applied to the instances of our paper (only 20 queries are made), these algorithms behave no differently than random sampling.

6 Discussion and Future Directions

We see this work as an exciting but preliminary step towards realizing the full potential of this general approach. Although our experiments has been focusing on applications of combinatorial bandit, we see this framework generalizing with minor changes to many more widely applicable settings such as multi-class active classification, contextual bandits, etc. To generalize $\mathcal{C}(\theta)$ to these settings, one can refer to existing literature for instance-dependent lower bounds (Katz-Samuels et al., 2021; Agarwal et al., 2014). Alternatively, when such a lower bound does not exist, we conjecture that a heuristic scoring function could also serve as $\mathcal{C}(\theta)$ . For example, in a chess game, one could simply use the scoring function of the pieces left on board as a proxy for difficulty.

From a practical perspective, training a $\widehat{\pi}$ can take many hours of computational resources for even these small instances. Scaling these methods to larger instances is an important next step. While training time scales linearly with the horizon length $T$ , we note that one can take multiple samples per time step. With minimal computational overhead, this could enable training on problems that require larger sample complexities. In our implementation we hard-coded the decision rule for $\widehat{z}$ given $s_{T}$ , but it could also be learned as in (Luedtke et al., 2020). Likewise, the parameterization of the policy and generator worked well for our purposes but was chosen somewhat arbitrarily–are there more natural choices? Finally, while we focused on stochastic settings, this work naturally extends to constrained fully adaptive adversarial sequences which is an interesting direction of future work.

Funding disclosure

This work was supported in part by NSF IIS-1907907.

Acknowledgement

The authors would like to thank Aravind Rajeswaran for insightful discussions on Robust Reinforcement Learning. We would also like to thank Julian Katz-Samuels and Andrew Wagenmaker for helpful comments.

References

Agarwal et al. (2014) Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638–1646. PMLR, 2014.
Agarwal et al. (2016) Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. Corralling a band of bandit algorithms. arXiv preprint arXiv:1612.06246, 2016.
Aleksandrov et al. (1968) VM Aleksandrov, VI Sysoyev, and SHEMENEV. VV. Stochastic optimization. Engineering Cybernetics, (5):11–+, 1968.
Bachman et al. (2017) Philip Bachman, Alessandro Sordoni, and Adam Trischler. Learning algorithms for active learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 301–310. JMLR. org, 2017.
Balcan et al. (2006) Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. In Proceedings of the 23rd international conference on Machine learning, pp. 65–72, 2006.
Baram et al. (2004) Yoram Baram, Ran El Yaniv, and Kobi Luz. Online choice of active learning algorithms. Journal of Machine Learning Research, 5(Mar):255–291, 2004.
Boutilier et al. (2020) Craig Boutilier, Chih-Wei Hsu, Branislav Kveton, Martin Mladenov, Csaba Szepesvari, and Manzil Zaheer. Differentiable bandit exploration. arXiv preprint arXiv:2002.06772, 2020.
Cao & Krishnamurthy (2017) Tongyi Cao and Akshay Krishnamurthy. Disagreement-based combinatorial pure exploration: Efficient algorithms and an analysis with localization. stat, 2017.
Carpentier & Locatelli (2016) Alexandra Carpentier and Andrea Locatelli. Tight (lower) bounds for the fixed budget best arm identification bandit problem. In Conference on Learning Theory, pp. 590–604, 2016.
Cella et al. (2020) Leonardo Cella, Alessandro Lazaric, and Massimiliano Pontil. Meta-learning with stochastic linear bandits. arXiv preprint arXiv:2005.08531, 2020.
Chen et al. (2017) Lijie Chen, Anupam Gupta, Jian Li, Mingda Qiao, and Ruosong Wang. Nearly optimal sampling algorithms for combinatorial pure exploration. In Conference on Learning Theory, pp. 482–534, 2017.
Chen et al. (2014) Shouyuan Chen, Tian Lin, Irwin King, Michael R Lyu, and Wei Chen. Combinatorial pure exploration of multi-armed bandits. In Advances in Neural Information Processing Systems, pp. 379–387, 2014.
Dasgupta (2005) Sanjoy Dasgupta. Analysis of a greedy active learning strategy. In Advances in neural information processing systems, pp. 337–344, 2005.
Dasgupta (2006) Sanjoy Dasgupta. Coarse sample complexity bounds for active learning. In Advances in neural information processing systems, pp. 235–242, 2006.
Dasgupta et al. (2008) Sanjoy Dasgupta, Daniel J Hsu, and Claire Monteleoni. A general agnostic active learning algorithm. In Advances in neural information processing systems, pp. 353–360, 2008.
Degenne et al. (2020) Rémy Degenne, Pierre Ménard, Xuedong Shang, and Michal Valko. Gamification of pure exploration for linear bandits. In International Conference on Machine Learning, pp. 2432–2442. PMLR, 2020.
Fang et al. (2017) Meng Fang, Yuan Li, and Trevor Cohn. Learning how to active learn: A deep reinforcement learning approach. arXiv preprint arXiv:1708.02383, 2017.
Fiez et al. (2019) Tanner Fiez, Lalit Jain, Kevin G Jamieson, and Lillian Ratliff. Sequential experimental design for transductive linear bandits. In Advances in Neural Information Processing Systems, pp. 10666–10676, 2019.
Gabillon et al. (2016) Victor Gabillon, Alessandro Lazaric, Mohammad Ghavamzadeh, Ronald Ortner, and Peter Bartlett. Improved learning complexity in combinatorial pure exploration bandits. In Artificial Intelligence and Statistics, pp. 1004–1012, 2016.
Garcıa & Fernández (2015) Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
Garivier & Kaufmann (2016) Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pp. 998–1027, 2016.
Goldberg et al. (2001) Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Eigentaste: A constant time collaborative filtering algorithm. information retrieval, 4(2):133–151, 2001.
Golovin & Krause (2011) Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in active learning and stochastic optimization. Journal of Artificial Intelligence Research, 42:427–486, 2011.
Hanneke (2007a) Steve Hanneke. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th international conference on Machine learning, pp. 353–360, 2007a.
Hanneke (2007b) Steve Hanneke. Teaching dimension and the complexity of active learning. In International Conference on Computational Learning Theory, pp. 66–81. Springer, 2007b.
Hanneke et al. (2014) Steve Hanneke et al. Theory of disagreement-based active learning. Foundations and Trends® in Machine Learning, 7(2-3):131–309, 2014.
Hao et al. (2019) Botao Hao, Tor Lattimore, and Csaba Szepesvari. Adaptive exploration in linear contextual bandit. arXiv preprint arXiv:1910.06996, 2019.
Hsu & Lin (2015) Wei-Ning Hsu and Hsuan-Tien Lin. Active learning by learning. In Twenty-Ninth AAAI conference on artificial intelligence, 2015.
Hu et al. (2018) Huang Hu, Xianchao Wu, Bingfeng Luo, Chongyang Tao, Can Xu, Wei Wu, and Zhan Chen. Playing 20 question game with policy-based reinforcement learning. arXiv preprint arXiv:1808.07645, 2018.
Huang et al. (2015) Tzu-Kuo Huang, Alekh Agarwal, Daniel J Hsu, John Langford, and Robert E Schapire. Efficient and parsimonious agnostic active learning. In Advances in Neural Information Processing Systems, pp. 2755–2763, 2015.
Jain & Jamieson (2019) Lalit Jain and Kevin G Jamieson. A new perspective on pool-based active classification and false-discovery control. In Advances in Neural Information Processing Systems, pp. 13992–14003, 2019.
Kääriäinen (2006) Matti Kääriäinen. Active learning in the non-realizable case. In International Conference on Algorithmic Learning Theory, pp. 63–77. Springer, 2006.
Karnin et al. (2013) Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed bandits. In International Conference on Machine Learning, pp. 1238–1246, 2013.
Karnin (2016) Zohar S Karnin. Verification based solution for structured mab problems. In Advances in Neural Information Processing Systems, pp. 145–153, 2016.
Katz-Samuels et al. (2020) Julian Katz-Samuels, Lalit Jain, Zohar Karnin, and Kevin Jamieson. An empirical process approach to the union bound: Practical algorithms for combinatorial and linear bandits. arXiv preprint arXiv:2006.11685, 2020.
Katz-Samuels et al. (2021) Julian Katz-Samuels, Jifan Zhang, Lalit Jain, and Kevin Jamieson. Improved algorithms for agnostic pool-based active classification. arXiv preprint arXiv:2105.06499, 2021.
Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Konyushkova et al. (2017) Ksenia Konyushkova, Raphael Sznitman, and Pascal Fua. Learning active learning from data. In Advances in Neural Information Processing Systems, pp. 4225–4235, 2017.
Kveton et al. (2020) Branislav Kveton, Martin Mladenov, Chih-Wei Hsu, Manzil Zaheer, Csaba Szepesvari, and Craig Boutilier. Differentiable meta-learning in contextual bandits. arXiv preprint arXiv:2006.05094, 2020.
Lattimore & Szepesvari (2016) Tor Lattimore and Csaba Szepesvari. The end of optimism? an asymptotic analysis of finite-armed linear bandits. arXiv preprint arXiv:1610.04491, 2016.
Li et al. (2018) Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Moritz Hardt, Benjamin Recht, and Ameet Talwalkar. Massively parallel hyperparameter tuning. arXiv preprint arXiv:1810.05934, 2018.
Li et al. (2017) Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research, 18(1):6765–6816, 2017.
Luedtke et al. (2020) Alex Luedtke, Marco Carone, Noah Simon, and Oleg Sofrygin. Learning to learn from data: Using deep adversarial learning to construct optimal statistical procedures. Science Advances, 6(9), 2020. doi: 10.1126/sciadv.aaw2140. URL https://advances.sciencemag.org/content/6/9/eaaw2140.
Mannor & Tsitsiklis (2004) Shie Mannor and John N Tsitsiklis. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623–648, 2004.
Mordatch et al. (2015) Igor Mordatch, Kendall Lowrey, and Emanuel Todorov. Ensemble-cio: Full-body dynamic motion planning that transfers to physical humanoids. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5307–5314. IEEE, 2015.
Nowak (2011) Robert D Nowak. The geometry of generalized binary search. IEEE Transactions on Information Theory, 57(12):7893–7906, 2011.
Ok et al. (2018) Jungseul Ok, Alexandre Proutiere, and Damianos Tranos. Exploration in structured reinforcement learning. In Advances in Neural Information Processing Systems, pp. 8874–8882, 2018.
Rajeswaran et al. (2016) Aravind Rajeswaran, Sarvjeet Ghotra, Balaraman Ravindran, and Sergey Levine. Epopt: Learning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283, 2016.
Settles (2009) Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009.
Sharaf & Daumé III (2019) Amr Sharaf and Hal Daumé III. Meta-learning for contextual bandit exploration. arXiv preprint arXiv:1901.08159, 2019.
Simchowitz & Jamieson (2019) Max Simchowitz and Kevin G Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. In Advances in Neural Information Processing Systems, pp. 1151–1160, 2019.
Simchowitz et al. (2017) Max Simchowitz, Kevin Jamieson, and Benjamin Recht. The simulator: Understanding adaptive sampling in the moderate-confidence regime. arXiv preprint arXiv:1702.05186, 2017.
Soare et al. (2014) Marta Soare, Alessandro Lazaric, and Rémi Munos. Best-arm identification in linear bandits. In Advances in Neural Information Processing Systems, pp. 828–836, 2014.
Tao et al. (2018) Chao Tao, Saúl Blanco, and Yuan Zhou. Best arm identification in linear bandits with linear dimension dependency. In International Conference on Machine Learning, pp. 4877–4886, 2018.
Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE, 2012.
Tsybakov (2008) Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer Science & Business Media, 2008.
Van Parys & Golrezaei (2020) Bart Van Parys and Negin Golrezaei. Optimal learning for structured bandits. 2020.
Wasserman (2013) Larry Wasserman. All of statistics: a concise course in statistical inference. Springer Science & Business Media, 2013.
Woodward & Finn (2017) Mark Woodward and Chelsea Finn. Active one-shot learning. arXiv preprint arXiv:1702.06559, 2017.
Xu et al. (2017) Liyuan Xu, Junya Honda, and Masashi Sugiyama. Fully adaptive algorithm for pure exploration in linear bandits. arXiv preprint arXiv:1710.05552, 2017.

Appendix A Instance Dependent Sample Complexity

Identifying forms of $\mathcal{C}(\theta)$ is not as difficult a task as one might think due to the proliferation of tools for proving lower bounds for active learning (Mannor & Tsitsiklis, 2004; Tsybakov, 2008; Garivier & Kaufmann, 2016; Carpentier & Locatelli, 2016; Simchowitz et al., 2017; Chen et al., 2014). One can directly extract values of $\mathcal{C}(\theta)$ from the literature for regret minimization of linear or other structured bandits (Lattimore & Szepesvari, 2016; Van Parys & Golrezaei, 2020), contextual bandits (Hao et al., 2019), and tabular as well as structured MDPs (Simchowitz & Jamieson, 2019; Ok et al., 2018). Moreover, we believe that even reasonable surrogates of $\mathcal{C}(\theta)$ should result in a high quality policy $\pi_{*}$ .

We review some canonical examples:

•

Multi-armed bandits. In the best-arm identification problem, there are $d\in\mathbb{N}$ Gaussian distributions where the $i$ th distribution has mean $\theta_{i}\in\mathbb{R}$ for $i=1,\dots,d$ . In the above formulation, this problem is encoded as action $x_{t}=i_{t}$ results in observation $y_{t}\sim\text{Bernoulli}(\theta_{i_{t}})$ and the loss $\ell(\pi,\theta):=\mathbb{E}_{\pi,\theta}[\bm{1}\{\widehat{i}\neq i_{\ast}(\theta)\}]$ where $\widehat{i}$ is $\pi$ ’s recommended index and $i_{\ast}(\theta)=\arg\max_{i}\theta_{i}$ . It’s been shown that there exists a constant $c_{0}>0$ such that for any sufficiently large $\nu>0$ we have

	$\displaystyle\min_{\pi}\max_{\theta:\mathcal{C}_{MAB}(\theta)\leq\nu}\ell(\pi,\theta)\geq\exp(-c_{0}T/\nu)$
	$\displaystyle\text{ where }\quad\mathcal{C}_{MAB}(\theta):=\sum_{i\neq i_{\ast}(\theta)}(\theta_{i_{\ast}(\theta)}-\theta_{i})^{-2}$

Moreover, for any $\theta\in\mathbb{R}^{d}$ there exists a policy $\widetilde{\pi}$ that achieves $\ell(\widetilde{\pi},\theta)\leq c_{1}\exp(-c_{2}T/\mathcal{C}_{MAB}(\theta))$ where $c_{1},c_{2}$ capture constant and low-order terms (Carpentier & Locatelli, 2016; Karnin et al., 2013; Simchowitz et al., 2017; Garivier & Kaufmann, 2016).

The above correspondence between the lower bound and the upper bound suggests that $\mathcal{C}_{MAB}(\theta)$ plays a critical role in determining the difficulty of identifying $i_{\ast}(\theta)$ for any $\theta$ . This exercise extends to more structured settings as well:

•

Content recommendation / active search. Consider $n$ items (e.g., movies, proteins) where the $i$ th item is represented by a feature vector $x_{i}\in\mathcal{X}\subset\mathbb{R}^{d}$ and a measurement $x_{t}=x_{i}$ (e.g., preference rating, binding affinity to a target) is modeled as a linear response model such that $y_{t}\sim\mathcal{N}(\langle x_{i},\theta\rangle,1)$ for some unknown $\theta\in\mathbb{R}^{d}$ . If $\ell(\pi,\theta):=\mathbb{E}_{\pi,\theta}[\bm{1}\{\widehat{i}\neq i_{\ast}(\theta)\}]$ as above then nearly identical results to that of above hold for an analogous function of $\mathcal{C}_{MAB}(\theta)$ (Soare et al., 2014; Karnin, 2016; Fiez et al., 2019).
•

Active binary classification. For $i=1,\dots,d$ let $\phi_{i}\in\mathbb{R}^{p}$ be a feature vector of an unlabeled item (e.g., image) that can be queried for its binary label $y_{i}\in\{-1,1\}$ where $y_{i}\sim\text{Bernoulli}(\theta_{i})$ for some $\theta\in\mathbb{R}^{d}$ . Let $\mathcal{H}$ be an arbitrary set of classifiers (e.g., neural nets, random forest, etc.) such that each $h\in\mathcal{H}$ assigns a label $\{-1,1\}$ to each of the items $\{\phi_{i}\}_{i=1}^{d}$ in the pool. If items are chosen sequentially to observe their labels, the objective is to identify the true risk minimizer $h_{\ast}(\theta)=\arg\min_{h\in\mathcal{H}}\sum_{i=1}^{d}\mathbb{E}_{\theta}[\bm{1}\{h(\phi_{i})\neq y_{i}\}]$ using as few requested labels as possible and $\ell(\pi,\theta):=\mathbb{E}_{\pi,\theta}[\bm{1}\{\widehat{h}\neq h_{\ast}(\theta)\}]$ where $\widehat{h}\in\mathcal{H}$ is $\pi$ ’s recommended classifier. Many candidates for $\mathcal{C}(\theta)$ have been proposed from the agnostic active learning literature (Balcan et al., 2006; Hanneke, 2007b; a; Dasgupta et al., 2008; Huang et al., 2015; Jain & Jamieson, 2019) but we believe the most granular candidates come from the combinatorial bandit literature (Chen et al., 2017; Fiez et al., 2019; Cao & Krishnamurthy, 2017; Jain & Jamieson, 2019). To make the reduction, for each $h\in\mathcal{H}$ assign a $z^{(h)}\in\{0,1\}^{d}$ such that $[z^{(h)}]_{i}:=\bm{1}\{h(\phi_{i})=1\}$ for all $i=1,\dots,d$ and set $\mathcal{Z}=\{z^{(h)}:h\in\mathcal{H}\}$ . It is easy to check that $z_{\ast}(\theta):=\arg\max_{z\in\mathcal{Z}}\langle z,\theta\rangle$ satisfies $z_{\ast}(\theta)=z^{(h_{\ast}(\theta))}$ . Thus, requesting the label of example $i$ is equivalent to sampling from Bernoulli $(\langle\mathbf{e}_{i},\theta\rangle)\in\{-1,1\}$ , completing the reduction to combinatorial bandits: $\mathcal{X}=\{\mathbf{e}_{i}:i\in[d]\}$ , $\mathcal{Z}\subset\{0,1\}^{d}$ . We then apply the exact same $\mathcal{C}(\theta)$ as above for linear bandits.

Appendix B Gradient Based Optimization Algorithm Implementation

First, we restate the algorithm with explicit gradient estimator formulas derived in Appendix C.

Algorithm 3 Gradient Based Optimization of equation 8 (Algorithm 2) with explicit gradient estimators.

Input: partition

\Omega

, number of iterations

N_{it}

, number of problem samples

M

, number of rollouts per problem

L

, and loss variable

L_{T}

at horizon

T

(see beginning of Section 2).

Goal: Compute the optimal policy

\arg\min_{\pi}\max_{\theta\in\Omega}\ell^{\prime}(\pi,\theta)=\arg\min_{\pi}\max_{\theta\in\Omega}\mathbb{E}_{\pi,\theta}[L_{T}]

Initialization:

w

, finite set

\widetilde{\Theta}

and

\psi

for

t=1,...,N_{\text{it}}

Collect rollouts of play:

for

m=1,...,M

Sample problem index

I_{m}\overset{i.i.d.}{\sim}\text{SOFTMAX}(w)

Collect

L

independent rollout trajectories (

\{\}

), denoted as

\tau_{m,1:L}

, by the policy

\pi^{\psi}

for problem instance

\theta_{I_{m}}

and observe losses

\forall 1\leq l\leq L,L_{T}(\pi^{\psi},\tau_{m,l},\widetilde{\theta}_{I_{m}})

end for

Optimize worst cases in

\Omega

Update the generating distribution by taking ascending steps on gradient estimates:

$\displaystyle w\leftarrow w+\frac{1}{ML}\sum_{m=1}^{M}\nabla_{w}$	$\displaystyle\log(\text{SOFTMAX}(w)_{I_{m}})\cdot(\sum_{l=1}^{L}L_{T}(\pi^{\psi},\tau_{m,l},\widetilde{\theta}_{I_{m}}))$	(9)
$\displaystyle\widetilde{\Theta}\leftarrow\widetilde{\Theta}+\frac{1}{ML}\sum_{m=1}^{M}\sum_{l=1}^{L}$	$\displaystyle\left(\nabla_{\widetilde{\Theta}}\mathcal{L}_{\text{barrier}}(\widetilde{\theta}_{I_{m}},\Omega)+\nabla_{\widetilde{\Theta}}L_{T}(\pi^{\psi},\tau_{m,l},\widetilde{\theta}_{I_{m}})\right.$
	$\displaystyle\left.+L_{T}(\pi^{\psi},\tau_{m,l},\widetilde{\theta}_{I_{m}})\cdot\nabla_{\widetilde{\Theta}}\log(\mathbb{P}_{\pi^{\psi},\widetilde{\theta}_{I_{m}}}(\tau_{m,l}))\right)$	(10)

where

\mathcal{L}_{\text{barrier}}

is a differentiable barrier loss that heavily penalizes the

\widetilde{\theta}_{I_{m}}

’s outside

\Omega

Optimize policy:

Update the policy by taking descending step on gradient estimate:

\displaystyle\psi\leftarrow

\displaystyle\psi-\frac{1}{ML}\sum_{m=1}^{M}\sum_{l=1}^{L}L_{T}(\pi^{\psi},\tau_{m,l},\widetilde{\theta}_{I_{m}})\cdot\nabla_{\psi}\log(\mathbb{P}_{\pi^{\psi},\widetilde{\theta}_{I_{m}}}(\tau_{m,l})).

(11)

end for

In the above algorithm, the gradient estimates are unbiased estimates of the true gradients with respect to $\psi$ , $w$ and $\widetilde{\Theta}$ (shown in Appendix C). We choose $N$ large enough to avoid mode collapse, and $M,L$ as large as possible to reduce variance in gradient estimates while fitting the memory constraint. We then find the appropriate large number of optimization iterations so that the variance of the gradient estimates is reduced dramatically by averaging over time. We use Adam optimization (Kingma & Ba, 2014) in taking gradient updates.

Note the decomposition for $\log(\mathbb{P}_{\pi^{\psi},\theta^{\prime}}(\tau))$ in equation 10 and equation 11, where rollout $\tau=\{(x_{t},y_{t})\}_{t=1}^{T}$ , and

\displaystyle\log

\displaystyle(\mathbb{P}_{\pi^{\psi},\theta^{\prime}}(\{(x_{t},y_{t})\}_{t=1}^{T}))=\log\Big{(}\pi^{\psi}(x_{1})\cdot f(y_{1}|\theta^{\prime},s_{1})\cdot\textstyle\prod_{t=2}^{T}\pi^{\psi}(s_{t},x_{t})\cdot f(y_{t}|\theta^{\prime},s_{t},x_{t})\Big{)}.

Here $\pi^{\psi}$ and $f$ are only dependent on $\psi$ and $\widetilde{\Theta}$ respectively. During evaluation of a fixed policy $\pi$ , we are interested in solving $\max_{\theta\in\Omega}\ell^{\prime}(\pi,\theta)$ by gradient ascent updates like equation 10. The decoupling of $\pi^{\psi}$ and $f$ thus enables us to optimize the objective without differentiating through a policy $\pi$ , which could be non-differentiable like a deterministic algorithm.

B.1 Implementation Details

Training. When training our policies for Best Identification, we warm start the training with optimizing Simple Regret. This is because a random initialized policy performs so poorly that Best Identification is nearly always 1, making it difficult to improve the policy. After training $\pi_{1:K}$ in MAPO (Algorithm 1), we warm start the training of $\widehat{\pi}$ with $\widehat{\pi}=\pi_{\lfloor K/2\rfloor}$ . In addition, our generating distribution parameterizations exactly follows from Section 3.1.

Loss functions. Instead of optimizing the approximated quantity from equation 8 directly, we add regularizers to the losses for both the policy and generator. First, we choose the $\mathcal{L}_{\text{barrier}}$ in equation 10 to be $\lambda_{\text{barrier}}\cdot\max\{0,\log(\mathcal{C}(\mathcal{X},\mathcal{Z},\theta))-\log(r_{k})\}$ , for some large constant $\lambda_{\text{barrier}}$ . To discourage the policy from over committing to a certain action and/or the generating distribution from covering only a small subset of particles (i.e., mode collapse), we also add negative entropy penalties to both policy’s output distributions and $\text{SOFTMAX}(w)$ with scaling factors $\lambda_{\text{Pol-reg}}$ and $\lambda_{\text{Gen-reg}}$ .

State representation. We parameterize our state space $\mathcal{S}$ as a flattened $|\mathcal{X}|\times 3$ matrix where each row represents a distinct $x\in\mathcal{X}$ . Specifically, at time $t$ the row of $s_{t}$ corresponding to some $x\in\mathcal{X}$ records the number of times that action $x$ has been taken $\sum_{s=1}^{t-1}\mathbf{1}\{x_{s}=x\}$ , its inverse $(\sum_{s=1}^{t-1}\mathbf{1}\{x_{s}=x\})^{-1}$ , and the sum of the observations $\sum_{s=1}^{t-1}\mathbf{1}\{x_{s}=x\}y_{s}$ .

Policy MLP architecture. Our policy $\pi^{\psi}$ is a multi-layer perceptron with weights $\psi$ . The policy take a $3|\mathcal{X}|$ sized state as input and outputs a vector of size $|\mathcal{X}|$ which is then pushed through a soft-max to create a probability distribution over $\mathcal{X}$ . At the end of the game, regardless of the policy’s weights, we set $\widehat{z}=\operatorname*{arg\,max}_{z\in\mathcal{Z}}\langle z,\widehat{\theta}\rangle$ where $\widehat{\theta}$ is the minimum $\ell_{2}$ norm solution to $\operatorname*{arg\,min}_{\theta}\sum_{s=1}^{T}(y_{s}-\langle x_{s},\theta\rangle)^{2}$ .

Our policy network is a simple 6-layer MLP, with layer sizes $\{3|\mathcal{X}|,256,256,256,256,|\mathcal{X}|\}$ where $3|\mathcal{X}|$ corresponds to the input layer and $|\mathcal{X}|$ is the size of the output layer, which is then pushed through a Softmax function to create a probability over arms. In addition, all intermediate layers are activated with the leaky ReLU activation units with negative slopes of $.01$ . For the experiments for 1D thresholds and 20 Questions, they share the same network structure as mentioned above with $|\mathcal{X}|=25$ and $|\mathcal{X}|=100$ respectively.

Appendix C Gradient Estimate Derivation

Here we derive the unbiased gradient estimates equation 9, equation 10 and equation 11 in Algorithm 2. Since each the gradient estimates in the above averages over $M\cdot L$ identically distributed trajectories, it is therefore sufficient to show that our gradient estimate is unbiased for a single problem $\widetilde{\theta}_{i}$ and its rollout trajectory $\{(x_{t},y_{t})\}_{t=1}^{T}$ .

For a feasible $w$ , using the score-function identity (Aleksandrov et al., 1968)

	$\displaystyle\nabla_{w}\mathbb{E}_{i\sim\text{SOFTMAX}(w)}\left[\ell(\pi^{\psi},\widetilde{\theta}_{i})\right]$
	$\displaystyle=\mathbb{E}_{i\sim\text{SOFTMAX}(w)}\left[\ell(\pi^{\psi},\widetilde{\theta}_{i})\cdot\nabla_{w}\log(\text{SOFTMAX}(w)_{i})\right].$

Observe that if $i\sim\text{SOFTMAX}(w)$ and $\{(x_{t},y_{t})\}_{t=1}^{T}$ is the result of rolling out a policy $\pi^{\psi}$ on $\widetilde{\theta}_{i}$ then

\displaystyle g^{w}(i,\{(x_{t},y_{t})\}_{t=1}^{T}):=L_{T}(\pi^{\psi},\{(x_{t},y_{t})\}_{t=1}^{T},\widetilde{\theta}_{i})\cdot\nabla_{w}\log(\text{SOFTMAX}(w)_{i})

is an unbiased estimate of $\nabla_{w}\mathbb{E}_{i\sim\text{SOFTMAX}(w)}\left[\ell(\pi^{\psi},\widetilde{\theta}_{i})\right]$ .

For a feasible set $\widetilde{\Theta}$ , by definition of $\ell(\pi,\theta)$ ,

$\displaystyle\nabla_{\widetilde{\Theta}}\mathbb{E}_{i\sim\text{SOFTMAX}(w)}$	$\displaystyle\left[\ell(\pi^{\psi},\widetilde{\theta}_{i})\right]$	(12)
$\displaystyle=\mathbb{E}_{i\sim\text{SOFTMAX}(w)}$	$\displaystyle\left[\nabla_{\widetilde{\Theta}}\mathbb{E}_{\pi,\widetilde{\theta}_{i}}\left[L_{T}(\pi,\{(x_{t},y_{t})\}_{t=1}^{T},\widetilde{\theta}_{i})\right]\right]$
$\displaystyle=\mathbb{E}_{i\sim\text{SOFTMAX}(w)}$	$\displaystyle\left[\mathbb{E}_{\pi,\widetilde{\theta}_{i}}\left[\nabla_{\widetilde{\Theta}}L_{T}(\pi,\{(x_{t},y_{t})\}_{t=1}^{T},\widetilde{\theta}_{i})+L_{T}(\pi,\{(x_{t},y_{t})\}_{t=1}^{T},\widetilde{\theta}_{i})\cdot\nabla_{\widetilde{\Theta}}\log(\mathbb{P}_{\pi^{\psi},\widetilde{\theta}_{i}}(\{(x_{t},y_{t})\}_{t=1}^{T}))\right]\right]$

where the last equality follows from chain rule and the score-function identity (Aleksandrov et al., 1968). The quantity inside the expectations, call it $g^{\widetilde{\Theta}}(i,\{(x_{t},y_{t})\}_{t=1}^{T})$ , is then an unbiased estimator of $\nabla_{\widetilde{\Theta}}\mathbb{E}_{i\sim\text{SOFTMAX}(w)}\left[\ell(\pi^{\psi},\widetilde{\theta}_{i})\right]$ given $i$ and $\{(x_{t},y_{t})\}_{t=1}^{T}$ are rolled out accordingly. Note that if $\mathcal{L}_{\text{barrier}}\neq 0$ , $\nabla_{\widetilde{\Theta}}\mathcal{L}_{\text{barrier}}(\widetilde{\theta}_{i},\Omega)$ is clearly an unbiased gradient estimator of $\mathbb{E}_{i\sim\text{SOFTMAX}(w)}[\mathbb{E}_{\pi,\widetilde{\theta}_{i}}[\mathcal{L}_{\text{barrier}}(\widetilde{\theta}_{i},\Omega)]]$ given $i$ and rollout are sampled accordingly.

Likewise, for policy,

\displaystyle g^{\psi}(i,\{(x_{t},y_{t})\}_{t=1}^{T}):=L_{T}(\pi^{\psi},\{(x_{t},y_{t})\}_{t=1}^{T},\widetilde{\theta}_{i})\cdot\nabla_{\psi}\log(\mathbb{P}_{\pi^{\psi},\widetilde{\theta}_{i}}(\{(x_{t},y_{t})\}_{t=1}^{T}))

is an unbiased estimate of $\nabla_{\psi}\mathbb{E}_{i\sim\text{SOFTMAX}(w)}\left[\ell(\pi^{\psi},\widetilde{\theta}_{i})\right]$ .

Appendix D Hyper-parameters

In this section, we list our hyperparameters. First we define $\lambda_{\text{binary}}$ to be a coefficient that gets multiplied to binary loses, so instead of $\bm{1}\{z_{\ast}(\theta_{*})\neq\widehat{z}\}$ , we receive loss $\lambda_{\text{binary}}\cdot\bm{1}\{z_{\ast}(\theta_{*})\neq\widehat{z}\}$ . We choose $\lambda_{\text{binary}}$ so that the recieved rewards are approximately at the same scale as Simple Regret. During our experiments, all of the optimizers are Adam. All budget sizes are $T=20$ . For fairness of evaluation, during each experiment (1D thresholds or 20 Questions), all parameters below are shared for evaluating all of the policies. To elaborate on training strategy proposed in MAPO (Algorithm 1) more, we divide our training into four procedures, as indicated in Table 3:

•

Init. The initialization procedure takes up a rather small portion of iterations primarily for the purpose of optimizing for $\mathcal{L}_{\text{barrier}}$ so that the particles converge into the constrained difficulty sets. In addition, during the initialization process we initialize and freeze $w=\vec{0}$ , thus putting an uniform distribution over the particles. This allows us to utilize the entire set of particles without $w$ converge to only a few particles early on. To initialize $\widetilde{\Theta}$ , we sample $2/3$ of the $N$ particles uniformly from $[-1,1]^{|\mathcal{X}|}$ and the rest $1/3$ of the particles by sampling, for each $i\in[|\mathcal{Z}|]$ , $\frac{N}{3|\mathcal{Z}|}$ particles uniformly from $\{\theta:\operatorname*{arg\,max}_{j}\langle\theta,z_{j}\rangle=i\}$ . We initialize our policy weights by Xavier initialization with weights sampled from normal distribution and scaled by $.01$ .
•

Regret Training, $\widetilde{\pi}_{i}$ Training with Simple Regret objective usually takes the longest among the Procedures. The primary purpose for this process is to let the policy converge to a reasonable warm start that already captures some essence of the task.
•

Fine-tune $\pi_{i}$ . Training with Best Identification objective run multiple times for each $\pi_{i}$ with their corresponding complexity set $\Theta_{i}$ . During each run, we start with a warm started policy, and reinitialize the rest of the models by running the initialization procedure followed by optimizing the Best Identification objective.
•

Fine-tune $\widehat{\pi}$ This procedure optimizes equation 3, with baselines $\min_{k}\ell(\pi_{k},\Theta^{(r_{k})})$ evaluated based on each $\pi_{i}$ learned from the previous procedure. Similar to fine-tuning each individual $\pi_{i}$ , we warm start a policy $\pi_{\lfloor K/2\rfloor}$ and reinitialize $w$ and $\Theta$ by running the initialization procedure again.

		Experiment
Procedure	Hyper-parameter	1D Threshold $\|\mathcal{X}\|=25$	20 Questions $\|\mathcal{X}\|=100$	Jester Joke $\|\mathcal{X}\|=100$
Init	$N_{it}$	20000 (all)
	$\psi$ learning rate	$10^{-4}$ (all)
	$\widetilde{\Theta}$ learning rate	$10^{-3}$ (all)
	$w$ learning rate	$0$ (all)
Regret Training	$N_{it}$	480000 (all)
	$\psi$ learning rate	$10^{-4}$ (all)
	$\widetilde{\Theta}$ learning rate	$10^{-3}$ (all)
	$w$ learning rate	$10^{-3}$ (all)
Fine-tune	$N_{it}$ for $\widetilde{\pi}_{i}$	200000	0	200000
	$N_{it}$ for $\pi_{i}$	200000	1500000	N/A
	$N_{it}$ for $\pi_{*}$	500000	250000	500000
	$\psi$ learning rate	$10^{-4}$ (all)
	$\widetilde{\Theta}$ learning rate	$10^{-3}$ (all)
	$w$ learning rate	$10^{-3}$ (all)
Adam Optimizer	$\beta_{1}$	.9 (all)
Adam Optimizer	$\beta_{2}$	.999 (all)

Table 3: Number of Iterations and Learning Rates

Procedure	Hyper-parameter	1D Threshold $\|\mathcal{X}\|=25$	20 Questions $\|\mathcal{X}\|=100$	Jester Joke $\|\mathcal{X}\|=100$
		Experiment
Init + Train + Fine-tune	$N$	$1000\times\|\mathcal{Z}\|$	$300\times\|\mathcal{Z}\|$	$2000\times\|\mathcal{Z}\|$
	M	1000	500	500
	L	10	30	30
	$\lambda_{\text{binary}}$	7.5	30	30
	$\lambda_{\text{Pol-reg}}$ (regret)	.2	.8	.8
	$\lambda_{\text{Pol-reg}}$ (fine-tune)	.3	.8	.8
	$\lambda_{\text{Gen-reg}}$	.05	.1	.05
	$\lambda_{\text{barrier}}$	$10^{3}$ (all)

Table 4: Parallel Sizes and Regularization coefficients

To provide a general strategy of choosing hyper-parameters, we note that $L$ , firstly, $\lambda_{\text{binary}}$ , $\lambda_{\text{Pol-reg}}$ are primarily parameters tuned for $|\mathcal{X}|$ as the noisiness and scale of the gradients, and entropy over the arms $\mathcal{X}$ grows with the size $|\mathcal{X}|$ . Secondly, $\lambda_{\text{Gen-reg}}$ is primarily tuned for $|\mathcal{Z}|$ as it penalizes the entropy over the $N$ arms, which is a multiple of $|\mathcal{Z}|$ . Thirdly, learning rate of $\theta$ is primarily tuned for the convergence of constraint $\rho^{*}$ into the restricted class, thus $\mathcal{L}_{\text{barrier}}$ becoming $0$ after the specified number of iterations during initialization is a good indicator. Finally, we choose $N$ and $M$ by memory constraint of our GPU. The hyper-parameters for each experiment was tuned with less than 20 hyper-parameter assignments, some metrics to look at while tuning these hyper-parameters includes but are not limited to: gradient magnitudes of each component, convergence of each loss and entropy losses for each regularization term (how close it is to the entropy of a uniform probability), etc.

Appendix E Policy Evaluation

When evaluating a policy, we are essentially solving the following objective for a fixed policy $\pi$ :

\displaystyle\max_{\theta\in\Omega}\ell(\pi,\theta)

where $\Omega$ is a set of problems. However, due to non-concavity of this loss function, gradient descent initialized randomly may converge to a local maxima. To reduce this possibility, we randomly initialize many initial iterates and take gradient steps round-robin, eliminating poorly performing trajectories. To do this with a fixed amount of computational resource, we apply the successive halving algorithm from Li et al. (2018). Specifically, we choose hyperparamters: $\eta=4$ , $r=100$ , $R=1600$ and $s=0$ . This translates to:

•

Initialize $|\widetilde{\Theta}|=1600$ , optimize for 100 iterations for each $\widetilde{\theta}_{i}\in\widetilde{\Theta}$
•

Take the top $400$ of them and optimize for another $400$ iterations
•

Take the top $100$ of the remaining $400$ and optimize for an additional $1600$ iterations

We take gradient steps with the Adam optimizer (Kingma & Ba, 2014) with learning rate of $10^{-3}$ $\beta_{1}=.9$ and $\beta_{2}=.999$ .

Appendix F Figures at Full Scale

Figure 8: Full scale of Figure 2

Figure 9: Full scale of Figure 3

Figure 10: Full scale of Figure 4

Figure 11: Full scale of Figure 5

Figure 12: Full scale of Figure 6

Figure 13: Full scale of Figure 7

Appendix G Uncertainty Sampling

We define the symmetric difference of a set of binary vectors, $\text{SymDiff}(\{z_{1},...,z_{n}\})=\{i:\exists j,k\in[n]\ \ s.t.,z_{j}^{(i)}=1\land z_{k}^{(i)}=0\}$ , as the dimensions where inconsistencies exist.

Input:

\mathcal{X},\mathcal{Z}

for

t=1,...,T

\widehat{\theta}_{t-1}=\operatorname*{arg\,min}_{\theta}\sum_{s=1}^{T}(y_{s}-\langle x_{s},\theta\rangle)^{2}

\widehat{\mathcal{Z}}=\{z\in\mathcal{Z}:\max_{z^{\prime}\in\mathcal{Z}}\langle z^{\prime},\widehat{\theta}_{t-1}\rangle=\langle z,\widehat{\theta}_{t-1}\rangle\}

|\widehat{\mathcal{Z}}|=1

then

\widehat{\mathcal{Z}}_{t}=\widehat{\mathcal{Z}}\bigcup\{z\in\mathcal{Z}:\max_{z^{\prime}\in(\mathcal{Z}\backslash\widehat{\mathcal{Z}})}\langle z^{\prime},\widehat{\theta}_{t-1}\rangle=\langle z,\widehat{\theta}_{t-1}\rangle\}

else

\widehat{\mathcal{Z}}_{t}=\widehat{\mathcal{Z}}

end if

Uniformly sample

I_{t}

from

\text{SymDiff}(\widehat{\mathcal{Z}}_{t})

Pull

x_{I_{t}}

and observe

y_{t}

end for

Algorithm 4 Uncertainty sampling in very small budget setting

Appendix H Learning to Actively Learn Algorithm

To train a policy under the learning to actively learn setting, we aim to solve for the objective

\displaystyle\min_{\psi}\mathbb{E}_{\theta\sim\widehat{\mathcal{P}}}[\ell(\pi^{\psi},\theta)]

where our policy and states are parameterized the same way as Appendix LABEL:sec:representation_parameterization for a fair comparison. To optimize for the parameter, we take gradient steps like equation 11 but with the new sampling and rollout where $\widetilde{\theta}_{i}\sim\widehat{\mathcal{P}}$ . This gradient step follows from both the classical policy gradient algorithm in reinforcement learning as well as from recent learning to actively learn work by Kveton et al. (2020).

Moreover, note that the optimal policy for the objective must be deterministic as justified by deterministic policies being optimal for MDPs. Therefore, it is clear that, under our experiment setting, the deterministic Bayes-LAL policy will perform poorly in the adversarial setting (for the same reason why SGBS performs poorly).

Appendix I 20 Questions Setup

Hu et al. (2018) collected a dataset of 1000 celebrities and 500 possible questions to ask about each celebrity. We chose $100$ questions out of the $500$ by first constructing $\bar{p}^{\prime}$ , $\mathcal{X}^{\prime}$ and $\mathcal{Z}^{\prime}$ for the $500$ dimensions data, and sampling without replacement $100$ of the $500$ dimensions from a distribution derived from a static allocation. We down-sampled the number of questions so our training can run with sufficient $M$ and $L$ to de-noise the gradients while being prototyped with a single GPU.

Specifically, the dataset from Hu et al. (2018) consists of probabilities of people answering Yes / No / Unknown to each celebrity-question pair collected from some population. To better fit the combinatorial bandit scenario, we re-normalize the probability of getting Yes / No, conditioning on the event that these people did not answer Unknown. The probability of answering Yes to all 500 questions for each celebrity then constitutes vectors $\bar{p}^{\prime(1)},...,\bar{p}^{\prime(1000)}\in\mathbb{R}^{500}$ , where each dimension of a give $\bar{p}_{i}^{\prime(j)}$ represents the probability of yes to the $i$ th question about the $j$ th person. The action set $\mathcal{X}^{\prime}$ is then constructed as $\mathcal{X}^{\prime}=\{\mathbf{e}_{i}:i\in[500]\}$ , while $\mathcal{Z}^{\prime}=\{z^{(j)}:[z^{(j)}_{i}]=\bm{1}\{\bar{p}_{i}^{(j)}>1/2\}\}\subset\{0,1\}^{1000}$ are binary vectors taking the majority votes.

To sub-sample 100 questions from the 500, we could have uniformly at random selected the questions, but many of these questions are not very discriminative. Thus, we chose a “good” set of queries based on the design recommended by $\rho_{\ast}$ of Fiez et al. (2019). If questions were being answered noiselessly in response to a particular $z\in\mathcal{Z}^{\prime}$ , then equivalently we have that for this setting $\theta=2z-1$ . Since $\rho_{\ast}$ optimizes allocations $\lambda$ over $\mathcal{X}^{\prime}$ that would reduce the number of required queries as much as possible (according to the information theoretic bound of (Fiez et al., 2019)) if we want to find a single allocation for all $z^{\prime}\in\mathcal{Z}$ simultaneously, we can perform the optimization problem

\displaystyle\min_{\lambda\in\Delta^{(|X|-1)}}\max_{z^{\prime}\in\mathcal{Z}^{\prime}}\max_{z\neq z^{\prime}}\frac{\lVert z^{\prime}-z\rVert_{(\sum_{i}\lambda_{i}x_{i}x_{i}^{T})^{-1}}^{2}}{((z^{\prime}-z)^{T}(2z^{\prime}-1))^{2}}.

We then sample elements from $\mathcal{X}^{\prime}$ according to this optimal $\lambda$ without replacement and add them to $\mathcal{X}$ until $|\mathcal{X}|=100$ .

Appendix J Jester Joke Recommendation Setup

We consider the Jester jokes dataset of Goldberg et al. (2001) that contains jokes ranging from pun-based jokes to grossly offensive. We filter the dataset to only contain users that rated all $100$ jokes, resulting in 14116 users. A rating of each joke was provided on a $[-10,10]$ scale which was shrunk to $[-1,1]$ . Denote this set of ratings as $\hat{\Theta}=\{\theta_{i}:i\in[14116],\theta_{i}\in[-1,1]^{100}\}$ , where $\theta_{i}$ encodes the ratings of all $100$ jokes by user $i$ . To construct the set of arms $\mathcal{Z}$ , we then clustered the ratings of these users to 10 groups to obtain $\mathcal{Z}=\{z_{i}:i\in[10],z_{i}\in\{0,1\}^{100}\}$ by minimizing the following metric:

\displaystyle\min_{\mathcal{Z}:|\mathcal{Z}|=10}\sum_{i=1}^{14116}\max_{z_{*}\in\{0,1\}^{100}}\langle z_{*},\theta_{i}\rangle-\max_{z\in\mathcal{Z}}\langle z,\theta_{i}\rangle.

To solve for $\mathcal{Z}$ , we adapt the $k-means$ algorithm, with the metric above instead of the $L-2$ metric used traditionally.