Learning to Actively Learn: A Robust Approach
Abstract
This work proposes a procedure for designing algorithms for specific adaptive data collection tasks like active learning and pure-exploration multi-armed bandits. Unlike the design of traditional adaptive algorithms that rely on concentration of measure and careful analysis to justify the correctness and sample complexity of the procedure, our adaptive algorithm is learned via adversarial training over equivalence classes of problems derived from information theoretic lower bounds. In particular, a single adaptive learning algorithm is learned that competes with the best adaptive algorithm learned for each equivalence class. Our procedure takes as input just the available queries, set of hypotheses, loss function, and total query budget. This is in contrast to existing meta-learning work that learns an adaptive algorithm relative to an explicit, user-defined subset or prior distribution over problems which can be challenging to define and be mismatched to the instance encountered at test time. This work is particularly focused on the regime when the total query budget is very small, such as a few dozen, which is much smaller than those budgets typically considered by theoretically derived algorithms. We perform synthetic experiments to justify the stability and effectiveness of the training procedure, and then evaluate the method on tasks derived from real data including a noisy 20 Questions game and a joke recommendation task.
1 Introduction
Closed-loop learning algorithms use previous observations to inform what measurements to take next in a closed loop in order to accomplish inference tasks far faster than any fixed measurement plan set in advance. For example, active learning algorithms for binary classification have been proposed that under favorable conditions require exponentially fewer labels than passive, random sampling to identify the optimal classifier (Hanneke et al., 2014; Katz-Samuels et al., 2021). And in the multi-armed bandits literature, adaptive sampling techniques have demonstrated the ability to identify the “best arm” that optimizes some metric with far fewer experiments than a fixed design (Garivier & Kaufmann, 2016; Fiez et al., 2019). Unfortunately, such guarantees often either require simplifying assumptions that limit robustness and applicability, or algorithmic use of concentration inequalities that are very loose unless the number of samples is very large.
This work proposes a framework for producing algorithms that are learned through simulated experience to be as effective and robust as possible, even on a tiny measurement budget (e.g., 20 queries) where most theoretical guarantees do not apply. Our work fits into a recent trend sometimes referred to as learning to actively learn and differentiable meta-learning in bandits (Konyushkova et al., 2017; Bachman et al., 2017; Fang et al., 2017; Boutilier et al., 2020; Kveton et al., 2020) which tune existing algorithms or learn entirely new active learning algorithms by policy optimization. Previous works in this area learn a policy by optimizing with respect to data observed through prior experience (e.g., meta-learning or transfer learning) or an assumed explicit prior distribution of problem parameters (e.g. a Gaussian prior over the true weight vector for linear regression). In contrast, our approach makes no assumptions about what parameters are likely to be encountered at test time, and therefore produces algorithms that do not suffer from mismatching priors at test time. Instead, our method learns a policy that attempts to mirror the guarantees of frequentist algorithms with instance dependent sample complexities: there is an intrinsic difficulty measure that orders problem instances and given a fixed budget, higher accuracies can be obtained for all easier instances than harder instances. This difficulty measure is most naturally derived from information theoretic lower bounds.
But unlike information theoretic bounds that hand-craft adversarial instances, inspired by the robust reinforcement learning literature, we formulate a novel adversarial training objective that automatically train minimax policies and propose a tractable and computationally efficient relaxation. This allows our learned policies to be very aggressive while maintaining robustness over difficulty in problem instances, without resorting to using loose concentration inequalities in the algorithm. Indeed, this work is particularly useful in the setting where relatively few rounds of querying can be made. The learning framework is general enough to be applied to many active learning settings of interest and is intended to be used to produce robust and high performing algorithms. We implement the framework for the pure-exploration combinatorial bandit problem — a paradigm including problems such as active binary classification and the 20 question game. We empirically validate our framework on a simple synthetic experiment before turning our attention to datasets derived from real data including a noisy 20 Questions game and a joke recommendation task which are also embedded as combinatorial bandits. As demonstrated in our experiments, in the low budget setting, our learned algorithms are the only ones that both enjoy robustness guarantees (as opposed to greedy and existing learning to actively learn methods) and perform non-vacuously and instance-optimally (as opposed to statistically justified algorithms).
2 Proposed Framework for Robust Learning to Actively Learn
From a birds-eye perspective, whether learned or defined by an expert, any algorithm for active learning can be thought of as a policy from the perspective of reinforcement learning. To be precise, at time , based on an internal state , the policy defines a distribution over the set of potential actions . It then takes action and receives observation , updates the state and the process repeats.
Fix a horizon , and a problem instance which parameterizes the observation distribution. For
-
•
state is a function of the history, ,
-
•
action is drawn at random from the distribution defined over , and
-
•
next state is constructed by taking action in state and observing
until the game terminates at time and the learner receives a loss which is task specific. Note that is a random variable that depends on the tuple . We assume that is a known parametric distribution to the policy but the parameter is unknown to the policy. Let denote the probability and expectation under the probability law induced by executing policy in the game with to completion. Note that includes any internal randomness of the policy and the random observations . Thus, assigns a probability to any trajectory . For a given policy and , the metric of interest we wish to minimize is the expected loss where as defined above is the loss observed at the end of the episode. For a fixed policy , defines a loss surface over all possible values of . This loss surface captures the fact that some values of are just intrinsically harder than others, but also that a policy may be better suited for some values of versus others.
Finally, we assume we are equipped with a positive function that assigns a score to each that intuitively captures the “difficulty” of a particular , and can be used as a partial ordering of . Ideally, is a monotonic transformation of for some “best” policy that we will define shortly. Our plan is now as follows, in Section 2.1, we ground the discussion and describe for the combinatorial bandit problem. Then in Section 2.2, we zoom out to define our main objective of finding a min-gap optimal policy, finally providing an adversarial training approach in Section 3.
2.1 Complexity for Combinatorial Bandits
A concrete example of the framework above is the combinatorial bandit problem. The learner has access to sets , where is the -th standard basis vector, and . In each round the learner chooses an according to a policy and observes with for some unknown . The goal of the learner is Best Arm Identification. Denote , then at time the learner outputs a recommendation and incurs loss . This setting naturally captures the 20 question game. Indeed assume there are potential yes/no questions that can be asked at each time, corresponding to the elements of , and that each element of is a binary vector representing the answers to these questions for a given item. If answers are deterministic then , but this framework also captures the case when answers are stochastic, or answered incorrectly with some probability. Then a policy at each time decides which question to ask based on the answers so far to determine the item closest to an unknown vector .
As described in Sections 5 and Appendix A, combinatorial bandits generalizes standard multi-armed bandits, and all of binary classification, and thus has received a tremendous amount of interest in recent years. A large portion of this work has focused on providing precise characterization of the information theoretic limit on the mimimal number of samples needed to identify with high probability a quantity denoted as which is the solution to an optimization problem (Soare et al., 2014; Fiez et al., 2019; Degenne et al., 2020) for some set of alternatives . This quantity provides a natural complexity measure for a given instance and we describe it in a few specific cases below.
As a warmup example, consider the standard best-arm identification problem where and choosing action results in reward . Let . Then in this case and it’s been shown that there exists a constant such that for any sufficiently large we have
In other words, more difficult instances correspond to with a small gap between the best arm and any other arm. Moreover, for any there exists a policy that achieves where capture constant and low-order terms (Carpentier & Locatelli, 2016; Karnin et al., 2013; Garivier & Kaufmann, 2016). Said plainly, the above correspondence between the lower bound and the upper bound for the multi-armed bandit problem shows that is a natural choice for in this setting.
In recent years, algorithms for the more general combinatorial bandit setting have been established with instance-dependent sample complexities matching (up to logarithmic factors) (Karnin et al., 2013; Chen et al., 2014; Fiez et al., 2019; Chen et al., 2017; Degenne et al., 2020; Katz-Samuels et al., 2020). Another complexity term that appears in Cao & Krishnamurthy (2017) for combinatorial bandits is
(1) |
One can show (Katz-Samuels et al., 2020) and in many cases track each other. Because can be computed much more efficiently compared to , we take .
2.2 Objective: Responding to All Difficulties
As described above, though there exists algorithms for the combinatorial bandit problem that are instance-optimal in the fixed-confidence setting along with algorithms for the fixed-budget, they do not work well with small budgets as they rely on statistical guarantees. Indeed, for their guarantees to be non-vacuous, we need the budget to be sufficiently large enough to compare to the potentially large constants in upper bounds. In practice, they are so conservative that for the first 20 samples they would just sample uniformly. To overcome this, we now provide a different framework that for policy learning in a worst-case setting that is effective even in the small budget regime.
The challenge is in finding a policy that performs well across all potential problem instances simultaneously. It is common to consider minimax optimal policies which attempt to perform well on worst case instances — but as a result, may perform poorly on easier instances. Thus, an ideal policy would perform uniformly well over a set of ’s that are all equivalent in “difficulty”. Since each is equipped with an inherent notion of difficulty, , we can stratify the space of all possible instances by difficulty. A good policy is one whose worst case performance over all possible problem difficulties is minimized. We formalize this idea below.
For any set of problem instances and define
For a fixed (including ), a policy that aims to minimize just will be minimax for and may not perform well on easy instances. To overcome this shortsightedness we introduce a new objective by focusing on ; the sub-optimality gap of a given policy relative to an -dependent baseline policy trained specifically for each . Objective: Return the policy (2) which minimizes the worst case sub-optimality gap over all .

Figure 1 illustrates these definitions. The blue curve (-dependent baseline) captures the best possible performance that is possible for each difficulty level . In other words, the -dependent baseline defines a different policy for each value of . Therefore, the blue curve may be unachievable with just a single policy. The green curve captures a policy that achieves the minima (-dependent baseline) at a given . Though it is the ideal policy for this difficulty, it could be sub-optimal at any other difficulty. The orange curve is the performance of our optimal policy — it is willing to sacrifice performance for any given to achieve an overall better worst case gap from the baseline.
3 MAPO: Adversarial Training Algorithm
Identifying naively requires the computation of for all . However, in practice given an increasing sequence that indexes nested sets of problem instances of increasing difficulty, , we wish to identify a policy that minimizes the maximum sub-optimality gap with respect to this sequence. Explicitly, we seek to learn
(3) |
Note that as and , equation 2 and equation 3 are essentially equivalent under benign smoothness conditions on , in which case . In practice, we choose contains all problems that can be solved within the budget relatively accurately, and a small , where . In Algorithm 1, our algorithm MAPO efficiently solves this objective by first computing for all to obtain as benchmarks, and then uses these benchmarks to train . The next section will focus on the challenges of the optimization problems in equation 4 and equation 5.
(4) |
3.1 Differentiable policy optimization
The critical part of running MAPO (Algorithm 1) is to solve for equation 4 and equation 5. Note that equation 5 is an optimization of the same form with equation 4 after shifting the loss by the scalar value . Consequently, to learn and , it suffices to develop a training procedure to solve for an arbitrary set and generic loss function .
We would like to solve this saddle-point problem using an alternating gradient descent/ascent method in Algorithm 2 that we describe now. Instead of optimizing over all possible policies, we restrict the policy class to neural networks that take state representation as input and output a probability distribution over actions, parameterized by weights . In practice, may be poorly behaved in so a gradient descent/ascent procedure may get stuck in a neighborhood of a critical point that is not an optimal solution to the saddle point problem. To avoid this, we instead track over many different possible ’s (intuitively corresponding to different initializations):
(6) | ||||
(7) | ||||
(8) |
In the first equality we replace the maximum over all to a maximum over all subsets of size . The resulting maximum over the points is still a discrete optimization. To smooth it out, we utilize the fact that a over a set is just the same as the maximum over of the expectation over all distributions on that set. In the last equality, we reparameterize the set of distributions with the softmax to weight the different values of . In each round, we backpropagate through and .
Now we discuss the optimization routine outlined in Algorithm 2. For the inside optimization, ideally, in each round we would build an estimate of the loss function at our current choice of for each of the ’s under consideration. To do so, we rollout the policy for each under consideration times and then average the resulting losses (this also allows us to construct a stochastic gradient of the loss). In practice we can’t consider all , so instead we sample of them from . This has a computational benefit by allowing us to be strategic by considering ’s each round that are closest to the .
After this we then backpropagate through and using the stochastic gradients learned from the rollouts. Finally, we then update by backpropagation through the neural network under consideration. The gradient steps are taken with unbiased gradient estimates , and , which are computed by using the score-function identity and is described in detail in Appendix C. We outline more implementation details in Appendix B along with the below algorithm with explicit gradient estimate formulas. Hyperparamters can be found in Appendix D.
4 Experiments
We now evaluate the approach described in the previous section for combinatorial bandits with and . This setting generalizes both binary active classification for arbitrary model class and active recommendation, which we evaluate by conducting experiments on two respective real datasets. We evaluated based on two criteria: instance-dependent worst-case and average-case. For instance-dependent worst-case, we measure, for each and policy , and plot this value as a function of . We note that our algorithm is designed to optimize for such a metric. For the secondary average-case metric, we instead measure, for policy and some collected set , . Performances of instance-dependent worst-case metric are reported in Figures 2, 3, 4, 6, and 7 below while the average case performances are reported in the tables and Figure 5. Full scale of the figures can also be found in Appendix F.
4.1 Algorithms
We compare against a number of baseline active learning algorithms (see Section 5 for a review). Uncertainty sampling at time computes the empirical maximizer of and the runner-up, and samples an index uniformly from their symmetric difference (i.e thinking of elements of as subsets of ); if either are not unique, an index is sampled from the region of disagreement of the winners (see Appendix G for details). The greedy methods are represented by soft generalized binary search (SGBS) (Nowak, 2011) which maintains a posterior distribution over and samples to maximize information gain. A hyperparameter of SGBS determines the strength of the likelihood update. We plot or report a range of performance over . The agnostic algorithms for classification (Balcan et al., 2006; Hanneke, 2007b; a; Dasgupta et al., 2008; Huang et al., 2015; Jain & Jamieson, 2019) or combinatorial bandits (Chen et al., 2014; Gabillon et al., 2016; Chen et al., 2017; Cao & Krishnamurthy, 2017; Fiez et al., 2019; Jain & Jamieson, 2019) are so conservative that given just samples, they are all exactly equivalent to uniform sampling and hence represented by Uniform. To represent a policy based on learning to actively learn with respect to a prior, we employ the method of Kveton et al. (2020), denoted Bayes-LAL, with a fixed prior constructed by drawing a uniformly at random from and defining (details in Appendix H). When evaluating each policy, we use the successive halving algorithm (Li et al., 2017; 2018) for optimizing our non-convex objective with randomly initialized gradient descent and restarts (details in Appendix E).
4.2 Synthetic Dataset: Thresholds
We begin with a very simple instance to demonstrate the instance-dependent performance achieved by our learned policy. For , let , , and is a Bernoulli distribution over with mean . Appendix A shows that is the best threshold classifier for a label distribution induced by . We trained baseline policies for the Best Identification metric with and for .




First we compare the base policies to . Figure 2 presents as a function of for our base policies and the global policy , each as an individual curve. Figure 3 plots the same information in terms of gap: . We observe that each performs best in a particular region and performs almost as well as the -dependent baseline policies over the range of .
Under the same conditions as Figure 2, Figure 4 compares the performance of to the algorithm benchmarks. Since SGBS and Bayes-LAL are deterministic, the adversarial training finds a that tricks them into catastrophic failure. Figure 5 trades adversarial evaluation for evaluating with respect to a parameterized prior: For each , is defined by drawing a uniformly at random from and then setting where . Thus, each sign of is flipped with probability . We then compute . While SGBS now performs much better than uniform and uncertainty sampling, our policy is still superior to these policies. However, Bayes-LAL is best overall which is expected since the support of is essentially a rescaled version of the prior used in Bayes-LAL.
4.3 Real Datasets
20 Questions. Our dataset is constructed from the real data of Hu et al. (2018). Summarizing how we used the data, yes/no questions were considered for celebrities. Each question for each person was answered by several annotators to construct an empirical probability denoting the proportion of annotators that answered “yes.” To construct our instance, we take to encode questions and . Just as before, we trained for the Best Identification metric with and for . See Appendix I for details.
Jester Joke Recommendation. We now turn our attention away from Best Identification to Simple Regret where . We consider the Jester jokes dataset of Goldberg et al. (2001) that contains jokes ranging from innocent puns to grossly offensive jokes. We filter the dataset to only contain users that rated all jokes, resulting in 14116 users. A rating of each joke was provided on a scale which was rescaled to and observations were simulated as Bernoulli’s like above. We then clustered the ratings of these users (see Appendix J for details) to 10 groups to obtain where corresponds to recommending the th joke in user cluster . See Appendix J for details.
4.3.1 Instance-dependent Worst-case
Figure 6 and Figure 7 are analogous to Figure 4 but for the 20 questions and Jester joke instances, respectively. The two deterministic policies, SGBS and Bayes-LAL, fail on these datasets as well against the worst-case instances.


On the Jester joke dataset, our policy alone nearly achieves the -dependent baseline for all . But on 20 questions, uncertainty sampling performs remarkably well. These experiments on real datasets demonstrate that our policy obtains near-optimal instance dependent sample complexity.
4.3.2 Average Case Performance
While the metric of the previous section rewarded algorithms that perform uniformly well over all possible environments that could be encountered, in this section we consider the performance of an algorithm with respect to a distribution over environments, which we denote as average case.
Method | Accuracy(%) |
---|---|
(Ours) | 17.9 |
SGBS | {26.5, 26.2, 27.2, |
26.5, 21.4, 12.8} | |
Uncertainty | 14.3 |
Bayes-LAL | 4.1 |
Uniform | 6.9 |
Method | Regret |
---|---|
(Ours) | 3.209 |
SGBS | {3.180, 3.224, 3.278, |
3.263, 3.153, 3.090} | |
Uncertainty | 3.027 |
Bayes-LAL | 3.610 |
Uniform | 3.877 |
While heuristic based algorithms (such as SGBS, uncertainty sampling and Bayes-LAL) can perform catastrophically for worst-case instances, they can perform very well with respect to a benign distribution over instances. Here we demonstrate that our policy not only performs optimally under the instance-dependent worst-case metric but also remain comparable even when evaluated under the average case metric. To measure the average performance, we construct prior distributions based on the individual datasets:
-
•
For the 20 questions dataset, to draw a , we uniformly at random select a and sets for all .
-
•
For the Jester joke recommendation dataset, to draw a , we uniformly sample a user and employ their ratings to each joke.
On the 20 questions dataset, as shown in Table 4.3.2, SGBS and are the winners. Bayes-LAL performs much worse in this case, potentially because of the distribution shift from (prior we train on) to (prior at test time). The strong performance of SGBS may be due to the fact that for all and , a realizability condition under which SGBS has strong guarantees (Nowak, 2011). On the Jester joke dataset, Table 4.3.2 shows that despite our policy not being trained for this setting, its performance is still among the top.
5 Related work
Learning to actively learn. Previous works vary in how the parameterize the policy, ranging from parameterized mixtures of existing expertly designed active learning algorithms (Baram et al., 2004; Hsu & Lin, 2015; Agarwal et al., 2016), parameterizing hyperparameters (e.g., learning rate, rate of forced exploration, etc.) in an existing popular algorithm (e.g, EXP3) (Konyushkova et al., 2017; Bachman et al., 2017; Cella et al., 2020), and the most ambitious, policies parameterized end-to-end like in this work (Boutilier et al., 2020; Kveton et al., 2020; Sharaf & Daumé III, 2019; Fang et al., 2017; Woodward & Finn, 2017). These works take an approach of defining a prior distribution either through past experience (meta-learning) or expert created (e.g., ), and then evaluate their policy with respect to this prior distribution. Defining this prior can be difficult, and moreover, if the encountered at test time did not follow this prior distribution, performance could suffer significantly. Our approach, on the other hand, takes an adversarial training approach and can be interpreted as learning a parameterized least favorable prior (Wasserman, 2013), thus gaining a much more robust policy as an end result.
Robust and Safe Reinforcement Learning. Our work is also highly related to the field of robust and safe reinforcement learning, where our objective can be considered as an instance of minimax criterion under parameter uncertainty (Garcıa & Fernández, 2015). Widely applied in applications such as robotics (Mordatch et al., 2015; Rajeswaran et al., 2016), these methods train a policy in a simulator like Mujoco (Todorov et al., 2012) to minimize a defined loss objective while remaining robust to uncertainties and perturbations to the environment (Mordatch et al., 2015; Rajeswaran et al., 2016). Ranges of these uncertainty parameters are chosen based on potential values that could be encountered when deploying the robot in the real world. In our setting, however, defining the set of environments is far less straightforward and is overcome by the adoption of the function.
Active Binary Classification Algorithms. The literature on active learning algorithms can be partitioned into model-based heuristics like uncertainty sampling, query by committee, or model-change sampling (Settles, 2009), greedy binary-search like algorithms that typically rely on a form of bounded noise for correctness (Dasgupta, 2005; 2006; Kääriäinen, 2006; Golovin & Krause, 2011; Nowak, 2011), and agnostic algorithms that make no assumptions on the probabilistic model (Balcan et al., 2006; Hanneke, 2007b; a; Dasgupta et al., 2008; Huang et al., 2015; Jain & Jamieson, 2019; Katz-Samuels et al., 2020; 2021). Though the heuristics and greedy methods can perform very well for some problems, it is typically easy to construct counter-examples (e.g., outside the assumptions) in which they catastrophically fail as demonstrated in our experiments. The agnostic algorithms have strong robustness guarantees but rely on concentration inequalities, and consequently require at least hundreds of labels to observe any deviation from random sampling (see Huang et al. (2015) for comparison). Therefore, they were implicitly represented by uniform in our experiments.
Pure-exploration Multi-armed Bandit Algorithms. In the linear structure setting, for sets known to the player, pulling an “arm” results in an observation zero-mean noise, and the objective is to identify for a vector unknown to the player (Soare et al., 2014; Karnin, 2016; Tao et al., 2018; Xu et al., 2017; Fiez et al., 2019). A special case of linear bandits is combinatorial bandits where and (Chen et al., 2014; Gabillon et al., 2016; Chen et al., 2017; Cao & Krishnamurthy, 2017; Fiez et al., 2019; Jain & Jamieson, 2019). Active binary classification is a special case of combinatorial pure-exploration multi-armed bandits (Jain & Jamieson, 2019), which we exploit in the threshold experiments. While the above works have made great theoretical advances in deriving algorithms and information theoretic lower bounds that match up to constants, the constants are so large that these algorithms only behave well when the number of measurements is very large. When applied to the instances of our paper (only 20 queries are made), these algorithms behave no differently than random sampling.
6 Discussion and Future Directions
We see this work as an exciting but preliminary step towards realizing the full potential of this general approach. Although our experiments has been focusing on applications of combinatorial bandit, we see this framework generalizing with minor changes to many more widely applicable settings such as multi-class active classification, contextual bandits, etc. To generalize to these settings, one can refer to existing literature for instance-dependent lower bounds (Katz-Samuels et al., 2021; Agarwal et al., 2014). Alternatively, when such a lower bound does not exist, we conjecture that a heuristic scoring function could also serve as . For example, in a chess game, one could simply use the scoring function of the pieces left on board as a proxy for difficulty.
From a practical perspective, training a can take many hours of computational resources for even these small instances. Scaling these methods to larger instances is an important next step. While training time scales linearly with the horizon length , we note that one can take multiple samples per time step. With minimal computational overhead, this could enable training on problems that require larger sample complexities. In our implementation we hard-coded the decision rule for given , but it could also be learned as in (Luedtke et al., 2020). Likewise, the parameterization of the policy and generator worked well for our purposes but was chosen somewhat arbitrarily–are there more natural choices? Finally, while we focused on stochastic settings, this work naturally extends to constrained fully adaptive adversarial sequences which is an interesting direction of future work.
Funding disclosure
This work was supported in part by NSF IIS-1907907.
Acknowledgement
The authors would like to thank Aravind Rajeswaran for insightful discussions on Robust Reinforcement Learning. We would also like to thank Julian Katz-Samuels and Andrew Wagenmaker for helpful comments.
References
- Agarwal et al. (2014) Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pp. 1638–1646. PMLR, 2014.
- Agarwal et al. (2016) Alekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E Schapire. Corralling a band of bandit algorithms. arXiv preprint arXiv:1612.06246, 2016.
- Aleksandrov et al. (1968) VM Aleksandrov, VI Sysoyev, and SHEMENEV. VV. Stochastic optimization. Engineering Cybernetics, (5):11–+, 1968.
- Bachman et al. (2017) Philip Bachman, Alessandro Sordoni, and Adam Trischler. Learning algorithms for active learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 301–310. JMLR. org, 2017.
- Balcan et al. (2006) Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. In Proceedings of the 23rd international conference on Machine learning, pp. 65–72, 2006.
- Baram et al. (2004) Yoram Baram, Ran El Yaniv, and Kobi Luz. Online choice of active learning algorithms. Journal of Machine Learning Research, 5(Mar):255–291, 2004.
- Boutilier et al. (2020) Craig Boutilier, Chih-Wei Hsu, Branislav Kveton, Martin Mladenov, Csaba Szepesvari, and Manzil Zaheer. Differentiable bandit exploration. arXiv preprint arXiv:2002.06772, 2020.
- Cao & Krishnamurthy (2017) Tongyi Cao and Akshay Krishnamurthy. Disagreement-based combinatorial pure exploration: Efficient algorithms and an analysis with localization. stat, 2017.
- Carpentier & Locatelli (2016) Alexandra Carpentier and Andrea Locatelli. Tight (lower) bounds for the fixed budget best arm identification bandit problem. In Conference on Learning Theory, pp. 590–604, 2016.
- Cella et al. (2020) Leonardo Cella, Alessandro Lazaric, and Massimiliano Pontil. Meta-learning with stochastic linear bandits. arXiv preprint arXiv:2005.08531, 2020.
- Chen et al. (2017) Lijie Chen, Anupam Gupta, Jian Li, Mingda Qiao, and Ruosong Wang. Nearly optimal sampling algorithms for combinatorial pure exploration. In Conference on Learning Theory, pp. 482–534, 2017.
- Chen et al. (2014) Shouyuan Chen, Tian Lin, Irwin King, Michael R Lyu, and Wei Chen. Combinatorial pure exploration of multi-armed bandits. In Advances in Neural Information Processing Systems, pp. 379–387, 2014.
- Dasgupta (2005) Sanjoy Dasgupta. Analysis of a greedy active learning strategy. In Advances in neural information processing systems, pp. 337–344, 2005.
- Dasgupta (2006) Sanjoy Dasgupta. Coarse sample complexity bounds for active learning. In Advances in neural information processing systems, pp. 235–242, 2006.
- Dasgupta et al. (2008) Sanjoy Dasgupta, Daniel J Hsu, and Claire Monteleoni. A general agnostic active learning algorithm. In Advances in neural information processing systems, pp. 353–360, 2008.
- Degenne et al. (2020) Rémy Degenne, Pierre Ménard, Xuedong Shang, and Michal Valko. Gamification of pure exploration for linear bandits. In International Conference on Machine Learning, pp. 2432–2442. PMLR, 2020.
- Fang et al. (2017) Meng Fang, Yuan Li, and Trevor Cohn. Learning how to active learn: A deep reinforcement learning approach. arXiv preprint arXiv:1708.02383, 2017.
- Fiez et al. (2019) Tanner Fiez, Lalit Jain, Kevin G Jamieson, and Lillian Ratliff. Sequential experimental design for transductive linear bandits. In Advances in Neural Information Processing Systems, pp. 10666–10676, 2019.
- Gabillon et al. (2016) Victor Gabillon, Alessandro Lazaric, Mohammad Ghavamzadeh, Ronald Ortner, and Peter Bartlett. Improved learning complexity in combinatorial pure exploration bandits. In Artificial Intelligence and Statistics, pp. 1004–1012, 2016.
- Garcıa & Fernández (2015) Javier Garcıa and Fernando Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
- Garivier & Kaufmann (2016) Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pp. 998–1027, 2016.
- Goldberg et al. (2001) Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Eigentaste: A constant time collaborative filtering algorithm. information retrieval, 4(2):133–151, 2001.
- Golovin & Krause (2011) Daniel Golovin and Andreas Krause. Adaptive submodularity: Theory and applications in active learning and stochastic optimization. Journal of Artificial Intelligence Research, 42:427–486, 2011.
- Hanneke (2007a) Steve Hanneke. A bound on the label complexity of agnostic active learning. In Proceedings of the 24th international conference on Machine learning, pp. 353–360, 2007a.
- Hanneke (2007b) Steve Hanneke. Teaching dimension and the complexity of active learning. In International Conference on Computational Learning Theory, pp. 66–81. Springer, 2007b.
- Hanneke et al. (2014) Steve Hanneke et al. Theory of disagreement-based active learning. Foundations and Trends® in Machine Learning, 7(2-3):131–309, 2014.
- Hao et al. (2019) Botao Hao, Tor Lattimore, and Csaba Szepesvari. Adaptive exploration in linear contextual bandit. arXiv preprint arXiv:1910.06996, 2019.
- Hsu & Lin (2015) Wei-Ning Hsu and Hsuan-Tien Lin. Active learning by learning. In Twenty-Ninth AAAI conference on artificial intelligence, 2015.
- Hu et al. (2018) Huang Hu, Xianchao Wu, Bingfeng Luo, Chongyang Tao, Can Xu, Wei Wu, and Zhan Chen. Playing 20 question game with policy-based reinforcement learning. arXiv preprint arXiv:1808.07645, 2018.
- Huang et al. (2015) Tzu-Kuo Huang, Alekh Agarwal, Daniel J Hsu, John Langford, and Robert E Schapire. Efficient and parsimonious agnostic active learning. In Advances in Neural Information Processing Systems, pp. 2755–2763, 2015.
- Jain & Jamieson (2019) Lalit Jain and Kevin G Jamieson. A new perspective on pool-based active classification and false-discovery control. In Advances in Neural Information Processing Systems, pp. 13992–14003, 2019.
- Kääriäinen (2006) Matti Kääriäinen. Active learning in the non-realizable case. In International Conference on Algorithmic Learning Theory, pp. 63–77. Springer, 2006.
- Karnin et al. (2013) Zohar Karnin, Tomer Koren, and Oren Somekh. Almost optimal exploration in multi-armed bandits. In International Conference on Machine Learning, pp. 1238–1246, 2013.
- Karnin (2016) Zohar S Karnin. Verification based solution for structured mab problems. In Advances in Neural Information Processing Systems, pp. 145–153, 2016.
- Katz-Samuels et al. (2020) Julian Katz-Samuels, Lalit Jain, Zohar Karnin, and Kevin Jamieson. An empirical process approach to the union bound: Practical algorithms for combinatorial and linear bandits. arXiv preprint arXiv:2006.11685, 2020.
- Katz-Samuels et al. (2021) Julian Katz-Samuels, Jifan Zhang, Lalit Jain, and Kevin Jamieson. Improved algorithms for agnostic pool-based active classification. arXiv preprint arXiv:2105.06499, 2021.
- Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Konyushkova et al. (2017) Ksenia Konyushkova, Raphael Sznitman, and Pascal Fua. Learning active learning from data. In Advances in Neural Information Processing Systems, pp. 4225–4235, 2017.
- Kveton et al. (2020) Branislav Kveton, Martin Mladenov, Chih-Wei Hsu, Manzil Zaheer, Csaba Szepesvari, and Craig Boutilier. Differentiable meta-learning in contextual bandits. arXiv preprint arXiv:2006.05094, 2020.
- Lattimore & Szepesvari (2016) Tor Lattimore and Csaba Szepesvari. The end of optimism? an asymptotic analysis of finite-armed linear bandits. arXiv preprint arXiv:1610.04491, 2016.
- Li et al. (2018) Liam Li, Kevin Jamieson, Afshin Rostamizadeh, Ekaterina Gonina, Moritz Hardt, Benjamin Recht, and Ameet Talwalkar. Massively parallel hyperparameter tuning. arXiv preprint arXiv:1810.05934, 2018.
- Li et al. (2017) Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, and Ameet Talwalkar. Hyperband: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research, 18(1):6765–6816, 2017.
- Luedtke et al. (2020) Alex Luedtke, Marco Carone, Noah Simon, and Oleg Sofrygin. Learning to learn from data: Using deep adversarial learning to construct optimal statistical procedures. Science Advances, 6(9), 2020. doi: 10.1126/sciadv.aaw2140. URL https://advances.sciencemag.org/content/6/9/eaaw2140.
- Mannor & Tsitsiklis (2004) Shie Mannor and John N Tsitsiklis. The sample complexity of exploration in the multi-armed bandit problem. Journal of Machine Learning Research, 5(Jun):623–648, 2004.
- Mordatch et al. (2015) Igor Mordatch, Kendall Lowrey, and Emanuel Todorov. Ensemble-cio: Full-body dynamic motion planning that transfers to physical humanoids. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5307–5314. IEEE, 2015.
- Nowak (2011) Robert D Nowak. The geometry of generalized binary search. IEEE Transactions on Information Theory, 57(12):7893–7906, 2011.
- Ok et al. (2018) Jungseul Ok, Alexandre Proutiere, and Damianos Tranos. Exploration in structured reinforcement learning. In Advances in Neural Information Processing Systems, pp. 8874–8882, 2018.
- Rajeswaran et al. (2016) Aravind Rajeswaran, Sarvjeet Ghotra, Balaraman Ravindran, and Sergey Levine. Epopt: Learning robust neural network policies using model ensembles. arXiv preprint arXiv:1610.01283, 2016.
- Settles (2009) Burr Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2009.
- Sharaf & Daumé III (2019) Amr Sharaf and Hal Daumé III. Meta-learning for contextual bandit exploration. arXiv preprint arXiv:1901.08159, 2019.
- Simchowitz & Jamieson (2019) Max Simchowitz and Kevin G Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. In Advances in Neural Information Processing Systems, pp. 1151–1160, 2019.
- Simchowitz et al. (2017) Max Simchowitz, Kevin Jamieson, and Benjamin Recht. The simulator: Understanding adaptive sampling in the moderate-confidence regime. arXiv preprint arXiv:1702.05186, 2017.
- Soare et al. (2014) Marta Soare, Alessandro Lazaric, and Rémi Munos. Best-arm identification in linear bandits. In Advances in Neural Information Processing Systems, pp. 828–836, 2014.
- Tao et al. (2018) Chao Tao, Saúl Blanco, and Yuan Zhou. Best arm identification in linear bandits with linear dimension dependency. In International Conference on Machine Learning, pp. 4877–4886, 2018.
- Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE, 2012.
- Tsybakov (2008) Alexandre B Tsybakov. Introduction to nonparametric estimation. Springer Science & Business Media, 2008.
- Van Parys & Golrezaei (2020) Bart Van Parys and Negin Golrezaei. Optimal learning for structured bandits. 2020.
- Wasserman (2013) Larry Wasserman. All of statistics: a concise course in statistical inference. Springer Science & Business Media, 2013.
- Woodward & Finn (2017) Mark Woodward and Chelsea Finn. Active one-shot learning. arXiv preprint arXiv:1702.06559, 2017.
- Xu et al. (2017) Liyuan Xu, Junya Honda, and Masashi Sugiyama. Fully adaptive algorithm for pure exploration in linear bandits. arXiv preprint arXiv:1710.05552, 2017.
Appendix A Instance Dependent Sample Complexity
Identifying forms of is not as difficult a task as one might think due to the proliferation of tools for proving lower bounds for active learning (Mannor & Tsitsiklis, 2004; Tsybakov, 2008; Garivier & Kaufmann, 2016; Carpentier & Locatelli, 2016; Simchowitz et al., 2017; Chen et al., 2014). One can directly extract values of from the literature for regret minimization of linear or other structured bandits (Lattimore & Szepesvari, 2016; Van Parys & Golrezaei, 2020), contextual bandits (Hao et al., 2019), and tabular as well as structured MDPs (Simchowitz & Jamieson, 2019; Ok et al., 2018). Moreover, we believe that even reasonable surrogates of should result in a high quality policy .
We review some canonical examples:
-
•
Multi-armed bandits. In the best-arm identification problem, there are Gaussian distributions where the th distribution has mean for . In the above formulation, this problem is encoded as action results in observation and the loss where is ’s recommended index and . It’s been shown that there exists a constant such that for any sufficiently large we have
Moreover, for any there exists a policy that achieves where capture constant and low-order terms (Carpentier & Locatelli, 2016; Karnin et al., 2013; Simchowitz et al., 2017; Garivier & Kaufmann, 2016).
The above correspondence between the lower bound and the upper bound suggests that plays a critical role in determining the difficulty of identifying for any . This exercise extends to more structured settings as well:
-
•
Content recommendation / active search. Consider items (e.g., movies, proteins) where the th item is represented by a feature vector and a measurement (e.g., preference rating, binding affinity to a target) is modeled as a linear response model such that for some unknown . If as above then nearly identical results to that of above hold for an analogous function of (Soare et al., 2014; Karnin, 2016; Fiez et al., 2019).
-
•
Active binary classification. For let be a feature vector of an unlabeled item (e.g., image) that can be queried for its binary label where for some . Let be an arbitrary set of classifiers (e.g., neural nets, random forest, etc.) such that each assigns a label to each of the items in the pool. If items are chosen sequentially to observe their labels, the objective is to identify the true risk minimizer using as few requested labels as possible and where is ’s recommended classifier. Many candidates for have been proposed from the agnostic active learning literature (Balcan et al., 2006; Hanneke, 2007b; a; Dasgupta et al., 2008; Huang et al., 2015; Jain & Jamieson, 2019) but we believe the most granular candidates come from the combinatorial bandit literature (Chen et al., 2017; Fiez et al., 2019; Cao & Krishnamurthy, 2017; Jain & Jamieson, 2019). To make the reduction, for each assign a such that for all and set . It is easy to check that satisfies . Thus, requesting the label of example is equivalent to sampling from Bernoulli, completing the reduction to combinatorial bandits: , . We then apply the exact same as above for linear bandits.
Appendix B Gradient Based Optimization Algorithm Implementation
First, we restate the algorithm with explicit gradient estimator formulas derived in Appendix C.
(9) | ||||
(10) |
(11) |
In the above algorithm, the gradient estimates are unbiased estimates of the true gradients with respect to , and (shown in Appendix C). We choose large enough to avoid mode collapse, and as large as possible to reduce variance in gradient estimates while fitting the memory constraint. We then find the appropriate large number of optimization iterations so that the variance of the gradient estimates is reduced dramatically by averaging over time. We use Adam optimization (Kingma & Ba, 2014) in taking gradient updates.
Note the decomposition for in equation 10 and equation 11, where rollout , and
Here and are only dependent on and respectively. During evaluation of a fixed policy , we are interested in solving by gradient ascent updates like equation 10. The decoupling of and thus enables us to optimize the objective without differentiating through a policy , which could be non-differentiable like a deterministic algorithm.
B.1 Implementation Details
Training. When training our policies for Best Identification, we warm start the training with optimizing Simple Regret. This is because a random initialized policy performs so poorly that Best Identification is nearly always 1, making it difficult to improve the policy. After training in MAPO (Algorithm 1), we warm start the training of with . In addition, our generating distribution parameterizations exactly follows from Section 3.1.
Loss functions. Instead of optimizing the approximated quantity from equation 8 directly, we add regularizers to the losses for both the policy and generator. First, we choose the in equation 10 to be , for some large constant . To discourage the policy from over committing to a certain action and/or the generating distribution from covering only a small subset of particles (i.e., mode collapse), we also add negative entropy penalties to both policy’s output distributions and with scaling factors and .
State representation. We parameterize our state space as a flattened matrix where each row represents a distinct . Specifically, at time the row of corresponding to some records the number of times that action has been taken , its inverse , and the sum of the observations .
Policy MLP architecture. Our policy is a multi-layer perceptron with weights . The policy take a sized state as input and outputs a vector of size which is then pushed through a soft-max to create a probability distribution over . At the end of the game, regardless of the policy’s weights, we set where is the minimum norm solution to .
Our policy network is a simple 6-layer MLP, with layer sizes where corresponds to the input layer and is the size of the output layer, which is then pushed through a Softmax function to create a probability over arms. In addition, all intermediate layers are activated with the leaky ReLU activation units with negative slopes of . For the experiments for 1D thresholds and 20 Questions, they share the same network structure as mentioned above with and respectively.
Appendix C Gradient Estimate Derivation
Here we derive the unbiased gradient estimates equation 9, equation 10 and equation 11 in Algorithm 2. Since each the gradient estimates in the above averages over identically distributed trajectories, it is therefore sufficient to show that our gradient estimate is unbiased for a single problem and its rollout trajectory .
For a feasible , using the score-function identity (Aleksandrov et al., 1968)
Observe that if and is the result of rolling out a policy on then
is an unbiased estimate of .
For a feasible set , by definition of ,
(12) | ||||
where the last equality follows from chain rule and the score-function identity (Aleksandrov et al., 1968). The quantity inside the expectations, call it , is then an unbiased estimator of given and are rolled out accordingly. Note that if , is clearly an unbiased gradient estimator of given and rollout are sampled accordingly.
Likewise, for policy,
is an unbiased estimate of .
Appendix D Hyper-parameters
In this section, we list our hyperparameters. First we define to be a coefficient that gets multiplied to binary loses, so instead of , we receive loss . We choose so that the recieved rewards are approximately at the same scale as Simple Regret. During our experiments, all of the optimizers are Adam. All budget sizes are . For fairness of evaluation, during each experiment (1D thresholds or 20 Questions), all parameters below are shared for evaluating all of the policies. To elaborate on training strategy proposed in MAPO (Algorithm 1) more, we divide our training into four procedures, as indicated in Table 3:
-
•
Init. The initialization procedure takes up a rather small portion of iterations primarily for the purpose of optimizing for so that the particles converge into the constrained difficulty sets. In addition, during the initialization process we initialize and freeze , thus putting an uniform distribution over the particles. This allows us to utilize the entire set of particles without converge to only a few particles early on. To initialize , we sample of the particles uniformly from and the rest of the particles by sampling, for each , particles uniformly from . We initialize our policy weights by Xavier initialization with weights sampled from normal distribution and scaled by .
-
•
Regret Training, Training with Simple Regret objective usually takes the longest among the Procedures. The primary purpose for this process is to let the policy converge to a reasonable warm start that already captures some essence of the task.
-
•
Fine-tune . Training with Best Identification objective run multiple times for each with their corresponding complexity set . During each run, we start with a warm started policy, and reinitialize the rest of the models by running the initialization procedure followed by optimizing the Best Identification objective.
-
•
Fine-tune This procedure optimizes equation 3, with baselines evaluated based on each learned from the previous procedure. Similar to fine-tuning each individual , we warm start a policy and reinitialize and by running the initialization procedure again.
Experiment | ||||
Procedure | Hyper-parameter | 1D Threshold | 20 Questions | Jester Joke |
Init | 20000 (all) | |||
learning rate | (all) | |||
learning rate | (all) | |||
learning rate | (all) | |||
Regret Training | 480000 (all) | |||
learning rate | (all) | |||
learning rate | (all) | |||
learning rate | (all) | |||
Fine-tune | for | 200000 | 0 | 200000 |
for | 200000 | 1500000 | N/A | |
for | 500000 | 250000 | 500000 | |
learning rate | (all) | |||
learning rate | (all) | |||
learning rate | (all) | |||
Adam Optimizer | .9 (all) | |||
.999 (all) |
Experiment | ||||
---|---|---|---|---|
Procedure | Hyper-parameter | 1D Threshold | 20 Questions | Jester Joke |
Init + Train + Fine-tune | ||||
M | 1000 | 500 | 500 | |
L | 10 | 30 | 30 | |
7.5 | 30 | 30 | ||
(regret) | .2 | .8 | .8 | |
(fine-tune) | .3 | .8 | .8 | |
.05 | .1 | .05 | ||
(all) |
To provide a general strategy of choosing hyper-parameters, we note that , firstly, , are primarily parameters tuned for as the noisiness and scale of the gradients, and entropy over the arms grows with the size . Secondly, is primarily tuned for as it penalizes the entropy over the arms, which is a multiple of . Thirdly, learning rate of is primarily tuned for the convergence of constraint into the restricted class, thus becoming after the specified number of iterations during initialization is a good indicator. Finally, we choose and by memory constraint of our GPU. The hyper-parameters for each experiment was tuned with less than 20 hyper-parameter assignments, some metrics to look at while tuning these hyper-parameters includes but are not limited to: gradient magnitudes of each component, convergence of each loss and entropy losses for each regularization term (how close it is to the entropy of a uniform probability), etc.
Appendix E Policy Evaluation
When evaluating a policy, we are essentially solving the following objective for a fixed policy :
where is a set of problems. However, due to non-concavity of this loss function, gradient descent initialized randomly may converge to a local maxima. To reduce this possibility, we randomly initialize many initial iterates and take gradient steps round-robin, eliminating poorly performing trajectories. To do this with a fixed amount of computational resource, we apply the successive halving algorithm from Li et al. (2018). Specifically, we choose hyperparamters: , , and . This translates to:
-
•
Initialize , optimize for 100 iterations for each
-
•
Take the top of them and optimize for another iterations
-
•
Take the top of the remaining and optimize for an additional iterations
We take gradient steps with the Adam optimizer (Kingma & Ba, 2014) with learning rate of and .
Appendix F Figures at Full Scale
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/bef63083-790a-4da8-9bec-c0a5fda4a912/x8.png)
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/bef63083-790a-4da8-9bec-c0a5fda4a912/x9.png)
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/bef63083-790a-4da8-9bec-c0a5fda4a912/x10.png)
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/bef63083-790a-4da8-9bec-c0a5fda4a912/x11.png)
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/bef63083-790a-4da8-9bec-c0a5fda4a912/x12.png)
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/bef63083-790a-4da8-9bec-c0a5fda4a912/x13.png)
Appendix G Uncertainty Sampling
We define the symmetric difference of a set of binary vectors, , as the dimensions where inconsistencies exist.
Appendix H Learning to Actively Learn Algorithm
To train a policy under the learning to actively learn setting, we aim to solve for the objective
where our policy and states are parameterized the same way as Appendix LABEL:sec:representation_parameterization for a fair comparison. To optimize for the parameter, we take gradient steps like equation 11 but with the new sampling and rollout where . This gradient step follows from both the classical policy gradient algorithm in reinforcement learning as well as from recent learning to actively learn work by Kveton et al. (2020).
Moreover, note that the optimal policy for the objective must be deterministic as justified by deterministic policies being optimal for MDPs. Therefore, it is clear that, under our experiment setting, the deterministic Bayes-LAL policy will perform poorly in the adversarial setting (for the same reason why SGBS performs poorly).
Appendix I 20 Questions Setup
Hu et al. (2018) collected a dataset of 1000 celebrities and 500 possible questions to ask about each celebrity. We chose questions out of the by first constructing , and for the dimensions data, and sampling without replacement of the dimensions from a distribution derived from a static allocation. We down-sampled the number of questions so our training can run with sufficient and to de-noise the gradients while being prototyped with a single GPU.
Specifically, the dataset from Hu et al. (2018) consists of probabilities of people answering Yes / No / Unknown to each celebrity-question pair collected from some population. To better fit the combinatorial bandit scenario, we re-normalize the probability of getting Yes / No, conditioning on the event that these people did not answer Unknown. The probability of answering Yes to all 500 questions for each celebrity then constitutes vectors , where each dimension of a give represents the probability of yes to the th question about the th person. The action set is then constructed as , while are binary vectors taking the majority votes.
To sub-sample 100 questions from the 500, we could have uniformly at random selected the questions, but many of these questions are not very discriminative. Thus, we chose a “good” set of queries based on the design recommended by of Fiez et al. (2019). If questions were being answered noiselessly in response to a particular , then equivalently we have that for this setting . Since optimizes allocations over that would reduce the number of required queries as much as possible (according to the information theoretic bound of (Fiez et al., 2019)) if we want to find a single allocation for all simultaneously, we can perform the optimization problem
We then sample elements from according to this optimal without replacement and add them to until .
Appendix J Jester Joke Recommendation Setup
We consider the Jester jokes dataset of Goldberg et al. (2001) that contains jokes ranging from pun-based jokes to grossly offensive. We filter the dataset to only contain users that rated all jokes, resulting in 14116 users. A rating of each joke was provided on a scale which was shrunk to . Denote this set of ratings as , where encodes the ratings of all jokes by user . To construct the set of arms , we then clustered the ratings of these users to 10 groups to obtain by minimizing the following metric:
To solve for , we adapt the algorithm, with the metric above instead of the metric used traditionally.