This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Contextual Bandits with Large Action Spaces: Made Practical

Yinglun Zhu
yinglun@cs.wisc.edu
   Dylan J. Foster
dylanfoster@microsoft.com
   John Langford
jcl@microsoft.com
   Paul Mineiro
pmineiro@microsoft.com
    
University of Wisconsin-Madison   Microsoft Research, NYC
Abstract

A central problem in sequential decision making is to develop algorithms that are practical and computationally efficient, yet support the use of flexible, general-purpose models. Focusing on the contextual bandit problem, recent progress provides provably efficient algorithms with strong empirical performance when the number of possible alternatives (“actions”) is small, but guarantees for decision making in large, continuous action spaces have remained elusive, leading to a significant gap between theory and practice. We present the first efficient, general-purpose algorithm for contextual bandits with continuous, linearly structured action spaces. Our algorithm makes use of computational oracles for (i) supervised learning, and (ii) optimization over the action space, and achieves sample complexity, runtime, and memory independent of the size of the action space. In addition, it is simple and practical. We perform a large-scale empirical evaluation, and show that our approach typically enjoys superior performance and efficiency compared to standard baselines.

1 Introduction

We consider the design of practical, theoretically motivated algorithms for sequential decision making with contextual information, better known as the contextual bandit problem. Here, a learning agent repeatedly receives a context (e.g., a user’s profile), selects an action (e.g., a news article to display), and receives a reward (e.g., whether the article was clicked). Contextual bandits are a useful model for decision making in unknown environments in which both exploration and generalization are required, but pose significant algorithm design challenges beyond classical supervised learning. Recent years have seen development on two fronts: On the theoretical side, extensive research into finite-action contextual bandits has resulted in practical, provably efficient algorithms capable of supporting flexible, general-purpose models (Langford and Zhang, 2007; Agarwal et al., 2014; Foster and Rakhlin, 2020; Simchi-Levi and Xu, 2021; Foster and Krishnamurthy, 2021). Empirically, contextual bandits have been widely deployed in practice for online personalization and recommendation tasks (Li et al., 2010; Agarwal et al., 2016; Tewari and Murphy, 2017; Cai et al., 2021), leveraging the availability of high-quality action slates (e.g., subsets of candidate articles selected by an editor).

The developments above critically rely on the existence of a small number of possible decisions or alternatives. However, many applications demand the ability to make contextual decisions in large, potentially continuous spaces, where actions might correspond to images in a database or high-dimensional embeddings of rich documents such as webpages. Contextual bandits in large (e.g., million-action) settings remains a major challenge—both statistically and computationally—and constitutes a substantial gap between theory and practice. In particular:

  • Existing general-purpose algorithms (Langford and Zhang, 2007; Agarwal et al., 2014; Foster and Rakhlin, 2020; Simchi-Levi and Xu, 2021; Foster and Krishnamurthy, 2021) allow for the use of flexible models (e.g., neural networks, forests, or kernels) to facilitate generalization across contexts, but have sample complexity and computational requirements linear in the number of actions. These approaches can degrade in performance under benign operations such as duplicating actions.

  • While certain recent approaches extend the general-purpose methods above to accommodate large action spaces, they either require sample complexity exponential in action dimension (Krishnamurthy et al., 2020), or require additional distributional assumptions (Sen et al., 2021).

  • Various results efficiently handle large or continuous action spaces (Dani et al., 2008; Jun et al., 2017; Yang et al., 2021) with specific types of function approximation, but do not accommodate general-purpose models.

As a result of these algorithmic limitations, empirical aspects of contextual decision making in large action spaces have remained relatively unexplored compared to the small-action regime (Bietti et al., 2021), with little in the way of readily deployable out-of-the-box solutions.

Contributions

We provide the first efficient algorithms for contextual bandits with continuous, linearly structured action spaces and general function approximation. Following Chernozhukov et al. (2019); Xu and Zeevi (2020); Foster et al. (2020), we adopt a modeling approach, and assume rewards for each context-action pair (x,a)(x,a) are structured as

f(x,a)=ϕ(x,a),g(x).f^{\star}(x,a)=\left\langle\phi(x,a),g^{\star}(x)\right\rangle. (1)

Here ϕ(x,a)d\phi(x,a)\in\mathbb{R}^{d} is a known context-action embedding (or feature map) and g𝒢g^{\star}\in\mathcal{G} is a context embedding to be learned online, which belongs to an arbitrary, user-specified function class 𝒢\mathcal{G}. Our algorithm, SpannerIGW, is computationally efficient (in particular, the runtime and memory are independent of the number of actions) whenever the user has access to (i) an online regression oracle for supervised learning over the reward function class, and (ii) an action optimization oracle capable of solving problems of the form

argmaxa𝒜ϕ(x,a),θ\displaystyle\operatorname*{arg\,max}_{a\in\mathcal{A}}\left\langle\phi(x,a),\theta\right\rangle

for any θd\theta\in\mathbb{R}^{d}. The former oracle follows prior approaches to finite-action contextual bandits (Foster and Rakhlin, 2020; Simchi-Levi and Xu, 2021; Foster and Krishnamurthy, 2021), while the latter generalizes efficient approaches to (non-contextual) linear bandits (McMahan and Blum, 2004; Dani et al., 2008; Bubeck et al., 2012; Hazan and Karnin, 2016). We provide a regret bound for SpannerIGW which scales as poly(d)T\sqrt{\operatorname{poly}(d)\cdot T}, and—like the computational complexity—is independent of the number of actions. Beyond these results, we provide a particularly practical variant of SpannerIGW (SpannerGreedy), which enjoys even faster runtime at the cost of slightly worse (poly(d)T2/3\operatorname{poly}(d)\cdot{}T^{2/3}-type) regret.

Our techniques

On the technical side, we show how to efficiently combine the inverse gap weighting technique (Abe and Long, 1999; Foster and Rakhlin, 2020) previously used in the finite-action setting with optimal design-based approaches for exploration with linearly structured actions. This offers a computational improvement upon the results of Xu and Zeevi (2020); Foster et al. (2020), which provide algorithms with poly(d)T\sqrt{\operatorname{poly}(d)\cdot{}T}-regret for the setting we consider, but require enumeration over the action space. Conceptually, our results expand upon the class of problems for which minimax approaches to exploration (Foster et al., 2021b) can be made efficient.

Empirical performance

As with previous approaches based on regression oracles, SpannerIGW is simple, practical, and well-suited to flexible, general-purpose function approximation. In extensive experiments ranging from thousands to millions of actions, we find that our methods typically enjoy superior performance compared to existing baselines. In addition, our experiments validate the statistical model in Eq. 1 which we find to be well-suited to learning with large-scale language models (Devlin et al., 2019).

1.1 Organization

This paper is organized as follows. In Section 2, we formally introduce our statistical model and the computational oracles upon which our algorithms are built. Subsequent sections are dedicated to our main results.

  • As a warm-up, Section 3 presents a simplified algorithm, SpannerGreedy, which illustrates the principle of exploration over an approximate optimal design. This algorithm is practical and oracle-efficient, but has suboptimal poly(d)T2/3\operatorname{poly}(d)\cdot{}T^{2/3}-type regret.

  • Building on these ideas, Section 4 presents our main algorithm, SpannerIGW, which combines the idea of approximate optimal design used by SpannerGreedy with the inverse gap weighting method (Abe and Long, 1999; Foster and Rakhlin, 2020), resulting in an oracle-efficient algorithm with poly(d)T\sqrt{\operatorname{poly}(d)\cdot{}T}-regret.

Section 5 presents empirical results for both algorithms. We close with discussion of additional related work (Section 6) and future directions (Section 7). All proofs are deferred to the appendix.

2 Problem Setting

The contextual bandit problem proceeds over TT rounds. At each round t[T]t\in[T], the learner receives a context xt𝒳x_{t}\in\mathcal{X} (the context space), selects an action at𝒜a_{t}\in\mathcal{A} (the action space), and then observes a reward rt(at)r_{t}(a_{t}), where rt:𝒜[1,1]r_{t}:\mathcal{A}\to[-1,1] is the underlying reward function. We assume that for each round tt, conditioned on xtx_{t}, the reward rtr_{t} is sampled from a (unknown) distribution rt(xt)\mathbb{P}_{r_{t}}(\cdot\mid{}x_{t}). We allow both the contexts x1,,xTx_{1},\ldots,x_{T} and the distributions r1,,rT\mathbb{P}_{r_{1}},\ldots,\mathbb{P}_{r_{T}} to be selected in an arbitrary, potentially adaptive fashion based on the history.

Function approximation

Following a standard approach to developing efficient contextual bandit methods, we take a modeling approach, and work with a user-specified class of regression functions (𝒳×𝒜[1,1])\mathcal{F}\subseteq(\mathcal{X}\times\mathcal{A}\rightarrow[-1,1]) that aims to model the underlying mean reward function. We make the following realizability assumption (Agarwal et al., 2012; Foster et al., 2018; Foster and Rakhlin, 2020; Simchi-Levi and Xu, 2021).

Assumption 1 (Realizability).

There exists a regression function ff^{\star}\in\mathcal{F} such that 𝔼[rt(a)xt=x]=f(x,a){\mathbb{E}}[r_{t}(a)\mid x_{t}=x]=f^{\star}(x,a) for all a𝒜a\in\mathcal{A} and t[T]t\in[T].

Without further assumptions, there exist function classes \mathcal{F} for which the regret of any algorithm must grow proportionally to |𝒜|\lvert\mathcal{A}\rvert (e.g., Agarwal et al. (2012)). In order to facilitate generalization across actions and achieve sample complexity and computational complexity independent of |𝒜|\lvert\mathcal{A}\rvert, we assume that each function ff\in\mathcal{F} is linear in a known (context-dependent) feature embedding of the action. Following Xu and Zeevi (2020); Foster et al. (2020), we assume that \mathcal{F} takes the form

={fg(x,a)=ϕ(x,a),g(x):g𝒢},\displaystyle\mathcal{F}=\left\{f_{g}(x,a)=\langle\phi(x,a),g(x)\rangle:g\in\mathcal{G}\right\},

where ϕ(x,a)d\phi(x,a)\in\mathbb{R}^{d} is a known, context-dependent action embedding and 𝒢\mathcal{G} is a user-specified class of context embedding functions.

This formulation assumes linearity in the action space (after featurization), but allows for nonlinear, learned dependence on the context xx through the function class 𝒢\mathcal{G}, which can be taken to consist of neural networks, forests, or any other flexible function class a user chooses. For example, in news article recommendation, ϕ(x,a)=ϕ(a)\phi(x,a)=\phi(a) might correspond to an embedding of an article aa obtained using a large pre-trained language-model, while g(x)g(x) might correspond to a task-dependent embedding of a user xx, which our methods can learn online. Well-studied special cases include the linear contextual bandit setting (Chu et al., 2011; Abbasi-Yadkori et al., 2011), which corresponds to the special case where each g𝒢g\in\mathcal{G} has the form g(x)=θg(x)=\theta for some fixed θd\theta\in{\mathbb{R}}^{d}, as well as the standard finite-action contextual bandit setting, where d=|𝒜|d=\lvert\mathcal{A}\rvert and ϕ(x,a)=ea\phi(x,a)=e_{a}.

We let g𝒢g^{\star}\in\mathcal{G} denote the embedding for which f=fgf^{\star}=f_{g^{\star}}. We assume that supx𝒳,a𝒜ϕ(x,a)1\sup_{x\in\mathcal{X},a\in\mathcal{A}}\|\phi(x,a)\|\leq 1 and supg𝒢,x𝒳g(x)1\sup_{g\in\mathcal{G},x\in\mathcal{X}}\|g(x)\|\leq 1. In addition, we assume that span({ϕ(x,a)})=d\operatorname{span}(\{\phi(x,a)\})={\mathbb{R}}^{d} for all x𝒳x\in\mathcal{X}.

Regret

For each regression function ff\in\mathcal{F}, let πf(xt):=argmaxa𝒜f(xt,a)\pi_{f}(x_{t})\vcentcolon=\operatorname*{arg\,max}_{a\in\mathcal{A}}f(x_{t},a) denote the induced policy, and define π:=πf\pi^{\star}\vcentcolon=\pi_{f^{\star}} as the optimal policy. We measure the performance of the learner in terms of regret:

𝐑𝐞𝐠𝖢𝖡(T):=t=1Trt(π(xt))rt(at).\displaystyle\mathrm{\mathbf{Reg}}_{\mathsf{CB}}(T)\vcentcolon=\sum_{t=1}^{T}r_{t}(\pi^{\star}(x_{t}))-r_{t}(a_{t}).

2.1 Computational Oracles

To derive efficient algorithms with sublinear runtime, we make use of two computational oracles: First, following Foster and Rakhlin (2020); Simchi-Levi and Xu (2021); Foster et al. (2020, 2021a), we use an online regression oracle for supervised learning over the reward function class \mathcal{F}. Second, we use an action optimization oracle, which facilitates linear optimization over the action space 𝒜\mathcal{A} (McMahan and Blum, 2004; Dani et al., 2008; Bubeck et al., 2012; Hazan and Karnin, 2016)

Function approximation: Regression oracles

A fruitful approach to designing efficient contextual bandit algorithms is through reduction to supervised regression with the class \mathcal{F}, which facilitates the use of off-the-shelf supervised learning algorithms and models (Foster and Rakhlin, 2020; Simchi-Levi and Xu, 2021; Foster et al., 2020, 2021a). Following Foster and Rakhlin (2020), we assume access to an online regression oracle 𝐀𝐥𝐠𝖲𝗊\mathrm{\mathbf{Alg}}_{\mathsf{Sq}}, which is an algorithm for online learning (or, sequential prediction) with the square loss.

We consider the following protocol. At each round t[T]t\in[T], the oracle produces an estimator f^t=fg^t\widehat{f}_{t}=f_{\widehat{g}_{t}}, then receives a context-action-reward tuple (xt,at,rt(at))(x_{t},a_{t},r_{t}(a_{t})). The goal of the oracle is to accurately predict the reward as a function of the context and action, and we evaluate its prediction error via the square loss (f^t(xt,at)rt)2(\widehat{f}_{t}(x_{t},a_{t})-r_{t})^{2}. We measure the oracle’s cumulative performance through square-loss regret to \mathcal{F}.

Assumption 2 (Bounded square-loss regret).

The regression oracle 𝐀𝐥𝐠𝖲𝗊\mathrm{\mathbf{Alg}}_{\mathsf{Sq}} guarantees that for any (potentially adaptively chosen) sequence {(xt,at,rt(at))}t=1T\left\{(x_{t},a_{t},r_{t}(a_{t}))\right\}_{t=1}^{T},

t=1T(f^t(xt,at)rt(at))2infft=1T(f(xt,at)rt(at))2𝐑𝐞𝐠𝖲𝗊(T),\displaystyle\sum_{t=1}^{T}\left(\widehat{f}_{t}(x_{t},a_{t})-r_{t}(a_{t})\right)^{2}-\inf_{f\in\mathcal{F}}\sum_{t=1}^{T}\left(f(x_{t},a_{t})-r_{t}(a_{t})\right)^{2}\leq\mathrm{\mathbf{Reg}}_{\mathsf{Sq}}(T),

for some (non-data-dependent) function 𝐑𝐞𝐠𝖲𝗊(T)\mathrm{\mathbf{Reg}}_{\mathsf{Sq}}(T).

We let 𝒯𝖲𝗊\mathcal{T}_{\mathsf{Sq}} denote an upper bound on the time required to (i) query the oracle’s estimator g^t\widehat{g}_{t} with xtx_{t} and receive the vector g^t(xt)d\widehat{g}_{t}(x_{t})\in\mathbb{R}^{d}, and (ii) update the oracle with the example (xt,at,rt(at))(x_{t},a_{t},r_{t}(a_{t})). We let 𝖲𝗊\mathcal{M}_{\mathsf{Sq}} denote the maximum memory used by the oracle throughout its execution.

Online regression is a well-studied problem, with computationally efficient algorithms for many models. Basic examples include finite classes \mathcal{F}, where one can attain 𝐑𝐞𝐠𝖲𝗊(T)=O(log||)\mathrm{\mathbf{Reg}}_{\mathsf{Sq}}(T)=O(\log\left\lvert\mathcal{F}\right\rvert) (Vovk, 1998), and linear models (g(x)=θg(x)=\theta), where the online Newton step algorithm (Hazan et al., 2007) satisfies 2 with 𝐑𝐞𝐠𝖲𝗊(T)=O(dlogT)\mathrm{\mathbf{Reg}}_{\mathsf{Sq}}(T)=O(d\log T). More generally, even for classes such as deep neural networks for which provable guarantees may not be available, regression is well-suited to gradient-based methods. We refer to Foster and Rakhlin (2020); Foster et al. (2020) for more comprehensive discussion.

Large action spaces: Action optimization oracles

The regression oracle setup in the prequel is identical to that considered in the finite-action setting (Foster and Rakhlin, 2020). In order to develop efficient algorithms for large or infinite action spaces, we assume access to an oracle for linear optimization over actions.

Definition 1 (Action optimization oracle).

An action optimization oracle 𝐀𝐥𝐠𝖮𝗉𝗍\mathrm{\mathbf{Alg}}_{\mathsf{Opt}} takes as input a context x𝒳x\in\mathcal{X}, and vector θd\theta\in{\mathbb{R}}^{d} and returns

a:=argmaxa𝒜ϕ(x,a),θ.\displaystyle a^{\star}\vcentcolon=\operatorname*{arg\,max}_{a\in\mathcal{A}}\left\langle\phi(x,a),\theta\right\rangle. (2)

For a single query to the oracle, We let 𝒯𝖮𝗉𝗍\mathcal{T}_{\mathsf{Opt}} denote a bound on the runtime for a single query to the oracle. We let 𝖮𝗉𝗍\mathcal{M}_{\mathsf{Opt}} denote the maximum memory used by the oracle throughout its execution.

The action optimization oracle in Eq. 2 is widely used throughout the literature on linear bandits (Dani et al., 2008; Chen et al., 2017; Cao and Krishnamurthy, 2019; Katz-Samuels et al., 2020), and can be implemented in polynomial time for standard combinatorial action spaces. It is a basic computational primitive in the theory of convex optimization, and when 𝒜\mathcal{A} is convex, it is equivalent (up to polynomial-time reductions) to other standard primitives such as separation oracles and membership oracles (Schrijver, 1998; Grötschel et al., 2012). It also equivalent to the well-known Maximum Inner Product Search (MIPS) problem (Shrivastava and Li, 2014), for which sublinear-time hashing based methods are available.

Example 1.

Let G=(V,E)G=(V,E) be a graph, and let ϕ(x,a){0,1}|E|\phi(x,a)\in\{0,1\}^{\lvert E\rvert} represent a matching and θ|E|\theta\in{\mathbb{R}}^{\lvert E\rvert} be a vector of edge weights. The problem of finding the maximum-weight matching for a given set of edge weights can be written as a linear optimization problem of the form in Eq. 2, and Edmonds’ algorithm (Edmonds, 1965) can be used to find the maximum-weight matching in O(|V|2|E|)O(\lvert V\rvert^{2}\cdot\lvert E\rvert) time.

Other combinatorial problems that admit polynomial-time action optimization oracles include the maximum-weight spanning tree problem, the assignment problem, and others (Awerbuch and Kleinberg, 2008; Cesa-Bianchi and Lugosi, 2012).

Action representation

We define b𝒜b_{\mathcal{A}} as the number of bits used to represent actions in 𝒜\mathcal{A}, which is always upper bounded by O(log|𝒜|)O(\log\lvert\mathcal{A}\rvert) for finite action sets, and by O~(d)\widetilde{O}(d) for actions that can be represented as vectors in d{\mathbb{R}}^{d}. Tighter bounds are possible with additional structual assumptions. Since representing actions is a minimal assumption, we hide the dependence on b𝒜b_{\mathcal{A}} in big-OO notation for our runtime and memory analysis.

2.2 Additional Notation

We adopt non-asymptotic big-oh notation: For functions f,g:𝒵+f,g:\mathcal{Z}\to\mathbb{R}_{+}, we write f=O(g)f=O(g) (resp. f=Ω(g)f=\Omega(g)) if there exists a constant C>0C>0 such that f(z)Cg(z)f(z)\leq{}Cg(z) (resp. f(z)Cg(z)f(z)\geq{}Cg(z)) for all z𝒵z\in\mathcal{Z}. We write f=O~(g)f=\widetilde{O}(g) if f=O(gpolylog(T))f=O(g\cdot\mathrm{polylog}(T)), f=Ω~(g)f=\widetilde{\Omega}(g) if f=Ω(g/polylog(T))f=\Omega(g/\mathrm{polylog}(T)). We use \lesssim only in informal statements to highlight salient elements of an inequality.

For a vector zdz\in\mathbb{R}^{d}, we let z\left\|z\right\| denote the euclidean norm. We define zW2:=z,Wz\|z\|_{W}^{2}\vcentcolon={}\langle z,Wz\rangle for a positive definite matrix Wd×dW\in{\mathbb{R}}^{d\times d}. For an integer nn\in\mathbb{N}, we let [n][n] denote the set {1,,n}\{1,\dots,n\}. For a set 𝒵\mathcal{Z}, we let Δ(𝒵)\Delta(\mathcal{Z}) denote the set of all Radon probability measures over 𝒵\mathcal{Z}. We let conv(𝒵)\operatorname{{conv}}(\mathcal{Z}) denote the set of all finitely supported convex combinations of elements in 𝒵\mathcal{Z}. When 𝒵\mathcal{Z} is finite, we let unif(𝒵)\mathrm{unif}(\mathcal{Z}) denote the uniform distribution over all the elements in 𝒵\mathcal{Z}. We let 𝕀zΔ(𝒵)\mathbb{I}_{z}\in\Delta(\mathcal{Z}) denote the delta distribution on zz. We use the convention ab=min{a,b}a\wedge{}b=\min\{a,b\} and ab=max{a,b}a\vee{}b=\max\{a,b\}.

3 Warm-Up: Efficient Algorithms via Uniform Exploration

In this section, we present our first result: an efficient algorithm based on uniform exploration over a representative basis (SpannerGreedy; Algorithm 1). This algorithm achieves computational efficiency by taking advantage of an online regression oracle, but its regret bound has sub-optimal dependence on TT. Beyond being practically useful in its own right, this result serves as a warm-up for Section 4.

Our algorithm is based on exploration with a G-optimal design for the embedding ϕ\phi, which is a distribution over actions that minimizes a certain notion of worse-case variance (Kiefer and Wolfowitz, 1960; Atwood, 1969).

Definition 2 (G-optimal design).

Let a set 𝒵d\mathcal{Z}\subseteq{\mathbb{R}}^{d} be given. A distribution qΔ(𝒵)q\in\Delta(\mathcal{Z}) is said to be a G-optimal design with approximation factor Copt1C_{\operatorname{{opt}}}\geq 1 if

supz𝒵zV(q)12Coptd,\displaystyle\sup_{z\in\mathcal{Z}}\|z\|_{V(q)^{-1}}^{2}\leq C_{\operatorname{{opt}}}\cdot d,

where V(q):=𝔼zq[zz]V(q)\vcentcolon={\mathbb{E}}_{z\sim q}\big{[}zz^{\top}\big{]}.

The following classical result guarantees existence of a G-optimal design.

Lemma 1 (Kiefer and Wolfowitz (1960)).

For any compact set 𝒵d\mathcal{Z}\subseteq{\mathbb{R}}^{d}, there exists an optimal design with Copt=1C_{\operatorname{{opt}}}=1.

Algorithm 1 SpannerGreedy
0: Exploration parameter ε(0,1]\varepsilon\in(0,1], online regression oracle 𝐀𝐥𝐠𝖲𝗊\mathrm{\mathbf{Alg}}_{\mathsf{Sq}}, action optimization oracle 𝐀𝐥𝐠𝖮𝗉𝗍\mathrm{\mathbf{Alg}}_{\mathsf{Opt}}.
1:for t=1,2,,Tt=1,2,\dots,T do
2:  Observe context xtx_{t}.
3:  Receive f^t=fg^t\widehat{f}_{t}=f_{\widehat{g}_{t}} from regression oracle 𝐀𝐥𝐠𝖲𝗊\mathrm{\mathbf{Alg}}_{\mathsf{Sq}}.
4:  Get a^targmaxa𝒜ϕ(xt,a),g^t(xt)\widehat{a}_{t}\leftarrow\operatorname*{arg\,max}_{a\in\mathcal{A}}\left\langle\phi(x_{t},a),\widehat{g}_{t}(x_{t})\right\rangle.
5:  Call subroutine to compute CoptC_{\operatorname{{opt}}}-approximate optimal design qtoptΔ(𝒜)q^{\operatorname{{opt}}}_{t}\in\Delta(\mathcal{A}) for set {ϕ(xt,a)}a𝒜\left\{\phi(x_{t},a)\right\}_{a\in\mathcal{A}}.// See Algorithm 5 for efficient solver.
6:  Define pt:=εqtopt+(1ε)𝕀a^tp_{t}\vcentcolon=\varepsilon\cdot q^{\operatorname{{opt}}}_{t}+(1-\varepsilon)\cdot\mathbb{I}_{\widehat{a}_{t}}.
7:  Sample atpta_{t}\sim p_{t} and observe reward rt(at)r_{t}(a_{t}).
8:  Update oracle 𝐀𝐥𝐠𝖲𝗊\mathrm{\mathbf{Alg}}_{\mathsf{Sq}} with (xt,at,rt(at))(x_{t},a_{t},r_{t}(a_{t})).

Algorithm 1 uses optimal design as a basis for exploration: At each round, the learner obtains an estimator f^t\widehat{f}_{t} from the regression oracle 𝐀𝐥𝐠𝖲𝗊\mathrm{\mathbf{Alg}}_{\mathsf{Sq}}, then appeals to a subroutine to compute an (approximate) G-optimal design qtoptΔ(𝒜)q^{\operatorname{{opt}}}_{t}\in\Delta(\mathcal{A}) for the action embedding {ϕ(xt,a)}a𝒜\left\{\phi(x_{t},a)\right\}_{a\in\mathcal{A}}. Fix an exploration parameter ε>0\varepsilon>0, the algorithm then samples an action aqtopta\sim q^{\operatorname{{opt}}}_{t} from the optimal design with probability ε\varepsilon (“exploration”), or plays the greedy action a^t:=argmaxa𝒜f^t(xt,a)\widehat{a}_{t}\vcentcolon=\operatorname*{arg\,max}_{a\in\mathcal{A}}\widehat{f}_{t}(x_{t},a) with probability 1ε1-\varepsilon (“exploitation”). Algorithm 1 is efficient whenever an approximate optimal design can be computed efficiently, which can be achieved using Algorithm 5. We defer a detailed discussion of efficiency for a moment, and first state the main regret bound for the algorithm.

Theorem 1.

With a CoptC_{\operatorname{{opt}}}-approximate optimal design subroutine and an appropriate choice for ε(0,1]\varepsilon\in(0,1], Algorithm 1, with probability at least 1δ1-\delta, enjoys regret

𝐑𝐞𝐠𝖢𝖡(T)=O((Coptd)1/3T2/3(𝐑𝐞𝐠𝖲𝗊(T)+log(δ1))1/3).\displaystyle\mathrm{\mathbf{Reg}}_{\mathsf{CB}}(T)=O\left((C_{\operatorname{{opt}}}\cdot d)^{1/3}T^{2/3}(\mathrm{\mathbf{Reg}}_{\mathsf{Sq}}(T)+\log(\delta^{-1}))^{1/3}\right).

In particular, when invoked with Algorithm 5 (with C=2C=2) as a subroutine, the algorithm enjoys regret

𝐑𝐞𝐠𝖢𝖡(T)=O(d2/3T2/3(𝐑𝐞𝐠𝖲𝗊(T)+log(δ1))1/3).\displaystyle\mathrm{\mathbf{Reg}}_{\mathsf{CB}}(T)=O\left(d^{2/3}T^{2/3}(\mathrm{\mathbf{Reg}}_{\mathsf{Sq}}(T)+\log(\delta^{-1}))^{1/3}\right).

and has per-round runtime O(𝒯𝖲𝗊+𝒯𝖮𝗉𝗍d2logd+d4logd)O(\mathcal{T}_{\mathsf{Sq}}+\mathcal{T}_{\mathsf{Opt}}\cdot d^{2}\log d+d^{4}\log d) and maximum memory O(𝖲𝗊+𝖮𝗉𝗍+d2)O(\mathcal{M}_{\mathsf{Sq}}+\mathcal{M}_{\mathsf{Opt}}+d^{2}).

Computational efficiency

The computational efficiency of Algorithm 1 hinges on the ability to efficiently compute an approximate optimal design (or, by convex duality, the John ellipsoid (John, 1948)) for the set {ϕ(xt,a)}a𝒜\left\{\phi(x_{t},a)\right\}_{a\in\mathcal{A}}. All off-the-shelf optimal design solvers that we are aware of require solving quadratic maximization subproblems, which in general cannot be reduced to a linear optimization oracle (Definition 1). While there are some special cases where efficient solvers exist (e.g., when 𝒜\mathcal{A} is a polytope (Cohen et al. (2019) and references therein)), computing an exact optimal design is NP-hard in general (Grötschel et al., 2012; Summa et al., 2014). To overcome this issue, we use the notion of a barycentric spanner, which acts as an approximate optimal design and can be computed efficiently using an action optimization oracle.

Definition 3 (Awerbuch and Kleinberg (2008)).

Let a compact set 𝒵d\mathcal{Z}\subseteq{\mathbb{R}}^{d} of full dimension be given. For C1C\geq{}1, a subset of points 𝒮={z1,,zd}𝒵\mathcal{S}=\left\{z_{1},\dots,z_{d}\right\}\subseteq\mathcal{Z} is said to be a CC-approximate barycentric spanner for 𝒵\mathcal{Z} if every point z𝒵z\in\mathcal{Z} can be expressed as a weighted combination of points in 𝒮\mathcal{S} with coefficients in [C,C][-C,C].

The following result shows that any barycentric spanner yields an approximate optimal design.

Lemma 2.

If 𝒮={z1,,zd}\mathcal{S}=\{z_{1},\dots,z_{d}\} is a CC-approximate barycentric spanner for 𝒵d\mathcal{Z}\subseteq{\mathbb{R}}^{d}, then q:=unif(𝒮)q\vcentcolon=\mathrm{unif}(\mathcal{S}) is a (C2d)(C^{2}\cdot d)-approximate optimal design.

Using an algorithm introduced by Awerbuch and Kleinberg (2008), one can efficiently compute the CC-approximate barycentric spanner for the set {ϕ(x,a)}a𝒜\left\{\phi(x,a)\right\}_{a\in\mathcal{A}} usingO(d2logCd)O(d^{2}\log_{C}d) an action optimization oracle; their method is restated stated as Algorithm 5 in Appendix A.

Key features of Algorithm 1

While the regret bound for Algorithm 1 scales with T2/3T^{2/3}, which is not optimal, this result constitutes the first computationally efficient algorithm for contextual bandits with linearly structured actions and general function approximation. Additional features include:

  • Simplicity and practicality. Appealing to uniform exploration makes Algorithm 1 easy to implement and highly practical. In particular, in the case where the action embedding does not depend on the context (i.e., ϕ(x,a)=ϕ(a)\phi(x,a)=\phi(a)) an approximate design can be precomputed and reused, reducing the per-round runtime to O~(𝒯𝖲𝗊+𝒯𝖮𝗉𝗍)\widetilde{O}(\mathcal{T}_{\mathsf{Sq}}+\mathcal{T}_{\mathsf{Opt}}) and the maximum memory to O(𝖲𝗊+d)O(\mathcal{M}_{\mathsf{Sq}}+d).

  • Lifting optimal design to contextual bandits. Previous bandit algorithms based on optimal design are limited to the non-contextual setting, and to pure exploration. Our result highlights for the first time that optimal design can be efficiently combined with general function approximation.

Proof sketch for Theorem 1

To analyze Algorithm 1, we follow a recipe introduced by Foster and Rakhlin (2020); Foster et al. (2021b) based on the Decision-Estimation Coefficient (DEC),111The original definition of the Decision-Estimation Coefficient in Foster et al. (2021b) uses Hellinger distance rather than squared error. The squared error version we consider here leads to tighter guarantees for bandit problems where the mean rewards serve as a sufficient statistic. defined as 𝖽𝖾𝖼γ():=supf^conv(),x𝒳𝖽𝖾𝖼γ(;f^,x)\mathsf{dec}_{\gamma}(\mathcal{F})\vcentcolon=\sup_{\widehat{f}\in\operatorname{{conv}}(\mathcal{F}),x\in\mathcal{X}}\mathsf{dec}_{\gamma}(\mathcal{F};\widehat{f},x), where

𝖽𝖾𝖼γ(;f^,x):=infpΔ(𝒜)supa𝒜supf𝔼ap[f(x,a)f(x,a)γ(f^(x,a)f(x,a))2].\displaystyle\mathsf{dec}_{\gamma}(\mathcal{F};\widehat{f},x)\vcentcolon=\inf_{p\in\Delta(\mathcal{A})}\sup_{a^{\star}\in\mathcal{A}}\sup_{f^{\star}\in\mathcal{F}}{\mathbb{E}}_{a\sim p}\bigg{[}f^{\star}(x,a^{\star})-f^{\star}(x,a)-\gamma\cdot(\widehat{f}(x,a)-f^{\star}(x,a))^{2}\bigg{]}. (3)

Foster et al. (2021b) consider a meta-algorithm which, at each round tt, (i) computes f^t\widehat{f}_{t} by appealing to a regression oracle, (ii) computes a distribution ptΔ(𝒜)p_{t}\in\Delta(\mathcal{A}) that solves the minimax problem in Eq. 3 with xtx_{t} and f^t\widehat{f}_{t} plugged in, and (iii) chooses the action ata_{t} by sampling from this distribution. One can show (Lemma 7 in Appendix A) that for any γ>0\gamma>0, this strategy enjoys the following regret bound:

𝐑𝐞𝐠𝖢𝖡(T)T𝖽𝖾𝖼γ()+γ𝐑𝐞𝐠𝖲𝗊(T),\displaystyle\mathrm{\mathbf{Reg}}_{\mathsf{CB}}(T)\lesssim T\cdot\mathsf{dec}_{\gamma}(\mathcal{F})+\gamma\cdot\mathrm{\mathbf{Reg}}_{\mathsf{Sq}}(T), (4)

More generally, if one computes a distribution that does not solve Eq. 3 exactly, but instead certifies an upper bound on the DEC of the form 𝖽𝖾𝖼γ()\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111()γ\mathsf{dec}_{\gamma}(\mathcal{F})\leq\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\gamma}(\mathcal{F}), the same result holds with 𝖽𝖾𝖼γ()\mathsf{dec}_{\gamma}(\mathcal{F}) replaced by \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111()γ\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\gamma}(\mathcal{F}). Algorithm 1 is a special case of this meta-algorithm, so to bound the regret it suffices to show that the exploration strategy in the algorithm certifies a bound on the DEC.

Lemma 3.

For any γ1\gamma\geq{}1, by choosing ε=Coptd/4γ1\varepsilon=\sqrt{C_{\operatorname{{opt}}}\cdot d/4\gamma}\wedge 1, the exploration strategy in Algorithm 1 certifies that 𝖽𝖾𝖼γ()=O(Coptd/γ)\mathsf{dec}_{\gamma}(\mathcal{F})=O(\sqrt{C_{\operatorname{{opt}}}\cdot d/\gamma}).

Using Lemma 3, one can upper bound the first term in Eq. 4 by O(TCoptd/γ)O(T\sqrt{C_{\operatorname{{opt}}}d/\gamma}). The regret bound in Theorem 1 follows by choosing γ\gamma to balance the two terms.

4 Efficient, Near-Optimal Algorithms

In this section we present SpannerIGW (Algorithm 2), an efficient algorithm with O~(T)\widetilde{O}(\sqrt{T}) regret (Algorithm 2). We provide the algorithm and statistical guarantees in Section 4.1, then discuss computational efficiency in Section 4.2.

4.1 Algorithm and Statistical Guarantees

Building on the approach in Section 3, SpannerIGW uses the idea of exploration with an optimal design. However, in order to achieve T\sqrt{T} regret, we combine optimal design with the inverse gap weighting (IGW) technique. previously used in the finite-action contextual bandit setting (Abe and Long, 1999; Foster and Rakhlin, 2020).

Recall that for finite-action contextual bandits, the inverse gap weighting technique works as follows. Given a context xtx_{t} and estimator f^t\widehat{f}_{t} from the regression oracle 𝐀𝐥𝐠𝖲𝗊\mathrm{\mathbf{Alg}}_{\mathsf{Sq}}, we assign a distribution to actions in 𝒜\mathcal{A} via the rule

pt(a):=1λ+γ(f^t(xt,a^t)f^t(xt,a)),\displaystyle p_{t}(a)\vcentcolon=\frac{1}{\lambda+\gamma\cdot\left(\widehat{f}_{t}(x_{t},\widehat{a}_{t})-\widehat{f}_{t}(x_{t},a)\right)},

where a^t:=argmaxa𝒜f^t(xt,a)\widehat{a}_{t}\vcentcolon=\operatorname*{arg\,max}_{a\in\mathcal{A}}\widehat{f}_{t}(x_{t},a) and λ>0\lambda>0 is chosen such that apt(a)=1\sum_{a}p_{t}(a)=1. This strategy certifies that 𝖽𝖾𝖼γ(;f^t,xt)|𝒜|γ\mathsf{dec}_{\gamma}(\mathcal{F};\widehat{f}_{t},x_{t})\leq\frac{\lvert\mathcal{A}\rvert}{\gamma}, which leads to regret O(|𝒜|T𝐑𝐞𝐠𝖲𝗊(T))O\big{(}\sqrt{\lvert\mathcal{A}\rvert T\cdot\mathrm{\mathbf{Reg}}_{\mathsf{Sq}}(T)}\big{)}. While this is essentially optimal for the finite-action setting, the linear dependence on |𝒜|\lvert\mathcal{A}\rvert makes it unsuitable for the large-action setting we consider.

To lift the IGW strategy to the large-action setting, Algorithm 2 combines it with optimal design with respect to a reweighted embedding. Let f^\widehat{f}\in\mathcal{F} be given. For each action a𝒜a\in\mathcal{A}, we define a reweighted embedding via

\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a):=ϕ(x,a)1+η(f^(x,a^)f^(x,a)),\displaystyle\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a)\vcentcolon=\frac{\phi(x,a)}{\sqrt{1+\eta\left(\widehat{f}(x,\widehat{a})-\widehat{f}(x,a)\right)}}, (5)

where a^:=argmaxa𝒜f^(x,a)\widehat{a}\vcentcolon={}\operatorname*{arg\,max}_{a\in\mathcal{A}}\widehat{f}(x,a) and η>0\eta>0 is a reweighting parameter to be tuned later. This reweighting is action-dependent since f^(x,a)\widehat{f}(x,a) term appears on the denominator. Within Algorithm 2, we compute a new reweighted embedding at each round t[T]t\in[T] using f^t=fg^t\widehat{f}_{t}=f_{\widehat{g}_{t}}, the output of the regression oracle 𝐀𝐥𝐠𝖲𝗊\mathrm{\mathbf{Alg}}_{\mathsf{Sq}}.

Algorithm 2 proceeds by computing an optimal design qtoptΔ(𝒜)q^{\operatorname{{opt}}}_{t}\in\Delta(\mathcal{A}) with respect to the reweighted embedding defined in Eq. 5. The algorithm then creates a distribution qt:=12qtopt+12𝕀a^tq_{t}\vcentcolon=\frac{1}{2}q^{\operatorname{{opt}}}_{t}+\frac{1}{2}\mathbb{I}_{\widehat{a}_{t}} by mixing the optimal design with a delta mass at the greedy action a^t\widehat{a}_{t}. Finally, in Eq. 6, the algorithm computes an augmented version of the inverse gap weighting distribution by reweighting according to qtq_{t}. This approach certifies the following bound on the Decision-Estimation Coefficient.

Lemma 4.

For any γ>0\gamma>0, by setting η=γ/(Coptd)\eta=\gamma/(C_{\operatorname{{opt}}}\cdot d), the exploration strategy used in Algorithm 2 certifies that 𝖽𝖾𝖼γ()=O(Coptd/γ)\mathsf{dec}_{\gamma}(\mathcal{F})=O({C_{\operatorname{{opt}}}\cdot d}/{\gamma}).

This lemma shows that the reweighted IGW strategy enjoys the best of both worlds: By leveraging optimal design, we ensure good coverage for all actions, leading to O(d)O(d) (rather than O(|𝒜|)O(\lvert\mathcal{A}\rvert)) scaling, and by leveraging inverse gap weighting, we avoid excessive exploration, leading O(1/γ)O(1/\gamma) rather than O(1/γ)O(1/\sqrt{\gamma}) scaling. Combining this result with Lemma 7 leads to our main regret bound for SpannerIGW.

Algorithm 2 SpannerIGW
0: Exploration parameter γ>0\gamma>0, online regression oracle 𝐀𝐥𝐠𝖲𝗊\mathrm{\mathbf{Alg}}_{\mathsf{Sq}}, action optimization oracle 𝐀𝐥𝐠𝖮𝗉𝗍\mathrm{\mathbf{Alg}}_{\mathsf{Opt}}.
1: Define η:=γCoptd\eta\vcentcolon={}\frac{\gamma}{C_{\operatorname{{opt}}}\cdot{}d}.
2:for t=1,2,,Tt=1,2,\dots,T do
3:  Observe context xtx_{t}.
4:  Receive f^t=fg^t\widehat{f}_{t}=f_{\widehat{g}_{t}} from regression oracle 𝐀𝐥𝐠𝖲𝗊\mathrm{\mathbf{Alg}}_{\mathsf{Sq}}.
5:  Get a^targmaxa𝒜ϕ(xt,a),g^t(xt)\widehat{a}_{t}\leftarrow\operatorname*{arg\,max}_{a\in\mathcal{A}}\left\langle\phi(x_{t},a),\widehat{g}_{t}(x_{t})\right\rangle.
6:  Call subroutine to compute CoptC_{\operatorname{{opt}}}-approximate optimal design qtoptΔ(𝒜)q^{\operatorname{{opt}}}_{t}\in\Delta(\mathcal{A}) for reweighted embedding {\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(xt,a)}a𝒜\left\{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x_{t},a)\right\}_{a\in\mathcal{A}} (Eq. 5 with f^=f^t\widehat{f}=\widehat{f}_{t}). // See Algorithm 3 for efficient solver.
7:  Define qt:=12qtopt+12𝕀a^tq_{t}\vcentcolon=\frac{1}{2}q^{\operatorname{{opt}}}_{t}+\frac{1}{2}\mathbb{I}_{\widehat{a}_{t}}.
8:  For each asupp(qt)a\in\operatorname{supp}(q_{t}), define
pt(a):=qt(a)λ+η(f^t(xt,a^t)f^t(xt,a)),\displaystyle p_{t}(a)\vcentcolon=\frac{q_{t}(a)}{\lambda+\eta\left(\widehat{f}_{t}(x_{t},\widehat{a}_{t})-\widehat{f}_{t}(x_{t},a)\right)}, (6)
where λ[12,1]\lambda\in[\frac{1}{2},1] is chosen so that asupp(qt)pt(a)=1\sum_{a\in\operatorname{supp}(q_{t})}p_{t}(a)=1.
9:  Sample atpta_{t}\sim p_{t} and observe reward rt(at)r_{t}(a_{t}).
10:  Update 𝐀𝐥𝐠𝖲𝗊\mathrm{\mathbf{Alg}}_{\mathsf{Sq}} with (xt,at,rt(at))(x_{t},a_{t},r_{t}(a_{t})).
Theorem 2.

Let δ(0,1)\delta\in(0,1) be given. With a CoptC_{\operatorname{{opt}}}-approximate optimal design subroutine and an appropriate choice for γ>0\gamma>0, Algorithm 2 ensures that with probability at least 1δ1-\delta,

𝐑𝐞𝐠𝖢𝖡(T)\displaystyle\mathrm{\mathbf{Reg}}_{\mathsf{CB}}(T) =O(CoptdT(𝐑𝐞𝐠𝖲𝗊(T)+log(δ1))).\displaystyle=O\left(\,{\sqrt{C_{\operatorname{{opt}}}\cdot d\,T\,\big{(}\mathrm{\mathbf{Reg}}_{\mathsf{Sq}}(T)+\log(\delta^{-1})\big{)}}}\right).

In particular, when invoked with Algorithm 3 (with C=2C=2) as a subroutine, the algorithm has

𝐑𝐞𝐠𝖢𝖡(T)=O(dT(𝐑𝐞𝐠𝖲𝗊(T)+log(δ1))),\displaystyle\mathrm{\mathbf{Reg}}_{\mathsf{CB}}(T)=O\left(d\,\sqrt{T\,\big{(}\mathrm{\mathbf{Reg}}_{\mathsf{Sq}}(T)+\log(\delta^{-1})\big{)}}\right),

and has per-round runtime O(𝒯𝖲𝗊+(𝒯𝖮𝗉𝗍d3+d4)log2(Tr))O(\mathcal{T}_{\mathsf{Sq}}+(\mathcal{T}_{\mathsf{Opt}}\cdot d^{3}+d^{4})\cdot\log^{2}\big{(}\frac{T}{r}\big{)}) and the maximum memory O(𝖲𝗊+𝖮𝗉𝗍+d2+dlog(Tr))O(\mathcal{M}_{\mathsf{Sq}}+\mathcal{M}_{\mathsf{Opt}}+d^{2}+d\log\big{(}\frac{T}{r}\big{)}).

Algorithm 2 is the first computationally efficient algorithm with T\sqrt{T}-regret for contextual bandits with general function approximation and linearly structured action spaces. In what follows, we show how to leverage the action optimization oracle (Definition 1) to achieve this efficiency.

4.2 Computational Efficiency

The computational efficiency of Algorithm 2 hinges on the ability to efficiently compute an optimal design. As with Algorithm 1, we address this issue by appealing to the notion of a barycentric spanner, which serves as an approximate optimal design. However, compared to Algorithm 1, a substantial additional challenge is that Algorithm 2 requires an approximate optimal design for the reweighted embeddings. Since the reweighting is action-dependent, the action optimization oracle 𝐀𝐥𝐠𝖮𝗉𝗍\mathrm{\mathbf{Alg}}_{\mathsf{Opt}} cannot be directly applied to optimize over the reweighted embeddings, which prevents us from appealing to an out-of-the-box solver (Algorithm 5) in the same fashion as the prequel.

Algorithm 3 ReweightedSpanner
0: Context x𝒳x\in\mathcal{X}, oracle prediction g^(x)d\widehat{g}(x)\in{\mathbb{R}}^{d}, action a^:=argmaxa𝒜ϕ(x,a),g^(x)\widehat{a}\vcentcolon=\operatorname*{arg\,max}_{a\in\mathcal{A}}\langle\phi(x,a),\widehat{g}(x)\rangle, reweighting parameter η>0\eta>0, approximation factor C>2C>\sqrt{2}, initial set 𝒮=(a1,,ad)\mathcal{S}=(a_{1},\dots,a_{d}) with |det(ϕ(x,𝒮))|rd\lvert\det(\phi(x,\mathcal{S}))\rvert\geq r^{d} for r(0,1)r\in(0,1).
1:while not break do
2:  for i=1,,di=1,\dots,d do
3:   Compute θd\theta\in{\mathbb{R}}^{d} representing linear function \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a)det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,𝒮i(a)))\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a)\mapsto\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,\mathcal{S}_{i}(a))), where 𝒮i(a):=(a1,,ai1,a,ai+1,,ad)\mathcal{S}_{i}(a)\vcentcolon=(a_{1},\ldots,a_{i-1},a,a_{i+1},\ldots,a_{d}). // \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{} is computed from fg^f_{\widehat{g}}, a^\widehat{a}, and η\eta via Eq. 5.
4:   Get aIGW-ArgMax(θ;x,g^(x),η,r)a\leftarrow\textsf{IGW-ArgMax}(\theta;x,\widehat{g}(x),\eta,r). // Algorithm 4.
5:   if |det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,𝒮i(a)))|2C2|det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,𝒮))|\left\lvert\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,\mathcal{S}_{i}(a)))\right\rvert\geq\frac{\sqrt{2}C}{2}\left\lvert\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,\mathcal{S}))\right\rvert then
6:    Update aiaa_{i}\leftarrow a.
7:    continue to line 2.
8:  break
9:return  CC-approximate barycentric spanner 𝒮\mathcal{S}.

To address the challenges above, we introduce ReweightedSpanner (Algorithm 3), a barycentric spanner computation algorithm which is tailored to the reweighted embedding \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}. To describe the algorithm, let us introduce some additional notation. For a set 𝒮𝒜\mathcal{S}\subseteq\mathcal{A} of dd actions, we let det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,𝒮))\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,\mathcal{S})) denote the determinant of the dd-by-dd matrix whose columns are {\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a)}a𝒜\left\{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a)\right\}_{a\in\mathcal{A}}. ReweightedSpanner adapts the barycentric spanner computation approach of Awerbuch and Kleinberg (2008), which aims to identify a subset 𝒮𝒜\mathcal{S}\subseteq\mathcal{A} with |𝒮|=d\lvert\mathcal{S}\rvert=d that approximately maximizes |det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,𝒮))|\lvert\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,\mathcal{S}))\rvert. The key feature of ReweightedSpanner is a subroutine, IGW-ArgMax (Algorithm 4), which implements an (approximate) action optimization oracle for the reweighted embedding:

argmaxa𝒜\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a),θ.\operatorname*{arg\,max}_{a\in\mathcal{A}}\left\langle\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a),\theta\right\rangle. (7)

IGW-ArgMax uses line search reduce the problem in Eq. 7 to a sequence of linear optimization problems with respect to the unweighted embeddings, each of which can be solved using 𝐀𝐥𝐠𝖮𝗉𝗍\mathrm{\mathbf{Alg}}_{\mathsf{Opt}}. This yields the following guarantee for Algorithm 3.

Theorem 3.

Suppose that Algorithm 3 is invoked with parameters η>0\eta>0, r(0,1)r\in(0,1), and C>2C>\sqrt{2}, and that the initialization set 𝒮\mathcal{S} satisfies |det(ϕ(x,𝒮))|rd\lvert\det(\phi(x,\mathcal{S}))\rvert\geq r^{d}. Then the algorithm returns a CC-approximate barycentric spanner with respect to the reweighted embedding set {\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a)}a𝒜\left\{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a)\right\}_{a\in\mathcal{A}}, and does so with O((𝒯𝖮𝗉𝗍d3+d4)log2(eηr))O((\mathcal{T}_{\mathsf{Opt}}\cdot d^{3}+d^{4})\cdot\log^{2}(e\vee\frac{\eta}{r})) runtime and O(𝖮𝗉𝗍+d2+dlog(eηr))O(\mathcal{M}_{\mathsf{Opt}}+d^{2}+d\log(e\vee\frac{\eta}{r})) memory.

We refer to Section C.1 for self-contained analysis of IGW-ArgMax.

Algorithm 4 IGW-ArgMax
0: Linear parameter θd\theta\in{\mathbb{R}}^{d}, context x𝒳x\in\mathcal{X}, oracle prediction g^(x)d\widehat{g}(x)\in{\mathbb{R}}^{d}, reweighting parameter η>0\eta>0, initialization constant r(0,1)r\in(0,1).
1: Define N:=dlog43(2η+1r)N\vcentcolon=\lceil d\log_{\frac{4}{3}}(\frac{2\eta+1}{r})\rceil.
2: Define :={(34)i}i=1N{(34)i}i=1N\mathcal{E}\vcentcolon=\{(\frac{3}{4})^{i}\}_{i=1}^{N}\cup\{-(\frac{3}{4})^{i}\}_{i=1}^{N}.
3: Initialize 𝒜^=\widehat{\mathcal{A}}=\emptyset.
4:for each ε\varepsilon\in\mathcal{E} do
5:  Compute \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a1112εθ+ε2ηg^(x)\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}\leftarrow 2\varepsilon\theta+{\varepsilon^{2}\eta}\cdot\widehat{g}(x).
6:  Get aargmaxa𝒜ϕ(x,a),\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111a\leftarrow\operatorname*{arg\,max}_{a\in\mathcal{A}}\langle\phi(x,a),\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}\rangle; add aa to 𝒜^\widehat{\mathcal{A}}.
7:return  argmaxa𝒜^\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a),θ2\operatorname*{arg\,max}_{a\in\widehat{\mathcal{A}}}{\langle\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a),\theta\rangle}^{2} // O~(d)\widetilde{O}(d) candidates.
On the initialization requirement

The runtime for Algorithm 3 scales with log(r1)\log(r^{-1}), where r(0,1)r\in(0,1) is such that det(ϕ(x,𝒮))rd\det(\phi(x,\mathcal{S}))\geq{}r^{d} for the initial set 𝒮\mathcal{S}. In Section C.3, we provide computationally efficient algorithms for initialization under various assumptions on the action space.

5 Empirical Results

In this section we investigate the empirical performance of SpannerGreedy and SpannerIGW through three experiments. First, we compare the spanner-based algorithms to state-of-the art finite-action algorithms on a large-action dataset; this experiment features nonlinear, learned context embeddings g𝒢g\in\mathcal{G}. Next, we study the impact of redundant actions on the statistical performance of said algorithms. Finally, we experiment with a large-scale large-action contextual bandit benchmark, where we find that the spanner-based methods exhibit excellent performance.

Preliminaries

We conduct experiments on three datasets, whose details are summarized in Table 1. oneshotwiki (Singh et al., 2012; Vasnetsov, 2018) is a named-entity recognition task where contexts are text phrases preceding and following the mention text, and where actions are text phrases corresponding to the concept names. amazon-3m (Bhatia et al., 2016) is an extreme multi-label dataset whose contexts are text phrases corresponding to the title and description of an item, and whose actions are integers corresponding to item tags. Actions are embedded into d{\mathbb{R}}^{d} with dd specified in Table 1. We construct binary rewards for each dataset, and report 90% bootstrap confidence intervals (CIs) of the rewards in the experiments. We defer other experimental details to Section D.1. Code to reproduce all results is available at https://github.com/pmineiro/linrepcb.

Table 1: Datasets used for experiments.
Dataset TT |𝒜||\mathcal{A}| dd
oneshotwiki-311 622000 311 50
oneshotwiki-14031 2806200 14031 50
amazon-3m 1717899 2812281 800
Comparison with finite-action baselines

We compare SpannerGreedy and SpannerIGW with their finite-action counterparts ε\varepsilon-Greedy and SquareCB (Foster and Rakhlin, 2020) on the oneshotwiki-14031 dataset. We consider bilinear models in which regression functions take the form f(x,a)=ϕ(a),Wxf(x,a)=\langle\phi(a),Wx\rangle where WW is a matrix of learned parameters; the deep models of the form f(x,a)=ϕ(a),W\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111g(x)f(x,a)=\langle\phi(a),W\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{g}(x)\rangle, where \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111g\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{g} is a learned two-layer neural network and WW contains learned parameters as before.222Also see Section D.1 for details. Table 2 presents our results. We find that SpannerIGW performs best, and that both spanner-based algorithms either tie or exceed their finite-action counterparts. In addition, we find that working with deep models uniformly improves performance for all methods. We refer to Table 4 in Section D.3 for timing information.

Table 2: Comparison on oneshotwiki-14031. Values are the average progressive rewards (confidence intervals), scaled by 1000. We include the performance of the best constant predictor and supervised learning as a baseline and skyline respectively.
Algorithm Regression Function
Bilinear Deep
best constant 0.071270.07127
ε\varepsilon-Greedy [5.00,6.27][5.00,6.27] [7.15,8.52][7.15,8.52]
SpannerGreedy [6.29,7.08][6.29,7.08] [6.67,8.30][6.67,8.30]
SquareCB [7.57,8.59][7.57,8.59] [10.4,11.3][10.4,11.3]
SpannerIGW [8.84,9.68][8.84,9.68] [11.2,12.2][11.2,12.2]
supervised [31.2,31.3][31.2,31.3] [36.7,36.8][36.7,36.8]
Impact of redundancy

Finite-action contextual bandit algorithms can explore excessively in the presence of redundant actions. To evaluate performance in the face of redundancy, we augment oneshotwiki-311 by duplicating action the final action. Table 3 displays the performance of SpannerIGW and its finite-action counterpart, SquareCB, with a varying number of duplicates. We find that SpannerIGW is completely invariant to duplicates (in fact, the algorithm produces numerically identical output when the random seed is fixed), but SquareCB is negatively impacted and over-explores the duplicated action. SpannerGreedy and ε\varepsilon-Greedy behave analogously (not shown).

Table 3: Redundancy study on oneshotwiki-311. Values are the average progressive rewards (confidence intervals), scaled by 100.
Duplicates SpannerIGW SquareCB
0 [12.6,13.0][12.6,13.0] [12.2,12.6][12.2,12.6]
16 [12.6,13.0][12.6,13.0] [12.1,12.4][12.1,12.4]
256 [12.6,13.0][12.6,13.0] [10.2,10.6][10.2,10.6]
1024 [12.6,13.0][12.6,13.0] [8.3,8.6][8.3,8.6]
Large scale exhibition

We conduct a large scale experiment using the amazon-3m dataset. Following Sen et al. (2021), we study the top-kk setting where kk actions are selected at each round. Out of the total number of actions sampled, we let rr denote the number of actions sampled for exploration. We apply SpannerGreedy for this dataset and consider regression functions similar to the deep models discussed before. The setting (k=1)(k=1) corresponds to running our algorithm unmodified, and (k=5,r=3)(k=5,r=3) corresponds to selecting 5 actions per round and using 3 exploration slots. Fig. 1 in Section D.4 displays the results. For (k=1)(k=1) the final CI is [0.1041,0.1046][0.1041,0.1046], and for (k=5,r=3)(k=5,r=3) the final CI is [0.438,0.440][0.438,0.440].

In the setup with (k=5,r=3)(k=5,r=3), our results are directly comparable to Sen et al. (2021), who evaluated a tree-based contextual bandit method on the same dataset. The best result from Sen et al. (2021) achieves roughly 0.19 reward with (k=5,r=3)(k=5,r=3), which we exceed by a factor of 2. This indicates that our use of embeddings provides favorable inductive bias for this problem, and underscores the broad utility of our techniques (which leverage embeddings). For (k=5,r=3)(k=5,r=3), our inference time on a commodity CPU with batch size 1 is 160ms per example, which is slower than the time of 7.85ms per example reported in Sen et al. (2021).

6 Additional Related Work

In this section we highlight some relevant lines of research not already discussed.

Efficient general-purpose contextual bandit algorithms

There is a long line of research on computationally efficient methods for contextual bandits with general function approximation, typically based on reduction to either cost-sensitive classification oracles (Langford and Zhang, 2007; Dudik et al., 2011; Agarwal et al., 2014) or regression oracles (Foster et al., 2018; Foster and Rakhlin, 2020; Simchi-Levi and Xu, 2021). Most of these works deal with a finite action spaces and have regret scaling with the number of actions, which is necessary without further structural assumptions (Agarwal et al., 2012). An exception is the works of Foster et al. (2020) and Xu and Zeevi (2020), both of which consider the same setting as the present paper. Both of the algorithms in these works require solving subproblems based on maximizing quadratic forms (which is NP-hard in general (Sahni, 1974)), and cannot directly take advantage of the linear optimization oracle we consider. Also related is the work of Zhang (2021), which proposes a posterior sampling-style algorithm for the setting we consider. This algorithm is not fully comparable computationally, as it requires sampling from specific posterior distribution; it is unclear whether this can be achieved in a provably efficient fashion.

Linear contextual bandits

The linear contextual bandit problem is a special case of our setting in which g(x)=θdg^{\star}(x)=\theta\in\mathbb{R}^{d} is constant (that is, the reward function only depends on the context through the feature map ϕ\phi). The most well-studied families of algorithms for this setting are UCB-style algorithms and posterior sampling. With a well-chosen prior and posterior distribution, posterior sampling can be implemented efficiently (Agrawal and Goyal, 2013), but it is unclear how to efficiently adapt this approach to accomodate general function approximation. Existing UCB-type algorithms require solving sub-problems based on maximizing quadratic forms, which is NP-hard in general (Sahni, 1974). One line of research aims to make UCB efficient by using hashing-based methods (MIPS) to approximate the maximum inner product (Yang et al., 2021; Jun et al., 2017). These methods have runtime sublinear (but still polynomial) in the number of actions.

Non-contextual linear bandits

For the problem of non-contextual linear bandits (with either stochastic or adversarial rewards), there is a long line of research on efficient algorithms that can take advantage of linear optimization oracles (Awerbuch and Kleinberg, 2008; McMahan and Blum, 2004; Dani and Hayes, 2006; Dani et al., 2008; Bubeck et al., 2012; Hazan and Karnin, 2016; Ito et al., 2019); see also work on the closely related problem of combinatorial pure exploration (Chen et al., 2017; Cao and Krishnamurthy, 2019; Katz-Samuels et al., 2020; Wagenmaker et al., 2021). In general, it is not clear how to lift these techniques to contextual bandits with linearly-structured actions and general function approximation. We also mention that optimal design has been applied in the context of linear bandits, but these algorithms are restricted to the non-contextual setting (Lattimore and Szepesvári, 2020; Lattimore et al., 2020), or to pure exploration (Soare et al., 2014; Fiez et al., 2019). The only exception we are aware of is Ruan et al. (2021), who extend these developments to linear contextual bandits (i.e., where g(x)=θg^{\star}(x)=\theta), but critically use that contexts are stochastic.

Other approaches

Another line of research provides efficient contextual bandit methods under specific modeling assumptions on the context space or action space that differ from the ones we consider here. Zhou et al. (2020); Xu et al. (2020); Zhang et al. (2021); Kassraie and Krause (2022) provide generalizations of the UCB algorithm and posterior sampling based on the Neural Tangent Kernel (NTK). These algorithms can be used to learn context embeddings (i.e., g(x)g(x)) with general function approximation, but only lead to theoretical guarantees under strong RKHS-based assumptions. For large action spaces, these algorithms typically require enumeration over actions. Majzoubi et al. (2020) consider a setting with nonparametric action spaces and design an efficient tree-based learner; their guarantees, however, scale exponentially in the dimensionality of action space. Sen et al. (2021) provide heuristically-motivated but empirically-effective tree-based algorithms for contextual bandits with large action spaces, with theoretical guarantees when the actions satisfy certain tree-structured properties. Lastly, another empirically-successful approach is the policy gradient method (e.g., Williams (1992); Bhatnagar et al. (2009); Pan et al. (2019)). On the theoretical side, policy gradient methods do not address the issue of systematic exploration, and—to our knowledge—do not lead to provable guarantees for the setting considered in our paper.

7 Discussion

We provide the first efficient algorithms for contextual bandits with continuous, linearly structured action spaces and general-purpose function approximation. We highlight some natural directions for future research below.

  • Efficient algorithms for nonlinear action spaces. Our algorithms take advantage of linearly structured action spaces by appealing to optimal design. Can we develop computationally efficient methods for contextual bandits with nonlinear dependence on the action space?

  • Reinforcement learning. The contextual bandit problem is a special case of the reinforcement learning problem with horizon one. Given our positive results in the contextual bandit setting, a natural next step is to extend our methods to reinforcement learning problems with large action/decision spaces. For example, Foster et al. (2021b) build on our computational tools to provide efficient algorithms for reinforcement learning with bilinear classes.

Beyond these directions, natural domains in which to extend our techniques include pure exploration and off-policy learning with linearly structured actions.

References

  • Abbasi-Yadkori et al. (2011) Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In NIPS, volume 11, pages 2312–2320, 2011.
  • Abe and Long (1999) Naoki Abe and Philip M Long. Associative reinforcement learning using linear probabilistic concepts. In ICML, pages 3–11. Citeseer, 1999.
  • Agarwal et al. (2012) Alekh Agarwal, Miroslav Dudík, Satyen Kale, John Langford, and Robert Schapire. Contextual bandit learning with predictable rewards. In Artificial Intelligence and Statistics, pages 19–26. PMLR, 2012.
  • Agarwal et al. (2014) Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, and Robert Schapire. Taming the monster: A fast and simple algorithm for contextual bandits. In International Conference on Machine Learning, pages 1638–1646. PMLR, 2014.
  • Agarwal et al. (2016) Alekh Agarwal, Sarah Bird, Markus Cozowicz, Luong Hoang, John Langford, Stephen Lee, Jiaji Li, Dan Melamed, Gal Oshri, Oswaldo Ribas, Siddhartha Sen, and Aleksandrs Slivkins. Making contextual decisions with low technical debt. arXiv:1606.03966, 2016.
  • Agrawal and Goyal (2013) Shipra Agrawal and Navin Goyal. Thompson sampling for contextual bandits with linear payoffs. In International Conference on Machine Learning, pages 127–135. PMLR, 2013.
  • Atwood (1969) Corwin L Atwood. Optimal and efficient designs of experiments. The Annals of Mathematical Statistics, pages 1570–1602, 1969.
  • Awerbuch and Kleinberg (2008) Baruch Awerbuch and Robert Kleinberg. Online linear optimization and adaptive routing. Journal of Computer and System Sciences, 74(1):97–114, 2008.
  • Bergstra and Bengio (2012) James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of machine learning research, 13(2), 2012.
  • Bhatia et al. (2016) K. Bhatia, K. Dahiya, H. Jain, P. Kar, A. Mittal, Y. Prabhu, and M. Varma. The extreme classification repository: Multi-label datasets and code, 2016. URL http://manikvarma.org/downloads/XC/XMLRepository.html.
  • Bhatnagar et al. (2009) Shalabh Bhatnagar, Richard S Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural Actor–Critic algorithms. Automatica, 45(11):2471–2482, 2009.
  • Bietti et al. (2021) Alberto Bietti, Alekh Agarwal, and John Langford. A contextual bandit bake-off. Journal of Machine Learning Research, 22(133):1–49, 2021.
  • Bubeck et al. (2012) Sébastien Bubeck, Nicolo Cesa-Bianchi, and Sham M Kakade. Towards minimax policies for online linear optimization with bandit feedback. In Conference on Learning Theory, pages 41–1. JMLR Workshop and Conference Proceedings, 2012.
  • Cai et al. (2021) William Cai, Josh Grossman, Zhiyuan Jerry Lin, Hao Sheng, Johnny Tian-Zheng Wei, Joseph Jay Williams, and Sharad Goel. Bandit algorithms to personalize educational chatbots. Machine Learning, pages 1–30, 2021.
  • Cao and Krishnamurthy (2019) Tongyi Cao and Akshay Krishnamurthy. Disagreement-based combinatorial pure exploration: Sample complexity bounds and an efficient algorithm. In Conference on Learning Theory, pages 558–588. PMLR, 2019.
  • Cesa-Bianchi and Lugosi (2012) Nicolo Cesa-Bianchi and Gábor Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78(5):1404–1422, 2012.
  • Chen et al. (2017) Lijie Chen, Anupam Gupta, Jian Li, Mingda Qiao, and Ruosong Wang. Nearly optimal sampling algorithms for combinatorial pure exploration. In Conference on Learning Theory, pages 482–534. PMLR, 2017.
  • Chernozhukov et al. (2019) Victor Chernozhukov, Mert Demirer, Greg Lewis, and Vasilis Syrgkanis. Semi-parametric efficient policy learning with continuous actions. Advances in Neural Information Processing Systems, 32:15065–15075, 2019.
  • Chu et al. (2011) Wei Chu, Lihong Li, Lev Reyzin, and Robert Schapire. Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214. JMLR Workshop and Conference Proceedings, 2011.
  • Cohen et al. (2019) Michael B Cohen, Ben Cousins, Yin Tat Lee, and Xin Yang. A near-optimal algorithm for approximating the John Ellipsoid. In Conference on Learning Theory, pages 849–873. PMLR, 2019.
  • Dani and Hayes (2006) Varsha Dani and Thomas P Hayes. Robbing the bandit: Less regret in online geometric optimization against an adaptive adversary. In SODA, volume 6, pages 937–943, 2006.
  • Dani et al. (2008) Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. Conference on Learning Theory (COLT), 2008.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), 2019.
  • Dudik et al. (2011) Miroslav Dudik, Daniel Hsu, Satyen Kale, Nikos Karampatziakis, John Langford, Lev Reyzin, and Tong Zhang. Efficient optimal learning for contextual bandits. In Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, pages 169–178, 2011.
  • Edmonds (1965) Jack Edmonds. Paths, trees, and flowers. Canadian Journal of mathematics, 17:449–467, 1965.
  • Fiez et al. (2019) Tanner Fiez, Lalit Jain, Kevin G Jamieson, and Lillian Ratliff. Sequential experimental design for transductive linear bandits. Advances in neural information processing systems, 32, 2019.
  • Foster and Rakhlin (2020) Dylan Foster and Alexander Rakhlin. Beyond UCB: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pages 3199–3210. PMLR, 2020.
  • Foster et al. (2018) Dylan Foster, Alekh Agarwal, Miroslav Dudik, Haipeng Luo, and Robert Schapire. Practical contextual bandits with regression oracles. In International Conference on Machine Learning, pages 1539–1548. PMLR, 2018.
  • Foster et al. (2021a) Dylan Foster, Alexander Rakhlin, David Simchi-Levi, and Yunzong Xu. Instance-dependent complexity of contextual bandits and reinforcement learning: A disagreement-based perspective. In Conference on Learning Theory, pages 2059–2059. PMLR, 2021a.
  • Foster and Krishnamurthy (2021) Dylan J Foster and Akshay Krishnamurthy. Efficient first-order contextual bandits: Prediction, allocation, and triangular discrimination. Advances in Neural Information Processing Systems, 34, 2021.
  • Foster et al. (2020) Dylan J Foster, Claudio Gentile, Mehryar Mohri, and Julian Zimmert. Adapting to misspecification in contextual bandits. Advances in Neural Information Processing Systems, 33, 2020.
  • Foster et al. (2021b) Dylan J Foster, Sham M Kakade, Jian Qian, and Alexander Rakhlin. The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487, 2021b.
  • Grötschel et al. (2012) Martin Grötschel, László Lovász, and Alexander Schrijver. Geometric algorithms and combinatorial optimization, volume 2. Springer Science & Business Media, 2012.
  • Hazan and Karnin (2016) Elad Hazan and Zohar Karnin. Volumetric spanners: An efficient exploration basis for learning. The Journal of Machine Learning Research, 17(1):4062–4095, 2016.
  • Hazan et al. (2007) Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169–192, 2007.
  • Ito et al. (2019) Shinji Ito, Daisuke Hatano, Hanna Sumita, Kei Takemura, Takuro Fukunaga, Naonori Kakimura, and Ken-Ichi Kawarabayashi. Oracle-efficient algorithms for online linear optimization with bandit feedback. Advances in Neural Information Processing Systems, 32:10590–10599, 2019.
  • John (1948) F John. Extremum problems with inequalities as subsidiary conditions. R. Courant Anniversary Volume, pages 187–204, 1948.
  • Jun et al. (2017) Kwang-Sung Jun, Aniruddha Bhargava, Robert Nowak, and Rebecca Willett. Scalable generalized linear bandits: Online computation and hashing. Advances in Neural Information Processing Systems, 30, 2017.
  • Kassraie and Krause (2022) Parnian Kassraie and Andreas Krause. Neural contextual bandits without regret. In International Conference on Artificial Intelligence and Statistics, pages 240–278. PMLR, 2022.
  • Katz-Samuels et al. (2020) Julian Katz-Samuels, Lalit Jain, Kevin G Jamieson, et al. An empirical process approach to the union bound: Practical algorithms for combinatorial and linear bandits. Advances in Neural Information Processing Systems, 33, 2020.
  • Kiefer and Wolfowitz (1960) Jack Kiefer and Jacob Wolfowitz. The equivalence of two extremum problems. Canadian Journal of Mathematics, 12:363–366, 1960.
  • Krishnamurthy et al. (2020) Akshay Krishnamurthy, John Langford, Aleksandrs Slivkins, and Chicheng Zhang. Contextual bandits with continuous actions: Smoothing, zooming, and adapting. Journal of Machine Learning Research, 21(137):1–45, 2020.
  • Langford and Zhang (2007) John Langford and Tong Zhang. The epoch-greedy algorithm for contextual multi-armed bandits. Advances in neural information processing systems, 20(1):96–1, 2007.
  • Lattimore and Szepesvári (2020) Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  • Lattimore et al. (2020) Tor Lattimore, Csaba Szepesvari, and Gellert Weisz. Learning with good feature representations in bandits and in RL with a generative model. In International Conference on Machine Learning, pages 5662–5670. PMLR, 2020.
  • Lebret and Collobert (2014) Rémi Lebret and Ronan Collobert. Word embeddings through Hellinger PCA. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 482–490, 2014.
  • Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E Schapire. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670, 2010.
  • Mahabadi et al. (2019) Sepideh Mahabadi, Piotr Indyk, Shayan Oveis Gharan, and Alireza Rezaei. Composable core-sets for determinant maximization: A simple near-optimal algorithm. In International Conference on Machine Learning, pages 4254–4263. PMLR, 2019.
  • Majzoubi et al. (2020) Maryam Majzoubi, Chicheng Zhang, Rajan Chari, Akshay Krishnamurthy, John Langford, and Aleksandrs Slivkins. Efficient contextual bandits with continuous actions. Advances in Neural Information Processing Systems, 33, 2020.
  • McMahan and Blum (2004) H Brendan McMahan and Avrim Blum. Online geometric optimization in the bandit setting against an adaptive adversary. In International Conference on Computational Learning Theory, pages 109–123. Springer, 2004.
  • Meyer (2000) Carl D Meyer. Matrix analysis and applied linear algebra, volume 71. Siam, 2000.
  • Pan et al. (2019) Feiyang Pan, Qingpeng Cai, Pingzhong Tang, Fuzhen Zhuang, and Qing He. Policy gradients for contextual recommendations. In The World Wide Web Conference, pages 1421–1431, 2019.
  • Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL http://arxiv.org/abs/1908.10084.
  • Ruan et al. (2021) Yufei Ruan, Jiaqi Yang, and Yuan Zhou. Linear bandits with limited adaptivity and learning distributional optimal design. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 74–87, 2021.
  • Sahni (1974) Sartaj Sahni. Computationally related problems. SIAM Journal on computing, 3(4):262–279, 1974.
  • Schrijver (1998) Alexander Schrijver. Theory of linear and integer programming. John Wiley & Sons, 1998.
  • Sen et al. (2021) Rajat Sen, Alexander Rakhlin, Lexing Ying, Rahul Kidambi, Dean Foster, Daniel N Hill, and Inderjit S Dhillon. Top-k extreme contextual bandits with arm hierarchy. In International Conference on Machine Learning, pages 9422–9433. PMLR, 2021.
  • Sherman and Morrison (1950) Jack Sherman and Winifred J Morrison. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics, 21(1):124–127, 1950.
  • Shrivastava and Li (2014) Anshumali Shrivastava and Ping Li. Asymmetric LSH (ALSH) for sublinear time maximum inner product search (MIPS). Advances in neural information processing systems, 27, 2014.
  • Simchi-Levi and Xu (2021) David Simchi-Levi and Yunzong Xu. Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability. Mathematics of Operations Research, 2021.
  • Singh et al. (2012) Sameer Singh, Amarnag Subramanya, Fernando Pereira, and Andrew McCallum. Wikilinks: A large-scale cross-document coreference corpus labeled via links to Wikipedia. Technical Report UM-CS-2012-015, University of Massachusetts, Amherst, 2012.
  • Soare et al. (2014) Marta Soare, Alessandro Lazaric, and Rémi Munos. Best-arm identification in linear bandits. Advances in Neural Information Processing Systems, 27, 2014.
  • Summa et al. (2014) Marco Di Summa, Friedrich Eisenbrand, Yuri Faenza, and Carsten Moldenhauer. On largest volume simplices and sub-determinants. In Proceedings of the twenty-sixth annual ACM-SIAM symposium on Discrete algorithms, pages 315–323. SIAM, 2014.
  • Tewari and Murphy (2017) Ambuj Tewari and Susan A Murphy. From Ads to interventions: Contextual bandits in mobile health. In Mobile Health, pages 495–517. Springer, 2017.
  • Vasnetsov (2018) Andrey Vasnetsov. Oneshot-wikilinks. https://www.kaggle.com/generall/oneshotwikilinks, 2018.
  • Vovk (1998) Vladimir Vovk. A game of prediction with expert advice. Journal of Computer and System Sciences, 56(2):153–173, 1998.
  • Wagenmaker et al. (2021) Andrew Wagenmaker, Julian Katz-Samuels, and Kevin Jamieson. Experimental design for regret minimization in linear bandits. In International Conference on Artificial Intelligence and Statistics, pages 3088–3096, 2021.
  • Williams (1992) Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3):229–256, 1992.
  • Xu et al. (2020) Pan Xu, Zheng Wen, Handong Zhao, and Quanquan Gu. Neural contextual bandits with deep representation and shallow exploration. arXiv preprint arXiv:2012.01780, 2020.
  • Xu and Zeevi (2020) Yunbei Xu and Assaf Zeevi. Upper counterfactual confidence bounds: a new optimism principle for contextual bandits. arXiv preprint arXiv:2007.07876, 2020.
  • Yang et al. (2021) Shuo Yang, Tongzheng Ren, Sanjay Shakkottai, Eric Price, Inderjit S Dhillon, and Sujay Sanghavi. Linear bandit algorithms with sublinear time complexity. arXiv preprint arXiv:2103.02729, 2021.
  • Zhang (2021) Tong Zhang. Feel-good thompson sampling for contextual bandits and reinforcement learning. arXiv preprint arXiv:2110.00871, 2021.
  • Zhang et al. (2021) Weitong Zhang, Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural Thompson sampling. In International Conference on Learning Representation (ICLR), 2021.
  • Zhou et al. (2020) Dongruo Zhou, Lihong Li, and Quanquan Gu. Neural contextual bandits with ucb-based exploration. In International Conference on Machine Learning, pages 11492–11502. PMLR, 2020.

Appendix A Proofs and Supporting Results from Section 3

This section is organized as follows. We provide supporting results in Section A.1, then give the proof of Theorem 1 in Section A.2.

A.1 Supporting Results

A.1.1 Barycentric Spanner and Optimal Design

Algorithm 5 restates an algorithm of Awerbuch and Kleinberg (2008), which efficiently computes a barycentric spanner (Definition 3) given access to a linear optimization oracle (Definition 1). Recall that, for a set 𝒮𝒜\mathcal{S}\subset\mathcal{A} of dd actions, the notation det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,𝒮))\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,\mathcal{S})) (resp. det(ϕ(x,𝒮))\det(\phi(x,\mathcal{S}))) denotes the determinant of the dd-by-dd matrix whose columns are the \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{} (resp. ϕ\phi) embeddings of actions.

Algorithm 5 Approximate Barycentric Spanner (Awerbuch and Kleinberg, 2008)
0: Context x𝒳x\in\mathcal{X} and approximation factor C>1C>1.
1:for i=1,,di=1,\dots,d do
2:  Compute θd\theta\in{\mathbb{R}}^{d} representing linear function ϕ(x,a)det(ϕ(x,a1),,ϕ(x,ai1),ϕ(x,a),ei+1,,ed)\phi(x,a)\mapsto\det(\phi(x,a_{1}),\ldots,\phi(x,a_{i-1}),\phi(x,a),e_{i+1},\ldots,e_{d}).
3:  Get aiargmaxa𝒜|ϕ(x,a),θ|a_{i}\leftarrow\operatorname*{arg\,max}_{a\in\mathcal{A}}\lvert\langle\phi(x,a),\theta\rangle\rvert.
4: Construct 𝒮=(a1,,ad)\mathcal{S}=(a_{1},\ldots,a_{d}). // Initial set of actions 𝒮𝒜\mathcal{S}\subseteq\mathcal{A} such that |𝒮|=d\lvert\mathcal{S}\rvert=d and |det(ϕ(x,𝒮))|>0\lvert\det(\phi(x,\mathcal{S}))\rvert>0.
5:while not break do
6:  for i=1,,di=1,\dots,d do
7:   Compute θd\theta\in{\mathbb{R}}^{d} representing linear function ϕ(x,a)det(ϕ(x,𝒮i(a)))\phi(x,a)\mapsto\det(\phi(x,\mathcal{S}_{i}(a))), where 𝒮i(a):=(a1,,ai1,a,ai+1,,ad)\mathcal{S}_{i}(a)\vcentcolon=(a_{1},\ldots,a_{i-1},a,a_{i+1},\ldots,a_{d}).
8:   Get aargmaxa𝒜|ϕ(x,a),θ|a\leftarrow\operatorname*{arg\,max}_{a\in\mathcal{A}}\lvert\langle\phi(x,a),\theta\rangle\rvert.
9:   if |det(ϕ(x,𝒮i(a)))|C|det(ϕ(x,𝒮))|\lvert\det(\phi(x,\mathcal{S}_{i}(a)))\rvert\geq C\lvert\det(\phi(x,\mathcal{S}))\rvert then
10:    Update aiaa_{i}\leftarrow a.
11:    continue to line 5.
12:  break
13:return  CC-approximate barycentric spanner 𝒮\mathcal{S}.
Lemma 5 (Awerbuch and Kleinberg (2008)).

For any x𝒳x\in\mathcal{X}, Algorithm 5 computes a CC-approximate barycentric spanner for {ϕ(x,a):a𝒜}\{\phi(x,a):a\in\mathcal{A}\} within O(dlogCd)O(d\log_{C}d) iterations of the while-loop.

Lemma 6.

Fix any constant C>1C>1. Algorithm 5 can be implemented with runtime O(𝒯𝖮𝗉𝗍d2logd+d4logd)O(\mathcal{T}_{\mathsf{Opt}}\cdot d^{2}\log d+d^{4}\log d) and memory O(𝖮𝗉𝗍+d2)O(\mathcal{M}_{\mathsf{Opt}}+d^{2}).

Proof of Lemma 6.

We provide the computational complexity analysis starting from the while-loop (line 5-12) in the following. The computational complexity regarding the first for-loop (line 1-3) can be similarly analyzed.

  • Outer loops (lines 5-6). From Lemma 5, we know that Algorithm 5 terminates within O(dlogd)O(d\log d) iterations of the while-loop (line 5). It is also clear that the for-loop (line 6) is invoked at most dd times.

  • Computational complexity for lines 7-10. We discuss how to efficiently implement this part using rank-one updates. We analyze the computational complexity for each line in the following.

    • Line 7. We discuss how to efficiently compute the linear function θ\theta through rank-one updates. Fix any YdY\in{\mathbb{R}}^{d}. Let Φ𝒮\Phi_{\mathcal{S}} denote the invertible (by construction) matrix whose kk-th column is ϕ(x,ak)\phi(x,a_{k}) (with ak𝒮a_{k}\in\mathcal{S}). Using the rank-one update formula for the determinant (Meyer, 2000), we have

      det(ϕ(x,a1),,ϕ(x,ai1),Y,ϕ(x,ai+1),,ϕ(x,ad))\displaystyle\det(\phi(x,a_{1}),\ldots,\phi(x,a_{i-1}),Y,\phi(x,a_{i+1}),\ldots,\phi(x,a_{d}))
      =det(Φ𝒮+(Yϕ(x,ai))ei)\displaystyle=\det\Big{(}\Phi_{\mathcal{S}}+\big{(}Y-\phi(x,a_{i})\big{)}e_{i}^{\top}\Big{)}
      =det(Φ𝒮)(1+eiΦ𝒮1(Yϕ(x,ai)))\displaystyle=\det(\Phi_{\mathcal{S}})\cdot\Big{(}1+e_{i}^{\top}\Phi_{\mathcal{S}}^{-1}\big{(}Y-\phi(x,a_{i})\big{)}\Big{)}
      =Y,det(Φ𝒮)(Φ𝒮1)ei+det(Φ𝒮)(1eiΦ𝒮1ϕ(x,ai)).\displaystyle=\big{\langle}Y,\det(\Phi_{\mathcal{S}})\cdot\left(\Phi_{\mathcal{S}}^{-1}\right)^{\top}e_{i}\big{\rangle}+\det(\Phi_{\mathcal{S}})\cdot\big{(}1-e_{i}^{\top}\Phi_{\mathcal{S}}^{-1}\phi(x,a_{i})\big{)}. (8)

      We first notice that det(Φ𝒮)(1eiΦ𝒮1ϕ(x,ai))=0\det(\Phi_{\mathcal{S}})\cdot\big{(}1-e_{i}^{\top}\Phi_{\mathcal{S}}^{-1}\phi(x,a_{i})\big{)}=0 since one can take Y=0dY=0\in{\mathbb{R}}^{d}. We can then write

      det(ϕ(x,a1),,ϕ(x,ai1),Y,ϕ(x,ai+1),,ϕ(x,ad))=Y,θ\displaystyle\det(\phi(x,a_{1}),\ldots,\phi(x,a_{i-1}),Y,\phi(x,a_{i+1}),\ldots,\phi(x,a_{d}))=\langle Y,\theta\rangle

      where θ=det(Φ𝒮)(Φ𝒮1)ei\theta=\det(\Phi_{\mathcal{S}})\cdot\left(\Phi_{\mathcal{S}}^{-1}\right)^{\top}e_{i}. Thus, whenever det(Φ𝒮)\det(\Phi_{\mathcal{S}}) and Φ𝒮1\Phi_{\mathcal{S}}^{-1} are known, compute θ\theta takes O(d)O(d) time. The maximum memory requirement is O(d2)O(d^{2}), following from the storage of Φ𝒮1\Phi_{\mathcal{S}}^{-1}.

    • Line 8. When θ\theta is computed, we can compute aa by first compute a+:=argmaxa𝒜ϕ(x,a),θa_{+}\vcentcolon=\operatorname*{arg\,max}_{a\in\mathcal{A}}\langle\phi(x,a),\theta\rangle and a:=argmaxa𝒜ϕ(x,a),θa_{-}\vcentcolon=\operatorname*{arg\,max}_{a\in\mathcal{A}}-\langle\phi(x,a),\theta\rangle and then compare the two. This process takes two oracle calls to 𝐀𝐥𝐠𝖮𝗉𝗍\mathrm{\mathbf{Alg}}_{\mathsf{Opt}}, which takes O(𝒯𝖮𝗉𝗍)O(\mathcal{T}_{\mathsf{Opt}}) time. The maximum memory requirement is O(𝖮𝗉𝗍+d)O(\mathcal{M}_{\mathsf{Opt}}+d), following from the memory requirement of 𝐀𝐥𝐠𝖮𝗉𝗍\mathrm{\mathbf{Alg}}_{\mathsf{Opt}} and the storage of θ\theta.

    • Line 9. Once θ\theta and det(Φ𝒮)\det(\Phi_{\mathcal{S}}) are computed, checking the updating criteria takes O(d)O(d) time. The maximum memory requirement is O(d)O(d), following from the storage of ϕ(x,a)\phi(x,a) and θ\theta.

    • Line 10. We discuss how to efficiently update det(Φ𝒮)\det(\Phi_{\mathcal{S}}) and Φ𝒮1\Phi_{\mathcal{S}}^{-1} through rank-one updates. If an update ai=aa_{i}=a is made, we can update the determinant using rank-one update (as in Eq. 8) with runtime O(d)O(d) and memory O(d2)O(d^{2}); and update the inverse matrix using the Sherman-Morrison rank-one update formula (Sherman and Morrison, 1950), i.e.,

      (Φ𝒮+(ϕ(x,a)ϕ(x,ai))ei)1=Φ𝒮1Φ𝒮1(ϕ(x,a)ϕ(x,ai))eiΦ𝒮11+eiΦ𝒮1(ϕ(x,a)ϕ(x,ai)),\displaystyle\Big{(}\Phi_{\mathcal{S}}+\big{(}\phi(x,a)-\phi(x,a_{i})\big{)}e_{i}^{\top}\Big{)}^{-1}=\Phi_{\mathcal{S}}^{-1}-\frac{\Phi_{\mathcal{S}}^{-1}\big{(}\phi(x,a)-\phi(x,a_{i})\big{)}e_{i}^{\top}\Phi_{\mathcal{S}}^{-1}}{1+e_{i}\Phi_{\mathcal{S}}^{-1}\big{(}\phi(x,a)-\phi(x,a_{i})\big{)}},

      which can be implemented in O(d2)O(d^{2}) time and memory. Note that the updated matrix must be invertible by construction.

    Thus, using rank-one updates, the total runtime adds up to O(𝒯𝖮𝗉𝗍+d2)O(\mathcal{T}_{\mathsf{Opt}}+d^{2}) and the maximum memory requirement is O(𝖮𝗉𝗍+d2)O(\mathcal{M}_{\mathsf{Opt}}+d^{2}). We also remark that the initial matrix determinant and inverse can be computed cheaply since the first iteration of the first for-loop (i.e., line 2 with i=1i=1) is updated from the identity matrix.

To summarize, Algorithm 5 has runtime O(𝒯𝖮𝗉𝗍d2logd+d4logd)O(\mathcal{T}_{\mathsf{Opt}}\cdot d^{2}\log d+d^{4}\log d) and uses at most O(𝖮𝗉𝗍+d2)O(\mathcal{M}_{\mathsf{Opt}}+d^{2}) units of memory. ∎

The next proposition shows that a barycentric spanner implies an approximate optimal design. The result is well-known (e.g., Hazan and Karnin (2016)), but we provide a proof here for completeness.

See 2

Proof of Lemma 2.

Assume without loss of generality that 𝒵d\mathcal{Z}\subseteq{\mathbb{R}}^{d} spans d{\mathbb{R}}^{d}. By Definition 3, we know that for any z𝒵z\in\mathcal{Z}, we can represent zz as a weighted sum of elements in 𝒮\mathcal{S} with coefficients in the range [C,C][-C,C]. Let Φ𝒮d×d\Phi_{\mathcal{S}}\in{\mathbb{R}}^{d\times d} be the matrix whose columns are the vectors in 𝒮\mathcal{S}. For any z𝒵z\in\mathcal{Z}, we can find θ[C,C]d\theta\in[-C,C]^{d} such that z=Φ𝒮θz=\Phi_{\mathcal{S}}\theta. Since Φ𝒮\Phi_{\mathcal{S}} is invertible (by construction), we can write θ=Φ𝒮1z\theta=\Phi_{\mathcal{S}}^{-1}z, which implies the result via

C2dθ22=z(Φ𝒮Φ𝒮)12=1dzV(q)12.\displaystyle C^{2}\cdot d\geq\lVert\theta\rVert_{2}^{2}=\lVert z\rVert^{2}_{(\Phi_{\mathcal{S}}\Phi_{\mathcal{S}}^{\top})^{-1}}=\frac{1}{d}\cdot\lVert z\rVert^{2}_{V(q)^{-1}}.

A.1.2 Regret Decomposition

Fix any γ>0\gamma>0. We consider the following meta algorithm that utilizes the online regression oracle 𝐀𝐥𝐠𝖲𝗊\mathrm{\mathbf{Alg}}_{\mathsf{Sq}} defined in 2.

For t=1,2,,Tt=1,2,\ldots,T:

  • Get context xt𝒳x_{t}\in\mathcal{X} from the environment and regression function f^tconv()\widehat{f}_{t}\in\operatorname{{conv}}(\mathcal{F}) from the online regression oracle 𝐀𝐥𝐠𝖲𝗊\mathrm{\mathbf{Alg}}_{\mathsf{Sq}}.

  • Identify the distribution ptΔ(𝒜)p_{t}\in\Delta(\mathcal{A}) that solves the minimax problem 𝖽𝖾𝖼γ(;f^t,xt)\mathsf{dec}_{\gamma}(\mathcal{F};\widehat{f}_{t},x_{t}) (defined in Eq. 3) and play action atpta_{t}\sim p_{t}.

  • Observe reward rtr_{t} and update regression oracle with example (xt,at,rt)(x_{t},a_{t},r_{t}).

The following result bounds the contextual bandit regret for the meta algorithm described above. The result is a variant of the regret decomposition based on the Decision-Estimation Coefficient given in Foster et al. (2021b), which generalizes Foster and Rakhlin (2020). The slight differences in constant terms are due to the difference in reward range.

Lemma 7 (Foster and Rakhlin (2020); Foster et al. (2021b)).

Suppose that 2 holds. Then probability at least 1δ1-\delta, the contextual bandit regret is upper bounded as follows:

𝐑𝐞𝐠𝖢𝖡(T)𝖽𝖾𝖼γ()T+2γ𝐑𝐞𝐠𝖲𝗊(T)+64γlog(2δ1)+8Tlog(2δ1).\displaystyle\mathrm{\mathbf{Reg}}_{\mathsf{CB}}(T)\leq\mathsf{dec}_{\gamma}(\mathcal{F})\cdot{}T+2\gamma\cdot\mathrm{\mathbf{Reg}}_{\mathsf{Sq}}(T)+64\gamma\cdot\log(2\delta^{-1})+\sqrt{8T\log(2\delta^{-1})}.

In general, identifying a distribution that exactly solves the minimax problem corresponding to the DEC may be impractical. However, if one can identify a distribution that instead certifies an upper bound \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111()γ\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\gamma}(\mathcal{F}) on the Decision-Estimation Coefficient (in the sense that 𝖽𝖾𝖼γ()\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111()γ\mathsf{dec}_{\gamma}(\mathcal{F})\leq\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\gamma}(\mathcal{F})), the regret bound in Lemma 7 continues to hold with 𝖽𝖾𝖼γ()\mathsf{dec}_{\gamma}(\mathcal{F}) replaced by \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111()γ\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\gamma}(\mathcal{F}).

A.1.3 Proof of Lemma 3

See 3

Proof of Lemma 3.

Fix a context x𝒳x\in\mathcal{X}. In our setting, where actions are linearly structured, we can equivalently write the Decision-Estimation Coefficient 𝖽𝖾𝖼γ(;f^,x)\mathsf{dec}_{\gamma}(\mathcal{F};\widehat{f},x) as

𝖽𝖾𝖼γ(𝒢;g^,x):=infpΔ(𝒜)supa𝒜supg𝒢𝔼ap[ϕ(x,a)ϕ(x,a),g(x)γ(ϕ(x,a),g(x)g^(x))2].\displaystyle\mathsf{dec}_{\gamma}(\mathcal{G};\widehat{g},x)\vcentcolon=\inf_{p\in\Delta(\mathcal{A})}\sup_{a^{\star}\in\mathcal{A}}\sup_{g^{\star}\in\mathcal{G}}{\mathbb{E}}_{a\sim p}\Big{[}\big{\langle}\phi(x,a^{\star})-\phi(x,a),g^{\star}(x)\big{\rangle}-{\gamma}\cdot\big{(}\big{\langle}\phi(x,a),g^{\star}(x)-\widehat{g}(x)\big{\rangle}\big{)}^{2}\bigg{]}. (9)

Recall that within our algorithms, g^conv(𝒢)\widehat{g}\in\operatorname{{conv}}(\mathcal{G}) is obtained from the estimator f^=fg^\widehat{f}=f_{\widehat{g}} output by 𝐀𝐥𝐠𝖲𝗊\mathrm{\mathbf{Alg}}_{\mathsf{Sq}}. We will bound the quantity in Eq. 9 uniformly for all x𝒳x\in\mathcal{X} and g^:𝒳d\widehat{g}:\mathcal{X}\to\mathbb{R}^{d} with g^1\|\widehat{g}\|\leq{}1. Recall that we assume supg𝒢,x𝒳g(x)1\sup_{g\in\mathcal{G},x\in\mathcal{X}}\|g(x)\|\leq 1.

Denote a^:=argmaxa𝒜ϕ(x,a),g^(x)\widehat{a}\vcentcolon=\operatorname*{arg\,max}_{a\in\mathcal{A}}\big{\langle}\phi(x,a),\widehat{g}(x)\big{\rangle} and a:=argmaxa𝒜ϕ(x,a),g(x)a^{\star}\vcentcolon=\operatorname*{arg\,max}_{a\in\mathcal{A}}\big{\langle}\phi(x,a),g^{\star}(x)\big{\rangle}. For any ε1\varepsilon\leq 1, let p:=εqopt+(1ε)𝕀a^p\vcentcolon=\varepsilon\cdot q^{\operatorname{{opt}}}+(1-\varepsilon)\cdot\mathbb{I}_{\widehat{a}}, where qoptΔ(𝒜)q^{\operatorname{{opt}}}\in\Delta(\mathcal{A}) is any CoptC_{\operatorname{{opt}}}-approximate optimal design for the embedding {ϕ(x,a)}a𝒜\left\{\phi(x,a)\right\}_{a\in\mathcal{A}}. We have the following decomposition.

𝔼ap[ϕ(x,a)ϕ(x,a),g(x)]\displaystyle{\mathbb{E}}_{a\sim p}\Big{[}\big{\langle}\phi(x,a^{\star})-\phi(x,a),g^{\star}(x)\big{\rangle}\Big{]} =𝔼ap[ϕ(x,a^)ϕ(x,a),g^(x)]+𝔼ap[ϕ(x,a),g^(x)g(x)]\displaystyle={\mathbb{E}}_{a\sim p}\Big{[}\big{\langle}\phi(x,\widehat{a})-\phi(x,a),\widehat{g}(x)\big{\rangle}\Big{]}+{\mathbb{E}}_{a\sim p}\Big{[}\big{\langle}\phi(x,a),\widehat{g}(x)-g^{\star}(x)\big{\rangle}\Big{]}
+(ϕ(x,a),g(x)ϕ(x,a^),g^(x)).\displaystyle\quad+\Big{(}\big{\langle}\phi(x,a^{\star}),g^{\star}(x)\big{\rangle}-\big{\langle}\phi(x,\widehat{a}),\widehat{g}(x)\big{\rangle}\Big{)}. (10)

For the first term in Eq. 10, we have

𝔼ap[ϕ(x,a^)ϕ(x,a),g^(x)]=ε𝔼aqopt[ϕ(x,a^)ϕ(x,a),g^(x)]2εsupx𝒳,a𝒜ϕ(x,a)supx𝒳g^(x)2ε.\displaystyle{\mathbb{E}}_{a\sim p}\Big{[}\big{\langle}\phi(x,\widehat{a})-\phi(x,a),\widehat{g}(x)\big{\rangle}\Big{]}=\varepsilon\cdot{\mathbb{E}}_{a\sim q^{\operatorname{{opt}}}}\Big{[}\big{\langle}\phi(x,\widehat{a})-\phi(x,a),\widehat{g}(x)\big{\rangle}\Big{]}\leq 2\varepsilon\cdot\sup_{x\in\mathcal{X},a\in\mathcal{A}}\|\phi(x,a)\|\cdot\sup_{x\in\mathcal{X}}\|\widehat{g}(x)\|\leq 2\varepsilon.

Next, since

ϕ(x,a),g^(x)g(x)γ2(ϕ(x,a),g^(x)g(x))2+12γ\displaystyle\big{\langle}\phi(x,a),\widehat{g}(x)-g^{\star}(x)\big{\rangle}\leq\frac{\gamma}{2}\cdot\big{(}\big{\langle}\phi(x,a),\widehat{g}(x)-g^{\star}(x)\big{\rangle}\big{)}^{2}+\frac{1}{2\gamma}

by AM-GM inequality, we can bound the second term in Eq. 10 by

𝔼ap[ϕ(x,a),g^(x)g(x)]γ2𝔼ap[(ϕ(x,a),g^(x)g(x))2]+12γ.\displaystyle{\mathbb{E}}_{a\sim p}\Big{[}\big{\langle}\phi(x,a),\widehat{g}(x)-g^{\star}(x)\big{\rangle}\Big{]}\leq\frac{\gamma}{2}\cdot{\mathbb{E}}_{a\sim p}\Big{[}\big{(}\big{\langle}\phi(x,a),\widehat{g}(x)-g^{\star}(x)\big{\rangle}\big{)}^{2}\Big{]}+\frac{1}{2\gamma}.

We now turn our attention to the third term. Observe that since a^\widehat{a} is optimal for g^\widehat{g}, ϕ(x,a^),g^(x)ϕ(x,a),g^(x)\big{\langle}\phi(x,\widehat{a}),\widehat{g}(x)\big{\rangle}\geq\big{\langle}\phi(x,a^{\star}),\widehat{g}(x)\big{\rangle}. As a result, defining V(qopt):=𝔼aqopt[ϕ(x,a)ϕ(x,a)]V(q^{\operatorname{{opt}}})\vcentcolon={\mathbb{E}}_{a\sim q^{\operatorname{{opt}}}}[\phi(x,a)\phi(x,a)^{\top}], we have

ϕ(x,a),g(x)ϕ(x,a^),g^(x)\displaystyle\big{\langle}\phi(x,a^{\star}),g^{\star}(x)\big{\rangle}-\big{\langle}\phi(x,\widehat{a}),\widehat{g}(x)\big{\rangle} ϕ(x,a),g(x)g^(x)\displaystyle\leq\big{\langle}\phi(x,a^{\star}),g^{\star}(x)-\widehat{g}(x)\big{\rangle}
ϕ(x,a)V(qopt)1g(x)g^(x)V(qopt)\displaystyle\leq\big{\|}\phi(x,a^{\star})\big{\|}_{V(q^{\operatorname{{opt}}})^{-1}}\cdot\big{\|}g^{\star}(x)-\widehat{g}(x)\big{\|}_{V(q^{\operatorname{{opt}}})}
=12γεϕ(x,a)V(qopt)12+γ2ε𝔼aqopt[(ϕ(x,a),g(x)g^(x))2]\displaystyle=\frac{1}{2\gamma\varepsilon}\cdot\big{\|}\phi(x,a^{\star})\big{\|}^{2}_{V(q^{\operatorname{{opt}}})^{-1}}+\frac{\gamma}{2}\cdot\varepsilon\cdot{\mathbb{E}}_{a\sim q^{\operatorname{{opt}}}}\Big{[}\big{(}\phi(x,a),g^{\star}(x)-\widehat{g}(x)\big{)}^{2}\Big{]}
Coptd2γε+γ2𝔼ap[(ϕ(x,a),g(x)g^(x))2].\displaystyle\leq\frac{C_{\operatorname{{opt}}}\cdot d}{2\gamma\varepsilon}+\frac{\gamma}{2}\cdot{\mathbb{E}}_{a\sim p}\Big{[}\big{(}\phi(x,a),g^{\star}(x)-\widehat{g}(x)\big{)}^{2}\Big{]}.

Here, the third line follows from the AM-GM inequality, and the last line follows from the (CoptC_{\operatorname{{opt}}}-approximate) optimal design property and the definition of pp.

Combining these bounds, we have

𝖽𝖾𝖼γ()=infpΔ(𝒜)supa𝒜supg𝒢𝖽𝖾𝖼γ(𝒢;g^,x)2ε+12γ+Coptd2γε.\displaystyle\mathsf{dec}_{\gamma}(\mathcal{F})=\inf_{p\in\Delta(\mathcal{A})}\sup_{a^{\star}\in\mathcal{A}}\sup_{g^{\star}\in\mathcal{G}}\mathsf{dec}_{\gamma}(\mathcal{G};\widehat{g},x)\leq 2\varepsilon+\frac{1}{2\gamma}+\frac{C_{\operatorname{{opt}}}\cdot d}{2\gamma\varepsilon}.

Since γ1\gamma\geq 1, taking ε:=Coptd/4γ1\varepsilon\vcentcolon=\sqrt{C_{\operatorname{{opt}}}\cdot d/4\gamma}\wedge 1 gives

𝖽𝖾𝖼γ()2Coptdγ+12γ3Coptdγ\displaystyle\mathsf{dec}_{\gamma}(\mathcal{F})\leq 2\sqrt{\frac{C_{\operatorname{{opt}}}\cdot d}{\gamma}}+\frac{1}{2\gamma}\leq 3\sqrt{\frac{C_{\operatorname{{opt}}}\cdot d}{\gamma}}

whenever ε<1\varepsilon<1. On the other hand, when ε=1\varepsilon=1, this bound holds trivially. ∎

A.2 Proof of Theorem 1

See 1

Proof of Theorem 1.

Consider γ1\gamma\geq 1. Combining Lemma 3 with Lemma 7, we have

𝐑𝐞𝐠𝖢𝖡(T)3TCoptdγ+2γ𝐑𝐞𝐠𝖲𝗊(T)+64γlog(2δ1)+8Tlog(2δ1).\displaystyle\mathrm{\mathbf{Reg}}_{\mathsf{CB}}(T)\leq 3T\cdot\sqrt{\frac{C_{\operatorname{{opt}}}\cdot d}{\gamma}}+2\gamma\cdot\mathrm{\mathbf{Reg}}_{\mathsf{Sq}}(T)+64\gamma\cdot\log(2\delta^{-1})+\sqrt{8T\log(2\delta^{-1})}.

The regret bound in Theorem 1 immediately follows by choosing

γ=(3TCoptd2𝐑𝐞𝐠𝖲𝗊(T)+64log(2δ1))2/31.\gamma=\left(\frac{3T\sqrt{C_{\operatorname{{opt}}}\cdot d}}{2\mathrm{\mathbf{Reg}}_{\mathsf{Sq}}(T)+64\log(2\delta^{-1})}\right)^{2/3}\vee 1.

In particular, when Algorithm 5 is invoked as a subroutine with parameter C=2C=2, Lemma 2 implies that we may take Copt4dC_{\operatorname{{opt}}}\leq{}4d.

Computational complexity. We now bound the per-round computational complexity of Algorithm 1 when Algorithm 5 is used as a subroutine to compute the approximate optimal design. Outside of the call to Algorithm 5, Algorithm 1 uses O(1)O(1) calls to 𝐀𝐥𝐠𝖲𝗊\mathrm{\mathbf{Alg}}_{\mathsf{Sq}} to obtain g^t(xt)d\widehat{g}_{t}(x_{t})\in{\mathbb{R}}^{d} and to update f^t\widehat{f}_{t}, and uses a single call to 𝐀𝐥𝐠𝖮𝗉𝗍\mathrm{\mathbf{Alg}}_{\mathsf{Opt}} to compute a^t\widehat{a}_{t}. With the optimal design qtoptq^{\operatorname{{opt}}}_{t} returned by Algorithm 5 (represented as a barycentric spanner), sampling from ptp_{t} takes at most O(d)O(d) time, since |supp(pt)|d+1\lvert\operatorname{supp}(p_{t})\rvert\leq d+1. outside of Algorithm 5 adds up to O(𝒯𝖲𝗊+𝒯𝖮𝗉𝗍+d)O(\mathcal{T}_{\mathsf{Sq}}+\mathcal{T}_{\mathsf{Opt}}+d). In terms of memory, calling 𝐀𝐥𝐠𝖲𝗊\mathrm{\mathbf{Alg}}_{\mathsf{Sq}} and 𝐀𝐥𝐠𝖮𝗉𝗍\mathrm{\mathbf{Alg}}_{\mathsf{Opt}} takes O(𝖲𝗊+𝖮𝗉𝗍)O(\mathcal{M}_{\mathsf{Sq}}+\mathcal{M}_{\mathsf{Opt}}) units, and maintaining the distribution ptp_{t} (the barycentric spanner) takes O(d)O(d) units, so the maximum memory (outside of Algorithm 5) is O(𝖲𝗊+𝖮𝗉𝗍+d)O(\mathcal{M}_{\mathsf{Sq}}+\mathcal{M}_{\mathsf{Opt}}+d). The stated results follow from combining the computational complexities analyzed in Lemma 6. ∎

Appendix B Proofs and Supporting Results from Section 4.1

In this section we provide supporting results concerning Algorithm 2 (Section B.1), and then give the proof of Theorem 2 (Section B.2).

B.1 Supporting Results

Lemma 8.

In Algorithm 2 (Eq. (6)), there exists a unique choice of λ>0\lambda>0 such that a𝒜pt(a)=1\sum_{a\in\mathcal{A}}p_{t}(a)=1, and its value lies in [12,1][\frac{1}{2},1].

Proof of Lemma 8.

Define h(λ):=asupp(qt)qt(a)λ+η(f^t(xt,a^t)f^t(xt,a))h(\lambda)\vcentcolon=\sum_{a\in\operatorname{supp}(q_{t})}\frac{q_{t}(a)}{\lambda+\eta(\widehat{f}_{t}(x_{t},\widehat{a}_{t})-\widehat{f}_{t}(x_{t},a))}. We first notice that h(λ)h(\lambda) is continuous and strictly decreasing over (0,)(0,\infty). We further have

h(1/2)\displaystyle h({1}/{2}) qt(a^t)1/2+η(f^t(xt,a^t)f^t(xt,a^t))1/21/2=1;\displaystyle\geq\frac{q_{t}(\widehat{a}_{t})}{{1}/{2}+\eta(\widehat{f}_{t}(x_{t},\widehat{a}_{t})-\widehat{f}_{t}(x_{t},\widehat{a}_{t}))}\geq\frac{1/2}{1/2}=1;

and

h(1)asupp(qt)qt(a)=12+12asupp(qtopt)qtopt(a)=1.\displaystyle h(1)\leq\sum_{a\in\operatorname{supp}(q_{t})}q_{t}(a)=\frac{1}{2}+\frac{1}{2}\sum_{a\in\operatorname{supp}(q^{\operatorname{{opt}}}_{t})}q^{\operatorname{{opt}}}_{t}(a)=1.

As a result, there exists a unique normalization constant λ[12,1]\lambda^{\star}\in[\frac{1}{2},1] such that h(λ)=1h(\lambda^{\star})=1. ∎

See 4

Proof of Lemma 4.

As in the proof of Lemma 3, we use the linear structure of the action space to rewrite the Decision-Estimation Coefficient 𝖽𝖾𝖼γ(;f^,x)\mathsf{dec}_{\gamma}(\mathcal{F};\widehat{f},x) as

𝖽𝖾𝖼γ(𝒢;g^,x):=infpΔ(𝒜)supa𝒜supg𝒢𝔼ap[ϕ(x,a)ϕ(x,a),g(x)γ(ϕ(x,a),g(x)g^(x))2],\displaystyle\mathsf{dec}_{\gamma}(\mathcal{G};\widehat{g},x)\vcentcolon=\inf_{p\in\Delta(\mathcal{A})}\sup_{a^{\star}\in\mathcal{A}}\sup_{g^{\star}\in\mathcal{G}}{\mathbb{E}}_{a\sim p}\Big{[}\big{\langle}\phi(x,a^{\star})-\phi(x,a),g^{\star}(x)\big{\rangle}-{\gamma}\cdot\big{(}\big{\langle}\phi(x,a),g^{\star}(x)-\widehat{g}(x)\big{\rangle}\big{)}^{2}\bigg{]},

Where g^\widehat{g} is such that f^=fg^\widehat{f}=f_{\widehat{g}}. We will bound the quantity above uniformly for all x𝒳x\in\mathcal{X} and g^:𝒳d\widehat{g}:\mathcal{X}\to\mathbb{R}^{d}.

Denote a^:=argmaxa𝒜ϕ(x,a),g^(x)\widehat{a}\vcentcolon=\operatorname*{arg\,max}_{a\in\mathcal{A}}\big{\langle}\phi(x,a),\widehat{g}(x)\big{\rangle}, a:=argmaxa𝒜ϕ(x,a),g(x)a^{\star}\vcentcolon=\operatorname*{arg\,max}_{a\in\mathcal{A}}\big{\langle}\phi(x,a),g^{\star}(x)\big{\rangle} and qoptΔ(𝒜)q^{\operatorname{{opt}}}\in\Delta(\mathcal{A}) be a CoptC_{\operatorname{{opt}}}-approximate optimal design with respect to the reweighted embedding \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,)\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,\cdot)). We use the setting η=γCoptd\eta=\frac{\gamma}{C_{\operatorname{{opt}}}\cdot d} throughout the proof. Recall that for the sampling distribution in Algorithm 2, we set q:=12qopt+12𝕀a^q\vcentcolon=\frac{1}{2}q^{\operatorname{{opt}}}+\frac{1}{2}\mathbb{I}_{\widehat{a}} and define

p(a)=q(a)λ+γCoptd(ϕ(x,a^)ϕ(x,a),g^(x)),\displaystyle p(a)=\frac{q(a)}{\lambda+\frac{\gamma}{C_{\operatorname{{opt}}}\cdot d}\left(\big{\langle}\phi(x,\widehat{a})-\phi(x,a),\widehat{g}(x)\big{\rangle}\right)}, (11)

where λ[12,1]\lambda\in[\frac{1}{2},1] is a normalization constant (cf. Lemma 8).

We decompose the regret of the distribution pp in Eq. 11 as

𝔼ap[ϕ(x,a)ϕ(x,a),g(x)]\displaystyle{\mathbb{E}}_{a\sim p}\Big{[}\big{\langle}\phi(x,a^{\star})-\phi(x,a),g^{\star}(x)\big{\rangle}\Big{]} =𝔼ap[ϕ(x,a^)ϕ(x,a),g^(x)]+𝔼ap[ϕ(x,a),g^(x)g(x)]\displaystyle={\mathbb{E}}_{a\sim p}\Big{[}\big{\langle}\phi(x,\widehat{a})-\phi(x,a),\widehat{g}(x)\big{\rangle}\Big{]}+{\mathbb{E}}_{a\sim p}\Big{[}\big{\langle}\phi(x,a),\widehat{g}(x)-g^{\star}(x)\big{\rangle}\Big{]}
+ϕ(x,a),g(x)g^(x)+ϕ(x,a)ϕ(x,a^),g^(x).\displaystyle\quad+\big{\langle}\phi(x,a^{\star}),g^{\star}(x)-\widehat{g}(x)\big{\rangle}+\big{\langle}\phi(x,a^{\star})-\phi(x,\widehat{a}),\widehat{g}(x)\big{\rangle}. (12)

Writing out the expectation, the first term in Eq. 12 is upper bounded as follows.

𝔼ap[ϕ(x,a^)ϕ(x,a),g^(x)]\displaystyle{\mathbb{E}}_{a\sim p}\Big{[}\big{\langle}\phi(x,\widehat{a})-\phi(x,a),\widehat{g}(x)\big{\rangle}\Big{]} =asupp(qopt){a^}p(a)ϕ(x,a^)ϕ(x,a),g^(x)\displaystyle=\sum_{a\in\operatorname{supp}(q^{\operatorname{{opt}}})\cup\{\widehat{a}\}}p(a)\cdot\big{\langle}\phi(x,\widehat{a})-\phi(x,a),\widehat{g}(x)\big{\rangle}
<asupp(qopt)qopt(a)/2γCoptd(ϕ(x,a^)ϕ(x,a),g^(x))ϕ(x,a^)ϕ(x,a),g^(x)\displaystyle<\sum_{a\in\operatorname{supp}(q^{\operatorname{{opt}}})}\frac{q^{\operatorname{{opt}}}(a)/2}{\frac{\gamma}{C_{\operatorname{{opt}}}\cdot d}\left(\big{\langle}\phi(x,\widehat{a})-\phi(x,a),\widehat{g}(x)\big{\rangle}\right)}\cdot\big{\langle}\phi(x,\widehat{a})-\phi(x,a),\widehat{g}(x)\big{\rangle}
Coptd2γ,\displaystyle\leq\frac{C_{\operatorname{{opt}}}\cdot d}{2\gamma},

where we use that λ>0\lambda>0 in the second inequality (with the convention that 00=0\frac{0}{0}=0).

The second term in Eq. 12 can be upper bounded as in the proof of Lemma 3, by applying the AM-GM inequality:

𝔼ap[ϕ(x,a),g(x)g^(x)]γ2𝔼ap[(ϕ(x,a),g^(x)g(x))2]+12γ.\displaystyle{\mathbb{E}}_{a\sim p}\Big{[}\big{\langle}\phi(x,a),g^{\star}(x)-\widehat{g}(x)\big{\rangle}\Big{]}\leq\frac{\gamma}{2}\cdot{\mathbb{E}}_{a\sim p}\Big{[}\big{(}\big{\langle}\phi(x,a),\widehat{g}(x)-g^{\star}(x)\big{\rangle}\big{)}^{2}\Big{]}+\frac{1}{2\gamma}.

The third term in Eq. 12 is the most involved. To begin, we define V(p):=𝔼ap[ϕ(x,a)ϕ(x,a)]V(p)\vcentcolon={\mathbb{E}}_{a\sim p}[\phi(x,a)\phi(x,a)^{\top}] and apply the following standard bound:

ϕ(x,a),g^(x)g(x)\displaystyle\big{\langle}\phi(x,a^{\star}),\widehat{g}(x)-g^{\star}(x)\big{\rangle} ϕ(x,a)V(p)1g(x)g^(x)V(p)\displaystyle\leq\big{\|}\phi(x,a^{\star})\big{\|}_{V(p)^{-1}}\cdot\big{\|}g^{\star}(x)-\widehat{g}(x)\big{\|}_{V(p)}
12γϕ(x,a)V(p)12+γ2g(x)g^(x)V(p)2\displaystyle\leq\frac{1}{2\gamma}\cdot\big{\|}\phi(x,a^{\star})\big{\|}^{2}_{V(p)^{-1}}+\frac{\gamma}{2}\cdot\big{\|}g^{\star}(x)-\widehat{g}(x)\big{\|}^{2}_{V(p)}
=12γϕ(x,a)V(p)12+γ2𝔼ap[(ϕ(x,a),g(x)g^(x))2],\displaystyle=\frac{1}{2\gamma}\cdot\big{\|}\phi(x,a^{\star})\big{\|}^{2}_{V(p)^{-1}}+\frac{\gamma}{2}\cdot{\mathbb{E}}_{a\sim p}\Big{[}\big{(}\phi(x,a),g^{\star}(x)-\widehat{g}(x)\big{)}^{2}\Big{]}, (13)

where the second line follows from the AM-GM inequality. The second term in Eq. 13 matches the bound we desired, so it remains to bound the first term. Let qˇopt{\check{q}}^{\operatorname{{opt}}} be the following sub-probability measure:

qˇopt(a):=qopt(a)/2λ+γCoptd(ϕ(x,a^)ϕ(x,a),g^(x)),\displaystyle{\check{q}}^{\operatorname{{opt}}}(a)\vcentcolon=\frac{q^{\operatorname{{opt}}}(a)/2}{\lambda+\frac{\gamma}{C_{\operatorname{{opt}}}\cdot d}\left(\big{\langle}\phi(x,\widehat{a})-\phi(x,a),\widehat{g}(x)\big{\rangle}\right)},

and let V(qˇopt):=𝔼aqˇopt[ϕ(x,a)ϕ(x,a)]V({\check{q}}^{\operatorname{{opt}}})\vcentcolon={\mathbb{E}}_{a\sim{\check{q}}^{\operatorname{{opt}}}}[\phi(x,a)\phi(x,a)^{\top}]. We clearly have V(p)V(qˇopt)V(p)\succeq V({\check{q}}^{\operatorname{{opt}}}) from the definition of pp (cf. Eq. 11). We observe that

V(qˇopt)\displaystyle V({\check{q}}^{\operatorname{{opt}}}) =asupp(qˇopt)qˇopt(a)ϕ(x,a)ϕ(x,a)\displaystyle=\sum_{a\in\operatorname{supp}({\check{q}}^{\operatorname{{opt}}})}{\check{q}}^{\operatorname{{opt}}}(a)\phi(x,a)\phi(x,a)^{\top}
=12asupp(qopt)qopt(a)\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a)\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a)1+γCoptd(ϕ(x,a^)ϕ(x,a),g^(x))λ+γCoptd(ϕ(x,a^)ϕ(x,a),g^(x))\displaystyle=\frac{1}{2}\cdot\sum_{a\in\operatorname{supp}(q^{\operatorname{{opt}}})}q^{\operatorname{{opt}}}(a)\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a)\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a)^{\top}\cdot\frac{1+\frac{\gamma}{C_{\operatorname{{opt}}}\cdot d}\left(\big{\langle}\phi(x,\widehat{a})-\phi(x,a),\widehat{g}(x)\big{\rangle}\right)}{\lambda+\frac{\gamma}{C_{\operatorname{{opt}}}\cdot d}\left(\big{\langle}\phi(x,\widehat{a})-\phi(x,a),\widehat{g}(x)\big{\rangle}\right)}
12asupp(qopt)qopt(a)\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a)\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a)=:12\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111V(qopt),\displaystyle\succeq\frac{1}{2}\cdot\sum_{a\in\operatorname{supp}(q^{\operatorname{{opt}}})}q^{\operatorname{{opt}}}(a)\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a)\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a)^{\top}=\vcentcolon\frac{1}{2}\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{V}(q^{\operatorname{{opt}}}),

where the last line uses that λ1\lambda\leq 1. Since \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111V(qopt)\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{V}(q^{\operatorname{{opt}}}) is positive-definite by construction, we have that V(p)1V(qˇopt)12\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111V(qopt)1V(p)^{-1}\preceq V({\check{q}}^{\operatorname{{opt}}})^{-1}\preceq 2\cdot\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{V}(q^{\operatorname{{opt}}})^{-1}. As a result,

12γϕ(x,a)V(p)12\displaystyle\frac{1}{2\gamma}\cdot\big{\|}\phi(x,a^{\star})\big{\|}^{2}_{V(p)^{-1}} 1γϕ(x,a)\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111V(qopt)12\displaystyle\leq\frac{1}{\gamma}\cdot\big{\|}\phi(x,a^{\star})\big{\|}^{2}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{V}(q^{\operatorname{{opt}}})^{-1}}
=1+γCoptd(ϕ(x,a^)ϕ(x,a),g^(x))γ\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a)\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111V(qopt)12\displaystyle=\frac{1+\frac{\gamma}{C_{\operatorname{{opt}}}\cdot d}\big{(}\big{\langle}\phi(x,\widehat{a})-\phi(x,a^{\star}),\widehat{g}(x)\big{\rangle}\big{)}}{\gamma}\cdot\big{\|}\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a^{\star})\big{\|}^{2}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{V}(q^{\operatorname{{opt}}})^{-1}}
Coptdγ+ϕ(x,a^)ϕ(x,a),g^(x),\displaystyle\leq\frac{C_{\operatorname{{opt}}}\cdot d}{\gamma}+\big{\langle}\phi(x,\widehat{a})-\phi(x,a^{\star}),\widehat{g}(x)\big{\rangle}, (14)

where the last line uses that \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a)\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111V(qopt)12Coptd\big{\|}\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a^{\star})\big{\|}^{2}_{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{V}(q^{\operatorname{{opt}}})^{-1}}\leq C_{\operatorname{{opt}}}\cdot d, since qoptq^{\operatorname{{opt}}} is a CoptC_{\operatorname{{opt}}}-approximate optimal design for the set {\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a)}a𝒜\left\{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a)\right\}_{a\in\mathcal{A}}. Finally, we observe that the second term in Eq. 14 is cancelled out by the forth term in Eq. 12.

Summarizing the bounds on the terms in Eq. 12 leads to:

𝖽𝖾𝖼γ()=infpΔ(𝒜)supa𝒜supg𝒢𝖽𝖾𝖼γ(𝒢;g^,x)Coptd2γ+12γ+Coptdγ2Coptdγ.\displaystyle\mathsf{dec}_{\gamma}(\mathcal{F})=\inf_{p\in\Delta(\mathcal{A})}\sup_{a^{\star}\in\mathcal{A}}\sup_{g^{\star}\in\mathcal{G}}\mathsf{dec}_{\gamma}(\mathcal{G};\widehat{g},x)\leq\frac{C_{\operatorname{{opt}}}\cdot d}{2\gamma}+\frac{1}{2\gamma}+\frac{C_{\operatorname{{opt}}}\cdot d}{\gamma}\leq\frac{2\,C_{\operatorname{{opt}}}\cdot d}{\gamma}.

B.2 Proof of Theorem 2

See 2

Proof.

Combining Lemma 4 with Lemma 7, we have

𝐑𝐞𝐠𝖢𝖡(T)2TCoptdγ+2γ𝐑𝐞𝐠𝖲𝗊(T)+64γlog(2δ1)+8Tlog(2δ1).\displaystyle\mathrm{\mathbf{Reg}}_{\mathsf{CB}}(T)\leq 2T\cdot{\frac{C_{\operatorname{{opt}}}\cdot d}{\gamma}}+2\gamma\cdot\mathrm{\mathbf{Reg}}_{\mathsf{Sq}}(T)+64\gamma\cdot\log(2\delta^{-1})+\sqrt{8T\log(2\delta^{-1})}.

The theorem follows by choosing

γ=(CoptdT𝐑𝐞𝐠𝖲𝗊(T)+32log(2δ1))1/2.\displaystyle\gamma=\left(\frac{C_{\operatorname{{opt}}}\cdot d\,T}{\mathrm{\mathbf{Reg}}_{\mathsf{Sq}}(T)+32\log(2\delta^{-1})}\right)^{1/2}.

In particular, when Algorithm 3 is invoked as the subroutine with parameter C=2C=2, we may take Copt=4dC_{\operatorname{{opt}}}=4d.

Computational complexity. We now discuss the per-round computational complexity of Algorithm 2. We analyze a variant of the sampling rule specified in Section D.2 that does not require computation of the normalization constant. Outside of the runtime and memory requirements required to compute the barycentric spanner using Algorithm 3, which are stated in Theorem 3, Algorithm 2 uses O(1)O(1) calls to the oracle 𝐀𝐥𝐠𝖲𝗊\mathrm{\mathbf{Alg}}_{\mathsf{Sq}} to obtain g^t(xt)d\widehat{g}_{t}(x_{t})\in{\mathbb{R}}^{d} and update f^t\widehat{f}_{t}, and uses a single call to 𝐀𝐥𝐠𝖮𝗉𝗍\mathrm{\mathbf{Alg}}_{\mathsf{Opt}} to compute a^t\widehat{a}_{t}. With g^t(xt)\widehat{g}_{t}(x_{t}) and a^t\widehat{a}_{t}, we can compute f^t(xt,a^t)f^t(xt,a)=ϕ(xt,a^t)ϕ(xt,a),g^t(xt)\widehat{f}_{t}(x_{t},\widehat{a}_{t})-\widehat{f}_{t}(x_{t},a)=\langle\phi(x_{t},\widehat{a}_{t})-\phi(x_{t},a),\widehat{g}_{t}(x_{t})\rangle in O(d)O(d) time for any a𝒜a\in\mathcal{A}; thus, with the optimal design qtoptq^{\operatorname{{opt}}}_{t} returned by Algorithm 3 (represented as a barycentric spanner), we can construct the sampling distribution ptp_{t} in O(d2)O(d^{2}) time. Sampling from ptp_{t} takes O(d)O(d) time since |supp(pt)|d+1\lvert\operatorname{supp}(p_{t})\rvert\leq d+1. This adds up to runtime O(𝒯𝖲𝗊+𝒯𝖮𝗉𝗍+d2)O(\mathcal{T}_{\mathsf{Sq}}+\mathcal{T}_{\mathsf{Opt}}+d^{2}). In terms of memory, calling 𝐀𝐥𝐠𝖲𝗊\mathrm{\mathbf{Alg}}_{\mathsf{Sq}} and 𝐀𝐥𝐠𝖮𝗉𝗍\mathrm{\mathbf{Alg}}_{\mathsf{Opt}} takes O(𝖲𝗊+𝖮𝗉𝗍)O(\mathcal{M}_{\mathsf{Sq}}+\mathcal{M}_{\mathsf{Opt}}) units, and maintaining the distribution ptp_{t} (the barycentric spanner) takes O(d)O(d) units, so the maximum memory (outside of Algorithm 3) is O(𝖲𝗊+𝖮𝗉𝗍+d)O(\mathcal{M}_{\mathsf{Sq}}+\mathcal{M}_{\mathsf{Opt}}+d). The stated results follow from combining the computational complexities analyzed in Theorem 3 , together with the choice of γ\gamma described above. ∎

Appendix C Proofs and Supporting Results from Section 4.2

This section of the appendix is dedicated to the analysis of Algorithm 3, and organized as follows.

Throughout this section of the appendix, we assume that the context x𝒳x\in\mathcal{X} and estimator g^:𝒳d\widehat{g}:\mathcal{X}\to\mathbb{R}^{d}—which are arguments to Algorithm 3 and Algorithm 4—are fixed.

C.1 Analysis of Algorithm 4 (Linear Optimization Oracle for Reweighted Embeddings)

A first step is to construct an (approximate) argmax oracle (after taking absolute value) with respect to the reweighted embedding \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}. Recall that the goal of Algorithm 4 is to implement a linear optimization oracle for the reweighted embeddings constructed by Algorithm 3. That is, for any θd\theta\in\mathbb{R}^{d}, we would like to compute an action that (approximately) solves

argmaxa𝒜|\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a),θ|=argmaxa𝒜\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a),θ2.\displaystyle\operatorname*{arg\,max}_{a\in\mathcal{A}}\big{\lvert}\big{\langle}\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a),\theta\big{\rangle}\big{\rvert}=\operatorname*{arg\,max}_{a\in\mathcal{A}}{\big{\langle}\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a),\theta\big{\rangle}}^{2}.

Define

ι(a):=\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a),θ2,anda:=argmaxa𝒜ι(a).\iota(a)\vcentcolon=\langle\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a),\theta\rangle^{2},\quad\text{and}\quad a^{\star}\vcentcolon=\operatorname*{arg\,max}_{a\in\mathcal{A}}\iota(a). (15)

The main result of this section, Theorem 4, shows that Algorithm 4 identifies an action that achieves the maximum value in Eq. 15 up to a multiplicative constant.

Theorem 4.

Fix any η>0\eta>{}0, r(0,1)r\in(0,1). Suppose ζι(a)1\zeta\leq\sqrt{\iota(a^{\star})}\leq 1 for some ζ>0\zeta>0. Then Algorithm 4 identifies an action aˇ\check{a} such that ι(aˇ)22ι(a)\sqrt{\iota(\check{a})}\geq\frac{\sqrt{2}}{2}\cdot{}\sqrt{\iota(a^{\star})}, and does so with runtime O((𝒯𝖮𝗉𝗍+d)log(eηζ))O((\mathcal{T}_{\mathsf{Opt}}+d)\cdot\log(e\vee\frac{\eta}{\zeta})) and maximum memory O(𝖮𝗉𝗍+log(eηζ)+d)O(\mathcal{M}_{\mathsf{Opt}}+\log(e\vee\frac{\eta}{\zeta})+d).

Proof of Theorem 4.

Recall from Eq. 5 that we have

\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a),θ2=(ϕ(x,a),θ1+ηϕ(x,a^)ϕ(x,a),g^(x))2=ϕ(x,a),θ21+ηϕ(x,a^)ϕ(x,a),g^(x),\displaystyle{\big{\langle}\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a),\theta\big{\rangle}}^{2}=\left(\frac{{\big{\langle}\phi(x,a),\theta\big{\rangle}}}{\sqrt{1+\eta\big{\langle}\phi(x,\widehat{a})-\phi(x,a),\widehat{g}(x)\big{\rangle}}}\right)^{2}=\frac{{\big{\langle}\phi(x,a),\theta\big{\rangle}}^{2}}{{1+\eta\big{\langle}\phi(x,\widehat{a})-\phi(x,a),\widehat{g}(x)\big{\rangle}}},

where a^:=argmaxa𝒜ϕ(x,a),g^(x)\widehat{a}\vcentcolon=\operatorname*{arg\,max}_{a\in\mathcal{A}}\left\langle\phi(x,a),\widehat{g}(x)\right\rangle; note that the denominator is at least 11. To proceed, we use that for any XX\in\mathbb{R} and Y2>0Y^{2}>0, we have

X2Y2=supε{2εXε2Y2}.\displaystyle\frac{X^{2}}{Y^{2}}=\sup_{\varepsilon\in{\mathbb{R}}}\left\{2\varepsilon X-\varepsilon^{2}Y^{2}\right\}.

Taking X=ϕ(x,a),θX=\big{\langle}\phi(x,a),\theta\big{\rangle} and Y2=1+ηϕ(x,a^)ϕ(x,a),g^(x)Y^{2}={1+\eta\big{\langle}\phi(x,\widehat{a})-\phi(x,a),\widehat{g}(x)\big{\rangle}} above, we can write

\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a),θ2\displaystyle{\big{\langle}\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a),\theta\big{\rangle}}^{2} =ϕ(x,a),θ21+ηϕ(x,a^)ϕ(x,a),g^(x)\displaystyle=\frac{{\big{\langle}\phi(x,a),\theta\big{\rangle}}^{2}}{{1+\eta\big{\langle}\phi(x,\widehat{a})-\phi(x,a),\widehat{g}(x)\big{\rangle}}}
=supε{2εϕ(x,a),θε2(1+ηϕ(x,a^)ϕ(x,a),g^(x))}\displaystyle=\sup_{\varepsilon\in\mathbb{R}}\Big{\{}2\varepsilon\big{\langle}\phi(x,a),\theta\big{\rangle}-\varepsilon^{2}\cdot\big{(}{1+\eta\big{\langle}\phi(x,\widehat{a})-\phi(x,a),\widehat{g}(x)\big{\rangle}}\big{)}\Big{\}} (16)
=supε{ϕ(x,a),2εθ+ηε2g^(x)ε2ηε2ϕ(x,a^),g^(x)}.\displaystyle=\sup_{\varepsilon\in\mathbb{R}}\Big{\{}\big{\langle}\phi(x,a),2\varepsilon\theta+\eta\varepsilon^{2}\widehat{g}(x)\big{\rangle}-\varepsilon^{2}-\eta\varepsilon^{2}\big{\langle}\phi(x,\widehat{a}),\widehat{g}(x)\big{\rangle}\Big{\}}. (17)

The key property of this representation is that for any fixed ε\varepsilon\in\mathbb{R}, Eq. 17 is a linear function of the unweighted embedding ϕ\phi, and hence can be optimized using 𝐀𝐥𝐠𝖮𝗉𝗍\mathrm{\mathbf{Alg}}_{\mathsf{Opt}}. In particular, for any fixed ε\varepsilon\in\mathbb{R}, consider the following linear optimization problem, which can be solved by calling 𝐀𝐥𝐠𝖮𝗉𝗍\mathrm{\mathbf{Alg}}_{\mathsf{Opt}}:

argmaxa𝒜{2εϕ(x,a),θε2(1+ηϕ(x,a^)ϕ(x,a),g^(x))}=:argmaxa𝒜W(a;ε).\displaystyle\operatorname*{arg\,max}_{a\in\mathcal{A}}\Big{\{}2\varepsilon\big{\langle}\phi(x,a),\theta\big{\rangle}-\varepsilon^{2}\cdot\big{(}{1+\eta\big{\langle}\phi(x,\widehat{a})-\phi(x,a),\widehat{g}(x)\big{\rangle}}\big{)}\Big{\}}=\vcentcolon\operatorname*{arg\,max}_{a\in\mathcal{A}}W(a;\varepsilon). (18)

Define

ε:=ϕ(x,a),θ1+ηϕ(x,a^)ϕ(x,a),g^(x).\displaystyle\varepsilon^{\star}\vcentcolon=\frac{\langle\phi(x,a^{\star}),\theta\rangle}{{1+\eta\langle\phi(x,\widehat{a})-\phi(x,a^{\star}),\widehat{g}(x)\rangle}}. (19)

If ε\varepsilon^{\star} was known (which is not the case, since aa^{\star} is unknown), we could set ε=ε\varepsilon=\varepsilon^{\star} in Eq. 18 and compute an action \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111a:=argmaxa𝒜W(a;ε)\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{a}\vcentcolon=\operatorname*{arg\,max}_{a\in\mathcal{A}}W(a;\varepsilon^{\star}) using a single oracle call. We would then have ι(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111a)W(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111a;ε)W(a;ε)=ι(a)\iota(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{a})\geq W(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{a};\varepsilon^{\star})\geq W(a^{\star};\varepsilon^{\star})=\iota(a^{\star}), which follows because ε\varepsilon^{\star} is the maximizer in Eq. 16 for a=aa=a^{\star}.

To get around the fact that ε\varepsilon^{\star} is unknown, Algorithm 4 performs a grid search over possible values of ε\varepsilon. To show that the procedure succeeds, we begin by bounding the range of ε\varepsilon^{\star}. With some rewriting, we have

|ε|=ι(a)1+ηϕ(x,a^)ϕ(x,a),g^(x).\displaystyle\lvert\varepsilon^{\star}\rvert=\frac{\sqrt{\iota(a^{\star})}}{\sqrt{{1+\eta\langle\phi(x,\widehat{a})-\phi(x,a^{\star}),\widehat{g}(x)\rangle}}}.

Since 0<ζι(a)10<\zeta\leq\sqrt{\iota(a^{\star})}\leq 1, we have

\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111:=ζ1+2η|ε|1.\displaystyle\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}\vcentcolon={\frac{\zeta}{\sqrt{1+2\eta}}}\leq\lvert\varepsilon^{\star}\rvert\leq 1.

Algorithm 4 performs a (3/4)(3/4)-multiplicative grid search over the intervals [\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,1][\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{},1] and [1,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111][-1,-\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}], which uses 2log43(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a1111)=O(log(eηζ))2\lceil\log_{\frac{4}{3}}({\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}}^{-1})\rceil=O(\log(e\vee\frac{\eta}{\zeta})) grid points. It is immediate to that the grid contains \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}\in\mathbb{R} such that \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111ε>0\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}\cdot\varepsilon^{\star}>0 and 34|ε||\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111||ε|\frac{3}{4}\left\lvert\varepsilon^{\star}\right\rvert\leq\left\lvert\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}\right\rvert\leq\left\lvert\varepsilon^{\star}\right\rvert. Invoking Lemma 9 (stated and proven in the sequel) with a¯:=argmaxa𝒜W(a;\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111)\bar{a}\vcentcolon=\operatorname*{arg\,max}_{a\in\mathcal{A}}W(a;\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}) implies that ι(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111a)12ι(a)\iota(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{a})\geq\frac{1}{2}\iota(a^{\star}). To conclude, recall that Algorithm 4 outputs the maximizer

aˇ:=argmaxa𝒜^ι(a),\check{a}\vcentcolon=\operatorname*{arg\,max}_{a\in\widehat{\mathcal{A}}}\iota(a),

where 𝒜^\widehat{\mathcal{A}} is the set of argmax actions encountered by the grid search. Since a¯𝒜^\bar{a}\in\widehat{\mathcal{A}}, we have ι(aˇ)ι(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111a)12ι(a)\iota(\check{a})\geq\iota(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{a})\geq\frac{1}{2}\iota(a^{\star}) as desired.

Computational complexity. Finally, we bound the computational complexity of Algorithm 4. Algorithm 4 maintains a grid of O(log(eηζ))O(\log(e\vee\frac{\eta}{\zeta})) points, and hence calls the oracle 𝐀𝐥𝐠𝖮𝗉𝗍\mathrm{\mathbf{Alg}}_{\mathsf{Opt}} O(log(eηζ))O(\log(e\vee\frac{\eta}{\zeta})) in total; this takes O(𝒯𝖮𝗉𝗍log(eηζ))O(\mathcal{T}_{\mathsf{Opt}}\cdot\log(e\vee\frac{\eta}{\zeta})) time. Computing the final maximizer from the set 𝒜^\widehat{\mathcal{A}}, which contains O(log(eηζ))O(\log(e\vee\frac{\eta}{\zeta})) actions, takes O(dlog(eηζ))O(d\log(e\vee\frac{\eta}{\zeta})) time (compute each \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a),\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a1112\langle\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a),\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}\rangle^{2} takes O(d)O(d) time). Hence, the total runtime of Algorithm 4 adds up to O((𝒯𝖮𝗉𝗍+d)log(eηζ))O((\mathcal{T}_{\mathsf{Opt}}+d)\cdot\log(e\vee\frac{\eta}{\zeta})). The maximum memory requirement is O(𝖮𝗉𝗍+log(eηζ)+d)O(\mathcal{M}_{\mathsf{Opt}}+\log(e\vee\frac{\eta}{\zeta})+d), follows from calling 𝐀𝐥𝐠𝖮𝗉𝗍\mathrm{\mathbf{Alg}}_{\mathsf{Opt}}, and storing ,𝒜^\mathcal{E},\widehat{\mathcal{A}} and other terms such as g^(x),θ,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111,ϕ(x,a),\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a)\widehat{g}(x),\theta,\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{},\phi(x,a),\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a). ∎

C.1.1 Supporting Results

Lemma 9.

Let ε\varepsilon^{\star} be defined as in Eq. 19. Suppose \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}\in{\mathbb{R}} has \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111ε>0\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}\cdot\varepsilon^{\star}>0 and 34|ε||\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111||ε|\frac{3}{4}\left\lvert\varepsilon^{\star}\right\rvert\leq\left\lvert\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}\right\rvert\leq\left\lvert\varepsilon^{\star}\right\rvert. Then, if \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111a:=argmaxa𝒜W(a;\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111)\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{a}\vcentcolon=\operatorname*{arg\,max}_{a\in\mathcal{A}}W(a;\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}), we have ι(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111a)12ι(a)\iota(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{a})\geq\frac{1}{2}\iota(a^{\star}).

Proof of Lemma 9.

First observe that using the definition of ι(a)\iota(a), along with Eq. 16 and Eq. 18, we have ι(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111a)W(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111a;\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111)W(a;\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111)\iota(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{a})\geq W(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{a};\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{})\geq W(a^{\star};\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}), where the second inequality uses that \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111a:=argmaxa𝒜W(a;\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111)\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{a}\vcentcolon=\operatorname*{arg\,max}_{a\in\mathcal{A}}W(a;\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}). Since \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111ε>0\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}\cdot\varepsilon^{\star}>0, we have sign(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111ϕ(x,a),θ)=sign(εϕ(x,a),θ)\operatorname{sign}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}\cdot\langle\phi(x,a^{\star}),\theta\rangle)=\operatorname{sign}(\varepsilon^{\star}\cdot\langle\phi(x,a^{\star}),\theta\rangle). If sign(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111ϕ(x,a),θ)0\operatorname{sign}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}\cdot\langle\phi(x,a^{\star}),\theta\rangle)\geq 0, then since 34|ε||\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111||ε|\frac{3}{4}\left\lvert\varepsilon^{\star}\right\rvert\leq\left\lvert\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}\right\rvert\leq\left\lvert\varepsilon^{\star}\right\rvert, we have

W(a;\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111)\displaystyle W(a^{\star};\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}) =2\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111ϕ(x,a),θ\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a1112(1+ηϕ(x,a^)ϕ(x,a),g^(x))\displaystyle=2\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}\cdot\big{\langle}\phi(x,a^{\star}),\theta\big{\rangle}-{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}}^{2}\cdot\big{(}{1+\eta\big{\langle}\phi(x,\widehat{a})-\phi(x,a^{\star}),\widehat{g}(x)\big{\rangle}}\big{)}
32εϕ(x,a),θ(ε)2(1+ηϕ(x,a^)ϕ(x,a),g^(x))\displaystyle\geq\frac{3}{2}\varepsilon^{\star}\cdot\big{\langle}\phi(x,a^{\star}),\theta\big{\rangle}-(\varepsilon^{\star})^{2}\cdot\big{(}{1+\eta\big{\langle}\phi(x,\widehat{a})-\phi(x,a^{\star}),\widehat{g}(x)\big{\rangle}}\big{)}
=12ϕ(x,a),θ21+ηϕ(x,a^)ϕ(x,a),g^(x)=12ι(a),\displaystyle=\frac{1}{2}\frac{\langle\phi(x,a^{\star}),\theta\rangle^{2}}{{1+\eta\langle\phi(x,\widehat{a})-\phi(x,a^{\star}),\widehat{g}(x)\rangle}}=\frac{1}{2}\iota(a^{\star}),

where we use that 1+ηϕ(x,a^)ϕ(x,a),g^(x)1{{1+\eta\big{\langle}\phi(x,\widehat{a})-\phi(x,a^{\star}),\widehat{g}(x)\big{\rangle}}}\geq 1 for the first inequality and use the definition of ε\varepsilon^{\star} for the second equality.

On the other hand, when sign(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111ϕ(x,a),θ)<0\operatorname{sign}(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}\cdot\langle\phi(x,a^{\star}),\theta\rangle)<0, we similarly have

W(a;\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111)\displaystyle W(a^{\star};\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}) =2\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111ϕ(x,a),θ\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a1112(1+ηϕ(x,a^)ϕ(x,a),g^(x))\displaystyle=2\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}\cdot\big{\langle}\phi(x,a^{\star}),\theta\big{\rangle}-{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}}^{2}\cdot\big{(}{1+\eta\big{\langle}\phi(x,\widehat{a})-\phi(x,a^{\star}),\widehat{g}(x)\big{\rangle}}\big{)}
2εϕ(x,a),θ(ε)2(1+ηϕ(x,a^)ϕ(x,a),g^(x))=ι(a).\displaystyle\geq 2\varepsilon^{\star}\cdot\big{\langle}\phi(x,a^{\star}),\theta\big{\rangle}-(\varepsilon^{\star})^{2}\cdot\big{(}{1+\eta\big{\langle}\phi(x,\widehat{a})-\phi(x,a^{\star}),\widehat{g}(x)\big{\rangle}}\big{)}=\iota(a^{\star}).

Summarizing both cases, we have ι(a¯)12ι(a)\iota(\bar{a})\geq\frac{1}{2}\iota(a^{\star}). ∎

C.2 Proof of Theorem 3

See 3

Proof of Theorem 3.

We begin by examining the range of ι(a)\sqrt{\iota(a^{\star})} used in Theorem 4. Note that the linear function θ\theta passed as an argument to Algorithm 3 takes the form \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a)det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,𝒮i(a)))\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a)\mapsto\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,\mathcal{S}_{i}(a))), i.e., \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a),θ=det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,𝒮i(a))){\langle\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a),\theta\rangle}={\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,\mathcal{S}_{i}(a)))}, where 𝒮i(a):=(a1,,ai1,a,ai+1,,ad)\mathcal{S}_{i}(a)\vcentcolon=(a_{1},\ldots,a_{i-1},a,a_{i+1},\ldots,a_{d}). For the upper bound, we have

|\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a),θ|=|det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,𝒮i(a)))|a𝒮i(a)\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a)2dsupa𝒜ϕ(x,a)2d1\displaystyle\lvert\langle\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a^{\star}),\theta\rangle\rvert=\lvert\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,\mathcal{S}_{i}(a^{\star})))\rvert\leq\prod_{a\in\mathcal{S}_{i}(a^{\star})}\|\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a)\|_{2}^{d}\leq\sup_{a\in\mathcal{A}}\|\phi(x,a)\|_{2}^{d}\leq 1

by Hadamard’s inequality and the fact that the reweighting appearing in Eq. 5 enjoys \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a)2ϕ(x,a)2\|\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a)\|_{2}\leq\|\phi(x,a)\|_{2}. This shows that ι(a)1\sqrt{\iota(a^{\star})}\leq 1. For the lower bound, we first recall that in Algorithm 3, the set 𝒮\mathcal{S} is initialized to have |det(ϕ(x,𝒮))|rd\lvert\det(\phi(x,\mathcal{S}))\rvert\geq r^{d}, and thus |det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,𝒮))|\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111rd\lvert\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,\mathcal{S}))\rvert\geq{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{r}}^{d}, where \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111r:=r1+2η\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{r}\vcentcolon=\frac{r}{\sqrt{1+2\eta}} accounts for the reweighting in Eq. 5. Next, we observe that as a consequence of the update rule in Algorithm 3, we are guaranteed that |det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,𝒮))|\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111rd\lvert\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,\mathcal{S}))\rvert\geq{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{r}}^{d} across all rounds. Thus, whenever Algorithm 4 is invoked with the linear function θ\theta described above, there must exist an action a𝒜a\in\mathcal{A} such that |\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a),θ|\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111rd\lvert\langle\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a),\theta\rangle\rvert\geq{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{r}}^{d}, which implies that ι(a)\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111rd\sqrt{\iota(a^{\star})}\geq{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{r}}^{d} and we can take ζ:=\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111rd\zeta\vcentcolon={\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{r}}^{d} in Theorem 4.

We next bound the number of iterations of the while-loop before the algorithm terminates. Let \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111C:=22C>1\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{C}\vcentcolon=\frac{\sqrt{2}}{2}\cdot C>1. At each iteration (beginning from line 3) of Algorithm 3, one of two outcomes occurs:

  1. 1.

    We find an index i[d]i\in[d] and an action a𝒜a\in\mathcal{A} such that |det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,𝒮i(a)))|>\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111C|det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,𝒮))|\lvert\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,\mathcal{S}_{i}(a)))\rvert>\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{C}\lvert\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,\mathcal{S}))\rvert, and update ai=aa_{i}=a.

  2. 2.

    We conclude that supa𝒜maxi[d]|det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,𝒮i(a)))|C|det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,𝒮))|\sup_{a\in\mathcal{A}}\max_{i\in[d]}\lvert\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,\mathcal{S}_{i}(a)))\rvert\leq C\lvert\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,\mathcal{S}))\rvert and terminate the algorithm.

We observe that (i) the initial set 𝒮\mathcal{S} has |det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,𝒮))|\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111rd\lvert\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,\mathcal{S}))\rvert\geq{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{r}}^{d} with \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111r:=r1+2η\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{r}\vcentcolon=\frac{r}{\sqrt{1+2\eta}} (as discussed before), (ii) sup𝒮𝒜,|𝒮|=d|det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,𝒮))|1\sup_{\mathcal{S}\subseteq\mathcal{A},\lvert\mathcal{S}\rvert=d}\lvert\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,\mathcal{S}))\rvert\leq 1 by Hadamard’s inequality, and (iii) each update of 𝒮\mathcal{S} increases the (absolute) determinant by a factor of \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111C\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{C}. Thus, fix any C>2C>\sqrt{2}, we are guaranteed that Algorithm 3 terminates within O(dlog(eηr))O(d\log(e\vee\frac{\eta}{r})) iterations of the while-loop.

We now discuss the correctness of Algorithm 3, i.e., when terminated, the set 𝒮\mathcal{S} is a CC-approximate barycentric spanner with respect to the reweighted embedding \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}. First, note that by Theorem 4, Algorithm 4 is guaranteed to identify an action aˇ𝒜\check{a}\in\mathcal{A} such that |det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,𝒮i(aˇ)))|>\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111C|det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,𝒮))|\lvert\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,\mathcal{S}_{i}(\check{a})))\rvert>\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{C}\lvert\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,\mathcal{S}))\rvert as long as there exists an action a𝒜a^{\star}\in\mathcal{A} such that |det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,𝒮i(a)))|>C|det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,𝒮))|\lvert\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,\mathcal{S}_{i}(a^{\star})))\rvert>C\lvert\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,\mathcal{S}))\rvert. As a result, by Observation 2.3 in Awerbuch and Kleinberg (2008), if no update is made and Algorithm 3 terminates, we have identified a CC-approximate barycentric spanner with respect to embedding \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}.

Computational complexity. We provide the computational complexity analysis for Algorithm 3 in the following. We use \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111𝒮\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\mathcal{S}} to denote the matrix whose kk-th column is \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,ak)\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a_{k}) with ak𝒮a_{k}\in\mathcal{S}.

  • Initialization. We first notice that, given g^(x)d\widehat{g}(x)\in{\mathbb{R}}^{d} and a^:=argmaxa𝒜ϕ(x,a),g^(x)\widehat{a}\vcentcolon=\operatorname*{arg\,max}_{a\in\mathcal{A}}\langle\phi(x,a),\widehat{g}(x)\rangle, it takes O(d)O(d) time to compute \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a)\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a) for any a𝒜a\in\mathcal{A}. Thus, computing det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111)𝒮\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\mathcal{S}}) and \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111𝒮1\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\mathcal{S}}^{-1} takes O(d2+dω)=O(dω)O(d^{2}+d^{\omega})=O(d^{\omega}) time, where we use O(dω)O(d^{\omega}) (with 2ω32\leq\omega\leq 3) to denote the time of computing matrix determinant/inversion. The maximum memory requirement is O(d2)O(d^{2}), following from the storage of {\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a)}a𝒮\{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a)\}_{a\in\mathcal{S}} and \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111𝒮1\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\mathcal{S}}^{-1}.

  • Outer loops (lines 1-2). We have already shown that Algorithm 5 terminates within O(dlog(eηr))O(d\log(e\vee\frac{\eta}{r})) iterations of the while-loop (line 2). It is also clear that the for-loop (line 2) is invoked at most dd times.

  • Computational complexity for lines 3-7. We discuss how to efficiently implement this part using rank-one updates. We analyze the computational complexity for each line in the following. The analysis largely follows from the proof of Lemma 6.

    • Line 3. Using rank-one update of the matrix determinant (as discussed in the proof of Lemma 6), we have

      det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a1),,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,ai1),Y,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,ai+1),,\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,ad))=Y,θ,\displaystyle\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a_{1}),\ldots,\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a_{i-1}),Y,\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a_{i+1}),\ldots,\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a_{d}))=\langle Y,\theta\rangle,

      where θ=det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111)𝒮(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111)𝒮1ei\theta=\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\mathcal{S}})\cdot(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\mathcal{S}}^{-1})^{\top}e_{i}. Thus, whenever det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111)𝒮\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\mathcal{S}}) and \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111𝒮1\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\mathcal{S}}^{-1} are known, compute θ\theta takes O(d)O(d) time. The maximum memory requirement is O(d2)O(d^{2}), following from the storage of \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111𝒮1\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\mathcal{S}}^{-1}.

    • Line 4. When θ\theta is computed, we can compute aa by invoking IGW-ArgMax (Algorithm 4). As discussed in Theorem 4, this step takes runtime O((𝒯𝖮𝗉𝗍d+d2)log(eηr))O((\mathcal{T}_{\mathsf{Opt}}\cdot d+d^{2})\cdot\log(e\vee\frac{\eta}{r})) and maximum memory O(𝖮𝗉𝗍+dlog(eηr)+d)O(\mathcal{M}_{\mathsf{Opt}}+d\log(e\vee\frac{\eta}{r})+d) (by taking ζ=\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111rd\zeta={\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{r}}^{d} as discussed before).

    • Line 5. Once θ\theta and det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111)𝒮\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\mathcal{S}}) are computed, checking the updating criteria takes O(d)O(d) time. The maximum memory requirement is O(d)O(d), following from the storage of \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(x,a)\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x,a) and θ\theta.

    • Line 6. As discussed in the proof of Lemma 6, if an update ai=aa_{i}=a is made, we can update det(\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111)𝒮\det(\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\mathcal{S}}) and \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111𝒮1\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}_{\mathcal{S}}^{-1} using rank-one updates with O(d2)O(d^{2}) time and memory.

    Thus, using rank-one updates, the total runtime for line 3-7 adds up to O((𝒯𝖮𝗉𝗍d+d2)log(eηr))O((\mathcal{T}_{\mathsf{Opt}}\cdot d+d^{2})\cdot\log(e\vee\frac{\eta}{r})) and maximum memory requirement is O(𝖮𝗉𝗍+d2+dlog(eηr))O(\mathcal{M}_{\mathsf{Opt}}+d^{2}+d\log(e\vee\frac{\eta}{r})).

To summarize, Algorithm 5 has runtime O((𝒯𝖮𝗉𝗍d3+d4)log2(eηr))O((\mathcal{T}_{\mathsf{Opt}}\cdot d^{3}+d^{4})\cdot\log^{2}(e\vee\frac{\eta}{r})) and uses at most O(𝖮𝗉𝗍+d2+dlog(eηr))O(\mathcal{M}_{\mathsf{Opt}}+d^{2}+d\log(e\vee\frac{\eta}{r})) units of memory. ∎

C.3 Efficient Initializations for Algorithm 3

In this section we discuss specific settings in which the initialization required by Algorithm 3 can be computed efficiently. For the first result, we let 𝖡𝖺𝗅𝗅(0,r):={xdx2r}\mathsf{Ball}(0,r)\vcentcolon=\left\{x\in\mathbb{R}^{d}\mid{}\left\|x\right\|_{2}\leq{}r\right\} denote the ball of radius rr in d\mathbb{R}^{d}.

Example 2.

Suppose that there exists r(0,1)r\in(0,1) such that 𝖡𝖺𝗅𝗅(0,r){ϕ(x,a):a𝒜}\mathsf{Ball}(0,r)\subseteq\{\phi(x,a):a\in\mathcal{A}\}. Then by choosing 𝒮:={re1,,red}𝒜\mathcal{S}\vcentcolon=\{re_{1},\dots,re_{d}\}\subseteq\mathcal{A}, we have |det(ϕ(𝒮))|=rd\lvert\det(\phi(\mathcal{S}))\rvert=r^{d}.

The next example is stronger, and shows that we can efficiently compute a set with large determinant whenever such a set exists.

Example 3.

Suppose there exists a set 𝒮𝒜\mathcal{S}^{\star}\subseteq\mathcal{A} such that |det(ϕ(𝒮))|\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111rd\lvert\det(\phi(\mathcal{S}^{\star}))\rvert\geq{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{r}}^{d} for some \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111r>0\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{r}>0. Then there exists an efficient algorithm that identifies a set 𝒮𝒜\mathcal{S}\subseteq\mathcal{A} with |det(ϕ(𝒮))|rd\lvert\det(\phi(\mathcal{S}))\rvert\geq r^{d} for r:=\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111r8dr\vcentcolon=\frac{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{r}}{8d}, and does so with runtime O(𝒯𝖮𝗉𝗍d2logd+d4logd)O(\mathcal{T}_{\mathsf{Opt}}\cdot d^{2}\log d+d^{4}\log d) and memory O(𝖮𝗉𝗍+d2)O(\mathcal{M}_{\mathsf{Opt}}+d^{2}).

Proof for Example 3.

The guarantee is achieved by running Algorithm 5 with C=2C=2. One can show that this strategy achieves the desired approximation guarantee by slightly generalizing the proof of a similar result in Mahabadi et al. (2019). In more detail, Mahabadi et al. (2019) study the problem of identifying a subset 𝒮𝒜\mathcal{S}\subseteq\mathcal{A} such that |𝒮|=k\left\lvert\mathcal{S}\right\rvert=k and det(Φ𝒮Φ𝒮)\det(\Phi_{\mathcal{S}}^{\top}\Phi_{\mathcal{S}}) is (approximately) maximized, where Φ𝒮d×|𝒮|\Phi_{\mathcal{S}}\in{\mathbb{R}}^{d\times\lvert\mathcal{S}\rvert} denotes the matrix whose columns are ϕ(x,a)\phi(x,a) for a𝒮a\in\mathcal{S}. We consider the case when k=dk=d, and make the following observations.

  • We have det(Φ𝒮Φ𝒮)=(det(Φ𝒮))2=(det(ϕ(x,𝒮)))2\det(\Phi_{\mathcal{S}}^{\top}\Phi_{\mathcal{S}})=\left(\det(\Phi_{\mathcal{S}})\right)^{2}=(\det(\phi(x,\mathcal{S})))^{2}. Thus, maximizing det(Φ𝒮Φ𝒮)\det(\Phi_{\mathcal{S}}^{\top}\Phi_{\mathcal{S}}) is equivalent to maximizing |det(ϕ(x,𝒮))|\lvert\det(\phi(x,\mathcal{S}))\rvert.

  • The Local Search Algorithm provided in Mahabadi et al. (2019) (Algorithm 4.1 therein) has the same update and termination condition as Algorithm 5. As a result, one can show that the conclusion of their Lemma 4.1 also applies to Algorithm 5.

Appendix D Other Details for Experiments

D.1 Basic Details

Datasets

oneshotwiki (Singh et al., 2012; Vasnetsov, 2018) is a named-entity recognition task where contexts are text phrases preceding and following the mention text, and where actions are text phrases corresponding to the concept names. We use the python package sentence transformers (Reimers and Gurevych, 2019) to separately embed the text preceding and following the reference into 768{\mathbb{R}}^{768}, and then concatenate, resulting in a context embedding in 1536{\mathbb{R}}^{1536}. We embed the action (mentioned entity) text into 768{\mathbb{R}}^{768} and then use SVD on the collection of embedded actions to reduce the dimensionality to 50{\mathbb{R}}^{50}. The reward function is an indicator function for whether the action corresponds to the actual entity mentioned. oneshotwiki-311 (resp. oneshotwiki-14031) is a subset of this dataset obtained by taking all actions with at least 2000 (resp. 200) examples.

amazon-3m (Bhatia et al., 2016) is an extreme multi-label dataset whose contexts are text phrases corresponding to the title and description of an item, and whose actions are integers corresponding to item tags. We separately embed the title and description phrases using sentence transformers, which leads to a context embedding in 1536{\mathbb{R}}^{1536}. Following the protocol used in Sen et al. (2021), the first 50000 examples are fully supervised, and subsequent examples have bandit feedback. We use Hellinger PCA (Lebret and Collobert, 2014) on the supervised data label cooccurrences to construct the action embeddings in 800{\mathbb{R}}^{800}. Rewards are binary, and indicate whether a given item has the chosen tag. Actions that do not occur in the supervised portion of the dataset cannot be output by the model, but are retained for evaluation: For example, if during the bandit feedback phase, an example consists solely of tags that did not occur during the supervised phase, the algorithm will experience a reward of 0 for every feasible action on the example. For a typical seed, this results in roughly 890,000 feasible actions for the model. In the (k=5,r=3)(k=5,r=3) setup, we take the top-kk actions as the greedy slate, and then independently decide whether to explore for each exploration slot (the bottom rr slots). For exploration, we sample from the spanner set without replacement.

Regression functions and oracles

For bilinear models, regression functions take the form f(x,a)=ϕ(a),Wxf(x,a)=\langle\phi(a),Wx\rangle, where WW is a matrix of learned parameters. For deep models, regression functions pass the original context through 2 residual leaky ReLU layers before applying the bilinear layer, f(x,a)=ϕ(a),W\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111g(x)f(x,a)=\langle\phi(a),W\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{g}(x)\rangle, where \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111g\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{g} is a learned two-layer neural network, and WW is a matrix of learned parameters. For experiments with respect to oneshotwiki datasets, we add a learned bias term for regression functions (same for every action); for experiments with respect to the amazon-3m dataset, we additionally add an action-dependent bias term that is obtained from the supervised examples. The online regression oracle is implemented using PyTorch’s Adam optimizer with log loss (recall that rewards are 0/1).

Hyperparameters

For each algorithm, we optimize its hyperparameters using random search (Bergstra and Bengio, 2012). Speccifically, hyperparameters are tuned by taking the best of 59 randomly selected configurations for a fixed seed (this seed is not used for evaluation). A seed determines both dataset shuffling, initialization of regressor parameters, and random choices made by any action sampling scheme.

Evaluation

We evaluate each algorithm on 32 seeds. All reported confidence intervals are 90% bootstrap CIs for the mean.

D.2 Practical Modification to Sampling Procedure in SpannerIGW

For experiments with SpannerIGW, we slightly modify the action sampling distribution so as to avoid computing the normalization constant λ\lambda. First, we modify the weighted embedding scheme given in Eq. 5 using the following expression:

\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111(xt,a):=ϕ(xt,a)1+d+γ4d(f^t(xt,a^t)f^t(xt,a)).\displaystyle\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{}(x_{t},a)\vcentcolon=\frac{\phi(x_{t},a)}{\sqrt{1+d+\frac{\gamma}{4d}\big{(}\widehat{f}_{t}(x_{t},\widehat{a}_{t})-\widehat{f}_{t}(x_{t},a)\big{)}}}.

We obtain a 4d4d-approximate optimal design for the reweighted embeddings by first computing a 22-approximate barycentric spanner 𝒮\mathcal{S}, then taking qtopt:=unif(𝒮)q_{t}^{\mathrm{opt}}\vcentcolon=\mathrm{unif}(\mathcal{S}). To proceed, let a^t:=argmaxa𝒜f^(xt,a)\widehat{a}_{t}\vcentcolon=\operatorname*{arg\,max}_{a\in\mathcal{A}}\widehat{f}(x_{t},a) and \macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111d:=|𝒮{a^t}|\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{d}\vcentcolon=\lvert\mathcal{S}\cup\{\widehat{a}_{t}\}\rvert. We construct the sampling distribution ptΔ(𝒜)p_{t}\in\Delta(\mathcal{A}) as follows:

  • Set pt(a):=1\macc@depthΔ\frozen@everymath\macc@group\macc@set@skewchar\macc@nested@a111d+γ4d(f^t(xt,a^t)f^t(xt,a))p_{t}(a)\vcentcolon=\frac{1}{\macc@depth\char 1\relax\frozen@everymath{\macc@group}\macc@set@skewchar\macc@nested@a 111{d}+\frac{\gamma}{4d}\big{(}\widehat{f}_{t}(x_{t},\widehat{a}_{t})-\widehat{f}_{t}(x_{t},a)\big{)}} for each asupp(𝒮)a\in\operatorname{supp}(\mathcal{S}).

  • Assign remaining probability mass to a^t\widehat{a}_{t}.

With a small modification to the proof of Lemma 4, one can show that this construction certifies that 𝖽𝖾𝖼γ()=O(d2γ)\mathsf{dec}_{\gamma}(\mathcal{F})=O\big{(}\frac{d^{2}}{\gamma}\big{)}. Thus, the regret bound in Theorem 2 holds up to a constant factor. Similarly, with a small modification to the proof of Theorem 3, we can also show that —with respect to this new embedding—Algorithm 3 has O((𝒯𝖮𝗉𝗍d3+d4)log2(d+γ/dr))O((\mathcal{T}_{\mathsf{Opt}}\cdot d^{3}+d^{4})\cdot\log^{2}\big{(}\frac{d+\gamma/d}{r}\big{)}) runtime and O(𝖮𝗉𝗍+d2+dlog(d+γ/dr))O(\mathcal{M}_{\mathsf{Opt}}+d^{2}+d\log\big{(}\frac{d+\gamma/d}{r}\big{)}) memory.

D.3 Timing Information

Table 4 contains timing information the oneshotwiki-14031 dataset with a bilinear model. The CPU timings are most relevant for practical scenarios such as information retrieval and recommendation systems, while the GPU timings are relevant for scenarios where simulation is possible. Timings for SpannerGreedy do not include the one-time cost to compute the spanner set. Timings for all algorithms use precomputed context and action embeddings. For all but algorithms but SpannerIGW, timings reflect the major bottleneck of computing the argmax action, since all subsequent steps take O(1)O(1) time with respect to |𝒜|\lvert\mathcal{A}\rvert. In particular, SquareCB is implemented using rejection sampling, which does not require explicit construction of the action distribution. For SpannerIGW, the additional overhead is due to the time required to construct an approximate optimal design for each example.

Table 4: Per-example inference timings for oneshotwiki-14031. CPU timings use batch size 11 on an Azure STANDARD_D4_V2 machine. GPU timings use batch size 1024 on an Azure STANDARD_NC6S_V2 (Nvidia P100-based) machine.
Algorithm CPU GPU
ε\varepsilon-Greedy 2 ms 10 μ\mus
SpannerGreedy 2 ms 10 μ\mus
SquareCB 2 ms 10 μ\mus
SpannerIGW 25 ms 180 μ\mus

D.4 Additional Figures

Refer to caption
Figure 1: amazon-3m results for SpannerGreedy. Confidence intervals are rendered, but are but too small to visualize. For (k=1)(k=1), the final CI is [0.1041,0.1046][0.1041,0.1046], and for (k=5,r=3)(k=5,r=3), the final CI is [0.438,0.440][0.438,0.440].