This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Avoiding Catastrophe in Online Learning by Asking for Help

Benjamin Plaut    Hanlin Zhu    Stuart Russell
Abstract

Most learning algorithms with formal regret guarantees assume that all mistakes are recoverable and essentially rely on trying all possible behaviors. This approach is problematic when some mistakes are catastrophic, i.e., irreparable. We propose an online learning problem where the goal is to minimize the chance of catastrophe. Specifically, we assume that the payoff in each round represents the chance of avoiding catastrophe in that round and try to maximize the product of payoffs (the overall chance of avoiding catastrophe) while allowing a limited number of queries to a mentor. We also assume that the agent can transfer knowledge between similar inputs. We first show that in general, any algorithm either queries the mentor at a linear rate or is nearly guaranteed to cause catastrophe. However, in settings where the mentor policy class is learnable in the standard online model, we provide an algorithm whose regret and rate of querying the mentor both approach 0 as the time horizon grows. Although our focus is the product of payoffs, we provide matching bounds for the typical additive regret. Conceptually, if a policy class is learnable in the absence of catastrophic risk, it is learnable in the presence of catastrophic risk if the agent can ask for help.

online learning, AI safety, asking for help, irreversibility

1 Introduction

There has been mounting concern over catastrophic risk from AI, including but not limited to autonomous weapon accidents (Abaimov & Martellini, 2020), bioterrorism (Mouton et al., 2024), cyberattacks on critical infrastructure (Guembe et al., 2022), and loss of control (Bengio et al., 2024). See Critch & Russell (2023) and Hendrycks et al. (2023) for taxonomies of societal-scale AI risks. In this paper, we use “catastrophe” to refer to any kind of irreparable harm. In addition to the large-scale risks above, our definition also covers smaller-scale (yet still unacceptable) incidents such as serious medical errors (Rajpurkar et al., 2022), crashing a robotic vehicle (Kohli & Chadha, 2020), and discriminatory sentencing (Villasenor & Foggo, 2020).

The gravity of these risks contrasts starkly with the dearth of theoretical understanding of how to avoid them. Nearly all of learning theory explicitly or implicitly assumes that no single mistake is too costly. We focus on online learning, where an agent repeatedly interacts with an unknown environment and uses its observations to gradually improve its performance. Most online learning algorithms essentially try all possible behaviors and see what works well. We do not want autonomous weapons or surgical robots to try all possible behaviors.

More precisely, trial-and-error-style algorithms only work when catastrophe is assumed to be impossible. This assumption can take multiple forms, such as that the agent’s actions do not affect future inputs (e.g., Slivkins, 2011), that no action has irreversible effects (e.g., Jaksch et al., 2010) or that the environment is reset at the start of each “episode” (e.g., Azar et al., 2017). One could train an agent entirely in a controlled lab setting where one of those assumptions does hold, but we argue that sufficiently general agents will inevitably encounter novel scenarios when deployed in the real world. Machine learning models often behave unpredictably in unfamiliar environments (see, e.g., Quiñonero-Candela et al., 2022), and we do not want AI biologists or robotic vehicles to behave unpredictably.

The goal of this paper is to understand the conditions under which it is possible to formally guarantee avoidance of catastrophe in online learning. Certainly some conditions are necessary, because the problem is hopeless if the agent must rely purely on trial-and-error: any untried action could lead to paradise or disaster and the agent has no way to predict which. In the real world, however, one need not learn through pure trial-and-error: one can also ask for help. We think it is critical for high-stakes AI applications to employ a designated supervisor who can be asked for help. Examples include a human doctor supervising AI doctors, a robotic vehicle with a human driver who can take over in emergencies, autonomous weapons with a human operator, and many more. We hope that our work constitutes a step in the direction of practical safety guarantees for such applications.

1.1 Our model

We propose an online learning model of avoiding catastrophe with mentor help. On each time step, the agent observes an input, selects an action or queries the mentor, and obtains a payoff. Each payoff represents the probability of avoiding catastrophe on that time step (conditioned on no prior catastrophe). The agent’s goal is to maximize the product of payoffs, which is equal to the overall probability of avoiding catastrophe.111Conditioning on no prior catastrophe means we do not need to assume that these probabilities are independent (and if catastrophe has already occurred, this time step does not matter). This is due to the chain rule of probability. As is standard in online learning, we consider the product of payoffs obtained while learning, not the product of payoffs of some final policy.

The (possibly suboptimal) mentor has a stationary policy, and when queried, the mentor illustrates their policy’s action for the current input. We want the agent’s regret – defined as the gap between the mentor’s performance and the agent’s performance – to go to zero as the time horizon TT grows. In other words, with enough time, the agent should avoid catastrophe nearly as well as the mentor. We also expect the agent to become self-sufficient over time: the number of queries to the mentor should be sublinear in TT, or equivalently, the rate of querying the mentor should go to zero.

1.2 Our assumptions

The agent needs some way to make inferences about unqueried inputs in order to decide when to ask for help. Much past work has used Bayesian inference, which suffers from tractability issues in complex environments.222For the curious reader, Betancourt (2018) provides a thorough treatment. See also Section 2. We instead assume that the mentor policy satisfies a novel property that we call local generalization: informally, if the mentor told us that an action was safe for a similar input, then that action is probably also safe for the current input. For example, if it is safe to ignore a 3 mm spot on an X-ray, it is likely (but not certainly) also safe to ignore a 3.1 mm spot with the same density, location, etc. Unlike Bayesian inference, local generalization only requires computing distances and is compatible with any input space that admits a distance metric. See Section 5.2 for further discussion of local generalization.

Unlike the standard online learning model, we assume that the agent does not observe payoffs. This is because the payoff in our model represents the chance of avoiding catastrophe on that time step. In the real world, one only observes whether catastrophe occurred, not its probability.333One may be able to detect “close calls” in some cases, but observing the precise probability seems unrealistic.

1.3 Standard online learning

To properly understand our results, it is important to understand standard online learning. In the standard model, the agent observes an input on each time step and must choose an action. An adversary then reveals the correct action, which results in some payoff to the agent. The goal is sublinear regret with respect to the sum of payoffs, or equivalently, the average regret per time step should go to 0 as TT\to\infty. Table 1 delineates the precise differences between the standard model and our model.

If the adversary’s choices are unconstrained, the problem is hopeless: if the adversary determines the correct action on each time step randomly and independently, the agent can do no better than random guessing. However, sublinear regret becomes possible if (1) the hypothesis class has finite Littlestone dimension (Littlestone, 1988), or (2) the hypothesis class has finite VC dimension (Vapnik & Chervonenkis, 1971) and the input is σ\sigma-smooth444Informally, the adversary chooses a distribution over inputs instead of a precise input. See Section 3 for the formal definition. (Haghtalab et al., 2024).

The goal of sublinear regret in online learning implicitly assumes catastrophe is impossible: the agent can make arbitrarily many (and arbitrarily costly) mistakes as long as the average regret per time step goes to 0. In contrast, we demand subconstant regret: the total probability of catastrophe should go to 0. Furthermore, standard online learning allows the agent to observe payoffs on every time step, while our agent only receives feedback on time steps with queries. However, the combination of a mentor and local generalization allows our agent to learn without trying actions directly, which is enough to offset all of the above disadvantages.

Table 1: Comparison between the standard online learning model and our model.
Standard model Our model
Objective Sum of payoffs Product of payoffs
Regret goal Sublinear Subconstant
Feedback Every time step Only from queries
Mentor No Yes
Local generalization No Yes

1.4 Our results

At a high level, we show that avoiding catastrophe with the help of a mentor and local generalization is no harder than online learning without catastrophic risk.

We first show that in general, any algorithm with sublinear queries to the mentor has unbounded regret in the worst-case (Theorem 4.1). As a corollary, even when the mentor can avoid catastrophe with certainty, any algorithm either needs extensive supervision or is nearly guaranteed to cause catastrophe (Corollary 4.1.1).

Our primary result is a simple algorithm whose total regret and rate of querying the mentor both go to 0 as TT\to\infty when either (1) the mentor policy class has finite Littlestone dimension or (2) the mentor policy class has finite VC dimension and the input sequence is σ\sigma-smooth (Theorem 5.2). Conceptually, the algorithm has two components: (1) for “in-distribution” inputs, run a standard online learning algorithm (adjusted to account for only receiving feedback in response to queries), and (2) for “out-of-distribution” inputs, ask for help. Our algorithm can handle an unbounded input space and does not need to know the local generalization constant.

Although we focus on the product of payoffs, we show that the results above (both positive and negative) hold for the typical additive regret as well. In fact, we show that multiplicative regret and additive regret are tightly related in our setting (Lemma A.1).

In summary, the combination of local generalization and a mentor allows us to reduce the regret by an entire factor of TT, resulting in subconstant regret (multiplicative or additive) instead of the typical sublinear regret.

2 Related work

Learning with irreversible costs. Despite the ubiquity of irreversible costs in the real world, theoretical work on this topic remains limited. This may be due to the fundamental modeling question of how the agent should learn about novel inputs or actions without trying them directly.

The most common approach is to allow the agent to ask for help. This alone is insufficient, however: the agent must have some way to decide when to ask for help. A popular solution is to perform Bayesian inference on the world model, but this has two tricky requirements: (1) a prior distribution which contains the true world model (or an approximation), and (2) an environment where computing (or approximating) the posterior is tractable. A finite set of possible environments satisfies both conditions but is unrealistic in many real-world scenarios. In contrast, our algorithm can handle an uncountable policy class and a continuous unbounded input space, which is crucial for many real-world scenarios in which one never sees the exact same input twice.

Bayesian inference combined with asking for help is studied by Cohen et al. (2021); Cohen & Hutter (2020); Kosoy (2019); Mindermann et al. (2018). We also mention Hadfield-Menell et al. (2017); Moldovan & Abbeel (2012); Turchetta et al. (2016), who utilize Bayesian inference in the context of safe (online) reinforcement learning without asking for help (and without regret bounds).

We are only aware of two papers that theoretically address irreversibility without Bayesian inference: Grinsztajn et al. (2021) and Maillard et al. (2019). The former proposes to sample trajectories and learn reversibility based on temporal consistency between states: intuitively, if s1s_{1} always precedes s2s_{2}, we can infer that s1s_{1} is unreachable from s2s_{2}. Although the paper theoretically grounds this intuition, there is no formal regret guarantee. The latter presents an algorithm which asks for help in the form of rollouts from the current state. However, the regret bound and number of rollouts are both linear in the worst case, due to the dependence on the γ\gamma^{*} parameter which roughly captures how bad an irreversible action can be. In contrast, our algorithm achieves good regret even when actions are maximally bad.

To our knowledge, we are the first to provide an algorithm which formally guarantees avoidance of catastrophe (with high probability) without Bayesian inference. We are also not aware of prior results comparable to our negative result, including in the Bayesian regime.

Safe reinforcement learning (RL). The safe RL problem is typically formulated as a constrained Markov Decision Process (CMDP) (Altman, 2021). In CMDPs, the agent must maximize reward while also satisfying safety constraints. See Gu et al. (2024); Zhao et al. (2023); Wachi et al. (2024) for surveys. The two most relevant safe RL papers are Liu et al. (2021) and Stradi et al. (2024), both of which provide algorithms guaranteed to satisfy initially unknown safety constraints. Since neither paper allows external help, they require strong assumptions to make the problem tractable: the aforementioned results assume that the agent (1) knows a strictly safe policy upfront (i.e., a policy which satisfies the safety constraints with slack), (2) is periodically reset, and (3) observes the safety costs. In contrast, our agent has no prior knowledge, is never reset, and never observes payoffs.

Online learning. See Cesa-Bianchi & Lugosi (2006) and Chapter 21 of Shalev-Shwartz & Ben-David (2014) for introductions to online learning. A classical result states that sublinear regret is possible if and only if the hypothesis class has finite Littlestone dimension (Littlestone, 1988). However, even some simple hypothesis classes have infinite Littlestone dimension, such as the class of thresholds on [0,1][0,1] (Example 21.4 in Shalev-Shwartz & Ben-David, 2014). Recently, Haghtalab et al. (2024) showed that if the adversary only chooses a distribution over inputs rather than the precise input, only finite VC dimension (Vapnik & Chervonenkis, 1971) is needed for sublinear regret. Specifically, they assume that each input is sampled from a distribution whose concentration is upper bounded by 1σ\frac{1}{\sigma} times the uniform distribution. This framework is known as smoothed analysis, originally due to Spielman & Teng (2004).

Multiplicative objectives. Although online learning traditionally studies the sum of payoffs, there is some work which aims to maximize the product of payoffs (or equivalently, the sum of logarithms). See, e.g., Chapter 9 of Cesa-Bianchi & Lugosi (2006). However, these regret bounds are still sublinear in TT, in comparison to our subconstant regret bounds. Also, like most online learning work, those results assume that payoffs are observed on every time step. In contrast, our agent only receives feedback in response to queries (Table 1) and never observes payoffs. Barman et al. studied a multiplicative objective in a multi-armed bandit context, but their objective is the geometric mean of payoffs instead of the product. Interpreted in our context, their regret bounds imply that the average chance of catastrophe goes to zero, while we guarantee that the total chance of catastrophe goes to zero. This distinction is closely related to the difference between subconstant and sublinear regret.

Active learning and imitation learning. Our assumption that the agent only receives feedback in response to queries falls under the umbrella of active learning (Hanneke, 2014). This contrasts with passive learning, where the agent receives feedback automatically. The way our agent learns from the mentor is also reminiscent of imitation learning (Osa et al., 2018). Although ideas from these areas could be useful in our setting, we are not aware of any results from that literature which account for irreversible costs.

3 Model

Inputs. Let \mathbb{N} denote the strictly positive integers and let TT\in\mathbb{N} be the time horizon. Let 𝒙=(x1,x2,,xT)𝒳T\boldsymbol{x}=(x_{1},x_{2},\dots,x_{T})\in\mathcal{X}^{T} be the sequence of inputs. In the fully adversarial setting, each xtx_{t} can have arbitrary (possibly randomized) dependence on the events of prior time steps. In the smoothed setting, the adversary only chooses the distribution 𝒟t\mathcal{D}_{t} from which xtx_{t} is sampled. Formally, a distribution 𝒟\mathcal{D} over 𝒳\mathcal{X} is σ\sigma-smooth if for any S𝒳S\subseteq\mathcal{X}, 𝒟(S)1σU(S)\mathcal{D}(S)\leq\frac{1}{\sigma}U(S). (In the smoothed setting, we assume that 𝒳\mathcal{X} supports a uniform distribution UU.555For example, it suffices for 𝒳\mathcal{X} to have finite Lebesgue measure. Note that this does not imply boundedness. Alternatively, σ\sigma-smoothness can be defined with respect to a different distribution; see Definition 1 of Block et al. (2022).) If each xtx_{t} is sampled from a σ\sigma-smooth 𝒟t\mathcal{D}_{t}, we say that 𝒙\boldsymbol{x} is σ\sigma-smooth. The sequence 𝓓=𝒟1,,𝒟T\boldsymbol{\mathcal{D}}=\mathcal{D}_{1},\dots,\mathcal{D}_{T} can still be adaptive, i.e., the choice of 𝒟t\mathcal{D}_{t} can depend on the events of prior time steps.

Actions and payoffs. Let 𝒴\mathcal{Y} be a finite set of actions. There also exists a special action y~\tilde{y} which corresponds to querying the mentor. For kk\in\mathbb{N}, let [k]={1,2,,k}[k]=\{1,2,\dots,k\}. On each time step t[T]t\in[T], the agent must select action yt𝒴{y~}y_{t}\in\ \mathcal{Y}\cup\{\tilde{y}\}, which generates a payoff. Let 𝒚=(y1,,yT)\boldsymbol{y}=(y_{1},\dots,y_{T}). We allow the payoff function to vary between time steps: let 𝝁=(μ1,,μT)(𝒳×𝒴[0,1])T\boldsymbol{\mu}=(\mu_{1},\dots,\mu_{T})\in(\mathcal{X}\times\mathcal{Y}\to[0,1])^{T} be the sequence of payoff functions. Then μt(xt,yt)[0,1]\mu_{t}(x_{t},y_{t})\in[0,1] is the agent’s payoff at time tt. Like 𝓓\boldsymbol{\mathcal{D}}, we allow 𝝁\boldsymbol{\mu} to be adaptive. Unless otherwise noted, all expectations are over any randomization in the agent’s decisions, any randomization in 𝒙\boldsymbol{x}, and any randomization in the adaptive choice of 𝝁\boldsymbol{\mu}.

Asking for help. The mentor is endowed with a (possibly suboptimal) policy πm:𝒳𝒴\pi^{m}:\mathcal{X}\to\mathcal{Y}. When action y~\tilde{y} is chosen, the mentor informs the agent of the action πm(xt)\pi^{m}(x_{t}) and the agent obtains payoff μt(xt,πm(xt))\mu_{t}(x_{t},\pi^{m}(x_{t})). For brevity, let μtm(x)=μt(x,πm(x))\mu_{t}^{m}(x)=\mu_{t}(x,\pi^{m}(x)). The agent never observes payoffs: the only way to learn about 𝝁\boldsymbol{\mu} is by querying the mentor.

We would like an algorithm which becomes “self-sufficient” over time: the rate of querying the mentor should go to 0 as TT\to\infty, or equivalently, the cumulative number of queries should be sublinear in TT. Formally, let QT(𝝁,πm)={t[T]:yt=y~}Q_{T}(\boldsymbol{\mu},\pi^{m})=\{t\in[T]:y_{t}=\tilde{y}\} be the random variable denoting the set of time steps with queries. Then we say that the (expected) number of queries is sublinear in TT if sup𝝁,πm𝔼[|QT(𝝁,πm)|]o(T)\sup_{\boldsymbol{\mu},\pi^{m}}\operatorname*{\mathbb{E}}[|Q_{T}(\boldsymbol{\mu},\pi^{m})|]\in o(T). In other words, there must exist g:g:\mathbb{N}\to\mathbb{N} which does not depend on 𝝁\boldsymbol{\mu} or πm\pi^{m} such that g(T)o(T)g(T)\in o(T) and sup𝝁,πm𝔼[|QT(𝝁,πm)|]g(T)\sup_{\boldsymbol{\mu},\pi^{m}}\operatorname*{\mathbb{E}}[|Q_{T}(\boldsymbol{\mu},\pi^{m})|]\leq g(T). For brevity, we will usually write QT=QT(𝝁,πm)Q_{T}=Q_{T}(\boldsymbol{\mu},\pi^{m}).

Local generalization. We assume that 𝝁\boldsymbol{\mu} and πm\pi^{m} satisfy local generalization. Informally, if the agent is given an input xx, taking the mentor action for a similar input xx^{\prime} is almost as good as taking the mentor action for xx. Formally, we assume 𝒳n\mathcal{X}\subseteq\mathbb{R}^{n} and there exists L>0L>0 such that for all x,x𝒳x,x^{\prime}\in\mathcal{X} and t[T]t\in[T], |μtm(x)μt(x,πm(x))|Lxx|\mu_{t}^{m}(x)-\mu_{t}(x,\pi^{m}(x^{\prime}))|\leq L||x-x^{\prime}||, where ||||||\cdot|| denotes Euclidean distance. This represents the ability to transfer knowledge between similar inputs:

|μt(x,πm(x))Taking the right actionμt(x,πm(x))Using what you learned in x|LxxInput similarity\big{|}\!\!\underbrace{\mu_{t}(x,\pi^{m}(x))}_{\text{Taking the right action}}-\!\underbrace{\mu_{t}(x,\pi^{m}(x^{\prime}))}_{\text{Using what you learned in $x^{\prime}$}}\!\!\!\!\!\!\!\big{|}\,\leq\>\underbrace{L||x-x^{\prime}||}_{\text{Input similarity}}

This ability seems fundamental to intelligence and is well-understood in psychology (e.g., Esser et al., 2023) and education (e.g., Hajian, 2019). Note that the input space 𝒳n\mathcal{X}\subseteq\mathbb{R}^{n} can be any encoding of the agent’s situation, not just its physical positioning. See Section 5.2 for further discussion.

All suprema over 𝝁,πm\boldsymbol{\mu},\pi^{m} pairs are assumed to be restricted to 𝝁,πm\boldsymbol{\mu},\pi^{m} pairs which satisfy local generalization.

Regret. If μt(xt,yt)[0,1]\mu_{t}(x_{t},y_{t})\in[0,1] is the chance of avoiding catastrophe at time tt (conditioned on no prior catastrophe), then by the chain rule of probability, t=1Tμt(xt,yt)\prod_{t=1}^{T}\mu_{t}(x_{t},y_{t}) is the agent’s overall chance of avoiding catastrophe. For given 𝒙,𝒚,𝝁,πm\boldsymbol{x},\boldsymbol{y},\boldsymbol{\mu},\pi^{m}, the agent’s multiplicative regret666One could also define the multiplicative regret as RT=t=1Tμtm(xt)t=1Tμt(xt,yt)R_{T}^{\prime}=\prod_{t=1}^{T}\mu_{t}^{m}(x_{t})-\prod_{t=1}^{T}\mu_{t}(x_{t},y_{t}), but our definition is actually stricter: limTRT×0\lim_{T\to\infty}R_{T}^{\times}\to 0 implies limTRT0\lim_{T\to\infty}R_{T}^{\prime}\to 0, while the reverse is not true. In particular, limTRT0\lim_{T\to\infty}R_{T}^{\prime}\to 0 is trivial if limTt=1Tμtm(xt)0\lim_{T\to\infty}\prod_{t=1}^{T}\mu_{t}^{m}(x_{t})\to 0. is

RT×(𝒙,𝒚,𝝁,πm)=logt=1Tμtm(xt)logt=1Tμt(xt,yt)R_{T}^{\times}(\boldsymbol{x},\boldsymbol{y},\boldsymbol{\mu},\pi^{m})=\log\prod_{t=1}^{T}\mu_{t}^{m}(x_{t})-\log\prod_{t=1}^{T}\mu_{t}(x_{t},y_{t})

when all payoffs are strictly positive. To handle the case where some payoffs are zero, we assume the existence of μ0m>0\mu_{0}^{m}>0 such that μm(xt)μ0m\mu^{m}(x_{t})\geq\mu_{0}^{m} always. Thus only the agent’s payoffs can be zero, so we can safely define RT×(𝒙,𝒚,𝝁,πm)=R_{T}^{\times}(\boldsymbol{x},\boldsymbol{y},\boldsymbol{\mu},\pi^{m})=\infty whenever μt(xt,yt)=0\mu_{t}(x_{t},y_{t})=0 for some t[T]t\in[T]. We write RT×=RT×(𝒙,𝒚,𝝁,πm)R_{T}^{\times}=R_{T}^{\times}(\boldsymbol{x},\boldsymbol{y},\boldsymbol{\mu},\pi^{m}) for brevity.

The assumption of μm(xt)μ0m>0\mu^{m}(x_{t})\geq\mu_{0}^{m}>0 means that the mentor cannot be abysmal. In fact, we argue that high-stakes AI applications should employ a mentor who is almost always safe, i.e., μ0m1\mu_{0}^{m}\approx 1. If no such mentor exists for some application, perhaps that application should be avoided altogether.

We also define the agent’s additive regret as

RT+(𝒙,𝒚,𝝁,πm)=t=1Tμtm(xt)t=1Tμt(xt,yt)R_{T}^{+}(\boldsymbol{x},\boldsymbol{y},\boldsymbol{\mu},\pi^{m})=\sum_{t=1}^{T}\mu_{t}^{m}(x_{t})-\sum_{t=1}^{T}\mu_{t}(x_{t},y_{t})

and similarly write RT+=RT+(𝒙,𝒚,𝝁,πm)R_{T}^{+}=R_{T}^{+}(\boldsymbol{x},\boldsymbol{y},\boldsymbol{\mu},\pi^{m}) for brevity. For both objectives, we desire subconstant worst-case regret: the total (not average) expected regret should go to 0 for any 𝝁\boldsymbol{\mu} and πm\pi^{m}. Formally, we want limTsup𝝁,πm𝔼[RT×]=0\lim_{T\to\infty}\sup_{\boldsymbol{\mu},\pi^{m}}\operatorname*{\mathbb{E}}[R_{T}^{\times}]=0 and limTsup𝝁,πm𝔼[RT+]=0\lim_{T\to\infty}\sup_{\boldsymbol{\mu},\pi^{m}}\operatorname*{\mathbb{E}}[R_{T}^{+}]=0.

VC and Littlestone dimensions. VC dimension (Vapnik & Chervonenkis, 1971) and Littlestone dimension (Littlestone, 1988) are standard measures of learning difficulty which capture the ability of a hypothesis class (in our case, a policy class) to realize arbitrary combinations of labels (in our case, actions). We omit the precise definitions since we only utilize these concepts via existing results. See Shalev-Shwartz & Ben-David (2014) for a comprehensive overview.

Misc. The diameter of a set S𝒳S\subseteq\mathcal{X} is defined by diam(S)=maxx,xSxx\operatorname*{diam}(S)=\max_{x,x^{\prime}\in S}||x-x^{\prime}||. All logarithms and exponents are base ee unless otherwise noted.

input xxpayoff μ\mu   μ(x,0)\mu(x,0)μ(x,1)\mu(x,1)L2f(T)\cfrac{L}{2f(T)}1f(T)\frac{1}{f(T)}2f(T)\frac{2}{f(T)}3f(T)\frac{3}{f(T)}f(T)1f(T)\frac{f(T)-1}{f(T)}\dots1011
Figure 1: An illustration of the construction we use to prove Theorem 4.1 (not to scale). The horizontal axis indicates the input x[0,1]=𝒳x\in[0,1]=\mathcal{X} and the vertical axis indicates the payoff μ(x,y)[0,1]\mu(x,y)\in[0,1]. The solid line represents μ(x,0)\mu(x,0) and the dotted line represents μ(x,1)\mu(x,1). In each section, one of the actions has the optimal payoff of 1, and the other action has the worst possible payoff allowed by LL, reaching a minimum of 1L2f(T)1-\frac{L}{2f(T)}. Crucially, both actions result in a payoff of 1 at the boundaries between sections: this allows us to “reset” for the next section. As a result, we can freely toggle the optimal action for each section independently.

4 Avoiding catastrophe with sublinear queries is impossible in general

We first show that in general, any algorithm with sublinear mentor queries has unbounded regret in the worst-case, even when inputs are i.i.d. on [0,1][0,1] and 𝝁\boldsymbol{\mu} does not vary over time. The formal proofs are deferred to Appendix A, but we provide intuition and define the construction here.

Theorem 4.1.

Any algorithm with sublinear queries has unbounded worst-case regret (both multiplicative and additive) as TT\to\infty. Specifically,

sup𝝁,πm𝔼[RT×],sup𝝁,πm𝔼[RT+]Ω(LTsup𝝁,πm𝔼[|QT|]+1)\sup_{\boldsymbol{\mu},\pi^{m}}\operatorname*{\mathbb{E}}[R_{T}^{\times}],\,\sup_{\boldsymbol{\mu},\pi^{m}}\operatorname*{\mathbb{E}}[R_{T}^{+}]\in\Omega\left(\!L\sqrt{\frac{T}{\sup_{\boldsymbol{\mu},\pi^{m}}\operatorname*{\mathbb{E}}[|Q_{T}|]\!+\!1}}\right)

Intuitively, the regret decreases as the number of queries increases. However, as long as the number of queries remains sublinear in TT, the regret is unbounded as TT\to\infty.

We also have the following corollary of Theorem 4.1:

Corollary 4.1.1.

Even when μtm(x)=1\mu_{t}^{m}(x)=1 for all tt and xx, any algorithm with sublinear queries satisfies

limTsup𝝁,πm𝔼[t=1Tμt(xt,yt)]=0\lim_{T\to\infty}\,\sup_{\boldsymbol{\mu},\pi^{m}}\,\operatorname*{\mathbb{E}}\left[\prod_{t=1}^{T}\mu_{t}(x_{t},y_{t})\right]=0

In other words, even if the mentor never causes catastrophe, any algorithm with sublinear queries causes catastrophe with probability 1 as TT\to\infty in the worst case.

4.1 Intuition

We partition 𝒳\mathcal{X} into equally-sized sections that are “independent” in the sense that querying an input in section ii provides no information about section jj. There will be f(T)f(T) sections, where ff is a function that we will choose. If |QT|o(f(T))|Q_{T}|\in o(f(T)), most of these sections will never contain a query. When the agent sees an input in a section not containing a query, it essentially must guess, meaning it will be wrong about half the time. We then choose a payoff function (which is the same for all time steps) which makes the wrong guesses as costly as possible, subject to the local generalization constraint. Figure 1 fleshes out this idea.

The choice of ff is crucial. One idea is f(T)=Tf(T)=T. If the agent is wrong about half the time, and the average payoff for wrong actions is 1L4T1-\frac{L}{4T}, we can estimate the regret as

RT×=\displaystyle R_{T}^{\times}\,= logt=1Tμtm(xt)logt=1Tμt(xt,yt)\displaystyle\ \log\prod_{t=1}^{T}\mu_{t}^{m}(x_{t})-\log\prod_{t=1}^{T}\mu_{t}(x_{t},y_{t})
\displaystyle\approx log1log(1L4T)T/2\displaystyle\ \log 1-\log\left(1-\frac{L}{4T}\right)^{T/2}
=\displaystyle= T2log(1L4T)\displaystyle\ -\frac{T}{2}\log\left(1-\frac{L}{4T}\right)
\displaystyle\approx T2L4T\displaystyle\ \frac{T}{2}\cdot\frac{L}{4T}
=\displaystyle= L8\displaystyle\ \frac{L}{8}

Thus f(T)=Tf(T)=T can at best give us a constant lower bound on regret. Instead, we choose ff such that |QT|o(f(T))|Q_{T}|\in o(f(T)) and f(T)o(T)f(T)\in o(T). Specifically, we choose f(T)=max(sup𝝁,πm𝔼[|QT|]T,1)f(T)=\max(\sqrt{\sup_{\boldsymbol{\mu},\pi^{m}}\operatorname*{\mathbb{E}}[|Q_{T}|]T},1). Most sections still will not contain a query, so the agent is still wrong about half the time, but the payoff for wrong actions is worse. Then

RT×\displaystyle R_{T}^{\times}\,\approx log1log(1L4f(T))T/2\displaystyle\ \log 1-\log\left(1-\frac{L}{4f(T)}\right)^{T/2}
\displaystyle\approx LT8f(T)\displaystyle\ \frac{LT}{8f(T)}
\displaystyle\in Ω(LTsup𝝁,πm𝔼[|QT|]+1)\displaystyle\ \Omega\left(L\sqrt{\frac{T}{\sup_{\boldsymbol{\mu},\pi^{m}}\operatorname*{\mathbb{E}}[|Q_{T}|]+1}}\right)

which produces the bound in Theorem 4.1.

VC dimension. The class of mentor policies in our construction has VC dimension f(T)f(T); across all possible values of TT, this implies infinite VC (and Littlestone) dimension. We know that this is necessary given our positive results.

4.2 Formal definition of construction

Let 𝒳=[0,1]\mathcal{X}=[0,1] and 𝒟t=U(𝒳)\mathcal{D}_{t}=U(\mathcal{X}) for each t[T]t\in[T], where U(𝒳)U(\mathcal{X}) is the uniform distribution on 𝒳\mathcal{X}. Assume that L1L\leq 1; this will simplify the math and only makes the problem easier for the agent. We define a family of payoff functions parameterized by a function f:f:\mathbb{N}\to\mathbb{N} and a bit string 𝒂=(a1,a2,,af(T)){0,1}f(T)\boldsymbol{a}=(a_{1},a_{2},\dots,a_{f(T)})\in\{0,1\}^{f(T)}. The bit aja_{j} will denote the optimal action in section jj. Note that f(T)1f(T)\geq 1.

For each j[f(T)]j\in[f(T)], we refer to Xj=[j1f(T),jf(T)]X_{j}=\left[\frac{j-1}{f(T)},\frac{j}{f(T)}\right] as the jjth section. Let mj=j0.5f(T)m_{j}=\frac{j-0.5}{f(T)} be the midpoint of XjX_{j}. Assume that each xtx_{t} belongs to exactly one XjX_{j} (this happens with probability 1, so this assumption does not affect the expected regret). Let j(x)j(x) denote the index of the section containing input xx. Then μf,𝒂\mu_{f,\boldsymbol{a}} is defined by

μf,𝒂(x,y)={1 if y=aj(x)1L(12f(T)|mj(x)x|) if yaj(x)\displaystyle\mu_{f,\boldsymbol{a}}(x,y)=\begin{cases}1&\textnormal{ if }y=a_{j(x)}\\ 1-L\left(\cfrac{1}{2f(T)}-|m_{j(x)}-x|\right)&\textnormal{ if }y\neq a_{j(x)}\\ \end{cases}

We use this payoff function for all time steps: μt=μf,𝒂\mu_{t}=\mu_{f,\boldsymbol{a}} for all t[T]t\in[T]. Let πm\pi^{m} be any optimal policy for μf,𝒂\mu_{f,\boldsymbol{a}}. Note that there is a unique optimal action for each xtx_{t}, since each xtx_{t} belongs to exactly one XjX_{j}; formally, πm(xt)=aj(xt)\pi^{m}(x_{t})=a_{j(x_{t})}.

For any 𝒂{0,1}f(T)\boldsymbol{a}\in\{0,1\}^{f(T)}, μf,𝒂\mu_{f,\boldsymbol{a}} is piecewise linear (trivially) and continuous (because both actions have payoff 1 on the boundary between sections). Since the slope of each piece is in {L,0,L}\{-L,0,L\}, μf,𝒂\mu_{f,\boldsymbol{a}} is Lipschitz continuous. Thus by Proposition E.1, πm\pi^{m} satisfies local generalization.

5 Avoiding catastrophe given finite VC or Littlestone dimension

Theorem 4.1 shows that avoiding catastrophe is impossible in general. What if we restrict ourselves to settings where standard online learning is possible? Specifically, we assume that πm\pi^{m} belongs to a policy class Π\Pi where either (1) Π\Pi has finite VC dimension dd and 𝒙\boldsymbol{x} is σ\sigma-smooth or (2) Π\Pi has finite Littlestone dimension dd.777Recall from Section 1.3 that standard online learning becomes tractable under either of these assumptions. This section presents a simple algorithm which guarantees subconstant regret (both multiplicative and additive) and sublinear queries under either of those assumptions. Formal proofs are deferred to Appendix B but we provide intuition and a proof sketch here.

5.1 Intuition behind the algorithm

Algorithm 1 has two simple components: (1) run a modified version of the Hedge algorithm for online learning, but (2) ask for help for unfamiliar inputs (specifically, when the input is very different from any queried input with the same action under the proposed policy). Hedge ensures that the number of mistakes (i.e., the number of time steps where the agent’s action doesn’t match the mentor’s) is small, and asking for help for unfamiliar inputs ensures that when we do make a mistake, the cost isn’t too high. This algorithmic structure seems quite natural: mostly follow a baseline strategy, but ask for help when out-of-distribution.

Hedge. Hedge (Freund & Schapire, 1997) is a standard online learning algorithm which ensures sublinear regret when the number of hypotheses (in our case, the number of policies in Π\Pi) is finite.888Chapter 5 of Slivkins et al. (2019) and Chapter 21 of Shalev-Shwartz & Ben-David (2014) give modern introductions to Hedge. We would prefer not to assume that Π\Pi is finite. Luckily, any policy in Π\Pi can be approximated within ε\varepsilon when either (1) Π\Pi has finite VC dimension and 𝒙\boldsymbol{x} is σ\sigma-smooth or (2) Π\Pi has finite Littlestone dimension. Thus we can run Hedge on this approximative policy class instead.

One other modification is necessary. In standard online learning, losses are observed on every time step, but our agent only receives feedback in response to queries. To handle this, we modify Hedge to only perform updates on time steps with queries and to issue a query with probability pp on each time step. Continuing our lucky streak, Russo et al. (2024) analyze exactly this modification of Hedge.

5.2 Local generalization

Local generalization is vital: this is what allows us to detect when an input is unfamiliar. Crucially, our algorithm does not need to know how inputs are encoded in n\mathbb{R}^{n} and does not need to know LL: it only needs to be able to compute the nearest-neighbor distance min(x,y)S:y=πt(xt)xtx\min_{(x,y)\in S:y=\pi_{t}(x_{t})}||x_{t}-x||. Thus we only need to assume that there exists some encoding satisfying local generalization.

To elaborate, recall the example that a 3 mm spot and a 3.1 mm spot on X-rays likely have similar risk levels (assuming similar density, location, etc.). If the risk level abruptly increases for any spot over 3 mm, then local generalization may not hold for a naive encoding which treats size as a single dimension. However, a more nuanced encoding would recognize that these two situations – a 3 mm vs 3.1 mm spot – are in fact not similar. Constructing a suitable encoding may be challenging, but we do not require the agent to have explicit access to such an encoding: the agent only needs a nearest-neighbor distance oracle.

 Inputs: T,ε>0,d,T\in\mathbb{N},\>\varepsilon\in\mathbb{R}_{>0},\>d\in\mathbb{N},\, policy class Π\Pi
if Π\Pi has VC dimension dd then
  Π~\tilde{\Pi}\leftarrow any smooth ε\varepsilon-cover of Π\Pi of size at most (41/ε)d(41/\varepsilon)^{d}  (see Definition 5.3)
else
  Π~\tilde{\Pi}\leftarrow any adversarial cover of size at most (eT/d)d(eT/d)^{d}  (see Definition 5.4)
SS\leftarrow\emptyset
w(π)1w(\pi)\leftarrow 1 for all πΠ~\pi\in\tilde{\Pi}
p1/εTp\leftarrow 1/\sqrt{\varepsilon T}
ηmax(plog|Π~|2T,p22)\eta\leftarrow\max\big{(}\sqrt{\frac{p\log|\tilde{\Pi}|}{2T}},\>\frac{p^{2}}{\sqrt{2}}\big{)}
for tt from 11 to TT do
  Run one step of Hedge, which selects policy πt\pi_{t}
  with probability p:p: hedgeQuerytrue\texttt{hedgeQuery}\leftarrow\texttt{true}
  with probability 1p:1-p: hedgeQueryfalse\texttt{hedgeQuery}\leftarrow\texttt{false}
  if hedgeQuery999The reader may notice that we do not update SS in this case. This is simply because those updates are not necessary for the desired bounds and omitting these updates simplifies the analysis. then
   Query mentor and observe πm(xt)\pi^{m}(x_{t})
   (t,π)𝟏(π(xt)πm(xt))\ell(t,\pi)\leftarrow\mathbf{1}(\pi(x_{t})\neq\pi^{m}(x_{t})) for all πΠ~\pi\in\tilde{\Pi}
   minπΠ~(t,π)\ell^{*}\leftarrow\min_{\pi\in\tilde{\Pi}}\ell(t,\pi)
   w(π)w(π)exp(η((t,π)))w(\pi)\leftarrow w(\pi)\cdot\exp(-\eta(\ell(t,\pi)-\ell^{*})) for all πΠ~\pi\in\tilde{\Pi}
   πtargminπΠ~(t,π)\pi_{t}\leftarrow\operatorname*{arg\,min}_{\pi\in\tilde{\Pi}}\ell(t,\pi)
  else
   P(π)w(π)/πΠ~w(π)P(\pi)\leftarrow w(\pi)/\sum_{\pi^{\prime}\in\tilde{\Pi}}w(\pi^{\prime}) for all πΠ~\pi\in\tilde{\Pi}
   Sample πtP\pi_{t}\sim P
  
  if S=S=\emptyset or min(x,y)S:y=πt(xt)xtx>ε1/n\min_{(x,y)\in S:y=\pi_{t}(x_{t})}||x_{t}-x||>\varepsilon^{1/n} then
   Ask for help if out-of-distribution
   Query mentor (if not already queried this round) and observe πm(xt)\pi^{m}(x_{t})
   SS{(xt,πm(xt))}S\leftarrow S\cup\{(x_{t},\pi^{m}(x_{t}))\}
  else
   Otherwise, follow Hedge’s chosen policy
   Take action πt(xt)\pi_{t}(x_{t})
  
Algorithm 1 successfully avoids catastrophe assuming finite VC or Littlestone dimension.

Conceptually, the algorithm only needs to be able to detect when an input is unfamiliar. While this task remains far from trivial, we argue that it is more tractable than fully constructing a suitable encoding. See Section 6 for a discussion of potential future work on this topic.

We note that these encoding-related questions apply similarly to the more standard assumption of Lipschitz continuity. In fact, Lipschitz continuity implies local generalization when the mentor is optimal (Proposition E.1). We also mention that without local generalization, avoiding catastrophe is impossible even when the mentor policy class has finite VC dimension and 𝒙\boldsymbol{x} is σ\sigma-smooth (Theorem E.2).

5.3 Main result

For simplicity, here we only state our results for 𝒴={0,1}\mathcal{Y}=\{0,1\}; Appendix C extends our result to many actions using the standard “one versus rest” reduction. We first prove regret and query bounds parametrized by ε\varepsilon:

Theorem 5.1.

Let 𝒴={0,1}\mathcal{Y}=\{0,1\}. Assume πmΠ\pi^{m}\in\Pi where either (1) Π\Pi has finite VC dimension dd and 𝐱\boldsymbol{x} is σ\sigma-smooth, or (2) Π\Pi has finite Littlestone dimension dd. Then for any TT\in\mathbb{N} and ε[\mfrac1T,(\mfracμ0m2L)n]\varepsilon\in\left[\mfrac{1}{T},\left(\mfrac{\mu_{0}^{m}}{2L}\right)^{n}\right], Algorithm 1 satisfies

𝔼[RT×]\displaystyle\operatorname*{\mathbb{E}}\left[R_{T}^{\times}\right]\in O(dLσμ0mTε1+1/nlog(T+1/ε))\displaystyle\ O\left(\frac{dL}{\sigma\mu_{0}^{m}}T\varepsilon^{1+1/n}\log(T+1/\varepsilon)\right)
𝔼[RT+]\displaystyle\operatorname*{\mathbb{E}}\left[R_{T}^{+}\right]\in O(dLσTε1+1/nlog(T+1/ε))\displaystyle\ O\left(\frac{dL}{\sigma}T\varepsilon^{1+1/n}\log(T+1/\varepsilon)\right)
𝔼[|QT|]\displaystyle\operatorname*{\mathbb{E}}[|Q_{T}|]\in O(Tε+dσTεlog(T+1/ε)+𝔼[diam(𝒙)n]ε)\displaystyle\ O\left(\!\sqrt{\frac{T}{\varepsilon}}+\frac{d}{\sigma}T\varepsilon\log(T\!+\!1/\varepsilon)+\frac{\operatorname*{\mathbb{E}}[\operatorname*{diam}(\boldsymbol{x})^{n}]}{\varepsilon}\right)

In Case 1, the expectation is over the randomness of both 𝒙\boldsymbol{x} and the algorithm, while in Case 2, the expectation is over only the randomness of the algorithm. The bounds clearly have no dependence on σ\sigma in Case 2, but we include σ\sigma anyway to avoid writing two separate sets of bounds.

To obtain subconstant regret and sublinear queries, we can choose ε=T2n2n+1\varepsilon=T^{\frac{-2n}{2n+1}}. This also satisfies 2Lε1/nμ0m2L\varepsilon^{1/n}\leq\mu_{0}^{m} for large enough TT.

Theorem 5.2.

Let 𝒴={0,1}\mathcal{Y}=\{0,1\}. Assume πmΠ\pi^{m}\in\Pi where either (1) Π\Pi has finite VC dimension dd and 𝐱\boldsymbol{x} is σ\sigma-smooth or (2) Π\Pi has finite Littlestone dimension dd. Then for any TT\in\mathbb{N}, Algorithm 1 with ε=T2n2n+1\varepsilon=T^{\frac{-2n}{2n+1}} satisfies

𝔼[RT×]\displaystyle\operatorname*{\mathbb{E}}\left[R_{T}^{\times}\right]\in O(dLσμ0mT12n+1logT)\displaystyle\ O\left(\frac{dL}{\sigma\mu_{0}^{m}}T^{\frac{-1}{2n+1}}\log T\right)
𝔼[RT+]\displaystyle\operatorname*{\mathbb{E}}\left[R_{T}^{+}\right]\in O(dLσT12n+1logT)\displaystyle\ O\left(\frac{dL}{\sigma}T^{\frac{-1}{2n+1}}\log T\right)
𝔼[|QT|]\displaystyle\operatorname*{\mathbb{E}}[|Q_{T}|]\in O(T4n+14n+2(dσlogT+𝔼[diam(𝒙)n]))\displaystyle\ O\left(T^{\frac{4n+1}{4n+2}}\left(\frac{d}{\sigma}\log T+\operatorname*{\mathbb{E}}[\operatorname*{diam}(\boldsymbol{x})^{n}]\right)\right)

Before proceeding to the proof sketch, we highlight some advantages of our algorithm.

Limited knowledge required. Our algorithm needs to know Π\Pi, as is standard. However, the algorithm does not need to know σ\sigma (in the smooth case) or LL. Although Algorithm 1 as written does require TT as an input, it can be converted into an infinite horizon/anytime algorithm via the standard “doubling trick” (see, e.g., Slivkins et al., 2019).

Unbounded environment. Our algorithm can handle an unbounded input space: the number of queries simply scales with the maximum distance between observed inputs in the form of 𝔼[diam(𝒙)n]\operatorname*{\mathbb{E}}[\operatorname*{diam}(\boldsymbol{x})^{n}].

Simultaneous bounds for all μ\boldsymbol{\mu}. Recall that the agent never observes payoffs and only learns from mentor queries. This means that the agent’s behavior does not depend on 𝝁\boldsymbol{\mu} at all. In other words, the distribution of (𝒙,𝒚)(\boldsymbol{x},\boldsymbol{y}) depends on πm\pi^{m} but not 𝝁\boldsymbol{\mu}. Consequently, a given πm\pi^{m} induces a single distribution (𝒙,𝒚)(\boldsymbol{x},\boldsymbol{y}) which satisfies the bounds in Theorem 5.2 simultaneously for all 𝝁\boldsymbol{\mu} satisfying local generalization.

5.4 Proof sketch

The formal proof of Theorem 5.1 can be found in Appendix B, but we outline the key elements here. The regret analysis consists of two ingredients: analyzing the Hedge component and analyzing the “ask for help when out-of-distribution” component. The former will bound the number of mistakes made by the algorithm (i.e., the number of time steps where the agent’s action doesn’t match the mentor’s), and the latter will bound the cost of any single mistake. We must also show that the latter does not result in excessively many queries, which we do via a novel packing argument.

We begin by formalizing two notions of approximating a policy class:

Definition 5.3.

Let UU be the uniform distribution over 𝒳\mathcal{X}. For ε>0\varepsilon>0, a policy class Π~\tilde{\Pi} is a smooth ε\varepsilon-cover of a policy class Π\Pi if for every πΠ\pi\in\Pi, there exists π~Π~\tilde{\pi}\in\tilde{\Pi} such that PrxU[π(x)π~(x)]ε\Pr_{x\sim U}[\pi(x)\neq\tilde{\pi}(x)]\leq\varepsilon.

Definition 5.4.

A policy class Π~\tilde{\Pi} is an adversarial cover of a policy class Π\Pi if for every 𝒙𝒳T\boldsymbol{x}\in\mathcal{X}^{T} and πΠ\pi\in\Pi, there exists π~Π~\tilde{\pi}\in\tilde{\Pi} such that π(xt)=π~(xt)\pi(x_{t})=\tilde{\pi}(x_{t}) for all t[T]t\in[T].

An adversarial cover is a perfect cover by definition. The idea of a smooth ε\varepsilon-cover is that if the probability of disagreement over the uniform distribution is small, then the probability of disagreement over a σ\sigma-smooth distribution cannot be too much larger.

Lemma 5.1.

Let Π~\tilde{\Pi} be a smooth ε\varepsilon-cover of Π\Pi and let 𝒟\mathcal{D} be a σ\sigma-smooth distribution. Then for any πΠ\pi\in\Pi, there exists π~Π~\tilde{\pi}\in\tilde{\Pi} such that Prx𝒟[π(x)π~(x)]ε/σ\Pr_{x\sim\mathcal{D}}[\pi(x)\neq\tilde{\pi}(x)]\leq\varepsilon/\sigma.

Proof.

Define S(π~)={x𝒳:π(x)π~(x)}S(\tilde{\pi})=\{x\in\mathcal{X}:\pi(x)\neq\tilde{\pi}(x)\}. By the definition of a smooth ε\varepsilon-cover, there exists π~Π~\tilde{\pi}\in\tilde{\Pi} such that PrxU[xS(π~)]ε\Pr_{x\sim U}[x\in S(\tilde{\pi})]\leq\varepsilon. Since 𝒟\mathcal{D} is σ\sigma-smooth, Prx𝒟[π(x)π~(x)]=Prx𝒟[xS(π~)]PrxU[xS(π~)]/σε/σ\Pr_{x\sim\mathcal{D}}[\pi(x)\neq\tilde{\pi}(x)]=\Pr_{x\sim\mathcal{D}}[x\in S(\tilde{\pi})]\leq\\ \Pr_{x\sim U}[x\in S(\tilde{\pi})]/\sigma\leq\varepsilon/\sigma, as claimed. ∎

The existence of small covers is crucial:

Lemma 5.2 (Lemma 7.3.2 in Haghtalab (2018)101010See also Haussler & Long (1995) or Lemma 13.6 in Boucheron et al. (2013) for variants of this lemma.).

For all ε>0\varepsilon>0, any policy class of VC dimension dd admits a smooth ε\varepsilon-cover of size at most (41/ε)d(41/\varepsilon)^{d}.

Lemma 5.3 (Lemmas 21.13 and A.5 in Shalev-Shwartz & Ben-David (2014)).

Any policy class of Littlestone dimension dd admits an adversarial cover of size at most (eT/d)d(eT/d)^{d}.

We will run a variant of Hedge on Π~\tilde{\Pi}. The vanilla Hedge algorithm operates in the standard online learning model where on each time step, the agent selects a policy (or more generally, a hypothesis), and observes the loss of every policy. In general the loss function can depend arbitrarily on the time step, the policy, and prior events, but we will only use the indicator loss function (t,π)=𝟏(π(xt)πm(xt))\ell(t,\pi)=\mathbf{1}(\pi(x_{t})\neq\pi^{m}(x_{t})). Crucially, whenever we query and learn πm(xt)\pi^{m}(x_{t}), we can compute (t,π)\ell(t,\pi) for every πΠ~\pi\in\tilde{\Pi}.

We cannot afford to query on every time step, however. Recently, Russo et al. (2024) analyzed a variant of Hedge where losses are observed only in response to queries, which they call “label-efficient feedback”. They proved a regret bound when a query is issued on each time step with fixed probability pp. Lemma 5.4 restates their result in a form that is more convenient for us. See Appendices B.1 and B.3 for details on our usage of results from Russo et al. (2024). Full pseudocode for HedgeWithQueries can also be found in the appendix (Algorithm 2).

Lemma 5.4 (Lemma 3.5 in Russo et al., 2024).

Assume Π~\tilde{\Pi} is finite. Then for any loss function :[T]×Π~[0,1]\ell:[T]\times\tilde{\Pi}\to[0,1] and query probability p>0p>0, HedgeWithQueries enjoys the regret bound

t=1T𝔼[(t,πt)]minπ~Π~t=1T𝔼[(t,π)]log|Π~|p2\sum_{t=1}^{T}\operatorname*{\mathbb{E}}[\ell(t,\pi_{t})]-\min_{\tilde{\pi}\in\tilde{\Pi}}\sum_{t=1}^{T}\operatorname*{\mathbb{E}}[\ell(t,\pi)]\leq\frac{\log|\tilde{\Pi}|}{p^{2}}

where πt\pi_{t} is the policy chosen at time tt and the expectation is over the randomness of the algorithm.

We apply Lemma 5.4 with (t,π)=𝟏(π(xt)πm(xt))\ell(t,\pi)=\mathbf{1}(\pi(x_{t})\neq\pi^{m}(x_{t})) and combine this with Lemmas 5.2 and 5.1 (in the σ\sigma-smooth case) and with Lemma 5.3 (in the adversarial case). This yields a O(dσTεlog(1/ε)logT)O\left(\frac{d}{\sigma}T\varepsilon\log(1/\varepsilon)\log T\right) bound on the number of mistakes made by Algorithm 1 (Lemma B.1).

The other key ingredient of the proof is analyzing the “ask for help when out-of-distribution” component. Combined with the local generalization assumption, this allows us to fairly easily bound the cost of a single mistake (Lemma B.2). The trickier part is bounding the number of resulting queries. It is tempting to claim that the inputs queried in the out-of-distribution case must all be separated by at least ε1/n\varepsilon^{1/n} and thus form an ε1/n\varepsilon^{1/n}-packing, but this is actually false.

Instead, we bound the number of data points (i.e., queries) needed to cover a set with respect to the realized actions of the algorithm (Lemma B.6). This contrasts with vanilla packing arguments which consider all data points in aggregate. The key to our analysis is that the number of mistakes made by the algorithm – which we already bounded in Lemma B.1 – gives us crucial information about how data points are distributed with respect to the actions of the algorithm. Our technique might be useful in other contexts where a more refined packing argument is needed and a bound on the number of mistakes already exists.

6 Conclusion and future work

In this paper, we proposed a model of avoiding catastrophe in online learning. We showed that achieving subconstant regret in our problem (with the help of a mentor and local generalization) is no harder than achieving sublinear regret in standard online learning.

Remaining technical questions. First, we have not resolved whether our problem is tractable for finite VC dimension and fully adversarial inputs (although Appendix D shows that the problem is tractable for at least some classes with finite VC but infinite Littlestone dimension). Second, the time complexity of Algorithm 1 currently stands at a hefty Ω(|Π~|)\Omega(|\tilde{\Pi}|) per time step plus the time to compute Π~\tilde{\Pi}. In the standard online learning setting, Block et al. (2022) and Haghtalab et al. (2022) show how to replace discretization approaches like ours with oracle-efficient approaches, where a small number of calls to an optimization oracle are made per round. We are optimistic about leveraging such techniques to obtain efficient algorithms in our setting.

Local generalization. Our algorithm crucially relies on the ability to detect when an input is unfamiliar, i.e., differs significantly from prior observations in a metric space which satisfies local generalization. Without this ability, the practicality of our algorithm would be fundamentally limited. One option is to use out-of-distribution (OOD) detection, which is conceptually similar and well-studied (see Yang et al., 2024 for a survey). However, it is an open question whether standard OOD detection methods are measuring distance in a metric space which satisfies local generalization.

We are also interested in alternatives to local generalization. Theorem E.2 shows that our positive result breaks down if local generalization is removed, so some sort of assumption is necessary. One possible alternative is Bayesian inference. We intentionally avoided Bayesian approaches in this paper due to tractability concerns, but it seems premature to abandon those ideas entirely.

MDPs. Finally, we are excited to apply the ideas in this paper to Markov Decision Processes (MDPs): specifically, MDPs where some actions are irreversible (“non-communicating”) and the agent only gets one attempt (“single-episode”). In such MDPs, the agent must not only avoid catastrophe but also obtain high reward. As discussed in Section 2, very little theory exists for RL in non-communicating single-episode MDPs. Can an agent learn near-optimal behavior in high-stakes environments while becoming self-sufficient over time? Formally, we pose the following open problem:

Is there an algorithm for non-communicating single-episode undiscounted MDPs which ensures that both the regret and the number of mentor queries are sublinear in TT?

Impact statement

As AI systems become increasingly powerful, we believe that the safety guarantees of such systems should become commensurately robust. Irreversible costs are especially worrisome, and we hope that our work plays a small part in mitigating such risks. We do not believe that our work has any concrete potential risks that should be highlighted here.

Author contributions

B. Plaut conceived the project. B. Plaut designed the mathematical model with feedback from H. Zhu and S. Russell. B. Plaut proved all of the results. H. Zhu participated in proof brainstorming and designed counterexamples for several early conjectures. B. Plaut wrote the paper, with feedback from H. Zhu and S. Russell. S. Russell supervised the project and secured funding.

Acknowledgements

This work was supported in part by a gift from Open Philanthropy to the Center for Human-Compatible AI (CHAI) at UC Berkeley. This paper also benefited from discussions with many other researchers. We would like to especially thank (in alphabetical order) Aly Lidayan, Bhaskar Mishra, Cameron Allen, Juan Liévano-Karim, Matteo Russo, Michael Cohen, Nika Haghtalab, and Scott Emmons. We would also like to thank our anonymous reviewers for helpful feedback.

References

  • Abaimov & Martellini (2020) Abaimov, S. and Martellini, M. Artificial Intelligence in Autonomous Weapon Systems, pp.  141–177. Springer International Publishing, Cham, 2020.
  • Altman (2021) Altman, E. Constrained Markov Decision Processes. Routledge, 2021.
  • Azar et al. (2017) Azar, M. G., Osband, I., and Munos, R. Minimax Regret Bounds for Reinforcement Learning. In Proceedings of the 34th International Conference on Machine Learning, pp.  263–272. PMLR, July 2017. ISSN: 2640-3498.
  • (4) Barman, S., Khan, A., Maiti, A., and Sawarni, A. Fairness and welfare quantification for regret in multi-armed bandits. In Proceedings of the Thirty-Seventh Conference on Artificial Intelligence (AAAI 2023).
  • Bengio et al. (2024) Bengio, Y., Hinton, G., Yao, A., Song, D., Abbeel, P., Darrell, T., Harari, Y. N., Zhang, Y.-Q., Xue, L., Shalev-Shwartz, S., et al. Managing extreme AI risks amid rapid progress. Science, 384(6698):842–845, 2024.
  • Betancourt (2018) Betancourt, M. A Conceptual Introduction to Hamiltonian Monte Carlo, July 2018. arXiv:1701.02434 [stat].
  • Block et al. (2022) Block, A., Dagan, Y., Golowich, N., and Rakhlin, A. Smoothed online learning is as easy as statistical learning. In Conference on Learning Theory, pp.  1716–1786. PMLR, 2022.
  • Boucheron et al. (2013) Boucheron, S., Lugosi, G., and Massart, P. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 02 2013.
  • Cesa-Bianchi & Lugosi (2006) Cesa-Bianchi, N. and Lugosi, G. Prediction, learning, and games. Cambridge university press, 2006.
  • Cohen & Hutter (2020) Cohen, M. K. and Hutter, M. Pessimism About Unknown Unknowns Inspires Conservatism. In Proceedings of Thirty Third Conference on Learning Theory, pp.  1344–1373. PMLR, July 2020. ISSN: 2640-3498.
  • Cohen et al. (2021) Cohen, M. K., Catt, E., and Hutter, M. Curiosity Killed or Incapacitated the Cat and the Asymptotically Optimal Agent. IEEE Journal on Selected Areas in Information Theory, 2(2):665–677, June 2021. Conference Name: IEEE Journal on Selected Areas in Information Theory.
  • Critch & Russell (2023) Critch, A. and Russell, S. TASRA: a taxonomy and analysis of societal-scale risks from AI. arXiv preprint arXiv:2306.06924, 2023.
  • Esser et al. (2023) Esser, S., Haider, H., Lustig, C., Tanaka, T., and Tanaka, K. Action–effect knowledge transfers to similar effect stimuli. Psychological Research, 87(7):2249–2258, October 2023.
  • Freund & Schapire (1997) Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
  • Grinsztajn et al. (2021) Grinsztajn, N., Ferret, J., Pietquin, O., Preux, P., and Geist, M. There Is No Turning Back: A Self-Supervised Approach for Reversibility-Aware Reinforcement Learning. In Advances in Neural Information Processing Systems, volume 34, pp.  1898–1911. Curran Associates, Inc., 2021.
  • Gu et al. (2024) Gu, S., Yang, L., Du, Y., Chen, G., Walter, F., Wang, J., and Knoll, A. A review of safe reinforcement learning: Methods, theories, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):11216–11235, 2024.
  • Guembe et al. (2022) Guembe, B., Azeta, A., Misra, S., Osamor, V. C., Fernandez-Sanz, L., and Pospelova, V. The Emerging Threat of AI-driven Cyber Attacks: A Review. Applied Artificial Intelligence, 36(1), December 2022.
  • Hadfield-Menell et al. (2017) Hadfield-Menell, D., Milli, S., Abbeel, P., Russell, S., and Dragan, A. D. Inverse reward design. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp.  6768–6777, Red Hook, NY, USA, December 2017. Curran Associates Inc.
  • Haghtalab (2018) Haghtalab, N. Foundation of Machine Learning, by the People, for the People. PhD thesis, Microsoft Research, 2018.
  • Haghtalab et al. (2022) Haghtalab, N., Han, Y., Shetty, A., and Yang, K. Oracle-efficient online learning for smoothed adversaries. Advances in Neural Information Processing Systems, 35:4072–4084, 2022.
  • Haghtalab et al. (2024) Haghtalab, N., Roughgarden, T., and Shetty, A. Smoothed analysis with adaptive adversaries. Journal of the ACM, 71(3):1–34, 2024.
  • Hajian (2019) Hajian, S. Transfer of Learning and Teaching: A Review of Transfer Theories and Effective Instructional Practices. IAFOR Journal of Education, 7(1):93–111, 2019. Publisher: International Academic Forum ERIC Number: EJ1217940.
  • Hanneke (2014) Hanneke, S. Theory of Disagreement-Based Active Learning, volume 7. Now Publishers Inc., Hanover, MA, USA, June 2014.
  • Haussler & Long (1995) Haussler, D. and Long, P. M. A generalization of Sauer’s lemma. Journal of Combinatorial Theory, Series A, 71(2):219–240, August 1995.
  • Hendrycks et al. (2023) Hendrycks, D., Mazeika, M., and Woodside, T. An overview of catastrophic AI risks. arXiv preprint arXiv:2306.12001, 2023.
  • Jaksch et al. (2010) Jaksch, T., Ortner, R., and Auer, P. Near-optimal Regret Bounds for Reinforcement Learning. Journal of Machine Learning Research, 11(51):1563–1600, 2010.
  • Jung (1901) Jung, H. Ueber die kleinste kugel, die eine räumliche figur einschliesst. Journal für die reine und angewandte Mathematik, 123:241–257, 1901.
  • Kohli & Chadha (2020) Kohli, P. and Chadha, A. Enabling pedestrian safety using computer vision techniques: A case study of the 2018 uber inc. self-driving car crash. In Advances in Information and Communication: Proceedings of the 2019 Future of Information and Communication Conference (FICC), Volume 1, pp.  261–279. Springer, 2020.
  • Kosoy (2019) Kosoy, V. Delegative Reinforcement Learning: learning to avoid traps with a little help. SafeML ICLR 2019 Workshop, July 2019.
  • Littlestone (1988) Littlestone, N. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine learning, 2:285–318, 1988.
  • Liu et al. (2021) Liu, T., Zhou, R., Kalathil, D., Kumar, P., and Tian, C. Learning policies with zero or bounded constraint violation for constrained mdps. Advances in Neural Information Processing Systems, 34:17183–17193, 2021.
  • Maillard et al. (2019) Maillard, O.-A., Mann, T., Ortner, R., and Mannor, S. Active Roll-outs in MDP with Irreversible Dynamics. July 2019.
  • Mindermann et al. (2018) Mindermann, S., Shah, R., Gleave, A., and Hadfield-Menell, D. Active Inverse Reward Design. In Proceedings of the 1st Workshop on Goal Specifications for Reinforcement Learning, 2018.
  • Moldovan & Abbeel (2012) Moldovan, T. M. and Abbeel, P. Safe exploration in Markov decision processes. In Proceedings of the 29th International Conference on International Conference on Machine Learning, ICML’12, pp.  1451–1458, Madison, WI, USA, June 2012. Omnipress.
  • Mouton et al. (2024) Mouton, C., Lucas, C., and Guest, E. The operational risks of AI in large-scale biological attacks. Technical report, RAND Corporation, Santa Monica, 2024.
  • Osa et al. (2018) Osa, T., Pajarinen, J., Neumann, G., Bagnell, J., Abbeel, P., and Peters, J. An Algorithmic Perspective on Imitation Learning. Foundations and trends in robotics. Now Publishers, 2018.
  • Quiñonero-Candela et al. (2022) Quiñonero-Candela, J., Sugiyama, M., Schwaighofer, A., and Lawrence, N. D. Dataset shift in machine learning. MIT Press, 2022.
  • Rajpurkar et al. (2022) Rajpurkar, P., Chen, E., Banerjee, O., and Topol, E. J. AI in health and medicine. Nature medicine, 28(1):31–38, 2022.
  • Russo et al. (2024) Russo, M., Celli, A., Colini Baldeschi, R., Fusco, F., Haimovich, D., Karamshuk, D., Leonardi, S., and Tax, N. Online learning with sublinear best-action queries. Advances in Neural Information Processing Systems, 37:40407–40433, 2024.
  • Shalev-Shwartz & Ben-David (2014) Shalev-Shwartz, S. and Ben-David, S. Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 1 edition, May 2014.
  • Slivkins (2011) Slivkins, A. Contextual Bandits with Similarity Information. In Proceedings of the 24th Annual Conference on Learning Theory (COLT), pp.  679–702, December 2011. ISSN: 1938-7228.
  • Slivkins et al. (2019) Slivkins, A. et al. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning, 12(1-2):1–286, 2019.
  • Spielman & Teng (2004) Spielman, D. A. and Teng, S.-H. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM (JACM), 51(3):385–463, 2004.
  • Stradi et al. (2024) Stradi, F. E., Castiglioni, M., Marchesi, A., and Gatti, N. Learning adversarial MDPs with stochastic hard constraints. arXiv preprint arXiv:2403.03672, 2024.
  • Turchetta et al. (2016) Turchetta, M., Berkenkamp, F., and Krause, A. Safe Exploration in Finite Markov Decision Processes with Gaussian Processes. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
  • Vapnik & Chervonenkis (1971) Vapnik, V. N. and Chervonenkis, A. Y. On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities. Theory of Probability & Its Applications, 16(2):264–280, January 1971. Publisher: Society for Industrial and Applied Mathematics.
  • Villasenor & Foggo (2020) Villasenor, J. and Foggo, V. Artificial intelligence, due process and criminal sentencing. Mich. St. L. Rev., pp.  295, 2020.
  • Wachi et al. (2024) Wachi, A., Shen, X., and Sui, Y. A Survey of Constraint Formulations in Safe Reinforcement Learning. volume 9, pp.  8262–8271, August 2024. ISSN: 1045-0823.
  • Wu (2020) Wu, Y. Lecture notes on: Information-theoretic methods for high-dimensional statistics. 2020.
  • Yang et al. (2024) Yang, J., Zhou, K., Li, Y., and Liu, Z. Generalized out-of-distribution detection: A survey. International Journal of Computer Vision, pp.  1–28, 2024.
  • Zhao et al. (2023) Zhao, W., He, T., Chen, R., Wei, T., and Liu, C. State-wise Safe Reinforcement Learning: A Survey. volume 6, pp.  6814–6822, August 2023. ISSN: 1045-0823.

Appendix A Proof of Theorem 4.1

A.1 Proof notation

  1. 1.

    Let MjM_{j} be the set of time steps tTt\leq T where |mjxt|\mfrac14f(T)|m_{j}-x_{t}|\leq\mfrac{1}{4f(T)}. In words, xtx_{t} is relatively close to the midpoint of XjX_{j}. This will imply that the suboptimal action is in fact quite suboptimal. This also implies that xtx_{t} is in XjX_{j}, since each XjX_{j} has length 1/f(T)1/f(T).

  2. 2.

    Let J¬Q={j[f(T)]:xtXjtQT}J_{\neg Q}=\{j\in[f(T)]:x_{t}\not\in X_{j}\ \forall t\in Q_{T}\} be the set of sections that are never queried. Since each query appears in exactly one section (because each input appears in exactly one section), |J¬Q|f(T)|QT||J_{\neg Q}|\geq f(T)-|Q_{T}|.

  3. 3.

    For each jJ¬Qj\in J_{\neg Q}, let yjy^{j} be the most frequent action among time steps in MjM_{j}: yj=argmaxy{0,1}|{tMj:y=yt}|y^{j}=\operatorname*{arg\,max}_{y\in\{0,1\}}|\{t\in M_{j}:y=y_{t}\}|.

  4. 4.

    Let J¬Q={jJ¬Q:ajyj}J_{\neg Q}^{\prime}=\{j\in J_{\neg Q}:a_{j}\neq y^{j}\} be the set of sections where the more frequent action is wrong according to μf,𝒂\mu_{f,\boldsymbol{a}}.

  5. 5.

    Let Mj={tMj:ytaj}M_{j}^{\prime}=\{t\in M_{j}:y_{t}\neq a_{j}\} be the set of time steps where the agent chooses the wrong action according to μf,𝒂\mu_{f,\boldsymbol{a}}, and xtx_{t} is close to the midpoint of section jj.

Since 𝒙,𝒚\boldsymbol{x},\boldsymbol{y}, and 𝒂\boldsymbol{a} are random variables, all variables defined on top of them (such as MjM_{j}) are also random variables. In contrast, the partition 𝒳={X1,,Xf(T)}\mathcal{X}=\{X_{1},\dots,X_{f(T)}\} and properties thereof (like the midpoints mjm_{j}) are not random variables.

A.2 Proof roadmap

The proof considers an arbitrary algorithm with sublinear queries, and proceeds via the following steps:

  1. 1.

    Show that multiplicative regret and additive regret are tightly related (Lemma A.1). We will also use this lemma for our positive results.

  2. 2.

    Prove an asymptotic density lemma which we will use to show that f(T)=|QT|Tf(T)=\sqrt{|Q_{T}|T} is asymptotically between |QT||Q_{T}| and TT (Lemma A.2).

  3. 3.

    Prove a simple variant of the Chernoff bound which we will apply multiple times (Lemma A.3).

  4. 4.

    Show that with high probability, jS|Mj|\sum_{j\in S}|M_{j}| is large for any subset of sections SS (Lemma A.4).

  5. 5.

    Prove that |J¬Q||J_{\neg Q}^{\prime}| is large with high probability (Lemma A.5).

  6. 6.

    The key lemma is Lemma A.6, which shows that a randomly sampled 𝒂\boldsymbol{a} produces poor agent performance with high probability. The central idea is that at least f(T)|QT|f(T)-|Q_{T}| sections are never queried (which is large, by Lemma A.2), so the agent has no way of knowing the optimal action in those sections. As a result, the agent picks the wrong answer at least half the time on average (and at least a quarter of the time with high probability). Lemma A.4 implies that a constant fraction of those time steps will have significantly suboptimal payoffs, again with high probability.

  7. 7.

    Apply sup𝝁,πm𝔼𝒙,𝒚RT×(𝒙,𝒚,𝝁,πm)𝔼πm,𝒂U({0,1}f(T))𝔼𝒙,𝒚RT×(𝒙,𝒚,μf,𝒂,πm)\sup\limits_{\boldsymbol{\mu},\pi^{m}}\ \operatorname*{\mathbb{E}}\limits_{\boldsymbol{x},\boldsymbol{y}}\ R_{T}^{\times}(\boldsymbol{x},\boldsymbol{y},\boldsymbol{\mu},\pi^{m})\geq\operatorname*{\mathbb{E}}\limits_{\pi^{m},\boldsymbol{a}\sim U(\{0,1\}^{f(T)})}\ \operatorname*{\mathbb{E}}\limits_{\boldsymbol{x},\boldsymbol{y}}\ R_{T}^{\times}(\boldsymbol{x},\boldsymbol{y},\mu_{f,\boldsymbol{a}},\pi^{m}). Here U({0,1}f(T))U(\{0,1\}^{f(T)}) is the uniform distribution over bit strings of length f(T)f(T) and we write πm,𝒂U({0,1}f(T))\pi^{m},\boldsymbol{a}\sim U(\{0,1\}^{f(T)}) with slight abuse of notation, since πm\pi^{m} is not drawn from U({0,1}f(T))U(\{0,1\}^{f(T)}) but rather is determined by 𝒂\boldsymbol{a} which is drawn from U({0,1}f(T))U(\{0,1\}^{f(T)}).

  8. 8.

    The analysis above results in a lower bound on RT+R_{T}^{+}. The last step is to use Lemma A.1 to obtain a lower bound on RT×R_{T}^{\times}.

Step 7 is essentially an application of the probabilistic method: if a randomly chosen μf,𝒂\mu_{f,\boldsymbol{a}} has high expected regret, then the worst-case 𝝁\boldsymbol{\mu} also has high expected regret. We have included subscripts in the expectations above to distinguish between the randomness over 𝒂\boldsymbol{a} and 𝒙,𝒚\boldsymbol{x},\boldsymbol{y}. When subscripts are omitted, the expected value is over all randomness, i.e., 𝒂,𝒙,\boldsymbol{a},\boldsymbol{x}, and 𝒚\boldsymbol{y}.

A.3 Proof

Lemma A.1.

If μtm(xt)μt(xt,yt)\mu_{t}^{m}(x_{t})\geq\mu_{t}(x_{t},y_{t}) for all tt, then RT+RT×R_{T}^{+}\leq R_{T}^{\times}. If μt(xt,yt)>0\mu_{t}(x_{t},y_{t})>0 for all tt, then RT×\mfracRT+mint[T]μt(xt,yt)R_{T}^{\times}\leq\mfrac{R_{T}^{+}}{\min_{t\in[T]}\mu_{t}(x_{t},y_{t})}.

Proof.

Recall the standard inequalities 11alogaa11-\frac{1}{a}\leq\log a\leq a-1 for any a>0a>0.

Part 1: RT+RT×R_{T}^{+}\leq R_{T}^{\times}. If μt(xt,yt)=0\mu_{t}(x_{t},y_{t})=0 for any t[T]t\in[T], then RT×=R_{T}^{\times}=\infty and the claim is trivially satisfied. Thus assume μt(xt,yt)>0\mu_{t}(x_{t},y_{t})>0 for all t[T]t\in[T].

RT+=\displaystyle R_{T}^{+}= t=1Tμtm(xt)t=1Tμt(xt,yt)\displaystyle\ \sum_{t=1}^{T}\mu_{t}^{m}(x_{t})-\sum_{t=1}^{T}\mu_{t}(x_{t},y_{t}) (Definition of RT+)\displaystyle(\text{Definition of $R_{T}^{+}$})
=\displaystyle= t=1Tμtm(xt)μt(xt,yt)μtm(xt)\displaystyle\ \sum_{t=1}^{T}\frac{\mu_{t}^{m}(x_{t})-\mu_{t}(x_{t},y_{t})}{\mu_{t}^{m}(x_{t})} (μt(xt,yt)μtm(xt) and 0μtm(xt)1)\displaystyle(\text{$\mu_{t}(x_{t},y_{t})\leq\mu_{t}^{m}(x_{t})$ and $0\leq\mu_{t}^{m}(x_{t})\leq 1$})
\displaystyle\leq t=1Tlog(μtm(xt)μt(xt,yt))\displaystyle\ \sum_{t=1}^{T}\log\left(\frac{\mu_{t}^{m}(x_{t})}{\mu_{t}(x_{t},y_{t})}\right) (11aloga for any a>0)\displaystyle\Big{(}\text{$1-\frac{1}{a}\leq\log a$ for any $a>0$}\Big{)}
=\displaystyle= logt=1Tμtm(xt)logt=1Tμt(xt,yt)\displaystyle\ \log\prod_{t=1}^{T}\mu_{t}^{m}(x_{t})-\log\prod_{t=1}^{T}\mu_{t}(x_{t},y_{t}) (Properties of logarithms)\displaystyle(\text{Properties of logarithms})
=\displaystyle= RT×\displaystyle\ R_{T}^{\times} (Definition of RT×)\displaystyle(\text{Definition of $R_{T}^{\times}$})

Part 2: RT×RT+/mint[T]μt(xt,yt)R_{T}^{\times}\leq R_{T}^{+}/\min_{t\in[T]}\mu_{t}(x_{t},y_{t}). We have

RT×=\displaystyle R_{T}^{\times}= logt=1Tμtm(xt)logt=1Tμt(xt,yt)\displaystyle\ \log\prod_{t=1}^{T}\mu_{t}^{m}(x_{t})-\log\prod_{t=1}^{T}\mu_{t}(x_{t},y_{t}) (Definition of RT× given μt(xt,yt)>0t[T])\displaystyle(\text{Definition of $R_{T}^{\times}$ given $\mu_{t}(x_{t},y_{t})>0\ \forall t\in[T]$})
=\displaystyle= t=1Tlog(μtm(xt)μt(xt,yt))\displaystyle\ \sum_{t=1}^{T}\log\left(\frac{\mu_{t}^{m}(x_{t})}{\mu_{t}(x_{t},y_{t})}\right) (Properties of logarithms)\displaystyle(\text{Properties of logarithms})
\displaystyle\leq t=1T(μtm(xt)μt(xt,yt)μt(xt,yt))\displaystyle\ \sum_{t=1}^{T}\left(\frac{\mu_{t}^{m}(x_{t})-\mu_{t}(x_{t},y_{t})}{\mu_{t}(x_{t},y_{t})}\right) (logaa1 for any a>0)\displaystyle(\text{$\log a\leq a-1$ for any $a>0$})
\displaystyle\leq t=1Tμt(xt,yt)mini[T]μi(xi,yi)(μtm(xt)μt(xt,yt)μt(xt,yt))\displaystyle\ \sum_{t=1}^{T}\frac{\mu_{t}(x_{t},y_{t})}{\min_{i\in[T]}\mu_{i}(x_{i},y_{i})}\left(\frac{\mu_{t}^{m}(x_{t})-\mu_{t}(x_{t},y_{t})}{\mu_{t}(x_{t},y_{t})}\right) (μt(xt,yt)mini[T]μi(xi,yt)>0t[T])\displaystyle(\text{$\mu_{t}(x_{t},y_{t})\geq\min_{i\in[T]}\mu_{i}(x_{i},y_{t})>0\ \forall t\in[T]$})
=\displaystyle= 1mint[T]μt(xt,yt)t=1T(μtm(xt)μt(xt,yt))\displaystyle\ \frac{1}{\min_{t\in[T]}\mu_{t}(x_{t},y_{t})}\sum_{t=1}^{T}(\mu_{t}^{m}(x_{t})-\mu_{t}(x_{t},y_{t})) (Arithmetic)\displaystyle(\text{Arithmetic})
=\displaystyle= RT+mint[T]μt(xt,yt)\displaystyle\ \frac{R_{T}^{+}}{\min_{t\in[T]}\mu_{t}(x_{t},y_{t})} (Definition of RT+)\displaystyle(\text{Definition of $R_{T}^{+}$})

as claimed. ∎

Lemma A.2.

Let a,b:>0>0a,b:\mathbb{R}_{>0}\to\mathbb{R}_{>0} be functions such that a(x)o(b(x))a(x)\in o(b(x)). Then c(x)=a(x)b(x)c(x)=\sqrt{a(x)b(x)} satisfies a(x)o(c(x))a(x)\in o(c(x)) and c(x)o(b(x))c(x)\in o(b(x)).

Proof.

Since aa and bb are strictly positive (and thus cc is as well), we have

a(x)c(x)=a(x)a(x)b(x)=a(x)b(x)=a(x)b(x)b(x)=c(x)b(x)\frac{a(x)}{c(x)}=\frac{a(x)}{\sqrt{a(x)b(x)}}=\sqrt{\frac{a(x)}{b(x)}}=\frac{\sqrt{a(x)b(x)}}{b(x)}=\frac{c(x)}{b(x)}

Then a(x)o(b(x))a(x)\in o(b(x)) implies

limxa(x)c(x)=limxc(x)b(x)=limxa(x)b(x)=0\displaystyle\lim_{x\to\infty}\frac{a(x)}{c(x)}=\lim_{x\to\infty}\frac{c(x)}{b(x)}=\lim_{x\to\infty}\sqrt{\frac{a(x)}{b(x)}}=0

as required. ∎

Lemma A.3.

Let z1,,znz_{1},\dots,z_{n} be i.i.d. variables in {0,1}\{0,1\} and let Z=i=1nziZ=\sum_{i=1}^{n}z_{i}. If 𝔼[Z]W\operatorname*{\mathbb{E}}[Z]\geq W, then Pr[ZW/2]exp(W/8)\Pr\big{[}Z\leq W/2\big{]}\leq\exp(-W/8).

Proof.

By the Chernoff bound for i.i.d. binary variables, we have Pr[Z𝔼[Z]/2]exp(𝔼[Z]/8)\Pr[Z\leq\operatorname*{\mathbb{E}}[Z]/2]\leq\exp(-\operatorname*{\mathbb{E}}[Z]/8). Since 𝔼[Z]W-\operatorname*{\mathbb{E}}[Z]\leq-W and exp\exp is an increasing function, we have exp(𝔼[Z]/8)exp(W/8)\exp(-\operatorname*{\mathbb{E}}[Z]/8)\leq\exp(-W/8). Also, W/2E[Z]/2W/2\leq E[Z]/2 implies Pr[ZW/2]Pr[Z𝔼[Z]/2]\Pr[Z\leq W/2]\leq\Pr[Z\leq\operatorname*{\mathbb{E}}[Z]/2]. Combining these inequalities proves the lemma. ∎

Lemma A.4.

Let S[f(T)]S\subseteq[f(T)] be any nonempty subset of sections. Then

Pr[jS|Mj|T|S|4f(T)]exp(T16f(T))\Pr\left[\sum_{j\in S}|M_{j}|\leq\frac{T|S|}{4f(T)}\right]\leq\exp\left(\frac{-T}{16f(T)}\right)
Proof.

Fix any j[f(T)]j\in[f(T)]. For each t[T]t\in[T] , define the random variable ztz_{t} by zt=1z_{t}=1 if tMjt\in M_{j} for some jSj\in S and 0 otherwise. We have tMjt\in M_{j} iff xtx_{t} falls within a particular interval of length 12f(T)\frac{1}{2f(T)}. Since these intervals are disjoint for different jj’s, we have zt=1z_{t}=1 iff xtx_{t} falls within a portion of the input space with total measure |S|2f(T)\frac{|S|}{2f(T)}. Since xtx_{t} is uniformly random across [0,1][0,1], we have 𝔼[zt]=|S|2f(T)\operatorname*{\mathbb{E}}[z_{t}]=\frac{|S|}{2f(T)}. Then 𝔼[t=1Tzt]=𝔼[jS|Mj|]=T|S|2f(T)\operatorname*{\mathbb{E}}[\sum_{t=1}^{T}z_{t}]=\operatorname*{\mathbb{E}}[\sum_{j\in S}|M_{j}|]=\frac{T|S|}{2f(T)}. Furthermore, since x1,,xTx_{1},\dots,x_{T} are i.i.d., so are z1,,zTz_{1},\dots,z_{T}. Then by Lemma A.3,

Pr[jS|Mj|T|S|4f(T)]exp(T|S|16f(T))exp(T16f(T))\Pr\left[\sum_{j\in S}|M_{j}|\leq\frac{T|S|}{4f(T)}\right]\leq\exp\left(\frac{-T|S|}{16f(T)}\right)\leq\exp\left(\frac{-T}{16f(T)}\right)

with the last step due to |S|1|S|\geq 1. ∎

Lemma A.5.

We have

Pr[|J¬Q|f(T)𝔼[|QT|]4]exp(f(T)𝔼[|QT|]16)\Pr\left[|J_{\neg Q}^{\prime}|\leq\frac{f(T)-\operatorname*{\mathbb{E}}[|Q_{T}|]}{4}\right]\leq\exp\left(-\frac{f(T)-\operatorname*{\mathbb{E}}[|Q_{T}|]}{16}\right)
Proof.

Define a random variable zj=𝟏jJ¬Qz_{j}=\mathbf{1}_{j\in J_{\neg Q}^{\prime}} for each jJ¬Qj\in J_{\neg Q}. By definition, if jJ¬Qj\in J_{\neg Q}, no input in XjX_{j} is queried. Since queries outside of XjX_{j} provide no information about aja_{j}, the agent’s actions must be independent of aja_{j}. In particular, the random variables aja_{j} and yjy^{j} are independent. Combining that independence with Pr[aj=0]=Pr[aj=1]=0.5\Pr[a_{j}=0]=\Pr[a_{j}=1]=0.5 yields Pr[zj=1]=0.5\Pr[z_{j}=1]=0.5 for all jJ¬Qj\in J_{\neg Q}. Then

𝔼[|J¬Q|]=\displaystyle\operatorname*{\mathbb{E}}\left[|J_{\neg Q}^{\prime}|\right]= 𝔼[jJ¬Qzj]\displaystyle\ \operatorname*{\mathbb{E}}\left[\sum_{j\in J_{\neg Q}}z_{j}\right]
=\displaystyle= |J¬Q|/2\displaystyle\ |J_{\neg Q}|/2
\displaystyle\geq f(T)𝔼[|QT|]2\displaystyle\ \frac{f(T)-\operatorname*{\mathbb{E}}[|Q_{T}|]}{2}

Furthermore, since a1,,af(T)a_{1},\dots,a_{f(T)} are independent, the random variables {zj:jJ¬Q}\{z_{j}:j\in J_{\neg Q}\} are also independent. Applying Lemma A.3 yields the desired bound. ∎

Lemma A.6.

Suppose f:f:\mathbb{N}\to\mathbb{N} and independently sample 𝐚U({0,1}f(T))\boldsymbol{a}\sim U(\{0,1\}^{f(T)}) and 𝐱U(𝒳)T\boldsymbol{x}\sim U(\mathcal{X})^{T}.111111That is, the entire set {a1,,af(T),x1,,xT}\{a_{1},\dots,a_{f(T)},x_{1},\dots,x_{T}\} is mutually independent. Then with probability at least 1exp(T16f(T))exp(f(T)𝔼[|QT|]16)1-\exp\big{(}\frac{-T}{16f(T)}\big{)}-\exp\big{(}-\frac{f(T)-\operatorname*{\mathbb{E}}[|Q_{T}|]}{16}\big{)},

RT+LT(f(T)𝔼[|QT|])27f(T)2\displaystyle R_{T}^{+}\geq\frac{LT(f(T)-\operatorname*{\mathbb{E}}[|Q_{T}|])}{2^{7}f(T)^{2}}
Proof.

Consider any jJ¬Qj\in J_{\neg Q}^{\prime} and tMjMjt\in M_{j}^{\prime}\subseteq M_{j}. By definition of MjM_{j}, we have |mjxt|14f(T)|m_{j}-x_{t}|\leq\frac{1}{4f(T)}. Then by the definition of μf,𝒂\mu_{f,\boldsymbol{a}},

μf,𝒂(xt,yt)=\displaystyle\mu_{f,\boldsymbol{a}}(x_{t},y_{t})= 1L(12f(T)|xtmj|)\displaystyle\ 1-L\left(\frac{1}{2f(T)}-|x_{t}-m_{j}|\right)
\displaystyle\leq 1L(12f(T)14f(T))\displaystyle\ 1-L\left(\frac{1}{2f(T)}-\frac{1}{4f(T)}\right)
=\displaystyle= 1L4f(T)\displaystyle\ 1-\frac{L}{4f(T)}

Since μf,𝒂m(xt)μf,𝒂(xt,yt)\mu_{f,\boldsymbol{a}}^{m}(x_{t})\geq\mu_{f,\boldsymbol{a}}(x_{t},y_{t}) always, we can safely restrict ourselves to time steps tMjt\in M_{j}^{\prime} for some jJ¬Qj\in J_{\neg Q}^{\prime} and still obtain a lower bound:

RT+=\displaystyle R_{T}^{+}= t=1T(μf,𝒂m(xt)μf,𝒂(xt,yt))\displaystyle\ \sum_{t=1}^{T}(\mu_{f,\boldsymbol{a}}^{m}(x_{t})-\mu_{f,\boldsymbol{a}}(x_{t},y_{t})) (Definition of RT+)\displaystyle(\text{Definition of $R_{T}^{+}$})
\displaystyle\geq jJ¬QtMj(μf,𝒂m(xt)μf,𝒂(xt,yt))\displaystyle\ \sum_{j\in J_{\neg Q}^{\prime}}\sum_{t\in M_{j}^{\prime}}\big{(}\mu_{f,\boldsymbol{a}}^{m}(x_{t})-\mu_{f,\boldsymbol{a}}(x_{t},y_{t})\big{)} (μf,𝒂m(xt)μf,𝒂(xt,yt))\displaystyle(\mu_{f,\boldsymbol{a}}^{m}(x_{t})\geq\mu_{f,\boldsymbol{a}}(x_{t},y_{t}))
\displaystyle\geq jJ¬QtMj(1μf,𝒂(xt,yt))\displaystyle\ \sum_{j\in J_{\neg Q}^{\prime}}\sum_{t\in M_{j}^{\prime}}(1-\mu_{f,\boldsymbol{a}}(x_{t},y_{t})) (μf,𝒂m(xt)=1 always)\displaystyle(\mu_{f,\boldsymbol{a}}^{m}(x_{t})=1\text{ always})
\displaystyle\geq jJ¬QtMj(11+L4f(T))\displaystyle\ \sum_{j\in J_{\neg Q}^{\prime}}\sum_{t\in M_{j}^{\prime}}\left(1-1+\frac{L}{4f(T)}\right) (bound on μf,𝒂(xt,yt) for tMj)\displaystyle(\text{bound on $\mu_{f,\boldsymbol{a}}(x_{t},y_{t})$ for $t\in M_{j}^{\prime}$})
=\displaystyle= jJ¬QL|Mj|4f(T)\displaystyle\ \sum_{j\in J_{\neg Q}^{\prime}}\frac{L|M_{j}^{\prime}|}{4f(T)} (Simplifying inner sum)\displaystyle(\text{Simplifying inner sum})

Since jJ¬Qj\in J_{\neg Q}, the mentor is not queried on any time step tMjt\in M_{j}, so yt{0,1}y_{t}\in\{0,1\} for all tMjt\in M_{j}. Since the agent chooses one of two actions for each tMjt\in M_{j}, the more frequent action must be chosen at least half of the time: yt=yjy_{t}=y^{j} for at least half of the time steps in MjM_{j}. Since ajyja_{j}\neq y^{j} for jJ¬Qj\in J_{\neg Q}^{\prime}, we have yt=yjajy_{t}=y^{j}\neq a_{j} for those time steps, so |Mj||Mj|/2|M_{j}^{\prime}|\geq|M_{j}|/2. Thus

RT+jJ¬QL|Mj|8f(T)R_{T}^{+}\geq\sum_{j\in J_{\neg Q}^{\prime}}\frac{L|M_{j}|}{8f(T)}

By Lemma A.4, Lemma A.5, and the union bound, with probability at least 1exp(T16f(T))exp(f(T)𝔼[|QT|]16)1-\exp\big{(}\frac{-T}{16f(T)}\big{)}-\exp\big{(}-\frac{f(T)-\operatorname*{\mathbb{E}}[|Q_{T}|]}{16}\big{)} we have jJ¬Q|Mj|\mfracT|J¬Q|4f(T)\sum_{j\in J_{\neg Q}^{\prime}}|M_{j}|\geq\mfrac{T|J_{\neg Q}^{\prime}|}{4f(T)} for all j[f(T)]j\in[f(T)] and |J¬Q|\mfracf(T)𝔼[|QT|]4|J_{\neg Q}^{\prime}|\geq\mfrac{f(T)-\operatorname*{\mathbb{E}}[|Q_{T}|]}{4}. Assuming those inequalities hold, we have

RT+\displaystyle R_{T}^{+}\geq jJ¬QL|Mj|8f(T)\displaystyle\ \sum_{j\in J_{\neg Q}^{\prime}}\frac{L|M_{j}|}{8f(T)}
\displaystyle\geq L8f(T)T|J¬Q|4f(T)\displaystyle\ \frac{L}{8f(T)}\cdot\frac{T|J_{\neg Q}^{\prime}|}{4f(T)}
\displaystyle\geq L8f(T)T4f(T)f(T)𝔼[|QT|]4\displaystyle\ \frac{L}{8f(T)}\cdot\frac{T}{4f(T)}\cdot\frac{f(T)-\operatorname*{\mathbb{E}}[|Q_{T}|]}{4}
=\displaystyle= LT(f(T)𝔼[|QT|])27f(T)2\displaystyle\ \frac{LT(f(T)-\operatorname*{\mathbb{E}}[|Q_{T}|])}{2^{7}f(T)^{2}}

as required. ∎

For a given f:f:\mathbb{N}\to\mathbb{N}, define αf(T)=exp(T16f(T))+exp(f(T)𝔼[|QT|]16)\alpha_{f}(T)=\exp\big{(}\frac{-T}{16f(T)}\big{)}+\exp\big{(}-\frac{f(T)-\operatorname*{\mathbb{E}}[|Q_{T}|]}{16}\big{)} for brevity. See 4.1

Proof.

If the algorithm has sublinear queries, then there exists g(T)o(T)g(T)\in o(T) such that sup𝝁,πm𝔼𝒙,𝒚[|QT|]g(T)\sup_{\boldsymbol{\mu},\pi^{m}}\operatorname*{\mathbb{E}}_{\boldsymbol{x},\boldsymbol{y}}[|Q_{T}|]\leq g(T). Consider any such g(T)g(T) satisfying g(T)>0g(T)>0. Since this holds for every 𝝁\boldsymbol{\mu}, it also holds in expectation over 𝒂U({0,1})f(T)\boldsymbol{a}\sim U(\{0,1\})^{f(T)}, so 𝔼𝒂,𝒙,𝒚[|QT|]=𝔼[|QT|]g(T)\operatorname*{\mathbb{E}}_{\boldsymbol{a},\boldsymbol{x},\boldsymbol{y}}[|Q_{T}|]=\operatorname*{\mathbb{E}}[|Q_{T}|]\leq g(T).

Next, Lemma A.2 gives us g(T)o(g(T)T)g(T)\in o(\sqrt{g(T)T}) and g(T)To(T)\sqrt{g(T)T}\in o(T). Let f(T)=g(T)Tf(T)=\lceil\sqrt{g(T)T}\rceil: then f(T)Θ(g(T)T)f(T)\in\Theta(\sqrt{g(T)T}), so g(T)o(f(T))g(T)\in o(f(T)) and f(T)o(T)f(T)\in o(T). First, this implies that limTαf(T)=0\lim_{T\to\infty}\alpha_{f}(T)=0. Second, g(T)o(f(T))g(T)\in o(f(T)) implies that exists T0T_{0} such that g(T)f(T)/2g(T)\leq f(T)/2 for all TT0T\geq T_{0}. We also have RT+0R_{T}^{+}\geq 0 always since μf,𝒂m(xt)μf,𝒂(xt,yt)\mu_{f,\boldsymbol{a}}^{m}(x_{t})\geq\mu_{f,\boldsymbol{a}}(x_{t},y_{t}) always. Then for all TT0T\geq T_{0} we have

𝔼πm,𝒂U({0,1}f(T))𝔼𝒙,𝒚[RT+(𝒙,𝒚,𝝁,πm)]\displaystyle\ \operatorname*{\mathbb{E}}_{\pi^{m},\boldsymbol{a}\sim U(\{0,1\}^{f(T)})}\ \,\operatorname*{\mathbb{E}}_{\boldsymbol{x},\boldsymbol{y}}\ \left[R_{T}^{+}(\boldsymbol{x},\boldsymbol{y},\boldsymbol{\mu},\pi^{m})\right]
\displaystyle\geq αf(T)0+(1αf(T))(LT(f(T)𝔼[|QT|]27f(T)2)\displaystyle\ \alpha_{f}(T)\cdot 0+\big{(}1-\alpha_{f}(T)\big{)}\left(\frac{LT(f(T)-\operatorname*{\mathbb{E}}[|Q_{T}|]}{2^{7}f(T)^{2}}\right) (Lemma A.6 and RT+0)\displaystyle(\text{\lx@cref{creftypecap~refnum}{lem:neg-error} and $R_{T}^{+}\geq 0$})
\displaystyle\geq (1αf(T))(LT(f(T)g(T))27f(T)2)\displaystyle\ \big{(}1-\alpha_{f}(T)\big{)}\left(\frac{LT(f(T)-g(T))}{2^{7}f(T)^{2}}\right) (𝔼[QT]g(T))\displaystyle(\text{$\operatorname*{\mathbb{E}}[Q_{T}]\leq g(T)$})
\displaystyle\geq (1αf(T))(LT28f(T))\displaystyle\ \big{(}1-\alpha_{f}(T)\big{)}\left(\frac{LT}{2^{8}f(T)}\right) (g(T)f(T)/2)\displaystyle(\text{$g(T)\leq f(T)/2$})

Since limTαf(T)=0\lim_{T\to\infty}\alpha_{f}(T)=0,

sup𝝁,πm𝔼𝒙,𝒚[RT+(𝒙,𝒚,𝝁,πm)]\displaystyle\sup_{\boldsymbol{\mu},\pi^{m}}\ \operatorname*{\mathbb{E}}_{\boldsymbol{x},\boldsymbol{y}}\ \big{[}R_{T}^{+}(\boldsymbol{x},\boldsymbol{y},\boldsymbol{\mu},\pi^{m})\big{]}\geq 𝔼πm,𝒂U({0,1}f(T))𝔼𝒙,𝒚[RT+(𝒙,𝒚,(μf,𝒂,,μf,𝒂),πm)]\displaystyle\ \operatorname*{\mathbb{E}}_{\pi^{m},\boldsymbol{a}\sim U(\{0,1\}^{f(T)})}\ \,\operatorname*{\mathbb{E}}_{\boldsymbol{x},\boldsymbol{y}}\ \big{[}R_{T}^{+}(\boldsymbol{x},\boldsymbol{y},(\mu_{f,\boldsymbol{a}},\dots,\mu_{f,\boldsymbol{a}}),\pi^{m})\big{]}
\displaystyle\geq (1αf(T))(LT28f(T))\displaystyle\ \big{(}1-\alpha_{f}(T)\big{)}\left(\frac{LT}{2^{8}f(T)}\right)
\displaystyle\in Ω(LTf(T))\displaystyle\ \Omega\left(\frac{LT}{f(T)}\right)
=\displaystyle= Ω(LTg(T))\displaystyle\ \Omega\left(L\sqrt{\frac{T}{g(T)}}\right)

This holds for any g(T)o(T)g(T)\in o(T) such that sup𝝁𝔼[|QT|]g(T)\sup_{\boldsymbol{\mu}}\operatorname*{\mathbb{E}}[|Q_{T}|]\leq g(T) and g(T)>0g(T)>0. Thus we can simply set g(T)=sup𝝁,πm𝔼[|QT|]+1g(T)=\sup_{\boldsymbol{\mu},\pi^{m}}\operatorname*{\mathbb{E}}[|Q_{T}|]+1, since sup𝝁,πm𝔼[|QT|]\sup_{\boldsymbol{\mu},\pi^{m}}\operatorname*{\mathbb{E}}[|Q_{T}|] is indeed a function of only TT.

Since μf,𝒂m(xt)μf,𝒂(xt,yt)\mu_{f,\boldsymbol{a}}^{m}(x_{t})\geq\mu_{f,\boldsymbol{a}}(x_{t},y_{t}) for all t[T]t\in[T], Lemma A.1 implies that

sup𝝁,πm𝔼𝒙,𝒚[RT×(𝒙,𝒚,𝝁,πm)]\displaystyle\sup_{\boldsymbol{\mu},\pi^{m}}\ \operatorname*{\mathbb{E}}_{\boldsymbol{x},\boldsymbol{y}}\ \big{[}R_{T}^{\times}(\boldsymbol{x},\boldsymbol{y},\boldsymbol{\mu},\pi^{m})\big{]}\geq 𝔼πm,𝒂U({0,1}f(T))𝔼𝒙,𝒚[RT×(𝒙,𝒚,(μf,𝒂,,μf,𝒂),πm)]\displaystyle\ \operatorname*{\mathbb{E}}_{\pi^{m},\boldsymbol{a}\sim U(\{0,1\}^{f(T)})}\ \,\operatorname*{\mathbb{E}}_{\boldsymbol{x},\boldsymbol{y}}\ \big{[}R_{T}^{\times}(\boldsymbol{x},\boldsymbol{y},(\mu_{f,\boldsymbol{a}},\dots,\mu_{f,\boldsymbol{a}}),\pi^{m})\big{]}
\displaystyle\in Ω(LTsup𝝁,πm𝔼[|QT|]+1)\displaystyle\ \Omega\left(L\sqrt{\frac{T}{\sup_{\boldsymbol{\mu},\pi^{m}}\operatorname*{\mathbb{E}}[|Q_{T}|]+1}}\right)

completing the proof.∎

See 4.1.1

Proof.

We have t=1Tμf,𝒂m(xt)=1\prod_{t=1}^{T}\mu_{f,\boldsymbol{a}}^{m}(x_{t})=1 from our construction. Then Lemma A.1 implies that RT×RT+R_{T}^{\times}\geq R_{T}^{+}, so

exp(RT+)\displaystyle\exp(-R_{T}^{+})\geq exp(RT×)\displaystyle\ \exp(-R_{T}^{\times})
=\displaystyle= exp(logt=1Tμf,𝒂(xt,yt)log1)\displaystyle\ \exp\left(\log\prod_{t=1}^{T}\mu_{f,\boldsymbol{a}}(x_{t},y_{t})-\log 1\right)
=\displaystyle= t=1Tμf,𝒂(xt,yt)\displaystyle\ \prod_{t=1}^{T}\mu_{f,\boldsymbol{a}}(x_{t},y_{t})

Then by Lemma A.6, with probability 1αf(T)1-\alpha_{f}(T),

t=1Tμf,𝒂(xt,yt)exp(LT(f(T)𝔼[|QT|])27f(T)2)exp(LT28f(T))\displaystyle\prod_{t=1}^{T}\mu_{f,\boldsymbol{a}}(x_{t},y_{t})\leq\exp\left(-\frac{LT(f(T)-\operatorname*{\mathbb{E}}[|Q_{T}|])}{2^{7}f(T)^{2}}\right)\leq\exp\left(-\frac{LT}{2^{8}f(T)}\right)

Since t=1Tμf,𝒂(xt,yt)1\prod_{t=1}^{T}\mu_{f,\boldsymbol{a}}(x_{t},y_{t})\leq 1 always,

limT𝔼πm,𝒂U({0,1}f(T))𝔼𝒙,𝒚[t=1Tμf,𝒂(xt,yt)]\displaystyle\lim_{T\to\infty}\operatorname*{\mathbb{E}}_{\pi^{m},\boldsymbol{a}\sim U(\{0,1\}^{f(T)})}\ \operatorname*{\mathbb{E}}_{\boldsymbol{x},\boldsymbol{y}}\ \left[\prod_{t=1}^{T}\mu_{f,\boldsymbol{a}}(x_{t},y_{t})\right]\leq limT𝔼πm,𝒂U({0,1}f(T))𝔼𝒙,𝒚[t=1Tμf,𝒂(xt,yt)]\displaystyle\ \lim_{T\to\infty}\operatorname*{\mathbb{E}}_{\pi^{m},\boldsymbol{a}\sim U(\{0,1\}^{f(T)})}\ \operatorname*{\mathbb{E}}_{\boldsymbol{x},\boldsymbol{y}}\ \left[\prod_{t=1}^{T}\mu_{f,\boldsymbol{a}}(x_{t},y_{t})\right]
\displaystyle\leq limT(1αf(T)+(1αf(T))exp(LT28f(T)))\displaystyle\ \lim_{T\to\infty}\left(1\cdot\alpha_{f}(T)+(1-\alpha_{f}(T))\cdot\exp\left(-\frac{LT}{2^{8}f(T)}\right)\right)
\displaystyle\leq limT(1αf(T))limTexp(LT28f(T))\displaystyle\ \lim_{T\to\infty}(1-\alpha_{f}(T))\cdot\lim_{T\to\infty}\exp\left(-\frac{LT}{2^{8}f(T)}\right)
=\displaystyle= 1exp()\displaystyle\ 1\cdot\exp(-\infty)
=\displaystyle= 0\displaystyle\ 0

Since this upper bound holds for a randomly chosen μf,𝒂,πm\mu_{f,\boldsymbol{a}},\pi^{m}, the same upper bound holds for a worst-case choice of 𝝁,πm\boldsymbol{\mu},\pi^{m} among 𝝁,πm\boldsymbol{\mu},\pi^{m} which satisfy μtm(x)=1\mu_{t}^{m}(x)=1 for all t[T],x𝒳t\in[T],x\in\mathcal{X}. Formally,

limTsup𝝁,πm:μt(x)=1t,x𝔼[t=1Tμt(xt,yt)]0\lim_{T\to\infty}\ \sup_{\boldsymbol{\mu},\pi^{m}:\,\mu_{t}(x)=1\,\forall t,x}\operatorname*{\mathbb{E}}\left[\prod_{t=1}^{T}\mu_{t}(x_{t},y_{t})\right]\leq 0

Since t=1Tμt(xt,yt)0\prod_{t=1}^{T}\mu_{t}(x_{t},y_{t})\geq 0 always, the inequality above holds with equality. ∎

Appendix B Proof of Theorem 5.2

B.1 Context on Lemma 5.4

Before diving into the main proof, we provide some context on Lemma 5.4 from Section 5:

See 5.4

Lemma 5.4 is a restatement and simplification of Lemma 3.5 in Russo et al. (2024) with the following differences:

  1. 1.

    They parametrize their algorithm by the expected number of queries k^\hat{k} instead of the query probability p=k^/Tp=\hat{k}/T.

  2. 2.

    They include a second parameter kk, which is the eventual target number of queries for their unconditional query bound. In our case, an expected query bound is sufficient, so we simply set k=k^k=\hat{k}.

  3. 3.

    They provide a second bound which is tighter for small kk; that bound is less useful for us so we omit it.

  4. 4.

    Their “actions” correspond to policies in our setting, not actions in 𝒴\mathcal{Y}. Their number of actions nn corresponds to |Π~||\tilde{\Pi}|.

  5. 5.

    We include an expectation over both loss terms, while they only include an expectation over the agent’s loss. This is because an adaptive adversary may choose the loss function for time tt in a randomized manner. Since we eventually set (t,π)=𝟏(π(xt)πm(xt))\ell(t,\pi)=\mathbf{1}(\pi(x_{t})\neq\pi^{m}(x_{t})), the randomization in \ell corresponds to the randomization in xtx_{t}.

Altogether, since Russo et al. (2024) set η=max(1Tk^logn2,kk^2T2)\eta=\max\Big{(}\frac{1}{T}\sqrt{\frac{\hat{k}\log n}{2}},\frac{k\hat{k}}{\sqrt{2}T^{2}}\Big{)}, we end up with η=max(plog|Π~|2T,p22)\eta=\max\Big{(}\sqrt{\frac{p\log|\tilde{\Pi}|}{2T}},\>\frac{p^{2}}{\sqrt{2}}\Big{)}. Algorithm 2 provides precise pseudocode for the HedgeWithQueries algorithm to which Lemma 5.4 refers.

function HedgeWithQueries(p(0,1],finite policy class Π~,unknown :[T]×Π~[0,1])(p\in(0,1],\>\text{finite policy class }\tilde{\Pi},\>\text{unknown }\ell:[T]\times\tilde{\Pi}\to[0,1]) 
  w(π)1w(\pi)\leftarrow 1 for all πΠ~\pi\in\tilde{\Pi}
  ηmax(plog|Π~|2T,p22)\eta\leftarrow\max\big{(}\sqrt{\frac{p\log|\tilde{\Pi}|}{2T}},\>\frac{p^{2}}{\sqrt{2}}\big{)}
  for tt from 11 to TT do
   with probability p:p: hedgeQuerytrue\texttt{hedgeQuery}\leftarrow\texttt{true}
   with probability 1p:1-p: hedgeQueryfalse\texttt{hedgeQuery}\leftarrow\texttt{false}
   if hedgeQuery then
    Query and observe (t,π)\ell(t,\pi) for all πΠ~\pi\in\tilde{\Pi}
    minπΠ~(t,π)\ell^{*}\leftarrow\min_{\pi\in\tilde{\Pi}}\ell(t,\pi)
    w(π)w(π)exp(η((t,π)))w(\pi)\leftarrow w(\pi)\cdot\exp(-\eta(\ell(t,\pi)-\ell^{*})) for all πΠ~\pi\in\tilde{\Pi}
    Select policy argminπΠ~(t,π)\arg\min_{\pi\in\tilde{\Pi}}\ell(t,\pi)
   else
    P(π)w(π)/πΠ~w(π)P(\pi)\leftarrow w(\pi)/\sum_{\pi^{\prime}\in\tilde{\Pi}}w(\pi^{\prime}) for all πΠ~\pi\in\tilde{\Pi}
    Sample πtP\pi_{t}\sim P
    Select policy πt\pi_{t}
   
  
Algorithm 2 A variant of the Hedge algorithm which only observes losses in response to queries.

B.2 Main proof

We use the following notation throughout the proof:

  1. 1.

    For each t[T]t\in[T], let StS_{t} refer to the value of SS at the start of time step tt.

  2. 2.

    Let MT={t[T]:πt(xt)πm(xt)}M_{T}=\{t\in[T]:\pi_{t}(x_{t})\neq\pi^{m}(x_{t})\} be the set of time steps where Hedge’s proposed action doesn’t match the mentor’s. Note that |MT||M_{T}| upper bounds the number of mistakes the algorithm makes (the number of mistakes could be smaller, since the algorithm sometimes queries instead of taking action πt(xt)\pi_{t}(x_{t})).

  3. 3.

    For S𝒳S\subseteq\mathcal{X}, let vol(S)\operatorname*{vol}(S) denote the nn-dimensional Lebesgue measure of SS.

  4. 4.

    With slight abuse of notation, we will use inequalities of the form f(T)g(T)+O(h(T))f(T)\leq g(T)+O(h(T)) to mean that there exists a constant CC such that f(T)g(T)+Ch(T)f(T)\leq g(T)+Ch(T).

  5. 5.

    We will use “Case 1” to refer to finite VC dimension and σ\sigma-smooth 𝒙\boldsymbol{x} and “Case 2” to refer to finite Littlestone dimension.

Lemma B.1.

Let 𝒴={0,1}\mathcal{Y}=\{0,1\}. Assume πmΠ\pi^{m}\in\Pi where either (1) Π\Pi has finite VC dimension dd and 𝐱\boldsymbol{x} is σ\sigma-smooth, or (2) Π\Pi has finite Littlestone dimension dd. Then for any TT\in\mathbb{N} and ε1/T\varepsilon\geq 1/T,121212Note that this lemma omits the assumption of ε(μ0m2L)n\varepsilon\leq(\frac{\mu_{0}^{m}}{2L})^{n}, since we do not need it for this lemma, and we would like to apply this lemma in the multi-action case without that assumption. Algorithm 1 satisfies

𝔼[|MT|]O(dσTεlog(T+1/ε))\operatorname*{\mathbb{E}}[|M_{T}|]\in O\left(\frac{d}{\sigma}T\varepsilon\log(T+1/\varepsilon)\right)
Proof.

Define :[T]×Π~[0,1]\ell:[T]\times\tilde{\Pi}\to[0,1] by (t,π)=𝟏(π(xt)πm(xt))\ell(t,\pi)=\mathbf{1}(\pi(x_{t})\neq\pi^{m}(x_{t})), and let whw^{h} and πth\pi_{t}^{h} denote the values of ww and πt\pi_{t} respectively in HedgeWithQueries, while ww and πt\pi_{t} refer to the variables in Algorithm 1. Then ww and whw^{h} evolve in the exact same way, so the distributions of πt\pi_{t} and πth\pi_{t}^{h} coincide. Also, ε1/T\varepsilon\geq 1/T implies that p=1/εT(0,1]p=1/\sqrt{\varepsilon T}\in(0,1]. Thus by Lemma 5.4,

t=1T𝔼[(t,πt)]minπ~Π~t=1T𝔼[(t,π~)]\displaystyle\sum_{t=1}^{T}\operatorname*{\mathbb{E}}[\ell(t,\pi_{t})]-\min_{\tilde{\pi}\in\tilde{\Pi}}\sum_{t=1}^{T}\operatorname*{\mathbb{E}}[\ell(t,\tilde{\pi})]\leq log|Π~|p2\displaystyle\ \frac{\log|\tilde{\Pi}|}{p^{2}}
=\displaystyle= Tεlog|Π~|\displaystyle\ T\varepsilon\log|\tilde{\Pi}|

Since |MT|=t=1T𝟏(πt(xt)πm(xt))=t=1T(t,πt)|M_{T}|=\sum_{t=1}^{T}\mathbf{1}(\pi_{t}(x_{t})\neq\pi^{m}(x_{t}))=\sum_{t=1}^{T}\ell(t,\pi_{t}), we have

𝔼[|MT|]Tεlog|Π~|+minπ~Π~t=1T𝔼[𝟏(π~(xt)πm(xt))]\operatorname*{\mathbb{E}}[|M_{T}|]\leq T\varepsilon\log|\tilde{\Pi}|+\min_{\tilde{\pi}\in\tilde{\Pi}}\sum_{t=1}^{T}\operatorname*{\mathbb{E}}[\mathbf{1}(\tilde{\pi}(x_{t})\neq\pi^{m}(x_{t}))]

Case 1: Since Π~\tilde{\Pi} is a smooth ε\varepsilon-cover and πmΠ\pi^{m}\in\Pi, Lemma 5.1 implies that 𝔼[𝟏(π~(xt)πm(xt))]ε/σ\operatorname*{\mathbb{E}}[\mathbf{1}(\tilde{\pi}(x_{t})\neq\pi^{m}(x_{t}))]\leq\varepsilon/\sigma for any π~Π~\tilde{\pi}\in\tilde{\Pi}. Since |Π~|(41/ε)d|\tilde{\Pi}|\leq(41/\varepsilon)^{d} by construction (and such a Π~\tilde{\Pi} is guaranteed to exist by Lemma 5.2), we get

𝔼[|MT|]\displaystyle\operatorname*{\mathbb{E}}[|M_{T}|]\leq Tεlog((41/ε)d)+minπ~Π~t=1Tεσ\displaystyle\ T\varepsilon\log((41/\varepsilon)^{d})+\min_{\tilde{\pi}\in\tilde{\Pi}}\sum_{t=1}^{T}\frac{\varepsilon}{\sigma}
=\displaystyle= dTεlog(41/ε)+Tεσ\displaystyle\ dT\varepsilon\log(41/\varepsilon)+\frac{T\varepsilon}{\sigma}
\displaystyle\in O(dσTεlog(T+1/ε))\displaystyle\ O\left(\frac{d}{\sigma}T\varepsilon\log(T+1/\varepsilon)\right)

Case 2: Since Π~\tilde{\Pi} is an adversarial cover of Π\Pi and πmΠ\pi^{m}\in\Pi, there exists π~Π~\tilde{\pi}\in\tilde{\Pi} such that t=1T𝟏(π~(xt)πm(xt))=0\sum_{t=1}^{T}\mathbf{1}(\tilde{\pi}(x_{t})\neq\pi^{m}(x_{t}))=0. Since |Π~|(eT/d)d|\tilde{\Pi}|\leq(eT/d)^{d} (with such a Π~\tilde{\Pi} guaranteed to exist by Lemma 5.3),

𝔼[|MT|]\displaystyle\operatorname*{\mathbb{E}}[|M_{T}|]\leq Tεlog|Π~|+minπ~Π~t=1T𝟏(π~(xt)πm(xt))\displaystyle\ T\varepsilon\log|\tilde{\Pi}|+\min_{\tilde{\pi}\in\tilde{\Pi}}\sum_{t=1}^{T}\mathbf{1}(\tilde{\pi}(x_{t})\neq\pi^{m}(x_{t}))
\displaystyle\leq Tεdln(eT/d)\displaystyle\ T\varepsilon d\ln(eT/d)
\displaystyle\in O(dσTεlog(T+1/ε))\displaystyle\ O\left(\frac{d}{\sigma}T\varepsilon\log(T+1/\varepsilon)\right)

as required. ∎

Lemma B.2.

For all t[T]t\in[T], μt(xt,yt)μtm(xt)Lε1/n\mu_{t}(x_{t},y_{t})\geq\mu_{t}^{m}(x_{t})-L\varepsilon^{1/n}.

Proof.

Consider an arbitrary t[T]t\in[T]. If tQTt\in Q_{T}, then μt(xt,yt)=μtm(xt)\mu_{t}(x_{t},y_{t})=\mu_{t}^{m}(x_{t}) trivially, so assume tQTt\not\in Q_{T}. Let (x,y)=argmin(x,y)St:πt(xt)=yxtx(x^{\prime},y^{\prime})=\operatorname*{arg\,min}_{(x,y)\in S_{t}:\pi_{t}(x_{t})=y}||x_{t}-x||. Since tQTt\not\in Q_{T}, we must have xtxε1/n||x_{t}-x^{\prime}||\leq\varepsilon^{1/n}.

We have y=πm(x)y^{\prime}=\pi^{m}(x^{\prime}) by construction of StS_{t} and πt(xt)=y\pi_{t}(x_{t})=y^{\prime} by construction of yy^{\prime}. Combining these with the local generalization assumption, we get

μt(xt,yt)=\displaystyle\mu_{t}(x_{t},y_{t})= μt(xt,πt(xt))\displaystyle\ \mu_{t}(x_{t},\pi_{t}(x_{t}))
=\displaystyle= μt(xt,πm(x))\displaystyle\ \mu_{t}(x_{t},\pi^{m}(x^{\prime}))
\displaystyle\geq μtm(xt)Lxtx\displaystyle\ \mu_{t}^{m}(x_{t})-L||x_{t}-x^{\prime}||
\displaystyle\geq μtm(xt)Lε1/n\displaystyle\ \mu_{t}^{m}(x_{t})-L\varepsilon^{1/n}

as required. ∎

Lemma B.3.

Under the conditions of Theorem 5.1, Algorithm 1 satisfies

𝔼[RT×]\displaystyle\operatorname*{\mathbb{E}}\left[R_{T}^{\times}\right]\in O(dLσμ0mTε1+1/nlog(T+1/ε))\displaystyle\ O\left(\frac{dL}{\sigma\mu_{0}^{m}}T\varepsilon^{1+1/n}\log(T+1/\varepsilon)\right)
𝔼[RT+]\displaystyle\operatorname*{\mathbb{E}}\left[R_{T}^{+}\right]\in O(dLσTε1+1/nlog(T+1/ε))\displaystyle\ O\left(\frac{dL}{\sigma}T\varepsilon^{1+1/n}\log(T+1/\varepsilon)\right)
Proof.

We first claim that yt=πm(xt)y_{t}=\pi^{m}(x_{t}) for all tMTt\not\in M_{T}. If tQTt\in Q_{T}, the claim is immediate. If not, we have yt=πt(xt)y_{t}=\pi_{t}(x_{t}) by the definition of the algorithm and πt(xt)=πm(xt)\pi_{t}(x_{t})=\pi^{m}(x_{t}) by the definition of tMTt\not\in M_{T}. Thus μt(xt,yt)=μtm(xt)\mu_{t}(x_{t},y_{t})=\mu_{t}^{m}(x_{t}) for tMTt\not\in M_{T}. For tMTt\in M_{T}, Lemma B.2 implies that μtm(xt)μt(xt,yt)Lε1/n\mu_{t}^{m}(x_{t})-\mu_{t}(x_{t},y_{t})\leq L\varepsilon^{1/n}, so

RT+=\displaystyle R_{T}^{+}= tMT(μtm(xt)μt(xt,yt))\displaystyle\ \sum_{t\in M_{T}}(\mu_{t}^{m}(x_{t})-\mu_{t}(x_{t},y_{t}))
\displaystyle\leq tMTLε1/n\displaystyle\ \sum_{t\in M_{T}}L\varepsilon^{1/n}
=\displaystyle= |MT|Lε1/n\displaystyle\ |M_{T}|L\varepsilon^{1/n} (1)

Since ε(\mfracμ0m2L)n\varepsilon\leq\left(\mfrac{\mu_{0}^{m}}{2L}\right)^{n} by assumption, we have Lε1/nμ0m/2L\varepsilon^{1/n}\leq\mu_{0}^{m}/2 and thus μt(xt,yt)μtm(xt)Lε1/nμ0mμ0m/2=μ0m/2>0\mu_{t}(x_{t},y_{t})\geq\mu_{t}^{m}(x_{t})-L\varepsilon^{1/n}\geq\mu_{0}^{m}-\mu_{0}^{m}/2=\mu_{0}^{m}/2>0 for all t[T]t\in[T]. Then by Lemma A.1,

RT×RT+μ0m/22|MT|Lε1/nμ0m\displaystyle R_{T}^{\times}\leq\frac{R_{T}^{+}}{\mu_{0}^{m}/2}\leq\frac{2|M_{T}|L\varepsilon^{1/n}}{\mu_{0}^{m}} (2)

Taking the expectation and applying Lemma B.1 to Equations 1 and 2 completes the proof. ∎

Definition B.1.

Let (K,||||)(K,||\cdot||) be a normed vector space and let δ>0\delta>0. Then a multiset SKS\subseteq K is a δ\delta-packing of KK if for all a,bSa,b\in S, ab>δ||a-b||>\delta. The δ\delta-packing number of KK, denoted (K,||||,δ)\mathcal{M}(K,||\cdot||,\delta), is the maximum cardinality of any δ\delta-packing of KK.

We only consider the Euclidean distance norm, so we just write M(K,||||,δ)=M(K,δ)M(K,||\cdot||,\delta)=M(K,\delta).

Lemma B.4 (Theorem 14.2 in (Wu, 2020)).

If KnK\subset\mathbb{R}^{n} is convex, bounded, and contains a ball with radius δ>0\delta>0, then

(K,δ)3nvol(K)δnvol(B)\mathcal{M}(K,\delta)\leq\frac{3^{n}\operatorname*{vol}(K)}{\delta^{n}\operatorname*{vol}(B)}

where BB is a unit ball.

Lemma B.5 (Jung’s Theorem (Jung, 1901)).

If SnS\subset\mathbb{R}^{n} is compact, then there exists a closed ball with radius at most diam(S)n2(n+1)\operatorname*{diam}(S)\sqrt{\frac{n}{2(n+1)}} containing SS.

Lemma B.6.

Under the conditions of Theorem 5.1, Algorithm 1 satisfies

𝔼[|QT|]O(Tε+dσTεlog(T+1/ε)+𝔼[diam(𝒙)n]ε)\operatorname*{\mathbb{E}}[|Q_{T}|]\in O\left(\sqrt{\frac{T}{\varepsilon}}+\frac{d}{\sigma}T\varepsilon\log(T+1/\varepsilon)+\frac{\operatorname*{\mathbb{E}}[\operatorname*{diam}(\boldsymbol{x})^{n}]}{\varepsilon}\right)
Proof.

If tQTt\in Q_{T}, then either hedgeQuery=true\texttt{hedgeQuery}=\texttt{true}, S=S=\emptyset, or min(x,y)St:πt(xt)=yxtx>ε1/n\min_{(x,y)\in S_{t}:\pi_{t}(x_{t})=y}||x_{t}-x||>\varepsilon^{1/n}. The expected number of time steps with hedgeQuery=true\texttt{hedgeQuery}=\texttt{true} is pT=T/εpT=\sqrt{T/\varepsilon}. There is a single time step with S=S=\emptyset since we always query on the first time step. It remains to bound the third case; let Q^={tQT:min(x,y)St:πt(xt)=yxtx>ε1/n}\hat{Q}=\{t\in Q_{T}:\min_{(x,y)\in S_{t}:\pi_{t}(x_{t})=y}||x_{t}-x||>\varepsilon^{1/n}\}. These three cases are not totally disjoint, so 𝔼[|QT|]T/ε+1+𝔼[|Q^|]\operatorname*{\mathbb{E}}[|Q_{T}|]\leq\sqrt{T/\varepsilon}+1+\operatorname*{\mathbb{E}}[|\hat{Q}|] may not hold with equality, but this is fine since we only need an upper bound. We further subdivide Q^\hat{Q} into Q^1={tQ^:πt(xt)πm(xt)}\hat{Q}_{1}=\{t\in\hat{Q}:\pi_{t}(x_{t})\neq\pi^{m}(x_{t})\} and Q^2={tQ^:πt(xt)=πm(xt)}\hat{Q}_{2}=\{t\in\hat{Q}:\pi_{t}(x_{t})=\pi^{m}(x_{t})\}. Since Q^1MT\hat{Q}_{1}\subseteq M_{T}, Lemma B.1 implies that 𝔼[|Q^1|]O(dσTεlog(1/ε))\operatorname*{\mathbb{E}}[|\hat{Q}_{1}|]\in O\left(\frac{d}{\sigma}T\varepsilon\log(1/\varepsilon)\right).

Next, fix a y𝒴y\in\mathcal{Y} and let Xy={x𝒙:πm(x)=y}X_{y}=\{x\in\boldsymbol{x}:\pi^{m}(x)=y\} be the multiset of observed inputs whose mentor action is yy. Also let X^2={xt:tQ^2}\hat{X}_{2}=\{x_{t}:t\in\hat{Q}_{2}\} be the multiset of inputs associated with time steps in Q^2\hat{Q}_{2}. Note that |X^2|=|Q^2||\hat{X}_{2}|=|\hat{Q}_{2}|, since X^2\hat{X}_{2} is a multiset. We claim that X^2Xy\hat{X}_{2}\cap X_{y} is a ε1/n\varepsilon^{1/n}-packing of XyX_{y}. Suppose instead that there exists x,xX^2Xyx,x^{\prime}\in\hat{X}_{2}\cap X_{y}, with xxε1/n||x-x^{\prime}||\leq\varepsilon^{1/n}. WLOG assume xx was queried after xx^{\prime} and let tt be the time step on which xx was queried. Then (x,πm(x))St(x^{\prime},\pi^{m}(x^{\prime}))\in S_{t}. Also, since x,xX^2x,x^{\prime}\in\hat{X}_{2} we have πt(xt)=πm(xt)=y=πm(x)\pi_{t}(x_{t})=\pi^{m}(x_{t})=y=\pi^{m}(x^{\prime}). Therefore

min(x′′,y′′)St:y′′=πt(xt)xtx′′xtxε1/n\min_{(x^{\prime\prime},y^{\prime\prime})\in S_{t}:y^{\prime\prime}=\pi_{t}(x_{t})}||x_{t}-x^{\prime\prime}||\leq||x_{t}-x^{\prime}||\leq\varepsilon^{1/n}

which contradicts tQ^t\in\hat{Q}. Thus X^2Xy\hat{X}_{2}\cap X_{y} is a ε1/n\varepsilon^{1/n}-packing of XyX_{y}.

By Lemma B.5, there exists a ball B1B_{1} of radius R:=diam(𝒙)n2(n+1)R:=\operatorname*{diam}(\boldsymbol{x})\sqrt{\frac{n}{2(n+1)}} which contains 𝒙\boldsymbol{x}. Let B2B_{2} be the ball with the same center as B1B_{1} but with radius max(R,ε1/n)\max(R,\varepsilon^{1/n}). Since Xy𝒙B1B2X_{y}\subset\boldsymbol{x}\subset B_{1}\subset B_{2} and X^2Xy\hat{X}_{2}\cap X_{y} is a ε1/n\varepsilon^{1/n}-packing of XyX_{y}, X^2Xy\hat{X}_{2}\cap X_{y} is also a ε1/n\varepsilon^{1/n}-packing of B2B_{2}. Also, B2B_{2} must contain a ball of radius ε1/n\varepsilon^{1/n}, so Lemma B.4 implies that

|X^2Xy|\displaystyle|\hat{X}_{2}\cap X_{y}|\leq (B2,ε1/n)\displaystyle\ \mathcal{M}(B_{2},\varepsilon^{1/n})
\displaystyle\leq 3nvol(B2)εvol(B)\displaystyle\ \frac{3^{n}\operatorname*{vol}(B_{2})}{\varepsilon\operatorname*{vol}(B)}
=\displaystyle= (max(R,ε1/n))n3nvol(B)εvol(B)\displaystyle\ \big{(}\max(R,\varepsilon^{1/n})\big{)}^{n}\,\frac{3^{n}\operatorname*{vol}(B)}{\varepsilon\operatorname*{vol}(B)}
=\displaystyle= max(diam(𝒙)n(n2(n+1))n/2,ε)3nε\displaystyle\ \max\left(\operatorname*{diam}(\boldsymbol{x})^{n}\left(\frac{n}{2(n+1)}\right)^{n/2},\>\varepsilon\right)\frac{3^{n}}{\varepsilon}
\displaystyle\leq O(diam(𝒙)nε+1)\displaystyle\ O\left(\frac{\operatorname*{diam}(\boldsymbol{x})^{n}}{\varepsilon}+1\right)

(The +1+1 is necessary for now since diam(𝒙)\operatorname*{diam}(\boldsymbol{x}) could theoretically be zero.) Since X^2{x1,,xT}y𝒴Xy\hat{X}_{2}\subseteq\{x_{1},\dots,x_{T}\}\subseteq\cup_{y\in\mathcal{Y}}X_{y}, we have |X^2|y𝒴|X^2Xy||\hat{X}_{2}|\leq\sum_{y\in\mathcal{Y}}|\hat{X}_{2}\cap X_{y}| by the union bound. Therefore

𝔼[|QT|]\displaystyle\operatorname*{\mathbb{E}}[|Q_{T}|]\leq Tε+1+𝔼[|Q^|]\displaystyle\ \sqrt{\frac{T}{\varepsilon}}+1+\operatorname*{\mathbb{E}}[|\hat{Q}|]
=\displaystyle= Tε+1+𝔼[|Q^1|]+𝔼[|Q^2|]\displaystyle\ \sqrt{\frac{T}{\varepsilon}}+1+\operatorname*{\mathbb{E}}[|\hat{Q}_{1}|]+\operatorname*{\mathbb{E}}[|\hat{Q}_{2}|]
=\displaystyle= Tε+1+𝔼[|Q^1|]+𝔼[|X^2|]\displaystyle\ \sqrt{\frac{T}{\varepsilon}}+1+\operatorname*{\mathbb{E}}[|\hat{Q}_{1}|]+\operatorname*{\mathbb{E}}[|\hat{X}_{2}|]
\displaystyle\leq Tε+1+𝔼[|Q^1|]+𝔼[y𝒴|X^2Xy|]\displaystyle\ \sqrt{\frac{T}{\varepsilon}}+1+\operatorname*{\mathbb{E}}[|\hat{Q}_{1}|]+\operatorname*{\mathbb{E}}\left[\sum_{y\in\mathcal{Y}}|\hat{X}_{2}\cap X_{y}|\right]
\displaystyle\ \leq Tε+1+O(dσTεlog(T+1/ε))+y𝒴O(𝔼[diam(𝒙)n]ε+1)\displaystyle\ \sqrt{\frac{T}{\varepsilon}}+1+O\left(\frac{d}{\sigma}T\varepsilon\log(T+1/\varepsilon)\right)+\sum_{y\in\mathcal{Y}}O\left(\frac{\operatorname*{\mathbb{E}}[\operatorname*{diam}(\boldsymbol{x})^{n}]}{\varepsilon}+1\right)
\displaystyle\leq Tε+1+O(dσTεlog(T+1/ε))+|𝒴|O(𝔼[diam(𝒙)n]ε+1)\displaystyle\ \sqrt{\frac{T}{\varepsilon}}+1+O\left(\frac{d}{\sigma}T\varepsilon\log(T+1/\varepsilon)\right)+|\mathcal{Y}|\cdot O\left(\frac{\operatorname*{\mathbb{E}}[\operatorname*{diam}(\boldsymbol{x})^{n}]}{\varepsilon}+1\right)
\displaystyle\leq O(Tε+dσTεlog(T+1/ε)+𝔼[diam(𝒙)n]ε)\displaystyle\ O\left(\sqrt{\frac{T}{\varepsilon}}+\frac{d}{\sigma}T\varepsilon\log(T+1/\varepsilon)+\frac{\operatorname*{\mathbb{E}}[\operatorname*{diam}(\boldsymbol{x})^{n}]}{\varepsilon}\right)

as required. ∎

Theorem 5.1 follows from Lemmas B.3 and B.6:

See 5.1

We then perform some arithmetic to obtain Theorem 5.2:

See 5.2

Proof.

We have

𝔼[RT×]\displaystyle\operatorname*{\mathbb{E}}\left[R_{T}^{\times}\right]\in O(dLσμ0mT12n2n+122n+1(logT+log(T2n2n+1)))\displaystyle\ O\left(\frac{dL}{\sigma\mu_{0}^{m}}T^{1-\frac{2n}{2n+1}-\frac{2}{2n+1}}\left(\log T+\log(T^{\frac{2n}{2n+1}})\right)\right)
=\displaystyle= O(dLσμ0mT12n+1logT)\displaystyle\ O\left(\frac{dL}{\sigma\mu_{0}^{m}}T^{\frac{-1}{2n+1}}\log T\right)

and similarly for 𝔼[RT+]\operatorname*{\mathbb{E}}[R_{T}^{+}]. For 𝔼[|QT|]\operatorname*{\mathbb{E}}[|Q_{T}|],

𝔼[|QT|]\displaystyle\operatorname*{\mathbb{E}}[|Q_{T}|]\in O(T1+2n2n+1+dσT12n2n+1(logT+log(T2n2n+1))+T2n2n+1𝔼[diam(𝒙)n])\displaystyle\ O\left(\sqrt{T^{1+\frac{2n}{2n+1}}}+\frac{d}{\sigma}T^{1-\frac{-2n}{2n+1}}\left(\log T+\log(T^{\frac{2n}{2n+1}})\right)+T^{\frac{2n}{2n+1}}\operatorname*{\mathbb{E}}[\operatorname*{diam}(\boldsymbol{x})^{n}]\right)
=\displaystyle= O(T2n+0.52n+1+dσT12n+1logT+T2n2n+1𝔼[diam(𝒙)n])\displaystyle\ O\left(T^{\frac{2n+0.5}{2n+1}}+\frac{d}{\sigma}T^{\frac{1}{2n+1}}\log T+T^{\frac{2n}{2n+1}}\operatorname*{\mathbb{E}}[\operatorname*{diam}(\boldsymbol{x})^{n}]\right)
\displaystyle\leq O(T4n+14n+2(dσlogT+𝔼[diam(𝒙)n]))\displaystyle\ O\left(T^{\frac{4n+1}{4n+2}}\Big{(}\frac{d}{\sigma}\log T+\operatorname*{\mathbb{E}}[\operatorname*{diam}(\boldsymbol{x})^{n}]\Big{)}\right)

B.3 Adaptive adversaries

If sts_{t} is allowed to depend on the events of prior time steps, we say that the adversary is adaptive. In contrast, a non-adaptive or “oblivious” adversary must choose the entire input upfront. This distinction is not relevant for deterministic algorithms, since an adversary knows exactly how the algorithm will behave for any input. In other words, the adversary gains no new information during the execution of the algorithm. For randomized algorithms, an adaptive adversary can base the choice of sts_{t} on the results of randomization on previous time steps (but not on the current time step), while an oblivious adversary cannot.

In the standard online learning model, Hedge guarantees sublinear regret against both oblivious and adaptive adversaries (Chapter 5 of Slivkins et al. (2019) or Chapter 21 of Shalev-Shwartz & Ben-David (2014)). However, Russo et al. (2024) state their result only for oblivious adversaries. In order for our overall proof of Theorem 5.1 to hold for adaptive adversaries, Lemma 5.4 (Lemma 3.5 in Russo et al., 2024) must also hold for adaptive adversaries. In this section, we argue why the proof of Lemma 5.4 (Lemma 3.5 in their paper) goes through for adaptive adversaries as well. For this rest of Section B.3, lemma numbers refer to the numbering in Russo et al. (2024).

The importance of independent queries.

Recall from Section B.1 that Russo et al. (2024) allow two separate parameters kk and k^\hat{k}, which we unify for simplicity. Recall also that Lemma 3.5 refers to the variant of Hedge which queries with probability p=k^/T=k/Tp=\hat{k}/T=k/T independently on each time step (Algorithm 2. More precisely, on each time step tt, the algorithm samples XtBernoulli(p)X_{t}\sim\text{Bernoulli}(p) and queries if Xt=1X_{t}=1. The key idea is that XtX_{t} is independent of events on previous time steps. Thus even conditioning on the history up to time tt, for any for any random variable YtY_{t} we can write

𝔼[Yt]=(1p)𝔼[YtXt=0]+p𝔼[YtXt=1]\operatorname*{\mathbb{E}}[Y_{t}]=(1-p)\operatorname*{\mathbb{E}}[Y_{t}\mid X_{t}=0]+p\operatorname*{\mathbb{E}}[Y_{t}\mid X_{t}=1]

This insight immediately extends Observation 3.3 to adaptive adversaries (with the minor modification that queries are now issued independently with probability pp on each time step instead of issuing kk uniformly distributed queries). Specifically, using the notation from Russo et al. (2024) where iti_{t} is the action chosen at time tt, it0i_{t}^{0} is the action chosen at time tt if a query is not issued, and iti_{t}^{*} is the optimal action at time tt, we have

𝔼[t(it)]=\displaystyle\operatorname*{\mathbb{E}}[\ell_{t}(i_{t})]= (1p)𝔼[t(it0)]+p𝔼[t(it)]\displaystyle\ (1-p)\operatorname*{\mathbb{E}}[\ell_{t}(i_{t}^{0})]+p\operatorname*{\mathbb{E}}[\ell_{t}(i_{t}^{*})]
=\displaystyle= (1kT)𝔼[t(it0)]+kT𝔼[t(it)]\displaystyle\ \left(1-\frac{k}{T}\right)\operatorname*{\mathbb{E}}[\ell_{t}(i_{t}^{0})]+\frac{k}{T}\operatorname*{\mathbb{E}}[\ell_{t}(i_{t}^{*})]

The same logic applies to other statements like 𝔼[^t(i)Xt1,It1]=t(i)t(it)\operatorname*{\mathbb{E}}[\hat{\ell}_{t}(i)\mid X_{\leq t-1},I_{\leq t-1}]=\ell_{t}(i)-\ell_{t}(i_{t}^{*}) and immediately extends those statements to adaptive adversaries as well.

Applying Observation 3.3.

The other tricky part of the proof is applying Observation 3.3 using a new loss function ^\hat{\ell} defined by ^t(i)=Tk^(t(i)t(it))𝟏(Xt=1)\hat{\ell}_{t}(i)=\frac{T}{\hat{k}}(\ell_{t}(i)-\ell_{t}(i_{t}^{*}))\mathbf{1}(X_{t}=1). To do so, we must argue that standard Hedge run on ^\hat{\ell} is the “counterpart without queries” of HedgeWithQueries. Specifically, both algorithms must have the same weight vectors on every time step, and the only difference should be that HedgeWithQueries takes the optimal action on each time step independently with probability pp (and otherwise behaves the same as standard Hedge). On time steps with Xt=0X_{t}=0, standard Hedge observes ^t(i)=0\hat{\ell}_{t}(i)=0 for all actions ii and thus makes no updates, and HedgeWithQueries makes no updates by definition. On time steps with Xt=1X_{t}=1, both algorithms perform the typical updates wt+1(i)=wt(i)exp(η(^t(i)^t(it)))w_{t+1}(i)=w_{t}(i)\cdot\exp(-\eta(\hat{\ell}_{t}(i)-\hat{\ell}_{t}(i_{t}^{*}))). Thus the weight vectors are the same for both algorithms on every time step. Furthermore, HedgeWithQueries takes the optimal action at time tt iff Xt=1X_{t}=1, which occurs independently with probability pp on each time step. Thus standard Hedge run on ^\hat{\ell} is the “counterpart without queries” of HedgeWithQueries. Note that since ^\hat{\ell} is itself a random variable, the law of iterated expectation is necessary to formalize this.

The rest of the proof.

The other elements of the proof of Lemma 3.5 are as follows:

  1. 1.

    Lemma 3.1, which analyzes the standard version of Hedge (i.e., no queries and losses are observed on every time step).

  2. 2.

    Applying Lemma 3.1 to ^\hat{\ell}.

  3. 3.

    Arithmetic and rearranging terms.

The proof of Lemma 3.1 relies on simple arithmetic properties of the Hedge weights. Regardless of the adversary’s behavior, ^\hat{\ell} is a well-defined loss function, so Lemma 3.1 can be applied. Step 3 clearly has no dependence on the type of adversary. Thus we conclude that Lemma 3.5 extends to adaptive adversaries.

Appendix C Generalizing Theorem 5.2 to many actions

We use the standard “one versus rest” reduction (see, e.g., Chapter 29 of Shalev-Shwartz & Ben-David, 2014). For each action yy, we will learn a binary classifier which predicts whether action yy is the mentor’s action. Formally, for each y𝒴y\in\mathcal{Y}, define the policy class Πy={πy:πΠ and πy(x)=𝟏(π(x)=y))x𝒳}\Pi_{y}=\{\pi_{y}:\pi\in\Pi\text{ and }\pi_{y}(x)=\mathbf{1}(\pi(x)=y))\ \forall x\in\mathcal{X}\}. In words, for each policy π:𝒳𝒴\pi:\mathcal{X}\to\mathcal{Y} in Π\Pi, there exists a policy πy:𝒳{0,1}\pi_{y}:\mathcal{X}\to\{0,1\} in Πy\Pi_{y} such that πy(x)=𝟏(π(x)=y)\pi_{y}(x)=\mathbf{1}(\pi(x)=y) for all x𝒳x\in\mathcal{X}.

Algorithm 3 runs one copy of our binary-action algorithm (Algorithm 1) for each action y𝒴y\in\mathcal{Y}. At each time step tt, the copy for action yy returns an action btyb_{t}^{y}, with bty=1b_{t}^{y}=1 indicating a belief that y=πm(xt)y=\pi^{m}(x_{t}) and bty=0b_{t}^{y}=0 indicating a belief that yπm(xt)y\neq\pi^{m}(x_{t}). (Note that bty=y~b_{t}^{y}=\tilde{y} is also possible, indicating that the mentor was queried.)

The key idea is that if btyb_{t}^{y} is correct for each action yy, there will be exactly one yy such that bty=1b_{t}^{y}=1, and specifically it will be y=πm(xt)y=\pi^{m}(x_{t}). Thus we are guaranteed to take the mentor’s action on such time steps. The analysis for Theorem 5.2 (specifically, Lemma B.1) bounds the number of time steps when a given copy of Algorithm 1 is incorrect, so by the union bound, the number of time steps where any copy is incorrect is |𝒴||\mathcal{Y}| times that bound. That in turn bounds the number of time steps where Algorithm 3 takes an action other than the mentor’s. Similarly, the number of queries made by Algorithm 3 is at most |𝒴||\mathcal{Y}| times the bound from Theorem 5.2. The result is the following theorem:

 Inputs: T,ε>0,d,T\in\mathbb{N},\>\varepsilon\in\mathbb{R}_{>0},\>d\in\mathbb{N},\, policy class Π\Pi
for y𝒴y\in\mathcal{Y} do
  if Πy\Pi_{y} has VC dimension dd then
   Π~y\tilde{\Pi}_{y}\leftarrow any smooth ε\varepsilon-cover of Π\Pi of size at most (41/ε)d(41/\varepsilon)^{d}
  else if Π\Pi has Littlestone dimension dd then
   Π~y\tilde{\Pi}_{y}\leftarrow any adversarial cover of size at most (eT/d)d(eT/d)^{d}
  
for tt from 11 to TT do
  for y𝒴y\in\mathcal{Y} do
   btyb_{t}^{y}\leftarrow action at time tt from the copy of Algorithm 1 running on Πy\Pi_{y} (with the same T,ε,dT,\varepsilon,d)
  
  if btyy~b_{t}^{y}\neq\tilde{y} y𝒴\forall y\in\mathcal{Y} and a𝒴:bty=1\exists a\in\mathcal{Y}:b_{t}^{y}=1 then
   Take any action yy with bty=1b_{t}^{y}=1
  else
   Take an arbitrary action in 𝒴\mathcal{Y}
  
Algorithm 3 extends Algorithm 1 to many actions.
Theorem C.1.

Assume πmΠ\pi^{m}\in\Pi where either (1) Πy\Pi_{y} has finite VC dimension dd and 𝐱\boldsymbol{x} is σ\sigma-smooth or (2) Πy\Pi_{y} has finite Littlestone dimension dd for all y𝒴y\in\mathcal{Y}. Then for any TT\in\mathbb{N}, Algorithm 3 with ε=T2n2n+1\varepsilon=T^{\frac{-2n}{2n+1}} satisfies

𝔼[RT×]\displaystyle\operatorname*{\mathbb{E}}\left[R_{T}^{\times}\right]\in O(|𝒴|dLσμ0mT12n+1logT)\displaystyle\ O\left(\frac{|\mathcal{Y}|dL}{\sigma\mu_{0}^{m}}T^{\frac{-1}{2n+1}}\log T\right)
𝔼[RT+]\displaystyle\operatorname*{\mathbb{E}}\left[R_{T}^{+}\right]\in O(|𝒴|dLσT12n+1logT)\displaystyle\ O\left(\frac{|\mathcal{Y}|dL}{\sigma}T^{\frac{-1}{2n+1}}\log T\right)
𝔼[|QT|]\displaystyle\operatorname*{\mathbb{E}}[|Q_{T}|]\in O(|𝒴|T4n+14n+2(dσlogT+𝔼[diam(𝒙)n]))\displaystyle\ O\left(|\mathcal{Y}|T^{\frac{4n+1}{4n+2}}\left(\frac{d}{\sigma}\log T+\operatorname*{\mathbb{E}}[\operatorname*{diam}(\boldsymbol{x})^{n}]\right)\right)

We use the following terminology and notation in the proof of Theorem C.1:

  1. 1.

    We refer to the copy of Algorithm 1 running on Πy\Pi_{y} as “copy yy of Algorithm 1”.

  2. 2.

    Recall that StS_{t} refers to the value of SS at the start of time step tt in Algorithm 1. Let πty\pi_{t}^{y} and StyS_{t}^{y} refer to the values of πt\pi_{t} and StS_{t} for copy yy of Algorithm 1.

  3. 3.

    Let πmy:𝒳{0,1}\pi^{my}:\mathcal{X}\to\{0,1\} be the policy defined by πmy(x)=𝟏(πm(xt)=y)\pi^{my}(x)=\mathbf{1}(\pi^{m}(x_{t})=y). Note that querying the mentor tells the agent πm(xt)\pi^{m}(x_{t}), which allows the agent to compute πmy(xt)\pi^{my}(x_{t}): this is necessary when Algorithm 1 queries while running on some Πy\Pi_{y}.

  4. 4.

    Let MTy={t[T]:btyπmy(xt)}M_{T}^{y}=\{t\in[T]:b_{t}^{y}\neq\pi^{my}(x_{t})\} be the set of time steps where πty\pi_{t}^{y} does not correctly determine whether the mentor would take action yy and let MT={t[T]:ytπm(xt)}M_{T}=\{t\in[T]:y_{t}\neq\pi^{m}(x_{t})\} be the set of time steps where the agent’s action does not match the mentor’s.

Lemma C.1.

We have |MT|y𝒴|MTy||M_{T}|\leq\sum_{y\in\mathcal{Y}}|M_{T}^{y}|.

Proof.

We claim that MTy𝒴MTyM_{T}\subseteq\cup_{y\in\mathcal{Y}}M_{T}^{y}. Suppose the opposite: then there exists tMTt\in M_{T} such that bty=πmy(xt)b_{t}^{y}=\pi^{my}(x_{t}) for all y𝒴y\in\mathcal{Y}. Since πm(xt)𝒴\pi^{m}(x_{t})\in\mathcal{Y}, there is exactly one y𝒴y\in\mathcal{Y} such that 𝟏(πm(xt)=y)=πmy(xt)=bty=1\mathbf{1}(\pi^{m}(x_{t})=y)=\pi^{my}(x_{t})=b_{t}^{y}=1. Specifically, this holds for y=πm(xt)y=\pi^{m}(x_{t}). But then Algorithm 3 takes action min{y𝒴:bty=1}=πm(xt)\min\{y\in\mathcal{Y}:b_{t}^{y}=1\}=\pi^{m}(x_{t}), which contradicts tMTt\in M_{T}. Therefore MTy𝒴MTyM_{T}\subseteq\cup_{y\in\mathcal{Y}}M_{T}^{y}, and applying the union bound completes the proof. ∎

Lemma C.2.

For all t[T]t\in[T], μtm(xt)μt(xt,yt)Lε1/n\mu_{t}^{m}(x_{t})-\mu_{t}(x_{t},y_{t})\leq L\varepsilon^{1/n}.

Proof.

The argument is similar to the proof of Lemma B.2. If μtm(xt)μt(xt,yt)\mu_{t}^{m}(x_{t})\neq\mu_{t}(x_{t},y_{t}), then yt=yy_{t}=y for some y𝒴y\in\mathcal{Y} where bty=1b_{t}^{y}=1. Therefore copy yy of Algorithm 1 did not query at time tt and πty(xt)=1\pi_{t}^{y}(x_{t})=1. Let (x,y)=argmin(x,y)Sty:πty(xt)=yxtx(x^{\prime},y^{\prime})=\operatorname*{arg\,min}_{(x,y)\in S_{t}^{y}:\pi_{t}^{y}(x_{t})=y}||x_{t}-x||. Then xtxε1/n||x_{t}-x^{\prime}||\leq\varepsilon^{1/n} and y=πty(xt)=1y^{\prime}=\pi_{t}^{y}(x_{t})=1.

By construction of StyS_{t}^{y}, y=πmy(x)y^{\prime}=\pi^{my}(x^{\prime}) so πmy(x)=1\pi^{my}(x^{\prime})=1 which implies πm(x)=y\pi^{m}(x^{\prime})=y. Then by the local generalization assumption,

μt(xt,yt)=μt(xt,y)=μt(xt,πm(x))μtm(xt)Lxtxμtm(xt)Lε1/n\mu_{t}(x_{t},y_{t})=\mu_{t}(x_{t},y)=\mu_{t}(x_{t},\pi^{m}(x^{\prime}))\geq\mu_{t}^{m}(x_{t})-L||x_{t}-x^{\prime}||\geq\mu_{t}^{m}(x_{t})-L\varepsilon^{1/n}

as required. ∎

We now proceed to the proof of Theorem C.1.

Proof of Theorem C.1.

Theorem 5.2 implies that each copy of Algorithm 1 makes O(T4n+14n+2(dσlogT+𝔼[diam(𝒙)n]))O\big{(}T^{\frac{4n+1}{4n+2}}\big{(}\frac{d}{\sigma}\log T+\operatorname*{\mathbb{E}}[\operatorname*{diam}(\boldsymbol{x})^{n}]\big{)}\big{)} queries in expectation. Thus by linearity of expectation the expected number of queries made by Algorithm 3 is O(|𝒴|T4n+14n+2(dσlogT+𝔼[diam(𝒙)n]))O\big{(}|\mathcal{Y}|T^{\frac{4n+1}{4n+2}}\big{(}\frac{d}{\sigma}\log T+\operatorname*{\mathbb{E}}[\operatorname*{diam}(\boldsymbol{x})^{n}]\big{)}\big{)}.131313This is an overestimate because the agent makes at most one query per time step, even if multiple copies request a query. Similar to the proof of Lemma B.3, we have

RT+=\displaystyle R_{T}^{+}= tMT(μtm(xt)μt(xt,yt))\displaystyle\ \sum_{t\in M_{T}}(\mu_{t}^{m}(x_{t})-\mu_{t}(x_{t},y_{t})) (μtm(xt)=μt(xt,yt) for all tMT)\displaystyle(\text{$\mu_{t}^{m}(x_{t})=\mu_{t}(x_{t},y_{t})$ for all $t\not\in M_{T}$})
\displaystyle\leq tMTLε1/n\displaystyle\ \sum_{t\in M_{T}}L\varepsilon^{1/n} (Lemma C.2)\displaystyle(\text{\lx@cref{creftypecap~refnum}{lem:multi-lipschitz}})
=\displaystyle= |MT|Lε1/n\displaystyle\ |M_{T}|L\varepsilon^{1/n} (Simplying sum)\displaystyle(\text{Simplying sum})
\displaystyle\leq Lε1/ny𝒴|MTy|\displaystyle\ L\varepsilon^{1/n}\sum_{y\in\mathcal{Y}}|M_{T}^{y}| (Lemma C.1)\displaystyle(\text{\lx@cref{creftypecap~refnum}{lem:multi-errors}})

Since each copy satisfies the conditions of Lemma B.1, we get

𝔼[RT+]Lε1/ny𝒴O(dσTεlog(1/ε)logT)=O(|𝒴|Lε1/ndσTεlog(1/ε)logT)\displaystyle\operatorname*{\mathbb{E}}[R_{T}^{+}]\leq L\varepsilon^{1/n}\sum_{y\in\mathcal{Y}}O\left(\frac{d}{\sigma}T\varepsilon\log(1/\varepsilon)\log T\right)=O\left(|\mathcal{Y}|L\varepsilon^{1/n}\frac{d}{\sigma}T\varepsilon\log(1/\varepsilon)\log T\right)

Since limTε=0\lim_{T\to\infty}\varepsilon=0, there exists T0T_{0} such that Lε1/nμ0m/2L\varepsilon^{1/n}\leq\mu_{0}^{m}/2 for all TT0T\geq T_{0}. Then by Lemma A.1,

𝔼[RT×]𝔼[RT+]μ0m/2O(|𝒴|Lε1/ndσμ0mTεlog(1/ε)logT)\displaystyle\operatorname*{\mathbb{E}}[R_{T}^{\times}]\leq\frac{\operatorname*{\mathbb{E}}[R_{T}^{+}]}{\mu_{0}^{m}/2}\in O\left(|\mathcal{Y}|L\varepsilon^{1/n}\frac{d}{\sigma\mu_{0}^{m}}T\varepsilon\log(1/\varepsilon)\log T\right) (3)

Plugging ε=T2n2n+1\varepsilon=T^{\frac{-2n}{2n+1}} to the bounds above on 𝔼[RT+]\operatorname*{\mathbb{E}}[R_{T}^{+}] and 𝔼[RT×]\operatorname*{\mathbb{E}}[R_{T}^{\times}] yields the desired bounds (see the arithmetic in the proof of Theorem 5.2 in Appendix B). ∎

Appendix D There exist policy classes which are learnable in our setting but not in the standard online model

This section presents another algorithm with subconstant regret and sublinear queries, but under different assumptions. The primary takeaway here is that our algorithm can handle the class of thresholds on [0,1][0,1], which is known to have infinite Littlestone dimension and thus be hard in the standard online learning model. (Example 21.4 in Shalev-Shwartz & Ben-David, 2014).

Specifically, we assume a 1D input space and we allow the input sequence to be fully adversarial chosen. Instead of VC/Littlestone dimension, we consider the following notion of simplicity:

Definition D.1.

Given a mentor policy πm\pi^{m}, partition the input space 𝒳\mathcal{X} into intervals such that all inputs within each interval share the same mentor action. Let {X1,,Xk}\{X_{1},\dots,X_{k}\} be a partition that minimizes the number of intervals. We call each XjX_{j} a segment. Let S(πm)S(\pi^{m}) denote the number of segments in πm\pi^{m}.

Bounding the number of segments is similar conceptually to VC dimension in that it limits the ability of the policy class to realize arbitrary combinations of labels (i.e., mentor actions) on 𝒙\boldsymbol{x}. For example, if Π\Pi is the class of thresholds on [0,1][0,1], every πΠ\pi\in\Pi has at most two segments, and thus the positive result in this section will apply. This demonstrates the existence of policy classes which are learnable in our setting but not learnable in the standard online learning model, meaning that the two settings do not exactly coincide.

Unlike our primary algorithm (Algorithm 1), this algorithm does require direct access to the input encoding. However, the point of this section is not to present a practical algorithm: it is simply to demonstrate that our setting and the standard online setting do not exactly coincide.

We prove the following regret bound. Like our previous results, this bound applies to both multiplicative and additive regret.

Theorem D.2.

For any 𝐱𝒳T\boldsymbol{x}\in\mathcal{X}^{T}, any πm\pi^{m} with S(πm)KS(\pi^{m})\leq K, and any function g:g:\mathbb{N}\to\mathbb{N} satisfying g(T)2L/μ0mg(T)\geq 2L/\mu_{0}^{m}, Algorithm 4 satisfies

RT×\displaystyle R_{T}^{\times}\leq 4LKTg(T)2μ0m\displaystyle\ \frac{4LKT}{g(T)^{2}\mu_{0}^{m}}
RT+\displaystyle R_{T}^{+}\leq 2LKTg(T)2\displaystyle\ \frac{2LKT}{g(T)^{2}}
|QT|\displaystyle|Q_{T}|\leq (diam(𝒙)+4)g(T)\displaystyle\ (\operatorname*{diam}(\boldsymbol{x})+4)g(T)

Choosing g(T)=Tcg(T)=T^{c} for c(1/2,1)c\in(1/2,1) is sufficient to subconstant regret and sublinear queries:

Theorem D.3.

For any c(1/2,1)c\in(1/2,1), Algorithm 4 with g(T)=Tcg(T)=T^{c} satisfies

RT×\displaystyle R_{T}^{\times}\in O(LKT12cμ0m)\displaystyle\ O\left(\frac{LKT^{1-2c}}{\mu_{0}^{m}}\right)
RT+\displaystyle R_{T}^{+}\in O(LKT12c)\displaystyle\ O\left(LKT^{1-2c}\right)
|QT|\displaystyle|Q_{T}|\in O(Tc(diam(𝒙)+1))\displaystyle\ O(T^{c}(\operatorname*{diam}(\boldsymbol{x})+1))
1:function DBWRQ(T,g:)(T\in\mathbb{N},\>g:\mathbb{N}\to\mathbb{N}) 
2:  XQX_{Q}\leftarrow\emptyset  (previously queried inputs)
3:  π\pi\leftarrow\emptyset  (records πm(x)\pi^{m}(x) for each xXQx\in X_{Q})
4:  \mathcal{B}\leftarrow\emptyset  (The set of active buckets)
5:  for tt from 11 to TT do
6:   EvaluateInput(xt)(x_{t})
7:  end for
8:end function
9:function EvaluateInput(x𝒳)(x\in\mathcal{X}) 
10:  if xBx\not\in B for all BB\in\mathcal{B} then
11:   B[j1g(T),jg(T)]B\leftarrow\left[\frac{j-1}{g(T)},\frac{j}{g(T)}\right] for jj\in\mathbb{Z} such that xBx\in B
12:   {B}\mathcal{B}\leftarrow\mathcal{B}\cup\{B\}
13:   nB0n_{B}\leftarrow 0
14:   EvaluateInput(x)(x)
15:  else
16:   BB\leftarrow any bucket containing xx
17:   if XQB=X_{Q}\cap B=\emptyset then
18:    Query mentor and observe πm(x)\pi^{m}(x)
19:    π(x)πm(x)\pi(x)\leftarrow\pi^{m}(x)
20:    XQXQ{x}X_{Q}\leftarrow X_{Q}\cup\{x\}
21:    nBnB+1n_{B}\leftarrow n_{B}+1
22:   else if nB<T/g(T)n_{B}<T/g(T) then
23:    Let xXQBx^{\prime}\in X_{Q}\cap B
24:    Take action π(x)\pi(x^{\prime})
25:    nBnB+1n_{B}\leftarrow n_{B}+1
26:   else
27:    B=[a,b]B=[a,b]
28:    (B1,B2)([a,a+b2],[a+b2,b])(B_{1},B_{2})\leftarrow\Big{(}\left[a,\frac{a+b}{2}\right],\left[\frac{a+b}{2},b\right]\Big{)}
29:    (xB1,xB2)(0,0)(x_{B_{1}},x_{B_{2}})\leftarrow(0,0)
30:    {B1,B2}B\mathcal{B}\leftarrow\mathcal{B}\cup\{B_{1},B_{2}\}\setminus B
31:    EvaluateInput(x)(x)
32:   end if
33:  end if
34:end function
Algorithm 4 achieves subconstant regret when the mentor’s policy has a bounded number of segments.

D.1 Intuition behind the algorithm

We call our algorithm “Dynamic Bucketing With Routine Querying”, or DBWRQ (pronounced “DBWRQ”). The algorithm maintains a set of buckets which partition the observed portion of the input space. Each bucket’s length determines the maximum loss in payoff we will allow from that subset of the input space. As long as the bucket contains a query from a prior time step, local generalization allows us to bound μtm(xt)μt(xt,yt)\mu_{t}^{m}(x_{t})-\mu_{t}(x_{t},y_{t}) based on the length of the bucket containing xtx_{t}. We always query if the bucket does not contain a prior query; in this sense the querying is “routine”.

The granularity of the buckets is controlled by a function gg, with the initial buckets having length 1/g(T)1/g(T). Since we can expect one query per bucket, we need g(T)o(T)g(T)\in o(T) to ensure sublinear queries.

Regardless of the bucket length, the adversary can still place multiple segments in the same bucket BB. A single query only tells us the optimal action for one of those segments, so we risk a payoff as bad as μtm(xt)O(len(B))\mu_{t}^{m}(x_{t})-O(\operatorname*{len}(B)) whenever we choose not to query. We can endure a limited number of such payoffs, but if we never query again in that bucket, we may suffer Θ(T)\Theta(T) such payoffs. Letting μtm(xt)=1\mu_{t}^{m}(x_{t})=1 for simplicity, that would lead to t=1Tμt(xt,yt)(11O(g(T)))Θ(T)\prod_{t=1}^{T}\mu_{t}(x_{t},y_{t})\leq\big{(}1-\frac{1}{O(g(T))}\big{)}^{\Theta(T)}, which converges to 0 (i.e., guaranteed catastrophe) when g(T)o(T)g(T)\in o(T).

This failure mode suggests a natural countermeasure: if we start to suffer significant (potential) losses in the same bucket, then we should probably query there again. One way to structure these supplementary queries is by splitting the bucket in half when enough time steps have involved that bucket. It turns out that splitting after T/g(T)T/g(T) time steps is a sweet spot.

D.2 Proof notation

We will use the following notation throughout the proof of Theorem D.2:

  1. 1.

    Let MT={t[T]:μt(xt,yt)<μtm(xt)}M_{T}=\{t\in[T]:\mu_{t}(x_{t},y_{t})<\mu_{t}^{m}(x_{t})\} be the set of time steps with a suboptimal payoff.

  2. 2.

    Let BtB_{t} be the bucket that is used on time step tt (as defined on line 16 of Algorithm 4).

  3. 3.

    Let d(B)d(B) be the depth of bucket BB.

    1. (a)

      Buckets created on line 11 are depth 0.

    2. (b)

      We refer to B1,B2B_{1},B_{2} created on line 28 as the children of the bucket BB defined on line 16.

    3. (c)

      If BB^{\prime} is the child of BB, d(B)=d(B)+1d(B^{\prime})=d(B)+1.

    4. (d)

      Note that len(B)=\mfrac1g(T)2d(B)\operatorname*{len}(B)=\mfrac{1}{g(T)2^{d(B)}}.

  4. 4.

    Viewing the set of buckets as a binary tree defined by the “child” relation, we use the terms “ancestor” and ”descendant” in accordance with their standard tree definitions.

  5. 5.

    Let V={B:tMT s.t. Bt=B}\mathcal{B}_{V}=\{B:\exists t\in M_{T}\text{ s.t. }B_{t}=B\} be the set of buckets that ever produced a suboptimal payoff.

  6. 6.

    Let V={BV:no descendant of B is in V}\mathcal{B}_{V}^{\prime}=\{B\in\mathcal{B}_{V}:\text{no descendant of $B$ is in $\mathcal{B}_{V}$}\}.

D.3 Proof roadmap

The proof proceeds in the following steps:

  1. 1.

    Bound the total number of buckets and therefore the total number of queries (Lemma D.1).

  2. 2.

    Bound the suboptimality on a single time step based on the bucket length and LL (Lemma D.2).

  3. 3.

    Bound the sum of bucket lengths on time steps where we make a mistake (Lemma D.4), with Lemma D.3 as an intermediate step. This captures the “total amount of suboptimality”.

  4. 4.

    Lemma D.5 uses Lemma D.2 and Lemma D.4 to bound the regret.

  5. 5.

    Theorem D.2 directly follows from Lemmas D.1 and D.5.

D.4 Proof

Lemma D.1.

Algorithm 4 performs at most (diam(𝐱)+4)g(T)(\operatorname*{diam}(\boldsymbol{x})+4)g(T) queries.

Proof.

Algorithm 4 performs at most one query per bucket, so the total number of queries is bounded by the total number of buckets. There are two ways to create a bucket: from scratch (line 11), or by splitting an existing bucket (line 28).

Since depth 0 buckets overlap only at their boundaries, and each depth 0 bucket has length 1/g(T)1/g(T), at most g(T)maxt,t[T]|xtxt|=g(T)diam(𝒙)g(T)\max_{t,t^{\prime}\in[T]}|x_{t}-x_{t^{\prime}}|=g(T)\operatorname*{diam}(\boldsymbol{x}) depth 0 buckets are subsets of the interval [mint[T]xt,maxt[T]xt][\min_{t\in[T]}x_{t},\max_{t\in[T]}x_{t}]. At most two depth 0 buckets are not subsets of that interval (one at each end), so the total number of depth 0 buckets is at most g(T)diam(𝒙)+2g(T)\operatorname*{diam}(\boldsymbol{x})+2.

We split a bucket BB when nBn_{B} reaches T/g(T)T/g(T), which creates two new buckets. Since each time step increments nBn_{B} for a single bucket BB, and there are a total of TT time steps, the total number of buckets created via splitting is at most \mfrac2TT/g(T)=2g(T)\mfrac{2T}{T/g(T)}=2g(T). Therefore the total number of buckets ever in existence is (diam(𝒙)+2)g(T)+2(diam(𝒙)+4)g(T)(\operatorname*{diam}(\boldsymbol{x})+2)g(T)+2\leq(\operatorname*{diam}(\boldsymbol{x})+4)g(T), so Algorithm 4 performs at most (diam(𝒙)+4)g(T)(\operatorname*{diam}(\boldsymbol{x})+4)g(T) queries. ∎

Lemma D.2.

For each t[T]t\in[T], μt(xt,yt)μtm(xt)Llen(Bt)\mu_{t}(x_{t},y_{t})\geq\mu_{t}^{m}(x_{t})-L\operatorname*{len}(B_{t}).

Proof.

If we query at time tt, then μt(xt,yt)=μtm(xt)\mu_{t}(x_{t},y_{t})=\mu_{t}^{m}(x_{t}). Thus assume we do not query at time tt: then there exists xBtx^{\prime}\in B_{t} (as defined on line 23 of Algorithm 4) such that yt=π(x)=πm(x)y_{t}=\pi(x^{\prime})=\pi^{m}(x^{\prime}). Since xtx_{t} and xx^{\prime} are both in BtB_{t}, |xtx|len(Bt)|x_{t}-x^{\prime}|\leq\operatorname*{len}(B_{t}). Then by local generalization, μt(xt,yt)=μt(xt,πm(x))μtm(xt)Lxtxμtm(xt)Llen(Bt)\mu_{t}(x_{t},y_{t})=\mu_{t}(x_{t},\pi^{m}(x^{\prime}))\geq\mu_{t}^{m}(x_{t})-L||x_{t}-x^{\prime}||\geq\mu_{t}^{m}(x_{t})-L\operatorname*{len}(B_{t}). ∎

Lemma D.3.

If πm\pi^{m} has at most KK segments, |V|K|\mathcal{B}_{V}^{\prime}|\leq K.

Proof.

Now consider any BVB\in\mathcal{B}_{V}^{\prime}. By definition of V\mathcal{B}_{V}^{\prime}, there exists tMTt\in M_{T} such that xtBx_{t}\in B. Then there exists xBx^{\prime}\in B (as defined in Algorithm 4) such that yt=π(x)=πm(x)y_{t}=\pi(x^{\prime})=\pi^{m}(x^{\prime}). Since tMTt\in M_{T}, we have πm(xt)yt=πm(x)\pi^{m}(x_{t})\neq y_{t}=\pi^{m}(x^{\prime}). Thus xtx_{t} and xx^{\prime} are in different segments, but are both in BB. Therefore any BVB\in\mathcal{B}_{V}^{\prime} must intersect at least two segments. Since BB is an interval, if it intersects two segments, it must intersect two adjacent segments XjX_{j} and Xj+1X_{j+1}. Furthermore, BB must contain an open neighborhood centered on the boundary between XjX_{j} and Xj+1X_{j+1}.

Now consider some BVB^{\prime}\in\mathcal{B}_{V}^{\prime} with BBB\neq B^{\prime}. We |BB|1|B\cap B^{\prime}|\leq 1: otherwise one must be the descendant of the other, which contradicts the definition of V\mathcal{B}_{V}^{\prime}. Suppose BB^{\prime} also intersects both XjX_{j} and Xj+1X_{j+1}: since BB^{\prime} is also an interval, BB^{\prime} must also contain an open neighborhood centered on the boundary between those two segments. But then |BB|>1|B\cap B^{\prime}|>1, which is a contradiction.

Therefore any pair of adjacent segments XjX_{j} and Xj+1X_{j+1}, there is at most one bucket in V\mathcal{B}_{V}^{\prime} which contains an open neighborhood around their boundary. Since there are at most K1K-1 pairs of adjacent segments, we have |V|K1K|\mathcal{B}_{V}^{\prime}|\leq K-1\leq K.∎

Lemma D.4.

We have tMTlen(Bt)\mfrac2KTg(T)2\sum\limits_{t\in M_{T}}\operatorname*{len}(B_{t})\leq\mfrac{2KT}{g(T)^{2}}.

Proof.

For every tMTt\in M_{T}, we have Bt=BB_{t}=B for some BVB\in\mathcal{B}_{V}, so

tMTlen(Bt)=BVtMT:B=Btlen(Bt)\sum_{t\in M_{T}}\operatorname*{len}(B_{t})=\sum_{B\in\mathcal{B}_{V}}\ \sum_{t\in M_{T}:B=B_{t}}\operatorname*{len}(B_{t})

Next, observe that every BVVB\in\mathcal{B}_{V}\setminus\mathcal{B}_{V}^{\prime} must have a descendant in V\mathcal{B}_{V}^{\prime}: otherwise we would have BVB\in\mathcal{B}_{V}^{\prime}. Let 𝒜(B)\mathcal{A}(B) denote the set of ancestors of BB, plus BB itself. Then we can write

tMTlen(Bt)\displaystyle\sum_{t\in M_{T}}\operatorname*{len}(B_{t})\leq BVB𝒜(B)tMT:B=Btlen(Bt)\displaystyle\ \sum_{B^{\prime}\in\mathcal{B}_{V}^{\prime}}\ \sum_{B\in\mathcal{A}(B^{\prime})}\ \sum_{t\in M_{T}:B=B_{t}}\operatorname*{len}(B_{t})
=\displaystyle= BVB𝒜(B)|{tMT:B=Bt}|len(Bt)\displaystyle\ \sum_{B^{\prime}\in\mathcal{B}_{V}^{\prime}}\ \sum_{B\in\mathcal{A}(B^{\prime})}\ |\{t\in M_{T}:B=B_{t}\}|\cdot\operatorname*{len}(B_{t})

For any bucket BB, the number of time steps tt with B=BtB=B_{t} is at most T/g(T)T/g(T). Also recall that len(B)=\mfrac1g(T)2d(B)\operatorname*{len}(B)=\mfrac{1}{g(T)2^{d(B)}}, so

B𝒜(B)|{tMT:B=Bt}|g(T)2d(B)\displaystyle\sum_{B\in\mathcal{A}(B^{\prime})}\frac{|\{t\in M_{T}:B=B_{t}\}|}{g(T)2^{d(B)}}\leq Tg(T)2B𝒜(B)12d(B)\displaystyle\ \frac{T}{g(T)^{2}}\sum_{B\in\mathcal{A}(B^{\prime})}\ \frac{1}{2^{d(B)}}
=\displaystyle= Tg(T)2d=0d(B)12d\displaystyle\ \frac{T}{g(T)^{2}}\sum_{d=0}^{d(B^{\prime})}\frac{1}{2^{d}}
\displaystyle\leq Tg(T)2d=012d\displaystyle\ \frac{T}{g(T)^{2}}\sum_{d=0}^{\infty}\frac{1}{2^{d}}
=\displaystyle= 2Tg(T)2\displaystyle\ \frac{2T}{g(T)^{2}}

Then by Lemma D.3,

tMTlen(Bt)BV2Tg(T)2=2T|V|g(T)22KTg(T)2\sum_{t\in M_{T}}\operatorname*{len}(B_{t})\leq\sum_{B^{\prime}\in\mathcal{B}_{V}^{\prime}}\frac{2T}{g(T)^{2}}=\frac{2T|\mathcal{B}_{V}^{\prime}|}{g(T)^{2}}\leq\frac{2KT}{g(T)^{2}}

as claimed. ∎

Lemma D.5.

Under the conditions of Theorem D.2, Algorithm 4 satisfies

RT×\displaystyle R_{T}^{\times}\leq 2LKTμ0mg(T)2\displaystyle\ \frac{2LKT}{\mu_{0}^{m}g(T)^{2}}
RT+\displaystyle R_{T}^{+}\leq 2LKTg(T)2\displaystyle\ \frac{2LKT}{g(T)^{2}}
Proof.

We have

RT+\displaystyle R_{T}^{+}\leq tMT(μtm(xt)μt(xt,yt))\displaystyle\ \sum_{t\in M_{T}}(\mu_{t}^{m}(x_{t})-\mu_{t}(x_{t},y_{t})) (μt(xt,yt)μtm(xt) for tMT)\displaystyle(\text{$\mu_{t}(x_{t},y_{t})\geq\mu_{t}^{m}(x_{t})$ for $t\not\in M_{T}$})
\displaystyle\leq tMTLlen(Bt)\displaystyle\ \sum_{t\in M_{T}}L\operatorname*{len}(B_{t}) (Lemma D.2)\displaystyle(\text{\lx@cref{creftypecap~refnum}{lem:lipschitz-1d}})
\displaystyle\leq 2LKTg(T)2\displaystyle\ \frac{2LKT}{g(T)^{2}} (Lemma D.4)\displaystyle(\text{\lx@cref{creftypecap~refnum}{lem:pos-1d-bucket-lengths}})

Since g(T)2L/μ0mg(T)\geq 2L/\mu_{0}^{m} and every bucket length is at most 1g(T)\frac{1}{g(T)},

μt(xt,yt)\displaystyle\mu_{t}(x_{t},y_{t})\geq μtm(xt)Llen(Bt)\displaystyle\ \mu_{t}^{m}(x_{t})-L\operatorname*{len}(B_{t})
\displaystyle\geq μ0mLg(T)\displaystyle\ \mu_{0}^{m}-\frac{L}{g(T)}
\displaystyle\geq μ0mμ0m/2\displaystyle\ \mu_{0}^{m}-\mu_{0}^{m}/2
\displaystyle\geq μ0m/2\displaystyle\ \mu_{0}^{m}/2
>\displaystyle> 0\displaystyle\ 0

Invoking Lemma A.1 completes the proof:

RT×2RT+μ0m4LKTg(T)2μ0mR_{T}^{\times}\leq\frac{2R_{T}^{+}}{\mu_{0}^{m}}\leq\frac{4LKT}{g(T)^{2}\mu_{0}^{m}}

Theorem D.2 follows from Lemma D.1 and Lemma D.5.

Appendix E Properties of local generalization

Proposition E.1 states that Lipschitz continuity implies local generalization when the mentor is optimal.

Proposition E.1.

Assume that μ\mu satisfies Lipschitz continuity: for all x,x𝒳x,x^{\prime}\in\mathcal{X} and y𝒴y\in\mathcal{Y}, |μ(x,y)μ(x,y)|Lxx|\mu(x,y)-\mu(x^{\prime},y)|\leq L||x-x^{\prime}||. Also assume that μ(x,πm(x))=maxy𝒴μ(x,y)\mu(x,\pi^{m}(x))=\max_{y\in\mathcal{Y}}\mu(x,y) for all x𝒳x\in\mathcal{X}. Then μ\mu satisfies local generalization with constant 2L2L.

Proof.

For any x,x𝒳x,x^{\prime}\in\mathcal{X}, we have

μ(x,πm(x))\displaystyle\mu(x,\pi^{m}(x^{\prime}))\geq μ(x,πm(x))Lxx\displaystyle\ \mu(x^{\prime},\pi^{m}(x^{\prime}))-L||x-x^{\prime}|| (Lipschitz continuity of μ\mu)
\displaystyle\geq μ(x,πm(x))Lxx\displaystyle\ \mu(x^{\prime},\pi^{m}(x))-L||x-x^{\prime}|| (πm\pi^{m} is optimal for xx^{\prime})
\displaystyle\geq μ(x,πm(x))2Lxx\displaystyle\ \mu(x,\pi^{m}(x))-2L||x-x^{\prime}|| (Lipschitz continuity of μ\mu again)
=\displaystyle= μm(x)2Lxx\displaystyle\ \mu^{m}(x)-2L||x-x^{\prime}|| (Definition of μm(x)\mu^{m}(x))

Since πm\pi^{m} is optimal for xx, we have

μm(x)+2Lxxμm(x)μ(x,πm(x))\mu^{m}(x)+2L||x-x^{\prime}||\geq\mu^{m}(x)\geq\mu(x,\pi^{m}(x^{\prime}))

So 2Lxxμ(x,πm(x))μm(x)2Lxx-2L||x-x^{\prime}||\leq\mu(x,\pi^{m}(x^{\prime}))-\mu^{m}(x)\leq 2L||x-x^{\prime}|| and therefore |μ(x,πm(x))μm(x)|2Lxx|\mu(x,\pi^{m}(x^{\prime}))-\mu^{m}(x)|\leq 2L||x-x^{\prime}||.∎

Theorem E.2 shows that avoiding catastrophe is impossible without local generalization, even when 𝒙\boldsymbol{x} is σ\sigma-smooth and Π\Pi has finite VC dimension. The first insight is that without local generalization, we can define μ(x,y)=𝟏(y=πm(x))\mu(x,y)=\mathbf{1}(y=\pi^{m}(x)) so that a single mistake causes t=1Tμ(xt,yt)=0\prod_{t=1}^{T}\mu(x_{t},y_{t})=0. To lower bound Pr[t=1Tμ(xt,yt)=0]\Pr\big{[}\prod_{t=1}^{T}\mu(x_{t},y_{t})=0\big{]}, we use a similar approach to the proof of Theorem 4.1: divide 𝒳=[0,1]\mathcal{X}=[0,1] into f(T)f(T) independent sections with |QT|<<f(T)<<T|Q_{T}|<<f(T)<<T, so that the agent can only query a small fraction of these sections. However, the proof of Theorem E.2 is a bit easier, since we only need the agent to make a single mistake.

Note that Theorem E.2 as stated only provides a bound on RT×R_{T}^{\times}. A similar bound can be obtained for RT+R_{T}^{+}, but it is more tedious and we do not believe it would add much to the paper.

Theorem E.2.

Let 𝒳=[0,1]\mathcal{X}=[0,1] and 𝒴={0,1}\mathcal{Y}=\{0,1\}. Let each input be sampled i.i.d. from the uniform distribution on 𝒳\mathcal{X} and define the mentor policy class as the set of intervals within 𝒳\mathcal{X}, i.e., Π={π:a,b[0,1] s.t π(x)=𝟏(x[a,b])x𝒳}\Pi=\{\pi:\exists a,b\in[0,1]\text{ s.t }\pi(x)=\mathbf{1}(x\in[a,b])\ \forall x\in\mathcal{X}\}. Then without the local generalization assumption, any algorithm with sublinear queries satisfies limTsup𝛍,πm𝔼[RT×]=\lim_{T\to\infty}\sup_{\boldsymbol{\mu},\pi^{m}}\operatorname*{\mathbb{E}}[R_{T}^{\times}]=\infty.

Proof.

Part 1: Setup. Consider any algorithm which makes sublinear worst-case queries: then there exists g:g:\mathbb{N}\to\mathbb{N} where sup𝝁,πm𝔼[|QT|]g(T)\sup_{\boldsymbol{\mu},\pi^{m}}\operatorname*{\mathbb{E}}[|Q_{T}|]\leq g(T) and g(T)o(T)g(T)\in o(T). WLOG assume g(T)0g(T)\geq 0 for all TT; if not, redefine g(T)g(T) to be max(g(T),1)\max(g(T),1).

Define f(T):=g(T)Tf(T):=\lceil\sqrt{g(T)T}\rceil; by Lemma A.2, g(T)o(f(T))g(T)\in o(f(T)) and f(T)o(T)f(T)\in o(T). Divide 𝒳\mathcal{X} into f(T)f(T) equally sized sections X1,,Xf(T)X_{1},\dots,X_{f(T)} in the exactly the same way as in Section 4.2; see also Figure 1. Assume that each xtx_{t} is in exactly one section: this assumption holds with probability 1, so it does not affect the regret.

We use the probabilistic method: sample a segment jm[f(T)]j^{m}\in[f(T)] uniformly at random, define πm\pi^{m} by πm(x)=𝟏(xXjm)\pi^{m}(x)=\mathbf{1}(x\in X_{j^{m}}), and define μ\mu by μ(x,y)=𝟏(y=πm(x))\mu(x,y)=\mathbf{1}(y=\pi^{m}(x)). In words, the mentor takes action 1 iff the input is in section jmj^{m}, and the agent receives payoff 1 if its action matches the mentor’s and zero otherwise. Since any choice of jmj^{m} defines a valid μ\mu and πm\pi^{m},

sup𝝁,πm𝔼𝒙,𝒚[RT+(𝒙,𝒚,𝝁,πm)]𝔼jm𝔼𝒙,𝒚[RT+(𝒙,𝒚,(μ,,μ),πm)]\displaystyle\sup_{\boldsymbol{\mu},\pi^{m}}\ \operatorname*{\mathbb{E}}_{\boldsymbol{x},\boldsymbol{y}}\ [R_{T}^{+}(\boldsymbol{x},\boldsymbol{y},\boldsymbol{\mu},\pi^{m})]\geq\operatorname*{\mathbb{E}}_{j^{m}}\ \operatorname*{\mathbb{E}}_{\boldsymbol{x},\boldsymbol{y}}\ [R_{T}^{+}(\boldsymbol{x},\boldsymbol{y},(\mu,\dots,\mu),\pi^{m})]

Let J¬Q={j[f(T)]:xtXjtQT}J_{\neg Q}=\{j\in[f(T)]:x_{t}\not\in X_{j}\ \forall t\in Q_{T}\} be the set of sections which are never queried. Let j1,,jkj_{1},\dots,j_{k} be the sequence of sections queried by the agent: then k=|QT|g(T)k=|Q_{T}|\leq g(T).

Part 2: The agent is unlikely to determine jmj^{m}. By the chain rule of probability,

Pr[jmJ¬Q]=Pr[jijmi]=i=1kPr[jijmjrjmr<i]\displaystyle\Pr[j^{m}\in J_{\neg Q}]=\Pr\big{[}j_{i}\neq j^{m}\ \forall i\big{]}=\prod_{i=1}^{k}\Pr\big{[}j_{i}\neq j^{m}\mid j_{r}\neq j^{m}\ \forall r<i\big{]}

Now fix ii and assume jrjmr<ij_{r}\neq j^{m}\ \forall r<i. Queries in sections other than jmj^{m} provide no information about the value of jmj^{m}, so jmj^{m} is uniformly distributed across the set of sections not yet queried, i.e., {j[f(T)]:jrjr<i}\{j\in[f(T)]:j_{r}\neq j\ \forall r<i\}. There are at least f(T)i+1f(T)-i+1 such sections, since there are i1i-1 prior queries at this point. Thus Pr[jijmjrjmr<i]f(T)if(T)i+1\Pr[j_{i}\neq j^{m}\mid j_{r}\neq j^{m}\ \forall r<i]\geq\frac{f(T)-i}{f(T)-i+1} (the inequality is because this probability could also be 1 if ji=jrj_{i}=j_{r} for some i<ri<r). Therefore

Pr[jmJ¬Q]\displaystyle\Pr\big{[}j^{m}\in J_{\neg Q}\big{]}\geq i=1kf(T)if(T)i+1\displaystyle\ \prod_{i=1}^{k}\frac{f(T)-i}{f(T)-i+1}
=\displaystyle= f(T)1f(T)f(T)2f(T)1f(T)k+1f(T)k+2f(T)kf(T)k+1\displaystyle\ \frac{f(T)-1}{f(T)}\cdot\frac{f(T)-2}{f(T)-1}\dots\frac{f(T)-k+1}{f(T)-k+2}\cdot\frac{f(T)-k}{f(T)-k+1}
=\displaystyle= f(T)kf(T)\displaystyle\ \frac{f(T)-k}{f(T)}
\displaystyle\geq 1g(T)f(T)\displaystyle\ 1-\frac{g(T)}{f(T)}

Part 3: If the agent fails to determine jmj^{m}, it is likely to make at least one mistake. For each jJ¬Qj\in J_{\neg Q}, let Vj={t[T]:xtXj}V_{j}=\{t\in[T]:x_{t}\in X_{j}\} be the set of time steps with inputs in section jj. By Lemma A.4, Pr[|Vjm|=0]exp(T16f(T))\Pr[|V_{j^{m}}|=0]\leq\exp\big{(}\frac{T}{16f(T)}\big{)}. Then by the union bound, Pr[jmJ¬Q and |Vjm|>0]1g(T)f(T)exp(T16f(T))\Pr[j^{m}\in J_{\neg Q}\text{ and }|V_{j^{m}}|>0]\geq 1-\frac{g(T)}{f(T)}-\exp\big{(}\frac{-T}{16f(T)}\big{)}. For the rest of Part 3, assume jmJ¬Q and |Vjm|>0j^{m}\in J_{\neg Q}\text{ and }|V_{j^{m}}|>0.

Case 1: For all jJ¬Qj\in J_{\neg Q} and tVjt\in V_{j}, we have yt=0y_{t}=0. In particular, this holds for j=jmj=j^{m}, and we know there exists at least one tVjmt\in V_{j^{m}} since |Vjm|>0|V_{j^{m}}|>0. Then ytπm(xt)y_{t}\neq\pi^{m}(x_{t}), so μ(xt,yt)=0\mu(x_{t},y_{t})=0 and thus Pr[r=1Tμ(xr,yr)=0|jmJ¬Q and |Vjm|>0 and yt=0jJ¬Q,tVj]=1\Pr\left[\prod_{r=1}^{T}\mu(x_{r},y_{r})=0\>\Big{|}\>j^{m}\in J_{\neg Q}\text{ and }|V_{j^{m}}|>0\text{ and }y_{t}=0\ \forall j\in J_{\neg Q},t\in V_{j}\right]=1.

Case 2: There exists jJ¬Qj\in J_{\neg Q} and tVjt\in V_{j} with yt=1y_{t}=1. Then μ(xt,yt)=0\mu(x_{t},y_{t})=0 unless j=jmj=j^{m}, so

Pr[r=1Tμ(xr,yr)=0|jmJ¬Q and |Vjm|>0 and jJ¬Q,tVj s.t. yt=1]\displaystyle\ \Pr\left[\prod_{r=1}^{T}\mu(x_{r},y_{r})=0\>\Big{|}\>j^{m}\in J_{\neg Q}\text{ and }|V_{j^{m}}|>0\text{ and }\exists j\in J_{\neg Q},t\in V_{j}\text{ s.t. }y_{t}=1\right]
\displaystyle\geq Pr[μ(xt,yt)=0|jmJ¬Q and |Vjm|>0 and jJ¬Q,tVj s.t. yt=1]\displaystyle\ \Pr\Big{[}\mu(x_{t},y_{t})=0\ \Big{|}\ j^{m}\in J_{\neg Q}\text{ and }|V_{j^{m}}|>0\text{ and }\exists j\in J_{\neg Q},t\in V_{j}\text{ s.t. }y_{t}=1\Big{]}
=\displaystyle= Pr[jjm|jmJ¬Q and |Vjm|>0 and jJ¬Q,tVj s.t. yt=1]\displaystyle\ \Pr\Big{[}j\neq j^{m}\ \Big{|}\ j^{m}\in J_{\neg Q}\text{ and }|V_{j^{m}}|>0\text{ and }\exists j\in J_{\neg Q},t\in V_{j}\text{ s.t. }y_{t}=1\Big{]}

Since jmJ¬Qj^{m}\in J_{\neg Q}, the agent has no information about jmj^{m} other than that it is in J¬QJ_{\neg Q}. This means that jmj^{m} is uniformly distributed across J¬QJ_{\neg Q}, so

Pr[r=1Tμ(xr,yr)=0|jmJ¬Q and |Vjm|>0 and jJ¬Q,tVj s.t. yt=1]11|J¬Q|11f(T)g(T)\displaystyle\Pr\left[\prod_{r=1}^{T}\mu(x_{r},y_{r})=0\>\Big{|}\>j^{m}\in J_{\neg Q}\text{ and }|V_{j^{m}}|>0\text{ and }\exists j\in J_{\neg Q},t\in V_{j}\text{ s.t. }y_{t}=1\right]\geq 1-\frac{1}{|J_{\neg Q}|}\geq 1-\frac{1}{f(T)-g(T)}

Combining Case 1 and Case 2, we get the overall bound of

Pr[t=1Tμ(xt,yt)=0|jmJ¬Q and |Vjm|>0]11f(T)g(T)\Pr\left[\prod_{t=1}^{T}\mu(x_{t},y_{t})=0\>\Big{|}\>j^{m}\in J_{\neg Q}\text{ and }|V_{j^{m}}|>0\right]\geq 1-\frac{1}{f(T)-g(T)}

and thus

Pr[t=1Tμ(xt,yt)=0]\displaystyle\Pr\left[\prod_{t=1}^{T}\mu(x_{t},y_{t})=0\right]\geq Pr[t=1Tμ(xt,yt)=0 and jmJ¬Q and |Vjm|>0]\displaystyle\ \Pr\left[\prod_{t=1}^{T}\mu(x_{t},y_{t})=0\text{ and }j^{m}\in J_{\neg Q}\text{ and }|V_{j^{m}}|>0\right]
=\displaystyle= Pr[t=1Tμ(xt,yt)=0|jmJ¬Q and |Vjm|>0]Pr[jmJ¬Q and |Vjm|>0]\displaystyle\ \Pr\left[\prod_{t=1}^{T}\mu(x_{t},y_{t})=0\>\Big{|}\>j^{m}\in J_{\neg Q}\text{ and }|V_{j^{m}}|>0\right]\cdot\Pr\Big{[}j^{m}\in J_{\neg Q}\text{ and }|V_{j^{m}}|>0\Big{]}
\displaystyle\geq (11f(T)g(T))(1g(T)f(T)exp(T16f(T)))\displaystyle\ \left(1-\frac{1}{f(T)-g(T)}\right)\left(1-\frac{g(T)}{f(T)}-\exp\left(\frac{-T}{16f(T)}\right)\right)

For brevity, let α(T)\alpha(T) denote this final bound. Since g(T)o(f(T))g(T)\in o(f(T)) and f(T)o(T)f(T)\in o(T), we have

limTα(T)=limT(11f(T)g(T))(1g(T)f(T)exp(T16f(T)))=(10)(100)=1\displaystyle\lim_{T\to\infty}\alpha(T)=\lim_{T\to\infty}\left(1-\frac{1}{f(T)-g(T)}\right)\left(1-\frac{g(T)}{f(T)}-\exp\left(\frac{-T}{16f(T)}\right)\right)=(1-0)(1-0-0)=1

Part 4: Putting it all together. Consider any ε(0,1]\varepsilon\in(0,1]; to avoid dealing with infinite expectations, we will deal with Pr[t=1Tμ(xt,yt)ε]\Pr[\prod_{t=1}^{T}\mu(x_{t},y_{t})\leq\varepsilon] instead of Pr[t=1Tμ(xt,yt)=0]\Pr[\prod_{t=1}^{T}\mu(x_{t},y_{t})=0]. Since t=1Tμ(xt,yt)1\prod_{t=1}^{T}\mu(x_{t},y_{t})\leq 1 always, we have

𝔼jm𝔼𝒙,𝒚[logt=1Tμ(xt,yt)]=\displaystyle\operatorname*{\mathbb{E}}_{j^{m}}\ \operatorname*{\mathbb{E}}_{\boldsymbol{x},\boldsymbol{y}}\ \left[\log\prod_{t=1}^{T}\mu(x_{t},y_{t})\right]= 𝔼jm𝔼𝒙,𝒚[logt=1Tμ(xt,yt)|t=1Tμ(xt,yt)ε]Pr[t=1Tμ(xt,yt)ε]\displaystyle\ \operatorname*{\mathbb{E}}_{j^{m}}\ \operatorname*{\mathbb{E}}_{\boldsymbol{x},\boldsymbol{y}}\ \left[\log\prod_{t=1}^{T}\mu(x_{t},y_{t})\>\Big{|}\>\prod_{t=1}^{T}\mu(x_{t},y_{t})\leq\varepsilon\right]\cdot\Pr\left[\prod_{t=1}^{T}\mu(x_{t},y_{t})\leq\varepsilon\right]
+\displaystyle+ 𝔼jm𝔼𝒙,𝒚[logt=1Tμ(xt,yt)|t=1Tμ(xt,yt)>ε]Pr[t=1Tμ(xt,yt)>ε]\displaystyle\ \operatorname*{\mathbb{E}}_{j^{m}}\ \operatorname*{\mathbb{E}}_{\boldsymbol{x},\boldsymbol{y}}\ \left[\log\prod_{t=1}^{T}\mu(x_{t},y_{t})\>\Big{|}\>\prod_{t=1}^{T}\mu(x_{t},y_{t})>\varepsilon\right]\cdot\Pr\left[\prod_{t=1}^{T}\mu(x_{t},y_{t})>\varepsilon\right]
\displaystyle\leq logεPr[t=1Tμ(xt,yt)ε]+1(1Pr[t=1Tμ(xt,yt)ε])\displaystyle\log\varepsilon\cdot\Pr\left[\prod_{t=1}^{T}\mu(x_{t},y_{t})\leq\varepsilon\right]+1\cdot\left(1-\Pr\left[\prod_{t=1}^{T}\mu(x_{t},y_{t})\leq\varepsilon\right]\right)
\displaystyle\leq 1(1logε)Pr[t=1Tμ(xt,yt)ε]\displaystyle\ 1-(1-\log\varepsilon)\Pr\left[\prod_{t=1}^{T}\mu(x_{t},y_{t})\leq\varepsilon\right]

Since ε(0,1]\varepsilon\in(0,1], we have 1logε>01-\log\varepsilon>0. Also, Pr[t=1Tμ(xt,yt)ε]Pr[t=1Tμ(xt,yt)=0]\Pr[\prod_{t=1}^{T}\mu(x_{t},y_{t})\leq\varepsilon]\geq\Pr[\prod_{t=1}^{T}\mu(x_{t},y_{t})=0], so

𝔼jm𝔼𝒙,𝒚[logt=1Tμ(xt,yt)]\displaystyle\operatorname*{\mathbb{E}}_{j^{m}}\ \operatorname*{\mathbb{E}}_{\boldsymbol{x},\boldsymbol{y}}\ \left[\log\prod_{t=1}^{T}\mu(x_{t},y_{t})\right]\leq 1(1logε)Pr[t=1Tμ(xt,yt)ε]\displaystyle\ 1-(1-\log\varepsilon)\Pr\left[\prod_{t=1}^{T}\mu(x_{t},y_{t})\leq\varepsilon\right]
\displaystyle\leq 1(1logε)Pr[t=1Tμ(xt,yt)=0]\displaystyle\ 1-(1-\log\varepsilon)\Pr\left[\prod_{t=1}^{T}\mu(x_{t},y_{t})=0\right]
\displaystyle\leq 1(1logε)α(T)\displaystyle\ 1-(1-\log\varepsilon)\alpha(T)

Since t=1Tμm(xt)=1\prod_{t=1}^{T}\mu^{m}(x_{t})=1 always, we have

sup𝝁,πm𝔼𝒙,𝒚[RT+(𝒙,𝒚,𝝁,πm)]\displaystyle\sup_{\boldsymbol{\mu},\pi^{m}}\ \operatorname*{\mathbb{E}}_{\boldsymbol{x},\boldsymbol{y}}\ [R_{T}^{+}(\boldsymbol{x},\boldsymbol{y},\boldsymbol{\mu},\pi^{m})]\geq 𝔼jm𝔼𝒙,𝒚[RT+(𝒙,𝒚,(μ,,μ),πm)]\displaystyle\ \operatorname*{\mathbb{E}}_{j^{m}}\ \operatorname*{\mathbb{E}}_{\boldsymbol{x},\boldsymbol{y}}\ [R_{T}^{+}(\boldsymbol{x},\boldsymbol{y},(\mu,\dots,\mu),\pi^{m})]
=\displaystyle= log1𝔼jm𝔼𝒙,𝒚[logt=1Tμ(xt,yt)]\displaystyle\ \log 1-\operatorname*{\mathbb{E}}_{j^{m}}\operatorname*{\mathbb{E}}_{\boldsymbol{x},\boldsymbol{y}}\left[\log\prod_{t=1}^{T}\mu(x_{t},y_{t})\right]
\displaystyle\geq 1+(1logε)α(T)\displaystyle\ -1+(1-\log\varepsilon)\alpha(T)

Therefore

limTsup𝝁,πm𝔼𝒙,𝒚[RT+(𝒙,𝒚,𝝁,πm)]\displaystyle\lim_{T\to\infty}\sup_{\boldsymbol{\mu},\pi^{m}}\ \operatorname*{\mathbb{E}}_{\boldsymbol{x},\boldsymbol{y}}\ [R_{T}^{+}(\boldsymbol{x},\boldsymbol{y},\boldsymbol{\mu},\pi^{m})]\geq 1+(1logε)limTα(T)\displaystyle\ -1+(1-\log\varepsilon)\lim_{T\to\infty}\alpha(T)
\displaystyle\geq 1+(1logε)\displaystyle\ -1+(1-\log\varepsilon)
\displaystyle\geq logε\displaystyle\ -\log\varepsilon

This holds for every ε(0,1]\varepsilon\in(0,1], which is only possible if limTsup𝝁,πm𝔼𝒙,𝒚[RT+(𝒙,𝒚,𝝁,πm)]=\lim_{T\to\infty}\sup_{\boldsymbol{\mu},\pi^{m}}\ \operatorname*{\mathbb{E}}_{\boldsymbol{x},\boldsymbol{y}}\ [R_{T}^{+}(\boldsymbol{x},\boldsymbol{y},\boldsymbol{\mu},\pi^{m})]=\infty, as desired. ∎