This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sharp bounds on the price of bandit feedback for several models of mistake-bounded online learning

Raymond Feng rfeng2004@gmail.com    Jesse Geneson geneson@gmail.com    Andrew Lee leeandrew1029@gmail.com    Espen Slettnes espen@slett.net
Abstract

We determine sharp bounds on the price of bandit feedback for several variants of the mistake-bound model. The first part of the paper presents bounds on the rr-input weak reinforcement model and the rr-input delayed, ambiguous reinforcement model. In both models, the adversary gives rr inputs in each round and only indicates a correct answer if all rr guesses are correct. The only difference between the two models is that in the delayed, ambiguous model, the learner must answer each input before receiving the next input of the round, while the learner receives all rr inputs at once in the weak reinforcement model.

In the second part of the paper, we introduce models for online learning with permutation patterns, in which a learner attempts to learn a permutation from a set of permutations by guessing statistics related to sub-permutations. For these permutation models, we prove sharp bounds on the price of bandit feedback.

1 Introduction

We investigate several variants of the mistake-bound model [11]. In the standard model [1, 12] (also called the standard strong reinforcement learning model [1]), a learner attempts to classify inputs (in the set XX) with labels (in the set YY) based on a set of possible functions f:XYf:X\to Y in FF. The learning proceeds in rounds, and in each round the adversary gives the learner an input, and the learner must then guess the corresponding label. After each round, the adversary informs the learner of the correct answer (and therefore whether the learner was right or wrong). A variant of this model is the standard weak reinforcement learning model [1, 2], where the adversary only tells the learner yes if they were correct and no otherwise. This variant is also commonly called the bandit model [5, 6, 7, 10, 12].

For any learning scenario, we generally let optscenario(F)\operatorname{opt}_{\operatorname{scenario}}(F) represent the optimal worst case number of mistakes that a learning algorithm can achieve [1]. For example, for the bandit model and the standard model, the optimal worst case performances of learning algorithms would be denoted optweak(F)=optbandit(F)\operatorname{opt}_{\operatorname{weak}}(F)=\operatorname{opt}_{\operatorname{bandit}}(F) and optstd(F)=optstrong(F)\operatorname{opt}_{\operatorname{std}}(F)=\operatorname{opt}_{\operatorname{strong}}(F) respectively. There are some obvious inequalities that follow by definition, such as optstrong(F)optweak(F)\operatorname{opt}_{\operatorname{strong}}(F)\leq\operatorname{opt}_{\operatorname{weak}}(F), just from the fact that the learner has strictly more information in one scenario compared to the other.

In [1], Auer and Long defined the rr-input delayed, ambiguous reinforcement model and compare it to a modified version of the standard weak reinforcement model (henceforth called the rr-input weak reinforcement model). The delayed, ambiguous reinforcement model is a situation where the learner receives a fixed number (rr) of inputs each round, and each input is given to the learner after they have answered the previous one. On the other hand, the learner receives all rr inputs at once for each round in the modified weak reinforcement model. In both situations, at the end of every round of rr inputs, the adversary says yes if the learner answered all rr inputs correctly and no otherwise. To compare the two situations, Auer and Long defined CARTr(F)\operatorname{CART}_{r}(F) (where FF is a set of functions f:XYf:X\to Y) to be a set of functions f:XrYrf^{\prime}:X^{r}\to Y^{r} where each fFf\in F has a corresponding fCARTr(F)f^{\prime}\in\operatorname{CART}_{r}(F) such that for any x1,x2,,xrXx_{1},x_{2},\dots,x_{r}\in X, we have f((x1,x2,,xr))=(f(x1),f(x2),,f(xr))f^{\prime}((x_{1},x_{2},\dots,x_{r}))=(f(x_{1}),f(x_{2}),\dots,f(x_{r})). They used optweak(CARTr(F))\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F)) to represent the optimal worst-case performance in the modified weak reinforcement setting and optamb,r(F)\operatorname{opt}_{\operatorname{amb},r}(F) to represent the optimal worst-case performance in the rr-input delayed, ambiguous reinforcement model. Auer and Long proved that the two situations are not equivalent for the learner in [1] by showing that that there is some input set XX and set FF of functions from XX to {0,1}\{0,1\} such that optweak(CART2(F))<optamb,2(F)\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{2}(F))<\operatorname{opt}_{\operatorname{amb},2}(F).

In Sections 2 and 3, we obtain sharp bounds on the maximum possible multiplicative gap between optamb,r(F)\operatorname{opt}_{\operatorname{amb},r}(F) and optweak(CARTr(F))\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F)). In particular, we show that the multiplicative gap between optamb,r(F)\operatorname{opt}_{\operatorname{amb},r}(F) and optweak(CARTr(F))\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F)) can grow exponentially with rr, generalizing Auer and Long’s result from [1] for r=2r=2. Combined with a bound from [1], our new result shows that the maximum possible multiplicative gap between optamb,r(F)\operatorname{opt}_{\operatorname{amb},r}(F) and optweak(CARTr(F))\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F)) is 2r(1±o(1))2^{r(1\pm o(1))}. Moreover, we give sharp bounds on optweak(CARTr(F))\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F)) (in Section 2) and optamb,r(F)\operatorname{opt}_{\operatorname{amb},r}(F) (in Section 3) for all sets FF which are a subset of the non-decreasing functions from XX to {0,1}\{0,1\}.

In a different paper, Long [12] also determined sharp bounds comparing the standard model and the standard weak reinforcement model for multi-class functions. Long proved the upper bound optbandit(F)(1+o(1))(|Y|ln|Y|)optstd(F)\operatorname{opt}_{\operatorname{bandit}}(F)\leq(1+o(1))(|Y|\ln|Y|)\operatorname{opt}_{\operatorname{std}}(F) and constructed infinitely many FF for which optbandit(F)(1o(1))(|Y|ln|Y|)optstd(F)\operatorname{opt}_{\operatorname{bandit}}(F)\geq(1-o(1))(|Y|\ln|Y|)\operatorname{opt}_{\operatorname{std}}(F) as a lower bound. Geneson corrected an error in the proof of the lower bound [9].

In Section 4, we generalize this result to determine a lower bound and upper bound on the maximum factor gap between optweak(CARTr(F))\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F)) and optstd(F)\operatorname{opt}_{\operatorname{std}}(F) for multi-class functions using probabilistic methods and linear algebra techniques. The proof uses techniques previously used for experimental design [15, 14] and hashing, derandomization, and cryptography [4, 13]. We also determine a lower bound and upper bound on the maximum factor gap between optamb,r(F)\operatorname{opt}_{\operatorname{amb,r}}(F) and optstd(F)\operatorname{opt}_{\operatorname{std}}(F). The bounds in this section are sharp up to a factor of r(1+o(1))r(1+o(1)).

In Section 5, we define several new models where the learner is trying to guess a function from a set of permutations of length nn. In the order model, the adversary chooses rr inputs, and the learner attempts to guess the corresponding sub-permutation. In the comparison model, the adversary instead chooses rr pairs of inputs, and the learner attempts to guess for each pair whether or not it is an inversion. In the selection model, the adversary chooses rr inputs, and the learner attempts to guess which has the maximum evaluation. In the relative position model, the adversary chooses rr inputs and an additional distinguished input xx, and the learner attempts to guess the relative position of xx in the sub-permutation corresponding to all r+1r+1 of these inputs. Finally, in the delayed relative position model, the adversary instead gives the rr elements to be compared to xx one at a time. We first establish general upper bounds, and then we discuss adversary strategies for a few special families of permutations that resemble sorting algorithms.

Finally, in Section 6, we discuss some future directions based on the results in this paper.

2 Bounds on optweak(CARTr(F))\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F))

In this section, we establish upper and lower bounds on optweak(CARTr(F))\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F)) for families FF of non-decreasing functions that are within a constant factor of each other. We show for such families FF that optweak(CARTr(F))=(1±o(1))rln(|F|)\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F))=(1\pm o(1))r\ln(|F|).

Definition 1.

Without loss of generality impose an ordering on the set XX, and call its elements {1,2,,|X|}\{1,2,\dots,|X|\}. Let F={f1,f2.,f|F|}F=\{f_{1},f_{2}.\dots,f_{|F|}\} be a subset of the functions from XX to {0,1}\{0,1\}. We say that FF is non-decreasing if every function in FF is non-decreasing.

In other words, there are integers 1a1<a2<<a|F||X|+11\leq a_{1}<a_{2}<\dots<a_{|F|}\leq|X|+1 which are the minimum numbers such that fi(ai)=1f_{i}(a_{i})=1 (with the convention that a|F|=|X|+1a_{|F|}=|X|+1 if f|F|f_{|F|} is identically 0) and satisfy the following property for each 1i|F|1\leq i\leq|F|:

  • If x>aix>a_{i}, then fi(x)=1f_{i}(x)=1.

  • If x<aix<a_{i}, then fi(x)=0f_{i}(x)=0.

Remark 2.

Let FF be non-decreasing. For a function fFf\in F and any choice of rr inputs from the set XX, there are at most r+1r+1 possible values of the corresponding outputs (f(x1),,f(xr))(f(x_{1}),\dots,f(x_{r})), namely (0,0,,0,0)(0,0,\dots,0,0), (0,0,,0,1)(0,0,\dots,0,1), \ldots, (0,1,,1,1)(0,1,\dots,1,1), and (1,1,,1,1)(1,1,\dots,1,1).

2.1 Bounds on optweak(CARTr(F))\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F))

The following theorem establishes an upper bound on optweak(CARTr(F))\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F)) by exhibiting a possible learner strategy.

Theorem 3.

For non-decreasing FF, optweak(CARTr(F))<(r+1)ln(|F|)\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F))<(r+1)\cdot\ln(|F|).

Proof.

The learner’s strategy: pick the answer that corresponds to the most functions. By Remark 2, we know that there are at most r+1r+1 such answers.

Each time the adversary says no, if the learner previously knew that there were TT possible functions left, then the learner is able to reduce the number of possible functions left by at least Tr+1\frac{T}{r+1}. Thus, each answer of no means that the number of remaining possibilities is multiplied by a factor of at most rr+1\frac{r}{r+1} on each turn.

Then, the learner will make at most

logr+1r(|F|)=ln(|F|)ln(1+1r)<ln(|F|)1r+1=(r+1)ln(|F|)\log_{\frac{r+1}{r}}(|F|)=\frac{\ln(|F|)}{\ln\left(1+\frac{1}{r}\right)}<\frac{\ln(|F|)}{\frac{1}{r+1}}=(r+1)\cdot\ln(|F|)

mistakes, as desired. ∎

Next we establish a lower bound on optweak(CARTr(F))\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F)) by exhibiting a strategy for the adversary.

Theorem 4.

For non-decreasing FF, optweak(CARTr(F))(1o(1))rln(|F|)\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F))\geq(1-o(1))r\ln(|F|) as |F||F|\rightarrow\infty.

Proof.

In each round, the adversary will say no. When they do this, some functions fail to remain consistent with the answers given by the adversary. Let FFF^{\prime}\subseteq F be the current set of functions that are consistent with the answers that the adversary has given so far. Furthermore, with F={g1,g2,,g|F|}F^{\prime}=\{g_{1},g_{2},\dots,g_{|F^{\prime}|}\}, define 1b1<b2<<b|F||X|+11\leq b_{1}<b_{2}<\dots<b_{|F^{\prime}|}\leq|X|+1 as the minimum numbers such that gi(bi)=1g_{i}(b_{i})=1 (with the convention that b|F|=|X|+1b_{|F^{\prime}|}=|X|+1 if f|F|f_{|F^{\prime}|} is identically 0).

In each round, the adversary will choose the inputs

xi=bi|F|r+1+1x_{i}=b_{i\cdot\left\lceil\frac{|F^{\prime}|}{r+1}\right\rceil+1}

for 1ir1\leq i\leq r. Then, no matter what the learner says, the adversary says no. This guarantees that the number of remaining consistent functions decreases by at most |F|r+1\left\lceil\frac{|F^{\prime}|}{r+1}\right\rceil functions per round. Thus, the adversary can continue for at least

(1o(1))logr+1r(|F|)=(1o(1))ln(|F|)ln(1+1r)(1o(1))rln(|F|)(1-o(1))\log_{\frac{r+1}{r}}(|F|)=(1-o(1))\cdot\frac{\ln(|F|)}{\ln(1+\frac{1}{r})}\geq(1-o(1))r\ln(|F|)

turns. Therefore, the adversary guarantees that they can say no at least (1o(1))rln(|F|)(1-o(1))r\ln(|F|) times, as desired. ∎

Combining the above bounds, this implies that for non-decreasing FF, we have the following theorem:

Theorem 5.

For non-decreasing FF, limrlim|F|optweak(CARTr(F))rln(|F|)=1\lim_{r\to\infty}\lim_{|F|\to\infty}\frac{\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F))}{r\ln(|F|)}=1.

3 Bounds on optamb,r(F)\operatorname{opt}_{\operatorname{amb},r}(F)

In this section, we show that optamb,r(F)=(1o(1))2rln(|F|)\operatorname{opt}_{\operatorname{amb},r}(F)=(1-o(1))2^{r}\ln(|F|) as |F|,r|F|,r\rightarrow\infty for non-decreasing families FF. The upper bound in this section is applicable to all families of functions FF, but the lower bound is only applicable for non-decreasing sets of functions FF, as with the bounds established in Section 2.

We prove the following theorem, a general upper bound for optamb,r(F)\operatorname{opt}_{\operatorname{amb},r}(F), using a learner strategy that does not make any assumptions about the set FF. In particular, FF does not have to be non-decreasing.

Theorem 6.

For all FF, optamb,r(F)<min(2rln(|F|),|F|)\operatorname{opt}_{\operatorname{amb},r}(F)<\min(2^{r}\ln(|F|),|F|).

Proof.

For each input that the adversary gives, the learner picks the answer that corresponds to the most functions which were consistent with all answers before the round started and also consistent with all earlier guesses in the current round.

Each time the adversary says no, if the learner knew that there were TT possible functions remaining before the round started, then they can guarantee that their answer for all rr inputs is consistent with at least T2r\frac{T}{2^{r}} of those functions. To see this, note that by induction, at least T2k\frac{T}{2^{k}} of the functions will be consistent with the kk answers they have given so far for each 1kr1\leq k\leq r. Therefore, each time the adversary says no the number of remaining possible functions is multiplied by at most 2r12r\frac{2^{r}-1}{2^{r}}. So, the learner makes at most

log2r2r1(|F|)=ln(|F|)ln(1+12r1)<ln(|F|)12r=2rln(|F|)\log_{\frac{2^{r}}{2^{r}-1}}(|F|)=\frac{\ln(|F|)}{\ln\left(1+\frac{1}{2^{r}-1}\right)}<\frac{\ln(|F|)}{\frac{1}{2^{r}}}=2^{r}\cdot\ln(|F|)

mistakes with this strategy.

The learner’s strategy to get optamb,r(F)|F|1\operatorname{opt}_{\operatorname{amb},r}(F)\leq|F|-1: Each time the adversary says no, the learner can eliminate at least 1 function. Once the learner has eliminated |F|1|F|-1 functions, no more errors will be made. ∎

For the following lower bound on optamb,r(F)\operatorname{opt}_{\operatorname{amb},r}(F), we again assume that FF is non-decreasing.

Theorem 7.

For non-decreasing FF, optamb,r(F)(1o(1))(2r1)ln(|F|)\operatorname{opt}_{\operatorname{amb},r}(F)\geq(1-o(1))(2^{r}-1)\ln(|F|) as |F||F|\rightarrow\infty.

Proof.

The adversary will say no at the end of each round. For each round, the adversary will choose a series of input values xix_{i} based on the answers given by the learner. In each subround, the next input xix_{i} is determined as follows: suppose that SS is the set of all functions that are consistent with all previous adversary answers from past rounds as well as all the answers of the learner from the current round.

Since SFS\subseteq F, we can then set S={g1,g2,,g|S|}S=\{g_{1},g_{2},\dots,g_{|S|}\} and define 1b1<b2<<b|S||X|+11\leq b_{1}<b_{2}<\dots<b_{|S|}\leq|X|+1 as the minimum numbers such that gi(bi)=1g_{i}(b_{i})=1 (with the convention that b|S|=|X|+1b_{|S|}=|X|+1 if f|S|f_{|S|} is identically 0).

The adversary then chooses xi=b|S|2x_{i}=b_{\left\lceil\frac{|S|}{2}\right\rceil} for the current subround. This guarantees that at each subround, the number of functions consistent with all of the adversary’s previous answers as well as all of the learner’s answers in the current round reduces by at most |S|2\left\lceil\frac{|S|}{2}\right\rceil. Thus, if TT functions were consistent with all of the adversary’s previous answers at the beginning of the current round, then at the end of the round, at most T2r\left\lceil\frac{T}{2^{r}}\right\rceil functions become inconsistent with the adversary’s answers. Here, we are repeatedly using the fact that xn=xn\left\lceil\frac{\left\lceil x\right\rceil}{n}\right\rceil=\left\lceil\frac{x}{n}\right\rceil for all positive reals xx and positive integers nn.

This means that the adversary can continue to say no for at least

(1o(1))log2r2r1(|F|)\displaystyle(1-o(1))\log_{\frac{2^{r}}{2^{r}-1}}(|F|) =(1o(1))ln(|F|)ln(1+12r1)(1o(1))(2r1)ln(|F|)\displaystyle=(1-o(1))\frac{\ln(|F|)}{\ln\left(1+\frac{1}{2^{r}-1}\right)}\geq(1-o(1))(2^{r}-1)\ln(|F|)

turns, as desired. ∎

Combining the above bounds, this implies that for non-decreasing FF, we have the following theorem:

Theorem 8.

For non-decreasing FF, limrlim|F|optamb,r(F)2rln(|F|)=1\lim_{r\to\infty}\lim_{|F|\to\infty}\frac{\operatorname{opt}_{\operatorname{amb},r}(F)}{2^{r}\ln(|F|)}=1.

Theorem 5 and Theorem 8 imply for non-decreasing families of functions FF that for sufficiently large rr and |F||F| that learners who are given all inputs at the beginning of each round do exponentially better in rr than their counterparts who receive inputs one at a time in each round.

In [1], Auer and Long proved that optamb,r(F)2(ln(2r))2roptstd(F)\operatorname{opt}_{\operatorname{amb},r}(F)\leq 2(\ln{2r})\cdot 2^{r}\cdot\operatorname{opt}_{\operatorname{std}}(F). Since optstd(F)optweak(CARTr(F))\operatorname{opt}_{\operatorname{std}}(F)\leq\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F)) for all families of functions FF, this implies the following result when combined with our Theorem 5 and Theorem 8.

Theorem 9.

The maximum possible value of optamb,r(F)optweak(CARTr(F))\frac{\operatorname{opt}_{\operatorname{amb},r}(F)}{\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F))} over all families of functions FF with |F|>1|F|>1 is 2r(1±o(1))2^{r(1\pm o(1))}.

4 Comparing optweak(CARTr(F))\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F)) and optamb,r(F)\operatorname{opt}_{\operatorname{amb},r}(F) to optstd(F)\operatorname{opt}_{\operatorname{std}}(F) for multi-class functions

In [12], Long compared the standard and bandit model for families of multi-class functions, and determined a bound on the maximum multiplicative gap between them. There was an error in Long’s proof of the lower bound, but Geneson fixed this error in [9]. In this section, we bound the maximum multiplicative gap between optweak(CARTr(F))\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F)) and optstd(F)\operatorname{opt}_{\operatorname{std}}(F) using similar methods, as well as the maximum multiplicative gap between optamb,r(F)\operatorname{opt}_{\operatorname{amb},r}(F) and optstd(F)\operatorname{opt}_{\operatorname{std}}(F). For each of the gaps, the proof of the lower bound employs techniques previously used for experimental design, hashing, derandomization, and cryptography [13, 4, 15, 14]. We also adapt the proof of the upper bound in [12] to show that our lower bounds are sharp up to a factor of r(1+o(1))r(1+o(1)).

In order to obtain the lower bounds, we prove the following three lemmas which generalize results from [12] and [9]. For the rest of this section, we assume that pp is a prime number.

Lemma 10.

Fix n2n\geq 2, suppose that z1,,zr{0,,p1}z_{1},\dots,z_{r}\in\left\{0,\dots,p-1\right\}, and let u1,,uru_{1},\dots,u_{r} each be chosen uniformly at random from {0,,p1}n\left\{0,\dots,p-1\right\}^{n}. For any s{1,,p1}ns\in\left\{1,\dots,p-1\right\}^{n}, we have

Pr(sui=zimodp for all 1ir)=1pr\Pr(s\cdot u_{i}=z_{i}\mod p\text{ for all }1\leq i\leq r)=\frac{1}{p^{r}}

.

Proof.

We have

Pr(sui=zimodp for all 1ir)=\displaystyle\Pr(s\cdot u_{i}=z_{i}\mod p\text{ for all }1\leq i\leq r)=
Pr(sjui,j=zikjskui,kmodp for all 1ir)=\displaystyle\Pr(s_{j}u_{i,j}=z_{i}-\sum_{k\neq j}s_{k}u_{i,k}\mod p\text{ for all }1\leq i\leq r)=
Pr(ui,j=(zikjskui,k)sj1modp for all 1ir)=\displaystyle\Pr(u_{i,j}=(z_{i}-\sum_{k\neq j}s_{k}u_{i,k})s_{j}^{-1}\mod p\text{ for all }1\leq i\leq r)=
1pr.\displaystyle\frac{1}{p^{r}}.

Lemma 11.

Fix n2n\geq 2, and let u1,,uru_{1},\dots,u_{r} each be chosen uniformly at random from {0,,p1}n\left\{0,\dots,p-1\right\}^{n}. For any s,t{1,,p1}ns,t\in\left\{1,\dots,p-1\right\}^{n} that are not multiples of each other mod pp and for any z1,,zr{0,,p1}z_{1},\dots,z_{r}\in\left\{0,\dots,p-1\right\}, we have

Pr(tui=zimodp for all 1ir | sui=zimodp for all 1ir)=1pr.\Pr(t\cdot u_{i}=z_{i}\mod p\text{ for all }1\leq i\leq r\text{ }|\text{ }s\cdot u_{i}=z_{i}\mod p\text{ for all }1\leq i\leq r)=\frac{1}{p^{r}}.
Proof.

By Lemma 10 and the definition of conditional probability, we have

Pr(tui=zimodp for all 1ir | sui=zimodp for all 1ir)=\displaystyle\Pr(t\cdot u_{i}=z_{i}\mod p\text{ for all }1\leq i\leq r\text{ }|\text{ }s\cdot u_{i}=z_{i}\mod p\text{ for all }1\leq i\leq r)=
Pr(tui=zimodp for all 1ir  sui=zimodp for all 1ir)Pr(sui=zimodp for all 1ir)=\displaystyle\frac{\Pr(t\cdot u_{i}=z_{i}\mod p\text{ for all }1\leq i\leq r\text{ }\wedge\text{ }s\cdot u_{i}=z_{i}\mod p\text{ for all }1\leq i\leq r)}{\Pr(s\cdot u_{i}=z_{i}\mod p\text{ for all }1\leq i\leq r)}=
prPr(tui=zimodp for all 1ir  sui=zimodp for all 1ir).\displaystyle p^{r}\Pr(t\cdot u_{i}=z_{i}\mod p\text{ for all }1\leq i\leq r\text{ }\wedge\text{ }s\cdot u_{i}=z_{i}\mod p\text{ for all }1\leq i\leq r).

Moreover

Pr(tui=zimodp for all 1ir  sui=zimodp for all 1ir)=\displaystyle\Pr(t\cdot u_{i}=z_{i}\mod p\text{ for all }1\leq i\leq r\text{ }\wedge\text{ }s\cdot u_{i}=z_{i}\mod p\text{ for all }1\leq i\leq r)=
|{(u1,,ur): tui=zimodp  sui=zimodp  ui{0,,p1}n for all 1ir}|pnr.\displaystyle\frac{|\left\{(u_{1},\dots,u_{r}):\text{ }t\cdot u_{i}=z_{i}\mod p\text{ }\wedge\text{ }s\cdot u_{i}=z_{i}\mod p\text{ }\wedge\text{ }u_{i}\in\left\{0,\dots,p-1\right\}^{n}\text{ for all }1\leq i\leq r\right\}|}{p^{nr}}.

In order to calculate

|{(u1,,ur): tui=zimodp  sui=zimodp  ui{0,,p1}n for all 1ir}|,|\left\{(u_{1},\dots,u_{r}):\text{ }t\cdot u_{i}=z_{i}\mod p\text{ }\wedge\text{ }s\cdot u_{i}=z_{i}\mod p\text{ }\wedge\text{ }u_{i}\in\left\{0,\dots,p-1\right\}^{n}\text{ for all }1\leq i\leq r\right\}|,

we must find the number of solutions (u1,,ur)(u_{1},\dots,u_{r}) to the system of equations tui=zimodpt\cdot u_{i}=z_{i}\mod p and sui=zimodps\cdot u_{i}=z_{i}\mod p for all 1ir1\leq i\leq r.

We form an augmented matrix MM with 2r2r rows and rn+1rn+1 columns from this system of equations. From left to right, the entries of row ii are (i1)n(i-1)n zeroes, then ss, then (ri)n(r-i)n zeroes, then ziz_{i} for each 1ir1\leq i\leq r. The entries of row r+ir+i are (i1)n(i-1)n zeroes, then tt, then (ri)n(r-i)n zeroes, then ziz_{i} for each 1ir1\leq i\leq r.

We row-reduce MM. Since ss and tt are not multiples of each other mod pp, MM has 2r2r pivot entries. Therefore the system of equations has 2r2r dependent variables and (n2)r(n-2)r independent variables. There are pp choices for each of the independent variables, and the dependent variables are determined by the values of the independent variables, so there are p(n2)rp^{(n-2)r} solutions to the system of equations. Thus

Pr(tui=zimodp for all 1ir | sui=zimodp for all 1ir)=prp2r=1pr.\Pr(t\cdot u_{i}=z_{i}\mod p\text{ for all }1\leq i\leq r\text{ }|\text{ }s\cdot u_{i}=z_{i}\mod p\text{ for all }1\leq i\leq r)=\frac{p^{r}}{p^{2r}}=\frac{1}{p^{r}}.

Lemma 12.

For any subset S{1,,p1}nS\subset\{1,\dots,p-1\}^{n}, there exists u=(u1,,ur)u=(u_{1},\dots,u_{r}) with u1,,ur{0,,p1}nu_{1},\dots,u_{r}\in\{0,\dots,p-1\}^{n} such that for all z=(z1,,zr){0,,p1}rz=(z_{1},\dots,z_{r})\in\{0,\dots,p-1\}^{r},

|{xS:xuizi(modp) for all 1ir}||S|pr+2|S|.\left\lvert\{x\in S:x\cdot u_{i}\equiv z_{i}\pmod{p}\text{ for all }1\leq i\leq r\}\right\rvert\leq\frac{|S|}{p^{r}}+2\sqrt{|S|}.
Proof.

Suppose that SS is any subset of {1,,p1}n\{1,\dots,p-1\}^{n}, and let u1,,uru_{1},\dots,u_{r} each be chosen uniformly at random from {0,,p1}n\{0,\dots,p-1\}^{n}. For each z{0,,p1}rz\in\{0,\dots,p-1\}^{r}, let TzT_{z} be the set of xSx\in S for which xui=zix\cdot u_{i}=z_{i} for all ii. By linearity of expectation, we have 𝔼(|Tz|)=|S|pr\mathbb{E}(|T_{z}|)=\frac{|S|}{p^{r}} for all zz.

Consider an arbitrary z{0,,p1}rz\in\{0,\dots,p-1\}^{r}. For each sSs\in S, define the indicator random variable Xs,zX_{s,z} such that Xs,z=1X_{s,z}=1 if sui=zis\cdot u_{i}=z_{i} for all 1ir1\leq i\leq r, and Xs,z=0X_{s,z}=0 otherwise. If s,tSs,t\in S are not multiples of each other mod pp, then Cov(Xs,z,Xt,z)=0\mathrm{Cov}(X_{s,z},X_{t,z})=0 by Lemmas 10 and 11. If ss and tt are multiples of each other with sts\neq t, then Cov(Xs,z,Xt,z)=𝔼(Xs,zXt,z)𝔼(Xs,z)𝔼(Xt,z)\mathrm{Cov}(X_{s,z},X_{t,z})=\mathbb{E}(X_{s,z}X_{t,z})-\mathbb{E}(X_{s,z})\mathbb{E}(X_{t,z}).

If zz contains any nonzero ziz_{i}, then 𝔼(Xs,zXt,z)=0\mathbb{E}(X_{s,z}X_{t,z})=0, giving Cov(Xs,z,Xt,z)=1p2r\mathrm{Cov}(X_{s,z},X_{t,z})=-\frac{1}{p^{2r}}. Thus,

Var(|Tz|)\displaystyle\mathrm{Var}(|T_{z}|) =Var(sSXs,z)=sSVar(Xs,z)+stCov(Xs,z,Xt,z)\displaystyle=\mathrm{Var}\left(\sum_{s\in S}X_{s,z}\right)=\sum_{s\in S}\mathrm{Var}(X_{s,z})+\sum_{s\neq t}\mathrm{Cov}(X_{s,z},X_{t,z})
sSVar(Xs,z)=|S|(1pr1p2r)<|S|pr.\displaystyle\leq\sum_{s\in S}\mathrm{Var}(X_{s,z})=|S|\left(\frac{1}{p^{r}}-\frac{1}{p^{2r}}\right)<\frac{|S|}{p^{r}}.

By Chebyshev’s inequality, (|Tz||S|pr+2|S|)14pr\mathbb{P}\left(|T_{z}|\geq\frac{|S|}{p^{r}}+2\sqrt{|S|}\right)\leq\frac{1}{4p^{r}}.

Otherwise, 𝔼(Xs,zXt,z)=1pr\mathbb{E}(X_{s,z}X_{t,z})=\frac{1}{p^{r}}, giving Cov(Xs,z,Xt,z)=1pr1p2r<1pr\mathrm{Cov}(X_{s,z},X_{t,z})=\frac{1}{p^{r}}-\frac{1}{p^{2r}}<\frac{1}{p^{r}}. Note that there are at most (p2)|S|(p-2)|S| ordered pairs (s,t)(s,t) for which ss and tt are multiples of each other (modp)\pmod{p} with sts\neq t. Thus,

Var(|Tz|)\displaystyle\mathrm{Var}(|T_{z}|) =Var(sSXs,z)=sSVar(Xs,z)+stCov(Xs,z,Xt,z)\displaystyle=\mathrm{Var}\left(\sum_{s\in S}X_{s,z}\right)=\sum_{s\in S}\mathrm{Var}(X_{s,z})+\sum_{s\neq t}\mathrm{Cov}(X_{s,z},X_{t,z})
sSVar(Xs,z)+(p2)|S|pr<|S|pr+(p2)|S|pr<|S|pr1.\displaystyle\leq\sum_{s\in S}\mathrm{Var}(X_{s,z})+\frac{(p-2)|S|}{p^{r}}<\frac{|S|}{p^{r}}+\frac{(p-2)|S|}{p^{r}}<\frac{|S|}{p^{r-1}}.

By Chebyshev’s inequality, (|Tz||S|pr+2|S|)14pr1\mathbb{P}\left(|T_{z}|\geq\frac{|S|}{p^{r}}+2\sqrt{|S|}\right)\leq\frac{1}{4p^{r-1}}.

By the union bound,

(z:|Tz||S|pr+2|S|)1pr14pr14pr1=3prp+14pr>12.\mathbb{P}\left(\forall z:|T_{z}|\leq\frac{|S|}{p^{r}}+2\sqrt{|S|}\right)\geq 1-\frac{p^{r}-1}{4p^{r}}-\frac{1}{4p^{r-1}}=\frac{3p^{r}-p+1}{4p^{r}}>\frac{1}{2}.

Thus, the conditions are satisfied with probability greater than 12\frac{1}{2} when uu is chosen randomly, so there must always exist uu satisfying the conditions. ∎

Next we prove the lower bound on the maximum possible multiplicative gap between optweak(CARTr(F))\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F)) and optstd(F)\operatorname{opt}_{\operatorname{std}}(F).

Theorem 13.

For all M>2rM>2r and infinitely many kk, there exists a set FF of functions from a set XX to a set YY with |Y|=k|Y|=k such that optstd(F)=M\operatorname{opt}_{\operatorname{std}}(F)=M and

optweak(CARTr(F))(1o(1))(|Y|rln(|Y|))(optstd(F)2r).\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F))\geq(1-o(1))\left(|Y|^{r}\ln{|Y|}\right)(\operatorname{opt}_{\operatorname{std}}(F)-2r).
Proof.

Fix n3n\geq 3 and p5p\geq 5. For all a{0,,p1}na\in\{0,\dots,p-1\}^{n}, we define fa:{0,,p1}n{0,,p1}f_{a}:\{0,\dots,p-1\}^{n}\rightarrow\{0,\dots,p-1\} so that fa(x)=ax(modp)f_{a}(x)=a\cdot x\pmod{p} and define FL(p,n)={fa:a{0,,p1}n}F_{L}(p,n)=\{f_{a}:a\in\{0,\dots,p-1\}^{n}\}. It is known that optstd(FL(p,n))=n\operatorname{opt}_{\operatorname{std}}(F_{L}(p,n))=n for all primes pp and n>0n>0 [17, 2, 3, 12].

We now determine a bound on optweak(CARTr(FL(p,n)))\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F_{L}(p,n))). Let S={1,,p1}nS=\{1,\dots,p-1\}^{n}, so |S|=(p1)n|S|=(p-1)^{n}. Let R1={fa:aS}FL(p,n)R_{1}=\{f_{a}:a\in S\}\subset F_{L}(p,n). In each round t>1t>1, the adversary will create a list RtR_{t} of members of {fa:aS}\{f_{a}:a\in S\} that are consistent with its previous answers. They will always answer no and choose (x1,,xr)(x_{1},\dots,x_{r}) that minimizes

max(y1,,yr)|Rt{f:f(xi)=yi for all 1ir}|.\max_{(y_{1},\dots,y_{r})}\left|R_{t}\cap\{f:f(x_{i})=y_{i}\text{ for all }1\leq i\leq r\}\right|.

By Lemma 12, we have

Rt+1|Rt||RT|pr2|Rt||Rt||Rt|pr2|Rt|prln(p)=(11+2ln(p)pr)|Rt|R_{t+1}\geq|R_{t}|-\frac{|R_{T}|}{p^{r}}-2\sqrt{|R_{t}|}\geq|R_{t}|-\frac{|R_{t}|}{p^{r}}-\frac{2|R_{t}|}{p^{r}\sqrt{\ln{p}}}=\left(1-\frac{1+\frac{2}{\sqrt{\ln{p}}}}{p^{r}}\right)|R_{t}|

as long as |Rt|p2rln(p)|R_{t}|\geq p^{2r}\ln{p}. Thus, we have |Rt|(11+2ln(p)pr)t1(p1)n|R_{t}|\geq\left(1-\frac{1+\frac{2}{\sqrt{\ln{p}}}}{p^{r}}\right)^{t-1}(p-1)^{n}. Therefore, whenever (11+2ln(p)pr)t1(p1)np2rln(p)\left(1-\frac{1+\frac{2}{\sqrt{\ln{p}}}}{p^{r}}\right)^{t-1}(p-1)^{n}\geq p^{2r}\ln{p}, the adversary can guarantee tt wrong guesses. This is true for t=(1o(1))nprln(p)t=(1-o(1))np^{r}\ln{p}, which gives the desired result. ∎

Remark 14.

Since we have the trivial inequality optamb,r(F)optweak(CARTr(F))\operatorname{opt}_{\operatorname{amb},r}(F)\geq\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F)) because the learner has strictly more information in the scenario on the right hand side, we also have optamb,r(F)(1o(1))(|Y|rln(|Y|))(optstd(F)2r)\operatorname{opt}_{\operatorname{amb},r}(F)\geq(1-o(1))\left(|Y|^{r}\ln{|Y|}\right)(\operatorname{opt}_{\operatorname{std}}(F)-2r) for the families F=FL(p,n)F=F_{L}(p,n).

Next we establish a similar upper bound relating optweak(CARTr(F))\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F)) and optstd(F)\operatorname{opt}_{\operatorname{std}}(F). For this bound, we use the fact that for all sets FF of functions f:XYf:X\to Y, we have optstd(CARTr(F))=optstd(F)\operatorname{opt}_{\operatorname{std}}(\operatorname{CART}_{r}(F))=\operatorname{opt}_{\operatorname{std}}(F). We also use the bound optweak(F)(1+o(1))(|Y|ln|Y|)optstd(F)\operatorname{opt}_{\operatorname{weak}}(F)\leq(1+o(1))(|Y|\ln|Y|)\operatorname{opt}_{\operatorname{std}}(F) which was proved in [12].

Theorem 15.

For any set FF of functions from some set XX to {0,1,,k1}\{0,1,\dots,k-1\} and for any r1r\geq 1,

optweak(CARTr(F))(1+o(1))(|Y|rrln(|Y|))optstd(F).\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F))\leq(1+o(1))\left(|Y|^{r}r\ln{|Y|}\right)\operatorname{opt}_{\operatorname{std}}(F).
Proof.

Substituting CARTr(F)\operatorname{CART}_{r}(F) for FF (and therefore setting YrY^{r} in place of YY) in the upper bound from [12] and using the fact that optstd(CARTr(F))=optstd(F)\operatorname{opt}_{\operatorname{std}}(\operatorname{CART}_{r}(F))=\operatorname{opt}_{\operatorname{std}}(F), we get

optweak(CARTr(F))\displaystyle\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F)) (1+o(1))(|Y|rln(|Y|r))optstd(CARTr(F))\displaystyle\leq(1+o(1))(|Y|^{r}\ln\left(|Y|^{r}\right))\operatorname{opt}_{\operatorname{std}}(\operatorname{CART}_{r}(F))
=(1+o(1))(|Y|rrln|Y|)optstd(F).\displaystyle=(1+o(1))(|Y|^{r}r\ln|Y|)\operatorname{opt}_{\operatorname{std}}(F).

Remark 16.

Theorem 13 demonstrates that the upper bound in Theorem 15 is sharp up to a factor of r(1+o(1))r(1+o(1)).

Next we prove an upper bound for optamb,r(F)\operatorname{opt}_{\operatorname{amb,r}}(F) using a method from [12]. Like the last bound, this one is sharp up to a factor of r(1+o(1))r(1+o(1)).

Theorem 17.

For any set FF of functions from some set XX to {0,1,,k1}\{0,1,\dots,k-1\} and for any r1r\geq 1,

optamb,r(F)(1+o(1))(|Y|rrln(|Y|))optstd(F).\operatorname{opt}_{\operatorname{amb,r}}(F)\leq(1+o(1))\left(|Y|^{r}r\ln{|Y|}\right)\operatorname{opt}_{\operatorname{std}}(F).
Proof.

Consider an algorithm BB for the rr-input delayed, ambiguous reinforcement model which uses a worst-case optimal algorithm AsA_{s} for the standard model and maintains copies of AsA_{s} which are given different inputs and answers. In each round, BB chooses a prediction by taking a weighted vote over the predictions of the copies.

Fix α=1krln(k)\alpha=\frac{1}{k^{r}\ln{k}}. Each copy of AsA_{s} gets a weight, where the current weight is αx\alpha^{x} if the copy has contributed to BB making xx mistakes in the earlier weighted votes. BB uses these weights to make a prediction for the current round by taking a weighted vote over the predictions of each copy for the outputs of all rr inputs. In case of ties, the winner is chosen uniformly at random among the predictions that tied for the highest weight.

At the beginning, BB starts with one copy of AsA_{s}. Whenever it gets a wrong answer for the outputs of the rr inputs in the current round, any copy of AsA_{s} which predicted an answer that did not win the weighted vote is rewound as if the round did not happen, so they forget the input and their answer. The copies of AsA_{s} that predicted the wrong answer which won the weighted vote are cloned to make kr1k^{r}-1 copies, and each copy is given a different answer for the outputs of the rr inputs, which differs from the wrong answer which won the weighted vote.

If WtW_{t} is the total weight of the copies of AsA_{s} before round tt, we must have Wtαoptstd(F)W_{t}\geq\alpha^{\operatorname{opt}_{\operatorname{std}}(F)} since one copy of AsA_{s} always gets the correct answers. Moreover, if BB makes a mistake in round tt, then copies of AsA_{s} with total weight at least Wtkr\frac{W_{t}}{k^{r}} are cloned to make kr1k^{r}-1 copies which each have weight α\alpha times their old weight. This implies that Wt+1(11kr)Wt+1kr(α(kr1)Wt)<(11kr)Wt+αWtW_{t+1}\leq(1-\frac{1}{k^{r}})W_{t}+\frac{1}{k^{r}}(\alpha(k^{r}-1)W_{t})<(1-\frac{1}{k^{r}})W_{t}+\alpha W_{t}.

Thus after BB has made xx mistakes, we have Wt<(11kr+α)x<e(1krα)xW_{t}<(1-\frac{1}{k^{r}}+\alpha)^{x}<e^{-(\frac{1}{k^{r}}-\alpha)x}. Therefore e(1krα)x>αoptstd(F)e^{-(\frac{1}{k^{r}}-\alpha)x}>\alpha^{\operatorname{opt}_{\operatorname{std}}(F)}, so x<ln(1α)optstd(F)1krαx<\frac{\ln(\frac{1}{\alpha})\operatorname{opt}_{\operatorname{std}}(F)}{\frac{1}{k^{r}}-\alpha}, which implies the desired bound. ∎

5 New Models for Permutation Functions

In this section, we define and explore new models where the family of possible functions FF is a set of permutations of length nn and where the learner tries to guess information about the relative orders of inputs.

5.1 The Order Model

We first define a new variant model called the order model.

Definition 18.

In the order model, for a set FF of permutations of nn numbers, the learner tries to guess a permutation function fF.f\in F. On each turn, the adversary chooses rr distinct inputs to ff, i.e., a set S[n]S\subseteq[n] with |S|=r\absolutevalue{S}=r. The learner guesses the permutation of {1,,r}\left\{1,\dots,r\right\} which is order-isomorphic to the outputs of ff on the given inputs.

Under weak reinforcement, the adversary informs the learner if they made a mistake, and under strong reinforcement, the adversary gives the correct answer to the learner. We denote the worst-case amount of mistakes for the learner with weak reinforcement as optweak,<,r(F)\operatorname{opt}_{\operatorname{weak,<,r}}(F) and with strong reinforcement as optstrong,<,r(F).\operatorname{opt}_{\operatorname{strong,<,r}}(F). Note that optstrong,<,r(F)optweak,<,r(F).\operatorname{opt}_{\operatorname{strong,<,r}}(F)\leq\operatorname{opt}_{\operatorname{weak,<,r}}(F). If r=2,r=2, then strong and weak reinforcement are identical, so equality holds.

We first find an upper bound for the order model by presenting a strategy for the learner which is analogous to Theorem 6.

Theorem 19.

For r>1,r>1, optweak,<,r(F)<r!ln|F|.\operatorname{opt}_{\operatorname{weak,<,r}}(F)<r!\ln\absolutevalue{F}.

Proof.

For each input that the adversary gives, the learner can pick the answer that corresponds to the most possible permutations. After each incorrect guess, at least an 1r!\frac{1}{r!} fraction of the previously possible permutations get eliminated. Therefore, the number of incorrect guesses, and consequently the number of mistakes, is at most lnr!r!1|F|<r!ln|F|\ln_{\frac{r!}{r!-1}}\absolutevalue{F}<r!\ln\absolutevalue{F}. ∎

When F=SnF=S_{n}, Theorem 19 shows that optweak,<,2(Sn)<2nln(n).\operatorname{opt}_{\operatorname{weak,<,2}}(S_{n})<2n\ln{n}. We find a lower bound on optweak,<(Sn)\operatorname{opt}_{\operatorname{weak,<}}(S_{n}) which is within a factor of 2+o(1)2+o(1) of the upper bound. In order to prove this bound, we define a function called p(n)p(n) and prove a lemma about it.

Definition 20.

For n1,n\geq 1, we let vn:=log2nv_{n}:=\left\lfloor\log_{2}n\right\rfloor and define p(n):=1mnvmp(n):=\displaystyle\sum_{1\leq m\leq n}v_{m}.

Lemma 21.

For n1,n\geq 1, p(n)=(n+1)vn2(2vn1).p(n)=(n+1)v_{n}-2(2^{v_{n}}-1).

Proof.

We prove this by induction. The base case n=1n=1 holds.

If nn is not a power of two, then vn1=vn,v_{n-1}=v_{n}, so p(n)=p(n1)+vn=(nvn2(2vn1))+vn=(n+1)vn2(2vn1)p(n)=p(n-1)+v_{n}=(nv_{n}-2(2^{v_{n}}-1))+v_{n}=(n+1)v_{n}-2(2^{v_{n}}-1), as desired.

Otherwise, n=2nn=2^{n} is a power of two, so vn1=vn1v_{n-1}=v_{n}-1 and p(2vn)=p(2vn1)+vn=(2vn(vn1)2(2vn11))+vn=(2vn+1)vn2(2vn1)p(2^{v_{n}})=p(2^{v_{n}}-1)+v_{n}=(2^{v_{n}}(v_{n}-1)-2(2^{v_{n}-1}-1))+v_{n}=(2^{v_{n}}+1)v_{n}-2(2^{v_{n}}-1), as desired. ∎

Now we present a strategy resembling insertion sort for the adversary which achieves a lower bound of p(n).p(n).

Theorem 22.

Under the order model, the adversary can achieve optweak,<,2(Sn)p(n)\operatorname{opt}_{\operatorname{weak,<,2}}(S_{n})\geq p(n) for all n2n\geq 2.

Proof.

The adversary withholds any inquiries about f(i)f(i) until the order of the smaller inputs j<ij<i is known. We use induction to show the desired bound. The base case of n=2n=2 clearly holds, since p(2)=1p(2)=1.

For n>2,n>2, by the inductive hypothesis, the adversary can force the learner to make at least p(n1)p(n-1) mistakes without learning anything about f(n).f(n). Assume without loss of generality that f(1),,f(n1)f(1),\dots,f(n-1) are in increasing order.

The adversary then prolongs the learner from finding the position of f(n)f(n) by making the learner do a binary search. Specifically, if at some point the learner’s bounds on f(n)f(n) are a<f(n)b,a<f(n)\leq b, the adversary asks about (n,m)(n,m) where mm is a+b2\frac{a+b}{2} rounded to the nearest integer, and says no to the learner’s prediction. In this way, the adversary ensures that the number of possible values for f(n)f(n) is at least ba2\left\lfloor\frac{b-a}{2}\right\rfloor after the learner’s guess. Since the number of possible values starts at n,n, the adversary will be able to guarantee vnv_{n} mistakes to find the value of f(n).f(n). Thus, optweak,<,2(Sn)optweak,<,2(Sn1)+vnp(n1)+vn=p(n)\operatorname{opt}_{\operatorname{weak,<,2}}(S_{n})\geq\operatorname{opt}_{\operatorname{weak,<,2}}(S_{n-1})+v_{n}\geq p(n-1)+v_{n}=p(n), as desired. ∎

We also get a lower bound on optweak,<,r(Sn)\operatorname{opt}_{\operatorname{weak,<,r}}(S_{n}) for r>2r>2 with a strategy that resembles merge sort. This lower bound is within a (1+o(1))rln(r)(1+o(1))r\ln{r} factor of the upper bound optweak,<,r(Sn)<(1o(1))r!nln(n)\operatorname{opt}_{\operatorname{weak,<,r}}(S_{n})<(1-o(1))r!n\ln{n} which follows from Theorem 19.

Theorem 23.

The adversary can achieve optweak,<,r(Sn)(1o(1))(r1)!nlogrn\operatorname{opt}_{\operatorname{weak,<,r}}(S_{n})\geq(1-o(1))(r-1)!n\log_{r}n as nn\to\infty and then rr\to\infty.

Proof.

We use strong induction on n.n.

First, the adversary divides [n][n] into nr\left\lfloor\frac{n}{r}\right\rfloor sets SiS_{i} each of size rr (ignoring any remaining elements). For each i,i, the adversary repeatedly asks for ordering of Si,S_{i}, saying “NO” each time until the order is known by the learner. This takes a total of nr(r!1)\left\lfloor\frac{n}{r}\right\rfloor(r!-1) mistakes.

Then, the adversary forms sets CjC_{j} from the relative jthj^{th} elements of each set, and uses the induction hypothesis on nn; note that knowing any of CjC_{j}’s orders does not eliminate possibilities for the orders of other CiC_{i}. This gives a recursion of the form optweak,<,r(Sn)(1o(1))n(r1)!+roptweak,<,r(Snr)=(1o(1))(r1)!nlogr(n)\operatorname{opt}_{\operatorname{weak,<,r}}(S_{n})\geq(1-o(1))n(r-1)!+r\operatorname{opt}_{\operatorname{weak,<,r}}(S_{\left\lfloor\frac{n}{r}\right\rfloor})=(1-o(1))(r-1)!n\log_{r}(n), as desired. ∎

5.2 The Comparison Model

We define a second variant called the comparison model.

Definition 24.

In the comparison model, for a set FF of permutations of nn numbers, the learner tries to guess a permutation function fF.f\in F. On each turn, the adversary chooses rr distinct pairs of inputs to ff, i.e., a set S{(i,j):i,j[n],ij}S\subseteq\{(i,j):i,j\in[n],i\neq j\} with |S|=r\absolutevalue{S}=r. The learner guesses for each pair (i,j)(i,j) whether or not f(i)<f(j)f(i)<f(j).

Under weak reinforcement, the adversary informs the learner if they made a mistake, and under strong reinforcement, the adversary gives the correct choices to the learner. We denote the worst-case amount of mistakes for the learner with weak reinforcement as optweak,c,r(F)\operatorname{opt}_{\operatorname{weak,c,r}}(F) and with strong reinforcement as optstrong,c,r(F).\operatorname{opt}_{\operatorname{strong,c,r}}(F). Note that optstrong,c,r(F)optweak,c,r(F).\operatorname{opt}_{\operatorname{strong,c,r}}(F)\leq\operatorname{opt}_{\operatorname{weak,c,r}}(F). If r=1,r=1, then both sides of the equation are equal to optstrong,<,2(F).\operatorname{opt}_{\operatorname{strong,<,2}}(F).

We similarly get the following upper bound.

Theorem 25.

If FSnF\subseteq S_{n}, optweak,c,r(F)<2rln|F|.\operatorname{opt}_{\operatorname{weak,c,r}}(F)<2^{r}\cdot\ln|F|.

Proof.

For each input that the adversary gives, the learner can pick the answer that corresponds to the most possible permutations. After each incorrect guess, at least a 12r\frac{1}{2^{r}} fraction of the previously possible permutations get eliminated. Therefore, the number of incorrect guesses, and consequently the number of mistakes, is at most log2r2r1(|F|)<2rln|F|.\log_{\frac{2^{r}}{2^{r}-1}}(|F|)<2^{r}\ln|F|.

We present a strategy imitating quicksort to obtain the following lower bound.

Theorem 26.

The adversary can achieve optweak,c,r(Sn)(1o(1))2rrnlog2n\operatorname{opt}_{\operatorname{weak,c,r}}(S_{n})\geq(1-o(1))\frac{2^{r}}{r}n\log_{2}n as nn\to\infty and then rr\to\infty.

Proof.

The adversary first splits [n1][n-1] into n1r\left\lfloor\frac{n-1}{r}\right\rfloor sets of size rr with some leftover elements. For every such set |S|\absolutevalue{S} with |S|=r\absolutevalue{S}=r, the adversary then queries the set of pairs {(s,n):sS}\quantity{(s,n):s\in S} at least 2r2^{r} times, responding no each time until the answer is forced. After all queries, the learner has made at least n1r(2r1)=(1o(1))2rrn\left\lfloor\frac{n-1}{r}\right\rfloor(2^{r}-1)=(1-o(1))\frac{2^{r}}{r}n mistakes.

The learner now at most knows the set of inputs ii for which π(i)<π(n)\pi(i)<\pi(n). The adversary can therefore force the learner to make at least optweak,c,r(Sπ(n)1)+optweak,c,r(Snπ(n))\operatorname{opt}_{\operatorname{weak,c,r}}(S_{\pi(n)-1})+\operatorname{opt}_{\operatorname{weak,c,r}}(S_{n-\pi(n)}) more mistakes.

The total number of mistakes is thus at least 2((1o(1))2rrn2(log2n1))+(1o(1))2rrn=(1o(1))2rrnlog2n2\quantity((1-o(1))\frac{2^{r}}{r}\cdot\frac{n}{2}(\log_{2}n-1))+(1-o(1))\frac{2^{r}}{r}n=(1-o(1))\frac{2^{r}}{r}n\log_{2}n in total, as desired. ∎

5.3 The Selection Model

We define another variant called the selection model.

Definition 27.

In the selection model, for a set FF of permutations of nn numbers, the learner tries to guess a permutation function fF.f\in F. On each turn, the adversary chooses rr (not necessarily distinct) inputs to ff, and the learner guesses the input xx among them that maximizes f(x)f(x).

Under weak reinforcement, the adversary informs the learner if they made a mistake, and under strong reinforcement, the adversary gives the correct choices to the learner. We denote the worst-case amount of mistakes for the learner with weak reinforcement as optweak,s,r(F)\operatorname{opt}_{\operatorname{weak,s,r}}(F) and with strong reinforcement as optstrong,s,r(F).\operatorname{opt}_{\operatorname{strong,s,r}}(F). Note that optstrong,s,r(F)optweak,s,r(F).\operatorname{opt}_{\operatorname{strong,s,r}}(F)\leq\operatorname{opt}_{\operatorname{weak,s,r}}(F). If r=2,r=2, then both sides of the equation are equal to optstrong,<,2(F).\operatorname{opt}_{\operatorname{strong,<,2}}(F).

Note that because the inputs the adversary chooses do not have to be distinct, the adversary can effectively ask about any set of inputs of size at most rr by using duplicates.

Theorem 28.

For r>1r>1,

optweak,s,r(F)<rln|F|.\operatorname{opt}_{\operatorname{weak,s,r}}(F)<r\ln|F|.
Proof.

For each input that the adversary gives, the learner can pick the input that is maximal for the most possible remaining permutations. After each incorrect guess, at least a 1r\frac{1}{r} fraction of the previously possible permutations get eliminated. Therefore, the number of incorrect guesses, and consequently the number of mistakes, is at most logrr1(|F|)<rln|F|.\log_{\frac{r}{r-1}}(|F|)<r\ln|F|.

We again imitate merge sort to provide a strategy for the adversary.

Theorem 29.

The adversary can achieve optweak,s,r(Sn)(1o(1))nrlogr(n)/2\operatorname{opt}_{\operatorname{weak,s,r}}(S_{n})\geq(1-o(1))nr\log_{r}(n)/2 as nn\to\infty and then rr\to\infty.

Proof.

We use strong induction on nn.

First, the adversary divides [n][n] into rr sets SiS_{i} each of size nr\left\lfloor\frac{n}{r}\right\rfloor (ignoring any remaining elements). For each i,i, the adversary uses the optimal strategy for nr\left\lfloor\frac{n}{r}\right\rfloor on SiS_{i}, forcing a total of at least roptweak,s,r(Snr)r\operatorname{opt}_{\operatorname{weak,s,r}}(S_{\left\lfloor\frac{n}{r}\right\rfloor}) mistakes.

Then, the adversary repeatedly asks about the elements xix_{i} of each corresponding SiS_{i} for 1ir1\leq i\leq r for which f(xi)f(x_{i}) is maximal, saying “NO” each time until their maximal element, say xmx_{m}, is known. Then, the adversary deletes xmx_{m} from SmS_{m} and repeats this process until all sets are empty, which takes rnrr\left\lfloor\frac{n}{r}\right\rfloor iterations in total.

The number of tries it takes the learner each time to guess correctly is the number of possibilities for the maximal remaining element, which is the number of nonempty SiS_{i}’s remaining. Therefore, until one of the SiS_{i}’s is empty, which takes at least |Si|=nr|S_{i}|=\left\lfloor\frac{n}{r}\right\rfloor turns, the adversary can force the learner to make at least (r1)(r-1) mistakes, and after that (r2)(r-2), and so on all the way to 0 mistakes when all but one SiS_{i} is empty. Therefore, the learner throughout the process makes an average of at least (r1)/2(r-1)/2 mistakes per iteration, for rnr(r1)/2=(1o(1))nr/2r\left\lfloor\frac{n}{r}\right\rfloor\cdot(r-1)/2=(1-o(1))nr/2 mistakes total.

Thus, optweak,s,r(Sn)(1o(1))nr/2+roptweak,s,r(Snr)=(1o(1))nrlogr(n)/2\operatorname{opt}_{\operatorname{weak,s,r}}(S_{n})\geq(1-o(1))nr/2+r\operatorname{opt}_{\operatorname{weak,s,r}}(S_{\left\lfloor\frac{n}{r}\right\rfloor})=(1-o(1))nr\log_{r}(n)/2, as desired. ∎

5.4 The Relative Position Model

Next we define a variant called the relative position model.

Definition 30.

In the relative position model, for a set FF of permutations of nn numbers, the learner tries to guess a permutation function fF.f\in F. On each turn, the adversary chooses a set SS of rr distinct inputs to ff and an element xS,x\notin S, and asks about the pair (x,S).(x,S). The learner guesses the relative position of f(x)f(x) in the permutation of {1,,r+1}\left\{1,\dots,r+1\right\} which is order-isomorphic to the outputs of ff on {x}S.\quantity{x}\cup S.

Under weak reinforcement, the adversary informs the learner if they made a mistake, and under strong reinforcement, the adversary gives the correct position to the learner. We denote the worst-case amount of mistakes for the learner with weak reinforcement as optweak,p,r(F)\operatorname{opt}_{\operatorname{weak,p,r}}(F) and with strong reinforcement as optstrong,p,r(F).\operatorname{opt}_{\operatorname{strong,p,r}}(F). Note that optstrong,p,r(F)optweak,p,r(F).\operatorname{opt}_{\operatorname{strong,p,r}}(F)\leq\operatorname{opt}_{\operatorname{weak,p,r}}(F). If r=1,r=1, then both sides of the equation are equal to optstrong,<,2(F).\operatorname{opt}_{\operatorname{strong,<,2}}(F).

We again imitate Theorem 6 to obtain an upper bound for the relative position model.

Theorem 31.

If FSnF\subseteq S_{n}, optweak,p,r(F)<(r+1)ln|F|.\operatorname{opt}_{\operatorname{weak,p,r}}(F)<(r+1)\ln|F|.

Proof.

For each input that the adversary gives, the learner can pick the answer that corresponds to the most possible permutations. After each incorrect guess, at least a 1r+1\frac{1}{r+1} fraction of the previously possible permutations get eliminated. Therefore, the number of incorrect guesses, and consequently the number of mistakes, is at most logr+1r(|F|)<(r+1)ln|F|.\log_{\frac{r+1}{r}}(|F|)<(r+1)\ln|F|.

As with Theorem 22, we use a strategy resembling insertion sort to obtain the following result.

Theorem 32.

Under the relative position model, limrlimnoptweak,p,r(Sn)rnlnn=1\lim_{r\to\infty}\lim_{n\to\infty}\frac{\operatorname{opt}_{\operatorname{weak,p,r}}(S_{n})}{rn\ln n}=1.

Proof.

The upper bound follows from Theorem 31. For the lower bound, we use strong induction on n.n. The adversary withholds any inquiries about f(i)f(i) until the order of the smaller inputs j<ij<i is known. Assume without loss of generality that f(1),,f(n)f(1),\dots,f(n) are in increasing order. Let gig_{i} be the function on [n][n] that maps mm to 11 if mim\geq i and 0 otherwise, and let GG be the family of these functions. Note that these functions are non-decreasing.

The remaining problem for the learner is equivalent to guessing gf(n+1),g_{f(n+1)}, where the adversary queries rr values at a time. Thus by Theorem 4, the adversary can force the learner to make at least optweak(CARTr(G))=(1o(1))rlnn\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(G))=(1-o(1))r\ln n mistakes on gf(n+1),g_{f(n+1)}, as desired. ∎

In a similar fashion, we can prove a similar result for pattern-avoiding permutations.

Theorem 33.

If Sn,πS_{n,\pi} is the set of π\pi-avoiding permutations of length nn, then optweak,p,r(Sn,π)(1o(1))rnln(k1)\operatorname{opt}_{\operatorname{weak,p,r}}(S_{n,\pi})\geq(1-o(1))rn\ln(k-1) as nn\to\infty and then rr\to\infty, where k=|π|k=|\pi| denotes the length of π\pi.

Proof.

The adversary withholds any inquiries about f(i)f(i) until the order of the smaller inputs j<ij<i is known. We use induction to show the desired bound.

Let NN be the set of possible values of f(n+1)f(n+1). Any number less than π(k)\pi(k) or more than nk+π(k)n-k+\pi(k) must be in NN as it would not be able to form the permutation pattern π\pi, so |N|π(k)+k1π(k)=k1|N|\geq\pi(k)+k-1-\pi(k)=k-1. Let the elements of NN in increasing order be s1,,s|N|s_{1},\dots,s_{|N|}.

Let gig_{i} be the function on [|N|][|N|] that maps mm to 11 if msim\geq s_{i} and 0 otherwise, and let GG be the family of these functions. Note that these functions are non-decreasing.

The remaining problem for the learner is equivalent to guessing gf(n+1),g_{f(n+1)}, where the adversary queries rr values at a time. Thus by Theorem 4, the adversary can force the learner to make at least optweak(CARTr(G))=(1o(1))rln|N|(1o(1))rln(k1)\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(G))=(1-o(1))r\ln|N|\geq(1-o(1))r\ln(k-1) mistakes, as desired. ∎

When π=Ik\pi=I_{k}, the identity permutation, the size of Sn,πS_{n,\pi} is (k1)O(n)(k-1)^{O(n)} [16]. The next result follows from combining Theorems 33 and 31.

Corollary 34.

For Sn,IkS_{n,I_{k}} the family of IkI_{k}-avoiding permutations of length nn, optweak,p,r(Sn,Ik)=Θ(rnlnk)\operatorname{opt}_{\operatorname{weak,p,r}}(S_{n,I_{k}})=\Theta(rn\ln k) as nn\to\infty and then rr\to\infty.

5.5 The Delayed Relative Position Model

In this subsection, we define delayed reinforcement for the relative position model.

Definition 35.

In the delayed relative position model, for a set FF of permutations of nn numbers, the learner tries to guess a permutation function fF.f\in F. On each turn, the adversary picks an input xx and proceeds to give the rr elements of a set SS one by one (with the requirement that xSx\notin S). After each of the adversary’s inquiries, the learner guesses either higher or lower. At the end of each round, the learner’s final guess for the relative position of xx in SS is one plus the number of times they said higher.

Under weak reinforcement, the adversary informs the learner if their final guess is incorrect, and under strong reinforcement, the adversary gives the correct position to the learner. We denote the worst-case amount of mistakes for the learner with weak reinforcement as optwrpos,r(F)\operatorname{opt}_{\operatorname{wrpos},r}(F) and with strong reinforcement as optsrpos,r(F).\operatorname{opt}_{\operatorname{srpos},r}(F). Note that optsrpos,r(F)optwrpos,r(F),\operatorname{opt}_{\operatorname{srpos},r}(F)\leq\operatorname{opt}_{\operatorname{wrpos},r}(F), and that if r=1,r=1, both sides of the equation are equal to optstrong,<,2(F).\operatorname{opt}_{\operatorname{strong,<,2}}(F).

We state the analogs of the relative position model results for the delayed version.

Theorem 36.

Given a set of permutations FSnF\subseteq S_{n} and a positive integer rr, these analogs of the results in Section 5.4 hold:

  • Analog of Theorem 31: optwrpos,r(F)<2rln|F|\operatorname{opt}_{\operatorname{wrpos},r}(F)<2^{r}\ln|F|.

  • Analog of Theorem 32: limrlimnoptwrpos,r(Sn)2rnlnn=1\lim_{r\to\infty}\lim_{n\to\infty}\frac{\operatorname{opt}_{\operatorname{wrpos},r}(S_{n})}{2^{r}n\ln n}=1.

  • Analog of Theorem 33: optwrpos,r(Sn,π)(1o(1))2rnln(k1)\operatorname{opt}_{\operatorname{wrpos},r}(S_{n,\pi})\geq(1-o(1))2^{r}n\ln(k-1) as nn\to\infty and then rr\to\infty, where k=|π|k=|\pi| denotes the length of π\pi.

  • Analog of Corollary 34: optwrpos,r(Sn,Ik)=Θ(2rnlnk)\operatorname{opt}_{\operatorname{wrpos},r}(S_{n,I_{k}})=\Theta(2^{r}n\ln k) as nn\to\infty and then rr\to\infty.

The proofs of these results are nearly identical to those of the previous section, with r+1r+1 replaced by 2r2^{r} and cited results from Section 2 replaced with analogous results from Section 3, so we omit them here.

6 Future Work

In this final section, we outline some possible areas for future work based on the results in our paper.

In Sections 2 and 3, we mainly focused on the price of feedback for non-decreasing FF in the rr-input weak reinforcement model and the rr-input delayed, ambiguous reinforcement model. What bounds can be obtained for general sets of functions FF? Moreover, is it possible to obtain a sharper bound on the maximum possible value of optamb,r(F)optweak(CARTr(F))\frac{\operatorname{opt}_{\operatorname{amb},r}(F)}{\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F))} over all families of functions FF with |F|>1|F|>1?

In a preliminary version of this paper ([8]), we found rough bounds for the rr-input weak reinforcement model and the rr-input delayed, ambiguous reinforcement model on non-decreasing multi-class functions. In particular we proved that optweak(CARTr(F))(r+k1k1)ln(|F|)\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F))\leq{{r+k-1}\choose k-1}\ln(|F|) and optweak(CARTr(F))12k2(1o(1))rln(|F|)\operatorname{opt}_{\operatorname{weak}}(\operatorname{CART}_{r}(F))\geq\frac{1}{2^{k-2}}(1-o(1))r\ln(|F|) for any subset of functions FF of the non-decreasing functions from XX to {0,1,,k1}\{0,1,\dots,k-1\}. We also showed in [8] that optamb,r(F)<min(krln(|F|),|F|)\operatorname{opt}_{\operatorname{amb},r}(F)<\min(k^{r}\ln(|F|),|F|) and optamb,r(F)(1o(1))2rk+2ln(|F|)\operatorname{opt}_{\operatorname{amb},r}(F)\geq(1-o(1))2^{r-k+2}\ln(|F|) for all XX and FF. Since the preceding upper and lower bounds for multi-class functions have large gaps, a natural problem is to narrow these gaps.

In Theorem 23, we described an adversary strategy for the order model that gives a bound of optweak,<,r(Sn)(1o(1))(r1)!nlogrn\operatorname{opt}_{\operatorname{weak,<,r}}(S_{n})\geq(1-o(1))(r-1)!n\log_{r}n, whereas Theorem 19 gives an upper bound of optweak,<,r(Sn)<r!lnn!=(1o(1))r!nlnn\operatorname{opt}_{\operatorname{weak,<,r}}(S_{n})<r!\ln n!=(1-o(1))r!n\ln n. We conjecture that the latter bound is sharp, i.e., that optweak,<,r(Sn)=(1o(1))r!nlnn\operatorname{opt}_{\operatorname{weak,<,r}}(S_{n})=(1-o(1))r!n\ln n.

Acknowledgments

We would like to thank the MIT PRIMES-USA program and everyone involved for providing us with this research opportunity.

References

  • [1] P. Auer and P.M. Long “Structural Results About On-line Learning Models With and Without Queries” In Machine Learning 36, 1999, pp. 147–181
  • [2] P. Auer, P.M. Long, W. Maass and G.J. Woeginger “On the complexity of function learning” In Machine Learning 18, 1995, pp. 187–230
  • [3] A. Blum “On-Line Algorithms in Machine Learning” In Online Algorithms: The State of the Art, 1998, pp. 306–325
  • [4] J.L. Carter and M.N. Wegman “Universal Classes of Hash Functions (Extended Abstract)” In Proceedings of the Ninth Annual ACM Symposium on Theory of Computing, 1977, pp. 106–112
  • [5] K. Crammer and C. Gentile “Multiclass classification with bandit feedback using adaptive regularization” In Machine Learning 90, 2012, pp. 347–383
  • [6] V. Dani, T. Hayes and S. Kakade “Stochastic Linear Optimization under Bandit Feedback” In 21st Annual Conference on Learning Theory, 2008, pp. 355–366
  • [7] A. Daniely and T. Helbertal “The price of bandit information in multiclass online classification”, 2013 arXiv:1310.8378
  • [8] R. Feng, A. Lee and E. Slettnes “Results on various models of mistake-bounded online learning”, 2021 URL: https://math.mit.edu/research/highschool/primes/papers
  • [9] J. Geneson “A note on the price of bandit feedback for mistake-bounded online learning” In Theoretical Computer Science 874, 2021, pp. 42–45
  • [10] E. Hazan and S. Kale “NEWTRON: An efficient bandit algorithm for online multiclass prediction” In 25th Annual Conference on Neural Information Processing Systems, 2011, pp. 891–899
  • [11] N. Littlestone “Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm” In Machine Learning 2, 1988, pp. 285–318
  • [12] P.M. Long “New bounds on the price of bandit feedback for mistake-bounded online multiclass learning” In Theoretical Computer Science 808, 2020, pp. 159–163
  • [13] M. Luby and A. Wigderson “Pairwise Independence and Derandomization” In Found. Trends Theor. Comput. Sci. 1, 2006, pp. 237–301
  • [14] C.R. Rao “Factorial Experiments Derivable from Combinatorial Arrangements of Arrays” In Supplement to the Journal of the Royal Statistical Society 9, 1947, pp. 128–139
  • [15] C.R. Rao “Hypercubes of strength d leading to confounded designs in factorial experiments” In Bulletin of the Calcutta Mathematical Society 38, 1946, pp. 67–68
  • [16] A. Regev “Asymptotic values for degrees associated with strips of young diagrams” In Advances in Mathematics 41, 1981, pp. 115–136
  • [17] H. Shvaytser “Linear manifolds are learnable from positive examples” Unpublished manuscript, 1988