This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Formal Solution to the Grain of Truth Problem

Jan Leike
Australian National University
jan.leike@anu.edu.au &Jessica Taylor
Machine Intelligence Research Inst.
jessica@intelligence.org &Benya Fallenstein
Machine Intelligence Research Inst.
benya@intelligence.org
Abstract

A Bayesian agent acting in a multi-agent environment learns to predict the other agents’ policies if its prior assigns positive probability to them (in other words, its prior contains a grain of truth). Finding a reasonably large class of policies that contains the Bayes-optimal policies with respect to this class is known as the grain of truth problem. Only small classes are known to have a grain of truth and the literature contains several related impossibility results. In this paper we present a formal and general solution to the full grain of truth problem: we construct a class of policies that contains all computable policies as well as Bayes-optimal policies for every lower semicomputable prior over the class. When the environment is unknown, Bayes-optimal agents may fail to act optimally even asymptotically. However, agents based on Thompson sampling converge to play ε\varepsilon-Nash equilibria in arbitrary unknown computable multi-agent environments. While these results are purely theoretical, we show that they can be computationally approximated arbitrarily closely.

Keywords. General reinforcement learning, multi-agent systems, game theory, self-reflection, asymptotic optimality, Nash equilibrium, Thompson sampling, AIXI.

1  Introduction

Consider the general setup of multiple reinforcement learning agents interacting sequentially in a known environment with the goal to maximize discounted reward.111We mostly use the terminology of reinforcement learning. For readers from game theory we provide a dictionary in Table 1. Each agent knows how the environment behaves, but does not know the other agents’ behavior. The natural (Bayesian) approach would be to define a class of possible policies that the other agents could adopt and take a prior over this class. During the interaction, this prior gets updated to the posterior as our agent learns the others’ behavior. Our agent then acts optimally with respect to this posterior belief.

Reinforcement learning Game theory
stochastic policy mixed strategy
deterministic policy pure strategy
agent player
multi-agent environment infinite extensive-form game
reward payoff/utility
(finite) history history
infinite history path of play
Table 1: Terminology dictionary between reinforcement learning and game theory.

A famous result for infinitely repeated games states that as long as each agent assigns positive prior probability to the other agents’ policies (a grain of truth) and each agent acts Bayes-optimal, then the agents converge to playing an ε\varepsilon-Nash equilibrium [KL93].

As an example, consider an infinitely repeated prisoners dilemma between two agents. In every time step the payoff matrix is as follows, where C means cooperate and D means defect.

C D
C 3/4, 3/4 0, 1
D 1, 0 1/4, 1/4

Define the set of policies Π:={π,π0,π1,}\Pi:=\{\pi_{\infty},\pi_{0},\pi_{1},\ldots\} where policy πt\pi_{t} cooperates until time step tt or the opponent defects (whatever happens first) and defects thereafter. The Bayes-optimal behavior is to cooperate until the posterior belief that the other agent defects in the time step after the next is greater than some constant (depending on the discount function) and then defect afterwards. Therefore Bayes-optimal behavior leads to a policy from the set Π\Pi (regardless of the prior). If both agents are Bayes-optimal with respect to some prior, they both have a grain of truth and therefore they converge to a Nash equilibrium: either they both cooperate forever or after some finite time they both defect forever. Alternating strategies like TitForTat (cooperate first, then play the opponent’s last action) are not part of the policy class Π\Pi, and adding them to the class breaks the grain of truth property: the Bayes-optimal behavior is no longer in the class. This is rather typical; a Bayesian agent usually needs to be more powerful than its environment [LH15b].

Until now, classes that admit a grain of truth were known only for small toy examples such as the iterated prisoner’s dilemma above [SLB09, Ch. 7.3]. The quest to find a large class admitting a grain of truth is known as the grain of truth problem [Hut09, Q. 5j]. The literature contains several impossibility results on the grain of truth problem [FY01, Nac97, Nac05] that identify properties that cannot be simultaneously satisfied for classes that allow a grain of truth.

In this paper we present a formal solution to multi-agent reinforcement learning and the grain of truth problem in the general setting (Section 3). We assume that our multi-agent environment is computable, but it does not need to be stationary/Markov, ergodic, or finite-state [Hut05]. Our class of policies is large enough to contain all computable (stochastic) policies, as well as all relevant Bayes-optimal policies. At the same time, our class is small enough to be limit computable. This is important because it allows our result to be computationally approximated.

In Section 4 we consider the setting where the multi-agent environment is unknown to the agents and has to be learned in addition to the other agents’ behavior. A Bayes-optimal agent may not learn to act optimally in unknown multi-agent environments even though it has a grain of truth. This effect occurs in non-recoverable environments where taking one wrong action can mean a permanent loss of future value. In this case, a Bayes-optimal agent avoids taking these dangerous actions and therefore will not explore enough to wash out the prior’s bias [LH15a]. Therefore, Bayesian agents are not asymptotically optimal, i.e., they do not always learn to act optimally [Ors13].

However, asymptotic optimality is achieved by Thompson sampling because the inherent randomness of Thompson sampling leads to enough exploration to learn the entire environment class [LLOH16]. This leads to our main result: if all agents use Thompson sampling over our class of multi-agent environments, then for every ε>0\varepsilon>0 they converge to an ε\varepsilon-Nash equilibrium asymptotically.

The central idea to our construction is based on reflective oracles [FST15, FTC15b]. Reflective oracles are probabilistic oracles similar to halting oracles that answer whether the probability that a given probabilistic Turing machine TT outputs 11 is higher than a given rational number pp. The oracles are reflective in the sense that the machine TT may itself query the oracle, so the oracle has to answer queries about itself. This invites issues caused by self-referential liar paradoxes of the form “if the oracle says that I return 11 with probability >1/2>1/2, then return 0, else return 11.” Reflective oracles avoid these issues by being allowed to randomize if the machines do not halt or the rational number is exactly the probability to output 11. We introduce reflective oracles formally in Section 2 and prove that there is a limit computable reflective oracle.

2  Reflective Oracles

2.1  Preliminaries

Let 𝒳\mathcal{X} denote a finite set called alphabet. The set 𝒳:=n=0𝒳n\mathcal{X}^{*}:=\bigcup_{n=0}^{\infty}\mathcal{X}^{n} is the set of all finite strings over the alphabet 𝒳\mathcal{X}, the set 𝒳\mathcal{X}^{\infty} is the set of all infinite strings over the alphabet 𝒳\mathcal{X}, and the set 𝒳:=𝒳𝒳\mathcal{X}^{\sharp}:=\mathcal{X}^{*}\cup\mathcal{X}^{\infty} is their union. The empty string is denoted by ϵ\epsilon, not to be confused with the small positive real number ε\varepsilon. Given a string x𝒳x\in\mathcal{X}^{\sharp}, we denote its length by |x||x|. For a (finite or infinite) string xx of length k\geq k, we denote with x1:kx_{1:k} the first kk characters of xx, and with x<kx_{<k} the first k1k-1 characters of xx. The notation x1:x_{1:\infty} stresses that xx is an infinite string.

A function f:𝒳f:\mathcal{X}^{*}\to\mathbb{R} is lower semicomputable iff the set {(x,p)𝒳×f(x)>p}\{(x,p)\in\mathcal{X}^{*}\times\mathbb{Q}\mid f(x)>p\} is recursively enumerable. The function ff is computable iff both ff and f-f are lower semicomputable. Finally, the function ff is limit computable iff there is a computable function ϕ\phi such that

limkϕ(x,k)=f(x).\lim_{k\to\infty}\phi(x,k)=f(x).

The program ϕ\phi that limit computes ff can be thought of as an anytime algorithm for ff: we can stop ϕ\phi at any time kk and get a preliminary answer. If the program ϕ\phi ran long enough (which we do not know), this preliminary answer will be close to the correct one.

We use Δ𝒴\Delta\mathcal{Y} to denote the set of probability distributions over 𝒴\mathcal{Y}. A list of notation can be found in Appendix A.

2.2  Definition

A semimeasure over the alphabet 𝒳\mathcal{X} is a function ν:𝒳[0,1]\nu:\mathcal{X}^{*}\to[0,1] such that (i) ν(ϵ)1\nu(\epsilon)\leq 1, and (ii) ν(x)a𝒳ν(xa)\nu(x)\geq\sum_{a\in\mathcal{X}}\nu(xa)for all x𝒳x\in\mathcal{X}^{*}. In the terminology of measure theory, semimeasures are probability measures on the probability space 𝒳=𝒳X\mathcal{X}^{\sharp}=\mathcal{X}^{*}\cup X^{\infty} whose σ\sigma-algebra is generated by the cylinder sets Γx:={xzz𝒳}\Gamma_{x}:=\{xz\mid z\in\mathcal{X}^{\sharp}\} [LV08, Ch. 4.2]. We call a semimeasure (probability) a measure iff equalities hold in (i) and (ii) for all x𝒳x\in\mathcal{X}^{*}.

Next, we connect semimeasures to Turing machines. The literature uses monotone Turing machines, which naturally correspond to lower semicomputable semimeasures [LV08, Sec. 4.5.2] that describe the distribution that arises when piping fair coin flips into the monotone machine. Here we take a different route.

A probabilistic Turing machine is a Turing machine that has access to an unlimited number of uniformly random coin flips. Let 𝒯\mathcal{T} denote the set of all probabilistic Turing machines that take some input in 𝒳\mathcal{X}^{*} and may query an oracle (formally defined below). We take a Turing machine T𝒯T\in\mathcal{T} to correspond to a semimeasure λT\lambda_{T} where λT(ax)\lambda_{T}(a\mid x) is the probability that TT outputs a𝒳a\in\mathcal{X} when given x𝒳x\in\mathcal{X}^{*} as input. The value of λT(x)\lambda_{T}(x) is then given by the chain rule

λT(x):=k=1|x|λT(xkx<k).\lambda_{T}(x):=\prod_{k=1}^{|x|}\lambda_{T}(x_{k}\mid x_{<k}). (1)

Thus 𝒯\mathcal{T} gives rise to the set of semimeasures \mathcal{M} where the conditionals λ(ax)\lambda(a\mid x) are lower semicomputable. In contrast, the literature typically considers semimeasures whose joint probability (1) is lower semicomputable. This set \mathcal{M} contains all computable measures. However, \mathcal{M} is a proper subset of the set of all lower semicomputable semimeasures because the product (1) is lower semicomputale, but there are some lower semicomputable semimeasures whose conditional is not lower semicomputable [LH15c, Thm. 6].

In the following we assume that our alphabet is binary, i.e., 𝒳:={0,1}\mathcal{X}:=\{0,1\}.

Definition 1 (Oracle).

An oracle is a function O:𝒯×{0,1}×Δ{0,1}O:\mathcal{T}\times\{0,1\}^{*}\times\mathbb{Q}\to\Delta\{0,1\}.

Oracles are understood to be probabilistic: they randomly return 0 or 11. Let TOT^{O} denote the machine T𝒯T\in\mathcal{T} when run with the oracle OO, and let λTO\lambda_{T}^{O} denote the semimeasure induced by TOT^{O}. This means that drawing from λTO\lambda_{T}^{O} involves two sources of randomness: one from the distribution induced by the probabilistic Turing machine TT and one from the oracle’s answers.

The intended semantics of an oracle are that it takes a query (T,x,p)(T,x,p) and returns 11 if the machine TOT^{O} outputs 11 on input xx with probability greater than pp when run with the oracle OO, i.e., when λTO(1x)>p\lambda^{O}_{T}(1\mid x)>p. Furthermore, the oracle returns 0 if the machine TOT^{O} outputs 11 on input xx with probability less than pp when run with the oracle OO, i.e., when λTO(1x)<p\lambda^{O}_{T}(1\mid x)<p. To fulfill this, the oracle OO has to make statements about itself, since the machine TT from the query may again query OO. Therefore we call oracles of this kind reflective oracles. This has to be defined very carefully to avoid the obvious diagonalization issues that are caused by programs that ask the oracle about themselves. We impose the following self-consistency constraint.

Definition 2 (Reflective Oracle).

An oracle OO is reflective iff for all queries (T,x,p)𝒯×{0,1}×(T,x,p)\in\mathcal{T}\times\{0,1\}^{*}\times\mathbb{Q},

  1. (i)

    λTO(1x)>p\lambda_{T}^{O}(1\mid x)>p implies O(T,x,p)=1O(T,x,p)=1, and

  2. (ii)

    λTO(0x)>1p\lambda_{T}^{O}(0\mid x)>1-p implies O(T,x,p)=0O(T,x,p)=0.

If pp under- or overshoots the true probability of λTO(x)\lambda_{T}^{O}(\,\cdot\mid x), then the oracle must reveal this information. However, in the critical case when p=λTO(1x)p=\lambda_{T}^{O}(1\mid x), the oracle is allowed to return anything and may randomize its result. Furthermore, since TT might not output any symbol, it is possible that λTO(0x)+λTO(1x)<1\lambda_{T}^{O}(0\mid x)+\lambda_{T}^{O}(1\mid x)<1. In this case the oracle can reassign the non-halting probability mass to 0, 11, or randomize; see Figure 1.

011λTO(0x)\lambda_{T}^{O}(0\mid x)λTO(1x)\lambda_{T}^{O}(1\mid x)OO returns 11OO may randomizeOO returns 0
Figure 1: Answer options of a reflective oracle OO for the query (T,x,p)(T,x,p); the rational p[0,1]p\in[0,1] falls into one of the three regions above. The values of λTO(0x)\lambda_{T}^{O}(0\mid x) and λTO(1x)\lambda_{T}^{O}(1\mid x) are depicted as the length of the line segment under which they are written.
Example 3 (Reflective Oracles and Diagonalization).

Let T𝒯T\in\mathcal{T} be a probabilistic Turing machine that outputs 1O(T,ϵ,1/2)1-O(T,\epsilon,1/2) (TT can know its own source code by quining [Kle52, Thm. 27]). In other words, TT queries the oracle about whether it is more likely to output 11 or 0, and then does whichever the oracle says is less likely. In this case we can use an oracle O(T,ϵ,1/2):=1/2O(T,\epsilon,1/2):=1/2 (answer 0 or 11 with equal probability), which implies λTO(1ϵ)=λTO(0ϵ)=1/2\lambda_{T}^{O}(1\mid\epsilon)=\lambda_{T}^{O}(0\mid\epsilon)=1/2, so the conditions of Section 2.2 are satisfied. In fact, for this machine TT we must have O(T,ϵ,1/2)=1/2O(T,\epsilon,1/2)=1/2 for all reflective oracles OO. ∎

The following theorem establishes that reflective oracles exist.

Theorem 4 ([FTC15a, App. B]).

There is a reflective oracle.

Definition 5 (Reflective-Oracle-Computable).

A semimeasure is called reflective-oracle-computable iff it is computable on a probabilistic Turing machine with access to a reflective oracle.

For any probabilistic Turing machine T𝒯T\in\mathcal{T} we can complete the semimeasure λTO(x)\lambda_{T}^{O}(\,\cdot\mid x) into a reflective-oracle-computable measure λ¯TO(x)\overline{\lambda}_{T}^{O}(\,\cdot\mid x): Using the oracle OO and a binary search on the parameter pp we search for the crossover point pp where O(T,x,p)O(T,x,p) goes from returning 11 to returning 0. The limit point pp^{*}\in\mathbb{R} of the binary search is random since the oracle’s answers may be random. But the main point is that the expectation of pp^{*} exists, so λ¯TO(1x)=𝔼[p]=1λ¯TO(0x)\overline{\lambda}_{T}^{O}(1\mid x)=\mathbb{E}[p^{*}]=1-\overline{\lambda}_{T}^{O}(0\mid x) for all x𝒳x\in\mathcal{X}^{*}. Hence λ¯TO\overline{\lambda}_{T}^{O} is a measure. Moreover, if the oracle is reflective, then λ¯TO(x)λTO(x)\overline{\lambda}_{T}^{O}(x)\geq\lambda_{T}^{O}(x) for all x𝒳x\in\mathcal{X}^{*}. In this sense the oracle OO can be viewed as a way of ‘completing’ all semimeasures λTO\lambda_{T}^{O} to measures by arbitrarily assigning the non-halting probability mass. If the oracle OO is reflective this is consistent in the sense that Turing machines who run other Turing machines will be completed in the same way. This is especially important for a universal machine that runs all other Turing machines to induce a Solomonoff-style distribution.

2.3  A Limit Computable Reflective Oracle

The proof of Section 2.2 given in [FTC15a, App. B] is nonconstructive and uses the axiom of choice. In Section 2.4 we give a constructive proof for the existence of reflective oracles and show that there is one that is limit computable.

Theorem 6 (A Limit Computable Reflective Oracle).

There is a reflective oracle that is limit computable.

This theorem has the immediate consequence that reflective oracles cannot be used as halting oracles. At first, this result may seem surprising: according to the definition of reflective oracles, they make concrete statements about the output of probabilistic Turing machines. However, the fact that the oracles may randomize some of the time actually removes enough information such that halting can no longer be decided from the oracle output.

Corollary 7 (Reflective Oracles are not Halting Oracles).

There is no probabilistic Turing machine TT such that for every prefix program pp and every reflective oracle OO, we have that λTO(1p)>1/2\lambda_{T}^{O}(1\mid p)>1/2 if pp halts and λTO(1p)<1/2\lambda_{T}^{O}(1\mid p)<1/2 otherwise.

Proof.

Assume there was such a machine TT and let OO be the limit computable oracle from Section 2.3. Since OO is reflective we can turn TT into a deterministic halting oracle by calling O(T,p,1/2)O(T,p,1/2) which deterministically returns 11 if pp halts and 0 otherwise. Since OO is limit computable, we can finitely compute the output of OO on any query to arbitrary finite precision using our deterministic halting oracle. We construct a probabilistic Turing machine TT^{\prime} that uses our halting oracle to compute (rather than query) the oracle OO on (T,ϵ,1/2)(T^{\prime},\epsilon,1/2) to a precision of 1/31/3 in finite time. If O(T,ϵ,1/2)±1/3>1/2O(T^{\prime},\epsilon,1/2)\pm 1/3>1/2, the machine TT^{\prime} outputs 0, otherwise TT^{\prime} outputs 11. Since our halting oracle is entirely deterministic, the output of TT^{\prime} is entirely deterministic as well (and TT^{\prime} always halts), so λTO(0ϵ)=1\lambda_{T^{\prime}}^{O}(0\mid\epsilon)=1 or λTO(1ϵ)=1\lambda_{T^{\prime}}^{O}(1\mid\epsilon)=1. Therefore O(T,ϵ,1/2)=1O(T^{\prime},\epsilon,1/2)=1 or O(T,ϵ,1/2)=0O(T^{\prime},\epsilon,1/2)=0 because OO is reflective. A precision of 1/31/3 is enough to tell them apart, hence TT^{\prime} returns 0 if O(T,ϵ,1/2)=1O(T^{\prime},\epsilon,1/2)=1 and TT^{\prime} returns 11 if O(T,ϵ,1/2)=0O(T^{\prime},\epsilon,1/2)=0. This is a contradiction. ∎

A similar argument can also be used to show that reflective oracles are not computable.

2.4  Proof of Theorem 2.3

The idea for the proof of Section 2.3 is to construct an algorithm that outputs an infinite series of partial oracles converging to a reflective oracle in the limit.

The set of queries is countable, so we can assume that we have some computable enumeration of it:

𝒯×{0,1}×=:{q1,q2,}\mathcal{T}\times\{0,1\}^{*}\times\mathbb{Q}=:\{q_{1},q_{2},\ldots\}
Definition 8 (kk-Partial Oracle).

A kk-partial oracle O~\tilde{O} is function from the first kk queries to the multiples of 2k2^{-k} in [0,1][0,1]:

O~:{q1,q2,,qk}{n2k0n2k}\tilde{O}:\{q_{1},q_{2},\ldots,q_{k}\}\to\{n2^{-k}\mid 0\leq n\leq 2^{k}\}
Definition 9 (Approximating an Oracle).

A kk-partial oracle O~\tilde{O} approximates an oracle OO iff |O(qi)O~(qi)|2k1|O(q_{i})-\tilde{O}(q_{i})|\leq 2^{-k-1} for all iki\leq k.

Let kk\in\mathbb{N}, let O~\tilde{O} be a kk-partial oracle, and let T𝒯T\in\mathcal{T} be an oracle machine. The machine TO~T^{\tilde{O}} that we get when we run TT with the kk-partial oracle O~\tilde{O} is defined as follows (this is with slight abuse of notation since kk is taken to be understood implicitly).

  1. 1.

    Run TT for at most kk steps.

  2. 2.

    If TT calls the oracle on qiq_{i} for iki\leq k,

    1. (a)

      return 11 with probability O~(qi)2k1\tilde{O}(q_{i})-2^{-k-1},

    2. (b)

      return 0 with probability 1O~(qi)2k11-\tilde{O}(q_{i})-2^{-k-1}, and

    3. (c)

      halt otherwise.

  3. 3.

    If TT calls the oracle on qjq_{j} for j>kj>k, halt.

Furthermore, we define λTO~\lambda_{T}^{\tilde{O}} analogously to λTO\lambda_{T}^{O} as the distribution generated by the machine TO~T^{\tilde{O}}.

Lemma 10.

If a kk-partial oracle O~\tilde{O} approximates a reflective oracle OO, then λTO(1x)λTO~(1x)\lambda_{T}^{O}(1\mid x)\geq\lambda_{T}^{\tilde{O}}(1\mid x) and λTO(0x)λTO~(0x)\lambda_{T}^{O}(0\mid x)\geq\lambda_{T}^{\tilde{O}}(0\mid x) for all x{0,1}x\in\{0,1\}^{*} and all T𝒯T\in\mathcal{T}.

Proof.

This follows from the definition of TO~T^{\tilde{O}}: when running TT with O~\tilde{O} instead of OO, we can only lose probability mass. If TT makes calls whose index is >k>k or runs for more than kk steps, then the execution is aborted and no further output is generated. If TT makes calls whose index iki\leq k, then O~(qi)2k1O(qi)\tilde{O}(q_{i})-2^{-k-1}\leq O(q_{i}) since O~\tilde{O} approximates OO. Therefore the return of the call qiq_{i} is underestimated as well. ∎

Definition 11 (kk-Partially Reflective).

A kk-partial oracle O~\tilde{O} is kk-partially reflective iff for the first kk queries (T,x,p)(T,x,p)

  • λTO~(1x)>p\lambda_{T}^{\tilde{O}}(1\mid x)>p implies O~(T,x,p)=1\tilde{O}(T,x,p)=1, and

  • λTO~(0x)>1p\lambda_{T}^{\tilde{O}}(0\mid x)>1-p implies O~(T,x,p)=0\tilde{O}(T,x,p)=0.

It is important to note that we can check whether a kk-partial oracle is kk-partially reflective in finite time by running all machines TT from the first kk queries for kk steps and tallying up the probabilities to compute λTO~\lambda_{T}^{\tilde{O}}.

Lemma 12.

If OO is a reflective oracle and O~\tilde{O} is a kk-partial oracle that approximates OO, then O~\tilde{O} is kk-partially reflective.

Section 2.4 only holds because we use semimeasures whose conditionals are lower semicomputable.

Proof.

Assuming λTO~(1x)>p\lambda_{T}^{\tilde{O}}(1\mid x)>p we get from Section 2.4 that λTO(1x)λTO~(1x)>p\lambda_{T}^{O}(1\mid x)\geq\lambda_{T}^{\tilde{O}}(1\mid x)>p. Thus O(T,x,p)=1O(T,x,p)=1 because OO is reflective. Since O~\tilde{O} approximates OO, we get 1=O(T,x,p)O~(T,x,p)+2k11=O(T,x,p)\leq\tilde{O}(T,x,p)+2^{-k-1}, and since O~\tilde{O} assigns values in a 2k2^{-k}-grid, it follows that O~(T,x,p)=1\tilde{O}(T,x,p)=1. The second implication is proved analogously. ∎

Definition 13 (Extending Partial Oracles).

A k+1k+1-partial oracle O~\tilde{O}^{\prime} extends a kk-partial oracle O~\tilde{O} iff |O~(qi)O~(qi)|2k1|\tilde{O}(q_{i})-\tilde{O}^{\prime}(q_{i})|\leq 2^{-k-1} for all iki\leq k.

Lemma 14.

There is an infinite sequence of partial oracles (O~k)k(\tilde{O}_{k})_{k\in\mathbb{N}} such that for each kk, O~k\tilde{O}_{k} is a kk-partially reflective kk-partial oracle and O~k+1\tilde{O}_{k+1} extends O~k\tilde{O}_{k}.

Proof.

By Section 2.2 there is a reflective oracle OO. For every kk, there is a canonical kk-partial oracle O~k\tilde{O}_{k} that approximates OO: restrict OO to the first kk queries and for any such query qq pick the value in the 2k2^{-k}-grid which is closest to O(q)O(q). By construction, O~k+1\tilde{O}_{k+1} extends O~k\tilde{O}_{k} and by Section 2.4, each O~k\tilde{O}_{k} is kk-partially reflective. ∎

Lemma 15.

If the k+1k+1-partial oracle O~k+1\tilde{O}_{k+1} extends the kk-partial oracle O~k\tilde{O}_{k}, then λTO~k+1(1x)λTO~k(1x)\lambda_{T}^{\tilde{O}_{k+1}}(1\mid x)\geq\lambda_{T}^{\tilde{O}_{k}}(1\mid x) and λTO~k+1(0x)λTO~k(0x)\lambda_{T}^{\tilde{O}_{k+1}}(0\mid x)\geq\lambda_{T}^{\tilde{O}_{k}}(0\mid x) for all x{0,1}x\in\{0,1\}^{*} and all T𝒯T\in\mathcal{T}.

Proof.

TO~k+1T^{\tilde{O}_{k+1}} runs for one more step than TO~kT^{\tilde{O}_{k}}, can answer one more query and has increased oracle precision. Moreover, since O~k+1\tilde{O}_{k+1} extends O~k\tilde{O}_{k}, we have |O~k+1(qi)O~k(qi)|2k1|\tilde{O}_{k+1}(q_{i})-\tilde{O}_{k}(q_{i})|\leq 2^{-k-1}, and thus O~k+1(qi)2k1O~k(qi)2k\tilde{O}_{k+1}(q_{i})-2^{-k-1}\geq\tilde{O}_{k}(q_{i})-2^{-k}. Therefore the success to answers to the oracle calls (case 2(a) and 2(b)) will not decrease in probability. ∎

Now everything is in place to state the algorithm that constructs a reflective oracle in the limit. It recursively traverses a tree of partial oracles. The tree’s nodes are the partial oracles; level kk of the tree contains all kk-partial oracles. There is an edge in the tree from the kk-partial oracle O~k\tilde{O}_{k} to the ii-partial oracle O~i\tilde{O}_{i} if and only if i=k+1i=k+1 and O~i\tilde{O}_{i} extends O~k\tilde{O}_{k}.

For every kk, there are only finitely many kk-partial oracles, since they are functions from finite sets to finite sets. In particular, there are exactly two 11-partial oracles (so the search tree has two roots). Pick one of them to start with, and proceed recursively as follows. Given a kk-partial oracle O~k\tilde{O}_{k}, there are finitely many (k+1)(k+1)-partial oracles that extend O~k\tilde{O}_{k} (finite branching of the tree). Pick one that is (k+1)(k+1)-partially reflective (which can be checked in finite time). If there is no (k+1)(k+1)-partially reflective extension, backtrack.

By Section 2.4 our search tree is infinitely deep and thus the tree search does not terminate. Moreover, it can backtrack to each level only a finite number of times because at each level there is only a finite number of possible extensions. Therefore the algorithm will produce an infinite sequence of partial oracles, each extending the previous. Because of finite backtracking, the output eventually stabilizes on a sequence of partial oracles O~1,O~2,\tilde{O}_{1},\tilde{O}_{2},\ldots. By the following lemma, this sequence converges to a reflective oracle, which concludes the proof of Section 2.3.

Lemma 16.

Let O~1,O~2,\tilde{O}_{1},\tilde{O}_{2},\ldots be a sequence where O~k\tilde{O}_{k} is a kk-partially reflective kk-partial oracle and O~k+1\tilde{O}_{k+1} extends O~k\tilde{O}_{k} for all kk\in\mathbb{N}. Let O:=limkO~kO:=\lim_{k\to\infty}\tilde{O}_{k} be the pointwise limit. Then

  1. (a)

    λTO~k(1x)λTO(1x)\lambda_{T}^{\tilde{O}_{k}}(1\mid x)\to\lambda_{T}^{O}(1\mid x) and λTO~k(0x)λTO(0x)\lambda_{T}^{\tilde{O}_{k}}(0\mid x)\to\lambda_{T}^{O}(0\mid x) as kk\to\infty for all x{0,1}x\in\{0,1\}^{*} and all T𝒯T\in\mathcal{T}, and

  2. (b)

    OO is a reflective oracle.

Proof.

First note that the pointwise limit must exists because |O~k(qi)O~k+1(qi)|2k1|\tilde{O}_{k}(q_{i})-\tilde{O}_{k+1}(q_{i})|\leq 2^{-k-1} by Section 2.4.

  1. (a)

    Since O~k+1\tilde{O}_{k+1} extends O~k\tilde{O}_{k}, each O~k\tilde{O}_{k} approximates OO. Let x{0,1}x\in\{0,1\}^{*} and T𝒯T\in\mathcal{T} and consider the sequence ak:=λTO~k(1x)a_{k}:=\lambda_{T}^{\tilde{O}_{k}}(1\mid x) for kk\in\mathbb{N}. By Section 2.4, akak+1a_{k}\leq a_{k+1}, so the sequence is monotone increasing. By Section 2.4, akλTO(1x)a_{k}\leq\lambda_{T}^{O}(1\mid x), so the sequence is bounded. Therefore it must converge. But it cannot converge to anything strictly below λTO(1x)\lambda_{T}^{O}(1\mid x) by the definition of TOT^{O}.

  2. (b)

    By definition, OO is an oracle; it remains to show that OO is reflective. Let qi=(T,x,p)q_{i}=(T,x,p) be some query. If p<λTO(1x)p<\lambda_{T}^{O}(1\mid x), then by (a) there is a kk large enough such that p<λTO~t(1x)p<\lambda_{T}^{\tilde{O}_{t}}(1\mid x) for all tkt\geq k. For any tmax{k,i}t\geq\max\{k,i\}, we have O~t(T,x,p)=1\tilde{O}_{t}(T,x,p)=1 since O~t\tilde{O}_{t} is tt-partially reflective. Therefore 1=limkO~k(T,x,p)=O(T,x,p)1=\lim_{k\to\infty}\tilde{O}_{k}(T,x,p)=O(T,x,p). The case 1p<λTO(0x)1-p<\lambda_{T}^{O}(0\mid x) is analogous. ∎

3  A Grain of Truth

3.1  Notation

In reinforcement learning, an agent interacts with an environment in cycles: at time step tt the agent chooses an action at𝒜a_{t}\in\mathcal{A} and receives a percept et=(ot,rt)e_{t}=(o_{t},r_{t})\in\mathcal{E} consisting of an observation ot𝒪o_{t}\in\mathcal{O} and a real-valued reward rtr_{t}\in\mathbb{R}; the cycle then repeats for t+1t+1. A history is an element of (𝒜×)(\mathcal{A}\times\mathcal{E})^{*}. In this section, we use æ𝒜×\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}\in\mathcal{A}\times\mathcal{E} to denote one interaction cycle, and æ<t\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t} to denote a history of length t1t-1.

We fix a discount function γ:\gamma:\mathbb{N}\to\mathbb{R} with γt0\gamma_{t}\geq 0 and t=1γt<\sum_{t=1}^{\infty}\gamma_{t}<\infty. The goal in reinforcement learning is to maximize discounted rewards t=1γtrt\sum_{t=1}^{\infty}\gamma_{t}r_{t}. The discount normalization factor is defined as Γt:=k=tγk\Gamma_{t}:=\sum_{k=t}^{\infty}\gamma_{k}. The effective horizon Ht(ε)H_{t}(\varepsilon) is a horizon that is long enough to encompass all but an ε\varepsilon of the discount function’s mass:

Ht(ε):=min{kΓt+k/Γtε}H_{t}(\varepsilon):=\min\{k\mid\Gamma_{t+k}/\Gamma_{t}\leq\varepsilon\} (2)

A policy is a function π:(𝒜×)Δ𝒜\pi:(\mathcal{A}\times\mathcal{E})^{*}\to\Delta\mathcal{A} that maps a history æ<t\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t} to a distribution over actions taken after seeing this history. The probability of taking action aa after history æ<t\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t} is denoted with π(aæ<t)\pi(a\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}). An environment is a function ν:(𝒜×)×𝒜Δ\nu:(\mathcal{A}\times\mathcal{E})^{*}\times\mathcal{A}\to\Delta\mathcal{E} where ν(eæ<tat)\nu(e\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}a_{t}) denotes the probability of receiving the percept ee when taking the action ata_{t} after the history æ<t\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}. Together, a policy π\pi and an environment ν\nu give rise to a distribution νπ\nu^{\pi} over histories. Throughout this paper, we make the following assumptions.

Assumption 17.
  1. (a)

    Rewards are bounded between 0 and 11.

  2. (b)

    The set of actions 𝒜\mathcal{A} and the set of percepts \mathcal{E} are both finite.

  3. (c)

    The discount function γ\gamma and the discount normalization factor Γ\Gamma are computable.

Definition 18 (Value Function).

The value of a policy π\pi in an environment ν\nu given history æ<t\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t} is defined recursively as Vνπ(æ<t):=a𝒜π(aæ<t)Vνπ(æ<ta)V^{\pi}_{\nu}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}):=\sum_{a\in\mathcal{A}}\pi(a\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t})V^{\pi}_{\nu}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}a) and

Vνπ\displaystyle V^{\pi}_{\nu} (æ<tat):=\displaystyle(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}a_{t}):=
1Γtetν(etæ<tat)(γtrt+Γt+1Vνπ(æ1:t))\displaystyle\frac{1}{\Gamma_{t}}\sum_{e_{t}\in\mathcal{E}}\nu(e_{t}\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}a_{t})\Big{(}\gamma_{t}r_{t}+\Gamma_{t+1}V^{\pi}_{\nu}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{1:t})\Big{)}

if Γt>0\Gamma_{t}>0 and Vνπ(æ<tat):=0V^{\pi}_{\nu}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}a_{t}):=0 if Γt=0\Gamma_{t}=0. The optimal value is defined as Vν(æ<t):=supπVνπ(æ<t)V^{*}_{\nu}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}):=\sup_{\pi}V^{\pi}_{\nu}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}).

Definition 19 (Optimal Policy).

A policy π\pi is optimal in environment ν\nu (ν\nu-optimal) iff for all histories æ<t(𝒜×)\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}\in(\mathcal{A}\times\mathcal{E})^{*} the policy π\pi attains the optimal value: Vνπ(æ<t)=Vν(æ<t)V^{\pi}_{\nu}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t})=V^{*}_{\nu}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}).

We assumed that the discount function is summable, rewards are bounded (Section 3.1a), and actions and percepts spaces are both finite (Section 3.1b). Therefore an optimal deterministic policy exists for every environment [LH14, Thm. 10].

3.2  Reflective Bayesian Agents

Fix OO to be a reflective oracle. From now on, we assume that the action space 𝒜:={α,β}\mathcal{A}:=\{\alpha,\beta\} is binary. We can treat computable measures over binary strings as environments: the environment ν\nu corresponding to a probabilistic Turing machine T𝒯T\in\mathcal{T} is defined by

ν(etæ<tat):=λ¯TO(yx)=i=1kλ¯TO(yixy1yi1)\nu(e_{t}\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}a_{t}):=\overline{\lambda}_{T}^{O}(y\mid x)=\prod_{i=1}^{k}\overline{\lambda}_{T}^{O}(y_{i}\mid xy_{1}\ldots y_{i-1})

where y1:ky_{1:k} is a binary encoding of ete_{t} and xx is a binary encoding of æ<tat\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}a_{t}. The actions a1:a_{1:\infty} are only contextual, and not part of the environment distribution. We define

ν(e<ta<t):=k=1t1ν(ekæ<k).\nu(e_{<t}\mid a_{<t}):=\prod_{k=1}^{t-1}\nu(e_{k}\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<k}).

Let T1,T2,T_{1},T_{2},\ldots be an enumeration of all probabilistic Turing machines in 𝒯\mathcal{T}. We define the class of reflective environments

reflO:={λ¯T1O,λ¯T2O,}.\mathcal{M}_{\textrm{refl}}^{O}:=\left\{\overline{\lambda}_{T_{1}}^{O},\overline{\lambda}_{T_{2}}^{O},\ldots\right\}.

This is the class of all environments computable on a probabilistic Turing machine with reflective oracle OO, that have been completed from semimeasures to measures using OO.

Analogously to AIXI [Hut05], we define a Bayesian mixture over the class reflO\mathcal{M}_{\textrm{refl}}^{O}. Let wΔreflOw\in\Delta\mathcal{M}_{\textrm{refl}}^{O} be a lower semicomputable prior probability distribution on reflO\mathcal{M}_{\textrm{refl}}^{O}. Possible choices for the prior include the Solomonoff prior w(λ¯TO):=2K(T)w\big{(}\overline{\lambda}_{T}^{O}\big{)}:=2^{-K(T)}, where K(T)K(T) denotes the length of the shortest input to some universal Turing machine that encodes TT [Sol78].222Technically, the lower semicomputable prior 2K(T)2^{-K(T)} is only a semidistribution because it does not sum to 11. This turns out to be unimportant. We define the corresponding Bayesian mixture

ξ(etæ<tat):=νreflOw(νæ<t)ν(etæ<tat)\xi(e_{t}\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}a_{t}):=\sum_{\nu\in\mathcal{M}_{\textrm{refl}}^{O}}w(\nu\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t})\nu(e_{t}\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}a_{t}) (3)

where w(νæ<t)w(\nu\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}) is the (renormalized) posterior,

w(νæ<t):=w(ν)ν(e<ta<t)ξ¯(e<ta<t).w(\nu\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}):=w(\nu)\frac{\nu(e_{<t}\mid a_{<t})}{\overline{\xi}(e_{<t}\mid a_{<t})}. (4)

The mixture ξ\xi is lower semicomputable on an oracle Turing machine because the posterior w(æ<t)w(\,\cdot\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}) is lower semicomputable. Hence there is an oracle machine TT such that ξ=λTO\xi=\lambda_{T}^{O}. We define its completion ξ¯:=λ¯TO\overline{\xi}:=\overline{\lambda}_{T}^{O} as the completion of λTO\lambda_{T}^{O}. This is the distribution that is used to compute the posterior. There are no cyclic dependencies since ξ¯\overline{\xi} is called on the shorter history æ<t\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}. We arrive at the following statement.

Proposition 20 (Bayes is in the Class).

ξ¯reflO\overline{\xi}\in\mathcal{M}_{\textrm{refl}}^{O}.

Moreover, since OO is reflective, we have that ξ¯\overline{\xi} dominates all environments νreflO\nu\in\mathcal{M}_{\textrm{refl}}^{O}:

ξ¯(e1:ta1:t)\displaystyle\phantom{=}\leavevmode\nobreak\ \;\overline{\xi}(e_{1:t}\mid a_{1:t})
=ξ¯(etæ<tat)ξ¯(e<ta<t)\displaystyle=\overline{\xi}(e_{t}\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}a_{t})\overline{\xi}(e_{<t}\mid a_{<t})
ξ(etæ<tat)ξ¯(e<ta<t)\displaystyle\geq\xi(e_{t}\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}a_{t})\overline{\xi}(e_{<t}\mid a_{<t})
=ξ¯(e<ta<t)νreflOw(νæ<t)ν(etæ<tat)\displaystyle=\overline{\xi}(e_{<t}\mid a_{<t})\sum_{\nu\in\mathcal{M}_{\textrm{refl}}^{O}}w(\nu\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t})\nu(e_{t}\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}a_{t})
=ξ¯(e<ta<t)νreflOw(ν)ν(e<ta<t)ξ¯(e<ta<t)ν(etæ<tat)\displaystyle=\overline{\xi}(e_{<t}\mid a_{<t})\sum_{\nu\in\mathcal{M}_{\textrm{refl}}^{O}}w(\nu)\frac{\nu(e_{<t}\mid a_{<t})}{\overline{\xi}(e_{<t}\mid a_{<t})}\nu(e_{t}\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}a_{t})
=νreflOw(ν)ν(e1:ta1:t)\displaystyle=\sum_{\nu\in\mathcal{M}_{\textrm{refl}}^{O}}w(\nu)\nu(e_{1:t}\mid a_{1:t})
w(ν)ν(e1:ta1:t)\displaystyle\geq w(\nu)\nu(e_{1:t}\mid a_{1:t})

This property is crucial for on-policy value convergence.

Lemma 21 (On-Policy Value Convergence [Hut05, Thm. 5.36]).

For any policy π\pi and any environment μreflO\mu\in\mathcal{M}_{\textrm{refl}}^{O} with w(μ)>0w(\mu)>0,

Vμπ(æ<t)Vξ¯π(æ<t)0 μπ-almost surely as t.V^{\pi}_{\mu}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t})-V^{\pi}_{\overline{\xi}}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t})\to 0\text{ $\mu^{\pi}$-almost surely as $t\to\infty$}.

3.3  Reflective-Oracle-Computable Policies

This subsection is dedicated to the following result that was previously stated but not proved in [FST15, Alg. 6]. It contrasts results on arbitrary semicomputable environments where optimal policies are not limit computable [LH15b, Sec. 4].

Theorem 22 (Optimal Policies are Oracle Computable).

For every νreflO\nu\in\mathcal{M}_{\textrm{refl}}^{O}, there is a ν\nu-optimal (stochastic) policy πν\pi^{*}_{\nu} that is reflective-oracle-computable.

Note that even though deterministic optimal policies always exist, those policies are typically not reflective-oracle-computable.

To prove Section 3.3 we need the following lemma.

Lemma 23 (Reflective-Oracle-Computable Optimal Value Function).

For every environment νreflO\nu\in\mathcal{M}_{\textrm{refl}}^{O} the optimal value function VνV^{*}_{\nu} is reflective-oracle-computable.

Proof.

This proof follows the proof of [LH15b, Cor. 13]. We write the optimal value explicitly as

Vν(æ<t)=1Γtlimmmaxæt:mk=tmγkrki=tkν(eiæ<i),V^{*}_{\nu}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t})=\frac{1}{\Gamma_{t}}\lim_{m\to\infty}\mathchoice{\ooalign{\hss$\max$\hss\cr${\sum}\limits_{\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{t:m}}$}}{\ooalign{\hss$\sum$\hss\cr$\max$}_{\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{t:m}}}{}{}\;\sum_{k=t}^{m}\gamma_{k}r_{k}\prod_{i=t}^{k}\nu(e_{i}\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<i}), (5)

where \sum max\max denotes the expectimax operator:

maxæt:m:=maxat𝒜etmaxam𝒜em\mathchoice{\ooalign{\hss$\max$\hss\cr${\sum}\limits_{\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{t:m}}$}}{\ooalign{\hss$\sum$\hss\cr$\max$}_{\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{t:m}}}{}{}:=\max_{a_{t}\in\mathcal{A}}\sum_{e_{t}\in\mathcal{E}}\ldots\max_{a_{m}\in\mathcal{A}}\sum_{e_{m}\in\mathcal{E}}

For a fixed mm, all involved quantities are reflective-oracle-computable. Moreover, this quantity is monotone increasing in mm and the tail sum from m+1m+1 to \infty is bounded by Γm+1\Gamma_{m+1} which is computable according to Section 3.1c and converges to 0 as mm\to\infty. Therefore we can enumerate all rationals above and below VνV^{*}_{\nu}. ∎

Proof of Section 3.3.

According to Section 3.3 the optimal value function VνV^{*}_{\nu} is reflective-oracle-computable. Hence there is a probabilistic Turing machine TT such that

λTO(1æ<t)=(Vν(æ<tα)Vν(æ<tβ)+1)/2.\lambda_{T}^{O}(1\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t})=\big{(}V^{*}_{\nu}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}\alpha)-V^{*}_{\nu}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}\beta)+1\big{)}/2.

We define a policy π\pi that takes action α\alpha if O(T,æ<t,1/2)=1O(T,\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t},1/2)=1 and action β\beta if O(T,æ<t,1/2)=0O(T,\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t},1/2)=0. (This policy is stochastic because the answer of the oracle OO is stochastic.)

It remains to show that π\pi is a ν\nu-optimal policy. If Vν(æ<tα)>Vν(æ<tβ)V^{*}_{\nu}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}\alpha)>V^{*}_{\nu}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}\beta), then λTO(1æ<t)>1/2\lambda_{T}^{O}(1\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t})>1/2, thus O(T,æ<t,1/2)=1O(T,\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t},1/2)=1 since OO is reflective, and hence π\pi takes action α\alpha. Conversely, if Vν(æ<tα)<Vν(æ<tβ)V^{*}_{\nu}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}\alpha)<V^{*}_{\nu}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}\beta), then λTO(1æ<t)<1/2\lambda_{T}^{O}(1\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t})<1/2, thus O(T,æ<t,1/2)=0O(T,\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t},1/2)=0 since OO is reflective, and hence π\pi takes action β\beta. Lastly, if Vν(æ<tα)=Vν(æ<tβ)V^{*}_{\nu}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}\alpha)=V^{*}_{\nu}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}\beta), then both actions are optimal and thus it does not matter which action is returned by policy π\pi. (This is the case where the oracle may randomize.) ∎

3.4  Solution to the Grain of Truth Problem

Together, Section 3.2 and Section 3.3 provide the necessary ingredients to solve the grain of truth problem.

Corollary 24 (Solution to the Grain of Truth Problem).

For every lower semicomputable prior wΔreflOw\in\Delta\mathcal{M}_{\textrm{refl}}^{O} the Bayes-optimal policy πξ¯\pi^{*}_{\overline{\xi}} is reflective-oracle-computable where ξ\xi is the Bayes-mixture corresponding to ww defined in (3).

Proof.

From Section 3.2 and Section 3.3. ∎

Hence the environment class reflO\mathcal{M}_{\textrm{refl}}^{O} contains any reflective-oracle-computable modification of the Bayes-optimal policy πξ¯\pi^{*}_{\overline{\xi}}. In particular, this includes computable multi-agent environments that contain other Bayesian agents over the class reflO\mathcal{M}_{\textrm{refl}}^{O}. So any Bayesian agent over the class reflO\mathcal{M}_{\textrm{refl}}^{O} has a grain of truth even though the environment may contain other Bayesian agents of equal power. We proceed to sketch the implications for multi-agent environments in the next section.

4  Multi-Agent Environments

This section summarizes our results for multi-agent systems. The proofs can be found in [Lei16].

4.1  Setup

In a multi-agent environment there are nn agents each taking sequential actions from the finite action space 𝒜\mathcal{A}. In each time step t=1,2,t=1,2,\ldots, the environment receives action atia_{t}^{i} from agent ii and outputs nn percepts et1,,etne_{t}^{1},\ldots,e_{t}^{n}\in\mathcal{E}, one for each agent. Each percept eti=(oti,rti)e_{t}^{i}=(o_{t}^{i},r_{t}^{i}) contains an observation otio_{t}^{i} and a reward rti[0,1]r_{t}^{i}\in[0,1]. Importantly, agent ii only sees its own action atia_{t}^{i} and its own percept etie_{t}^{i} (see Figure 2). We use the shorthand notation at:=(at1,,atn)a_{t}:=(a_{t}^{1},\ldots,a_{t}^{n}) and et:=(et1,,etn)e_{t}:=(e_{t}^{1},\ldots,e_{t}^{n}) and denote æ<ti=a1ie1iat1iet1i\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}^{i}=a_{1}^{i}e_{1}^{i}\ldots a_{t-1}^{i}e_{t-1}^{i} and æ<t=a1e1at1et1\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}=a_{1}e_{1}\ldots a_{t-1}e_{t-1}.

agent π1\pi_{1}agent π2\pi_{2}agent πn\pi_{n} multi-agent environment σ\sigma at1a_{t}^{1}et1e_{t}^{1}at2a_{t}^{2}et2e_{t}^{2}atna_{t}^{n}etne_{t}^{n}
Figure 2: Agents π1,,πn\pi_{1},\ldots,\pi_{n} interacting in a multi-agent environment.

We define a multi-agent environment as a function

σ:(𝒜n×n)×𝒜nΔ(n).\sigma:(\mathcal{A}^{n}\times\mathcal{E}^{n})^{*}\times\mathcal{A}^{n}\to\Delta(\mathcal{E}^{n}).

The agents are given by nn policies π1,,πn\pi_{1},\ldots,\pi_{n} where πi:(𝒜×)Δ𝒜\pi_{i}:(\mathcal{A}\times\mathcal{E})^{*}\to\Delta\mathcal{A}. Together they specify the history distribution

σπ1:n(ϵ):\displaystyle\sigma^{\pi_{1:n}}(\epsilon): =1\displaystyle=1
σπ1:n(æ1:t):\displaystyle\sigma^{\pi_{1:n}}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{1:t}): =σπ1:n(æ<tat)σ(etæ<tat)\displaystyle=\sigma^{\pi_{1:n}}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}a_{t})\sigma(e_{t}\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}a_{t})
σπ1:n(æ<tat):\displaystyle\sigma^{\pi_{1:n}}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}a_{t}): =σπ1:n(æ<t)i=1nπi(atiæ<ti).\displaystyle=\sigma^{\pi_{1:n}}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t})\prod_{i=1}^{n}\pi_{i}(a_{t}^{i}\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}^{i}).

Each agent ii acts in a subjective environment σi\sigma_{i} given by joining the multi-agent environment σ\sigma with the policies π1,,πi1,πi+1,,πn\pi_{1},\ldots,\pi_{i-1},\pi_{i+1},\ldots,\pi_{n} by marginalizing over the histories that πi\pi_{i} does not see. Together with policy πi\pi_{i}, the environment σi\sigma_{i} yields a distribution over the histories of agent ii

σiπi(æ<ti):=æ<tj,jiσπ1:n(æ<t).\sigma_{i}^{\pi_{i}}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}^{i}):=\sum_{\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}^{j},j\neq i}\sigma^{\pi_{1:n}}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}).

We get the definition of the subjective environment σi\sigma_{i} with the identity σi(etiæ<tiati):=σiπi(etiæ<tiati)\sigma_{i}(e_{t}^{i}\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}^{i}a_{t}^{i}):=\sigma_{i}^{\pi_{i}}(e_{t}^{i}\mid\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}^{i}a_{t}^{i}). It is crucial to note that the subjective environment σi\sigma_{i} and the policy πi\pi_{i} are ordinary environments and policies, so we can use the formalism from Section 3.

Our definition of a multi-agent environment is very general and encompasses most of game theory. It allows for cooperative, competitive, and mixed games; infinitely repeated games or any (infinite-length) extensive form games with finitely many players.

The policy πi\pi_{i} is an ε\varepsilon-best response after history æ<ti\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}^{i} iff

Vσi(æ<ti)Vσiπi(æ<ti)<ε.V^{*}_{\sigma_{i}}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}^{i})-V^{\pi_{i}}_{\sigma_{i}}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t}^{i})<\varepsilon.

If at some time step tt, all agents’ policies are ε\varepsilon-best responses, we have an ε\varepsilon-Nash equilibrium. The property of multi-agent systems that is analogous to asymptotic optimality is convergence to an ε\varepsilon-Nash equilibrium.

4.2  Informed Reflective Agents

Let σ\sigma be a multi-agent environment and let πσ1,πσn\pi^{*}_{\sigma_{1}},\ldots\pi^{*}_{\sigma_{n}} be such that for each ii the policy πσi\pi^{*}_{\sigma_{i}} is an optimal policy in agent ii’s subjective environment σi\sigma_{i}. At first glance this seems ill-defined: The subjective environment σi\sigma_{i} depends on each other policy πσj\pi^{*}_{\sigma_{j}} for jij\neq i, which depends on the subjective environment σj\sigma_{j}, which in turn depends on the policy πσi\pi^{*}_{\sigma_{i}}. However, this circular definition actually has a well-defined solution.

Theorem 25 (Optimal Multi-Agent Policies).

For any reflective-oracle-computable multi-agent environment σ\sigma, the optimal policies πσ1,,πσn\pi^{*}_{\sigma_{1}},\ldots,\pi^{*}_{\sigma_{n}} exist and are reflective-oracle-computable.

Note the strength of Section 4.2: each of the policies πσi\pi^{*}_{\sigma_{i}} is acting optimally given the knowledge of everyone else’s policies. Hence optimal policies play 0-best responses by definition, so if every agent is playing an optimal policy, we have a Nash equilibrium. Moreover, this Nash equilibrium is also a subgame perfect Nash equilibrium, because each agent also acts optimally on the counterfactual histories that do not end up being played. In other words, Section 4.2 states the existence and reflective-oracle-computability of a subgame perfect Nash equilibrium in any reflective-oracle-computable multi-agent environment. From Section 2.3 we then get that these subgame perfect Nash equilibria are limit computable.

Corollary 26 (Solution to Computable Multi-Agent Environments).

For any computable multi-agent environment σ\sigma, the optimal policies πσ1,,πσn\pi^{*}_{\sigma_{1}},\ldots,\pi^{*}_{\sigma_{n}} exist and are limit computable.

4.3  Learning Reflective Agents

Since our class reflO\mathcal{M}_{\textrm{refl}}^{O} solves the grain of truth problem, the result by Kalai and Lehrer [KL93] immediately implies that for any Bayesian agents π1,,πn\pi_{1},\ldots,\pi_{n} interacting in an infinitely repeated game and for all ε>0\varepsilon>0 and all i{1,,n}i\in\{1,\ldots,n\} there is almost surely a t0t_{0}\in\mathbb{N} such that for all tt0t\geq t_{0} the policy πi\pi_{i} is an ε\varepsilon-best response. However, this hinges on the important fact that every agent has to know the game and also that all other agents are Bayesian agents. Otherwise the convergence to an ε\varepsilon-Nash equilibrium may fail, as illustrated by the following example.

At the core of the following construction is a dogmatic prior [LH15a, Sec. 3.2]. A dogmatic prior assigns very high probability to going to hell (reward 0 forever) if the agent deviates from a given computable policy π\pi. For a Bayesian agent it is thus only worth deviating from the policy π\pi if the agent thinks that the prospects of following π\pi are very poor already. This implies that for general multi-agent environments and without additional assumptions on the prior, we cannot prove any meaningful convergence result about Bayesian agents acting in an unknown multi-agent environment.

Example 27 (Reflective Bayesians Playing Matching Pennies).

In the game of matching pennies there are two agents (n=2n=2), and two actions 𝒜={α,β}\mathcal{A}=\{\alpha,\beta\} representing the two sides of a penny. In each time step agent 11 wins if the two actions are identical and agent 22 wins if the two actions are different. The payoff matrix is as follows.

α\alpha β\beta
α\alpha 1,0 0,1
β\beta 0,1 1,0

We use ={0,1}\mathcal{E}=\{0,1\} to be the set of rewards (observations are vacuous) and define the multi-agent environment σ\sigma to give reward 11 to agent 11 iff at1=at2a_{t}^{1}=a_{t}^{2} (0 otherwise) and reward 11 to agent 22 iff at1at2a_{t}^{1}\neq a_{t}^{2} (0 otherwise). Note that neither agent knows a priori that they are playing matching pennies, nor that they are playing an infinite repeated game with one other player.

Let π1\pi_{1} be the policy that takes the action sequence (ααβ)(\alpha\alpha\beta)^{\infty} and let π2:=πα\pi_{2}:=\pi_{\alpha} be the policy that always takes action α\alpha. The average reward of policy π1\pi_{1} is 2/32/3 and the average reward of policy π2\pi_{2} is 1/31/3. Let ξ\xi be a universal mixture (3). By Section 3.2, Vξ¯π1c12/3V^{\pi_{1}}_{\overline{\xi}}\to c_{1}\approx 2/3 and Vξ¯π2c21/3V^{\pi_{2}}_{\overline{\xi}}\to c_{2}\approx 1/3 almost surely when following policies (π1,π2)(\pi_{1},\pi_{2}). Therefore there is an ε>0\varepsilon>0 such that Vξ¯π1>εV^{\pi_{1}}_{\overline{\xi}}>\varepsilon and Vξ¯π2>εV^{\pi_{2}}_{\overline{\xi}}>\varepsilon for all time steps. Now we can apply [LH15a, Thm. 7] to conclude that there are (dogmatic) mixtures ξ1\xi_{1}^{\prime} and ξ2\xi_{2}^{\prime} such that πξ1\pi^{*}_{\xi_{1}^{\prime}} always follows policy π1\pi_{1} and πξ2\pi^{*}_{\xi_{2}^{\prime}} always follows policy π2\pi_{2}. This does not converge to a (ε\varepsilon-)Nash equilibrium. ∎

A policy π\pi is asymptotically optimal in mean in an environment class \mathcal{M} iff for all μ\mu\in\mathcal{M}

𝔼μπ[Vμ(æ<t)Vμπ(æ<t)]0 as t{\mathbb{E}}_{\mu}^{\pi}\big{[}V^{*}_{\mu}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t})-V^{\pi}_{\mu}(\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t})\big{]}\to 0\text{ as $t\to\infty$} (6)

where 𝔼μπ{\mathbb{E}}_{\mu}^{\pi} denotes the expectation with respect to the probability distribution μπ\mu^{\pi} over histories generated by policy π\pi acting in environment μ\mu.

Asymptotic optimality stands out because it is currently the only known nontrivial objective notion of optimality in general reinforcement learning [LH15a].

The following theorem is the main convergence result. It states that for asymptotically optimal agents we get convergence to ε\varepsilon-Nash equilibria in any reflective-oracle-computable multi-agent environment.

Theorem 28 (Convergence to Equilibrium).

Let σ\sigma be an reflective-oracle-computable multi-agent environment and let π1,,πn\pi_{1},\ldots,\pi_{n} be reflective-oracle-computable policies that are asymptotically optimal in mean in the class reflO\mathcal{M}_{\textrm{refl}}^{O}. Then for all ε>0\varepsilon>0 and all i{1,,n}i\in\{1,\ldots,n\} the σπ1:n\sigma^{\pi_{1:n}}-probability that the policy πi\pi_{i} is an ε\varepsilon-best response converges to 11 as tt\to\infty.

In contrast to Section 4.2 which yields policies that play a subgame perfect equilibrium, this is not the case for Section 4.3: the agents typically do not learn to predict off-policy and thus will generally not play ε\varepsilon-best responses in the counterfactual histories that they never see. This weaker form of equilibrium is unavoidable if the agents do not know the environment because it is impossible to learn the parts that they do not interact with.

Together with Section 2.3 and the asymptotic optimality of the Thompson sampling policy [LLOH16, Thm. 4] that is reflective-oracle computable we get the following corollary.

Corollary 29 (Convergence to Equilibrium).

There are limit computable policies π1,,πn\pi_{1},\ldots,\pi_{n} such that for any computable multi-agent environment σ\sigma and for all ε>0\varepsilon>0 and all i{1,,n}i\in\{1,\ldots,n\} the σπ1:n\sigma^{\pi_{1:n}}-probability that the policy πi\pi_{i} is an ε\varepsilon-best response converges to 11 as tt\to\infty.

5  Discussion

This paper introduced the class of all reflective-oracle-computable environments reflO\mathcal{M}_{\textrm{refl}}^{O}. This class solves the grain of truth problem because it contains (any computable modification of) Bayesian agents defined over reflO\mathcal{M}_{\textrm{refl}}^{O}: the optimal agents and Bayes-optimal agents over the class are all reflective-oracle-computable (Section 3.3 and Section 3.4).

If the environment is unknown, then a Bayesian agent may end up playing suboptimally (Section 4.3). However, if each agent uses a policy that is asymptotically optimal in mean (such as the Thompson sampling policy [LLOH16]) then for every ε>0\varepsilon>0 the agents converge to an ε\varepsilon-Nash equilibrium (Section 4.3 and Section 4.3).

Our solution to the grain of truth problem is purely theoretical. However, Section 2.3 shows that our class reflO\mathcal{M}_{\textrm{refl}}^{O} allows for computable approximations. This suggests that practical approaches can be derived from this result, and reflective oracles have already seen applications in one-shot games [FTC15b].

Acknowledgements

We thank Marcus Hutter and Tom Everitt for valuable comments.

References

  • [FST15] Benja Fallenstein, Nate Soares, and Jessica Taylor. Reflective variants of Solomonoff induction and AIXI. In Artificial General Intelligence. Springer, 2015.
  • [FTC15a] Benja Fallenstein, Jessica Taylor, and Paul F Christiano. Reflective oracles: A foundation for classical game theory. Technical report, Machine Intelligence Research Institute, 2015. http://arxiv.org/abs/1508.04145.
  • [FTC15b] Benja Fallenstein, Jessica Taylor, and Paul F Christiano. Reflective oracles: A foundation for game theory in artificial intelligence. In Logic, Rationality, and Interaction, pages 411–415. Springer, 2015.
  • [FY01] Dean P Foster and H Peyton Young. On the impossibility of predicting the behavior of rational agents. Proceedings of the National Academy of Sciences, 98(22):12848–12853, 2001.
  • [Hut05] Marcus Hutter. Universal Artificial Intelligence. Springer, 2005.
  • [Hut09] Marcus Hutter. Open problems in universal induction & intelligence. Algorithms, 3(2):879–906, 2009.
  • [KL93] Ehud Kalai and Ehud Lehrer. Rational learning leads to Nash equilibrium. Econometrica, pages 1019–1045, 1993.
  • [Kle52] Stephen Cole Kleene. Introduction to Metamathematics. Wolters-Noordhoff Publishing, 1952.
  • [Lei16] Jan Leike. Nonparametric General Reinforcement Learning. PhD thesis, Australian National University, 2016.
  • [LH14] Tor Lattimore and Marcus Hutter. General time consistent discounting. Theoretical Computer Science, 519:140–154, 2014.
  • [LH15a] Jan Leike and Marcus Hutter. Bad universal priors and notions of optimality. In Conference on Learning Theory, pages 1244–1259, 2015.
  • [LH15b] Jan Leike and Marcus Hutter. On the computability of AIXI. In Uncertainty in Artificial Intelligence, pages 464–473, 2015.
  • [LH15c] Jan Leike and Marcus Hutter. On the computability of Solomonoff induction and knowledge-seeking. In Algorithmic Learning Theory, pages 364–378, 2015.
  • [LLOH16] Jan Leike, Tor Lattimore, Laurent Orseau, and Marcus Hutter. Thompson sampling is asymptotically optimal in general environments. In Uncertainty in Artificial Intelligence, 2016.
  • [LV08] Ming Li and Paul M. B. Vitányi. An Introduction to Kolmogorov Complexity and Its Applications. Texts in Computer Science. Springer, 3rd edition, 2008.
  • [Nac97] John H Nachbar. Prediction, optimization, and learning in repeated games. Econometrica, 65(2):275–309, 1997.
  • [Nac05] John H Nachbar. Beliefs in repeated games. Econometrica, 73(2):459–480, 2005.
  • [Ors13] Laurent Orseau. Asymptotic non-learnability of universal agents with computable horizon functions. Theoretical Computer Science, 473:149–156, 2013.
  • [SLB09] Yoav Shoham and Kevin Leyton-Brown. Multiagent Systems: Algorithmic, Game-Theoretic, and Logical Foundations. Cambridge University Press, 2009.
  • [Sol78] Ray Solomonoff. Complexity-based induction systems: Comparisons and convergence theorems. IEEE Transactions on Information Theory, 24(4):422–432, 1978.

Appendix A List of Notation

{xtabular*}

lp60mm :=:= defined to be equal
\mathbb{N} the natural numbers, starting with 0
\mathbb{Q} the rational numbers
\mathbb{R} the real numbers
tt (current) time step, tt\in\mathbb{N}
k,n,ik,n,i time steps, natural numbers
pp a rational number
𝒳\mathcal{X}^{*} the set of all finite strings over the alphabet 𝒳\mathcal{X}
𝒳\mathcal{X}^{\infty} the set of all infinite strings over the alphabet 𝒳\mathcal{X}
𝒳\mathcal{X}^{\sharp} the set of all finite and infinite strings over the alphabet 𝒳\mathcal{X}
OO a reflective oracle
O~\tilde{O} a partial oracle
qq a query to a reflective oracle
𝒯\mathcal{T} the set of all probabilistic Turing machines that can query an oracle
T,TT,T^{\prime} probabilistic Turing machines that can query an oracle, T,T𝒯T,T^{\prime}\in\mathcal{T}
K(x)K(x) the Kolmogorov complexity of a string xx
λT\lambda_{T} the semimeasure corresponding to the probabilistic Turing machine TT
λTO\lambda_{T}^{O} the semimeasure corresponding to the probabilistic Turing machine TT with reflective oracle OO
λ¯TO\overline{\lambda}_{T}^{O} the completion of λTO\lambda_{T}^{O} into a measure using the reflective oracle OO
𝒜\mathcal{A} the finite set of possible actions
𝒪\mathcal{O} the finite set of possible observations
\mathcal{E} the finite set of possible percepts, 𝒪×\mathcal{E}\subset\mathcal{O}\times\mathbb{R}
α,β\alpha,\beta two different actions, α,β𝒜\alpha,\beta\in\mathcal{A}
ata_{t} the action in time step tt
oto_{t} the observation in time step tt
rtr_{t} the reward in time step tt, bounded between 0 and 11
ete_{t} the percept in time step tt, we use et=(ot,rt)e_{t}=(o_{t},r_{t}) implicitly
æ<t\mathchoice{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}{\mbox{{\ae }}}_{<t} the first t1t-1 interactions, a1e1a2e2at1et1a_{1}e_{1}a_{2}e_{2}\ldots a_{t-1}e_{t-1} (a history of length t1t-1)
ϵ\epsilon the empty string/the history of length 0
ε\varepsilon a small positive real number
γ\gamma the discount function γ:0\gamma:\mathbb{N}\to\mathbb{R}_{\geq 0}
Γt\Gamma_{t} a discount normalization factor, Γt:=k=tγk\Gamma_{t}:=\sum_{k=t}^{\infty}\gamma_{k}
ν,μ\nu,\mu environments/semimeasures
σ\sigma multi-agent environment
σπ1:n\sigma^{\pi_{1:n}} history distribution induced by policies π1,,πn\pi_{1},\ldots,\pi_{n} acting in the multi-agent environment σ\sigma
σi\sigma_{i} subjective environment of agent ii
π\pi a policy, π:(𝒜×)𝒜\pi:(\mathcal{A}\times\mathcal{E})^{*}\to\mathcal{A}
πν\pi^{*}_{\nu} an optimal policy for environment ν\nu
VνπV^{\pi}_{\nu} the ν\nu-expected value of the policy π\pi
VνV^{*}_{\nu} the optimal value in environment ν\nu
\mathcal{M} a countable class of environments
reflO\mathcal{M}_{\textrm{refl}}^{O} the class of all reflective-oracle-computable environments
ww a universal prior, wΔreflOw\in\Delta\mathcal{M}_{\textrm{refl}}^{O}
ξ\xi the universal mixture over all environments reflO\mathcal{M}_{\textrm{refl}}^{O}, a semimeasure
ξ¯\overline{\xi} the completion of λTO\lambda_{T}^{O} into a measure using the reflective oracle OO