This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\newtotcounter

editnum

Agent Spaces

John C. Raisbeck111Corresponding author, jraisbeck@mgh.harvard.edu Center for Systems Biology
Massachusetts General Hospital
Boston, Massachusetts
Matthew W. Allen Center for Systems Biology
Massachusetts General Hospital
Boston, Massachusetts
Hakho Lee Center for Systems Biology
Massachusetts General Hospital
Boston, Massachusetts
Abstract

Exploration is one of the most important tasks in Reinforcement Learning, but it is not well-defined beyond finite problems in the Dynamic Programming paradigm (see Subsection 2.4). We provide a reinterpretation of exploration which can be applied to any online learning method.

We come to this definition by approaching exploration from a new direction. After finding that concepts of exploration created to solve simple Markov decision processes with Dynamic Programming are no longer broadly applicable, we reexamine exploration. Instead of extending the ends of dynamic exploration procedures, we extend their means. That is, rather than repeatedly sampling every state-action pair possible in a process, we define the act of modifying an agent to itself be explorative. The resulting definition of exploration can be applied in infinite problems and non-dynamic learning methods, which the dynamic notion of exploration cannot tolerate.

To understand the way that modifications of an agent affect learning, we describe a novel structure on the set of agents: a collection of distances (see footnote 8) da,aAd_{a},a\in A, which represent the perspectives of each agent possible in the process. Using these distances, we define a topology and show that many important structures in Reinforcement Learning are well behaved under the topology induced by convergence in the agent space.

0 Introduction

Reinforcement Learning (RL) is the study of stochastic processes which are composed of a decision process 𝒫\mathcal{P}, and an agent aa. The Reinforcement Learning problem of a decision process 𝒫\mathcal{P} with reward RR is to find an optimal agent aa^{*} which maximizes the expectation of a reward function R:ΦR:\Phi\rightarrow\mathbb{R} with respect to the distribution of paths Φa\Phi^{a} drawn from the process (𝒫,a)(\mathcal{P},a). Typically, Reinforcement Learning methods seek aa^{*} or otherwise high-reward agents via iterative learning processes, sometimes described as trial-and-error [1].

In an online learning algorithm, learners send one or more agents to interact with the process and collect information on those interactions. After studying these interactions, the learner develops new agents to experiment with, seeking to improve the reward of successive generations of agents (see Definition 1.15). In this pursuit, there is a trade-off between seeking high-quality agents during learning (exploitation) and seeking new information about the decision process (exploration).

Exploitation, being closely related to general iterative optimization algorithms, is well-understood. In simple problems, its trade-off with exploration has been extensively researched [1, 2]. However, in more complex problems the conclusions reached by studying simple problems seem to have little bearing; some of the exploration methods which are least efficient in simple problems have been used in the most impressive demonstrations of the power of Reinforcement Learning to date [3, 4].

This paper focuses on exploration, especially on methods of exploration which ignore a certain set of the tenets of Dynamic Programming [5, 6]. We call these methods naïve, and class among them Novelty Search [7], which we discuss in Subsection 4.1. We begin in Section 1 and Section 2, rigorously describing the problem of Reinforcement Learning and introducing the necessity for exploration. In Subsection 2.4 and Section 3, we study the notions of exploration coming from the study of Dynamic Programming, and consider their efficacy in modern Reinforcement Learning. In Section 4, we investigate the properties of a class of spaces based on what we call primitive behavior, and in Section 5 introduce a general form of these spaces, which we call agent spaces. The agent space can be defined in a broad class of decision processes and efficiently describes several important features of an agent. In Section 6, it is demonstrated that the distributions of truncated paths ϕt\phi_{t} are continuous in the agent, and in Section 7 it is demonstrated that certain functions of those paths (chiefly, expected reward) are continuous in the agent space.

1 Reinforcement Learning

1.1 Definitions

Definition 1.1 (Decision Process).

A discrete time decision process 𝒫\mathcal{P} is a controlled stochastic process indexed by \mathbb{N}. Associated with the process are a set of states SS, a set of actions 𝒜\mathpzc{A}, and a state-transition function σ\sigma.

Definition 1.2 (State).

A state sSs\in S is state possible in the decision process 𝒫\mathcal{P}.

Definition 1.3 (Action).

An action 𝒶𝒜\mathpzc{a}\in\mathpzc{A} is a control which an agent can exert on 𝒫\mathcal{P}.

Definition 1.4 (Path).

A path (sometimes called a trajectory) ϕΦ\phi\in\Phi in a decision process 𝒫\mathcal{P} is a sequence of state-action pairs generated by the interaction between the process and an agent aa:

ϕ=(s0,𝒶0),(𝓈1,𝒶1),(𝓈2,𝒶2)\displaystyle\phi=(s_{0},\mathpzc{a}_{0}),(s_{1},\mathpzc{a}_{1}),(s_{2},\mathpzc{a}_{2})... (1.1.1)
Remark 1.1.

We will sometimes need to refer to truncated paths (ϕt\phi_{t}), i.e. paths which contain only the state-action pairs associated with indices iti\leq t. This is common when referring to the domains of agents aa and the state transition function σ\sigma. We will see that truncated paths suffice to describe the domain of σ\sigma, but because agents act only when a new state has been generated, its domain is the set of truncated prime paths (ϕt)(\phi_{t}^{\prime}): paths which contain the tt-th state, but not the tt-th action.

  1. 1.

    ϕt={(s0,𝒶0),(𝓈1,𝒶1),,(𝓈𝓉1,𝒶𝓉1),(𝓈𝓉,𝒶𝓉)}\phi_{t}=\{(s_{0},\mathpzc{a}_{0}),(s_{1},\mathpzc{a}_{1}),\ldots,(s_{t-1},\mathpzc{a}_{t-1}),(s_{t},\mathpzc{a}_{t})\} see Figure 1

  2. 2.

    ϕt={(s0,𝒶0),(𝓈1,𝒶1),,(𝓈𝓉1,𝒶𝓉1),𝓈𝓉}\phi_{t}^{\prime}=\{(s_{0},\mathpzc{a}_{0}),(s_{1},\mathpzc{a}_{1}),\ldots,(s_{t-1},\mathpzc{a}_{t-1}),s_{t}\} see Figure 2

Sometimes (as in equation (1.1.118)), it is necessary to refer to the tt-th state or action of a path ϕ\phi. We denote the tt-th state ϕt(s)\phi_{t}(s), and the tt-th action ϕt(𝒶)\phi_{t}(\mathpzc{a}), using sts_{t} and 𝒶𝓉\mathpzc{a}_{t} when unambiguous.

Because s0s_{0} has no antecedent, it is determined by the initial state distribution s0σ0s_{0}\sim\sigma_{0}.

Definition 1.5 (Initial State Distribution).

The initial state distribution σ0\sigma_{0} of a process 𝒫\mathcal{P} is the probability distribution over SS which determines the first state in a path, s0=ϕ0(s)σ0s_{0}=\phi_{0}(s)\sim\sigma_{0}.

Definition 1.6 (State-Transition Function).

The state-transition function σ\sigma of a process 𝒫\mathcal{P} is the function which takes the truncated path ϕt1\phi_{t-1} to the distribution of states σ(ϕt1)\sigma(\phi_{t-1}) from which the next state st=ϕt(s)s_{t}=\phi_{t}(s) is drawn.

Thus, the state at tt is determined by

st{σ0,if t=0,σ(ϕt1),if t>0.\displaystyle s_{t}\sim\begin{cases}\sigma_{0},&\text{if }t=0,\\ \sigma(\phi_{t-1}),&\text{if }t>0.\end{cases} (1.1.2)
Definition 1.7 (Agent).

An agent222Some authors call this function a policy, referring to its embodiment as an agent. See Subsection 1.2.1, Definition 5.6, and Subsection 7.1. aAa\in A is a function from the set of truncated paths Φ=tΦt\Phi^{\prime}=\bigcup_{t\in\mathbb{N}}\Phi_{t}^{\prime} (notice that Φ\Phi^{\prime} contains only truncated paths, and Φ\Phi contains only infinite paths) of a process into the set of actions 𝒜\mathpzc{A} of the process:

a:tΦt𝒜.\displaystyle a:\bigcup_{t\in\mathbb{N}}\Phi_{t}^{\prime}\rightarrow\mathpzc{A}. (1.1.3)
(1.1.42)
Figure 1: Truncated paths ϕt\phi_{t} are drawn from a distribution defined by iteratively drawing states and new actions from σ\sigma and aa. The complete path ϕ\phi is defined by the collection of these.
(1.1.48)
(1.1.54)
(1.1.63)
(1.1.72)
(1.1.85)
(1.1.98)
(1.1.116)
\vdots\hskip 27.0pt\vdots\hskip 27.0pt\vdots\hskip 27.0pt\vdots\hskip 27.0pt\vdots\hskip 27.0pt\vdots\hskip 27.0pt\vdots\hskip 27.0pt\vdots
Figure 2: The structure of a decision process.

Agents are judged by the quality of the control which they exert, as measured by the expectation of the reward of their paths.

Definition 1.8 (Reward).

A reward function is a function from the set of paths Φ\Phi of a process 𝒫\mathcal{P} into the real numbers \mathbb{R}:

R:Φ\displaystyle R:\Phi\rightarrow\mathbb{R} (1.1.117)

Often [1], RR can be described as a sum:

R(ϕ)=\displaystyle R(\phi)= tr(ϕt(s),ϕt(𝒶)),or\displaystyle\sum_{t\in\mathbb{N}}r(\phi_{t}(s),\phi_{t}(\mathpzc{a}))\text{,}\hskip 7.5pt\text{or} (1.1.118)
R(ϕ)=\displaystyle R(\phi)= tr(ϕt(s),ϕt(𝒶))ω(𝓉),\displaystyle\sum_{t\in\mathbb{N}}r(\phi_{t}(s),\phi_{t}(\mathpzc{a}))\omega(t), (1.1.119)

where r:S×𝒜r:S\times\mathpzc{A}\rightarrow\mathbb{R} is an immediate [8, 49] reward function, and ω:[0,1]\omega:\mathbb{N}\rightarrow[0,1] is a discount function such that

Ω=tω(t)<.\displaystyle\Omega=\sum_{t\in\mathbb{N}}\omega(t)<\infty. (1.1.120)
Definition 1.9 (Expected Reward).

The expected reward JJ is the expectation of the reward of paths ϕ\phi drawn from the distribution of paths Φa\Phi^{a} generated by the process (𝒫,a)(\mathcal{P},a):

J(a)=𝔼ϕΦa[R(ϕ)]\displaystyle J(a)=\mathop{\mathbb{E}}_{\phi\sim\Phi^{a}}\left[R(\phi)\right] (1.1.121)

Reinforcement Learning is concerned with the optimal control of decision processes, pursuing an optimal agent aa^{*} which achieves the greatest possible expected reward J(a)J(a). To this end, learning algorithms (see Subsection 1.3) are employed. Many algorithms pursue such agents by searching the agents which are “near” the agents which they have recently considered. In some contexts, this suffices to guarantee a solution in finitely many steps (see Theorem 2.1). When this cannot be guaranteed, the problem is often restated in terms of discovering “satisfactory” agents, or discovering the highest quality agent possible under certain constraints.

Remark 1.2 (Table of Notation).

We will frequently reference states, actions, paths, and sets and distributions of the same:333Without superscripts, Φ\Phi represents the set of possible paths. Note that only truncated paths can be prime.

Individual    Set Distribution
State ss SS σ(ϕt)\sigma(\phi_{t})
Action 𝒶\mathpzc{a} 𝒜\mathpzc{A} a(ϕt)a(\phi_{t}^{\prime})
Path ϕ\phi Φ\Phi Φa\Phi^{a}
Prime Path ϕt\phi_{t}^{\prime} Φt\Phi_{t}^{\prime} Φta\Phi_{t}^{\prime a}

Often, the decision processes in Reinforcement Learning are assumed to be Markov Decision Processes (MDPs); decision processes which satisfy the Markov property.

Definition 1.10 (Markov Property).

A decision process has the Markov Property if, given the prior state-action pair (st1,𝒶𝓉1)(s_{t-1},\mathpzc{a}_{t-1}), the distribution of the state-action pair (st,𝒶𝓉)(s_{t},\mathpzc{a}_{t}) is independent of the pairs (s,𝒶){(𝓈𝒾,𝒶𝒾)|𝒾<𝓉1}(s,\mathpzc{a})\in\{(s_{i},\mathpzc{a}_{i})\,|\,i<t-1\}.

Remark 1.3 (A Simple Condition for the Markov Property).

Sometimes a slightly stricter definition of the Markov property is used: a decision process is guaranteed to be Markov if σ\sigma is a function of (st,𝒶𝓉)(s_{t},\mathpzc{a}_{t}) and aa is a function of st+1s_{t+1} alone (which is itself a function of (st,𝒶𝓉)(s_{t},\mathpzc{a}_{t})). We call such agents strictly Markov:

Definition 1.11 (Strictly Markov Agent).

A strictly Markov agent aAa\in A is a function from the set of states SS into the set of actions 𝒜\mathpzc{A}:

a:S𝒜.\displaystyle a:S\rightarrow\mathpzc{A}. (1.1.122)

1.2 Computational considerations

When studying modern Reinforcement Learning, it is important to keep the constraints of practice in mind. Among the most important of these is the quantification of states and actions. In practice, the sets of states and actions are typically real (Sm,𝒜𝓃{S\subset\mathbb{R}^{m},\mathpzc{A}\subset\mathbb{R}^{n}}), finite (|S|,|𝒜||S|\in\mathbb{N},|\mathpzc{A}|\in\mathbb{N}), or a combination thereof.

1.2.1 The Agent

Agents can be represented in a variety of ways, but most modern algorithms represent agents with real function approximators (often neural networks) parameterized by a set of real numbers θ\theta,

aθ:mn.\displaystyle a_{\theta}:\mathbb{R}^{m}\rightarrow\mathbb{R}^{n}. (1.2.1)

This is true even in cases where the sets of states or actions are not real. To reconcile an approximator’s domain and range with such sets, the approximator is fit with an input function II and an output function OO, which mediate the interaction between a process and a function approximator:

I\displaystyle I :tΦtm,\displaystyle:\bigcup_{t\in\mathbb{N}}\Phi_{t}^{\prime}\rightarrow\mathbb{R}^{m}, (1.2.2)
O\displaystyle O :n𝒜.\displaystyle:\mathbb{R}^{n}\rightarrow\mathpzc{A}. (1.2.3)

When composed with these functions, the approximator forms an agent:

a=OaθI.\displaystyle a=O\circ a_{\theta}\circ I. (1.2.4)

1.2.2 The Output Function

The choice of output function is important and requires consideration of several aspects of the process, including the learning process and the action set 𝒜\mathpzc{A}. One of the most consequential roles of an output function OO occurs in processes where |𝒜||\mathpzc{A}| is finite. In such processes, the output function must map a real vector to a discrete action.

Some learning algorithms, such as 𝒬\mathcal{Q}-Learning, invoke the action-value function 𝒬\mathcal{Q} (see Definition 2.2) in their learning process, training an agent to estimate the expected reward of each action at each state, given an agent aa [9, 81]. Because the object of approximation in 𝒬\mathcal{Q}-learning is defined analytically, some output functions are less sensible than others. Other algorithms, like policy gradients, have outputs with less fixed meanings. Consequently, they can accommodate a variety of output functions, and the output function has a substantial effect on the agent itself. Let us consider some of the most common output functions. For further discussion of output functions, see Appendix A.1.

Definition 1.12 (Greedy Action Sampling).

In Greedy Action Sampling, the function approximator’s output is taken to indicate a single “greedy” action, and this greedy action is taken[1]. In problems with finite sets of actions, the range of the parameterized agent is real (aθ(ϕt)|𝒜|a_{\theta}(\phi_{t}^{\prime})\in\mathbb{R}^{|\mathpzc{A}|}) and the action associated with the dimension of greatest magnitude (the greedy action) is taken.

Definition 1.13 (ε\varepsilon-Greedy Sampling).

In ε\varepsilon-Greedy Sampling, ε[0,1]\varepsilon\in[0,1], the greedy action is taken with probability 1ε1-\varepsilon. Otherwise, an action is drawn from the uniform distribution on 𝒜\mathpzc{A}.[1]

Unlike the greedy sampling methods above, Thompson sampling requires a finite set of actions.

Definition 1.14 (Thompson Sampling).

In Thompson Sampling, the range of the parametrized agent aθa_{\theta} is {x|𝒜||i|𝒜|xi=1,xi>0}\{x\in\mathbb{R}^{|\mathpzc{A}|}|\sum_{i\in|\mathpzc{A}|}x_{i}=1,x_{i}>0\}, and the action is selected by drawing from the random variable which gives action 𝒶𝒾\mathpzc{a}_{i} probability xix_{i} [10].

1.3 Learning Algorithms

We are concerned primarily with online learning algorithms, in which exploration is most sensible. Whereas most dynamic programming based methods study only a single agent in each epoch of training, we consider a broader class of learning algorithms:

Definition 1.15 (Learning Algorithm).

A learning algorithm (sometimes optimizer) is an algorithm which generates sets of candidate agents AnA_{n}, observes their interactions with a process, and uses this information to generate a set of candidates An+1A_{n+1} in the next epoch. This procedure is repeated for each epoch nIn\in I to improve the greatest expected reward among considered agents,

supnIsupaAnJ(a).\displaystyle\sup_{n\in I}\sup_{a\in A_{n}}J(a). (1.3.1)
Remark 1.4 (Loci).

Many learning algorithms have an additional property which can simplify discussions of learning algorithms: they center generations of agents AnA_{n} around a locus agent, or locus of optimization, denoted alocusa_{\text{locus}}.

Not every interesting algorithm falls within this definition (e.g. evolutionary methods with multiple loci), but many of the discussions in this paper apply, mutatis mutandis, to a broader class of methods.

1.4 Interpreting the Reinforcement Learning Problem

Many learning algorithms are influenced by a philosophy of optimization called Dynamic Programming [5] (DP). Dynamic programming approaches control problems state by state, determining the best action for each state according to an approximation of the state-action value function 𝒬a(s,𝒶)\mathcal{Q}^{a}(s,\mathpzc{a}) (see Definition 2.2). DP has compelling guarantees in finite444An MDP is called finite if |𝒜|<0|\mathpzc{A}|<\aleph_{0} and |S|<0|S|<\aleph_{0}. strictly Markov decision processes [11]. However, because DP methods require every [8, 12] action value be approximated during each step of optimization, these guarantees cannot transfer to infinite problems.

Although DP is dominant in Reinforcement Learning [1, 73], several trends indicate that it may not be required for effective learning. Broadly, the problems of interest in Reinforcement Learning have shifted substantially from those Bellman considered in the early 1950s [9]. Among other changes, the field now studies many infinite problems, agents are not generally tabular, and the problems of interest are not in general Markov. Each of these changes independently prevents a problem from satisfying the requirements of dynamic programming [12]. Simultaneously, approaches not based on DP, like Evolution Strategies [13] have shown that DP methods are not necessary to achieve performance on par with modern dynamic programming based methods in common reinforcement learning problems. Taken together, these trends indicate that DP may not be uniquely effective.

This paper considers exploration in the light of these developments. We distinguish DP methods like 𝒬\mathcal{Q}-Learning, which treat reinforcement learning problems as collections of subproblems, one for each state, from methods like Evolution Strategies [13], which treat reinforcement learning problems as unitary. We call the former class dynamic, and the latter class naïve.

2 Exploitation and its discontents

The study of Reinforcement Learning employs several abstractions to describe and compare the dynamics of learning algorithms which may have little in common. One particularly important concept is a division between two supposed tasks of learning: exploitation and exploration. Exploitation designates behavior which targets greater reward in the short term and exploration designates behavior which seeks to improve the learner’s understanding of the process. Frequently, these tasks conflict; learning more (exploration) can exclude experimenting with agents which are estimated to obtain high reward (exploitation) and vice versa. The problem of balancing these tasks is known as the Exploration versus Exploitation problem. Let us consider exploitation, exploration, and their conflicts.

2.1 Exploitation

Exploitation can be defined in a version of the Reinforcement Learning problem in which the goal is to maximize the cumulative total of rewards obtained by all sampled agents: let ϕa\phi^{a} be a path sampled from (𝒫,a)(\mathcal{P},a)

a(nIAn)R(ϕa).\displaystyle\sum_{a\in\left(\bigcup_{n\in I}A_{n}\right)}R(\phi^{a}). (2.1.1)
Definition 2.1 (Exploitation).

A learning method is exploitative if its purpose is to increase the reward accumulated by the agents considered in the epochs II.

Fortunately, this definition of exploitation remains relevant to our problem because it informs the cumulative version of the problem. Exploitation advises the dispatch of agents which are expected to be high-quality, rather than those whose behavior and quality are less certain. In effect, exploitation is conservative, preferring “safer” experimentation and incremental improvement to potentially destructive exploration. While exploitation is crucial to reinforcement learning, explorative experiments are often necessary because sometimes, exploitation alone fails.

2.2 Exploitation Fails

When exploitation fails, it is because its conservatism causes it to become stuck in local optima. Because Reinforcement Learning methods change the locus agent gradually, and exploitative methods typically generate sets of agents AnA_{n} near the locus, exploitative learners can become trapped in local optima. Figure 3 provides a unidimensional representation of this problem: because the agents aAna\in A_{n} fall within the sampling radius, and updates to the locus agent are generally interpolative, the product of the learning process depends exclusively on the initial locus and the sampling distance.

6-64-42-22244666-64-42-2224466J(x)J(x)abcpqLocusSampling Radius
Figure 3: A depiction of the reward of agents parameterized by a real number xx. Even though c has higher expected reward, it is isolated from the locus. Within the sampling radius, the trend is clear: lower values of xx produce greater reward. Thus, a purely exploitative method will end at b.

2.2.1 The Policy Improvement Theorem

No discussion of local optima in Reinforcement Learning would be complete without a discussion of the Policy Improvement Theorem [12, 8]. Let us begin with a definition:

Definition 2.2 (Action Value Function (𝒬\mathcal{Q})).

Let 𝒫\mathcal{P} be a strictly Markov decision process (Remark 1.3). Then, the Action Value Function is the expectation of reward555Reward is assumed to be summable (1.1.118), and ω(t)\omega(t) is assumed to be exponential, ω(t)=γt,0γ<1\omega(t)=\gamma^{t},0\leq\gamma<1., starting at state ss, taking an action 𝒶\mathpzc{a}, and continuing with an agent aa: [8]

𝒬a(s,𝒶)=𝔼[(ϕ)|𝓈0=𝓈,𝒶0=𝒶,𝓉+(𝒶𝓉𝒶(𝓈𝓉1))].\displaystyle\mathcal{Q}^{a}(s,\mathpzc{a})=\mathop{\mathbb{E}}\left[R(\phi)\hskip 3.00003pt|\hskip 3.00003pts_{0}=s,\hskip 10.00002pt\mathpzc{a}_{0}=\mathpzc{a},\hskip 10.00002pt\operatorname{\forall}t\in\mathbb{Z}^{+}(\mathpzc{a}_{t}\sim a(s_{t-1}))\right]. (2.2.1)
Theorem 2.1 (Policy Improvement Theorem).

For any agent aa in a finite, discounted, bounded-reward strictly Markov decision process, either there is an agent aa^{\prime} such that

\displaystyle\operatorname{\forall} sS(𝒬a(s,a(s))𝒬a(s,a(s)))sS(𝒬a(s,a(s))>𝒬a(s,a(s)))\displaystyle s\in S\big{(}\mathcal{Q}^{a}(s,a^{\prime}(s))\geq\mathcal{Q}^{a}(s,a(s))\big{)}\land\operatorname{\exists}s\in S\left(\mathcal{Q}^{a}(s,a^{\prime}(s))>\mathcal{Q}^{a}(s,a(s))\right) (2.2.2)

or aa is an optimal agent.

This theorem also holds if this condition is appended:

!\displaystyle\operatorname{\exists}! sS(a(s)a(s)).\displaystyle s\in S(a^{\prime}(s)\neq a(s)). (2.2.3)

With this condition, Theorem 2.1 shows that every agent aa either has a superior neighbor aa^{\prime} (an agent which differs from aa in its response to exactly one state), or it is optimal.

This means that any finite strictly Markov decision process has a discrete “convexity”; every imperfect agent has a superior neighbor. Unfortunately, many of the problems of interest in modern Reinforcement Learning are not finite or, if finite, are too large for the Policy Improvement Algorithm, which exploits this theorem, to be tractable [8, 54].

2.3 The Assumptions of Dynamic Programming

Modern Reinforcement Learning confronts many problems which do not satisfy the assumptions of the Policy Improvement Theorem. Many modern problems fail directly, using infinite decision processes. Many others technically satisfy the requirements of the theorem, but are so large as to render the guarantees of policy improvement notional with modern computational techniques. Still others fail the qualitative conditions, for example, by not being strictly Markov.

The calculations necessary to guarantee that one policy is an improvement upon another require that the learner have knowledge of each state and action in the process, as well as of the dynamics of the decision process. While excessive size of SS or 𝒜\mathpzc{A} are easiest to exhibit, the interplay of the size of these sets tends to be the true source of the problem. Difficulties caused by this interplay are known as the curse of dimensionality, a term which Bellman used to describe the way that problems with many aspects can be more difficult than the sum of their parts [6, ix]. For example, the size of the set of agents grows exponentially in the number of states and polynomially in the number of actions:

|A|=|𝒜||𝒮|.\displaystyle|A|=|\mathpzc{A}|^{|S|}. (2.3.1)

In large problems, the curse of dimensionality makes the guarantees of Dynamic Programming almost impossible to achieve in practice. Curiously, the performance of DP methods seems to degrade slowly, even as the information necessary for the guarantees of DP quickly becomes unachievable.

2.4 Incomplete Information and Dynamic Programming

One way to address the size of a Reinforcement Learning problem is to collect the information necessary for Dynamic Programming more efficiently. Dynamic Programming methods which approximate 𝒬a\mathcal{Q}^{a} rely on several types of information, all based in the superiority condition:

sS(𝒬a(s,a(s))𝒬a(s,a(s))).\displaystyle\operatorname{\forall}s\in S\big{(}\mathcal{Q}^{a}(s,a^{\prime}(s))\geq\mathcal{Q}^{a}(s,a(s))\big{)}. (2.4.1)

𝒬\mathcal{Q}-based methods rely on the learner’s ability to approximate the 𝒬a\mathcal{Q}^{a}-function. There are two ways to approach this problem: 𝒬a\mathcal{Q}^{a} can be approximated directly (i.e. separately for each state), or it can be calculated from knowledge or approximations of several aspects of the process:

  1. 1.

    The set of states, SS,

  2. 2.

    The set of actions, 𝒜\mathpzc{A},

  3. 3.

    The immediate reward function, rr,

  4. 4.

    The state-transition function, σ\sigma, and

  5. 5.

    The agent aa.

Calculating the 𝒬a\mathcal{Q}^{a} function is, in a sense, more efficient: if the quantities above are known to an acceptable degree of precision, then a consistent action-value function can be imputed without repeated sampling.

The complexity of this operation is a function of |S|×|𝒜||S|\times|\mathpzc{A}|. In a sense, this method is inexpensive; the number of state-action pairs is much smaller than the number of possible sample paths, |Φt|=|S×𝒜|𝓉>|𝒮×𝒜||\Phi_{t}|=|S\times\mathpzc{A}|^{t}>|S\times\mathpzc{A}|. It is also typically much smaller than the number of agents, |𝒜|𝒮|\mathpzc{A}|^{S}.

This kind of approximation is fundamental to Dynamic Programming methods, and serves as the basis for many exploration methods. In general, reinforcement learning algorithms are assumed to have full knowledge of 𝒜\mathpzc{A} and aa, but they may not have complete information about SS, rr, or σ\sigma. Thus, exploration has sometimes been defined as pursuing experiences of state-action pairs (s,𝒶)(s,\mathpzc{a}), specifically the quadruplets which can be associated with them in a path, (st,𝒶,𝓇(𝓈𝓉,𝒶),𝓈𝓉+1)(s_{t},\mathpzc{a},r(s_{t},\mathpzc{a}),s_{t+1}). If enough of these are collected, it is possible to approximate the expected reward of each state-action pair, as well as the transition probabilities σ(s,𝒶)\sigma(s,\mathpzc{a}).

Importantly, because these quadruplets can be induced by particular actions (i.e., (s,𝒶)(s,\mathpzc{a}) can be generated by taking the action 𝒶\mathpzc{a} at a time when ss is visited), the task of collecting the information about these quadruplets can be reduced to a simple two-step formula: first, visit each state sSs\in S, and second, attempt each action in that state. Under the right circumstances, proceeding in this way results in experience of every state-action pair, {(s,𝒶),|𝓈𝒮,𝒶𝒜}\{(s,\mathpzc{a}),|s\in S,\mathpzc{a}\in\mathpzc{A}\}.

This simple notion of exploration is both efficient and effective in some circumstances: if SS is small and all of its states may be easily visited, and 𝒜\mathpzc{A} is small, then it is easy to consider every possible state-action pair. After enough sampling, this allows the 𝒬\mathcal{Q}-function to be approximated. Outside of problems which satisfy those conditions, it is natural to consider other notions of exploration. Sometimes, these definitions take a more descriptive form, for example in Thrun: “exploration seeks to minimize learning time” [14]. Even without sufficient information about the process to guarantee policy improvement, Dynamic Programming methods have performed admirably [3, 4].

These results could be seen as a testament to the efficacy of Dynamic Programming under non-ideal circumstances, however, there is a curious countervailing trend: in many of these problems, simple or black-box methods such as Evolution Strategies [13] (an implementation of finite-differences gradient approximation) have been able to match the performance of modern Dynamic Programming methods. This conflicts with the present theory in two ways: first, it challenges the idea that Dynamic Programming is uniquely efficient, or uniquely suited to Reinforcement Learning problems. Second, because these methods are not dynamic, they lack the usual information requirements which exploration is supposed to resolve, yet they are useless without exploration (as seen in Figure 3) - if information about state-action pairs is not directly employed, what is the role of exploration?

In the next section we begin by describing the properties of the methods of Subsection 1.2.2 which guarantee that eventually, every state-action pair will be experienced. We then discuss methods which address circumstances where certain states are difficult to handle

3 Exploration and contentment

Dynamic exploration can be divided into two categories: directed exploration and undirected exploration [14]. Undirected methods explore by using random output functions like ε\varepsilon-Greedy Sampling and Thompson Sampling (see Subsection 1.2.2) to experience paths which would be impossible with the corresponding greedy agent. These are called undirected because the changes which the output functions make to the greedy version of the agent are not intended to cause the agent to visit particular states. Instead, these methods explore through a sort of serendipity.

Directed methods have specific goals and mechanisms more narrowly tailored to the optimization paradigm they support. Some directed methods, like #Exploration [15] (see Definition A.5) seek to experience particular state-action pairs by directing the learner through exploration bonuses to consider agents which lead to state-action pairs which have been visited less in the learning history. In order to apply the state-action pair formulation of exploration to large and even infinite problems, #Exploration employs a hashing function to simplify and discretize the set set of states.

Other directed exploration methods, like that of Stadie et al. [16] (see Definition A.6) employ exploration bonuses to incentivize agents which visit states which are poorly understood. Stadie et al. begin by modeling the state-transition function (in a strictly Markov process) σ\sigma with a function \mathcal{M}. In each step of the process, the model (st,𝒶𝓉)\mathcal{M}(s_{t},\mathpzc{a}_{t}) guesses the next state st+1s_{t+1}. After guessing, the model is trained on the transition from (st,𝒶𝓉)(s_{t},\mathpzc{a}_{t}) to st+1s_{t+1}. When the model is less accurate (i.e. when the distance666Stadie et al. assume that SS is a metric space. between st+1s_{t+1} and (st,𝒶𝓉)\mathcal{M}(s_{t},\mathpzc{a}_{t}) is large), Stadie et al. reason, there is more to be learned about sts_{t}, and their method assigns exploration bonuses to encourage agents which visit sts_{t}. Unfortunately, none of these methods can recover the guarantees of Dynamic Programming in problems which do not satisfy the requirements of the theory of Dynamic Programming. More detailed descriptions of these methods may be found in Appendix A.

3.1 The Essence of Exploration

The brief survey above and in Appendix A is far from complete, but it contains the core strains of most modern exploration methods. A more thorough discussion of exploration may be found in [14] or [17]. In spite of its incompleteness, our survey suffices for us to reason generally about exploration, and about its greatest mystery: why do methods which were developed to collect exactly the information necessary for dynamic programming appear to help naïve methods succeed?

Because naïve methods do not make use of action-values, nor do they collect information about particular state-action pairs, the dynamic motivations of these exploration methods cannot explain their efficacy when paired with naïve methods of exploitation. Instead, there must be something about the process of exploration itself which aids naïve methods. That poses a further challenge to the dynamic paradigm: if exploration is effective in naïve methods for non-dynamic reasons, to what extent do those reasons contribute to their effect in dynamic methods?

These exploration methods seem to share little beyond their motivations. One other thing which they share - and which they by necessity share with every reinforcement learning algorithm - is that their mechanism is, ultimately, aiding in the selection of the next set of agents AnA_{n}. Undirected methods accomplish this by selecting an output function, and directed methods go slightly further in influencing the parameters of the agents, but this is their shared fundamental mechanism.

What differentiates exploration methods from other reinforcement learning methods is that they influence the selection of agents not to improve reward in the next epoch, as is standard in exploitation methods, but to collect a more diverse range of information about the process. It is the combination of this mechanism and purpose which makes a method explorative:

Definition 3.1 (Exploration).

A reinforcement learning method is explorative if it influences the agents aAa\in A for the purpose of information collection.

We now know two things about exploration: in Section 2 we established that exploration was a process which sought additional information about the process. We have now added that exploration is accomplished by changing the agents which the learner considers. This, however, does not resolve our question: under the dynamic programming paradigm, these changes are made so as to collect the information necessary to calculate action-values. What is the information which naïve methods require, and how does dynamic exploration collect it? To what extent does that other information contribute to the effectiveness of those methods in dynamic programming?

To continue our study of exploration in naïve learning, we begin in the next section with a discussion of Novelty Search, an algorithm which uses a practitioner-defined behavior function BB to explore behavior spaces. In Section 5, we describe a general substrate for naïve exploration which is general enough to contain other exploration substrates, and is equipped with a useful topological structure.

4 Naïve Exploration

One of the most prominent examples of naïve exploration is an algorithm called Novelty Search [7]. In contrast to the other methods which we discuss in this work, its creators do not describe it as a learning algorithm. Instead, they call novelty search an “open-ended” method. Nonetheless, methods which incorporate Novelty Search can usually be analyzed as learning algorithms, since they typically satisfy Definition 1.15, with the possible exception of the “purpose” of the method.

4.1 Novelty Search

Novelty Search is an algorithmic component of many learning algorithms which was introduced by Joel Lehman and Kenneth O. Stanley [7, 18]. Unlike other learning methods, Novelty Search works to encourage novel, rather than high-reward agents.

Definition 4.1 (Novelty Search).

Novelty Search is a component which can be incorporated into many learning algorithms which defines the behavior B(a)B(a) and novelty N(a)N(a) of agents which the learning algorithm considers aAna\in A_{n}.

Because Novelty Search does not specify an optimizer, the details of implementation can vary, but the “search” in Novelty Search refers to the way that Novelty Search methods seek agents with higher novelty scores N(a)N(a). These scores are based upon the scarcity of an agent’s behavior B(a)B(a) within a behavior archive χ\chi (4.1.2).

Definition 4.2 (Behavior).

The behavior of an agent is the image of that agent777In practice, many behavior functions are functions of a sampled path of the agent, rather than the agent itself. under a behavior function (or behavior characterization)

B:AX;\displaystyle B:A\rightarrow X; (4.1.1)

a function from the set of agents AA into a space of behaviors XX equipped with a distance dXd_{X}.888 The literature on Novelty Search is most sensible when the range of the behavior function is assumed to be a metric space, and novelty search is usually discussed under that pretense. However the “novelty metrics” employed in Abandoning Objectives are not metrics in the mathematical sense (see Definition 5.1) Instead, they are squared Euclidean distances - a symmetric [19]. We use the word distance to refer to a broader class of functions, and use the word metric and its derivatives in the formal sense.

The behavior archive χ\chi in Novelty Search is a subset of the behaviors which have been observed in the learning process,

χmn{B(a)|aAm}.\displaystyle\chi\subset\bigcup_{m\leq n}\{B(a)\,|\,a\in A_{m}\}. (4.1.2)

In general, the archive is meant to summarize the behaviors which have been observed so far using as few representative behaviors as possible, so as to minimize computational requirements.

Definition 4.3.

The novelty NN of a behavior B(a)B(a) is a measure of the sparsity of the behavior archive around B(a)B(a). In Abandoning Objectives, Lehman and Stanley use the average distance8 from that behavior to its kk nearest neighbors in the archive, KχK\subset\chi:

N(B(a))=1kB(c)KdX(B(a),B(c)).\displaystyle N(B(a))=\frac{1}{k}\sum_{B(c)\in K}d_{X}(B(a),B(c)). (4.1.3)

Novelty Search reveals something about naïve methods as a class: because they do not operate under the dynamic paradigm, individually manipulating the ways that agents respond to each situation possible in the process, they must employ another structure to understand the agents which it considers, and to determine An+1A_{n+1}. For this purpose, Novelty Search uses the structure of the chosen behavior space X. Let us consider the structures other naïve methods might use.

In the case of exploitation, a simple structure is available: reward. The purpose of exploitation is to improve reward, so reward is the relevant structure. Under the dynamic framework, reward is decomposed into the immediate rewards r(s,𝒶)r(s,\mathpzc{a}), and agents are specified in relation to these. In the naïve framework, that decomposition is not availed, so only the coarser reward of an agent, R(a)R(a), can be used. In some problems, other structures may correlate with reward, but these correlations can be inverted by changes to the state-transition function or reward function, so the only a priori justifiable structure for exploitation is reward.

Exploration is more complex. In Dynamic Programming, the information necessary to solve a Reinforcement Learning problem is well defined, but in the naïve framework, there is not a general notion of information “sufficient” to solve a problem (except for exhaustion of the set of agents). That is, naïve exploration does not have a natural definition in the same way as naïve exploitation or dynamic exploration. Let us then consider a definition of exclusion:

Definition 4.4 (Naïve Exploration).

A learning method is a method of naïve exploration if

  1. i.

    The method itself is naïve, and

  2. ii.

    The set of agents is explored using a structure that is not induced by the expected reward.

Under this definition, Novelty Search is clearly a method of naïve exploration. Let us consider it further. Rather than treat it as a unique algorithm, we can consider Novelty Search to be a family of exploration methods, each characterized by the way that it projects the set of agents AA into a space of behaviors XX.

Novelty Search can be analyzed with respect to several goals: it may be viewed as an explorative method, or, when paired with a method of optimization, it may be seen as an open-ended or learning method in and of itself. Unfortunately, the capacity of Novelty Search to accomplish any of these goals is compromised by the subjectivity of the behavior function. Because the behavior function relies on human input, the exploration which is undertaken, the diversity which Novelty Search achieves, and the reward at the end of a learning process involving Novelty Search all rely on the beliefs of the practitioner. Instead of viewing this subjectivity as a problem, Lehman and Stanley embrace it, suggesting that behavior functions must be determined manually for each problem, writing:

There are many potential ways to measure novelty by analyzing and quantifying behaviors to characterize their differences. Importantly, like the fitness function, this measure must be fitted to the domain. -Lehman and Stanley [20]

If followed, this advice would make it virtually impossible to disentangle Novelty Search as an algorithm from either the problems to which it is applied or the practitioners applying it. Fortunately, some authors have rejected this suggestion, pursuing more general notions of behavior.

We now present a brief overview of some of the behavior functions in the literature, including those described in Lehman and Stanley’s pioneering Novelty Search papers [7, 18]. Appendix B presents a more detailed discussion of these functions as well as some other behavior functions which could not be included in this summary.

In their flagship paper on Novelty Search, Lehman and Stanley [7] consider as their primary example problem a two-dimension maze. As a secondary example they take a similar navigation problem in three dimensions. Importantly for our discussion, behavior functions in both environments admit a concept of the agent’s position. Lehman and Stanley introduce two behavior functions in their study of Novelty Search, both of which are functions of the position of the agent throughout a sampled truncated path ϕtaΦa\phi_{t}^{a}\sim\Phi^{a}.

The simpler of these behavior function takes the agent’s final position (i.e. the position of the agent when the path is truncated), and the more complex behavior function takes as behavior a list of positions throughout the path taken at temporal intervals. These functions provide insight into Lehman and Stanley’s intuitions about behavior: to them, behavior relates to the state of the process, rather than to the agent or its actions. Because the state of the process depends upon the interaction of the process and the agent, these definitions assure that behavior reflects the interaction between the process and the agent.

This is important. Notice that under the function-approximation framework (Subsection 1.2.2), any function with appropriate range and domain could be treated as an agent. As a result, if behavior were taken to be a matter of the function approximator alone, one would be forced to accept the premise that the structure of the set of agents should be identical for any pair of processes with the same sets of states and actions. Identical even when the state-transition functions differ. In other words, all three-dimensional navigation tasks would have the same space of agents. This issue certainly suffices to explain Lehman and Stanley’s attitude toward behavior functions. It does not, however, necessitate that approach.

Other authors have considered near-totally general notions of behavior. One group of these focuses on collating as much information as possible about every point in a path. In the case of Gomez et al. [21], this involves concatenating some number of observed states. Conti et al. [22] take a similar approach, replacing the states of the decision process with RAM states - the version of the state of the decision process stored by the computer.

Another class of general notions of behavior focuses instead on the actions of the agent itself, as viewed across a subset of SS. We call this class of functions Primitive Behavior.

4.2 Primitive Behavior

Primitive behavior functions define the behavior of a strictly Markov agent aa as the restriction of the agent to a subset of SS:

Definition 4.5 (Primitive Behavior).

A behavior function B:AMB:A\rightarrow M is said to be primitive if it is a collection of an agent’s actions in response to a finite set of states XSX\subset S; BB is primitive iff

B(a)={(s,a(s))|sX}, and\displaystyle B(a)=\{(s,a(s))\,|\,s\in X\},\text{ and} (4.2.1)

the distance on the set of behaviors is given by a weighted (by wsw_{s}) sum of distances between the actions of the agents on XX:999This assumes that the set of actions 𝒜\mathpzc{A} is a metric space with d𝒜d_{\mathpzc{A}}.

d(B(a),B(b))=sXwsd𝒜(a(s),b(s)).\displaystyle d(B(a),B(b))=\sum_{s\in X}w_{s}d_{\mathpzc{A}}(a(s),b(s)). (4.2.2)

This notion of behavior, with slight modifications, has appeared in several papers in the Reinforcement Learning literature [23, 24, 25, 26]. At least one existing work uses this notion of behavior in Novelty Search [23]. Another [24] uses it for optimization with an algorithm other than Novelty Search. [23, 25, 26] weight the constituent distances (i.e. wsw_{s} is not constant), and [25] uses primitive behavior to study the relationship between behavior and reward.

A number of important properties of primitive behavior have been described. For example, [24] notes that agents with the same behavior may have different parameters. [23] takes implicit advantage of the fact that agents which do not encounter a state do not meaningfully have a response to it (a fact we address in Item iib of Section 5), and [25] considers the states which an agent encounters an important aspect of the agent, using them to create equivalence classes of agents.

We call this notion of behavior primitive because it is the simplest notion of behavior which completely describes the interaction between the agent and the process [on XX]. Thus, for an appropriate set XX, the primitive behavior contains all information relevant to (𝒫,a)(\mathcal{P},a). So long as a definition of behavior only depends on the interaction between 𝒫\mathcal{P} and aa, primitive behavior thus suffices to determine every other notion of behavior.

Clearly, this does not follow the advice of Lehman and Stanley [7]; it is completely unfitted to the underlying problem. In exchange for this lack of fit, primitive behavior is fully general. Further, because primitive behavior is simply a restriction of the agent itself, every other notion of behavior is downstream of primitive behavior, provided that an appropriate set XX is used.

However, because the selection of XX, and of the weights wsw_{s} is itself a matter of choice, primitive behavior in general remains somewhat subjective. In finite problems, it is possible to assign to every state a non-zero weight, which produces a sort of objective distance, but this remains problematic; two agents might differ on a state which neither of them ever visits. Should agents aa and bb which produce identical processes (𝒫,a)(\mathcal{P},a) and (𝒫,b)(\mathcal{P},b) really be described as different? We contend in Definition 5.6 that the answer is “no”.

Our task in the next several sections is to resolve this and other issues with primitive behavior. In the next section we approach the matter of a general substrate for exploration from the ground up. We begin with the simplest version of primitive behavior (that associated with a single state) and proceed to a “complete” notion of behavior: a distance between agents which properly discriminates between agents (see Definition 5.6). In Section 6, we demonstrate some properties of this completed space, which we call the agent space.

5 Seeking a Structure for Naïve Exploration

Under Definition 4.4, naïve exploration is a category of exclusion. Any naïve method which is not exploitative, i.e. which does not use the structure induced by an agent’s expected reward, is a method of naïve exploration. Our task in this section is to develop a good structure for exploration. That is, to develop a structure on the space of agents, other than reward, which captures important aspects of the relationships of agents to one another and to the process. We seek a structure which:

  1. i.

    Exists in every discrete-time decision process,

  2. ii.

    Correctly describes equivalence relations between agents (see Definitions 5.1 and 5.6), i.e.

    1. (a)

      Identifies agents which differ in aspects irrelevant to their processes,

    2. (b)

      Distinguishes agents which differ in aspects which are relevant to their processes, and

  3. iii.

    Naturally describes important relations on the set of agents.

Such a structure would allow us to compare structures which are used in naïve exploration methods, including, for example, the various behavior functions which have been used in Novelty Search. If computationally tractable, such a structure could also provide the basis for a new exploration method, or perhaps even a new definition of exploration. Let us begin by considering one possible kind of structure for this.

5.1 Prototyping the Agent Space

In contending with a generic discrete-time decision process, few assets are available to define the structure of an agent space. At a basic level, there are only two types of interaction between an agent and a process: the generation of a state, in which the decision process acts on the agent (𝒫a{\mathcal{P}\rightarrow a}), and the action of an agent, in which the agent acts on the decision process (a𝒫{a\rightarrow\mathcal{P}}). Every other aspect of a decision process may be regarded as a function of those interactions. Let us begin by using these basic interactions to define a metric space, following the behavior functions used in Novelty Search.

Definition 5.1 (Metric).

A metric on a set XX is a function

d:X×X+{0}\displaystyle d:X\times X\rightarrow\mathbb{R}^{+}\cup\{0\} (5.1.1)

which satisfies the metric axioms:

  1. 1.

    d(x,y)=0x=yd(x,y)=0\iff x=y Identity of Indiscernibles

  2. 2.

    d(x,y)=d(y,x)d(x,y)=d(y,x) Symmetry

  3. 3.

    d(x,y)d(x,z)+d(z,y)d(x,y)\leq d(x,z)+d(z,y) Triangle Inequality

Let us begin with a simple case, comparing strictly Markov agents on their most basic elements: their actions on the process in response to a single state ss.

5.2 The Distance on ss

Let 𝒜\mathpzc{A} be a metric space with metric d𝒜d_{\mathpzc{A}}. Then, we define the distance between agents aa and bb in a single-state decision on the state ss:

Definition 5.2 (Distance on ss).

The distance on ss between aa and bb is the distance of their actions on ss:

ds(a,b)=d𝒜(a(s),b(s)).\displaystyle d_{s}(a,b)=d_{\mathpzc{A}}(a(s),b(s)). (5.2.1)

Importantly, this distance is not a metric.8 Instead, it is a pseudometric; it cannot distinguish agents which act identically on ss but differently on another state ss^{\prime}.

Definition 5.3 (Pseudometric).

A pseudometric on a set XX is a function

d:X×X+{0}\displaystyle d:X\times X\rightarrow\mathbb{R}^{+}\cup\{0\} (5.2.2)

which satisfies the pseudometric axioms

  1. 1.

    d(x,y)=0x=yd(x,y)=0\impliedby x=y Indiscernibility of Identicals

  2. 2.

    d(x,y)=d(y,x)d(x,y)=d(y,x) Symmetry

  3. 3.

    d(x,y)d(x,z)+d(z,y)d(x,y)\leq d(x,z)+d(z,y) Triangle Inequality

This distance describes the differences between aa and bb on the state ss, but decision processes involve many states, potentially infinitely many. Certainly, the action which agents take in response to a single state does not suffice to explain the differences between agents. Let us begin to resolve this by comparing agents on a finite set of states XX (|X|>1|X|>1).

5.3 The Distance on XX

Having defined the distance between agents on a single state ss, we can define the distance on a set of states XX by summation. Denote the distance between agents on a finite set of states XX as dX(a,b)d_{X}(a,b).

Definition 5.4 (Distance on XX).

The distance on XSX\subset S between aa and bb is the sum of the distances between aa and bb on each element of the set:

dX(a,b)\displaystyle d_{X}(a,b) =sXds(a,b)\displaystyle=\sum_{s\in X}d_{s}(a,b) (5.3.1)
=sXd𝒜(a(s),b(s)).\displaystyle=\sum_{s\in X}d_{\mathpzc{A}}(a(s),b(s)). (5.3.2)

Depending upon the process and the set XX itself, this could contain all of the states in a process (for decision processes with finite sets of states), or a set of important states. We can also extend this notion of distance by weighting the distance at each state sXs\in X with a weight wsw_{s}, producing the primitive behavior of Subsection 4.2,

d(B(a),B(b))=sXwsd𝒜(a(s),b(s)).\displaystyle d(B(a),B(b))=\sum_{s\in X}w_{s}d_{\mathpzc{A}}(a(s),b(s)). (4.2.2)

Consider a special case for XX: let XX be the set of states observed before time TT in a path ϕ\phi:

Definition 5.5 (The distance on ϕt\phi_{t}).
dϕt(a,b)=ttdϕi(s)(a,b),\displaystyle d_{\phi_{t}}(a,b)=\sum_{t\in\mathbb{Z}_{t}}d_{\phi_{i}(s)}(a,b), (5.3.3)

Which is the distance between aa and bb over a truncated path. When ϕ\phi is drawn from Φa\Phi^{a}, dϕtd_{\phi_{t}} is especially interesting; it is a description of the way that bb would have acted differently from aa over a path that aa actually encountered, a sort of backseat-driver metric.

The distance on ϕt\phi_{t} is powerful in a number of respects. Let us consider the case dϕt(a,b)=0d_{\phi_{t}}(a,b)=0. Clearly, this implies that aa and bb do not differ at all on this path - presented with the same initial state, they would produce exactly the same truncated path. Unfortunately, the distance at ϕt\phi_{t} still fails to satisfactorily distinguish agents - it says nothing about paths in which other states are encountered, or longer paths, or about the stochastic nature of decision processes. Let us state these failings directly so that we may address them:

Remark 5.1 (Three Properties).

  1. I.

    The distance on a truncated path ϕt\phi_{t} drawn from Φa\Phi^{a} is not reciprocal; it describes how bb differs from aa, but not how aa differs from bb,

  2. II.

    This distance ignores the stochasticity of the process; the ways in which aa and bb differ on ϕt\phi_{t} do not necessarily imply anything about the other paths which aa experiences,

  3. III.

    The distance on a truncated path cannot account for infinite paths; if the agents are not assumed to be strictly Markov, one can easily construct a pair of agents aa and bb which differ only on longer paths. Even in the strictly Markov case, some states might only be possible after tt.

5.4 The Role of Time

It is common in Reinforcement Learning to treat problems with infinite time horizons by weighting sums with a discount function ω:[0,1]\omega:\mathbb{N}\rightarrow[0,1]. If the space of actions 𝒜\mathpzc{A} is bounded,101010A metric space 𝒜\mathpzc{A} is bounded iff sup{d𝒜(x,y)|x,y𝒜}<\sup\{d_{\mathpzc{A}}(x,y)\,|\,x,y\in\mathpzc{A}\}<\infty and

tω(t)<\displaystyle\sum_{t\in\mathbb{N}}\omega(t)<\infty (5.4.1)

Then we may define the distance between strictly Markov agents aa and bb on a complete path:

dϕ(a,b)=tω(t)dϕt(s)(a,b).\displaystyle d_{\phi}(a,b)=\sum_{t\in\mathbb{N}}\omega(t)d_{\phi_{t}(s)}(a,b). (5.4.2)

Clearly, if d𝒜d_{\mathpzc{A}} is bounded, then [27, 60]

dϕ(a,b)sup𝒶1,𝒶2𝒜{d𝒜(𝒶1,𝒶2)}𝓉ω(𝓉)<.\displaystyle d_{\phi}(a,b)\leq\sup_{\mathpzc{a}_{1},\mathpzc{a}_{2}\in\mathpzc{A}}\{d_{\mathpzc{A}}(\mathpzc{a}_{1},\mathpzc{a}_{2})\}\cdot\sum_{t\in\mathbb{N}}\omega(t)<\infty. (5.4.3)

Thus, this pseudometric resolves the problem of Item III of Remark 5.1, describing the differences in the action of agents over an infinite path.

Our definition of dϕ(a,b)d_{\phi}(a,b) readily admits a change that allows the distance on a path to be defined for agents which are not strictly Markov. All we must do is remove the symbols “(s)(s)”:

dϕ(a,b)=tω(t)dϕt(a,b).\displaystyle d_{\phi}(a,b)=\sum_{t\in\mathbb{N}}\omega(t)d_{\phi_{t}}(a,b). (5.4.4)

Let us use the distance on ϕ\phi to define a notion of distance which incorporates the stochastic aspects of the interaction between an agent and a process, resolving the problem of Item II of Remark 5.1.

5.5 The Distance at aa

Consider Φa\Phi^{a}, the distribution of paths generated by the interaction of an agent aa with 𝒫\mathcal{P}. Given a discount function ω\omega satisfying (5.4.1), [27, 318]

𝔼ϕΦa[dϕ(b,c)]=\displaystyle\mathop{\mathbb{E}}_{\phi\sim\Phi^{a}}\left[d_{\phi}(b,c)\right]= 𝔼ϕΦa[tω(t)dϕt(b,c)]\displaystyle\mathop{\mathbb{E}}_{\phi\sim\Phi^{a}}\left[\sum_{t\in\mathbb{N}}\omega(t)d_{\phi_{t}}(b,c)\right] (5.5.1)
=\displaystyle= 𝔼ϕΦa[limnt<nω(t)dϕt(b,c)]\displaystyle\mathop{\mathbb{E}}_{\phi\sim\Phi^{a}}\left[\lim_{n\rightarrow\infty}\sum_{t<n}\omega(t)d_{\phi_{t}}(b,c)\right] (5.5.2)
=limn\displaystyle=\lim_{n\rightarrow\infty} 𝔼ϕΦa[t<nω(t)dϕt(b,c)]\displaystyle\mathop{\mathbb{E}}_{\phi\sim\Phi^{a}}\left[\sum_{t<n}\omega(t)d_{\phi_{t}}(b,c)\right] (5.5.3)
=limnt<n\displaystyle=\lim_{n\rightarrow\infty}\sum_{t<n} 𝔼ϕΦa[ω(t)dϕt(b,c)]\displaystyle\mathop{\mathbb{E}}_{\phi\sim\Phi^{a}}\,\left[\omega(t)d_{\phi_{t}}(b,c)\right] (5.5.4)

exists and is bounded above. We call this quantity the distance at aa.

Because the expectation integrates the distance on ϕ\phi over all of the paths of aa, dad_{a} compares bb and cc on every part of the process which aa experiences. This guarantees that the stochasticity involved in the interaction of aa and 𝒫\mathcal{P} is considered in the comparison. However, the stochasticity involved in the processes (𝒫,b)(\mathcal{P},b) and (𝒫,c)(\mathcal{P},c) may be more relevant to their comparison than that produced by aa. In the next section, we begin to address this by considering the case a=ba=b.

5.6 da(a,b)d_{a}(a,b)

In order to state our next result, we must introduce agent identity. Agent identity collapses two artificial distinctions between representations of agents caused by the use of function approximators. First, it treats approximators which are differently parameterized but identical as functions (e.g. because of a permutation of the order of parameters, as noted by [24]) as identical. Second, in keeping with the methods of [25] it treats functions which differ, but only on a set of probability 0 as identical.

Definition 5.6 (Agent Identity (aba\equiv b)).

Let us say that the agents aa and bb are identical as agents (aba\equiv b) if and only if the set of paths where they differ has probability 0 in Φa\Phi^{a};

Φa[{ϕ|t(a(ϕt)b(ϕt))}]\displaystyle\mathbb{P}_{\Phi^{a}}\left[\left\{\phi\,|\operatorname{\exists}t(a(\phi_{t})\neq b(\phi_{t}))\right\}\right] =0.\displaystyle=0. (5.6.1)

That is, aa and bb are identical as agents if and only if the probability of aa encountering a path where they differ is 0.

Remark 5.2 (Agent Identity and da(a,b)d_{a}(a,b)).

Notice that this implies da(a,b)=0d_{a}(a,b)=0.

Theorem 5.1 (Identical agents the same local distance).

If aa and bb are identical as agents, then

da=db.\displaystyle d_{a}=d_{b}. (5.6.2)
Proof by induction.

[Base case:] Suppose da(a,b)=0d_{a}(a,b)=0. Then,

da(a,b)\displaystyle d_{a}(a,b)\, =limnt<n\displaystyle=\lim_{n\rightarrow\infty}\sum_{t<n} 𝔼ϕΦa[ω(t)dϕt(b,c)]=0\displaystyle\mathop{\mathbb{E}}_{\phi\sim\Phi^{a}}\left[\omega(t)d_{\phi_{t}}(b,c)\right]=0 (5.6.3)
\displaystyle\implies t(\displaystyle\operatorname{\forall}t\in\mathbb{N}\,\biggl{(} 𝔼ϕΦa[dϕt(a,b)]=0)\displaystyle\mathop{\mathbb{E}}_{\phi\sim\Phi^{a}}\left[d_{\phi_{t}}(a,b)\right]=0\biggr{)} (5.6.4)
\displaystyle\implies 𝔼ϕΦa[dϕ0(a,b)]=dσ0(a,b)=0.\displaystyle\mathop{\mathbb{E}}_{\phi\sim\Phi^{a}}\left[d_{\phi_{0}}(a,b)\right]=d_{\sigma_{0}}(a,b)=0. (5.6.5)

That is, if da(a,b)=0d_{a}(a,b)=0 (aa and bb are identical), then dσ0(a,b)=0d_{\sigma_{0}}(a,b)=0 (they act identically on truncated prime paths of length 0). Then, the joint distributions (Φ0a,a(ϕ0))=(σ0,a(ϕ0))(\Phi_{0}^{\prime a},a(\phi_{0}^{\prime}))=(\sigma_{0},a(\phi_{0}^{\prime})) and (Φ0b,b(ϕ0))=(σ0,b(ϕ0))(\Phi_{0}^{\prime b},b(\phi_{0}^{\prime}))=(\sigma_{0},b(\phi_{0}^{\prime})) are identical.

Thus, the joint distributions of these and the next state, given by σ\sigma, are also identical: the total variation distance of ((Φ0a,a(ϕ0)),σ(Φ0a,a(ϕ0)))=Φ1a((\Phi_{0}^{\prime a},a(\phi_{0}^{\prime})),\sigma(\Phi_{0}^{\prime a},a(\phi_{0}^{\prime})))=\Phi^{\prime a}_{1} and ((Φ0b,b(ϕ0)),σ(Φ0b,b(ϕ0)))=Φ1b((\Phi_{0}^{\prime b},b(\phi_{0}^{\prime})),\sigma(\Phi_{0}^{\prime b},b(\phi_{0}^{\prime})))=\Phi^{\prime b}_{1} is 0;

dΦ0a(a,b)\displaystyle\,d_{\Phi^{\prime a}_{0}}(a,b) =0\displaystyle=0 (5.6.6)
and TVD(Φ0a,Φ0b)\displaystyle\mathop{\text{TVD}}(\Phi^{\prime a}_{0},\Phi^{\prime b}_{0}) (5.6.7)
=\displaystyle\hskip 6.02773pt=\hskip 7.74998pt TVD(σ0,σ0)\displaystyle\mathop{\text{TVD}}(\sigma_{0},\sigma_{0}) =0\displaystyle=0 (5.6.8)
\displaystyle\implies TVD(Φ0a,Φ0b)\displaystyle\mathop{\text{TVD}}(\Phi^{a}_{0},\Phi^{b}_{0}) =0\displaystyle=0 (5.6.9)
\displaystyle\implies TVD(Φ1a,Φ1b)\displaystyle\mathop{\text{TVD}}(\Phi^{\prime a}_{1},\Phi^{\prime b}_{1})\, =0.\displaystyle=0. (5.6.10)

The distribution Φ1a\Phi^{a\prime}_{1} determines the component of dad_{a} at t=1t=1. By (5.6.4), we have dΦ1a(a,b)=0d_{\Phi_{1}^{\prime a}}(a,b)=0.

Let tt\in\mathbb{N} and suppose that TVD(Φta,Φtb)=0\mathop{\text{TVD}}(\Phi^{a\prime}_{t},\Phi^{b\prime}_{t})=0. By (5.6.4), we have dΦta(a,b)=0d_{\Phi_{t}^{\prime a}}(a,b)=0. Then, since Φta\Phi^{a}_{t} and Φtb\Phi^{b}_{t} are the joint distributions of these, they also have total variation 0, and since σ\sigma is fixed, the resulting distributions Φt+1a\Phi^{\prime a}_{t+1} and Φt+1b\Phi^{\prime b}_{t+1} also have total variation 0;

dΦta(a,b)\displaystyle d_{\Phi^{\prime a}_{t}}(a,b) =0\displaystyle=0 (5.6.11)
and TVD(Φta,Φtb)\displaystyle\mathop{\text{TVD}}(\Phi^{\prime a}_{t},\Phi^{\prime b}_{t}) =0\displaystyle=0 (5.6.12)
\displaystyle\implies TVD(Φta,Φtb)\displaystyle\mathop{\text{TVD}}(\Phi^{a}_{t},\Phi^{b}_{t}) =0\displaystyle=0 (5.6.13)
\displaystyle\implies TVD(Φt+1a,Φt+1b)\displaystyle\mathop{\text{TVD}}(\Phi^{\prime a}_{t+1},\Phi^{\prime b}_{t+1}) =0.\displaystyle=0. (5.6.14)

By assumption (5.6.4), we have

dΦt+1a(a,b)=0,\displaystyle d_{\Phi_{t+1}^{\prime a}}(a,b)=0, (5.6.15)

and thus, the joint distributions also have total variation 0:

TVD(Φt+1a,Φt+1b).\displaystyle\mathop{\text{TVD}}(\Phi^{a}_{t+1},\Phi^{b}_{t+1}). (5.6.16)

The same holds for all tt\in\mathbb{N}, so we have

da(a,b)=0\displaystyle d_{a}(a,b)=0\iff TVD(Φa,Φb)=0\displaystyle\mathop{\text{TVD}}(\Phi^{a},\Phi^{b})=0 (5.6.17)
\displaystyle\implies x,yA(dΦa(x,y)=dΦb(x,y))\displaystyle\operatorname{\forall}x,y\in A(d_{\Phi^{a}}(x,y)=d_{\Phi^{b}}(x,y)) (5.6.18)
\displaystyle\iff da=db.\displaystyle d_{a}=d_{b}. (5.6.19)

Corollary 5.1 (Identical agents are indiscernible under their shared local distance.).

While the identity of indiscernibles (see Definition 5.1) does not generally hold for dad_{a}, it does hold if

  1. 1.

    We consider the agents as agents, and

  2. 2.

    The distance is taken at one of the considered agents, i.e. da(a,)d_{a}(a,\cdot);

da(a,b)=0ab.\displaystyle d_{a}(a,b)=0\iff a\equiv b. (5.6.20)
Proof..

By Theorem 5.1,

da(a,b)=0\displaystyle d_{a}(a,b)=0\iff x,yA(da(x,y)=db(x,y))\displaystyle\operatorname{\forall}x,y\in A\left(d_{a}(x,y)=d_{b}(x,y)\right) (5.6.21)
\displaystyle\implies db(a,b)=da(a,b)\displaystyle d_{b}(a,b)=d_{a}(a,b) (5.6.22)
\displaystyle\implies db(a,b)=0\displaystyle d_{b}(a,b)=0 (5.6.23)

Remark 5.3 (Symmetry of dad_{a}).

Notice that because dad_{a} is an integral of distances between agents, which are symmetric, dad_{a} is symmetric;

da(b,c)=da(c,b).\displaystyle d_{a}(b,c)=d_{a}(c,b). (5.6.24)

Corollary 5.1 demonstrates that the agent identity relation of Definition 5.6 is reflexive; for every pair of agents a,ba,b, abbaa\equiv b\iff b\equiv a.

5.7 dc(a,b)=0d_{c}(a,b)=0: When Distance 0 Does Not Imply Agent Identity

The picture provided above when the distance is taken at one of the agents being compared is complicated when the comparison is made from a different vantage point. Let us consider some of these cases:

  1. 1.

    Sometimes, identical agents may be distinguished by a local distance,

    abdc(a,b)>0.a\equiv b\land d_{c}(a,b)>0.
  2. 2.

    Sometimes, different agents will not be distinguished,

    abdc(a,b)=0.a\not\equiv b\land d_{c}(a,b)=0.
  3. 3.

    Sometimes, identical distances imply identical agents,

    da=dbab.d_{a}=d_{b}\implies a\equiv b.
  4. 4.

    Sometimes, identical distances don’t,

    da=dbab.d_{a}=d_{b}\land a\not\equiv b.

All of these problems stem from the issues of state visitation mentioned in Subsection 4.2 and [25]: unless aba\equiv b, aa may visit paths which bb does not, and vice versa.

Example 5.1 (aba\equiv b, but dc(a,b)>0d_{c}(a,b)>0).

In general, the functions aa and bb can be identical as agents, while they differ in their responses to unvisited paths. Then, from the perspective of an agent which visits such paths, aa and bb appear different.

Example 5.2 (aba\not\equiv b, but dc(a,b)=0d_{c}(a,b)=0).

Similarly, it is possible for aa and bb to be identical on every path which cc visits, but differ when aa or bb control the process.

Example 5.3 (da=dbabd_{a}=d_{b}\implies a\equiv b).

In general, local distances form a bijection with the distributions Φa\Phi^{a} and the processes (𝒫,a)(\mathcal{P},a), and thus da=dbΦa=Φb(𝒫,a)=(𝒫,b)abd_{a}=d_{b}\iff\Phi^{a}=\Phi^{b}\iff(\mathcal{P},a)=(\mathcal{P},b)\iff a\equiv b.

Example 5.4 (da=dbd_{a}=d_{b}, but aba\not\equiv b).

However, when the set of agents is restricted, this is not necessarily true. If, for example, agents are assumed to be strictly Markov (see Remark 1.3), then all that matters to the equivalence of the functions dad_{a} and dbd_{b} is the states which they visit; if σ(ϕt,𝒶1)\sigma(\phi_{t}^{\prime},\mathpzc{a}_{1}) = σ(ϕt,𝒶2)\sigma(\phi_{t}^{\prime},\mathpzc{a}_{2}), 𝒶1𝒶2\mathpzc{a}_{1}\neq\mathpzc{a}_{2}, then agents aa and bb which differ in their response to ϕt\phi_{t}^{\prime} could nonetheless produce “identical” distances, when the range of the distance is restricted to pairs of strictly Markov agents.

With so few assurances, the local distances may seem pointless. Are they nothing more than markers of identity? No, they are much more; Subsection 6.1 demonstrates that da(a,b)=0abd_{a}(a,b)=0\iff a\equiv b is not a special case. Instead, the local distances themselves are continuous in the agent space: as aa and bb approach one another under either local distance, i.e. as |da(a,b)||d_{a}(a,b)| or |db(a,b)||d_{b}(a,b)| goes to 0, and so does supx,yA|da(x,y)db(x,y)|\sup_{x,y\in A}|d_{a}(x,y)-d_{b}(x,y)|.

6 The Agent Space

The collection of local distances described in Section 5 is an odd basis for the structure of an agent space; rather than a single, objective notion of distance, each agent aa defines its own local distance dad_{a}. When paired with the set of agents, each local distance defines a pseudometric space (A,da)(A,d_{a}), which describes the ways that agents differ on Φa\Phi^{a}.

Subsection 5.6 establishes relationships between local distances, but only in the case of identical agents which differ as functions. We have not yet related the local distances of non-identical agents. In particular, we have not established that a collection of local distances defines a single space.

One interpretation of the collection of local distances is as a premetric (see Subsection 6.2), in a manner analogous to the Kullback-Leibler Divergence. However, dad_{a} can also be treated as more than a premetric; it need not be asymmetrical, nor need it violate the triangle inequality, because each agent defines a local distance that describes an internally-consistent pseudometric space. We continue this discussion in greater detail in Subsection 6.2, employing the premetric to provide a simple topology equivalent to that defined by convergence in the agent space (Definition 6.1).

In the next section, we unify the pseudometric spaces produced by each local distance to create an objective agent space, whose topology is compatible with many important aspects of Reinforcement Learning, including standard function approximators (e.g. neural networks) and standard formulations of reward (see Equation 7.2.4).

6.1 Convergence in Agent Spaces

Theorem 5.1 proves that identical agents have the same local distance,

ab\displaystyle a\equiv b\implies da=db.\displaystyle d_{a}=d_{b}. (6.1.1)

Corollary 5.1 gives an important condition for equivalence: agents are identical, and thus have identical local distances, whenever da(a,b)=0d_{a}(a,b)=0. The next step in our analysis of the local distance is to consider the case where aa and bb are close to one another, but their distance is greater than 0. Consider the case

0<da(a,b)<δ,\displaystyle 0<d_{a}(a,b)<\delta, (6.1.2)

with δ>0\delta>0. In order to simplify the remainder of this section, we restrict ourselves to stochastic agents. Let the metric on 𝒜\mathpzc{A}, d𝒜d_{\mathpzc{A}}, be the total variation distance TVD(𝒶1,𝒶2)\mathop{\text{TVD}}(\mathpzc{a}_{1},\mathpzc{a}_{2}). In this case, the logic of Theorem 5.1 can be extended. Theorem 5.1 demonstrates that two agents which are at every time identical must produce identical distributions of paths, and, as a result, identical local distances. Consider a pair of agents aa and bb, which have a distance less than δ\delta on Φa\Phi^{a}, da(a,b)<δd_{a}(a,b)<\delta, with ω(t)=1\omega(t)=1. Then,

0<dΦ0a(a,b)da(a,b)\displaystyle 0<d_{\Phi^{\prime a}_{0}}(a,b)\leq\hskip 4.19998ptd_{a}(a,b) <δ\displaystyle<\delta (6.1.3)
\displaystyle\implies dΦ0a(a,b)=dσ0(a,b)\displaystyle d_{\Phi^{\prime a}_{0}}(a,b)=d_{\sigma_{0}}(a,b) <δ\displaystyle<\delta (6.1.4)
\displaystyle\iff 𝔼ϕ0σ0[TVD(a(ϕ0),b(ϕ0))]\displaystyle\mathop{\mathbb{E}}_{\phi_{0}^{\prime}\sim\sigma_{0}}\left[\mathop{\text{TVD}}(a(\phi_{0}^{\prime}),b(\phi_{0}^{\prime}))\right] <δ\displaystyle<\delta (6.1.5)

Since σ0\sigma_{0} does not vary with the agent, we have

Φ0a=Φ0b𝔼ϕ0σ0[TVD(a(ϕ0),b(ϕ0))]<δ\displaystyle\Phi^{\prime a}_{0}=\Phi^{\prime b}_{0}\land\mathop{\mathbb{E}}_{\phi_{0}^{\prime}\sim\sigma_{0}}\left[\mathop{\text{TVD}}(a(\phi_{0}^{\prime}),b(\phi_{0}^{\prime}))\right]<\delta (6.1.7)
TVD(Φ1a,Φ1b)<δ.\displaystyle\implies\mathop{\text{TVD}}(\Phi_{1}^{\prime a},\Phi_{1}^{\prime b})<\delta. (6.1.8)

Likewise,

dΦ1a(a,b)da(a,b)<δ\displaystyle d_{\Phi_{1}^{\prime a}}(a,b)\leq d_{a}(a,b)<\delta (6.1.9)

should imply that

TVD(Φ2a,Φ2b)<δ,\displaystyle\mathop{\text{TVD}}(\Phi_{2}^{\prime a},\Phi_{2}^{\prime b})<\delta, (6.1.10)

except that since

TVD(Φ1a,Φ1b)<δ,\displaystyle\mathop{\text{TVD}}(\Phi_{1}^{\prime a},\Phi_{1}^{\prime b})<\delta, (6.1.8)

we must start from a baseline of δ\delta, giving the bound 2δ2\delta. In general, we have

TVD(Φta,Φtb)<tδ.\displaystyle\mathop{\text{TVD}}(\Phi_{t}^{\prime a},\Phi_{t}^{\prime b})<t\delta. (6.1.11)

This bound can be improved by noting that the total variation TVD(Φ1a,Φ1b)\mathop{\text{TVD}}(\Phi_{1}^{\prime a},\Phi_{1}^{\prime b}) can be bounded above by (and is in fact equal to) the smaller quantity

dΦ0a(a,b)=dσ0(a,b),\displaystyle d_{\Phi_{0}^{\prime a}}(a,b)=d_{\sigma_{0}}(a,b), (6.1.12)

yielding in the general case

TVD(Φta,Φtb)<itdΦia(a,b).\displaystyle\mathop{\text{TVD}}(\Phi_{t}^{\prime a},\Phi_{t}^{\prime b})<\sum_{i\in\mathbb{Z}_{t}}d_{\Phi_{i}^{\prime a}}(a,b). (6.1.13)

With our assumption that ω(t)=1\omega(t)=1, we can bound the right side of this inequality above, giving the looser inequality

TVD(Φta,Φtb)<itdΦia(a,b)<da(a,b)\displaystyle\mathop{\text{TVD}}(\Phi_{t}^{\prime a},\Phi_{t}^{\prime b})<\sum_{i\in\mathbb{Z}_{t}}d_{\Phi_{i}^{\prime a}}(a,b)<d_{a}(a,b) <δ\displaystyle<\delta (6.1.14)
TVD(Φta,Φtb)\displaystyle\implies\mathop{\text{TVD}}(\Phi_{t}^{\prime a},\Phi_{t}^{\prime b}) <δ.\displaystyle<\delta. (6.1.15)

If the discount function is not the constant value 11 (i.e. if ω(t)<1\omega(t)<1 for some tt), as assumed above, the sum above gains a factor of 1ω(i)\frac{1}{\omega(i)}:

TVD(Φta,Φtb)<it1ω(i)dΦia(a,b).\displaystyle\mathop{\text{TVD}}(\Phi_{t}^{\prime a},\Phi_{t}^{\prime b})<\sum_{i\in\mathbb{Z}_{t}}\frac{1}{\omega(i)}d_{\Phi_{i}^{\prime a}}(a,b). (6.1.16)

For simplicity we now assume that ω(t)=γt,0<γ<1\omega(t)=\gamma^{t},0<\gamma<1, though the following results apply to a more general family of functions (for example, they apply to all monotonic super-exponential decay functions).

Lemma 6.1 (TVD(Φta,Φtb)\mathop{\text{TVD}}(\Phi_{t}^{a},\Phi_{t}^{b}) can be bound above by a function of da(a,b)d_{a}(a,b)).

Notice that when

s<tω(s)ω(t)1ω(s)1ω(t),\displaystyle s<t\implies\omega(s)\geq\omega(t)\implies\frac{1}{\omega(s)}\leq\frac{1}{\omega(t)}, (6.1.17)

so for a fixed distance da(a,b)d_{a}(a,b), the maximal total variation TVD(Φta,Φtb)\mathop{\text{TVD}}(\Phi_{t}^{a},\Phi_{t}^{b}) is achieved when

st(dΦsa(a,b)=0)dΦta(a,b)=da(a,b).\displaystyle\operatorname{\forall}s\leq t\left(d_{\Phi^{\prime a}_{s}}(a,b)=0\right)\land d_{\Phi^{\prime a}_{t}}(a,b)=d_{a}(a,b). (6.1.18)

Thus, we can bound the total variation TVD(Φta,Φtb)\mathop{\text{TVD}}(\Phi_{t}^{a},\Phi_{t}^{b}) above by 1γtda(a,b)\frac{1}{\gamma^{t}}d_{a}(a,b). Thus,

da(a,b)<δTVD(Φta,Φtb)<1γtδ.\displaystyle d_{a}(a,b)<\delta\implies\mathop{\text{TVD}}(\Phi_{t}^{a},\Phi_{t}^{b})<\frac{1}{\gamma^{t}}\delta. (6.1.19)

Lemma 6.1 enables us to prove our next theorem, the limit equivalent of Corollary 5.1. Let us begin with a definition.

Definition 6.1 (Convergence in the Agent Space).

We say that a sequence of agents xnx_{n} converges to an agent aa if and only if the local distance between the agents in the sequence and aa goes to 0;

xnaε>0m(n>mda(a,xn)<ε).\displaystyle x_{n}\rightarrow a\iff\operatorname{\forall}\varepsilon>0\,\operatorname{\exists}m\in\mathbb{N}\,(n>m\implies d_{a}(a,x_{n})<\varepsilon). (6.1.20)
Theorem 6.1 (The Limit Behavior of Local Distances).

Let xnx_{n} be a sequence of agents converging to aa. Then,

  1. 1.

    t(limxnaTVD(Φtxn,Φta)=0)\operatorname{\forall}t\in\mathbb{N}(\lim_{x_{n}\rightarrow a}\mathop{\text{TVD}}(\Phi_{t}^{x_{n}},\Phi_{t}^{a})=0),

  2. 2.

    dxndad_{x_{n}}\rightarrow d_{a}, and

  3. 3.

    dxn(xn,a)0d_{x_{n}}(x_{n},a)\rightarrow 0.

Proof of 1.

By (6.1.19), we have for any fixed tt and any agent xnx_{n}

da(a,xn)<δTVD(Φta,Φtxn)<1γtδ.\displaystyle d_{a}(a,x_{n})<\delta\implies\mathop{\text{TVD}}(\Phi_{t}^{a},\Phi_{t}^{x_{n}})<\frac{1}{\gamma^{t}}\delta. (6.1.21)

By assumption, for every δ>0\delta>0 there is an mm\in\mathbb{N} with n>mda(a,xn)<δn>m\implies d_{a}(a,x_{n})<\delta. For any ε>0\varepsilon>0, there is a δε>0\delta_{\varepsilon}>0 with 1γtδε<ε\frac{1}{\gamma^{t}}\delta_{\varepsilon}<\varepsilon. Thus, we can select an nn\in\mathbb{N} with

m>n\displaystyle m>n \displaystyle\implies da(a,xn)<\displaystyle d_{a}(a,x_{n})< δε\displaystyle\,\delta_{\varepsilon} (6.1.22)
\displaystyle\implies TVD(Φta,Φtxn)<\displaystyle\mathop{\text{TVD}}(\Phi_{t}^{a},\Phi_{t}^{x_{n}})< ε.\displaystyle\,\varepsilon. (6.1.23)

Proof of 2.

Per the proof of 1, we have for any ε>0\varepsilon>0 and any tt\in\mathbb{N} an mm giving n>mTVD(Φta,Φtxm)<εn>m\implies\mathop{\text{TVD}}(\Phi_{t}^{a},\Phi_{t}^{x_{m}})<\varepsilon. Let 𝒜max\mathpzc{A}_{\text{max}} be the bound on 𝒜\mathpzc{A} (in the case of total variation, 𝒜max=1\mathpzc{A}_{\text{max}}=1). For each path ϕ\phi, we have

dϕ(a,b)dϕt(a,b)\displaystyle d_{\phi}(a,b)-d_{\phi_{t}}(a,b)\leq i=t+1γi𝒜max\displaystyle\sum_{i=t+1}^{\infty}\gamma^{i}\,\mathpzc{A}_{\text{max}} (6.1.24)
=γt+11γ𝒜max=\displaystyle=\frac{\gamma^{t+1}}{1-\gamma}\mathpzc{A}_{\text{max}}= γt+11γ,\displaystyle\frac{\gamma^{t+1}}{1-\gamma}, (6.1.25)

and we have the analogous bound for Φa\Phi^{a}

dai=0i=tdΦiaγt+11γ.\displaystyle d_{a}-\sum_{i=0}^{i=t}d_{\Phi_{i}^{a}}\leq\frac{\gamma^{t+1}}{1-\gamma}. (6.1.26)
Remark 6.1 (Notation for the Distance on Distributions of Truncated Paths).

We now need to manipulate terms of this type, for which a bit of notation will be useful: Let

dat\displaystyle d_{a}^{t}\! =i=0i=tdΦia, and\displaystyle=\sum_{i=0}^{i=t}d_{\Phi_{i}^{a}},\text{ and} (6.1.27)
dat+\displaystyle d_{a}^{t+}\! =i=t+1i=dΦia.\displaystyle=\hskip-3.00003pt\sum_{i=t+1}^{i=\infty}\hskip-3.50006ptd_{\Phi_{i}^{a}}. (6.1.28)

Further, for any agent aa we can decompose dad_{a} into

da\displaystyle d_{a} =i=0t\displaystyle=\sum_{i=0}^{t} dΦta\displaystyle d_{\Phi_{t}^{a}} +i=t+1dΦta\displaystyle+\sum_{i=t+1}^{\infty}d_{\Phi_{t}^{a}} (6.1.29)
da\displaystyle d_{a} =\displaystyle= dat\displaystyle d_{a}^{t} +(dadat)\displaystyle+(d_{a}-d_{a}^{t}) (6.1.30)
da\displaystyle d_{a} =\displaystyle= dat\displaystyle d_{a}^{t} +dat+.\displaystyle+d_{a}^{t+}. (6.1.31)

Notice that the maximum value of datd_{a}^{t} is

i=0tγt𝒜max=1γ𝓉+11γ𝒜max,\displaystyle\sum_{i=0}^{t}\gamma^{t}\mathpzc{A}_{\text{max}}=\frac{1-\gamma^{t+1}}{1-\gamma}\mathpzc{A}_{\text{max}}, (6.1.32)

and we can bound dat+d_{a}^{t+} from above as well,

dat+\displaystyle d_{a}^{t+}\leq max(da)max(dat)\displaystyle\,\max(d_{a})-\max(d_{a}^{t}) (6.1.33)
=\displaystyle= 11γ𝒜max1γ𝓉+11γ𝒜max\displaystyle\,\frac{1}{1-\gamma}\mathpzc{A}_{\text{max}}-\frac{1-\gamma^{t+1}}{1-\gamma}\mathpzc{A}_{\text{max}} (6.1.34)
=\displaystyle= γt+11γ𝒜max.\displaystyle\,\frac{\gamma^{t+1}}{1-\gamma}\mathpzc{A}_{\text{max}}. (6.1.35)

Clearly, as tt\rightarrow\infty, this bound goes to 0.

Let ε>0\varepsilon>0, ε=ε(1γ)2𝒜max(1γ𝓉+1)\varepsilon^{\prime}=\frac{\varepsilon(1-\gamma)}{2\mathpzc{A}_{\text{max}}(1-\gamma^{t+1})}, and let δε\delta_{\varepsilon^{\prime}} be as above. Then, we have

da(a,b)\displaystyle d_{a}(a,b) <δε\displaystyle<\delta_{\varepsilon^{\prime}} (6.1.36)
\displaystyle\implies TVD(Φta,Φtb)\displaystyle\mathop{\text{TVD}}(\Phi_{t}^{a},\Phi_{t}^{b}) <ε=ε(1γ)𝒜max(1γ𝓉+1)\displaystyle<\varepsilon^{\prime}=\frac{\varepsilon(1-\gamma)}{\mathpzc{A}_{\text{max}}(1-\gamma^{t+1})} (6.1.37)
\displaystyle\implies datdbt\displaystyle d_{a}^{t}-d_{b}^{t} <1γt+11γ𝒜maxTVD(Φ𝓉𝒶,Φ𝓉𝒷)\displaystyle<\frac{1-\gamma^{t+1}}{1-\gamma}\mathpzc{A}_{\text{max}}\mathop{\text{TVD}}(\Phi_{t}^{a},\Phi_{t}^{b}) (6.1.38)
<1γt+11γ𝒜maxε=ε2.\displaystyle<\frac{1-\gamma^{t+1}}{1-\gamma}\mathpzc{A}_{\text{max}}\varepsilon^{\prime}=\frac{\varepsilon}{2}. (6.1.39)

Thus, we have

dadb\displaystyle d_{a}-d_{b} =dat+dat+(dbt+dbt+)\displaystyle=d_{a}^{t}+d_{a}^{t+}-(d_{b}^{t}+d_{b}^{t+}) (6.1.40)
datdbt+2[𝒜maxγ𝓉1γ]\displaystyle\leq d_{a}^{t}-d_{b}^{t}+2\left[\mathpzc{A}_{\text{max}}\frac{\gamma^{t}}{1-\gamma}\right] (6.1.41)
<ε2+2[𝒜maxγ𝓉1γ].\displaystyle<\frac{\varepsilon}{2}+2\left[\mathpzc{A}_{\text{max}}\frac{\gamma^{t}}{1-\gamma}\right]. (6.1.42)

Now, set tt great enough that 2[𝒜maxγ𝓉1γ]<ε22\left[\mathpzc{A}_{\text{max}}\frac{\gamma^{t}}{1-\gamma}\right]<\frac{\varepsilon}{2}, and set b=xnb=x_{n}. Per part 1, we can select an mm with n>mda(a,xn)<δε2n>m\implies d_{a}(a,x_{n})<\delta_{\frac{\varepsilon}{2}}. Then, combining lines (6.1.40) and (6.1.42), we have

dadxm<ε.\displaystyle d_{a}-d_{x_{m}}<\varepsilon. (6.1.43)

Proof of 3..

Applying 2, we have

dxnda.\displaystyle d_{x_{n}}\rightarrow d_{a}. (6.1.44)

Since da(a,xn)=0d_{a}(a,x_{n})=0, we have

dxn(a,xn)da(a,xn)=0.\displaystyle d_{x_{n}}(a,x_{n})\rightarrow d_{a}(a,x_{n})=0. (6.1.45)

This theorem demonstrates that agents which are close in the agent space have close perspectives and produce close local distances. In fact, the proof of Item 2 of Theorem 6.1 demonstrates that the local distances dad_{a} which represent those perspectives are uniformly continuous in the agent. Further, Item 1 of Theorem 6.1 demonstrates that similar agents produce similar distributions of truncated paths - not just similar distance functions.

In the next section we consider a loose method of interpreting the local distances: the interpretation of the local distance as a function of two, rather than three, agents, fixing the vantage point at the first agent being compared. This allows us to describe the local distance as a premetric. We use this fact to define the topology of the agent space in Subsection 6.3.

6.2 dx(x,y)d_{x}(x,y) as a Premetric

A premetric is a generalization of a metric which relaxes several properties, giving the very general definition

Definition 6.2 (Premetric).

A function D:X+{0}D:X\rightarrow\mathbb{R}^{+}\cup\{0\} is called a premetric if [19, 23]

D(x,x)=0.\displaystyle D(x,x)=0. (6.2.1)

such a premetric is called separating if it also satisfies

D(x,y)=0x=y.\displaystyle D(x,y)=0\iff x=y. (6.2.2)

Many important functions satisfy this definition, including the Kullback-Liebler Divergence.

Remark 6.2 (The Local Distance is a Premetric).

Notice that the function

D(a,b)=da(a,b)\displaystyle D(a,b)=d_{a}(a,b) (6.2.3)

is a separating premetric.

Important for the practical use of the local distances, this premetric (along with the other structures of the agent space) is able to describe the differences between agents and between the distributions of paths which they produce without actually sampling those distributions; da(b,c)d_{a}(b,c) compares the distributions Φb\Phi^{b} and Φc\Phi^{c} but only requires information about the distribution Φa\Phi^{a} the functions bb and cc. This is valuable because in Reinforcement Learning it is typically simple to calculate the actions which an agent would take from that agent’s parameters, but information about the distribution usually needs to be sampled - an expensive operation. This is especially valuable if many nearby agents need to be compared (e.g. because the agents being considered are based on a single locus agent). Operations which involve comparing a pair of agents using the standard of a third like this are common in Reinforcement Learning. For example, the 𝒬a\mathcal{Q}^{a} Definition 2.2 function is often used to judge the quality of the actions of other agents bab\neq a.

In the next section we describe a topology on the agent space which we will take as canonical (i.e. as the topology of the agent space). There are two basic ways to understand the topology: it may be understood as the topology of the premetric space given by the premetric on the agent space described above, or it may be understood as the topology given by the convergence relation of Definition 6.1. These are identical. In fact, Definition 6.1 can be defined using only the premetric description of the local distance.

6.3 The Topology of the Agent Space

Let us start by providing two equivalent definitions of the topology of the agent space: one definition of its open sets, and another definition of its closed sets.

Definition 6.3 (The Topology of (A,d)(A,d): Open Sets).

We say that a set UAU\in A is open if and only if about every point xUx\in U, UU admits an open disk of positive radius:

xUr>0({y|dx(x,y)<r}U).\displaystyle\operatorname{\forall}x\in U\operatorname{\exists}r>0\,(\{y\,|\,d_{x}(x,y)<r\}\subset U). (6.3.1)
Definition 6.4 (The Topology of (A,d)(A,d): Closed Sets).

We say that a set XAX\in A is closed if and only if it contains its limit points; iff for every convergent sequence xnx_{n} with xnAx_{n}\in A: XX is closed iff

xn(n(xnX)limxnX).\displaystyle\operatorname{\forall}x_{n}(\operatorname{\forall}n\in\mathbb{N}\,(x_{n}\in X)\implies\lim x_{n}\in X). (6.3.2)

These definitions suffice, in fact, to define the topology of any premetric (or metric). It may be demonstrated that these definitions produce the same topology (for example, by remarking that open sets in metric spaces may be characterized by the criterion (6.3.1)). It is important to note that this, along with the premetric version of the agent space, represents a sort of lower-bound on the structure which the local distances describe on the set of agents. In particular, the local distances may prove useful beyond simple problems of limits. In [28], we employ the local distances for exploration in an implementation of Novelty Search.

In the next section, we demonstrate that the topology of the agent space is compatible with many of the most important aspects of Reinforcement Learning. In particular, we show that standard formulations of reward are continuous in the agent space, and that the agent space itself is continuous in the parameters of most agent approximators, demonstrating that the agent space is a valid structure for the set of agents, and for Reinforcement Learning more generally.

7 Functions of the Agent

The topology of the agent space carries information about many important aspects of the decision process and its interaction with agents, including the distributions of truncated paths. However, we have not yet demonstrated any relationship between the agent space and the object of Reinforcement Learning: the expected reward of the agent, J(a)J(a). In this section we demonstrate that an important class of reward functions (summable reward functions, (1.1.118)) are continuous functions of the agent in the topology of the agent space. We begin with a simple condition for the agent to be a continuous function of the parameters of a function approximator. We then use the continuity of finite distributions of paths established in Item 1 of Theorem 6.1 in the agent space to prove that the expectation of reward is a continuous function of the agent.

7.1 Parameterized Agents

Let ff be a function approximator parameterized by a set of real numbers θ\theta, taking truncated prime paths into a set of actions. Then,

f:n×Φ𝒜.\displaystyle f:\mathbb{R}^{n}\times\Phi^{\prime}\rightarrow\mathpzc{A}. (7.1.1)

If we delay the selection of the truncated path, we may understand ff as a function from n\mathbb{R}^{n} into the set of agents:

f:n𝒜Φ𝒻:𝓃𝒜.\displaystyle f:\mathbb{R}^{n}\rightarrow\mathpzc{A}^{\Phi^{\prime}}\iff f:\mathbb{R}^{n}\rightarrow A. (7.1.2)

Notice that we have returned to the pre-quotient notion of an agent - the set of functions from the set of truncated prime paths to the set of actions before the equivalence relation of Definition 5.6 is applied. To better distinguish these functions, let us denote the pre-quotient set FF. The matter of demonstrating that a particular function approximator is a continuous function from its parameters to the agent space may be divided into two parts: it must be demonstrated that the function approximator is a continuous function from the set of parameters to the set of functions, and it must be demonstrated that the quotient operation itself is a continuous function from the set of functions to the set of agents. We begin by demonstrating the continuity of the quotient operation.

Let us assume the LL^{\infty} metric on the set of functions and denote the map taking a function ff to an agent aa by Q:(F,L)(A,d)Q:(F,L^{\infty})\rightarrow(A,d). Then, the quotient operation which takes the set of agents to the space of agents is continuous if and only if for every convergent sequence xnx_{n} in (F,L)(F,L^{\infty}), Q(xn)Q(x_{n}) converges.

Theorem 7.1 (The Agent Identity Quotient Operation is Continuous).

The quotient operation defined by Definition 5.6 is a continuous function from the FF to AA.

Proof.

Let xnx_{n} be a LL^{\infty}-convergent sequence of functions converging to xx

xn:Φ𝒜,\displaystyle x_{n}:\Phi^{\prime}\rightarrow\mathpzc{A}, (7.1.3)
ε>0mn>m(d(xn,x)<ε).\displaystyle\operatorname{\forall}\varepsilon>0\operatorname{\exists}m\in\mathbb{N}\operatorname{\forall}n>m\,(d^{\infty}(x_{n},x)<\varepsilon). (7.1.4)

Then, we must show that

ε>0mn>m(dQ(x)(Q(xn),Q(x))<ε).\displaystyle\operatorname{\forall}\varepsilon>0\operatorname{\exists}m\in\mathbb{N}\operatorname{\forall}n>m\,(d_{Q(x)}(Q(x_{n}),Q(x))<\varepsilon). (7.1.5)

Consider the definition of dxd_{x}:

dx=\displaystyle d_{x}= tω(t)dΦtx(xn,x)\displaystyle\sum_{t\in\mathbb{N}}\omega(t)d_{\Phi^{\prime x}_{t}}(x_{n},x) (7.1.6)
dΦtx(xn,x)=\displaystyle d_{\Phi^{\prime x}_{t}}(x_{n},x)= 𝔼ϕtΦtx[dϕ(xn,x)].\displaystyle\mathop{\mathbb{E}}_{\phi^{\prime}_{t}\sim\Phi^{\prime x}_{t}}\left[d_{\phi^{\prime}}(x_{n},x)\right]. (7.1.7)

Clearly, we have

dΦtx(xn,x)\displaystyle d_{\Phi^{\prime x}_{t}}(x_{n},x) supϕΦdϕ(xn,x)\displaystyle\leq\sup_{\phi^{\prime}\in\Phi^{\prime}}d_{\phi^{\prime}}(x_{n},x) (7.1.8)
=d(xn,x).\displaystyle=d^{\infty}(x_{n},x). (7.1.9)

Thus, dx(xn,x)tω(t)d(xn,x)d_{x}(x_{n},x)\leq\sum_{t\in\mathbb{N}}\omega(t)d^{\infty}(x_{n},x). Recall that Ω=tω(t)<\Omega=\sum_{t\in\mathbb{N}}\omega(t)<\infty. Thus,

ε>0mn>m(dx(xn,x)ε)\displaystyle\operatorname{\forall}\varepsilon>0\operatorname{\exists}m\in\mathbb{N}\,\operatorname{\forall}n>m(d_{x}(x_{n},x)\leq\varepsilon) (7.1.10)

exists because an integer mm which satisfies

εΩ>0mn>m(d(xn,x)<εΩ)\displaystyle\operatorname{\forall}\frac{\varepsilon}{\Omega}>0\operatorname{\exists}m\in\mathbb{N}\,\operatorname{\forall}n>m\left(d^{\infty}(x_{n},x)<\frac{\varepsilon}{\Omega}\right) (7.1.11)

exists, by assumption. ∎

To finish the demonstration that a particular function approximator gives agents continuous in its parameters, then, it remains only to show that the function approximator is LL^{\infty}-continuous (uniformly continuous) in the parameters of the approximator. One class of function approximators which satisfies this is feedforward neural networks, such as those discussed in [29].111111Specifically, neural networks with continuous, bounded activation functions are uniformly continuous in their parameters.

7.2 Reward and the Agent Space

In order for the agent space to be useful for the problem of Reinforcement Learning, it must be related to the object of Reinforcement Learning: reward.

R:Φ\displaystyle R:\Phi\rightarrow\mathbb{R} (1.1.117)

We noted in Definition 1.8 that reward can frequently be described by a sum,

R(ϕ)=\displaystyle R(\phi)= tr(ϕt(s),ϕt(𝒶)).\displaystyle\sum_{t\in\mathbb{N}}r(\phi_{t}(s),\phi_{t}(\mathpzc{a})). (1.1.118)

We also noted that this sum is often weighted by a discount function ω(t)\omega(t). Discount functions are employed because they offer general conditions under which the reward of a path (and thus its expectation) is bounded: so long as Ω\Omega is finite and the immediate reward rr is bounded, so too is the sum (1.1.118).

This formulation of reward has several valuable properties which can be extracted: the reward function can be extended from Φ\Phi to include truncated paths:

R:ΦtΦt\displaystyle R:\Phi\cup\bigcup_{t\in\mathbb{N}}\Phi_{t}\rightarrow\mathbb{R} (7.2.1)
R(ϕt)=i<tr(ϕt(s),ϕt(𝒶)).\displaystyle R(\phi_{t})=\sum_{i<t}r(\phi_{t}(s),\phi_{t}(\mathpzc{a})). (7.2.2)

Clearly, for any path ϕ\phi for which R(ϕ)R(\phi) exists we have

limtR(ϕt)=R(ϕ).\displaystyle\lim_{t\rightarrow\infty}R(\phi_{t})=R(\phi). (7.2.3)

If the immediate reward rr is bounded and the sum is weighted by a discount function ω\omega with finite sum Ω\Omega then RR is bounded and we have the stronger condition

ε>0tϕΦ(|R(ϕ)R(ϕt)|<ε).\displaystyle\operatorname{\forall}\varepsilon>0\operatorname{\exists}t\in\mathbb{N}\operatorname{\forall}\phi\in\Phi\left(|R(\phi)-R(\phi_{t})|<\varepsilon\right). (7.2.4)

That is, such a summable discount function converges uniformly to its value as tt\rightarrow\infty.

Theorem 7.2 (Functions Continuous in the Agent Space).

Let RR be a bounded real function of the set of paths and truncated paths, and let JJ be the expectation of RR on the distribution of paths Φa\Phi^{a},

J(a)=𝔼ϕΦa[R(ϕ)],\displaystyle J(a)=\mathop{\mathbb{E}}_{\phi\sim\Phi^{a}}\left[R(\phi)\right], (7.2.5)
R:ΦtΦt,\displaystyle R:\Phi\cup\bigcup_{t\in\mathbb{N}}\Phi_{t}\rightarrow\mathbb{R}, (7.2.6)
supϕ,φΦtΦt|R(ϕ)R(φ)|=R¯,\displaystyle\sup_{\phi,\varphi\in\Phi\cup\bigcup_{t\in\mathbb{N}}\Phi_{t}}|R(\phi)-R(\varphi)|=\overline{R}, (7.2.7)

and let RR satisfy (7.2.4)

ε>0tϕΦ(t>t|R(ϕ)R(ϕt)|<ε).\displaystyle\operatorname{\forall}\varepsilon>0\operatorname{\exists}t\in\mathbb{N}\operatorname{\forall}\phi\in\Phi\left(t^{\prime}>t\implies|R(\phi)-R(\phi_{t^{\prime}})|<\varepsilon\right). (7.2.4)

Then, JJ is a continuous function with respect to the agent space (A,d)(A,d).

Proof.

Let us demonstrate that JJ is a continuous function of aa by showing that for any convergent sequence xnx_{n} converging to aa,

limnJ(xn)=J(a).\displaystyle\lim_{n\rightarrow\infty}J(x_{n})=J(a). (7.2.8)

Thus, our goal is to demonstrate that

εm(n>m|J(a)J(xn)|<ε).\displaystyle\operatorname{\forall}\varepsilon\operatorname{\exists}m\in\mathbb{N}\left(n>m\implies|J(a)-J(x_{n})|<\varepsilon\right). (7.2.9)

By assumption of (7.2.4),

tϕΦ(t>t|R(ϕ)R(ϕt)|<ε3).\displaystyle\operatorname{\exists}t\in\mathbb{N}\,\operatorname{\forall}\phi\in\Phi\left(t^{\prime}>t\implies|R(\phi)-R(\phi_{t})|<\frac{\varepsilon}{3}\right). (7.2.10)

Consider the expectation of the reward of the truncated paths of aa,

𝔼ϕΦaR(ϕt).\displaystyle\mathop{\mathbb{E}}_{\phi\sim\Phi^{a}}R(\phi_{t}). (7.2.11)

By (7.2.10), for an appropriate value of tt we have

|R(ϕ)R(ϕt)|<ε3.\displaystyle|R(\phi)-R(\phi_{t})|<\frac{\varepsilon}{3}. (7.2.12)

Thus, we have

|J(a)𝔼ϕΦaR(ϕt)|=|𝔼ϕΦaR(ϕ)𝔼ϕΦaR(ϕt)|\displaystyle\left|J(a)-\mathop{\mathbb{E}}_{\phi\sim\Phi^{a}}R(\phi_{t})\right|=\left|\mathop{\mathbb{E}}_{\phi\sim\Phi^{a}}R(\phi)-\mathop{\mathbb{E}}_{\phi\sim\Phi^{a}}R(\phi_{t})\right| (7.2.13)
=\displaystyle= 𝔼ϕΦa|R(ϕ)R(ϕt)|\displaystyle\mathop{\mathbb{E}}_{\phi\sim\Phi^{a}}|R(\phi)-R(\phi_{t})| (7.2.14)
<\displaystyle< 𝔼ϕΦaε3=ε3\displaystyle\mathop{\mathbb{E}}_{\phi\sim\Phi^{a}}\frac{\varepsilon}{3}=\frac{\varepsilon}{3} (7.2.15)

Now, let us consider Item 1 of Theorem 6.1, which demonstrates that for any tt\in\mathbb{N} and any sequence of agents xnx_{n} converging to aa, TVD(Φta,Φtxn)\mathop{\text{TVD}}(\Phi_{t}^{a},\Phi_{t}^{x_{n}}) goes to 0 as nn\rightarrow\infty, so we have

tm(n>mTVD(Φta,Φtxn)<ε3R¯).\displaystyle\operatorname{\forall}t\in\mathbb{N}\operatorname{\exists}m\in\mathbb{N}\left(n>m\implies\mathop{\text{TVD}}(\Phi_{t}^{a},\Phi_{t}^{x_{n}})<\frac{\varepsilon}{3\overline{R}}\right). (7.2.16)

Then we have

|𝔼ϕΦxnR(ϕt)𝔼ϕΦaR(ϕt)|<R¯ε3R¯=ε3.\displaystyle\left|\mathop{\mathbb{E}}_{\phi\sim\Phi^{x_{n}}}R(\phi_{t})-\mathop{\mathbb{E}}_{\phi\sim\Phi^{a}}R(\phi_{t})\right|<\overline{R}\frac{\varepsilon}{3\overline{R}}=\frac{\varepsilon}{3}. (7.2.17)

Thus, for sufficiently large tt and mm, we have for n>mn>m

|J(a)J(xn)|\displaystyle\left|J(a)-J(x_{n})\right| (7.2.18)
=\displaystyle= |J(a)J(xn)+(𝔼ϕΦaR(ϕt)𝔼ϕΦaR(ϕt))+(𝔼ϕΦxnR(ϕt)𝔼ϕΦxnR(ϕt))|\displaystyle\left|J(a)-J(x_{n})+\left(\mathop{\mathbb{E}}_{\phi\sim\Phi^{a}}R(\phi_{t})-\mathop{\mathbb{E}}_{\phi\sim\Phi^{a}}R(\phi_{t})\right)+\left(\mathop{\mathbb{E}}_{\phi\sim\Phi^{x_{n}}}R(\phi_{t})-\mathop{\mathbb{E}}_{\phi\sim\Phi^{x_{n}}}R(\phi_{t})\right)\right| (7.2.19)
=\displaystyle= |(J(a)𝔼ϕΦaR(ϕt))(J(xn)𝔼ϕΦxnR(ϕt))+(𝔼ϕΦaR(ϕt)𝔼ϕΦxnR(ϕt))|\displaystyle\left|\left(J(a)-\mathop{\mathbb{E}}_{\phi\sim\Phi^{a}}R(\phi_{t})\right)-\left(J(x_{n})-\mathop{\mathbb{E}}_{\phi\sim\Phi^{x_{n}}}R(\phi_{t})\right)+\left(\mathop{\mathbb{E}}_{\phi\sim\Phi^{a}}R(\phi_{t})-\mathop{\mathbb{E}}_{\phi\sim\Phi^{x_{n}}}R(\phi_{t})\right)\right| (7.2.20)
<\displaystyle< |J(a)𝔼ϕΦaR(ϕt)|+|J(xn)𝔼ϕΦxnR(ϕt)|+|𝔼ϕΦaR(ϕt)𝔼ϕΦxnR(ϕt)|\displaystyle\left|J(a)-\mathop{\mathbb{E}}_{\phi\sim\Phi^{a}}R(\phi_{t})\right|+\left|J(x_{n})-\mathop{\mathbb{E}}_{\phi\sim\Phi^{x_{n}}}R(\phi_{t})\right|+\left|\mathop{\mathbb{E}}_{\phi\sim\Phi^{a}}R(\phi_{t})-\mathop{\mathbb{E}}_{\phi\sim\Phi^{x_{n}}}R(\phi_{t})\right| (7.2.21)
<\displaystyle< ε3+ε3+ε3=ε.\displaystyle\,\,\frac{\varepsilon}{3}+\frac{\varepsilon}{3}+\frac{\varepsilon}{3}=\varepsilon. (7.2.22)

8 Conclusion

In this work we consider the problem of exploration in Reinforcement Learning. We find that exploration is understood and well-defined in the dynamic paradigm of Richard Bellman [5], but that it is not well-defined for other optimization paradigms used in Reinforcement Learning. In dynamic Reinforcement Learning, exploration serves to collect the information necessary for dynamic programming, as described in Subsection 2.4. In non-dynamic Reinforcement Learning - what we call naïve Reinforcement Learning - the situation is more complex. We find that dynamic methods of exploration are effective in naïve methods, but that the explanation of their effect offered by dynamic programming cannot explain their efficacy in naïve methods, which do not use the information required by dynamic programming.

This leads us to several questions: Why are exploration methods designed to provide information useless to naïve Reinforcement Learning nonetheless effective for naïve methods? What should the definition of exploration be for naïve methods? To what extent does this more general kind of exploration contribute to the effectiveness of dynamic exploration in dynamic methods? To resolve these questions, we consult the commonalities of several dynamic methods of exploration, finding two: first, their dynamic justification, and second, their mechanism: considering different agents which are deemed likely to demonstrate different distributions of paths.

Of these, only the mechanism might serve to explain dynamic exploration’s efficacy in naïve methods, and we take this mechanism as the definition of naïve exploration. This, definition, however, leaves a gap: under it, totally random experimentation with agents is explorative. This may be effective in small problems, but it is unprincipled. We find a principle in Novelty Search [7]: in exploration one should consider agents which are novel relative to the agents which have already been considered. To determine novelty, they use the distance between the behavior of an agent and those considered in the past.

However, we find their notion of novelty deficient for the purpose of defining naïve exploration; they require that function which determines the behavior of an agent be separately and manually determined for each reinforcement learning problem. Fortunately, this view is not held universally in the literature. We consider a cluster of behavior functions which we call primitive behavior [23, 24, 25]. Primitive behavior is powerful: because it is composed of the actions of an agent, it is possible in some processes for primitive behavior to fully determine the distribution of paths, and thus to determine every notion of behavior derived from (𝒫,a)(\mathcal{P},a).

Unfortunately, primitive behavior has several flaws. First, only in certain finite processes may the primitive behavior of an agent fully determine Φa\Phi^{a}. Second, primitive behavior can inappropriately distinguish between agents (see Definition 5.6). Third, it necessarily retains the manual selection requirement in decision processes with infinitely many states. In Section 5, we describe a more general notion of the distance between agents - one which does not require a behavior function. Instead, we define a structure on the set of agents itself. We call the resulting structure an agent space.

In Section 6 and Section 7, we describe the topology of the agent space, demonstrating that it carries information about many important aspects of Reinforcement Learning, including the distribution of paths produced by an agent and standard formulations of the reward of an agent. Using these facts, we demonstrate that, for many function approximators, reward is a continuous function of the parameters of an agent.

In a future work [28], we use techniques described in Appendix C to join the agent space with Novelty Search to perform Reinforcement Learning using a naïve, scalable learning system similar to Evolution Strategies [13]. We test this method in a variety of processes and find that it performs similarly to ES in problems which require little exploration, and is strictly superior to ES in problems in which exploration is necessary.

References

  • [1] Richard S Sutton and Andrew G Barto “Reinforcement Learning: An Introduction” Cambridge, Massachusetts: MIT Press, 2020 URL: https://mitpress.mit.edu/books/reinforcement-learning-second-edition
  • [2] J.C. Gittins and D.M. Jones “A Dynamic Allocation Index for the Discounted Multiarmed Bandit Problem” In Biometrika 66.3, 1979, pp. 561–565 URL: https://www.jstor.org/stable/2335176
  • [3] Oriol Vinyals et al. “Grandmaster level in StarCraft II using multi-agent reinforcement learning” In Nature 575.7782 Springer US, 2019, pp. 350–354 DOI: 10.1038/s41586-019-1724-z
  • [4] OpenAI et al. “Dota 2 with Large Scale Deep Reinforcement Learning”, 2019 arXiv: http://arxiv.org/abs/1912.06680
  • [5] Richard Bellman “The Theory of Dynamic Programming” In Summer Meeting of the American Mathematical Society, 1954 URL: https://apps.dtic.mil/sti/citations/AD0604386
  • [6] Richard Bellman “Dynamic programming” Princeton, New Jersey: Princeton University Press, 1957
  • [7] Joel Lehman and Kenneth O Stanley “Abandoning Objectives: Evolution Through the Search for Novelty Alone” In Evolutionary Computation 19.2, 2011, pp. 189–233 DOI: 10.1162/EVCO˙a˙00025
  • [8] Christopher J C H Watkins “Learning From Delayed Rewards”, 1989 URL: https://www.researchgate.net/publication/33784417
  • [9] Richard E. Bellman and Stewart E. Dreyfus “Applied Dynamic Programming” In Journal of Mathematical Analysis and Applications, 1962 URL: https://www.rand.org/pubs/reports/R352.html
  • [10] Shipra Agrawal and Navin Goyal “Further Optimal Regret Bounds for Thompson Sampling” In Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics 31 Scottsdale, Arizona: PMLR, 2013, pp. 99–107 arXiv: http://proceedings.mlr.press/v31/agrawal13a.html
  • [11] Christopher J C H Watkins and Peter Dayan “Technical Note Q,-Learning” In Machine Learning 8, 1992, pp. 279–292 DOI: 10.1023/A:1022676722315
  • [12] Ronald A. Howard “Dynamic Programming and Markov Processes” MIT PressJohn Wiley & Sons, Inc., 1960 arXiv:Lib. Cong. 60-11030
  • [13] Tim Salimans et al. “Evolution Strategies as a Scalable Alternative to Reinforcement Learning”, 2017, pp. 1–13 arXiv: https://openai.com/blog/evolution-strategies/
  • [14] Sebastian B Thrun “THE ROLE OF EXPLORATION IN LEARNING CONTROL” In Handbook for Intelligent Control: Neural, Fuzzy and Adaptive Approaches, 1992 URL: http://www.cs.cmu.edu/%7B~%7Dthrun/papers/thrun.exploration-overview.html
  • [15] Haoran Tang et al. “#Exploration: A Study of Count-Based Exploration for Deep Reinforcement Learning” In 31st Conference on Neural Information Processing Systems (NIPS 2017), 2017 DOI: 10.5555/3294996.3295035
  • [16] Bradly C. Stadie, Sergey Levine and Pieter Abbeel “Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models”, 2015 arXiv: https://ui.adsabs.harvard.edu/abs/2015arXiv150700814S/abstract
  • [17] Lilian Weng “Exploration Strategies in Deep Reinforcement Learning” In lilianweng.github.io/lil-log, 2020 URL: https://lilianweng.github.io/lil-log/2020/06/07/exploration-strategies-in-deep-reinforcement-learning.html
  • [18] Joel Lehman and Kenneth O Stanley “Exploiting Open-Endedness to Solve Problems Through the Search for Novelty” In Proceedings of the Eleventh International Conference on Artificial Life XI Cambridge, Massachusetts: MIT Press, 2008 URL: http://eplex.cs.ucf.edu/papers/lehman%7B%5C_%7Dalife08.pdf
  • [19] A.. Arkhangel’skiǐ and L.. Pontryagin “General Topology I”, Encyclopaedia of Mathematical Sciences Springer-Verlag Berlin Heidelberg, 1990, pp. 202 DOI: 10.1007/978-3-642-61265-7˙1
  • [20] Joel Lehman and Kenneth O Stanley “NOVELTY SEARCH AND THE PROBLEM WITH OBJECTIVES” In Genetic Programming Theory and Practice IX Springer-Verlag, 2011, pp. 37–56 DOI: 10.1007/978-1-4614-1770-5˙3
  • [21] Faustino J. Gomez “Sustaining diversity using behavioral information distance” In GECCO ’09: Proceedings of the 11th Annual conference on Genetic and evolutionary computation Montréal, Québec, Canada: Association for Computing Machinery, 2009, pp. 113–120 DOI: 10.1145/1569901.1569918
  • [22] Edoardo Conti et al. “Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents” In Advances in Neural Information Processing Systems 31, 2017 arXiv: https://proceedings.neurips.cc/paper/2018/hash/b1301141feffabac455e1f90a7de2054-Abstract.html
  • [23] Elliot Meyerson, Joel Lehman and Risto Miikkulainen “Learning Behavior Characterizations for Novelty Search” In GECCO 2016 - Proceedings of the 2016 Genetic and Evolutionary Computation Conference Association for Computing Machinery, Inc, 2016, pp. 149–156 DOI: 10.1145/2908812.2908929
  • [24] Jack Parker-Holder, Aldo Pacchiano, Krzysztof Choromanski and Stephen Roberts “Effective Diversity in Population Based Reinforcement Learning”, 2020 arXiv: https://research.google/pubs/pub49976/
  • [25] Jörg Stork, Martin Zaefferer, Thomas Bartz-Beielstein and A E Eiben “Understanding the Behavior of Reinforcement Learning Agents” In International Conference on Bioinspired Methods and Their Applications 2020 Brussels, Belgium: Springer, 2020, pp. 148–160 DOI: 10.1007/978-3-030-63710-1˙12
  • [26] Aldo Pacchiano et al. “Learning to score behaviors for guided policy optimization” In 37th International Conference on Machine Learning, ICML 2020, 2020, pp. 7445–7454 arXiv: https://proceedings.mlr.press/v119/pacchiano20a.html
  • [27] Walter Rudin “Principles of Mathematical Analysis” McGraw-Hill, 1976 URL: https://www.maa.org/press/maa-reviews/principles-of-mathematical-analysis
  • [28] Matthew W. Allen, John C. Raisbeck and Hakho Lee “Distributed Policy Reward & Strategy Optimization” In Unpublished, 2021 DOI: TBD.
  • [29] Kurt Hornik, Maxwell Stinchcombe and Halbert White “Multilayer feedforward networks are universal approximators” In Neural Networks 2.5, 1989, pp. 359–366 DOI: 10.1016/0893-6080(89)90020-8
  • [30] Mathias Edman and Neil Dhir “Boltzmann Exploration Expectation–Maximisation” In arXiv, 2019 arXiv: https://arxiv.org/abs/1912.08869
  • [31] Ronald J Williams and Jing Peng “Function Optimization Using Connectionist Reinforcement Learning Algorithms” In Connection Science 3.3 Taylor & Francis, 1991, pp. 241–268 DOI: 10.1080/09540099108946587
  • [32] Marc G. Bellemare et al. “Unifying Count-Based Exploration and Intrinsic Motivation” In Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016, pp. 1479–1487 DOI: 10.5555/3157096.3157262
  • [33] Achim Klenke “Probability Theory” Springer-Verlag, 2014 DOI: 10.1007/978-1-4471-5361-0

Appendix A Exploration Methods

This appendix contains a brief review of the exploration methods mentioned in Section 3.

A.1 Undirected Exploration

A.1.1 ε\varepsilon-Greedy

One of the simplest undirected exploration algorithms is the ε\varepsilon-greedy algorithm described in Definition 1.13.

In Definition 1.13, we assumed that aθa_{\theta} was a real function approximator, but ε\varepsilon-Greedy can be applied to a broader range of intermediates. All that is necessary is that the underlying function approximator aθa_{\theta} indicate a single action - referred to as the “greedy” action, a reference to Reinforcement Learning algorithms which explicitly predict the value of actions (see Definition 2.2). In such algorithms, the “greedy action” is the one which is predicted to have the highest value. We call functions with this property deterministic, and the greedy action their deterministic action.

Definition A.1 (ε\varepsilon-Greedy).

An ε\varepsilon-greedy output function renders a deterministic agent stochastic by changing its action with probability ε[0,1]\varepsilon\in[0,1] to one drawn from a uniform distribution over the set of actions, UU, and retaining the deterministic action with probability 1ε1-\varepsilon:

Oε(aθ(s))={Ogreedy(aθ(s))with probability 1εU(𝒜)with probability ε\displaystyle O_{\varepsilon}(a_{\theta}(s))=\begin{cases}O_{\text{greedy}}(a_{\theta}(s))&\text{with probability }1-\varepsilon\\ U(\mathpzc{A})&\text{with probability }\hskip 8.80005pt\varepsilon\\ \end{cases} (A.1.1)

where OgreedyO_{\text{greedy}} is the output function which takes aθa_{\theta}’s greedy action.121212For notational simplicity we assume that the function underlying aa is parameterized (aθa_{\theta}). This is not necessary to apply the methods of this section.

Remark A.1 (ε=0\varepsilon=0).

Notice that the case ε=0\varepsilon=0 collapses to the deterministic agent aθa_{\theta}, and the case ε=1\varepsilon=1 is the uniformly random agent.

The major benefit of ε\varepsilon-greedy sampling is that in a finite decision process, every path has a non-zero probability (provided ε>0\varepsilon>0). Unfortunately, that can only be accomplished by assigning a diminutive probability to each of those paths. As a path deviates further from the paths generated by the agent aθa_{\theta}, its probability decreases exponentially with each action which deviates from aθa_{\theta}.

That restriction is not necessarily bad for optimization; by visiting paths which require few changes to the actions of aθa_{\theta}, the newly discovered states are nearly accessible to aθa_{\theta}, which may make them more poignant to learning algorithms, which typically change locus agents only by small amounts in each epoch.

While ε\varepsilon-Greedy can be applied to processes with finite or infinite sets of actions, the next method, Thompson Sampling, can only be defined for processes with finite sets of actions.

A.1.2 Thompson Sampling and Related Methods

Other major undirected exploration methods operate using a similar mechanism to ε\varepsilon-Greedy, to very different effect. Just like ε\varepsilon-Greedy sampling, Thompson Sampling acts as OO, taking the range of a function aθa_{\theta} to the set of probability distributions of actions. Whereas ε\varepsilon-Greedy produces a distribution which varies only in the agent’s deterministic131313Different learning algorithms approximate different objects; in 𝒬\mathcal{Q}-Learning (Subsection 2.4), aθa_{\theta} approximates the action-value of a state-action pair (s,𝒶)(s,\mathpzc{a}); in policy gradients, its meaning is dependent on the output function OO. action, Thompson Sampling produces a distribution which varies with the agent’s output for each action; unlike ε\varepsilon-Greedy, Thompson Sampling is continuous in the output of aθa_{\theta}.

Definition A.2 (Thompson Sampling).

A Thompson Sampling output function produces a distribution of actions from the output of a real function approximator aθa_{\theta}. Thompson Sampling requires that aθ(s)a_{\theta}(s) be a real vector of dimension |𝒜||\mathpzc{A}|, whose elements are nonnegative and have sum 1; aθ(s):S{x|𝒜||xi0,xi=1}a_{\theta}(s):S\rightarrow\{x\in\mathbb{R}^{|\mathpzc{A}|}\,|\,x_{i}\geq 0,\sum x_{i}=1\}. The Thompson Output Function produces the distribution of actions

OThompson(aθ(s))={𝒶1with probability aθ(s)1,𝒶2with probability aθ(s)2,𝒶3with probability aθ(s)3,𝒶|𝒜|with probability aθ(s)|𝒜|.\displaystyle O_{\text{Thompson}}(a_{\theta}(s))=\begin{cases}\mathpzc{a}_{1}&\text{with probability }a_{\theta}(s)_{1},\\ \mathpzc{a}_{2}&\text{with probability }a_{\theta}(s)_{2},\\ \mathpzc{a}_{3}&\text{with probability }a_{\theta}(s)_{3},\\ \vdots\\ \mathpzc{a}_{|\mathpzc{A}|}&\text{with probability }a_{\theta}(s)_{|\mathpzc{A}|}.\\ \end{cases} (A.1.2)

Many function approximators do not naturally produce values which fall in the set of acceptable inputs to OThompsonO_{\text{Thompson}}. Several methods may be employed to rectify these approximators with Thompson Sampling. One common method is known as Boltzmann Exploration (or as a softmax layer) [1, 37]:

Definition A.3 (Boltzmann Exploration).

A Boltzmann Exploration output function produces a distribution of actions from the output of a real function approximator aθa_{\theta}. It is most easily understood as a “pre-processing” for Thompson Sampling function. Let ρ\rho be a real parameter (ρ\rho is sometimes called temperature [30]). Then,

softmax (x):R|𝒜|{x|𝒜||xi0,xi=1},\displaystyle(x):R^{|\mathpzc{A}|}\rightarrow\{x\in\mathbb{R}^{|\mathpzc{A}|}|x_{i}\geq 0,\sum x_{i}=1\}, (A.1.3)
softmax (aθ(s))i=eaθ(s)iρeaθ(s)kρ.\displaystyle(a_{\theta}(s))_{i}=\frac{e^{a_{\theta}(s)_{i}\rho}}{\sum e^{a_{\theta}(s)_{k}\rho}}. (A.1.4)

Which can be composed with the regular Thompson output function:

OBoltzmann(aθ(s))=OThompson(softmax(aθ(s))).\displaystyle O_{\text{Boltzmann}}(a_{\theta}(s))=O_{\text{Thompson}}(\text{softmax}(a_{\theta}(s))). (A.1.5)

Boltzmann Exploration is among the most common methods of creating functions which are compatible with Thompson Sampling because of its beneficial analytical properties: it is continuous, has a simple derivative (especially important for back-propagation), and guarantees that every action has a non-zero probability.

A different kind of augmentation of Thompson Sampling and other stochastic output functions is Entropy Maximization [31]. In contrast with Boltzmann Exploration, entropy maximization modifies the learning process itself through the immediate reward function.

Definition A.4 (Entropy Maximization).

Entropy Maximization is a method used with stochastic agents which adds the conditional entropy h(𝒶𝓉|𝒶,𝓈𝓉)h(\mathpzc{a}_{t}|a,s_{t}) of the action with respect to the distribution a(s)a(s) to the immediate reward,

rentropy(ϕt)=r(ϕt)+ρh(𝒶𝓉|𝒶,𝓈𝓉),\displaystyle r_{\text{entropy}}(\phi_{t})=r(\phi_{t})+\rho h(\mathpzc{a}_{t}|a,s_{t}), (A.1.6)

where ρ\rho is a positive real parameter of the optimizer [31].

These entropy bonuses cause the learner to consider both the reward which an agent attains and its propensity to select a diversity of actions. The learner is thus encouraged to consider agents which express greater “uncertainty” in their actions, slowing the convergence of the locus agent.

With respect to dynamic exploration, there is little difference between ε\varepsilon-Greedy and Thompson Sampling. Both algorithms explore the process by selecting agents which allow them to experience unexplored aspects of the process. From the naïve perspective, this similarity is overshadowed by a difference in their analytical properties: in finite problems, Thompson Sampling agents act from a continuous set of actions, whereas ε\varepsilon-Greedy agents use a [modified] finite set of actions.

The exploration methods discussed in this section are fairly homogeneous, precisely because they are undirected; the only way to explore without direction is to inject stochasticity into the optimization process. Conversely, the methods of the next section are considerably more diverse; there are many ways to direct an explorative process.

A.2 Directed Exploration

The variety of directed exploration methods make the genre difficult to summarize. Perhaps the simplest description of directed methods as the complement of undirected methods. Undirected exploration methods use exclusively stochastic means to explore; they do not incorporate any information specific to the process. Directed exploration methods thus include any exploration method which does incorporate such information [14]. This section describes two major families of directed exploration methods: count-based and prediction-based through a pair of representative methods [17]. We begin with count-based exploration, exemplified by #Exploration [15].

Count-based methods [15, 32] count the number of times that each state (or state-action pair, see Subsection 2.4) has been observed in the course of learning, and use that count to inform the course of learning, to encourage visitation of scarcely visited states. Count-based algorithms have appealing guarantees in finite processes [32], but lose those guarantees in infinite settings. Despite this, count-based exploration continues to inspire exploration techniques in the infinite setting. #Exploration is a recent method which discretizes infinite problems, imitating traditional count-based methods.

Definition A.5 (#Exploration).

#Exploration [15] is an algorithm which augments the immediate reward function with an exploration bonus in the same manner as the entropy bonus of Definition A.4. However, instead of encouraging the learner to pursue agents which attempt new actions or visit rarely visited states, #Exploration uses hash codes. The hash codes are generated by a hashing function H(s)H(s) which discretizes an unmanageable (i.e. large or infinite) set of states into a manageable finite set of hash codes. Using these hash codes as a proxy for states, #Exploration assigns its exploration bonus in much the same way as a traditional count-based method:

r#(a,s)=r(a,s)+ρn(H(s)).\displaystyle r_{\#}(a,s)=r(a,s)+\frac{\rho}{\sqrt{n(H(s))}}. (A.2.1)

Here, r#r_{\#} is the combination of the immediate reward function and the exploration bonus for that state, nn is the state-count function, a tally of the number of times that a state with the same hash code as ss, H(s)H(s), has been visited, and ρ\rho is a positive real number. #Exploration pursues its goal as a count-based method by assigning greater exploration bonuses to states which have been visited fewer times.

The next class of exploration methods in this section is called prediction-based exploration. Whereas count-based methods estimate the new information that an action will collect with a measurement of the learner’s experience of each state, prediction-based methods attempt to estimate the quality, rather than the mere quantity of the collected information. To do this, they employ a separate modeling method which predicts the next state of the process. The better that prediction is, the higher the quality of the information which the learner has about that part of the process.

Definition A.6.

In Incentivizing Exploration[16], Stadie et al. estimate the quality of information which the executor has gathered by using that information to train a dynamics model \mathcal{M} to estimate the next state st+1s_{t+1} from (st,𝒶𝓉)(s_{t},\mathpzc{a}_{t})141414This algorithm uses “state encodings”, similar to the hash codes of #Exploration, rather than states.. They reason that if \mathcal{M} accurately estimates the next state (as measured by the distance between (st,𝒶𝓉)\mathcal{M}(s_{t},\mathpzc{a}_{t}) and st+1s_{t+1}), then the executor has gathered better information about that state-action pair. Thus, they assign exploration bonuses so as to encourage consideration of agents which visit state-action pairs for which the distance (st,𝒶𝓉)𝓈𝓉+1||\mathcal{M}(s_{t},\mathpzc{a}_{t})-s_{t+1}|| is large:

rStadie(st,𝒶𝓉)=𝓇(𝓈𝓉,𝒶𝓉)+ρ(𝓈𝓉,𝒶𝓉)𝓈𝓉+1𝓉𝒞,\displaystyle r_{\text{Stadie}}(s_{t},\mathpzc{a}_{t})=r(s_{t},\mathpzc{a}_{t})+\rho\frac{||\mathcal{M}(s_{t},\mathpzc{a}_{t})-s_{t+1}||}{tC}, (A.2.2)

where β\beta is a constant, and CC is decay constant (i.e. an increasing function of the learning epoch).

Appendix B Behavior Functions in the Literature

Many of the behavior functions which have been proposed have been influenced by the behavior functions of Lehman and Stanley’s initial work, and by their advice on the subject in Abandoning Objectives: Evolution Through the Search for Novelty Alone:

Although generally applicable, novelty search is best suited to domains with deceptive fitness landscapes, intuitive behavioral characterizations, and domain constraints on possible expressible behaviors. - Lehman and Stanley [7, 200]

This passage provides important insight for those who wish to apply Novelty Search to new domains in the tradition of Lehman and Stanley, but their suggestions also make it difficult to analyze Novelty Search independent of the choice of behavior function. This is especially problematic for the open-ended use of Novelty Search; without a general notion of behavior, there are few options for the comparison of behavior functions to one another or their absolute evaluation. In a given decision process, one can compare the outcomes of Novelty Search processes with different behavior functions by considering the diversity of behaviors which they produce, but this diversity must be measured by one of these or a different notion of behavior. One could consider each behavior function’s propensity to find high-quality agents, but this is a return to the just-abandoned objectives.

Lehman and Stanley are forced to compare their behavior functions in the maze environment along these lines. In a discussion of the degrees of “conflation” (assignment of the same behavior to different agents) present in their behavior functions: “[I]f some dimensions of behavior are completely orthogonal to the objective, it is likely better to conflate all such orthogonal dimensions of behavior together rather than explore them.” To address these issues and make it possible to apply Novelty Search to a wider range of decision processes, several authors have considered general behavior functions [21, 22, 23, 24, 25].

The rest of this appendix provides a brief survey of some behavior functions present in the literature, beginning with the specific functions of Abandoning Objectives [7], and proceeding to general behavior functions, including those of [21, 22].

B.1 The Behavior Functions of Abandoning Objectives

The main decision processes in Abandoning Objectives are two-dimensional mazes. They consider several behavior functions in this environment, all of which are based on the position of the agent. The primary behavior function they consider is what we call the final position behavior function:

Definition B.1 (Final Position Behavior).
B(a)=ptmax\displaystyle B(a)=p_{t\textsubscript{max}} (B.1.1)

Where ptmaxp_{t\textsubscript{max}} is the position at the final time in a sampled truncated trajectory ϕtmax\phi_{t_{\max}}.

They consider another positional behavior function: the position of the agent over time.

Definition B.2 (Position Over Time Behavior).
B(a)=(pt1,pt2,,ptN1,ptN)\displaystyle B(a)=(p_{t_{1}},p_{t_{2}},...,p_{t_{N-1}},p_{t_{N}}) (B.1.2)

For 0ti<ti+1tmax0\leq t_{i}<t_{i+1}\leq t_{\max}

As noted in footnote 8, while these are functions into 2\mathbb{R}^{2} in the case of final position behavior, and 2N\mathbb{R}^{2N} in the case of position over time behavior, neither 2\mathbb{R}^{2} nor 2N\mathbb{R}^{2N} are treated as metric spaces. Instead, both of these are equipped with a symmetric [19, 23]: the square of the usual Euclidean distance.

These examples reveal the intuitions about behavior which Lehman and Stanley relied upon to implement Novelty Search. First, rather than reflecting the actions of an agent alone, both of these behavior functions reflect the results of the agent’s interaction with the process - in fact, they reflect the position of the agent, a function of the state. Second, these behavior functions are distilled, considering only one or a few points of time tit_{i}.

In the other environment, Biped Locomotion, Lehman and Stanley take a different approach to selecting the times tit_{i}, opting to collect spatial information once per simulated second. Explaining that difference, they write: “Unlike in the maze domain, temporal sampling is necessary because the temporal pattern is fundamental to walking.” This is a strange argument, since reinforcement learning problems are defined by their temporality (see Definition 1.1).

B.2 General Behavior Functions

Since the publication of [18], Lehman and Stanley’s first paper on Novelty Search, many authors have sought general notions of behavior [21, 22, 23, 24, 25]. This section analyzes several of these behavior functions. Let us begin with a simple notion of distance on the set of agents which is defined with for any method using a parameterized agent:

Example B.1 (The distance of θ\theta).

Consider two agents, aa and bb, represented by function approximators of the same form. Assume that they are parameterized by an ordered list of real numbers, and let their parameters be θa\theta_{a} and θb\theta_{b}. Then,

B(aθa)=θa\displaystyle B(a_{\theta_{a}})=\theta_{a} (B.2.1)

is a behavior function and

d(B(a),B(b))=θaθb\displaystyle d(B(a),B(b))=||\theta_{a}-\theta_{b}|| (B.2.2)

is a metric on this set of behaviors.

Although it is a metric (though it does not satisfy the indiscernibility of identicals under the quotient operation of Definition 5.6), this distance is unsatisfactory in several ways. For example, it can assign an agent a non-zero distance from itself if the agent can be parameterized by two different sets of parameters (see Subsection 5.6 and [24]).

An early work of Gomez et al. [21] introduced a behavior function which maps agents to a concatenation of truncated trajectories. They then use the normalized compression distance (NCD) as a metric on this set of finite sequences.

Definition B.3 (Gomez et al. Behavior).

Gomez et al. [21] define the behavior of an agent as a concatenation of a number of observed truncated paths,

B(a)=ϕa,1n1|ϕa,2n2|\displaystyle B(a)=\phi_{a,1}^{n_{1}}|\phi_{a,2}^{n_{2}}|... (B.2.3)

As the distance on this set of behaviors B(A)B(A), Gomez uses the normalized compression distance NCD(B(a),B(b))\text{NCD}(B(a),B(b)), which is an approximation of the mutual information of a pair of strings:

NCD(a,b)=C(B(a)|B(b))min{C(B(a)),C(B(b))}max{C(B(a)),C(B(b))}.\displaystyle\text{NCD}(a,b)=\frac{C(B(a)|B(b))-\min\{C(B(a)),C(B(b))\}}{\max\{C(B(a)),C(B(b))\}}. (B.2.4)

Where CC is the length of the compressed sequence.

Appendix C Agent Spaces In Practice

While the agent space is in general not a metric space (see Section 5, Subsection 5.7, and Subsection C.1), this does not proscribe its use in e.g. Novelty Search, which has long used metric-adjacent spaces (see footnote 8). In an upcoming work [28], we describe an approach to the Reinforcement Learning problem based on an extension of Evolution Strategies [13] which combines naïve reward and Novelty Search [7] of the agent space, selecting the locus agent as the perspective for comparisons during each epoch, to solve a variety of reinforcement learning problems (see Subsection C.3).

Novelty Search and the Agent Space to be naïve artifacts, but we cannot a priori restrict them to naïve learning methods. For example, [25] uses several versions of primitive behavior (see Subsection 4.2) to explain “reward behavior correlation[s]”, showing that agents which are similar under certain primitive behavior functions perform similarly. In Section 7 we demonstrate that the Agent Space completes this line of inquiry by demonstrating analytically that reward is a continuous function of the agent in the agent space. The reasoning of the Agent Space in Definition 5.6 also provides a clean explanation for the observation of [25] that certain states may be totally unimportant to performance.

C.1 When dad_{a} is Equivalent to dbd_{b}

While local distances do not generally produce homeomorphic topologies, it is worthwhile to note that many basic problems in the literature do have agents which produce homeomorphic topologies, especially in the Markov and strictly Markov cases. Let us begin by considering the equivalence of the measures underlying local distances:

Definition C.1 (Equivalence of Measures).

A pair of measures μ\mu, ν\nu each on a measurable space (M,Σ)(M,\Sigma) are said to be equivalent iff [33, 157]

XΣ(μ(X)=0ν(X)=0)\displaystyle\operatorname{\forall}X\in\Sigma(\mu(X)=0\iff\nu(X)=0) (C.1.1)

When the measures underlying dad_{a} and dbd_{b}, Φa\Phi^{a} and Φb\Phi^{b}, are equivalent, they produce equivalent topologies. In general, this is rare. In most decision processes, some paths can only be visited by a subset of agents. However, there are certain circumstances where these distances are guaranteed to be equivalent.

Clearly, if two agents differ with non-zero probability, then the distributions of truncated paths which they produce must also differ. However, this does not apply when only Markov agents are considered (see Definition 1.10). In this case, if the distributions of states, rather than truncated paths, are equivalent, then the distances produce equivalent topologies.

Remark C.1 (Notation for the Probability of a State).

The next few results require a simple notation for the probability of a state occurring in a distribution of paths. This is complicated by the fact that in any path there are an infinity of states, so that the sum of the probabilities of a state occurring at each time tt\in\mathbb{N} might be infinite. To resolve this, we weight the probability of a state at tt by ω(t)\omega(t) and then normalize these probabilities with Ω\Omega. Let

(s|a)=t(ϕt(s)=s)ω(t)Ω.\displaystyle\mathbb{P}(s\,|\,a)=\frac{\sum_{t\in\mathbb{N}}\mathbb{P}(\phi_{t}(s)=s)\omega(t)}{\Omega}. (C.1.2)

In general, this gives

Theorem C.1 (A Condition for the Equivalence of dad_{a} and dbd_{b}).

In a Markov decision process with a finite set of states, the local distances dad_{a} and dbd_{b} are equivalent whenever the distributions of states in Φa\Phi^{a} and Φb\Phi^{b} are equivalent.

This theorem has several important manifestations. Let us consider the case where the local distances of all agents possible in the process are equivalent:

Lemma C.1 (Conditions for the Equivalence of all Local Distances).

The most general condition for the equivalence of all local distances is

aAsS((s|a)>0).\displaystyle\operatorname{\forall}a\in A\operatorname{\forall}s\in S(\mathbb{P}(s\,|\,a)>0). (C.1.3)

There are two common conditions which are more specific but easier to verify, which may help with the application of this result. First, if every transition probability is greater than zero, then certainly the probability of each state in the distribution of paths is greater than 0:

ϕttΦt((s|σ(ϕt))>0).\displaystyle\operatorname{\forall}\phi_{t}\in\bigcup_{t\in\mathbb{N}}\Phi_{t}(\mathbb{P}(s\,|\,\sigma(\phi_{t}))>0). (C.1.4)

Even more specifically, but also more easy to test, if the probability of each state at the beginning of the process is non-zero, then the probability of each state in the distribution of paths is greater than 0:

sS((s|σ0)).\displaystyle\operatorname{\forall}s\in S(\mathbb{P}(s\,|\,\sigma_{0})). (C.1.5)

Of course, this result is of little importance if these local distances cease to be useful for the analysis of the underlying decision process. By construction, these equivalent local distances are relevant only to Markov processes. Importantly, these local distances cannot detect differences in distributions of paths which are not caused by differences in distributions of states, or more specifically of state-action pairs. Thus, these local distances cannot guarantee the continuity all of the functions considered in Section 7, but they do apply to the cumulative reward functions described by (1.1.118).

C.2 Deterministic Agents

The theorems of Section 6 rely on stochastic agents to justify the topology of the Agent Space. This reliance stems from the fact that we are concerned in that section primarily with distributions of paths. Because paths consist of sequences of states and actions, distributions of paths can only approach one another (with respect to total variation) if, in response to a single truncated prime path ϕt\phi^{\prime}_{t}, a distribution of actions is taken.

However, this does not mean that the agent space is useless when deterministic agents are considered. For example, the distributions of states mentioned in Subsection C.1 may be continuous even in an agent space composed entirely of deterministic agents, via the stochasticity of the state-transition function. Thus, if the set of actions is connected and for every truncated prime path ϕt\phi^{\prime}_{t} the state-transition distribution is a continuous function of the final action 𝒶𝓉\mathpzc{a}_{t}, then the distributions of states are continuous in the agent space. Then, an immediate reward function which considers only the state would also be continuous in the agent space.

C.3 Sketch of the Novelty Search Methods of [28]

In an upcoming work, we use the Agent Space in conjunction with Novelty Search to develop a distributed optimization algorithm for Reinforcement Learning problems. Our method evaluates candidate agents with a path collected by the locus agent at each training epoch, resulting in a non-stationary objective that encourages the agent to behave in ways that it has not yet behaved, on states that it can currently encounter. The following pseudo-code is a summary of this method.

Algorithm 1 Strategy & Reward Optimization Algorithm
1:for each epoch do
2:  Set batch At:=A_{t}:=\emptyset
3:  for desired batch size do
4:   Set ε𝒩(0,I)\varepsilon\sim\mathcal{N}(0,I)
5:   Gather a path ϕ\phi and cumulative reward R(ϕ)R(\phi) with at+εa_{t}+\varepsilon
6:   Set N(at+ε)min0<it[dat(at+ε,ati)]N(a_{t}+\varepsilon)\leftarrow\underset{0<i\leq t}{\min}[d_{a_{t}}(a_{t}+\varepsilon,a_{t-i})] for prior epochs tit-i
7:   Append (R(ϕ),N(at+ε))(R(\phi),N(a_{t}+\varepsilon)) to AtA_{t}   
8:  Compute atG(at):=atR(at)+atN(at)\nabla_{a_{t}}G(a_{t}):=\nabla_{a_{t}}R(a_{t})+\nabla_{a_{t}}N(a_{t}) via Finite Differences with AtA_{t}.
9:  Update ata_{t} by following atG(at)\nabla_{a_{t}}G(a_{t})

We approximate dat(at+ε,ati)d_{a_{t}}(a_{t}+\varepsilon,a_{t-i}) by evaluating candidate agents on a set of states that we gather every epoch. To do this, we follow the agent ata_{t} in the decision process until KK total states have been encountered, then store them in a set denoted ζ\zeta. The distance dat(at+ε,atn)d_{a_{t}}(a_{t}+\varepsilon,a_{t-n}) can then be approximated by evaluating the responses of at+εa_{t}+\varepsilon and atia_{t-i} on only ζ\zeta, rather than by integration on Φat\Phi^{a_{t}}.