Intrinsic Exploration as Multi-Objective RL

\namePhilippe Morere \emailphilippe.morere@sydney.edu.au
\addrThe University of Sydney,
Sydney, Australia
\AND\nameFabio Ramos \emailfabio.ramos@sydney.edu.au
\addrThe University of Sydney & NVIDIA,
Sydney, Australia

Abstract

Intrinsic motivation enables reinforcement learning (RL) agents to explore when rewards are very sparse, where traditional exploration heuristics such as Boltzmann or $\epsilon$ -greedy would typically fail. However, intrinsic exploration is generally handled in an ad-hoc manner, where exploration is not treated as a core objective of the learning process; this weak formulation leads to sub-optimal exploration performance. To overcome this problem, we propose a framework based on multi-objective RL where both exploration and exploitation are being optimized as separate objectives. This formulation brings the balance between exploration and exploitation at a policy level, resulting in advantages over traditional methods. This also allows for controlling exploration while learning, at no extra cost. Such strategies achieve a degree of control over agent exploration that was previously unattainable with classic or intrinsic rewards. We demonstrate scalability to continuous state-action spaces by presenting a method (EMU-Q) based on our framework, guiding exploration towards regions of higher value-function uncertainty. EMU-Q is experimentally shown to outperform classic exploration techniques and other intrinsic RL methods on a continuous control benchmark and on a robotic manipulator.

Keywords: Reinforcement Learning, Robotics, Exploration, Online Learning, Multi-Objective

1 Introduction

In^†^†This paper extends previous work published as Morere and Ramos (2018). Reinforcement Learning (RL), data-efficiency and learning speed are paramount. Indeed, when interacting with robots, humans, or the real world, data can be extremely scarce and expensive to collect. Improving data-efficiency is of the utmost importance to apply RL to interesting and realistic applications. Learning from few data is relatively easier to achieve when rewards are dense, as these can be used to guide exploration. In most realistic problems however, defining dense reward functions is non-trivial, requires expert knowledge and much fine-tuning. In some cases (eg. when dealing with humans), definitions for dense rewards are unclear and remain an open problem. This greatly hinders the applicability of RL to many interesting problems.

It appears more natural to reward robots only when reaching a goal, termed goal-only rewards, which becomes trivial to define Reinke et al. (2017). Goal-only rewards, defined as unit reward for reaching a goal and zero elsewhere, cause classic exploration techniques based on random-walk such as $\epsilon$ -greedy and control input noise Schulman et al. (2015), or optimistic initialization to become highly inefficient. For example, Boltzmann exploration Kaelbling et al. (1996) requires training time exponential in the number of states Osband et al. (2014). Such data requirement is unacceptable in real-world applications. Most solutions to this problem rely on redesigning rewards to avoid dealing with the problem of exploration. Reward shaping helps learning Ng et al. (1999), and translating rewards to negative values triggers optimism in the face of uncertainty Kearns and Singh (2002); Brafman and Tennenholtz (2002); Jaksch et al. (2010). This approach suffers from two shortcomings: proper reward design is difficult and requires expert knowledge; improper reward design often degenerates to unexpected learned behaviour.

Refer to caption — Figure 1: Left: classic intrinsic exploration setup as proposed in Chentanez et al. (2005). Right: intrinsic exploration formulated as multi-objective RL

Intrinsic motivation proposes a different approach to exploration by defining an additional guiding reward; see Figure 1 (left). The exploration reward is typically added to the original reward, which makes rewards dense from the agent’s perspective. This approach has had many successes Bellemare et al. (2016); Fox et al. (2018) but suffers several limitations. For example, weighting between exploration and exploitation must be chosen before learning and remain fixed. Furthermore, in the model-free setting, state-action value functions are learned from non-stationary targets mixing exploration and exploitation, hence making learning less data-efficient.

To solve the problem of data-efficient exploration in goal-only reward settings, we propose to leverage advances in multi-objective RL Roijers et al. (2013). We formulate exploration as one of the core objectives of RL by explicitly integrating it to the loss being optimized. Following the multi-objective RL framework, agents optimize for both exploration and exploitation as separate objectives. This decomposition can be seen as two different RL agents, as shown in Figure 1 (right). Contrary to most intrinsic RL approaches, this formulation keeps the exploration-exploitation trade-off at a policy level, as in traditional RL. This allows for several advantages: (i) Weighting between objectives can be adapted while learning, and strategies can be developed to change exploration online; (ii) Exploration can be stopped at any time at no extra cost, yielding purely exploratory behaviour immediately; (iii) Inspection of exploration status is possible, and experimenters can easily generate trajectories for exploration or exploitation only.

Our contributions are the following:

•

We propose a framework based on multi-objective RL for treating exploration as an explicit objective, making it core to the optimization problem.
•

This framework is experimentally shown to perform better than classic additive exploration bonuses on several key exploration characteristics.
•

Drawing inspiration from the fields of bandits and Bayesian optimization, we give strategies for taking advantage of and tuning the exploration-exploitation balance online. These strategies achieve a degree of control over agent exploration that was previously unattainable with classic additive intrinsic rewards.
•

We present a data-efficient model-free RL method (EMU-Q) for continuous state-action goal-only MDPs based on the proposed framework, guiding exploration towards regions of higher value-function uncertainty.
•

EMU-Q is experimentally shown to outperform classic exploration techniques and other methods with additive intrinsic rewards on a continuous control benchmark.

In the following, Section 2 reviews background on Markov decision processes, intrinsic motivation RL, multi-objective RL and related work. Section 3 defines a framework for explicit exploration-exploitation balance at a policy level, based on multi-objective RL. Section 4 presents advantages and strategies for controlling this balance during the agent learning process. Section 5 formulates EMU-Q, a model-free data-efficient RL method for continuous state-action goal-only MDPs, based on the proposed framework. Section 6 presents experiments that evaluate EMU-Q’s exploration capabilities on classic RL problems and a simulated robotic manipulator. EMU-Q is further evaluated against other intrinsic RL methods on a continuous control benchmark. We conclude with a summary in Section 7.

2 Preliminaries

This section reviews basics on Markov decision processes, intrinsic motivation RL, multi-objective RL and related work.

2.1 Markov Decision Processes

A Markov decision process (MDP) is defined by the tuple $<S,A,T,R,\gamma>$ . $\mathcal{S}$ and $\mathcal{A}$ are spaces of states $s$ and actions $a$ respectively. The transition function $T:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow[0,1]$ encodes the probability to transition to state $s^{\prime}$ when executing action $a$ in state $s$ , i.e. $T(s,a,s^{\prime})=p(s^{\prime}|s,a)$ . The reward distribution $R$ of support $\mathcal{S}\times\mathcal{A}\times\mathcal{S}$ defines the reward $r$ associated with transition $(s,a,s^{\prime})$ . In the simplest case, goal-only rewards are deterministic and unit rewards are given for absorbing goal states, potential negative unit rewards are given for penalized absorbing states, and zero-reward is given elsewhere. $\gamma\in[0,1)$ is a discount factor. Solving a MDP is equivalent to finding the optimal policy $\pi^{*}$ starting from $s_{0}$ :

\pi^{*}=\arg\max_{\pi}\mathbb{E}_{T,R,\pi}[\sum_{i=0}^{\infty}\gamma^{i}r_{i}],

(1)

with $a_{i}\sim\pi(s_{i})$ , $s_{i+1}\sim T(s_{i},a_{i},\cdot)$ , and $r_{i}\sim R(s_{i},a_{i},s_{i+1})$ . Model-free RL learns an action-value function $Q$ , which encodes the expected long-term discounted value of a state-action pair

Q(s,a)=\mathbb{E}_{T,R,\pi}[\sum_{i=0}^{\infty}\gamma^{i}r_{i}].

(2)

Equation 2 can be rewritten recursively, also known as the Bellman equation

Q(s,a)=\mathbb{E}_{R}[R(s,a,s^{\prime})]+\gamma\mathbb{E}_{s^{\prime},a^{\prime}|s,a}[Q(s^{\prime},a^{\prime})],

(3)

$s^{\prime}\sim p(s^{\prime}|s,a)$ , $a^{\prime}\sim\pi(s^{\prime})$ , which is used to iteratively refine models of $Q$ based on transition data.

2.2 Intrinsic RL

While classic RL typically carries out exploration by adding randomness at a policy level (eg. random action, posterior sampling), intrinsic RL focuses on augmenting rewards with an exploration bonus. This approach was presented in Chentanez et al. (2005), in which agents aim to maximize a total reward $r_{total}$ for transition $(s,a,r,s^{\prime})$ :

r_{total}=r+\xi r^{e},

(4)

where $r^{e}$ is the exploration bonus and $\xi$ a user-defined parameter weighting exploration. The second term encourages agents to select state-action pairs for which they previously received high exploration bonuses. The definition of $r^{e}$ has been the focus of much recent theoretical and applied work; examples include model prediction error Stadie et al. (2015) or information gain Little and Sommer (2013).

While this formulation enables exploration in well behaved scenarios, it suffers from multiple limitations:

•

Exploration bonuses are designed to reflect the information gain at a given time of the learning process. They are initially high, and typically decrease after more transitions are experienced, making it a non-stationary target. Updating $Q$ with non-stationary targets results in higher data requirements, especially when environment rewards are stationary.
•

The exploration bonus given for reaching new areas of the state-action space persists in the estimate of $Q$ . As a consequence, agents tend to over-explore and may be stuck oscillating between neighbouring states.
•

There is no dynamic control over the exploration-exploitation balance, as changing parameter $\xi$ only affects future total rewards. Furthermore, it would be desirable to control generating trajectories for pure exploration or pure exploitation, as these two quantities may conflict.

This work presents a framework for enhancing intrinsic exploration, which does not suffer from the previously stated limitations.

2.3 Multi-Objective RL

Multi-objective RL seeks to learn policies solving multiple competing objectives by learning how to solve for each objective individually Roijers et al. (2013). In multi-objective RL, the reward function describes a vector of $n$ rewards instead of a scalar. The value function also becomes a vector $\bm{Q}$ defined as

\bm{Q}(s,a)=\mathbb{E}_{T,R,\pi}[\sum_{i=0}^{\infty}\gamma^{i}\bm{r}_{i}],

(5)

where $\bm{r}_{i}$ is the vector of rewards at step $i$ in which each coordinate corresponds to one objective. For simplicity, the overall objective is often expressed as the sum of all individual objectives; $\bm{Q}$ can be converted to a scalar state-action value function with a linear scalarization function: $Q_{\bm{\omega}}(s,a)=\bm{\omega}^{T}\bm{Q}(s,a)$ , where $\bm{\omega}$ are weights governing the relative importance of each objective.

The advantage of the multi-objective RL formulation is to allow learning policies for all combinations of $\bm{\omega}$ , even if the balance between each objective is not explicitly defined prior to learning. Moreover, if $\bm{\omega}$ is a function of time, policies for new values of $\bm{\omega}$ are available without additional computation. Conversely, with traditional RL methods, a pass through the whole dataset of transitions would be required.

2.4 Related Work

Enhancing exploration with additional rewards can be traced back to the work of Storck et al. (1995) and Meuleau and Bourgine (1999), in which information acquisition is dealt with in an active manner. This type of exploration was later termed intrinsic motivation and studied in Chentanez et al. (2005). This field has recently received much attention, especially in the context of very sparse or goal-only rewards Reinke et al. (2017); Morere and Ramos (2018) where traditional reward functions give too little guidance to RL algorithms.

Extensive intrinsic motivation RL work has focused on domains with simple or discrete spaces, proposing various definitions for exploration bonuses. Starting from reviewing intrinsic motivation in psychology, the work of Oudeyer and Kaplan (2008) presents a definition based on information theory. Maximizing predicted information gain from taking specific actions is the focus of Little and Sommer (2013), applied to learning in the absence of external reward feedback. Using approximate value function variance as an exploration bonus was proposed in Osband et al. (2016). In the context of model-based RL, exploration based on model learning progress Lopes et al. (2012), and model prediction error Stadie et al. (2015); Pathak et al. (2017) were proposed. State visitation counts have been widely investigated, in which an additional model counting previous state-action pair occurrences guides agents towards less visited regions. Recent successes include Bellemare et al. (2016); Fox et al. (2018). An attempt to generalizing counter-based exploration to continuous state spaces was made in Nouri and Littman (2009), by using regression trees to achieve multi-resolution coverage of the state space. Another pursuit for scaling visitation counters to large and continuous state spaces was made in Bellemare et al. (2016) by using density models.

Little work attempted to extend intrinsic exploration to continuous action spaces. A policy gradient RL method was presented in Houthooft et al. (2016). Generalization of visitation counters is proposed in Fox et al. (2018), and interpreted as exploration values. Exploration values are also presented as an alternative to additive rewards in Szita and Lőrincz (2008), where exploration balance at a policy level is mentioned.

Most of these methods typically suffer from high data requirements. One of the reasons for such requirements is that exploration is treated as an ad-hoc problem instead of being the focus of the optimization method. More principled ways to deal with exploration can be found in other related fields. In bandits, the balance between exploration and exploitation is central to the formulation Kuleshov and Precup (2014). For example with upper confidence bound Auer et al. (2002), actions are selected based on the balance between action values and a visitation term measuring the variance in the estimate of the action value. In the bandits setting, the balance is defined at a policy level, and the exploration term is not incorporated into action values like in intrinsic RL.

Similarly to bandits, Bayesian Optimization Jones et al. (1998) (BO) brings exploration at the core of its framework, extending the problem to continuous action spaces. BO provides a data-efficient approach for finding the optimum of an unknown objective. Exploration is achieved by building a probabilistic model of the objective from samples, and exploiting its posterior variance information. An acquisition function such as UCB Cox and John (1992) balances exploration and exploitation, and is at the core of the optimization problem. BO was successfully applied to direct policy search Brochu et al. (2010); Wilson et al. (2014) by searching over the space of policy parameters, casting RL into a supervised learning problem. Searching the space of policy parameters is however not data-efficient as recently acquired step information is not used to improve exploration. Furthermore, using BO as global search over policy parameters greatly restricts parameter dimensionality, hence typically imposes using few expressive and hand-crafted features.

In both bandits and BO formulations, exploration is brought to a policy level where it is a central goal of the optimization process. In this work, we treat exploration and exploitation as two distinct objectives to be optimized. Multi-objective RL Roijers et al. (2013) provides tools which we utilize for defining these two distinct objectives, and balancing them at a policy level. Multi-objective RL allows for making exploration central to the optimization process. While there exist Multi-objective RL methods to find several viable objective weightings such as finding Pareto fronts Perny and Weng (2010), our work focuses on two well defined objectives whose weighting changes during learning. As such, we are mostly interested in the ability to change the relative importance of objectives without requiring training.

Modelling state-action values using a probabilistic model enables reasoning about the whole distribution instead of just its expectation, giving opportunities for better exploration strategies. Bayesian Q-learning Dearden et al. (1998) was first proposed to provide value function posterior information in the tabular case, then extended to more complicated domains by using Gaussian processes to model the state-action function Engel et al. (2005). In this work, authors also discuss decomposition of returns into several terms separating intrinsic and extrinsic uncertainty, which could later be used for exploration. Distribution over returns were proposed to design risk-sensitive algorithms Morimura et al. (2010), and approximated to enhance RL stability in Bellemare et al. (2017). In recent work, Bayesian linear regression is combined to a deep network to provide a posterior on Q-values Azizzadenesheli et al. (2018). Thomson sampling is then used for action selection, but can only guarantee local exploration. Indeed, if all action were experienced in a given state, the uncertainty of Q in this state is not sufficient to drive the agent towards unexplored regions.

To the best of our knowledge, there exists no model-free RL framework treating exploration as a core objective. We present such framework, building on theory from multi-objective RL, bandits and BO. We also present EMU-Q, a solution to exploration based on the proposed framework in fully continuous goal-only domains, relying on reducing the posterior variance of value functions.

This paper extends our earlier work Morere and Ramos (2018). It formalizes a new framework for treating exploration and exploitation as two objectives, provides strategies for online exploration control and new experimental results.

3 Explicit Balance for Exploration and Exploitation

Traditional RL aims at finding a policy maximizing the expected sum of future discounted rewards, as formulated in Equation 1. Exploration is then typically achieved by adding a perturbation to rewards or behaviour policies in an ad-hoc way. We propose making the trade-off between exploration and exploitation explicit and at a policy level, by formulating exploration as a multi-objective RL problem.

3.1 Framework Overview

Multi-objective RL extends the classic RL framework by allowing value functions or policies to be learned for individual objectives. Exploitation and exploration are two distinct objectives for RL agents, for which separate value functions $Q$ and $U$ (respectively) can be learned. Policies then need to make use of information from two separate models for $Q$ and $U$ . While exploitation value function $Q$ is learned from external rewards, exploration value function $U$ is modelled using exploration rewards.

Aiming to define policies which combine exploration and exploitation, we draw inspiration from Bayesian Optimization Brochu et al. (2010), which seeks to find the maximum of an expensive function using very few samples. It relies on an acquisition function to determine the most promising locations to sample next, based on model posterior mean and variance. The Upper-Confidence Bounds (UCB) acquisition function Cox and John (1992) is popular for its explicit balance between exploitation and exploration controlled by parameter $\kappa\in[0,\infty)$ . Adapting UCB to our framework leads to policies balancing $Q$ and $U$ . Contrary to most intrinsic RL approaches, our formulation keeps the exploration-exploitation trade-off at a policy level, as in traditional RL. This allows for adapting the exploration-exploitation balance during the learning process without sacrificing data-efficiency, as would be the case with a balance at a reward level. Furthermore, policy level balance can be used to design methods to control the agent’s learning process, e.g. stop exploration after a budget is reached, or encourage more exploration if the agent converged to a sub-optimal solution; see Section 4. Lastly, generating trajectories resulting only from exploration or exploration grants experimenters insight over learning status.

3.2 Exploration Values

We propose to redefine the objective optimized by RL methods to incorporate both exploration and exploitation at its core. To do so, we consider the following expected balanced return for policy $\pi$ :

D^{\pi}(s,a)=\mathbb{E}_{T,R,R^{e},\pi}[\sum_{i=0}^{\infty}\gamma^{i}(r_{i}+\kappa r^{e}_{i})],

(6)

where we introduced exploration rewards $r^{e}_{i}\sim R^{e}(s_{i},a_{i},s_{i+1})$ and parameter $\kappa\in[0,\infty]$ governing exploration-exploitation balance. Note that we recover Equation 1 by setting $\kappa$ to $0$ , hence disabling exploration.

Equation 6 can be further decomposed into

	$\displaystyle D^{\pi}(s,a)$	$\displaystyle=\mathbb{E}_{T,R,\pi}[\sum_{i=0}^{\infty}\gamma^{i}r_{i}]+\kappa\mathbb{E}_{T,R^{e},\pi}[\sum_{i=0}^{\infty}\gamma^{i}r^{e}_{i}]$		(7)
		$\displaystyle=Q^{\pi}(s,a)+\kappa U^{\pi}(s,a),$		(8)

where we have defined the exploration state-action value function $U$ , akin to $Q$ . Exploration behaviour is achieved by maximizing the expected discounted exploration return $U$ . Note that, if $r^{e}$ depends on $Q$ , then $U$ is a function of $Q$ . For clarity, we omit this potential dependency in notations.

Bellman-type updates for both $Q$ and $U$ can be derived by unrolling the first term in both sums:

	$\displaystyle D^{\pi}(s,a)$	$\displaystyle=\mathbb{E}_{R}[r]+\mathbb{E}_{a^{\prime},s^{\prime}\|s,a}[\mathbb{E}_{T,R,\pi}[\sum_{i=1}^{\infty}\gamma^{i}r_{i}]]+\kappa(\mathbb{E}_{R^{e}}[r^{e}]+\mathbb{E}_{a^{\prime},s^{\prime}\|s,a}[\mathbb{E}_{T,R^{e},\pi}[\sum_{i=1}^{\infty}\gamma^{i}r^{e}_{i}]])$		(9)
		$\displaystyle=\mathbb{E}_{R}[r]+\gamma\mathbb{E}_{a^{\prime},s^{\prime}\|s,a}[Q^{\pi}(s^{\prime},a^{\prime})]+\kappa(\mathbb{E}_{R^{e}}[r^{e}]+\gamma\mathbb{E}_{a^{\prime},s^{\prime}\|s,a}[U^{\pi}(s^{\prime},a^{\prime})]).$		(10)

By identification we recover the update for $Q$ given by Equation 3 and the following update for $U$ :

U(s,a)=\mathbb{E}_{R^{e}}[r^{e}]+\gamma\mathbb{E}_{s^{\prime},a^{\prime}|s,a}[U(s^{\prime},a^{\prime})],

(11)

which is similar to that of $Q$ . Learning both $U$ and $Q$ can be seen as combining two agents to solve separate MPDs for goal reaching and exploration, as shown in Figure 1 (right). This formulation is general in that any reinforcement learning algorithm can be used to learn $Q$ and $U$ , combined with any exploration bonus. Both state-action value functions can be learned from transition data using existing RL algorithms.

3.3 Exploration Rewards

The presented formulation is independent from the choice of exploration rewards, hence many reward definitions from the intrinsic RL literature can directly be applied here.

Note that in the special case $R^{e}=0$ for all states and actions, we recover exploration values from DORA Fox et al. (2018), and if state and action spaces are discrete, we recover visitation counters Bellemare et al. (2016).

Another approach to define $R^{e}$ consists in considering the amount of exploration left at a given state. The exploration reward $r^{e}$ for a transition is then defined as the amount of exploration in the resulting state of the transition, to favour transitions that result in discovery. It can be computed by taking an expectation over all actions:

R^{e}(s^{\prime})=\mathbb{E}_{a^{\prime}\sim\mathcal{U}(\mathcal{A})}[\sigma(s^{\prime},a^{\prime})].

(12)

We defined a function $\sigma$ accounting for the uncertainty associated with a state-action pair. This formulation favours transitions that arrive at states of higher uncertainty. An obvious choice for $\sigma$ is the variance of $Q$ -values, to guide exploration towards parts of the state-action space where $Q$ values are uncertain. This formulation is discussed in Section 5. Another choice for $\sigma$ is to use visitation count or its continuous equivalent. Compared to classic visitation counts, this formulation focuses on visitations of the resulting transition state $s^{\prime}$ instead of on the state-action pair of origin $(s,a)$ .

Exploration rewards are often constrained to negative values so that by combining an optimistic model for $U$ to negative rewards, optimism in the face of uncertainty guarantees efficient exploration Kearns and Singh (2002). The resulting model creates a gradient of $U$ values; trajectories generated by following this gradient reach unexplored areas of the state-action space. With continuous actions, Equation 12 might not have closed form solution and the expectation can be estimated with approximate integration or sampling techniques. In domains with discrete actions however, the expectation is replaced by a sum over all possible actions.

3.4 Action Selection

Goal-only rewards are often defined as deterministic, as they simply reflect goal and penalty states. Because our framework handles exploration in a deterministic way, we simply focus on deterministic policies. Although state-action values $Q$ are still non-stationary (because $\pi$ is), they are learned from a stationary objective $r$ . This makes learning policies for exploitation easier.

Following the definition in Equation 6, actions are selected to maximize the expected balanced return $D^{\pi}$ at a given state $s$ :

\pi(s)=\arg\max_{a}D^{\pi}(s,a)=\arg\max_{a}Q^{\pi}(s,a)+\kappa U^{\pi}(s,a).

(13)

Notice the similarity between the policy given in Equation 13 and UCB acquisition functions from the Bayesian optimization and bandits literature. No additional exploration term is needed, as this policy explicitly balances exploration and exploitation with parameter $\kappa$ . This parameter can be tuned at any time to generate trajectories for pure exploration or exploitation, which can be useful to assess agent learning status. Furthermore, strategies can be devised to control $\kappa$ manually or automatically during the learning process. We propose a few strategies in Section 4.

The policy from Equation 13 can further be decomposed into present and future terms:

\pi(s)=\arg\max_{a}\underbrace{\mathbb{E}_{R}[r]+\kappa\mathbb{E}_{R^{e}}[r^{e}]}_{myopic}+\gamma\mathbb{E}_{s^{\prime}|s,a}[\underbrace{\mathbb{E}_{a^{\prime}\sim\pi(s^{\prime})}[Q^{\pi}(s^{\prime},a^{\prime})+\kappa U^{\pi}(s^{\prime},a^{\prime})]}_{future}],

(14)

where the term denoted future is effectively $D^{\pi}(s^{\prime})$ . This decomposition highlights the link between this framework and other active learning methods; by setting $\gamma$ to $0$ , only the myopic term remains, and we recover the traditional UCB acquisition function from bandits or Bayesian optimization. This decomposition can be seen as an extension of these techniques to a non-myopic setting. Indeed, future discounted exploration and exploitation are also considered within the action selection process. Drawing this connection opens up new avenues for leveraging exploration techniques from the bandits literature.

The method presented in this section for explicitly balancing exploration and exploitation at a policy level is concisely summed up in Algorithm 1. The method is general enough so that it allows learning both $Q$ and $U$ with any RL algorithm, and does not make assumptions on the choice of exploration reward used. Section 5 presents a practical method implementing this framework, while the next section presents advantages and strategies for controlling exploration balance during the agent learning process.

Algorithm 1 Explicit Exploration-Exploitation

1: Input: parameter

\kappa

2: Output: Policy

\pi

3: for episode

l=1,2,..

4: for step

h=1,2,..

\pi(s)=\arg\max_{a\in\mathcal{A}}Q(s,a)+\kappa U(s,a)

6: Execute

a=\pi(s)

, observe

s^{\prime}

and

r

, and store

s,a,r,s^{\prime}

D

7: Generate

r^{e}

with Equation 12 for example.

8: Update

Q

with Bellman eq. and

r

9: Update

U

with Bellman eq. and

r^{e}

10: end for

11: end for

4 Preliminary Experiments on Classic RL Problems

In this section, a series of preliminary experiments on goal-only classic RL domains is presented to highlight the advantages of exploration values over additive rewards. Strategies for taking advantage of variable exploration rates are then provided.

The comparisons make use of goal-only version of simple and fully discrete domains. We compare all methods using strictly the same learning algorithm and reward bonuses. Learning algorithms are tabular implementations of Q-Learning with learning rate fixed to $0.1$ . Reward bonuses are computed from a table of state-action visitation counts, where experiencing a state-action pair for the first time grants $0$ reward and revisiting yields $-1$ reward. We denote by additive reward a learning algorithm where reward bonuses are used as in classic intrinsic RL (Equation 4), and by exploration values reward bonuses used as in the proposed framework (Equation 11 and action selection defined by Equation 13). A Q-learning agent with no reward bonuses and $\epsilon$ -greedy exploration is displayed as a baseline.

Problem 1: The Cliff Walking domain Sutton et al. (1998) is adapted to the goal-only setting: negative unit rewards are given for falling off the cliff (triggering agent teleportation to starting state), and positive unit rewards for reaching the terminal goal state. Transitions allow the agent to move in four cardinal directions, where a random direction is chosen with low probability $0.01$ .

Problem 2: The traditional Taxi domain Dietterich (2000) is also adapted to the goal-only setting. This domain features a $5\times 5$ grid-word with walls and four special locations. In each episode, the agent starts randomly and two of the special locations are denoted as passenger and destination. The goal is for the agent to move to the passenger’s location, pick-up the passenger, drive it to the destination, and drop it off. A unit reward is given for dropping-off the passenger to the destination (ending the episode), and $-0.1$ rewards are given for actions pick-up and drop-off in wrong locations.

4.1 Analyzing the Advantages of an Explicit Exploration-Exploitation Balance

We first present simple pathological cases in which using exploration values provides advantages over additive rewards for exploration, on the Cliff Walking domain.

The first two experiments show that exploration values allow for direct control over exploration such stopping and continuing exploration. Stopping exploration after a budget is reached is simulated by setting exploration parameters (eg. $\kappa$ ) to $0$ after $30$ episodes and stopping model learning. While exploration values maintain high performance after exploration stops, returns achieved with additive rewards dramatically drop and yield a degenerative policy. When exploration is enabled once again, the two methods continue improving at a similar rate; see Figures 2a and 2b. Note that when exploration is disabled, there is a jump in returns with exploration values, as performance for pure exploitation is evaluated. However, it is never possible to be sure a policy is generated from pure exploitation when using additive rewards, as parts of bonus exploration rewards are encoded within learned Q-values.

The third experiment demonstrates that stochastic transitions with higher probability of random action ( $p=0.1$ ) lead to increased return variance and poor performance with additive rewards, while exploration values only seem mildly affected. As shown in Figure 2c, even $\epsilon$ -greedy appears to solve the task, suggesting stochastic transitions provide additional random exploration. It is unclear why the additive rewards method is affected negatively.

Lastly, the fourth experiments shows environment reward magnitude is paramount to achieving good performance with additive rewards; see Figure 2d. Even though exploration parameters balancing environment and exploration bonus rewards are scaled to maintain equal amplitude between the two terms, additive rewards suffer from degraded performance. This is due to two reward quantities being incorporated into a single model for Q, which also needs to be initialized optimistically with respect to both quantities. When the two types of rewards have different amplitude, this causes a problem. Exploration values do not suffer from this drawback as separate models are learned based on these two quantities, hence resulting in unchanged performance.

4.2 Automatic Control of Exploration-Exploitation Balance

We now present strategies for automatically controlling the exploration-exploitation balance during the learning process. The following experiments also make use of the Taxi domain.

Exploration parameter $\kappa$ is decreased over time according to the following schedule $\kappa(t)=\frac{1}{1+ct}$ , where $c$ governs decay rate. Higher values of $c$ result in reduced exploration after only a few episodes, whereas lower values translate to almost constant exploration. Results displayed in Figure 3 indicate that decreasing exploration leads to fast convergence to returns relatively close to maximum return, as shown when setting $c=10^{5}$ . However choosing a reasonable value $c=0.1$ first results in lower performance, but enables finding a policy with higher returns later. Such behaviour is more visible with very small values such as $c=10^{-3}$ which corresponds to almost constant $\kappa$ .

Method	Times target reached	Episodes to target	Performance after target
$\epsilon$ -greedy	0/10	–	–
Exploration values	9/10	111.33	0.08(4.50)
Additive rewards	9/10	242.11	-33.26(67.41)

Table 1: Stopping exploration after a target test return of

0.1

is reached on

5

consecutive episodes in the sparse Taxi domain. Results averaged over 100 runs.

We now show how direct control over exploration parameter $\kappa$ can be taken advantage of to stop learning automatically once a predefined target is met. On the taxi domain, exploration is first stopped after an exploration budget is exhausted. Results comparing additive rewards to exploration values for different budgets of $100$ , $300$ and $500$ episodes are given in Figure 4a. These clearly show that when stopping exploration after the budget is reached, exploration value agents can generate purely exploiting trajectories achieving near optimal return whereas additive reward agents fail to converge on an acceptable policy.

Lastly, we investigate stopping exploration automatically once a target return is reached. After each learning episode, $5$ test episodes with pure exploitation are run to score the current policy. If all $5$ test episodes score returns above $0.1$ , the target return is reached and exploration stops. Results for this experiment are shown in Figure 4b and Table 1. Compared to additive rewards, exploration values display better performance after target is reach as well as faster target reaching.

Exploration values were experimentally shown exploration advantages over additive reward on simple RL domains. The next section presents an algorithm built on the proposed framework which extends to fully continuous state and action spaces and is applicable to more advanced problems.

5 EMU-Q: Exploration by Minimizing Uncertainty of $Q$ Values

Following the framework defined in Section 3, we propose learning exploration values with a specific reward driving trajectories towards areas of the state-action space where the agent’s uncertainty of $Q$ values is high.

5.1 Reward Definition

Modelling $Q$ -values with a probabilistic model gives access to variance information representing model uncertainty in expected discounted returns. Because the probabilistic model is learned from expected discounted returns, discounted return variance is not considered. Hence the model variance only reflects epistemic uncertainty, which can be used to drive exploration. This formulation was explored in EMU-Q Morere and Ramos (2018), extending Equation 12 as follows:

R^{e}(s^{\prime})=\mathbb{E}_{a^{\prime}\sim\mathcal{U}(\mathcal{A})}[\mathbb{V}[Q(s^{\prime},a^{\prime})]]-\mathbb{V}_{max},

(15)

where $\mathbb{V}_{max}$ is the maximum possible variance of $Q$ , guaranteeing always negative rewards. In practice, $\mathbb{V}_{max}$ depends on the model used to learn $Q$ and its hyper-parameters, and can often be computed analytically. In Equation 15, the variance operator $\mathbb{V}$ computes the epistemic uncertainty of $Q$ values, that is it assesses how confident the model is that it can predict $Q$ values correctly. Note that the MDP stochasticity emerging from transitions, rewards and policy is absorbed by the expectation operator in Equation 2, and so no assumptions are required on the MDP components in this reward definition.

5.2 Bayesian Linear Regression for Q-Learning

We now seek to obtain a model-free RL algorithm able to explore with few environment interactions, and providing a full predictive distribution on state-action values to fit the exploration reward definition given by Equation 15. Kernel methods such as Gaussian Process TD (GPTD) Engel et al. (2005) and Least-Squares TD (LSTD) Lagoudakis and Parr (2003) are among the most data-efficient model-free techniques. While the former suffers from prohibitive computation requirements, the latter offers an appealing trade-off between data-efficiency and complexity. We now derive a Bayesian RL algorithm that combines the strengths of both kernel methods and LSTD.

The distribution of long-term discounted exploitation returns $G$ can be defined recursively as:

G(s,a)=R(s,a,s^{\prime})+\gamma G(s^{\prime},a^{\prime}),

(16)

which is an equality in the distributions of the two sides of the equation. Note that so far, no assumption are made on the nature of the distribution of returns. Let us decompose the discounted return $G$ into its mean $Q$ and a random zero-mean residual $q$ so that $Q(s,a)=\mathbb{E}[G(s,a)]$ . Substituting and rearranging Equation 16 yields

\underbrace{R(s,a,s^{\prime})+\gamma Q(s^{\prime},a^{\prime})}_{t}=Q(s,a)+\underbrace{q(s,a)-\gamma q(s^{\prime},a^{\prime})}_{\epsilon}.

(17)

The only extrinsic uncertainty left in this equation are the reward distribution $R$ and residuals $q$ . Assuming rewards are disturbed by zero-mean Gaussian noise implies the difference of residuals $\epsilon$ is Gaussian with zero-mean and precision $\beta$ . By modelling $Q$ as a linear function of a feature map $\bm{\phi}_{s,a}$ so that $Q(s,a)=\bm{w}^{T}\bm{\phi}_{s,a}$ , estimation of state-action values becomes a linear regression problem of target $\bm{t}$ and weights $\bm{w}$ . The likelihood function takes the form

p(\bm{t}|\bm{x},\bm{w},\beta)=\prod_{i=1}^{N}\mathcal{N}(t_{i}|r_{i}+\gamma\bm{w}^{T}\bm{\phi}_{s^{\prime}_{i},a^{\prime}_{i}},\beta^{-1}),

(18)

where independent transitions are denoted as $x_{i}=(s_{i},a_{i},r_{i},s^{\prime}_{i},a^{\prime}_{i})$ . We now treat the weights as random variables with zero-mean Gaussian prior $p(\bm{w})=\mathcal{N}(\bm{w}|\bm{0},\alpha^{-1}\bm{I})$ . The weight posterior distribution is

$\displaystyle p(\bm{w}\|\bm{t})$	$\displaystyle=\mathcal{N}(\bm{w}\|\bm{m}_{Q},\bm{S})$	(19)
$\displaystyle\bm{m}_{Q}$	$\displaystyle=\beta\bm{S}\bm{\Phi}_{\bm{s},\bm{a}}^{T}(\bm{r}+\gamma\bm{Q}^{\prime})$	(20)
$\displaystyle\bm{S}$	$\displaystyle=(\alpha\bm{I}+\beta\bm{\Phi}_{\bm{s},\bm{a}}^{T}\bm{\Phi}_{\bm{s},\bm{a}})^{-1},$	(21)

where $\bm{\Phi}_{\bm{s},\bm{a}}=\{\bm{\phi}_{s_{i},a_{i}}\}^{N}_{i=1}$ , $\bm{Q}^{\prime}=\{Q(s^{\prime}_{i},a^{\prime}_{i})\}^{N}_{i=1}$ , and $\bm{r}=\{r_{i}\}^{N}_{i=1}$ . The predictive distribution is also Gaussian, yielding

	$\displaystyle Q(s,a)=\mathbb{E}[p(t\|\bm{x},\bm{t},\alpha,\beta)]$	$\displaystyle=\bm{\phi}_{s,a}^{T}\bm{m}_{Q},$		(22)
	$\displaystyle\text{and }\mathbb{V}[p(t\|\bm{x},\bm{t},\alpha,\beta)]$	$\displaystyle=\beta^{-1}\bm{\phi}_{s,a}^{T}\bm{S}\bm{\phi}_{s,a}.$		(23)

The predictive variance $\mathbb{V}[p(t|\bm{x},\bm{t},\alpha,\beta)]$ encodes the intrinsic uncertainty in $Q(s,a)$ , due to the subjective understanding of the MDP’s model; it is used to compute $r^{e}$ in Equation 15.

The derivation for $U$ is similar, replacing $r$ with $r^{e}$ and $t$ with $t^{e}=R^{e}(s,a,s^{\prime})+\gamma U(s^{\prime},a^{\prime})$ . Note that because $\bm{S}$ does not depend on rewards, it can be shared by both models. Hence, with $\bm{U}^{\prime}=\{U(s^{\prime}_{i},a^{\prime}_{i})\}^{N}_{i=1}$ ,

U(s,a)=\bm{\phi}_{s,a}^{T}\bm{m}_{U},\text{ with }\bm{m}_{U}=\beta\bm{S}\bm{\Phi}_{\bm{s},\bm{a}}^{T}(\bm{r}^{e}+\gamma\bm{U}^{\prime}).

(24)

This model gracefully adapts to iterative updates at each step, by substituting the current prior with the previous posterior. Furthermore, the Sherman-Morrison equality is used to compute rank-1 updates of matrix $\bm{S}$ with each new data point $\phi_{s,a}$ :

\bm{S}_{t+1}=\bm{S}_{t}-\beta\frac{(\bm{S}_{t}\bm{\phi}_{s,a})(\bm{\phi}_{s,a}^{T}\bm{S}_{t})}{1+\beta\bm{\phi}_{s,a}^{T}\bm{S}_{t}\bm{\phi}_{s,a}}

(25)

This update only requires a matrix-to-vector multiplication and saves the cost of inverting a matrix at every step. Hence the complexity cost is reduced from $O(M^{3})$ to $O(M^{2})$ in the number of features $M$ . An optimized implementation of EMU-Q is given in algorithm 2.

Algorithm 2 EMU-Q

1: Input: initial state

s

, parameters

\alpha,\beta,\kappa

2: Output: Policy

\pi

paramatrized by

\bm{m}_{Q}

and

\bm{m}_{U}

3: Initialize

\bm{S}=\alpha^{-1}\bm{I}

\bm{t}_{Q}=\bm{t}_{U}=\bm{m}_{Q}=\bm{m}_{U}=\bm{0}

4: for episode

l=1,2,..

5: for step

h=1,2,..

\pi(s)=\arg\max_{a}\bm{\phi}_{s,a}\bm{m}_{Q}+\kappa\bm{\phi}_{s,a}\bm{m}_{U}

7: Execute

a=\pi(s)

, observe

s^{\prime}

and

r

, and store

\bm{\phi}_{s,a},r,s^{\prime}

D

8: Generate

r^{e}

from Equation 15 with

s^{\prime}

\bm{S}=\bm{S}-\beta\frac{(\bm{S}\bm{\phi}_{s,a})(\bm{\phi}_{s,a}^{T}\bm{S})}{1+\beta\bm{\phi}_{s,a}^{T}\bm{S}\bm{\phi}_{s,a}}

10:

\bm{t}_{Q}=\bm{t}_{Q}+\beta\bm{\phi}_{s,a}^{T}(r+\gamma\bm{\phi}_{s,a}\bm{m}_{Q})

11:

\bm{t}_{U}=\bm{t}_{U}+\beta\bm{\phi}_{s,a}^{T}(r^{e}+\gamma\bm{\phi}_{s,a}\bm{m}_{U})

12:

\bm{m}_{Q}=\bm{S}\bm{t}_{Q}

\bm{m}_{U}=\bm{S}\bm{t}_{U}

13: end for

14: From

D

, draw

\bm{\Phi}_{\bm{s},\bm{a}},\bm{r},\bm{s}^{\prime}

, and compute

\bm{\Phi}_{\bm{s}^{\prime},\pi(\bm{s}^{\prime})}

15: Update

\bm{m}_{Q}=\beta\bm{S}\bm{\Phi}_{\bm{s},\bm{a}}^{T}(\bm{r}+\gamma\bm{\Phi}_{\bm{s}^{\prime},\pi(\bm{s}^{\prime})}\bm{m}_{Q})

until change in

\bm{m}_{Q}<\epsilon

16: Compute

\bm{r}^{e}

with Equation 15 and

\bm{s}^{\prime}

17: Update

\bm{m}_{U}=\beta\bm{S}\bm{\Phi}_{\bm{s},\bm{a}}^{T}(\bm{r}^{e}+\gamma\bm{\Phi}_{\bm{s}^{\prime},\pi(\bm{s}^{\prime})}\bm{m}_{U})

until change in

\bm{m}_{U}<\epsilon

18: end for

End of episode updates for $\bm{m}_{Q}$ and $\bm{m}_{U}$ (line 15 onward) are analogous to policy iteration, and although not mandatory, greatly improve convergence speed. Note that because $r^{e}$ is a non-stationary target, recomputing it after each episode with the updated posterior on $Q$ provides the model on $U$ with more accurate targets, thereby improving learning speed.

5.3 Kernel Approximation Features for RL

We presented a simple method to learn $Q$ and $U$ as linear functions of states-actions features. While powerful when using a good feature map, linear models typically require experimenters to define meaningful features on a problem specific basis. In this section, we introduce random Fourier features (RFF) Rahimi and Recht (2008), a kernel approximation technique which allows linear models to enjoy the expressivity of kernel methods. It should be noted that these features are different from Fourier basis Konidaris et al. (2011) (detailed in supplementary material), which do not approximate kernel functions. Although RFF were recently used to learn policy parametrizations Rajeswaran et al. (2017), to the best of our knowledge, this is the first time RFF are applied to the value function approximation problem in RL.

For any shift invariant kernel, which can be written as $k(\tau)$ with $\tau=\bm{x}-\bm{x}^{\prime}$ , a representation based on the Fourier transform can be computed with Bochner’s theorem Gihman and Skorohod (1974).

Theorem 1 (Bochner’s Theorem) Any shift invariant kernel $k(\tau)$ , $\tau\in\mathbb{R}^{D}$ , with a positive finite measure $d\mu(\bm{\omega})$ can be represented in terms of its Fourier transform as

k(\tau)=\int_{\mathbb{R}^{D}}e^{-i\bm{\omega}\tau}d\mu(\bm{\omega}).

(26)

Assuming measure $\mu$ has a density $p(\bm{\omega})$ , $p$ is the spectral density of $k$ and we have

k(\tau)=\int_{\mathbb{R}^{D}}e^{-i\tau\bm{\omega}}p(\bm{\omega})d\bm{\omega}\approx\frac{1}{M}\sum_{j=1}^{M}e^{-i\tau\bm{\omega}_{j}}=\langle\bm{\phi}(\bm{x}),\bm{\phi}(\bm{x}^{\prime})\rangle,

(27)

where $p$ is the spectral density of $k$ , $\bm{\phi}(\bm{x})$ is an approximate feature map, and $M$ the number of spectral samples from $p$ . In practice, the feature map approximating $k(\bm{x},\bm{x}^{\prime})$ is

\bm{\phi}(\bm{x})=\frac{1}{\sqrt{M}}[\cos(\bm{x}^{T}\bm{\omega}_{1}),...,\cos(\bm{x}^{T}\bm{\omega}_{M}),\sin(\bm{x}^{T}\bm{\omega}_{1}),...,\sin(\bm{x}^{T}\bm{\omega}_{M})],

(28)

where the imaginary part was set to zero, as required for real kernels. In the case of the RBF kernel defined as $k(\bm{x},\bm{x}^{\prime})=\exp(-\frac{1}{2\sigma^{2}}||\bm{x}-\bm{x}^{\prime}||_{2}^{2})$ , the kernel spectral density is Gaussian $p=\mathcal{N}(0,2\sigma^{-2}I)$ . Feature maps can be computed by drawing $M/2\times d$ samples from $p$ one time only, and computing Equation 28 on new inputs $\bm{x}$ using these samples. Resulting features are not domain specific and require no feature engineering. Users only need to choose a kernel that represents adequate distance measures in the state-action space, and can benefit from numerous kernels already provided by the literature. Using these features in conjunction with Bayesian linear regression provides an efficient method to approximate a Gaussian process.

As the number of features increases, kernel approximation error decreases Sutherland and Schneider (2015); approximating popular shift-invariant kernels to within $\epsilon$ can be achieved with only $M=O(d\epsilon^{-2}\log\frac{1}{\epsilon^{2}})$ features. Additionally, sampling frequencies according to a quasi-random sampling scheme (used in our experiments) reduces kernel approximation error compared to classic Monte-Carlo sampling with the same number of features Yang et al. (2014).

EMU-Q with RFF combines the ease-of-use and expressivity of kernel methods brought by RFF with the convergence properties and speed of linear models.

5.3.1 Comparison of Random Fourier Features and Fourier Basis Features

For completeness, a comparison between RFF and the better known Fourier Basis Features Konidaris et al. (2011) is provided on classic RL domains using Q-learning. A short overview on Fourier Basis Features is given in Appendix A.

Three relatively simple environments were considered: SinglePendulum, MountainCar and DoublePendulum (details on these environments are given in Section 6). The same Q-learning algorithm was used for both methods, with equal parameters. As little as $300$ random Fourier features are sufficient in these domains, while the order of Fourier basis was set to $5$ for SinglePendulum and MountainCar and to $3$ for DoublePendulum. The higher state and action space dimensions of DoublePendulum make using Fourier basis features prohibitively expensive, as the number of generated features increases exponentially with space dimensions. For example, in DoublePendulum, Fourier basis features of order $3$ leads to more than $2000$ features.

Results displayed in Figure 5 show RFF outperforms Fourier basis both in terms of learning speed and asymptotic performance, while using a lower number of features. In DoublePendulum, the number of Fourier basis features seems insufficient to solve the problem, even though it is an order of magnitude higher than that of RFF.

6 Experiments

EMU-Q’s exploration performance is qualitatively and quantitatively evaluated on a toy chain MDP example, $7$ widely-used continuous control domains and a robotic manipulator problem. Experiments aim at measuring exploration capabilities in domains with goal-only rewards. Unless specified otherwise, domains feature one absorbing goal state with positive unit reward, and potential penalizing absorbing states of reward of $-1$ . All other rewards are zero, resulting in very sparse reward functions, and rendering guidance from reward gradient information inapplicable.

6.1 Synthetic Chain Domain

We investigate EMU-Q’s exploration capabilities on a classic domain known to be hard to explore. It is composed of a chain of $N$ states and two actions, displayed in Figure 6a. Action right (dashed) has probability $1-1/N$ to move right and probability $1/N$ to move left. Action left (solid) is deterministic.

6.1.1 Goal-only Rewards

We first consider the case of goal-only rewards, where goal state $S_{N}$ yields unit reward and all other transitions result in nil reward. Classic exploration such as $\epsilon$ -greedy was shown to have exponential regret with the number of states in this domain Osband et al. (2014). Achieving better performance on this domain is therefore essential to any advanced exploration technique. We compare EMU-Q to $\epsilon$ -greedy exploration for increasing chain lengths, in terms of number of steps before goal-state $S_{N}$ is found. Results in Figure 6b illustrate the exponential regret of $\epsilon$ -greedy while EMU-Q achieves much lower exploration time, scaling linearly with chain length.

6.1.2 Semi-Sparse Rewards

We now investigate the impact of reward structure by decreasing the chain domain’s reward sparsity. In this experiment only, agents are given additional $-1$ rewards with probability $1-p$ for every non-goal state, effectively guiding them towards the goal state (goal-only rewards are recovered for $p=0$ ). The average number of steps before the goal is reached as a function of $p$ is compared for $\epsilon$ -greedy and EMU-Q in Figure 6c. Results show that $\epsilon$ -greedy performs very poorly for high $p$ , but improves as guiding reward density increases. Conversely, EMU-Q seems unaffected by reward density and performs equally well for all values of $p$ . When $p=0$ , agents receive $-1$ reward in every non-goal state, and $\epsilon$ -greedy performs similarly to EMU-Q.

6.2 Classic Control

EMU-Q is further evaluated on more challenging RL domains. These feature fully continuous state and action spaces, and are adapted to the goal-only reward setting. In these standard control problems Brockman et al. (2016), classic exploration methods are unable to reach goal states.

6.2.1 Exploration Behaviour on goal-only MountainCar

We first provide intuition behind what EMU-Q learns and illustrate its typical behaviour on a continuous goal-only version of MountainCar. In this domain, the agent needs to drive an under-actuated car up a hill by building momentum. The state space consists of car position and velocity, and actions ranging from $-1$ to $1$ describing car wheel torque (absolute value) and direction (sign). The agent is granted a unit reward for reaching the top of the right hill, and zero elsewhere.

Figure 7 displays the state-action exploration value function $U$ at different stages of learning, overlaid by the state-space trajectories followed during learning. The first episode (yellow line) exemplifies action babbling, and the car does not exit the valley (around $x=0.4$ ). On the next episode (black line), the agent finds sequences of actions that allow exiting the valley and exploring further areas of the state-action space. Lastly, in episode three (white line), the agent finds the goal ( $x>0.9$ ). This is done by adopting a strategy that quickly leads to unexplored areas, as shown by the increased gap between white lines. The exploration value function $U$ reflects high uncertainty about unexplored areas (yellow), which shrink as more data is gathered, and low and decreasing uncertainty for often visited areas such as starting states (purple). Function $U$ also features a gradient which can be followed from any state to find new areas of the state-action space to explore. Figure 7d shows EMU-Q’s exploration capabilities enables to find the goal state within one or two episodes.

6.2.2 Continuous control benchmark

We now compare our algorithm on the complete benchmark of $7$ continuous control goal-only tasks. All domains make use of OpenAI Gym Brockman et al. (2016), and are modified to feature goal only rewards and continuous state-action spaces with dimensions detailed in Table 2. More specifically, the domains considered are MountainCar and the following:

•

SinglePendulum: The agent needs to balance an under-actuated pendulum upwards by controlling a motor’s torque a the base of the pendulum. A unit reward is granted when the pole (of angle with vertical $\theta$ ) is upwards: $\theta<0.05\text{ }rad$ .
•

DoublePendulum: Similarly to SinglePendulum, the agent’s goal is to balance a double pendulum upwards. Only the base joint can be controlled while the joint between the two segments moves freely. The agent is given a unit reward when the tip of the pendulum is close to the tallest point it can reach: within a distance $d<1$ .
•

CartpoleSwingUp: This domain features a single pole mounted on a cart. The goal is to balance the pole upwards by controlling the torque of the under-actuated cart’s wheels. Driving the cart too far off the centre ( $|x|>2.4$ ) results in episode failure with reward $-1$ , and managing to balance the pole ( $cos(\theta)>0.8$ , with $\theta$ the pole angle with the vertical axis) yields unit reward. Note that contrary to classic Cartpole, this domain starts with the pole hanging down and episodes terminate when balance is achieved.
•

LunarLander: The agent controls a landing pod by applying lateral and vertical thrust, which needs to be landed on a designated platform. A positive unit reward is given for reaching the landing pad within distance $d<0.05$ of its center point, and a negative unit reward is given for crashing or exiting the flying area.
•

Reacher: A robotic manipulator composed of two segments can be actuated at each of its two joints to reach a predefined position in a two-dimensional space. Bringing the manipulator tip within a distance $d<0.015$ of a random target results in a unit reward.
•

Hopper: This domain features a single leg robot composed of three segments, which needs to propel itself to a predefined height. A unit reward is given for successfully jumping to height $h>1.3$ , and a negative unit reward when the leg falls past angle $|\theta|>0.2$ with the vertical axis.

Domain	$d_{\mathcal{S}}$	$d_{\mathcal{A}}$	$l_{\mathcal{S}}$	$l_{\mathcal{A}}$	$M$	$\alpha$	$\beta$
SinglePendulum	3	1	0.3	0.3	300	0.001	1.0
MountainCar	2	1	0.3	10	300	0.1	1.0
DoublePendulum	6	1	0.3	0.3	500	0.01	1.0
CartpoleSwingUp	4	1	0.8	1.0	500	0.01	1.0
LunarLander	8	2	0.5	0.3	500	0.01	1.0
Reacher	11	2	0.3	0.3	500	0.001	1.0
Hopper	11	3	0.3	0.3	500	0.01	1.0

Table 2: Experimental parameters for all

7

domains

Most methods in the sparse rewards literature address domains with discrete states and/or action spaces, making it hard to find baselines to compare EMU-Q to. Furthermore, classic exploration techniques such as $\epsilon$ -greedy fail on these domains. We compare our algorithm to three baselines: VIME, DORA and RFF-Q. VIME Houthooft et al. (2016) defines exploration as maximizing information gain about the agent’s belief of environment dynamics. DORA Fox et al. (2018), which we run on discretized action spaces, extends visitation counts to continuous state spaces. Both VIME and DORA use additive rewards, as opposed to EMU-Q which uses exploration values. Q-Learning with $\epsilon$ -greedy exploration and RFF is denoted RFF-Q. Because it would fail in domains with goal-only rewards, it is run with classic rewards; see Brockman et al. (2016) for details on classic rewards.

We are interested in comparing exploration performance, favouring fast discovery of goal states. To reflect exploration performance, we measure the number of episodes required before the first positive goal-reward is obtained. This metric reflects how long pure exploration is required for before goal-reaching information can be taken advantage of to refine policies, and hence directly reflects exploration capabilities. Parameter $\gamma$ is set to $0.99$ for all domains and episodes are capped at $500$ steps. State spaces are normalized, and Random Fourier Features approximating square exponential kernels are used for both state and action spaces with EMU-Q and RFF-Q. The state and action kernel lengthscales are denoted as $l_{\mathcal{S}}$ and $l_{\mathcal{A}}$ respectively. Exploration and exploitation trade-off parameter $\kappa$ is set to $1/\mathbb{V}_{max}$ for all experiments. Other algorithm parameters were manually fixed to reasonable values given in Table 2.

Domain	EMU-Q		VIME		DORA (discrete)		RFF-Q (classic reward)
Domain	Success	Episodes to goal	Success	Episodes to goal	Success	Episodes to goal	Success	Episodes to goal
SinglePendulum	100%	1.80 (1.07)	95%	2.05 (2.04)	35%	3.00 (4.11)	100%	1.0 (0.00)
MountainCar	100%	2.95 (0.38)	65%	5.08 (2.43)	0%	–	100%	8.6 (8.05)
DoublePendulum	100%	1.10 (0.30)	90%	3.61 (2.75)	0%	–	100%	4.20 (2.25)
CartpoleSwingup	90%	12.40 (16.79)	65%	3.23 (2.66)	35%	48.71 (30.44)	100%	9.70 (12.39)
LunarLander	100%	28.75 (29.57)	75%	4.47 (2.47)	30%	35.17 (31.38)	95%	19.15 (24.06)
Reacher	100%	19.70 (20.69)	95%	3.68 (2.03)	35%	1.00 (0.00)	95%	26.55 (25.58)
Hopper	60%	52.85 (39.32)	40%	5.62 (3.35)	20%	30.50 (11.80)	80%	41.15 (35.72)

Table 3: Results for all

7

domains, as success rate of goal finding within 100 episodes and mean (and standard deviation) of number of episodes before goal is found. Success rate rate is more important than number of episodes to goal. Results averaged over 20 runs. DORA was run with discretized actions, and RFF-Q with

\epsilon

-greedy exploration on domains with classic rewards.

Results displayed in Table 3 indicate that EMU-Q is more consistent than VIME or DORA in finding goal states on all domains, illustrating better exploration capabilities. The average number of episodes to reach goal states is computed only on successful runs. EMU-Q displays better goal finding on lower dimension domains, while VIME tends to find goals faster on domains with higher dimensions but fails in more occasions. Observing similar results between EMU-Q and RFF-Q confirms that EMU-Q can deal with goal-only rewards without sacrificing performance.

6.3 Jaco Manipulator

In this final experiment, we show the applicability of EMU-Q to realistic problems by demonstrating its efficacy on an advanced robotics simulator. In this robotics problem, the agent needs to learn to control a Jaco manipulator solely from observing joint configuration. Given a position in the 3D space, the agent’s goal is to bring the manipulator finger tips to this goal location by sending torque commands to each of the manipulator joints; see Figure 8a. Designing such target-reaching policies is also known as inverse kinematics for robotic arms, and has been studied extensively. Instead, we focus here on learning a mapping from joint configuration to joint torques on a damaged manipulator. When a manipulator is damaged, previously computed inverse kinematics are not valid anymore, thus being able to learn a new target-reaching policy is important.

We model damage by immobilizing four of the arm joints, making previous inverse kinematics invalid. The target position is chosen randomly to form locations across the reachable space. Episodes terminate with unit reward when the target is reached within $50$ steps, zero rewards are given otherwise. We compare EMU-Q and RFF-Q on this domain, both using $600$ random Fourier features approximating an RBF kernel. Parameters $\alpha,\beta$ and $\kappa$ were manually selected to acceptable values of $0.1,1.0$ and $0.1$ respectively. Figure 8b displays results averaged over $10$ runs. The difference in number of episodes solved shows EMU-Q learns and manages to complete the task more consistently than RFF-Q. This confirms that directed exploration is beneficial, even in more realistic robotics scenario.

7 Conclusion

We proposed a novel framework for exploration in RL domains with very sparse or goal-only rewards. The framework makes use of multi-objective RL to define exploration and exploitation as two key objectives, bringing the balance between the two at a policy level. This formulation has several advantages over traditional exploration methods. It allows direct and online control over exploration, without additional computation or training. Strategies for such control were shown to experimentally outperform classic intrinsic RL on several aspects. We demonstrated scalability to continuous state-action spaces by presenting EMU-Q, a method based on our framework, guiding exploration towards regions of higher value-function uncertainty. EMU-Q was experimentally shown to outperform classic exploration techniques and other intrinsic RL methods on a continuous control benchmark and on a robotic manipulator.

As future work, we would like to investigate how exploration as multi-objective RL can be brought to other types of RL methods such as policy gradient. This extension would enable control over exploration in domains with larger state-action spaces, an potentially numerous real-life application. Other interesting extensions include bringing the online control over exploration achieved by this work to life-long RL, where it would be beneficial. Indeed, exploration can be tuned down in critical situations where high performance is necessary, or increased when learning new behaviours is required.

A Fourier Basis Features

Fourier basis features are described in Konidaris et al. (2011) as a linear function approximation based on Fourier series decomposition. Formally, the order- $n$ feature map for state $\bm{s}$ is defined as follows:

\bm{\phi}(\bm{s})=cos(\pi\bm{s}^{T}\bm{C}),

(29)

where $C$ is the Cartesian product of all $c_{j}\in\{0,...,n\}$ for $j=1,..,d_{\mathcal{S}}$ . Note that Fourier basis features do not scale well. Indeed, the number of features generated is exponential with state space dimension.

While Fourier basis features approximate value functions with periodic basis functions, random Fourier features are designed to approximate a kernel function with similar basis functions. As such, they allow recovering properties of kernel methods in the limit of the number of features. Additionally, random Fourier features scale better with higher dimensions.

B Derivation of Bayesian linear regression for Q-learning

The likelihood function is defined as follows

p(\bm{t}|\bm{x},\bm{w},\beta)=\prod_{i=1}^{N}\mathcal{N}(t_{i}|r_{i}+\gamma\bm{w}^{T}\bm{\phi}_{s^{\prime}_{i},a^{\prime}_{i}},\beta^{-1}),

(30)

where independent transitions are denoted $x_{i}=(s_{i},a_{i},r_{i},s^{\prime}_{i},a^{\prime}_{i})$ . we treat the linear regression weights $\bm{w}$ as random variables and introduce a Gaussian prior

p(\bm{w})=\mathcal{N}(\bm{w}|\bm{0},\alpha^{-1}\bm{I})

(31)

The weight posterior can be computed analytically with Bayes rule, resulting in a normal distribution

p(\bm{w}|\bm{t})=\frac{p(\bm{t}|\bm{w})p(\bm{w})}{p(\bm{t})}=\mathcal{N}(\bm{w}|\bm{m}_{Q},\bm{S})

(32)

Expressions for the mean $\bm{m}_{Q}$ and variance $\bm{S}$ follow from general results of products of normal distributions Bishop (2006):

	$\displaystyle\bm{m}_{Q}$	$\displaystyle=\beta\bm{S}\bm{\Phi}_{\bm{s},\bm{a}}^{T}(\bm{r}+\gamma\bm{Q}^{\prime})$		(33)
	$\displaystyle\bm{S}$	$\displaystyle=(\alpha\bm{I}+\beta\bm{\Phi}_{\bm{s},\bm{a}}^{T}\bm{\Phi}_{\bm{s},\bm{a}})^{-1},$		(34)

p(t|\bm{x},\bm{t},\alpha,\beta)=\int p(t|\bm{w},\beta)p(\bm{w}|\bm{x},\bm{t},\alpha,\beta)d\bm{w}=\mathcal{N}(t|Q,\sigma^{2})

(35)

Expressions for Q and $\sigma^{2}$ follow from general results Bishop (2006), yielding

	$\displaystyle Q(s,a)=\mathbb{E}[p(t\|\bm{x},\bm{t},\alpha,\beta)]$	$\displaystyle=\bm{\phi}_{s,a}^{T}\bm{m}_{Q},$		(36)
	$\displaystyle\sigma^{2}(s,a)=\mathbb{V}[p(t\|\bm{x},\bm{t},\alpha,\beta)]$	$\displaystyle=\beta^{-1}+\bm{\phi}_{s,a}^{T}\bm{S}\bm{\phi}_{s,a}.$		(37)

References

Auer et al. (2002) Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 2002.
Azizzadenesheli et al. (2018) Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. Efficient exploration through Bayesian deep Q-networks. arXiv preprint arXiv:1802.04412, 2018.
Bellemare et al. (2016) Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi Munos. Unifying count-based exploration and intrinsic motivation. In Neural Information Processing Systems, 2016.
Bellemare et al. (2017) Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, 2017.
Bishop (2006) Christopher M Bishop. Pattern recognition and machine learning. Technical report, 2006.
Brafman and Tennenholtz (2002) Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 2002.
Brochu et al. (2010) Eric Brochu, Vlad M Cora, and Nando De Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599, 2010.
Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym, 2016.
Chentanez et al. (2005) Nuttapong Chentanez, Andrew G Barto, and Satinder P Singh. Intrinsically motivated reinforcement learning. In Advances in neural information processing systems, 2005.
Cox and John (1992) Dennis D Cox and Susan John. A statistical method for global optimization. In International Conference onSystems, Man and Cybernetics, 1992.
Dearden et al. (1998) Richard Dearden, Nir Friedman, and Stuart Russell. Bayesian Q-learning. In Association for the Advancement of Artificial Intelligence, 1998.
Dietterich (2000) Thomas G Dietterich. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research, 2000.
Engel et al. (2005) Yaakov Engel, Shie Mannor, and Ron Meir. Reinforcement learning with Gaussian processes. In International Conference on Machine Learning, 2005.
Fox et al. (2018) Lior Fox, Leshem Choshen, and Yonatan Loewenstein. DORA the explorer: Directed outreaching reinforcement action-selection. In International Conference on Learning Representations, 2018.
Gihman and Skorohod (1974) I Gihman and A Skorohod. The theory of stochastic processes, vol. i, 1974.
Houthooft et al. (2016) Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De Turck, and Pieter Abbeel. Vime: Variational information maximizing exploration. In Neural Information Processing Systems, 2016.
Jaksch et al. (2010) Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 2010.
Jones et al. (1998) Donald R Jones, Matthias Schonlau, and William J Welch. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 1998.
Kaelbling et al. (1996) Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of artificial intelligence research, 1996.
Kearns and Singh (2002) Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 2002.
Konidaris et al. (2011) George Konidaris, Sarah Osentoski, and Philip S Thomas. Value function approximation in reinforcement learning using the Fourier basis. In Association for the Advancement of Artificial Intelligence, 2011.
Kuleshov and Precup (2014) Volodymyr Kuleshov and Doina Precup. Algorithms for multi-armed bandit problems. arXiv preprint arXiv:1402.6028, 2014.
Lagoudakis and Parr (2003) Michail G Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of Machine Learning Research, 2003.
Little and Sommer (2013) Daniel Ying-Jeh Little and Friedrich Tobias Sommer. Learning and exploration in action-perception loops. Frontiers in neural circuits, 2013.
Lopes et al. (2012) Manuel Lopes, Tobias Lang, Marc Toussaint, and Pierre-Yves Oudeyer. Exploration in model-based reinforcement learning by empirically estimating learning progress. In Neural Information Processing Systems, 2012.
Meuleau and Bourgine (1999) Nicolas Meuleau and Paul Bourgine. Exploration of multi-state environments: Local measures and back-propagation of uncertainty. Machine Learning, 1999.
Morere and Ramos (2018) Philippe Morere and Fabio Ramos. Bayesian RL for goal-only rewards. In Conference on Robot Learning, 2018.
Morimura et al. (2010) Tetsuro Morimura, Masashi Sugiyama, Hisashi Kashima, Hirotaka Hachiya, and Toshiyuki Tanaka. Nonparametric return distribution approximation for reinforcement learning. In International Conference on Machine Learning, 2010.
Ng et al. (1999) Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In International Conference on Machine Learning, 1999.
Nouri and Littman (2009) Ali Nouri and Michael L Littman. Multi-resolution exploration in continuous spaces. In Neural Information Processing Systems, 2009.
Osband et al. (2014) Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized value functions. arXiv preprint arXiv:1402.0635, 2014.
Osband et al. (2016) Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped DQN. In Advances in neural information processing systems, 2016.
Oudeyer and Kaplan (2008) Pierre-Yves Oudeyer and Frederic Kaplan. How can we define intrinsic motivation? In International Conference on Epigenetic Robotics: Modeling Cognitive Development in Robotic Systems, 2008.
Pathak et al. (2017) Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In International Conference on Machine Learning, 2017.
Perny and Weng (2010) Patrice Perny and Paul Weng. On finding compromise solutions in multiobjective Markov decision processes. In European Conference on Artificial Intelligence, 2010.
Rahimi and Recht (2008) Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Neural Information Processing Systems, 2008.
Rajeswaran et al. (2017) Aravind Rajeswaran, Kendall Lowrey, Emanuel V. Todorov, and Sham M Kakade. Towards generalization and simplicity in continuous control. In Neural Information Processing Systems, 2017.
Reinke et al. (2017) Chris Reinke, Eiji Uchibe, and Kenji Doya. Average reward optimization with multiple discounting reinforcement learners. In International Conference on Neural Information Processing, 2017.
Roijers et al. (2013) Diederik M. Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 2013.
Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, 2015.
Stadie et al. (2015) Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814, 2015.
Storck et al. (1995) Jan Storck, Sepp Hochreiter, and Jürgen Schmidhuber. Reinforcement driven information acquisition in non-deterministic environments. In International Conference on Artificial Neural Networks, 1995.
Sutherland and Schneider (2015) Dougal J Sutherland and Jeff Schneider. On the error of random Fourier features. In Conference on Uncertainty in Artificial Intelligence, 2015.
Sutton et al. (1998) Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction. MIT press, 1998.
Szita and Lőrincz (2008) István Szita and András Lőrincz. The many faces of optimism: A unifying approach. In International Conference on Machine Learning, 2008.
Wilson et al. (2014) Aaron Wilson, Alan Fern, and Prasad Tadepalli. Using trajectory data to improve Bayesian optimization for reinforcement learning. Journal of Machine Learning Research, 2014.
Yang et al. (2014) Jiyan Yang, Vikas Sindhwani, Haim Avron, and Michael Mahoney. Quasi-Monte Carlo feature maps for shift-invariant kernels. In International Conference on Machine Learning, 2014.