This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Active Exploration for Learning
Symbolic Representations

Garrett Andersen
PROWLER.io
Cambridge, United Kingdom
garrett@prowler.io
&George Konidaris
Department of Computer Science
Brown University
gdk@cs.brown.edu
Abstract

We introduce an online active exploration algorithm for data-efficiently learning an abstract symbolic model of an environment. Our algorithm is divided into two parts: the first part quickly generates an intermediate Bayesian symbolic model from the data that the agent has collected so far, which the agent can then use along with the second part to guide its future exploration towards regions of the state space that the model is uncertain about. We show that our algorithm outperforms random and greedy exploration policies on two different computer game domains. The first domain is an Asteroids-inspired game with complex dynamics but basic logical structure. The second is the Treasure Game, with simpler dynamics but more complex logical structure.

1 Introduction

Much work has been done in artificial intelligence and robotics on how high-level state abstractions can be used to significantly improve planning [21]. However, building these abstractions is difficult, and consequently, they are typically hand-crafted [16, 14, 8, 4, 5, 6, 22, 10].

A major open question is then the problem of abstraction: how can an intelligent agent learn high-level models that can be used to improve decision making, using only noisy observations from its high-dimensional sensor and actuation spaces? Recent work [12, 13] has shown how to automatically generate symbolic representations suitable for planning in high-dimensional, continuous domains. This work is based on the hierarchical reinforcement learning framework [1], where the agent has access to high-level skills that abstract away the low-level details of control. The agent then learns representations for the (potentially abstract) effect of using these skills. For instance, opening a door is a high-level skill, while knowing that opening a door typically allows one to enter a building would be part of the representation for this skill. The key result of that work was that the symbols required to determine the probability of a plan succeeding are directly determined by characteristics of the skills available to an agent. The agent can learn these symbols autonomously by exploring the environment, which removes the need to hand-design symbolic representations of the world.

It is therefore possible to learn the symbols by naively collecting samples from the environment, for example by random exploration. However, in an online setting the agent shall be able to use its previously collected data to compute an exploration policy which leads to better data efficiency. We introduce such an algorithm, which is divided into two parts: the first part quickly generates an intermediate Bayesian symbolic model from the data that the agent has collected so far, while the second part uses the model plus Monte-Carlo tree search to guide the agent’s future exploration towards regions of the state space that the model is uncertain about. We show that our algorithm is significantly more data-efficient than more naive methods in two different computer game domains. The first domain is an Asteroids-inspired game with complex dynamics but basic logical structure. The second is the Treasure Game, with simpler dynamics but more complex logical structure.

2 Background

As a motivating example, imagine deciding the route you are going to take to the grocery store; instead of planning over the various sequences of muscle contractions that you would use to complete the trip, you would consider a small number of high-level alternatives such as whether to take one route or another. You also would avoid considering how your exact low-level state affected your decision making, and instead use an abstract (symbolic) representation of your state with components such as whether you are at home or an work, whether you have to get gas, whether there is traffic, etc. This simplification reduces computational complexity, and allows for increased generalization over past experiences. In the following sections, we introduce the frameworks that we use to represent the agent’s high-level skills, and symbolic models for those skills.

2.1 Semi-Markov Decision Processes

We assume that the agent’s environment can be described by a semi-Markov decision process (SMDP), given by a tuple D=(S,O,R,P,γ)D=(S,O,R,P,\gamma), where SdS\subseteq\mathbb{R}^{d} is a dd-dimensional continuous state space, O(s)O(s) returns a set of temporally extended actions, or options [21] available in state sSs\in S, R(s,t,s,o)R(s^{\prime},t,s,o) and P(s,ts,o)P(s^{\prime},t\mid s,o) are the reward received and probability of termination in state sSs^{\prime}\in S after tt time steps following the execution of option oO(s)o\in O(s) in state sSs\in S, and γ(0,1]\gamma\in(0,1] is a discount factor. In this paper, we are not concerned with the time taken to execute oo, so we use P(ss,o)=P(s,ts,o)dtP(s^{\prime}\mid s,o)=\int P(s^{\prime},t\mid s,o)\mathrm{d}t.

An option oo is given by three components: πo\pi_{o}, the option policy that is executed when the option is invoked, IoI_{o}, the initiation set consisting of the states where the option can be executed from, and βo(s)[0,1]\beta_{o}(s)\rightarrow[0,1], the termination condition, which returns the probability that the option will terminate upon reaching state ss. Learning models for the initiation set, rewards, and transitions for each option, allows the agent to reason about the effect of its actions in the environment. To learn these option models, the agent has the ability to collect observations of the forms (s,O(s))(s,O(s)) when entering a state ss and (s,o,s,r,t)(s,o,s^{\prime},r,t) upon executing option oo from ss.

2.2 Abstract Representations for Planning

We are specifically interested in learning option models which allow the agent to easily evaluate the success probability of plans. A plan is a sequence of options to be executed from some starting state, and it succeeds if and only if it is able to be run to completion (regardless of the reward). Thus, a plan {o1,o2,,on}\{o_{1},o_{2},...,o_{n}\} with starting state ss succeeds if and only if sIo1s\in I_{o_{1}} and the termination state of each option (except for the last) lies in the initiation set of the following option, i.e. sP(ss,o1)Io2s^{\prime}\sim P(s^{\prime}\mid s,o_{1})\in I_{o_{2}}, s′′P(s′′s,o2)Io3s^{\prime\prime}\sim P(s^{\prime\prime}\mid s^{\prime},o_{2})\in I_{o_{3}}, and so on.

Recent work [12, 13] has shown how to automatically generate a symbolic representation that supports such queries, and is therefore suitable for planning. This work is based on the idea of a probabilistic symbol, a compact representation of a distribution over infinitely many continuous, low-level states. For example, a probabilistic symbol could be used to classify whether or not the agent is currently in front of a door, or one could be used to represent the state that the agent would find itself in after executing its ‘open the door’ option. In both cases, using probabilistic symbols also allows the agent to be uncertain about its state.

The following two probabilistic symbols are provably sufficient for evaluating the success probability of any plan [13]; the probabilistic precondition: Pre(o)=P(sIo)\mathrm{Pre}(o)=P(s\in I_{o}), which expresses the probability that an option oo can be executed from each state sSs\in S, and the probabilistic image operator:

Im(o,Z)=SP(ss,o)Z(s)P(Ios)dsSZ(s)P(Ios)ds,\mathrm{Im}(o,Z)=\frac{\int_{S}P(s^{\prime}\mid s,o)Z(s)P(I_{o}\mid s)\mathrm{d}s}{\int_{S}Z(s)P(I_{o}\mid s)\mathrm{d}s},

which represents the distribution over termination states if an option oo is executed from a distribution over starting states ZZ. These symbols can be used to compute the probability that each successive option in the plan can be executed, and these probabilities can then be multiplied to compute the overall success probability of the plan; see Figure 1 for a visual demonstration of a plan of length 2.

Refer to caption
Figure 1: Determining the probability that a plan consisting of two options can be executed from a starting distribution Z0Z_{0}. (a): Z0Z_{0} is contained in Pre(o1)\mathrm{Pre}(o_{1}), so o1o_{1} can definitely be executed. (b): Executing o1o_{1} from Z0Z_{0} leads to distribution over states Im(o1,Z0)\mathrm{Im}(o_{1},Z_{0}). (c): Im(o1,Z0)\mathrm{Im}(o_{1},Z_{0}) is not completely contained in Pre(o2)\mathrm{Pre}(o_{2}), so the probability of being able to execute o2o_{2} is less than 1. Note that Pre\mathrm{Pre} is a set and Im\mathrm{Im} is a distribution, and the agent typically has uncertain models for them.

Subgoal Options  Unfortunately, it is difficult to model Im(o,Z)\mathrm{Im}(o,Z) for arbitrary options, so we focus on restricted types of options. A subgoal option [19] is a special class of option where the distribution over termination states (referred to as the subgoal) is independent of the distribution over starting states that it was executed from, e.g. if you make the decision to walk to your kitchen, the end result will be the same regardless of where you started from.

For subgoal options, the image operator can be replaced with the effects distribution: Eff(o)=Im(o,Z),Z(S)\mathrm{Eff}(o)=\mathrm{Im}(o,Z),\forall Z(S), the resulting distribution over states after executing oo from any start distribution Z(S)Z(S). Planning with a set of subgoal options is simple because for each ordered pair of options oio_{i} and ojo_{j}, it is possible to compute and store G(oi,oj)G(o_{i},o_{j}), the probability that ojo_{j} can be executed immediately after executing oio_{i}: G(oi,oj)=SPre(oj,s)Eff(oi)(s)ds.G(o_{i},o_{j})=\int_{S}\mathrm{Pre}(o_{j},s)\mathrm{Eff}(o_{i})(s)\mathrm{d}s.

We use the following two generalizations of subgoal options: abstract subgoal options model the more general case where executing an option leads to a subgoal for a subset of the state variables (called the mask), leaving the rest unchanged. For example, walking to the kitchen leaves the amount of gas in your car unchanged. More formally, the state vector can be partitioned into two parts s=[a,b]s=[a,b], such that executing oo leaves the agent in state s=[a,b]s^{\prime}=[a,b^{\prime}], where P(b)P(b^{\prime}) is independent of the distribution over starting states. The second generalization is the (abstract) partitioned subgoal option, which can be partitioned into a finite number of (abstract) subgoal options. For instance, an option for opening doors is not a subgoal option because there are many doors in the world, however it can be partitioned into a set of subgoal options, with one for every door.

The subgoal (and abstract subgoal) assumptions propose that the exact state from which option execution starts does not really affect the options that can be executed next. This is somewhat restrictive and often does not hold for options as given, but can hold for options once they have been partitioned. Additionally, the assumptions need only hold approximately in practice.

3 Online Active Symbol Acquisition

Previous approaches for learning symbolic models from data [12, 13] used random exploration. However, real world data from high-level skills is very expensive to collect, so it is important to use a more data-efficient approach. In this section, we introduce a new method for learning abstract models data-efficiently. Our approach maintains a distribution over symbolic models which is updated after every new observation. This distribution is used to choose the sequence of options that in expectation maximally reduces the amount of uncertainty in the posterior distribution over models. Our approach has two components: an active exploration algorithm which takes as input a distribution over symbolic models and returns the next option to execute, and an algorithm for quickly building a distribution over symbolic models from data. The second component is an improvement upon previous approaches in that it returns a distribution and is fast enough to be updated online, both of which we require.

3.1 Fast Construction of a Distribution over Symbolic Option Models

Now we show how to construct a more general model than GG that can be used for planning with abstract partitioned subgoal options. The advantages of our approach versus previous methods are that our algorithm is much faster, and the resulting model is Bayesian, both of which are necessary for the active exploration algorithm introduced in the next section.

Recall that the agent can collect observations of the forms (s,o,s)(s,o,s^{\prime}) upon executing option oo from ss, and (s,O(s))(s,O(s)) when entering a state ss, where O(s)O(s) is the set of available options in state ss. Given a sequence of observations of this form, the first step of our approach is to find the factors [13], partitions of state variables that always change together in the observed data. For example, consider a robot which has options for moving to the nearest table and picking up a glass on an adjacent table. Moving to a table changes the xx and yy coordinates of the robot without changing the joint angles of the robot’s arms, while picking up a glass does the opposite. Thus, the xx and yy coordinates and the arm joint angles of the robot belong to different factors. Splitting the state space into factors reduces the number of potential masks (see end of Section 2.2) because we assume that if state variables ii and jj always change together in the observations, then this will always occur, e.g. we assume that moving to the table will never move the robot’s arms.111The factors assumption is not strictly necessary as we can assign each state variable to its own factor. However, using this uncompressed representation can lead to an exponential increase in the size of the symbolic state space and a corresponding increase in the sample complexity of learning the symbolic models.

Finding the Factors   Compute the set of observed masks MM from the (s,o,s)(s,o,s^{\prime}) observations: each observation’s mask is the subset of state variables that differ substantially between ss and ss^{\prime}. Since we work in continuous, stochastic domains, we must detect the difference between minor random noise (independent of the action) and a substantial change in a state variable caused by action execution. In principle this requires modeling action-independent and action-dependent differences, and distinguishing between them, but this is difficult to implement. Fortunately we have found that in practice allowing some noise and having a simple threshold is often effective, even in more noisy and complex domains. For each state variable ii, let MiMM_{i}\subseteq M be the subset of the observed masks that contain ii. Two state variables ii and jj belong to the same factor fFf\in F if and only if Mi=MjM_{i}=M_{j}. Each factor ff is given by a set of state variables and thus corresponds to a subspace SfS_{f}. The factors are updated after every new observation.

Let SS^{*} be the set of states that the agent has observed and let SfS^{*}_{f} be the projection of SS^{*} onto the subspace SfS_{f} for some factor ff, e.g. in the previous example there is a SfS^{*}_{f} which consists of the set of observed robot (x,y)(x,y) coordinates. It is important to note that the agent’s observations come only from executing partitioned abstract subgoal options. This means that SfS^{*}_{f} consists only of abstract subgoals, because for each sSs\in S^{*}, sfs_{f} was either unchanged from the previous state, or changed to another abstract subgoal. In the robot example, all (x,y)(x,y) observations must be adjacent to a table because the robot can only execute an option that terminates with it adjacent to a table or one that does not change its (x,y)(x,y) coordinates. Thus, the states in SS^{*} can be imagined as a collection of abstract subgoals for each of the factors. Our next step is to build a set of symbols for each factor to represent its abstract subgoals, which we do using unsupervised clustering.

Finding the Symbols  For each factor fFf\in F, we find the set of symbols ZfZ^{f} by clustering SfS^{*}_{f}. Let Zf(sf)Z^{f}(s_{f}) be the corresponding symbol for state ss and factor ff. We then map the observed states sSs\in S^{*} to their corresponding symbolic states sd={Zf(sf),fF}s^{d}=\{Z^{f}(s_{f}),\forall f\in F\}, and the observations (s,O(s))(s,O(s)) and (s,o,s)(s,o,s^{\prime}) to (sd,O(s))(s^{d},O(s)) and (sd,o,sd)(s^{d},o,s^{\prime d}), respectively.

In the robot example, the (x,y)(x,y) observations would be clustered around tables that the robot could travel to, so there would be a symbol corresponding to each table.

We want to build our models within the symbolic state space SdS^{d}. Thus we define the symbolic precondition, 𝑃𝑟𝑒(o,sd)\mathit{Pre}(o,s^{d}), which returns the probability that the agent can execute an option from some symbolic state, and the symbolic effects distribution for a subgoal option oo, 𝐸𝑓𝑓(o)\mathit{Eff}(o), maps to a subgoal distribution over symbolic states. For example, the robot’s ‘move to the nearest table’ option maps the robot’s current (x,y)(x,y) symbol to the one which corresponds to the nearest table.

The next step is to partition the options into abstract subgoal options (in the symbolic state space), e.g. we want to partition the ‘move to the nearest table’ option in the symbolic state space so that the symbolic states in each partition have the same nearest table.

Partitioning the Options  For each option oo, we initialize the partitioning PoP^{o} so that each symbolic state starts in its own partition. We use independent Bayesian sparse Dirichlet-categorical models [20] for the symbolic effects distribution of each option partition.222We use sparse Dirichlet-categorical models because there are a combinatorial number of possible symbolic state transitions, but we expect that each partition has non-zero probability for only a small number of them. We then perform Bayesian Hierarchical Clustering [9] to merge partitions which have similar symbolic effects distributions.333We use the closed form solutions for Dirichlet-multinomial models provided by the paper.

There is a special case where the agent has observed that an option oo was available in some symbolic states SadS^{d}_{a}, but has yet to actually execute it from any sdSads^{d}\in S^{d}_{a}. These are not included in the Bayesian Hierarchical Clustering, instead we have a special prior for the partition of oo that they belong to. After completing the merge step, the agent has a partitioning PoP^{o} for each option oo. Our prior is that with probability qoq_{o},444This is a user specified parameter. each sdSads^{d}\in S^{d}_{a} belongs to the partition poPop^{o}\in P^{o} which contains the symbolic states most similar to sds^{d}, and with probability 1qo1-q_{o} each sds^{d} belongs to its own partition. To determine the partition which is most similar to some symbolic state, we first find AoA^{o}, the smallest subset of factors which can still be used to correctly classify PoP^{o}. We then map each sdSads^{d}\in S^{d}_{a} to the most similar partition by trying to match sds^{d} masked by AoA^{o} with a masked symbolic state already in one of the partitions. If there is no match, sds^{d} is placed in its own partition.

Our final consideration is how to model the symbolic preconditions. The main concern is that many factors are often irrelevant for determining if some option can be executed. For example, whether or not you have keys in your pocket does not affect whether you can put on your shoe.

Modeling the Symbolic Preconditions  Given an option oo and subset of factors FoF^{o}, let SFodS^{d}_{F^{o}} be the symbolic state space projected onto FoF^{o}. We use independent Bayesian Beta-Bernoulli models for the symbolic precondition of oo in each masked symbolic state sFodSFods^{d}_{F^{o}}\in S^{d}_{F^{o}}. For each option oo, we use Bayesian model selection to find the the subset of factors FoF_{*}^{o} which maximizes the likelihood of the symbolic precondition models.

Algorithm 1 Fast Construction of a Distribution over Symbolic Option Models
1:Find the set of observed masks MM.
2:Find the factors FF.
3:fF\forall f\in F, find the set of symbols ZfZ^{f}.
4:Map the observed states sSs\in S^{*} to symbolic states sdSds^{d}\in S^{*d}.
5:Map the observations (s,O(s))(s,O(s)) and (s,o,s)(s,o,s^{\prime}) to (sd,O(s))(s^{d},O(s)) and (sd,o,sd)(s^{d},o,s^{\prime d}).
6:oO\forall o\in O, initialize PoP^{o} and perform Bayesian Hierarchical Clustering on it.
7:oO\forall o\in O, find AoA^{o} and FoF_{*}^{o}.

The final result is a distribution over symbolic option models HH, which consists of the combined sets of independent symbolic precondition models {𝑃𝑟𝑒(o,sFod);oO,sFodSFod}\{\mathit{Pre}(o,s^{d}_{F_{*}^{o}});\forall o\in O,\forall s^{d}_{F_{*}^{o}}\in S^{d}_{F_{*}^{o}}\} and independent symbolic effects distribution models {𝐸𝑓𝑓(o,po);oO,poPo}.\{\mathit{Eff}(o,p^{o});\forall o\in O,\forall p^{o}\in P^{o}\}.

The complete procedure is given in Algorithm 1. A symbolic option model hHh\sim H can be sampled by drawing parameters for each of the Bernoulli and categorical distributions from the corresponding Beta and sparse Dirichlet distributions, and drawing outcomes for each qoq_{o}. It is also possible to consider distributions over other parts of the model such as the symbolic state space and/or a more complicated one for the option partitionings, which we leave for future work.

3.2 Optimal Exploration

In the previous section we have shown how to efficiently compute a distribution over symbolic option models HH. Recall that the ultimate purpose of HH is to compute the success probabilities of plans (see Section 2.2). Thus, the quality of HH is determined by the accuracy of its predicted plan success probabilities, and efficiently learning HH corresponds to selecting the sequence of observations which maximizes the expected accuracy of HH. However, it is difficult to calculate the expected accuracy of HH over all possible plans, so we define a proxy measure to optimize which is intended to represent the amount of uncertainty in HH. In this section, we introduce our proxy measure, followed by an algorithm for finding the exploration policy which optimizes it. The algorithm operates in an online manner, building HH from the data collected so far, using HH to select an option to execute, updating HH with the new observation, and so on.

First we define the standard deviation σH\sigma_{H}, the quantity we use to represent the amount of uncertainty in HH. To define the standard deviation, we need to also define the distance and mean.

We define the distance KK from h2Hh_{2}\in H to h1Hh_{1}\in H, to be the sum of the Kullback-Leibler (KL) divergences555The KL divergence has previously been used in other active exploration scenarios [17, 15]. between their individual symbolic effect distributions plus the sum of the KL divergences between their individual symbolic precondition distributions:666Similarly to other active exploration papers, we define the distance to depend only on the transition models and not the reward models.

K(h1,h2)=oO[sFodSFodDKL(𝑃𝑟𝑒h1(o,sFod)𝑃𝑟𝑒h2(o,sFod))K(h_{1},h_{2})=\sum_{o\in O}[\sum_{s^{d}_{F_{*}^{o}}\in S^{d}_{F_{*}^{o}}}{D_{KL}(\mathit{Pre}^{h_{1}}(o,s^{d}_{F_{*}^{o}})\mid\mid\mathit{Pre}^{h_{2}}(o,s^{d}_{F_{*}^{o}}))}
+poPoDKL(𝐸𝑓𝑓h1(o,po)𝐸𝑓𝑓h2(o,po))].+\sum_{p^{o}\in P^{o}}{D_{KL}(\mathit{Eff}^{h_{1}}(o,p^{o})\mid\mid\mathit{Eff}^{h_{2}}(o,p^{o}))}].

We define the mean, 𝔼[H]\mathbb{E}[H], to be the symbolic option model such that each Bernoulli symbolic precondition and categorical symbolic effects distribution is equal to the mean of the corresponding Beta or sparse Dirichlet distribution:

oO,poPo,𝐸𝑓𝑓𝔼[H](o,po)=𝔼hH[𝐸𝑓𝑓h(o,po)],\forall o\in O,~\forall p^{o}\in P^{o},~\mathit{Eff}^{\mathbb{E}[H]}(o,p^{o})=\mathbb{E}_{h\sim H}[\mathit{Eff}^{h}(o,p^{o})],
oO,sFodSFod,𝑃𝑟𝑒𝔼[H](o,sFod))=𝔼hH[𝑃𝑟𝑒h(o,sFod))].\forall o\in O,~\forall s^{d}_{F_{*}^{o}}\in S^{d}_{F_{*}^{o}},~\mathit{Pre}^{\mathbb{E}[H]}(o,s^{d}_{F_{*}^{o}}))=\mathbb{E}_{h\sim H}[\mathit{Pre}^{h}(o,s^{d}_{F_{*}^{o}}))].

The standard deviation σH\sigma_{H} is then simply: σH=𝔼hH[K(h,𝔼[H])]\sigma_{H}=\mathbb{E}_{h\sim H}[K(h,\mathbb{E}[H])]. This represents the expected amount of information which is lost if 𝔼[H]\mathbb{E}[H] is used to approximate HH.

Now we define the optimal exploration policy for the agent, which aims to maximize the expected reduction in σH\sigma_{H} after HH is updated with new observations. Let H(w)H(w) be the posterior distribution over symbolic models when HH is updated with symbolic observations ww (the partitioning is not updated, only the symbolic effects distribution and symbolic precondition models), and let W(H,i,π)W(H,i,\pi) be the distribution over symbolic observations drawn from the posterior of HH if the agent follows policy π\pi for ii steps. We define the optimal exploration policy π\pi^{*} as:

π=argmax𝜋σH𝔼wW(H,i,π)[σH(w)].\pi^{*}=\underset{\pi}{\operatorname{argmax}}~~\sigma_{H}-\mathbb{E}_{w\sim W(H,i,\pi)}[\sigma_{H(w)}].

For the convenience of our algorithm, we rewrite the second term by switching the order of the expectations: 𝔼wW(H,i,π)[𝔼hH(w)[K(h,𝔼[H(w)])]]=𝔼wW(h,i,π)[K(h,𝔼[H(w)])]].\mathbb{E}_{w\sim W(H,i,\pi)}[\mathbb{E}_{h\sim H(w)}[K(h,\mathbb{E}[H(w)])]]=\mathbb{E}_{w\sim W(h,i,\pi)}[K(h,\mathbb{E}[H(w)])]].

Note that the objective function is non-Markovian because HH is continuously updated with the agent’s new observations, which changes σH\sigma_{H}. This means that π\pi^{*} is non-stationary, so Algorithm 2 approximates π\pi^{*} in an online manner using Monte-Carlo tree search (MCTS) [3] with the UCT tree policy [11]. πT\pi_{T} is the combined tree and rollout policy for MCTS, given tree TT.

There is a special case when the agent simulates the observation of a previously unobserved transition, which can occur under the sparse Dirichlet-categorical model. In this case, the amount of information gained is very large, and furthermore, the agent is likely to transition to a novel symbolic state. Rather than modeling the unexplored state space, instead, if an unobserved transition is encountered during an MCTS update, it immediately terminates with a large bonus to the score, a similar approach to that of the R-max algorithm [2]. The form of the bonus is -zgzg, where gg is the depth that the update terminated and zz is a constant. The bonus reflects the opportunity cost of not experiencing something novel as quickly as possible, and in practice it tends to dominate (as it should).

Algorithm 2 Optimal Exploration
1:Number of remaining option executions ii.
2:while i0i\geq 0 do
3:  Build HH from observations (Algorithm 1).
4:  Initialize tree TT for MCTS.
5:  while number updates << threshold do
6:   Sample a symbolic model hHh\sim H.
7:   Do an MCTS update of TT with dynamics given by hh.
8:   Terminate current update if depth gg is i\geq i, or unobserved transition is encountered.
9:   Store simulated observations wW(h,g,πT)w\sim W(h,g,\pi_{T}).
10:   Score = K(h,𝔼[H])K(h,𝔼[H(w)])zgK(h,\mathbb{E}[H])-K(h,\mathbb{E}[H(w)])-zg.
11:  end while
12:  return most visited child of root node.
13:  Execute corresponding option; Update observations; ii–.
14:end while

4 The Asteroids Domain

The Asteroids domain is shown in Figure 2(a) and was implemented using physics simulator pybox2d. The agent controls a ship by either applying a thrust in the direction it is facing or applying a torque in either direction. The goal of the agent is to be able to navigate the environment without colliding with any of the four “asteroids.” The agent’s starting location is next to asteroid 1. The agent is given the following 6 options (see Appendix A for additional details):

  1. 1.

    move-counterclockwise and move-clockwise: the ship moves from the current face it is adjacent to, to the midpoint of the face which is counterclockwise/clockwise on the same asteroid from the current face. Only available if the ship is at an asteroid.

  2. 2.

    move-to-asteroid-1, move-to-asteroid-2, move-to-asteroid-3, and move-to-asteroid-4: the ship moves to the midpoint of the closest face of asteroid 1-4 to which it has an unobstructed path. Only available if the ship is not already at the asteroid and an unobstructed path to some face exists.

Exploring with these options results in only one factor (for the entire state space), with symbols corresponding to each of the 35 asteroid faces as shown in Figure 2(a).

Refer to caption
(a)
Refer to caption
(b)
Figure 2: (a): The Asteroids Domain, and the 35 symbols which can be encountered while exploring with the provided options. (b): The Treasure Game domain. Although the game screen is drawn using large image tiles, sprite movement is at the pixel level.

Results  We tested the performance of three exploration algorithms: random, greedy, and our algorithm. For the greedy algorithm, the agent first computes the symbolic state space using steps 1-5 of Algorithm 1, and then chooses the option with the lowest execution count from its current symbolic state. The hyperparameter settings that we use for our algorithm are given in Appendix A.

Figures 3(a), 3(b), and 3(c) show the percentage of time that the agent spends on exploring asteroids 1, 3, and 4, respectively. The random and greedy policies have difficulty escaping asteroid 1, and are rarely able to reach asteroid 4. On the other hand, our algorithm allocates its time much more proportionally. Figure 4(d) shows the number of symbolic transitions that the agent has not observed (out of 115 possible).777We used Algorithm 1 to build symbolic models from the data gathered by each exploration algorithms. As we discussed in Section 3, the number of unobserved symbolic transitions is a good representation of the amount of information that the models are missing from the environment.

Our algorithm significantly outperforms random and greedy exploration. Note that these results are using an uninformative prior and the performance of our algorithm could be significantly improved by starting with more information about the environment. To try to give additional intuition, in Appendix A we show heatmaps of the (x,y)(x,y) coordinates visited by each of the exploration algorithms.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 3: Simulation results for the Asteroids domain. Each bar represents the average of 100 runs. The error bars represent a 99% confidence interval for the mean. (a), (b), (c): The fraction of time that the agent spends on asteroids 1, 3, and 4, respectively. The greedy and random exploration policies spend significantly more time than our algorithm exploring asteroid 1 and significantly less time exploring asteroids 3 and 4. (d): The number of symbolic transitions that the agent has not observed (out of 115 possible). The greedy and random policies require 2-3 times as many option executions to match the performance of our algorithm.

5 The Treasure Game Domain

The Treasure Game [13], shown in Figure 2(b), features an agent in a 2D, 528×528528\times 528 pixel video-game like world, whose goal is to obtain treasure and return to its starting position on a ladder at the top of the screen. The 9-dimensional state space is given by the xx and yy positions of the agent, key, and treasure, the angles of the two handles, and the state of the lock.

The agent is given 9 options: go-left, go-right, up-ladder, down-ladder, jump-left, jump-right, down-right, down-left, and interact. See Appendix A for a more detailed description of the options and the environment dynamics. Given these options, the 7 factors with their corresponding number of symbols are: playerplayer-xx, 10; playerplayer-yy, 9; handle1handle1-angleangle, 2; handle2handle2-angleangle, 2; keykey-xx and keykey-yy, 3; boltbolt-lockedlocked, 2; and goldcoingoldcoin-xx and goldcoingoldcoin-yy, 2.

Results  We tested the performance of the same three algorithms: random, greedy, and our algorithm. Figure 4(a) shows the fraction of time that the agent spends without having the key and with the lock still locked. Figures 4(b) and 4(c) show the number of times that the agent obtains the key and treasure, respectively. Figure 4(d) shows the number of unobserved symbolic transitions (out of 240 possible). Again, our algorithm performs significantly better than random and greedy exploration. The data from our algorithm has much better coverage, and thus leads to more accurate symbolic models. For instance in Figure 4(c) you can see that random and greedy exploration did not obtain the treasure after 200 executions; without that data the agent would not know that it should have a symbol that corresponds to possessing the treasure.

6 Conclusion

We have introduced a two-part algorithm for data-efficiently learning an abstract symbolic representation of an environment which is suitable for planning with high-level skills. The first part of the algorithm quickly generates an intermediate Bayesian symbolic model directly from data. The second part guides the agent’s exploration towards areas of the environment that the model is uncertain about. This algorithm is useful when the cost of data collection is high, as is the case in most real world artificial intelligence applications. Our results show that the algorithm is significantly more data efficient than using more naive exploration policies.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 4: Simulation results for the Treasure Game domain. Each bar represents the average of 100 runs. The error bars represent a 99% confidence interval for the mean. (a): The fraction of time that the agent spends without having the key and with the lock still locked. The greedy and random exploration policies spend significantly more time than our algorithm exploring without the key and with the lock still locked. (b), (c): The average number of times that the agent obtains the key and treasure, respectively. Our algorithm obtains both the key and treasure significantly more frequently than the greedy and random exploration policies. There is a discrepancy between the number of times that our agent obtains the key and the treasure because there are more symbolic states where the agent can try the option that leads to a reset than where it can try the option that leads to obtaining the treasure. (d): The number of symbolic transitions that the agent has not observed (out of 240 possible). The greedy and random policies require 2-3 times as many option executions to match the performance of our algorithm.

7 Acknowledgements

This research was supported in part by the National Institutes of Health under award number R01MH109177. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

References

  • [1] A.G. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems, 13(4):341–379, 2003.
  • [2] Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
  • [3] C.B. Browne, E. Powley, D. Whitehouse, S.M. Lucas, P.I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton. A survey of Monte-Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1–43, 2012.
  • [4] S. Cambon, R. Alami, and F. Gravot. A hybrid approach to intricate motion, manipulation and task planning. International Journal of Robotics Research, 28(1):104–126, 2009.
  • [5] J. Choi and E. Amir. Combining planning and motion planning. In Proceedings of the IEEE International Conference on Robotics and Automation, pages 4374–4380, 2009.
  • [6] Christian Dornhege, Marc Gissler, Matthias Teschner, and Bernhard Nebel. Integrating symbolic and geometric planning for mobile manipulation. In IEEE International Workshop on Safety, Security and Rescue Robotics, November 2009.
  • [7] Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pages 226–231, 1996.
  • [8] E. Gat. On three-layer architectures. In D. Kortenkamp, R.P. Bonnasso, and R. Murphy, editors, Artificial Intelligence and Mobile Robots. AAAI Press, 1998.
  • [9] K.A. Heller and Z. Ghahramani. Bayesian hierarchical clustering. In Proceedings of the 22nd international conference on Machine learning, pages 297–304. ACM, 2005.
  • [10] L. Kaelbling and T. Lozano-Pérez. Hierarchical planning in the Now. In Proceedings of the IEEE Conference on Robotics and Automation, 2011.
  • [11] L. Kocsis and C. Szepesvári. Bandit based Monte-Carlo planning. In Machine Learning: ECML 2006, pages 282–293. Springer, 2006.
  • [12] G.D. Konidaris, L.P. Kaelbling, and T. Lozano-Perez. Constructing symbolic representations for high-level planning. In Proceedings of the Twenty-Eighth Conference on Artificial Intelligence, pages 1932–1940, 2014.
  • [13] G.D. Konidaris, L.P. Kaelbling, and T. Lozano-Perez. Symbol acquisition for probabilistic high-level planning. In Proceedings of the Twenty Fourth International Joint Conference on Artificial Intelligence, pages 3619–3627, 2015.
  • [14] C. Malcolm and T. Smithers. Symbol grounding via a hybrid architecture in an autonomous assembly system. Robotics and Autonomous Systems, 6(1-2):123–144, 1990.
  • [15] S.A. Mobin, J.A. Arnemann, and F. Sommer. Information-based learning by agents in unbounded state spaces. In Advances in Neural Information Processing Systems, pages 3023–3031, 2014.
  • [16] N.J. Nilsson. Shakey the robot. Technical report, SRI International, April 1984.
  • [17] L. Orseau, T. Lattimore, and M. Hutter. Universal knowledge-seeking agents for stochastic environments. In International Conference on Algorithmic Learning Theory, pages 158–172. Springer, 2013.
  • [18] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct):2825–2830, 2011.
  • [19] D. Precup. Temporal Abstraction in Reinforcement Learning. PhD thesis, Department of Computer Science, University of Massachusetts Amherst, 2000.
  • [20] N.F.Y. Singer. Efficient Bayesian parameter estimation in large discrete domains. In Advances in Neural Information Processing Systems 11: Proceedings of the 1998 Conference, volume 11, page 417. MIT Press, 1999.
  • [21] R.S. Sutton, D. Precup, and S.P. Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1-2):181–211, 1999.
  • [22] J. Wolfe, B. Marthi, and S.J. Russell. Combined task and motion planning for mobile manipulation. In International Conference on Automated Planning and Scheduling, 2010.

Appendix A Environment Descriptions

A.1 Asteroids Domain Cont.

The agent’s 6 options are implemented using PD controllers for the torque and thrust. The options do not always work as intended, sometimes the ship will crash during the execution of an option, which resets the environment. If the ship tries to move from asteroid aa to asteroid b>ab>a, it crashes with probability 0.50.5. As designed, these options do not have the subgoal property because the outcome of executing each option is dependent on which face of which asteroid the option was executed from. However, they are partitioned subgoal options because their outcome is only dependent on which asteroid face they were executed from.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Heatmaps of the (x,y)(x,y) coordinates visited by each exploration algorithm in the Asteroids domain. The plot was generated by the hexbin function in matplotlib. Our algorithm explores the state space much more uniformly than the random and greedy exploration algorithms.

Results Cont. Figure 5 shows heatmaps of the (x,y)(x,y) coordinates visited by each exploration algorithm in the Asteroids domain. Our algorithm explores the state space much more uniformly than the random and greedy exploration algorithms.

A.2 Treasure Game Domain Cont.

The low-level actions available to the agent are move up, down, left, and right, jump, and interact. The 4 movement actions move the agent between 2 and 4 pixels uniformly at random in the appropriate direction. There are three doors which may block the path of the agent. The top two doors are oppositely open and closed; flipping one of the two handles switches their status. The bottom door which guards the treasure can be opened by the agent obtaining the key and using it on the lock. The interact action is available when the agent is standing in front of a handle, or when it possesses the key and is standing in front of the lock. In the first case, executing the interact action flips the handle’s position with probability 0.8, and in the second case, the lock is unlocked and the agent loses the key. Whenever the agent has possession of the key and/or the treasure, they are displayed in the lower-right corner of the screen. The agent returning to the top ladder resets the environment. The agent’s 9 options are implemented using simple control loops:

  1. 1.

    go-right and go-left: the agent moves continuously right/left until it reaches a wall, edge, object it can interact with, or ladder. Only available when the agent’s way is not directly blocked.

  2. 2.

    up-ladder and down-ladder: the agent ascends/descends a ladder. Only available when the agent is directly below/above a ladder.

  3. 3.

    down-left and down-right: the agent falls off an edge onto the nearest solid cell on its left/right. Only available when they would succeed.

  4. 4.

    jump-left and jump-right: the agent jumps and moves left/right for about 48 pixels. Only available when the area above the agent’s head, and above its head and to the left/right, are clear.

  5. 5.

    interact: same as the low-level interact action.

These options, like the low-level actions they are composed of, all have at least a small amount of stochasticity in their outcomes. Additionally, when the agent executes one of the jump options to reach a faraway ledge, for instance when it is trying to get the key, it succeeds with probability 0.53, and misses the ledge and lands directly below with probability 0.47. These are abstract partitioned subgoal options.

A.3 Hyperparameter Settings

In each run the agent had access to the exact number of option executions it had to explore with. For MCTS, we used the UCT tree policy with c=2c=2, a random rollout policy, and performed 10001000 updates. Also, during UCT option selection, we normalized a node’s score using the highest and lowest scores seen so far. For the sparse Dirichlet-Multinomial models, we used hyperparameter α=0.5\alpha=0.5 and the prior probability over the size of the support was given by a geometric distribution with parameter 0.50.5. For the state clustering (step 3 of Algorithm 1), we used the DBSCAN algorithm [7] implemented in scikit-learn [18] with parameters minmin-samples=1samples=1, and ϵ=1.5\epsilon=1.5 for the Asteroids domain and ϵ=0.05\epsilon=0.05 for the Treasure Game domain. For Algorithm 2, we set z=10z=10. For all options oo, we set qo=0.3q_{o}=0.3.