This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning on a Budget via Teacher Imitation

Ercüment İlhan, Jeremy Gow and Diego Perez-Liebana School of Electronic Engineering and Computer Science
Queen Mary University of London
London, United Kingdom
{e.ilhan, jeremy.gow, diego.perez}@qmul.ac.uk
Abstract

Deep Reinforcement Learning (RL) techniques can benefit greatly from leveraging prior experience, which can be either self-generated or acquired from other entities. Action advising is a framework that provides a flexible way to transfer such knowledge in the form of actions between teacher-student peers. However, due to the realistic concerns, the number of these interactions is limited with a budget; therefore, it is crucial to perform these in the most appropriate moments. There have been several promising studies recently that address this problem setting especially from the student’s perspective. Despite their success, they have some shortcomings when it comes to the practical applicability and integrity as an overall solution to the learning from advice challenge. In this paper, we extend the idea of advice reusing via teacher imitation to construct a unified approach that addresses both advice collection and advice utilisation problems. We also propose a method to automatically tune the relevant hyperparameters of these components on-the-fly to make it able to adapt to any task with minimal human intervention. The experiments we performed in 55 different Atari games verify that our algorithm either surpasses or performs on-par with its top competitors while being far simpler to be employed. Furthermore, its individual components are also found to be providing significant advantages alone.

Index Terms:
reinforcement learning, deep q-networks, action advising, teacher-student framework
publicationid: pubid: 978-1-6654-3886-5/21/$31.00 ©2021 IEEE

I Introduction

Deep Reinforcement Learning (RL) has been proven to be a successful approach to solve decision-making problems in a variety of difficult domains such as video games [1], board games [2] or robot manipulation [3]. However, achieving the reported performances is not entirely straightforward. One of the most critical setbacks in deep RL is the exhaustive training processes that usually require many interactions with the environment. This occurs mainly due to the RL inherent exploration challenges as well as the complexity of the incorporated function approximators, e.g. deep neural networks. To this date, there has been a remarkable amount of research effort to overcome the sample inefficiency by devising advanced exploration strategies [4]. In addition to these, other lines of work that focus on leveraging some legacy knowledge to tackle these issues have also been studied extensively with great success.

The ability to learn by utilising the prior experience instead of starting from scratch is an essential component of intelligence. In RL, this idea has been investigated in various forms that are tailored for different problem settings. Imitation Learning (IL) [5] studies the concept of mimicking an expert behaviour presented in a pre-recorded dataset without allowing any RL rewards from the environment itself. Similarly, Learning from Demonstrations (LfD) [6][7] extends this idea to incorporate the RL interactions and rewards to further surpass the experts. Some other approaches such as Policy Reuse [8] study the ways of speeding up the agent’s learning by leveraging the past policies directly, instead of datasets.

In this paper, we study a different problem setup where it is not possible to have any pre-generated datasets or to directly obtain the useful policies themselves; instead, the learning agent only has access to some peer(s) over a limited communication channel. Such a setting is especially relevant for the scenarios with unknown task specifications (prior to learning and deployment) and non-transferable yet beneficial policies, e.g. humans, non-stationary agents, in the loop. A flexible framework that is tailored for this setting is called Action Advising [9]. According to this, agents exchange knowledge between each other in the form of actions to speed-up their learning progressions. However, the number of these peer-to-peer interactions are limited with a budget to resemble the real-world limitations, which essentially converts the problem into determining the best possible way to utilise the available budget. Based on the peer that is in charge of driving these interactions, action advising algorithms can either be student-initiated, teacher-initiated or jointly-initiated.

Action advising algorithms in deep RL have obtained promising results with an emphasis on student-initiated strategies [10][11][12]. While the majority of these focus on addressing when to ask for advice question, some recent approaches [13] have also investigated the ways to further utilise the collected advice by imitating and reusing the teacher policy. Despite these developments, there are currently several significant shortcomings present. These techniques often employ some threshold hyperparameters to control the decisions to initiate advice exchange interactions which play a key role in their efficiencies. However, these parameters are sensitive to the learning state of the models as well as the domain properties. Therefore, they need to be tuned very carefully prior to execution, which involve unrealistically accessing the target tasks for trial runs. Furthermore, the studies to further leverage the teacher advice beyond collection are currently in their early stages and do not provide a complete solution to the problem besides addressing the advice reusing aspect.

In this paper, we present an all-in-one student-initiated approach that is capable of collecting and reusing advice in a budget-efficient manner, by extending [13] in multiple ways. First, we propose a method for automatically determining the threshold parameters responsible for the decisions to request and reuse advice. This greatly alleviates the burden of task-specific hyperparameter tuning procedures. Secondly, we follow a decaying advice reuse schedule that is not tied to the student’s exploration strategy. Finally, instead of using the imitated policy only for reusing advice as in [13], we incorporate this policy to determine and collect more diverse advice to construct a more universal imitation policy.

The rest of this paper is structured as follows: Section II outlines the most relevant previous work. In Section III, the background knowledge that is required to understand the paper is provided. Afterwards, we describe our approach in detail in Section IV. Then, Section V presents our evaluation domain and experimental procedure. In Section VI, we share and analyse the experiment results. Finally, Section VII wraps up this study with conclusions and future remarks.

II Related Work

Action advising techniques with budget constraints were originally invented and studied extensively in tabular domains. In [9], the teacher-student learning procedure was formalised for the first time and some solutions from the teacher’s perspective were proposed. This was later extended with some new heuristics [14] as well as with a meta-learning approach [15]. Later on, the action advising interactions are also studied in student-initiated and jointly-initiated forms [16] which were also adopted in multi-agent problems where the agents take student and teacher roles interchangeably [17]. In a recent work [18], the idea of reusing the previously collected advice is investigated to improve the learning performance and also to make a more efficient use of the available budget.

Deep RL is a considerably new domain for the action advising studies. [19] introduced a novel LfD setup in which the demonstrations dataset is built interactively as in action advising. To do so, they employed uncertainty estimation capable models with LfD loss terms integrated in the learning stage. [10] extended the jointly-initiated action advising [17] to be applicable in multi-agent deep RL for the first time. This study replaced the state counters that were used to assess uncertainty in the tabular version [17] with state novelty measurements with the aid of Random Network Distillation (RND) [20]. In [11], an uncertainty-based advice collection strategy was proposed. According to this, the student adopts a multi-headed neural network architecture to access epistemic uncertainty estimations as in [19]. Later, [12] further studied the state novelty-based idea of [10] to devise a better student-initiated advice collection method. Specifically, they made RND module to be updated only for the states that are involved in advice collection. That way, the student was ensured to benefit from the teacher regardless of its own knowledge about the states to tackle the special cases of belated inclusion of the teacher. More recently, [13] studied the advice reuse idea in deep RL. In this work, an imitation model of the teacher is constructed partially with the collected advice. Furthermore, Dropout regularisation [21] is also incorporated in this model to enable it to make uncertainty-aware decisions when it comes to reusing the self-generated advice.

Among these studies in deep RL, only [19] and [13] investigated the concept of advice reuse. Our paper differs from them in several ways. The idea in [19] require using uncertainty estimation capable models in the student’s RL algorithm. Moreover, the RL algorithm also needs to be modified to have the LfD loss terms. In contrast, the algorithm we propose does not require the student to have any specific RL models or loss functions. Thus, the student agent can be treated as a blackbox, which lets our method to be applied to a wider range of agent types. [13] performs advice reuse as a separate module, as we described here. However, they rely on the previously proposed advice collection strategies instead of taking advantage of its own imitation module to manage the advice collection process via uncertainty. Furthermore, some of the hyperparameters in [13] limit its practical applicability due to being difficult to tune, which we also address in this work.

III Background

III-A Reinforcement Learning and Deep Q-Networks

Reinforcement Learning (RL) [22] is a trial-and-error learning paradigm that studies decision-making problems where an agent learns to accomplish a task. This interaction within the environment is commonly formalised as a Markov Decision Process (MDP). In MDP, the environment is defined with the tuple 𝒮,𝒜,R,𝒯,γ\langle\mathcal{S},\mathcal{A},R,\mathcal{T},\gamma\rangle where 𝒮\mathcal{S} is the set of states, 𝒜\mathcal{A} is the set of available actions, R(s,a)R(s,a) is the reward function, 𝒯(s|a,s)\mathcal{T}(s|a,s^{\prime}) defines the transition probabilities, and γ[0,1]\gamma\in[0,1] is the discount factor. At every timestep tt, the agent applies an action ata_{t} in state sts_{t} to advance to the next state st+1s_{t+1} while receiving a reward rtr_{t}. These actions are determined by the agent’s policy π:𝒮𝒜\pi\colon\mathcal{S}\rightarrow\mathcal{A}, and RL’s goal is to learn an agent policy πθ\pi_{\theta} (with parameters θ\theta) that maximises the cumulative discounted sum of the rewards t=0γtrt\sum_{t=0}^{\infty}\gamma^{t}r_{t}. There are various different approaches in the RL literature to achieve this. For instance, the well-known Q-learning algorithm does so by learning the state-action values Q(s,a)Q(s,a) via Bellman equations [22] and making the agent follow the policy π(s)=argmaxaQ(s,a)\pi(s)=\operatorname*{argmax}_{a}Q(s,a).

In the recent years, RL algorithms have been studied extensively in a branch referred to as Deep RL to deal with non-tabular state spaces with the aid of non-linear function approximation. Deep Q-Networks (DQN) [23] is a substantial one among these, that serves as a strong baseline in the domains with discrete actions. In this off-policy algorithm, Q(s,a)Q(s,a) values are approximated with a deep neural network with weights θ\theta. By using the transitions stored in a replay buffer, θ\theta is learned by minimising the loss terms (rk+1+γmaxaQθ¯(sk+1,a)Qθ(sk,a))2(r_{k+1}+\gamma\max_{a^{\prime}}Q_{\bar{\theta}}(s_{k+1},a^{\prime})-Q_{\theta}(s_{k},a))^{2} with stochastic gradient descent, where θ¯\bar{\theta} stands for the periodically updated copy of θ\theta. This is the target network trick that DQN incorporates to battle the RL induced non-stationarity in the approximator learning. Another critical component is the replay buffer, that lets the agent save samples to be learned from over a long course by also breaking the non-i.i.d. property of sequential collection. DQN’s success has led it to be studied and enhanced further over years. The most prominent of these are summarised in Rainbow DQN [24]. Among these, we employ Dueling Networks [25] and Double Q-learning [26] in this paper.

III-B Action Advising

Action advising [9] is a peer-to-peer knowledge exchange framework that requires only a common set of actions and a communication protocol between the peers. According to this, a learning peer (student) receives advice in the form of actions from a more knowledgeable peer (teacher) to accelerate its learning. These advice actions are generated directly from the teacher’s decision-making policy; therefore, it is important for the student and the teacher to have the same goal in the task they are performing in. A key property that distinguishes this approach from the similar frameworks such as Policy Reuse [8] is the notion of budget constraint. By considering the realistic scenarios where it is not possible to reliably exchange information, the number of interactions in this framework is also limited with a budget. Consequently, the algorithms that operate in this problem setup should be capable of determining the most appropriate moments to exchange advice either from the perspective of teacher (teacher-initiated), student (student-initiated) or both (jointly-initiated).

IV Proposed Algorithm

We adopt the MDP formalisation presented in Section III in our problem definition. The setup in this study includes an off-policy deep RL agent (student) with policy πS\pi_{S} learning to perform some task TT in an environment with continuous state space and discrete actions. There is also another agent with policy πT\pi_{T} (teacher) that is competent in TT. The teacher is isolated from the environment itself, but is reachable by the student via a communication channel for a limited number of times defined by the advising budget bb. By using this mechanism, the student can request action advice aT=πT(s)a_{T}=\pi_{T}(s) for its current state ss. The objective of the student in this problem is to maximise its learning performance in TT by timing these interactions to make the most efficient use of πT\pi_{T}.

Our approach provides a unified solution for addressing when to ask for advice and how to leverage the advice questions. In addition to the RL algorithm, the student is equipped with a neural network GωG_{\omega} with weights ω\omega that are not shared with the RL model in any way. The student also has a transitions buffer DD with no capacity limit that holds the collected state-advice pairs. By using the samples in DD, GωG_{\omega} is trained periodically to provide the student an up-to-date imitation model of πT\pi_{T} to make it possible to reuse the previously provided advice. Moreover, GωG_{\omega} is also used to determine what advice to collect by being regarded as a representation of DD’s contents. Obviously, making these decisions require GωG_{\omega} to have a form of awareness of what it is trained on (in terms of samples). Therefore, GωG_{\omega} employs Dropout regularisation in the fully-connected layers to have an estimation of epistemic uncertainty denoted by Gωμ(s)G_{\omega}^{\mu}(s) for any state ss as it is done in [21][13]. None of these components share anything with or require access to the student’s RL algorithm. This is especially advantageous when it comes to pairing up our approach with different RL methods.

At the beginning of the student’s learning process, ω\omega is initialised randomly and D=D=\emptyset. Then, at every timestep tt in state sts_{t}, the student goes through 33 stages of our algorithm: Collection, Imitation, Reuse. The remainder of this section describes these stages with the line number references to the complete flow of our algorithm summarised in Algorithm 1.

The collection stage (lines 13-19) remains active from the beginning until the student runs out of its advising budget bb. At this step, the student attempts to collect advice if its current state has not been advised before. This is determined by the value of Gωμ(st)G_{\omega}^{\mu}(s_{t}). If it is higher than the uncertainty threshold τ\tau (which is set automatically in the imitation stage), it is decided that sts_{t} is has not been advised before; thus, the student proceeds with requesting advice. However, if τ\tau is undetermined, this request is carried out without performing any uncertainty check.

The imitation module is responsible for training GωG_{\omega} and tuning τ\tau accordingly. This stage (lines 20-22) is always active, but it is only triggered when these conditions that are checked at every timestep tt are met: the student has collected nminn_{min} new samples in DD (since the last imitation) or the student has taken tmint_{min} steps (since the last imitation) with at least nmin/2n_{min}/2 new samples in DD. Here, nminn_{min} and tmint_{min} are hyperparameters. These are set in order to keep the number of imitation processes within a reasonable number while also ensuring GωG_{\omega} remains up-to-date with the collected advice. On one hand, if GωG_{\omega} was updated for every new state-advice pair, it would be a very accurate model of DD’s contents, but the total training times would be a significant computational burden. On the other hand, if GωG_{\omega} was updated infrequently, it would not cause any computational setbacks; however, it would not be a good representation of the collected advice either.

Once the imitation is triggered, GωG_{\omega} is trained for kinitk_{init} iterations (if it is the first ever training; else, for kperiodick_{periodic} iterations) with the minibatches of samples drawn randomly from DD. This process resembles the simplest form of behavioural cloning where the supervised negative log-likelihood loss (ω)=(s,a)DlogGω(as)\mathcal{L}(\omega)=\sum_{(s,a)\in D}-logG_{\omega}(a\mid s) is minimised. Afterwards, τ\tau is updated automatically to be compatible with the new state of this imitation network. This is done by measuring GωμG_{\omega}^{\mu} and storing them in a set UU for each sis_{i} in DD that satisfies ai=argmaxaGω(asi)a_{i}=\operatorname*{argmax}_{a}G_{\omega}(a\mid s_{i}). Then, the uncertainty value that corresponds to the pthp^{th} percentile (hyperparameter) in the ascending-order sorted UU is assigned to τ\tau. We do this to pick a threshold τ\tau such that GωG_{\omega} can consider these samples it classifies correctly as “known” while leaving a small portion that are likely to be outliers out, when GωμG_{\omega}^{\mu} is compared with τ\tau. This approach could be further developed by also considering the true-positive and false-positive rates, however, we opted for a simpler approach in this study.

Finally, the reuse stage (lines 23-26) handles the execution of the imitated advice whenever appropriate, to aid the student in efficient exploration. It becomes active as soon as the imitation model GωG_{\omega} is trained for the first time. Then, whenever Gωμ(st)<τG_{\omega}^{\mu}(s_{t})<\tau (i.e. GωG_{\omega} is familiar with sts_{t}), no advice collection is occurred at tt and reusing is enabled for this particular episode, the student executes the imitated advice argmaxaGω(ast)\operatorname*{argmax}_{a}G_{\omega}(a\mid s_{t}). Unlike [13], we do not limit advice reusing to the exploration stage of learning, e.g. the period ϵ\epsilon is annealed to its final value in ϵ\epsilon-greedy. Instead, we define a reuse schedule that is independent than the underlying RL algorithm’s exploration strategy. At the beginning of each episode, the agent either enables reuse module with a probability of ρ\rho (set as ρinit\rho_{init} initially). This value is decayed until it reaches its final value ρfinal\rho_{final} over tρt_{\rho} steps, similarly to ϵ\epsilon-greedy annealing. This approach further eliminates the dependency of our algorithm to the RL algorithm’s exploration strategy.

Algorithm 1 Learning on a Budget via Teacher Imitation
1:  Input: action advising budget bb, student policy πS\pi_{S}, teacher policy πT\pi_{T}, number of training iterations tmaxt_{max}, initial reuse probability ρinit\rho_{init}, final reuse probability ρfinal\rho_{final}, ρ\rho decaying steps tρt_{\rho}, imitation network GωG_{\omega}, number of imitation training iterations kinitk_{init} and kperiodick_{periodic}, number of new samples and steps to trigger imitation nminn_{min} and tmint_{min}
2:  DD\leftarrow\emptyset \triangleright initialise empty state-advice buffer
3:  reuse_enabled False\leftarrow False \triangleright disable reuse by default
4:  τNone\tau\leftarrow None \triangleright no valid threshold
5:  ρρinit\rho\leftarrow\rho_{init} \triangleright set reuse probability with the initial value
6:  nlast0n_{last}\leftarrow 0
7:  for training steps t{1,2,tmax}t\in\{1,2,\ldots t_{max}\} do
8:     if EnvEnv is reset then
9:        Set reuse_enabled TrueTrue with ϵreuse\epsilon_{reuse} probability
10:        Get observation stEnvs_{t}\sim Env if EnvEnv is reset
11:     end if
12:     atNonea_{t}\leftarrow None \triangleright action is not determined yet \triangleright Collection
13:     if reuse_enabled is TrueTrue and b>0b>0 then
14:        if GωG_{\omega} is not trained or Gωμ(st)>τG_{\omega}^{\mu}(s_{t})>\tau then
15:           atπT(st)a_{t}\leftarrow\pi_{T}(s_{t}) \triangleright collect advice
16:           Add st,at\langle s_{t},a_{t}\rangle to DD
17:           bb1b\leftarrow b-1 \triangleright decrease the budget
18:        end if
19:     end if\triangleright Imitation
20:     if |D|nlastntrain|D|-n_{last}\geqslant n_{train} or (|D|nlastntrain/2|D|-n_{last}\geqslant n_{train}/2 and tttraintestt-t_{train}\geqslant testthen
21:        Train GωG_{\omega} with DD for kinitk_{init} or kperiodick_{periodic} iterations
22:        nlast|D|n_{last}\leftarrow|D|
23:        Determine τ\tau as described in Section IV
24:     end if\triangleright Reuse
25:     if reuse_enabled is TrueTrue and ata_{t} is NoneNone andGωG_{\omega} is trained and Gωμ(st)<τG_{\omega}^{\mu}(s_{t})<\tau then
26:        atargmax𝑎Gω(ast)a_{t}\leftarrow\underset{a}{\operatorname{arg}\,\operatorname{max}}\;G_{\omega}(a\mid s_{t})
27:     end if
28:     Decay ρ\rho w.r.t. pre-defined schedule if ρ>ρfinal\rho>\rho_{final}
29:     if ata_{t} is NoneNone then
30:        atπS(st)a_{t}\leftarrow\pi_{S}(s_{t}) \triangleright self policy
31:     end if
32:     Execute ata_{t} and obtain rtr_{t}, st+1Envs_{t+1}\sim Env
33:     Update the RL algorithm, e.g. DQN
34:     stst+1s_{t}\leftarrow s_{t+1}
35:  end for

V Experimental Setup

We designed our experiments to answer the following questions about our proposal:

  • How does our automatic threshold tuning perform against the manually-set ones in terms reuse accuracy and learning performance?

  • How much does using advice imitation model to drive the advice collection process help with collecting more diverse state-advice dataset?

  • Does collecting a dataset with more diverse samples make any significant impact on the learning performance?

  • How much does every particular modification contribute in the final performance?

In the remainder of this section, we first describe our evaluation domain. Then, we provide the details our experimental process along with the substantial implementation details111The code for our experiments can be found at https://github.com/ercumentilhan/advice-imitation-reuse.

V-A Evaluation Domain

Refer to caption

(a) Enduro

Refer to caption

(b) Freeway

Refer to caption

(c) Pong

Refer to caption

(d) Q*bert

Refer to caption

(e) Seaquest
Figure 1: Example RGB frames (observations) taken from the Arcade Learning Environment games Enduro (a), Freeway (b), Pong (c), Q*bert (d) and Seaquest (e).

In order to have the adequate amount of difficulty that is relevant to the modern deep RL algorithms, we chose the widely experimented Arcade Learning Environment (ALE) [27] that contains more than 6060 Atari games as our testbed. We picked 55 well-known games among these, namely Enduro, Freeway, Pong, Q*bert, Seaquest, which involve different mechanics and present various learning challenges. In each of these games, the agent receives observations as RGB images with a size of 160×210×3160\times 210\times 3. These observations are converted into 80×80×180\times 80\times 1 grayscale images to reduce the amount of representational complexity, and are also stacked as the most recent 44 frames to eliminate the effect of partial observability. Furthermore, since these games are originally processed at high frame-per-second rates with very little differences between two consecutive frames, agent actions are repeated for 44 frames by skipping frames. Consequently, the final 84×84×484\times 84\times 4 sized observations the agent gets are built with the most recent 1616 game frames. The range of rewards in these games are also different and unbounded. Therefore, to facilitate the stability of learning, they are clipped to be in [1,1][-1,1] before they are provided to the agent. The game episodes are limited to last for 108108k frames (2727k agent steps) at maximum.

V-B Settings and Procedure

We experiment with an extensive set of agents to be able to determine the most beneficial enhancements included in our algorithm. The student agent variants we compare in our experiments are as follows:

  • No Advising (NA): No form of action advising is employed, the agent relies on its RL algorithm only.

  • Early Advising (EA): The student asks for advice greedily until its budget runs out. There is no further utilisation of advice beyond their execution at the time of collection. This is a simple yet well-performing heuristic.

  • Random Advising (RA): The student asks for advice randomly with 0.50.5 probability. This heuristic uses the intuition that spacing out requests may yield more diverse and information rich advice.

  • EA + Advice Reuse (AR): The agent employs the previously proposed advice reuse approach [13]. Advice is collected with early advising strategy, and the teacher is imitated with these advice. Then, advice are reused in place of the random exploration actions in approximately 0.50.5 of the episodes.

  • AR + Automatic Threshold Tuning (AR+A): AR is combined with our automatic threshold tuning technique.

  • AR+A + Extended Reuse (AR+A+E): AR is combined with both our automatic threshold tuning technique and the extended reusing scheme.

  • Advice Imitation & Reuse (AIR): This agent mode incorporates all of our proposed enhancements (as detailed in Section IV). On top of AR+A+E, this mode also uses the imitation module’s uncertainty to drive the advice collection process instead of relying on early advising.

We test the agents in learning sessions with length of 55M steps (equals to 2020M game frames due to frame skipping) with an advising budget of 2525k that corresponds to only 0.5%0.5\% of the total number of steps in a session. At every 5050kth step, the agents are evaluated in a separate set of 1010 episodes by having their action advising and exploration mechanisms disabled. The cumulative rewards obtained in these episodes are averaged and recorded as evaluation scores for that corresponding learning session step. This lets us measure the actual learning progress of the agents as the main performance metric.

The deep RL algorithm of the student agents is Double DQN with a neural network structure comprised of 33 convolutional layers (3232 8×88\times 8 filters with a stride of 44 followed by 6464 4×44\times 4 filters with a stride of 22 followed by 6464 3×33\times 3 filters with a stride of 11) and fully-connected layers with a single hidden layer (512512 units) and dueling stream output. For exploration, ϵ\epsilon-greedy strategy with linearly decaying ϵ\epsilon is adopted. The teacher agents are generated separately for each of the games prior to the experiments, by using the identical DQN algorithm and structure with the student. Even though the resulting agents are not necessarily at super-human levels achievable by DQN, they have competent policies that can achieve the evaluation scores of 15561556, 28.828.8, 1212, 37053705, 81788178 for Enduro, Freeway, Pong, Q*bert, Seaquest, respectively.

As we described in Section IV, our approach requires the student to be equipped with an additional behavioural cloning module that includes a neural network. We used the identical neural network structure to the student’s DQN model except for the dueling streams. Fully-connected layers of this network are enhanced with Dropout regularisation with a dropout rate of 0.350.35 and the number of forward passes to measure the epistemic uncertainty via variance is set at 100100.

The uncertainty threshold for AR is set as 0.010.01 for every game. Determining a reasonable value for this parameter requires to access the tasks briefly, which we have performed prior to the experiments; even though this will not be reflected at the numerical results, it should be noted that this is a critical disadvantage of AR. The automatic threshold tuning percentile used in AR+A, AR+A+E, AIR is set as 9090. This is a very straightforward hyperparameter to adjust compared to the (manual) uncertainty threshold itself and can potentially be valid in a wide variety of tasks. For the extensive reuse scheme in AR+A+E and AIR, we set ρinit\rho_{init} and ρfinal\rho_{final} as 0.50.5 and 0.10.1, respectively. We defined the annealing schedule to begin at 500500kth step and last until 22Mth step. For the imitation triggering conditions in AIR, nminn_{min} is set as 2.52.5k (samples) and tmint_{min} is set as 5050k (timesteps). Finally, the number of imitation network training iterations is set as 200200k for the initial one (applies to all modes but NA, EA and RA) and 5050k for the periodic ones (only applies to AIR).

All of the aforementioned hyperparameters reported in this section are set empirically prior to the experiments and are kept the same across every game. The most significant ones among the unmentioned hyperparameters of the student’s learning components are presented in Table I.

TABLE I: Hyperparameters of the student’s DQN and imitation module (for AR, AR+A, AR+A+E, AIR).
Hyperparameter name Value
Discount factor γ\gamma 0.990.99
Learning rate 625×107625\times 10^{-7}
Minibatch size 3232
Replay memory min. size and capacity 5050k, 500500k
Target network update period 75007500
ϵ\epsilon initial, ϵ\epsilon final, ϵ\epsilon decay steps 1.01.0, 0.010.01, 500500k
Learning rate 0.00010.0001
Minibatch size 3232

VI Results

Refer to caption
Figure 2: Evaluation scores obtained by the agent modes NA, EA, RA, AR, AR+A, AR+A+E, AIR in the ALE games Enduro, Freeway, Pong, Q*bert, Seaquest. Shaded areas show the standard deviation across 33 runs.
Refer to caption
Figure 3: Top and middle rows show the plots for the number of advice reuses and advice collections (for the first 55M and 150150k steps, respectively) in every 100100 steps performed by the relevant student modes. The legend for colours is the same as Figure 2 and the shaded areas show the standard deviation across 33 different runs. Bottom row shows the values of τ\tau across a single learning session, with different timestep scopes determined by the length of AIR’s advice collection stage. Blue colour represents AIR, the dashed grey lines represent AR, and the dashed red lines represent AR+A and AR+A+E.

Refer to caption

(a) AIR vs. EA

Refer to caption

(b) AIR vs. RA
Figure 4: UMAP [28] embeddings of the advice collected states in Seaquest by AIR (blue) vs. EA (red) in (a) and AIR (blue) vs. RA (pink) in (b), where the samples in common are shown in grey. Areas denoted with larger circles are the outliers covered only by either AIR (blue), EA (red) or RA (pink).
TABLE II: Final evaluation scores, percentages of advice reuses in the total environment steps, percentages of advice reuse accuracies achieved by NA, EA, RA, AR, AR+A, AR+A+E, AIR (+ signs are omitted) in 55 ALE games aggregated over 33 runs. The standard deviations across runs are indicated with ±\pm. The best scores/accuracies are denoted in bold.
Evaluation Score Advice Reuse
Game Mode Final Ratio (%\%) Accuracy (%)
Enduro NA 1066.34±37.31066.34\pm 37.3
EA 1131.13±69.01131.13\pm 69.0
RA 1170.60±19.01170.60\pm 19.0
AR 1127.72±26.91127.72\pm 26.9 0.49±0.030.49\pm 0.03 70.06±0.670.06\pm 0.6
ARA 1117.93±59.01117.93\pm 59.0 0.45±0.040.45\pm 0.04 72.36±0.372.36\pm 0.3
ARAE 1102.44±62.41102.44\pm 62.4 4.97±0.444.97\pm 0.44 75.24±0.5\bm{75.24\pm 0.5}
AIR 1184.02±19.6\bm{1184.02\pm 19.6} 4.79±0.444.79\pm 0.44 73.63±0.273.63\pm 0.2
Freeway NA 31.98±0.131.98\pm 0.1
EA 32.09±0.132.09\pm 0.1
RA 32.02±0.232.02\pm 0.2
AR 32.06±0.132.06\pm 0.1 1.60±0.111.60\pm 0.11 93.02±0.393.02\pm 0.3
ARA 32.26±0.1\bm{32.26\pm 0.1} 1.53±0.101.53\pm 0.10 93.48±0.193.48\pm 0.1
ARAE 32.14±0.032.14\pm 0.0 15.76±0.0815.76\pm 0.08 95.43±0.1\bm{95.43\pm 0.1}
AIR 32.14±0.132.14\pm 0.1 17.89±0.4717.89\pm 0.47 95.39±0.295.39\pm 0.2
Pong NA 0.95±2.40.95\pm 2.4
EA 3.73±4.93.73\pm 4.9
RA 11.48±0.211.48\pm 0.2
AR 9.41±3.59.41\pm 3.5 0.52±0.020.52\pm 0.02 79.80±0.579.80\pm 0.5
ARA 10.48±0.410.48\pm 0.4 0.93±0.010.93\pm 0.01 75.07±0.775.07\pm 0.7
ARAE 11.21±1.211.21\pm 1.2 10.4±0.5910.4\pm 0.59 79.00±0.179.00\pm 0.1
AIR 11.81±1.2\bm{11.81\pm 1.2} 9.98±0.369.98\pm 0.36 80.46±0.8\bm{80.46\pm 0.8}
Q*bert NA 3154.91±408.93154.91\pm 408.9
EA 2277.98±300.52277.98\pm 300.5
RA 2528.70±505.42528.70\pm 505.4
AR 2434.70±54.82434.70\pm 54.8 0.93±0.040.93\pm 0.04 80.37±0.680.37\pm 0.6
ARA 2359.39±371.92359.39\pm 371.9 0.93±0.020.93\pm 0.02 78.86±0.778.86\pm 0.7
ARAE 3763.72±340.33763.72\pm 340.3 14.57±0.2514.57\pm 0.25 92.87±0.7\bm{92.87\pm 0.7}
AIR 3814.34±134.6\bm{3814.34\pm 134.6} 15.16±0.2115.16\pm 0.21 92.83±0.292.83\pm 0.2
Seaquest NA 4496.41±1101.04496.41\pm 1101.0
EA 6538.18±1445.16538.18\pm 1445.1
RA 5033.04±1413.35033.04\pm 1413.3
AR 8053.93±935.98053.93\pm 935.9 0.61±0.030.61\pm 0.03 72.96±0.872.96\pm 0.8
ARA 7851.03±556.77851.03\pm 556.7 0.56±0.030.56\pm 0.03 76.29±0.7\bm{76.29\pm 0.7}
ARAE 8082.69±1105.28082.69\pm 1105.2 5.93±0.495.93\pm 0.49 74.18±1.574.18\pm 1.5
AIR 8614.04±268.7\bm{8614.04\pm 268.7} 4.79±0.204.79\pm 0.20 74.87±0.574.87\pm 0.5

The results of our experiments in Enduro, Freeway, Pong, Q*bert and Seaquest are presented in several plots to let us analyse the performance of the student modes NA, EA, RA, AR, AR+A, AR+A+E, AIR in different aspects. Figure 2 contains the evaluation scores plots. In Figure 3, plots for the number of advice reuses per 100100 steps (top row) performed by AR, AR+A, AR+A+E, AIR; plots for the number of advice collections per 100100 steps performed by NA, RA, AR, AR+A, AR+A+E, AIR (middle row); plots for the values of the τ\tau hyperparameter (uncertainty threshold) used by AR, AR+A, AR+A+E, AIR (bottom row) in a single run are shown. The shaded areas in these plots show the standard deviation across 33 runs. The final evaluation scores, percentage of advice reuses in total number of environment steps as well as their accuracies are presented in Table II. Finally, in order to highlight the differences in the advice-collected state diversity of EA (identical collection strategy to AR, AR+A, AR+A+E), RA and AIR, these states from a single Pong run are visualised with the aid of UMAP [28] dimensionality reduction technique in Figure 4. Here, the scatter plot on the left compares AIR (blue) vs. EA (red) and the one on the right compares AIR (blue) vs. RA (pink); and the samples collected in common are shown in grey.

We first analyse the learning performances via evaluation scores. In Enduro, all methods but NA have a very similar learning speed and final scores, with AIR being slightly ahead of the rest. In Freeway, they all achieve nearly the same final scores, but they are distinguishable with small differences in learning speed where AR+A+E is on the top followed by other advanced student modes AIR, AR, AR+A. When we move to Pong, Q*bert and Seaquest, we finally see the student mode performances to be more distinctive. Even though the basic heuristics (RA and EA) show that a little number of advice from a competent policy can make substantial boosts on learning, these modes fall behind of those that employ advice reuse and fail to be a reliable choice, i.e. performing worse than NA in Freeway and Q*bert. Overall, the best mode AIR and the runner-up AR+A+E are ahead of all, with AR and AR+A following them.

Among the advice reuse approaches, we see that the most beneficial modification is the extended reuse schedule (+E) as it is highlighted by the difference between AR+A and AR+A+E. Defining such a schedule independently from the student’s RL exploration strategy involves using some extra hyperparameters, nevertheless, they are rather trivial to set arbitrarily. The trends in the advice reuse plots (Figure 3, top row) show how these schedules differ. The versions with +E (AIR and AR+A+E) yield around 10×10\times more reusing, which apparently plays an important role in the performance improvement. However, it is still not clear how to define the optimal reuse schedules.

Automatic threshold tuning (+A) also performs comparably, if not better, with the manual tuning approach (AR) as we can observe in the evaluation scores. Additionally, AR+A managed to achieve very similar reuse accuracies with AR; this also supports its success. When we examine the τ\tau values, we see AR+A determined values that are close to the hand-tuned ones, except for the case in Pong where the difference is more significant. This reflected in reuse trend and evaluation performance, giving AR+A a very advantageous head start. These results support the idea that +A a far more preferable approach considering how problematic it can be to tune the sensitive τ\tau threshold manually. For instance, if they were to deployed in some significantly different domains as they are, then we could potentially see AR+A coming far ahead of AR with a poorly tuned τ\tau. Furthermore, the periodic imitation model updates incorporated in AIR makes +A an essential component. In the bottom row of Figure 3, we also show how AIR changes its τ\tau values over time as it collects more advice samples and updates its imitation model accordingly. Clearly, it is very difficult to manage these changes manually.

Finally, we also see that collecting advice by utilising the imitation model’s uncertainty (as it is done by AIR) contributes to the agent’s learning. When we look at the advice collection plots, we see 33 different types of behaviours: early collection (EA, AR, AR+A, AR+A+E), random collection (RA), and AIR. Even though they seem to be occurring mostly in the same time windows, AIR does this in an uncertainty-aware fashion; hence the decreasing collection rate over time. Freeway is the case in which AIR is very selective. This is possibly due to the fact that in Freeway, the agent can traverse only a limited space which consequently reduces the diversity of the acquired observations. Nonetheless, this is not reflected in the evaluation scores as dramatically due to this game being rather trivial to solve. Another interesting observation is made by analysing the advice collected states in Seaquest by AIR, EA (which is identical to AR, AR+A, AR+A+E in terms of collection strategy), RA in a reduced dimensionality as seen in Figure 4. We chose Seaquest since it is the game where AIR is significantly ahead of AR+A+E, which can be credited to AIR’s only difference from it (collection strategy). We also include RA here mainly because it can potentially do better in acquiring different samples than EA. Here, the large circles denote the outliers (diverse samples) that are only covered only by either AIR, EA or RA. These are the important bits to pay attention to and compare. As it can be seen, AIR yields larger coverage, i.e. more diverse dataset of advice, in both cases against EA and RA collection strategies.

VII Conclusions and Future Work

In this study, we proposed an automatic threshold tuning technique, an extended advice reusing schedule and an imitation model uncertainty-based advice collection procedure by extending the previously proposed advice reusing algorithm. We also developed a combined approach by incorporating these components, that is able to collect a diverse set of advice to build a more widely applicable advice imitation model for advice reuse.

The experiments in 55 different Atari games from the ALE domain have shown that our enhancements provide significant improvements over the baseline advice reuse method as well as the basic action advising heuristics. First, being able to tune the uncertainty thresholds on-the-fly was observed to yield the learning performance of the carefully tuned threshold, which require unrealistic access to the tasks and extra effort to be adjusted. Secondly, we found that having the advice reusing process span across a larger portion of the learning session rather than just the steps that involve random exploration can yield superior performance. However, defining the best schedule for the maximum advice utilisation efficiency remains to be an open question. Thirdly, the uncertainty-driven advice collection method was found to be successful way to improve the imitation module’s dataset diversity. Nevertheless, periodic training process can be improved with better incremental learning techniques to make a better use of this simultaneous collection-imitation idea. Finally, our unified algorithm demonstrated state-of-the-art performance across 55 Atari games by performing either on-par or better than its closest competitors.

The future extensions of this work can involve experimenting with more principled Policy Reuse approaches in the literature to further improve the advice reuse strategy. Furthermore, it will be a worthwhile study to make the teacher imitation better at learning online from the new samples it acquires. Finally, even though it is in the core motivation of our approach not to access and modify the agent’s RL components, it will be beneficial to investigate Learning from Demonstrations techniques and their possible contributions in our framework.

Acknowledgment

This research utilised Queen Mary’s Apocrita HPC facility, supported by QMUL Research-IT. http://doi.org/10.5281/zenodo.438045

References

  • [1] Oriol Vinyals et al. “Grandmaster level in StarCraft II using multi-agent reinforcement learning” In Nature 575.7782, 2019, pp. 350–354
  • [2] Julian Schrittwieser et al. “Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model” In CoRR abs/1911.08265, 2019
  • [3] OpenAI et al. “Solving Rubik’s Cube with a Robot Hand” In CoRR abs/1910.07113, 2019
  • [4] Adrien Ali Taı̈ga, William Fedus, Marlos C. Machado, Aaron C. Courville and Marc G. Bellemare “Benchmarking Bonus-Based Exploration Methods on the Arcade Learning Environment” In CoRR abs/1908.02388, 2019
  • [5] Dean Pomerleau “Efficient Training of Artificial Neural Networks for Autonomous Navigation” In Neural Com. 3.1, 1991, pp. 88–97
  • [6] Stefan Schaal “Learning from Demonstration” In Advances in Neural Information Processing Systems 9, NIPS, Denver, CO, USA, December 2-5, 1996 MIT Press, 1996, pp. 1040–1046
  • [7] Todd Hester et al. “Deep Q-learning From Demonstrations” In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18) AAAI Press, 2018, pp. 3223–3230
  • [8] Fernando Fernández and Manuela M. Veloso “Probabilistic policy reuse in a reinforcement learning agent” In 5th International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS 2006), Hakodate, Japan, May 8-12, 2006 ACM, 2006, pp. 720–727
  • [9] Lisa Torrey and Matthew E. Taylor “Teaching on a budget: agents advising agents in reinforcement learning” In International conference on Autonomous Agents and Multi-Agent Systems, AAMAS ’13, Saint Paul, MN, USA, May 6-10, 2013, 2013, pp. 1053–1060
  • [10] Ercüment Ilhan, Jeremy Gow and Diego Pérez-Liébana “Teaching on a Budget in Multi-Agent Deep Reinforcement Learning” In IEEE Conference on Games, CoG 2019, London, United Kingdom, August 20-23, 2019, 2019, pp. 1–8
  • [11] Felipe Leno Silva, Pablo Hernandez-Leal, Bilal Kartal and Matthew E. Taylor “Uncertainty-Aware Action Advising for Deep Reinforcement Learning Agents” In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020 AAAI Press, 2020, pp. 5792–5799
  • [12] Ercument Ilhan, Jeremy Gow and Diego Perez-Liebana “Student-Initiated Action Advising via Advice Novelty”, 2021 arXiv:2010.00381
  • [13] Ercüment Ilhan, Jeremy Gow and Diego Perez-Liebana “Action Advising with Advice Imitation in Deep Reinforcement Learning” In Proceedings of the 20th Conference on Autonomous Agents and Multi-Agent Systems, AAMAS 2021, May 3-7, 2021 IFAAMAS, 2021
  • [14] Matthew E. Taylor, Nicholas Carboni, Anestis Fachantidis, Ioannis P. Vlahavas and Lisa Torrey “Reinforcement learning agents providing advice in complex video games” In Connect. Sci. 26.1, 2014, pp. 45–63
  • [15] Matthieu Zimmer, Paolo Viappiani and Paul Weng “Teacher-Student Framework: A Reinforcement Learning Approach” In AAMAS Workshop Autonomous Robots and Multirobot Systems, 2014
  • [16] Ofra Amir, Ece Kamar, Andrey Kolobov and Barbara J. Grosz “Interactive Teaching Strategies for Agent Training” In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016 IJCAI/AAAI Press, 2016, pp. 804–811
  • [17] Felipe Leno Silva, Ruben Glatt and Anna Helena Reali Costa “Simultaneously Learning and Advising in Multiagent Reinforcement Learning” In Proceedings of the 16th Conference on Autonomous Agents and Multi-Agent Systems, AAMAS 2017 ACM, 2017, pp. 1100–1108
  • [18] Changxi Zhu, Yi Cai, Ho-fung Leung and Shuyue Hu “Learning by Reusing Previous Advice in Teacher-Student Paradigm” In Proceedings of the 19th International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’20 International Foundation for Autonomous AgentsMultiagent Systems, 2020, pp. 1674–1682
  • [19] Si-An Chen, Voot Tangkaratt, Hsuan-Tien Lin and Masashi Sugiyama “Active Deep Q-learning with Demonstration” In CoRR abs/1812.02632, 2018
  • [20] Yuri Burda, Harrison Edwards, Amos J. Storkey and Oleg Klimov “Exploration by Random Network Distillation” In CoRR abs/1810.12894, 2018
  • [21] Yarin Gal and Zoubin Ghahramani “Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning” In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016 48, 2016, pp. 1050–1059
  • [22] Richard S Sutton and Andrew G Barto “Reinforcement learning: An introduction” MIT press, 2018
  • [23] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra and Martin A. Riedmiller “Playing Atari with Deep Reinforcement Learning” In CoRR abs/1312.5602, 2013
  • [24] Matteo Hessel, Joseph Modayil, Hado Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Gheshlaghi Azar and David Silver “Rainbow: Combining Improvements in Deep Reinforcement Learning” In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18) AAAI Press, 2018, pp. 3215–3222
  • [25] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot and Nando Freitas “Dueling Network Architectures for Deep Reinforcement Learning” In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016 48, JMLR Workshop and Conference Proceedings, 2016, pp. 1995–2003
  • [26] Hado Hasselt, Arthur Guez and David Silver “Deep Reinforcement Learning with Double Q-Learning” In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence AAAI Press, 2016, pp. 2094–2100
  • [27] Marc G. Bellemare, Yavar Naddaf, Joel Veness and Michael Bowling “The Arcade Learning Environment: An Evaluation Platform for General Agents” In J. Artif. Intell. Res. 47, 2013, pp. 253–279
  • [28] Leland McInnes and John Healy “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction” In CoRR abs/1802.03426, 2018