This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Convergence of a Human-in-the-Loop Policy-Gradient Algorithm With Eligibility Trace Under Reward, Policy, and Advantage Feedback

Ishaan Shah    David Halpern    Kavosh Asadi    Michael L. Littman
Abstract

Abstract—Fluid human–agent communication is essential for the future of human-in-the-loop reinforcement learning. An agent must respond appropriately to feedback from its human trainer even before they have significant experience working together. Therefore, it is important that learning agents respond well to various feedback schemes human trainers are likely to provide. This work analyzes the COnvergent Actor–Critic by Humans (COACH) algorithm under three different types of feedback—policy feedback, reward feedback, and advantage feedback. For these three feedback types, we find that COACH can behave sub-optimally. We propose a variant of COACH, episodic COACH (E-COACH), which we prove converges for all three types. We compare our COACH variant with two other reinforcement-learning algorithms: Q-learning and TAMER.

Reinforcment Learning, ICML, Machine Learning, Convergence, COACH

1 Introduction

We study the algorithm COACH (MacGlashan et al., 2017a), designed to learn from evaluative feedback. We would like for the algorithm to find an optimal policy under different feedback schemes, since a human trainer is apt to select from several possible approaches and we do not know which will be chosen a priori.

We present a proof of convergence for three natural feedback schemes. 1) Feedback can take the form of an economic incentive in which the learner gets an immediate reward for moving into a state based on the state’s desirability—one-step reward. 2) Feedback can be a binary signal that tells the learner whether the action it took was correct (1) or not (0) with respect to the trainer’s intended policy. And, 3) feedback can reveal how good an action was relative to the agent’s recent behavior—the action’s advantage. It is desirable for a learning algorithm to perform appropriately in all three of these settings.

E-COACH (Algorithm 1) is such a learning algorithm. It takes input policy πθ\pi_{\theta}, discount factor γ\gamma, and a learning rate α\alpha.

Algorithm 1 E-COACH πθ,γ,α\langle\pi_{\theta},\gamma,\alpha\rangle
  θ00\theta_{0}\leftarrow 0
  for episode=0,1,2,\textnormal{episode}=0,1,2,\ldots do
     e00e_{0}\leftarrow 0
     for t=0,1,2,t=0,1,2,\ldots do
        atπθ(st,)a_{t}\sim\pi_{\theta}(s_{t},\cdot)
        observe state st+1s_{t+1} and human feedback ft+1f_{t+1}
        et+1et+1πθ(st,at)θπθ(st,at)e_{t+1}\leftarrow e_{t}+\frac{1}{\pi_{\theta}(s_{t},a_{t})}\nabla_{\theta}\pi_{\theta}(s_{t},a_{t})
        θt+1θt+αγtet+1ft+1\theta_{t+1}\leftarrow\theta_{t}+\alpha\gamma^{t}e_{t+1}f_{t+1}
     end for
  end for

E-COACH (Episodic COACH) is a close variant of the original COACH with a few differences. 1) It keeps track explicitly of the number of steps tt elapsed in the current episode. 2) E-COACH’s most notable difference from COACH is E-COACH’s use of a γt\gamma^{t} decay factor. This element emphasizes information from temporally closer decisions over more distant ones. 3) In addition, E-COACH does not use a λ\lambda parameter to decay the eligibility trace ete_{t}. This makes E-COACH’s treatment of eligibility traces like setting λ=1\lambda=1 in the original COACH algorithm.

We propose E-COACH instead of COACH because COACH does not take advantage of the discount factor, γt\gamma^{t}. This causes it to incorrectly estimate the expected reward, causing it to perform poorly on the given environment. We provide an example of such a scenario in 6.1. In contrast to COACH, we show that E-COACH can find converge under all three feedback schemes described above.

2 Background

A Markov Decision Process (MDP) is a five-tuple: S,A,T,R,γ\langle S,A,T,R,\gamma\rangle. Here, SS is a set of reachable states, AA is the set of actions an agent might use, T(s|s,a)T(s^{\prime}\,|\,s,a) is a probability that the agent would move to state ss^{\prime} from the given state ss having taken action aa, R(s,a)R(s,a) is the reward obtained for taking action aa from state ss, and γ[0,1)\gamma\in[0,1) is a discount factor indicating the importance of immediate rewards as opposed to rewards received in distant future.

A stochastic policy πθ:S×A[0,1]\pi_{\theta}:S\times A\rightarrow[0,1], where aAπθ(s,a)=1,sS\sum_{a\in A}\pi_{\theta}(s,a)=1,\forall\>s\in S, defines an agent’s behavior via πθ(s,a)={at=a|st=s,θ},sS,aA\pi_{\theta}(s,a)=\mathbb{P}\{a_{t}=a\,|\,s_{t}=s,\theta\},\forall\>s\in S,a\in A. Note that θ\theta is a vector parameter of the policy, and we assume that π\pi is differentiable with respect to this parameter. For brevity, we will denote πθ(s,a)\pi_{\theta}(s,a) as π(s,a)\pi(s,a) when the parameter vector is clear from context.

The value functions QπQ^{\pi} and VπV^{\pi} measure the performance of policy π\pi:

Qπ(s,a)=𝔼[k=1,2,γk1rt+k|st=s,at=a,π]\displaystyle Q^{\pi}(s,a)=\mathbb{E}\big{[}\sum_{k=1,2,\ldots}\gamma^{k-1}r_{t+k}\,|\,s_{t}=s,a_{t}=a,\pi\big{]}

and

Vπ(s)=𝔼aπ(s,)[Qπ(s,a)|s,π].\displaystyle V^{\pi}(s)=\mathbb{E}_{a\sim\pi(s,\cdot)}[Q^{\pi}(s,a)\,|\,s,\pi].

When an agent’s policy π=argmaxπVπ,sS\pi^{*}=\text{argmax}_{\pi}V^{\pi},\forall\>s\in S, then we call that policy optimal. We will denote optimal policies as π\pi^{*} and use the shorthand V(s)=Vπ(s)V^{*}(s)=V^{\pi^{*}}(s). Also, for sake of brevity, we will write 𝔼[|π]\mathbb{E}[\,\cdot\,|\,\pi] simply as 𝔼[]\mathbb{E}[\cdot] from now on. Note that all expectations we consider are conditioned on the policy. If not specified otherwise , 𝔼[]\mathbb{E}[\cdot] is an expectation over s1,a1,s2,a2s_{1},a_{1},s_{2},a_{2}\ldots where st+1T(|st,at)s_{t+1}\sim\text{T}(\cdot\,|\,s_{t},a_{t}) and at+1π(st+1,)a_{t+1}\sim\pi(s_{t+1},\cdot).

3 E-COACH Under Reward Feedback

A simple form of feedback a trainer may choose to give a learner is the one-step reward obtained from the MDP for the action the agent just took. Such reward feedback is convenient since it is myopic and does not require the trainer to consider future rewards. It assumes a direct analogy between the rewards that define the task and the feedback provided by the trainer—it is the simplest extension of standard reinforcement learning to the interactive setting. For the following theoretical results to hold, we assume the human-trainer gives consistent reward, as per the definition of our feedback, ff. ff represents our feedback, which we will redefine in Sections 3, 4.2, and 5.

We look at an MDP M=S,A,R,T,γM=\langle S,A,R,T,\gamma\rangle. Under reward feedback, when an agent takes an action aa in state ss, the trainer gives feedback

f(s,a)=R(s,a).\displaystyle f(s,a)=R(s,a).

Theorem 3: E-COACH converges under reward feedback f(s,a)=R(s,a),s×aS×Af(s,a)=R(s,a),\forall\>s\times a\in S\times A.

Proof: Consider the sequence of updates on ete_{t} and θt\theta_{t} at each time step tt:

et+1\displaystyle e_{t+1} et+1πθ(st,at)θπθ(st,at)\displaystyle\leftarrow e_{t}+\frac{1}{\pi_{\theta}(s_{t},a_{t})}\nabla_{\theta}\pi_{\theta}(s_{t},a_{t})
θt+1\displaystyle\theta_{t+1} θt+γtet+1rt+1\displaystyle\leftarrow\theta_{t}+\gamma^{t}e_{t+1}r_{t+1}

where, for brevity, we define rt+1=R(st,at)r_{t+1}=R(s_{t},a_{t}).

To better understand what the updates mean, consider some terminal time LL. The value LL may refer to the time at which an agent reaches the goal or a pre-decided time at which the trainer stops the agent. We use LL only for the purpose of elucidation and the analysis below also extends to the infinite horizon case when LL is unbounded. We ignore the α\alpha in the θ\theta update above for sake of clarity. By linearity of the updates, it is trivial to incorporate α\alpha into the calculations below.

θL+1\displaystyle\theta_{L+1} =Στ=0Lγτeτ+1rτ+1\displaystyle=\Sigma_{\tau=0}^{L}\gamma^{\tau}e_{\tau+1}r_{\tau+1}
=Στ=0Lγτrτ+1(Σt=0τθπθ(st,at)πθ(st,at))\displaystyle=\Sigma_{\tau=0}^{L}\gamma^{\tau}r_{\tau+1}(\Sigma_{t=0}^{\tau}\frac{\nabla_{\theta}\pi_{\theta}(s_{t},a_{t})}{\pi_{\theta}(s_{t},a_{t})})
=Στ=0LΣt=0τγτrτ+1θπθ(st,at)πθ(st,at)\displaystyle=\Sigma_{\tau=0}^{L}\Sigma_{t=0}^{\tau}\gamma^{\tau}r_{\tau+1}\frac{\nabla_{\theta}\pi_{\theta}(s_{t},a_{t})}{\pi_{\theta}(s_{t},a_{t})}

Rearranging the order of summation

θL+1\displaystyle\theta_{L+1} =Σt=0LΣτ=tLθπθ(st,at)πθ(st,at)γτrτ+1\displaystyle=\Sigma_{t=0}^{L}\Sigma_{\tau=t}^{L}\frac{\nabla_{\theta}\pi_{\theta}(s_{t},a_{t})}{\pi_{\theta}(s_{t},a_{t})}\gamma^{\tau}r_{\tau+1}
=Σt=0Lγtθπθ(st,at)πθ(st,at)(Στ=0Ltγτrτ+t+1)\displaystyle=\Sigma_{t=0}^{L}\gamma^{t}\frac{\nabla_{\theta}\pi_{\theta}(s_{t},a_{t})}{\pi_{\theta}(s_{t},a_{t})}(\Sigma_{\tau=0}^{L-t}\gamma^{\tau}r_{\tau+t+1})

Taking expectation

𝔼[θL+1]=Σt=0Lγt𝔼[θπθ(st,at)πθ(st,at)Qπ(st,at)]\displaystyle\mathbb{E}[\theta_{L+1}]=\Sigma_{t=0}^{L}\gamma^{t}\mathbb{E}[\frac{\nabla_{\theta}\pi_{\theta}(s_{t},a_{t})}{\pi_{\theta}(s_{t},a_{t})}Q^{\pi}(s_{t},a_{t})]

3.1 E-COACH Objective Function

In this section we show that the gradient of the objective function Σt=0γt𝔼[θπθ(st,at)πθ(st,at)Qπ(st,at)]\Sigma_{t=0}^{\infty}\gamma^{t}\mathbb{E}[\frac{\nabla_{\theta}\pi_{\theta}(s_{t},a_{t})}{\pi_{\theta}(s_{t},a_{t})}Q^{\pi}(s_{t},a_{t})] is what REINFORCE performs gradient ascent on. Note that this quantity is what E-COACH is estimating (via θL\theta_{L} parameter) and performing gradient ascent on.

Consider the REINFORCE algorithm (Algorithm 2). Here, GtG_{t} is a Monte Carlo estimate of Qπ(st,at)Q^{\pi}(s_{t},a_{t}). Hence, at any terminal time LL, it is clear that the expected value of θL\theta_{L} obtained by REINFORCE is equal to that of E-COACH.

Algorithm 2 REINFORCE<πθ,γ,α><\pi_{\theta},\gamma,\alpha>
  Generate an episode s0,a0,r1,sL,aL,rL+1s_{0},a_{0},r_{1},\ldots s_{L},a_{L},r_{L+1}
  for t=0,1,2,t=0,1,2,\ldots do
     GtG_{t} = return from step t
     θt+1θt+γtGtθlog(πθ(st,at))\theta_{t+1}\leftarrow\theta_{t}+\gamma^{t}G_{t}\nabla_{\theta}\log(\pi_{\theta}(s_{t},a_{t}))
  end for

Consider the unnormalised state visitation distribution, such that sS\forall s\in S, dπ(s)=0π(s)+γ1π(s)++γiiπ(s)+d^{\pi}(s)=\mathbb{P}_{0}^{\pi}(s)+\gamma\mathbb{P}_{1}^{\pi}(s)+\ldots+\gamma^{i}\mathbb{P}_{i}^{\pi}(s)+\ldots where tπ(s)\mathbb{P}_{t}^{\pi}(s) denotes the probability of arriving in state ss at time tt following policy π\pi. The objective to maximize, as described in the Policy Gradient Theorem (Sutton et al., 2000), is

ρπ=Σsdπ(s)Σaπ(s,a)R(s,a),\displaystyle\rho^{\pi}=\Sigma_{s}d^{\pi}(s)\Sigma_{a}\pi(s,a)R(s,a),

and its gradient is

θρπ=Σsdπ(s)Σaθπ(s,a)Qπ(s,a).\displaystyle\nabla_{\theta}\rho^{\pi}=\Sigma_{s}d^{\pi}(s)\Sigma_{a}\nabla_{\theta}\pi(s,a)Q^{\pi}(s,a).

The gradient can be rewritten as θρπ=𝔼sdπ,aπ(s,)[θπθ(st,at)πθ(st,at)Qπ(s,a)]\nabla_{\theta}\rho^{\pi}=\mathbb{E}_{s\sim d^{\pi},a\sim\pi(s,\cdot)}[\frac{\nabla_{\theta}\pi_{\theta}(s_{t},a_{t})}{\pi_{\theta}(s_{t},a_{t})}Q^{\pi}(s,a)].

Expanding with respect to dπ(s)d^{\pi}(s) yields

𝔼sdπ,aπ(s,)[\displaystyle\mathbb{E}_{s\sim d^{\pi},a\sim\pi(s,\cdot)}[ θπθ(st,at)πθ(st,at)Qπ(s,a)]\displaystyle\frac{\nabla_{\theta}\pi_{\theta}(s_{t},a_{t})}{\pi_{\theta}(s_{t},a_{t})}Q^{\pi}(s,a)]
=Σt=0γt𝔼sttπ,atπθ(st,)[\displaystyle=\Sigma_{t=0}^{\infty}\gamma^{t}\mathbb{E}_{s_{t}\sim\mathbb{P}^{\pi}_{t},a_{t}\sim\pi_{\theta}(s_{t},\cdot)}[ θπθ(st,at)πθ(st,at)Qπ(st,at)]\displaystyle\frac{\nabla_{\theta}\pi_{\theta}(s_{t},a_{t})}{\pi_{\theta}(s_{t},a_{t})}Q^{\pi}(s_{t},a_{t})]

which is exactly what E-COACH and REINFORCE are estimating. Since E-COACH is performing gradient ascent on the policy gradient objective, we can use results from (Agarwal et al., 2020) to say that E-COACH converges to local optima or saddle points. Although, recent work has shown that policy gradient methods can escape saddle points under mild assumptions on the rewards and minor modifications to existing algorithms (Zhang et al., 2020) (Jin et al., 2017).
\Box

Something to note is that behavioral evidence indicates that one-step reward is not a typical choice of human trainers (Ho et al., 2019).

4 E-COACH Under Policy Feedback

To argue that E-COACH converges under policy feedback, we first consider a more general form of feedback and then show policy feedback is a special case.

4.1 E-COACH with a More General Type of Feedback

Let us start by considering two similar MDPs M1=S,A,R,T,γM_{1}=\langle S,A,R,T,\gamma\rangle and M2=S,A,f,T,γM_{2}=\langle S,A,f,T,\gamma\rangle. Note the differing reward functions RR and ff in the two MDPs.

We will denote the value functions for M1M_{1} and M2M_{2} as V1V_{1} and V2V_{2}, respectively. We will say that the starting state for both of our MDPs is s0s_{0}. Define V1min=minπΠV1π(s0)V_{1}^{\min}=\min_{\pi\in\Pi}V_{1}^{\pi}(s_{0}), V1=maxπΠV1π(s0)V_{1}^{*}=\max_{\pi\in\Pi}V_{1}^{\pi}(s_{0}).

The following theorem will have the following assumption:

  1. 1.

    E-COACH (see algorithm 1) will give us a policy π2(s,a)\pi_{2}(s,a) such that 𝔼sdπ2[a|π2(s,a)π2(s,a)|]δ\mathbb{E}_{s\sim d^{\pi_{2}^{*}}}\big{[}\sum_{a}\big{|}\pi_{2}^{*}(s,a)-\pi_{2}(s,a)\big{|}\big{]}\leq\delta for some optimal policy π2\pi_{2}^{*} on the domain M2M_{2}. The proof in section 3 strengthens this assumption by showing that E-COACH optimizes the policy gradient objective. Note π2\pi_{2}^{*} may not be the only optimal policy; instead, it is a single optimal policy.

  2. 2.

    We also assume that γ1\gamma\neq 1 for the case where the MDP has an infinite horizon, which will we will justify later on.

Theorem 4.1 requires the condition that all optimal policies for M2M_{2} are also optimal for M1M_{1}. We will later show that this condition holds true for the case of policy feedback in theorem 4.2, allowing us to leverage these results.

Theorem 4.1: If all optimal policies for M2M_{2} are also optimal for M1M_{1} (optimal policies of M2M_{2} are a subset of those for M1M_{1}), then running E-COACH on M2M_{2} will result in a policy that is close to an optimal policy on M2M_{2}, which will also be close to an optimal policy for M1M_{1}. Let’s define W=max(|V1|,|V1min|)W=\max(|V_{1}^{*}|,|V_{1}^{\min}|). Then we find that,

0V1V1π2Wδ\displaystyle 0\leq V_{1}^{*}-V_{1}^{\pi_{2}}\leq W\delta

Proof: We have to show that running E-COACH in M2M_{2} will yield a policy that is not too far off from an optimal policy for M1M_{1}. We would like to run E-COACH on M2M_{2}, using the alternate form of feedback as the reward function, and for any good policy (as per assumption 1) we get from E-COACH on M2M_{2}, we would like for that policy to also be good on M1M_{1}, the original MDP we are trying to solve.

The lower-bound in the theorem statement is immediate.

For the upper-bound, let’s let π(n)\pi^{(n)} denote a policy that follows/simulates π2\pi_{2}^{*} for the first n1n-1 time-steps and π2\pi_{2} for the rest. Hence, on the nthn^{\textnormal{th}} time-step, π(n)\pi^{(n)} will follow/simulate π2\pi_{2} and not π2\pi_{2}^{*}. Let V(n)V^{(n)} denote the value of policy π(n)\pi^{(n)}. Therefore, we can say that V1π2=V(0)V_{1}^{\pi_{2}}=V^{(0)} and V1=V()V_{1}^{*}=V^{(\infty)}. Remember that π2\pi_{2}^{*} is optimal on M1M_{1} and M2M_{2} by the condition above, and thus has value VV^{*}.

We’ll start by considering V(t)V(t1)V^{(t)}-V^{(t-1)}. Both π(t)\pi^{(t)} and π(t1)\pi^{(t-1)} accumulate the same expected reward for the first t2t-2 steps and so these rewards cancel out. Note that the \mathbb{P} we use below is the same as that defined in section 3.1. We find the following:

V(t)\displaystyle V^{(t)}- V(t1)=γt1st1π2(s)aπ2(s,a)Qπ2(s,a)\displaystyle V^{(t-1)}=\gamma^{t-1}\sum_{s}\mathbb{P}_{t-1}^{\pi_{2}^{*}}(s)\sum_{a}\pi_{2}^{*}(s,a)Q^{\pi_{2}}(s,a)
\displaystyle- γt1st1π2(s)aπ2(s,a)Qπ2(s,a)\displaystyle\gamma^{t-1}\sum_{s}\mathbb{P}_{t-1}^{\pi_{2}^{*}}(s)\sum_{a}\pi_{2}(s,a)Q^{\pi_{2}}(s,a)
=\displaystyle= γt1st1π2(s)a(π2(s,a)π2(s,a))Qπ2(s,a)\displaystyle\gamma^{t-1}\sum_{s}\mathbb{P}_{t-1}^{\pi_{2}^{*}}(s)\sum_{a}(\pi_{2}^{*}(s,a)-\pi_{2}(s,a))Q^{\pi_{2}}(s,a)
\displaystyle\leq γt1st1π2(s)a|π2(s,a)π2(s,a)|W\displaystyle\gamma^{t-1}\sum_{s}\mathbb{P}_{t-1}^{\pi_{2}^{*}}(s)\sum_{a}|\pi_{2}^{*}(s,a)-\pi_{2}(s,a)|W

Now we’ll use the above fact when considering V1V1π2V_{1}^{*}-V_{1}^{\pi_{2}}.

V1V1π2=\displaystyle V_{1}^{*}-V_{1}^{\pi_{2}}= (V1V0)+(V2V1)+(V3V2)+\displaystyle(V^{1}-V^{0})+(V^{2}-V^{1})+(V^{3}-V^{2})+\cdots
\displaystyle\leq iγisiπ2(s)a|π2(s,a)π2(s,a)|W\displaystyle\sum_{i}^{\infty}\gamma^{i}\sum_{s}\mathbb{P}_{i}^{\pi^{*}_{2}}(s)\sum_{a}|\pi_{2}^{*}(s,a)-\pi_{2}(s,a)|W
=\displaystyle= siγiiπ2(s)a|π2(s,a)π2(s,a)|W\displaystyle\sum_{s}\sum_{i}^{\infty}\gamma^{i}\mathbb{P}_{i}^{\pi^{*}_{2}}(s)\sum_{a}|\pi_{2}^{*}(s,a)-\pi_{2}(s,a)|W
=\displaystyle= sdπ2(s)a|π2(s,a)π2(s,a)|W\displaystyle\sum_{s}d^{\pi_{2}^{*}}(s)\sum_{a}|\pi_{2}^{*}(s,a)-\pi_{2}(s,a)|W
=\displaystyle= W𝔼sdπ2[a|π2(s,a)π2(s,a)|]\displaystyle W\mathbb{E}_{s\sim d^{\pi_{2}^{*}}}\big{[}\sum_{a}|\pi_{2}^{*}(s,a)-\pi_{2}(s,a)|\big{]}
\displaystyle\leq Wδ\displaystyle W\delta

As δ0\delta\rightarrow 0, we have that V1V1π20V_{1}^{*}-V_{1}^{\pi_{2}}\rightarrow 0.
\Box

Note that our theorem says something different than the Simulation Lemma (Kearns & Singh, 1998) as we make no assumptions about how close the reward functions of M1M_{1} and M2M_{2} are. Instead our theorem requires optimal policies in M2M_{2} be optimal in M1M_{1} and bounds the return of a policy learnt by E-COACH in M2M_{2}.

4.2 E-COACH Under Policy Feedback

Let M1=S,A,R,T,γM_{1}=\langle S,A,R,T,\gamma\rangle be an MDP without any specific reward function. Under policy feedback, a trainer has a target stationary deterministic policy π1\pi^{*}_{1} in mind and delivers feedback based on whether the trainer’s decision agrees with π1\pi^{*}_{1}. When an agent takes an action aa in state ss, the trainer will give feedback

f(s,a)=I(s,a),\displaystyle f(s,a)=I(s,a),

with I(s,a)I(s,a) defined as,

I(s,a)={1,if π1(s)=a,0,otherwise.\displaystyle I(s,a)=\begin{cases}1,&\textnormal{if }\pi^{*}_{1}(s)=a,\\ 0,&\textnormal{otherwise.}\end{cases}

Theorem 4.2: E-COACH converges under feedback f(s,a)=I(s,a),s×aS×Af(s,a)=I(s,a),\forall\>s\times a\in S\times A.

Proof: Consider the case of replacing the reward function R(s,a)R(s,a) with I(s,a)I(s,a) in MDP M1M_{1}, constructing a new MDP M2=S,A,f,T,γM_{2}=\langle S,A,f,T,\gamma\rangle. We would like to show that, in this setting, the E-COACH algorithm converges to the optimal solution. M1M_{1} and M2M_{2} satisfy the prerequisites for theorem 4.1.

Consider the optimal policy for M2M_{2}. The best policy will select the best action in every state. We have that V2(s0)=i=01γiV^{*}_{2}(s_{0})=\sum_{i=0}^{\infty}1\cdot\gamma^{i}. The optimal policy for M2M_{2} will achieve this value function because, if not, then we have a policy such that V2(s0)=i=0t(i)1γiV_{2}^{\prime}(s_{0})=\sum_{i=0}^{\infty}t(i)\cdot 1\cdot\gamma^{i}, where t(i){0,1}it(i)\in\{0,1\}\forall\>i and t(j)=0t(j)=0 for some jj. Take the smallest value kk\in\mathbb{N} such that t(k)=0t(k)=0, then V2(s0)V2(s0)γkV^{*}_{2}(s_{0})-V_{2}^{\prime}(s_{0})\geq\gamma^{k} so then the policy achieving V2V_{2}^{\prime} is sub-optimal. We can conclude that V2(s0)V^{*}_{2}(s_{0}) is the value function for the optimal policy. Therefore, the policy that always chooses the action that gives a value of one is optimal. Also, note that always choosing the action that results in a feedback of one corresponds exactly to the decision of π1\pi_{1}^{*} by construction of f(s,a)f(s,a). So, we obtain that π1(s,a)=π2(s,a),(s,a)S×A\pi_{1}^{*}(s,a)=\pi_{2}^{*}(s,a),\forall\>(s,a)\in S\times A. In other words, an optimal policy in the new domain is equivalent to the target policy from the original one.

We can leverage Theorem 4.1 to show that the algorithm converges under policy feedback.
\Box

5 E-COACH Under Advantage Feedback

COACH (MacGlashan et al., 2017a) was originally motivated by the observation that human feedback is observed to be policy dependent—if a decision improves over the agent’s recent decisions, trainers provide positive feedback. If it is worse, the trainer is more likely to provide negative feedback. As such, feedback is well modeled by the advantage function of the agent’s current policy.

In advantage feedback, when an agent takes action aa in state ss, the trainer will give feedback

f(s,a)=Aπ(s,a),\displaystyle f(s,a)=A^{\pi}(s,a),

with Aπ(s,a)A^{\pi}(s,a) defined as,

Aπ(s,a)=Qπ(s,a)Vπ(s).\displaystyle A^{\pi}(s,a)=Q^{\pi}(s,a)-V^{\pi}(s).

Theorem 5: E-COACH converges under feedback f(s,a)=Aπ(s,a),s×aS×Af(s,a)=A^{\pi}(s,a),\forall\>s\times a\in S\times A.

Proof: Since Vπ(s)=𝔼aπ(|s)[Qπ(s,a)|s]V^{\pi}(s)=\mathbb{E}_{a\sim\pi(\cdot|s)}[Q^{\pi}(s,a)\,|s], we have that

𝔼aπ(s,)[\displaystyle\mathbb{E}_{a\sim\pi(s,\cdot)}[ Aπ(s,a)|s]\displaystyle A^{\pi}(s,a)\,|\,s]
=\displaystyle= 𝔼aπ(s,)[Qπ(s,a)|s]Vπ(s)\displaystyle\mathbb{E}_{a\sim\pi(s,\cdot)}[Q^{\pi}(s,a)\,|\,s]-V^{\pi}(s)
=\displaystyle= 0\displaystyle 0

By the equation above we can say the following

𝔼st+1,at+1,st+2[Aπ(st+τ,at+τ)|st,at]=0τ>0\mathbb{E}_{s_{t+1},a_{t+1},s_{t+2}\ldots}[A^{\pi}(s_{t+\tau},a_{t+\tau})|s_{t},a_{t}]=0\,\,\forall\tau>0 (1)

We also know that

𝔼st+1,at+1,st+2[Aπ(st,at)|st,at]=Aπ(st,at)\mathbb{E}_{s_{t+1},a_{t+1},s_{t+2}\ldots}[A^{\pi}(s_{t},a_{t})|s_{t},a_{t}]=A^{\pi}(s_{t},a_{t}) (2)

We will use equations 1 and 2 later on in this proof.

Using the same approach as in theorem 3, we look at the sequence of updates made to the policy parameter θ\theta until some terminal time LL.

θL+1\displaystyle\theta_{L+1} =Στ=0Lγτeτ+1Aπ(sτ,aτ)\displaystyle=\Sigma_{\tau=0}^{L}\gamma^{\tau}e_{\tau+1}A^{\pi}(s_{\tau},a_{\tau})
=Στ=0LγτAπ(sτ,aτ)(Σt=0τθπθ(st,at)πθ(st,at))\displaystyle=\Sigma_{\tau=0}^{L}\gamma^{\tau}A^{\pi}(s_{\tau},a_{\tau})(\Sigma_{t=0}^{\tau}\frac{\nabla_{\theta}\pi_{\theta}(s_{t},a_{t})}{\pi_{\theta}(s_{t},a_{t})})
=Στ=0LΣt=0τγτAπ(sτ,aτ)θπθ(st,at)πθ(st,at)\displaystyle=\Sigma_{\tau=0}^{L}\Sigma_{t=0}^{\tau}\gamma^{\tau}A^{\pi}(s_{\tau},a_{\tau})\frac{\nabla_{\theta}\pi_{\theta}(s_{t},a_{t})}{\pi_{\theta}(s_{t},a_{t})}

Rearranging the order of summation

θL+1\displaystyle\theta_{L+1} =Σt=0LΣτ=tLθπθ(st,at)πθ(st,at)γτAπ(sτ,aτ)\displaystyle=\Sigma_{t=0}^{L}\Sigma_{\tau=t}^{L}\frac{\nabla_{\theta}\pi_{\theta}(s_{t},a_{t})}{\pi_{\theta}(s_{t},a_{t})}\gamma^{\tau}A^{\pi}(s_{\tau},a_{\tau})
=Σt=0Lγtθπθ(st,at)πθ(st,at)(Στ=0LtγτAπ(sτ+t,aτ+t))\displaystyle=\Sigma_{t=0}^{L}\gamma^{t}\frac{\nabla_{\theta}\pi_{\theta}(s_{t},a_{t})}{\pi_{\theta}(s_{t},a_{t})}(\Sigma_{\tau=0}^{L-t}\gamma^{\tau}A^{\pi}(s_{\tau+t},a_{\tau+t}))

Therefore, taking the expectation

𝔼[θL+1]\displaystyle\mathbb{E}[\theta_{L+1}] =Σt=0Lγt𝔼[θπθ(st,at)πθ(st,at)(Στ=0LtγτAπ(sτ+t,aτ+t))]\displaystyle=\Sigma_{t=0}^{L}\gamma^{t}\mathbb{E}[\frac{\nabla_{\theta}\pi_{\theta}(s_{t},a_{t})}{\pi_{\theta}(s_{t},a_{t})}(\Sigma_{\tau=0}^{L-t}\gamma^{\tau}A^{\pi}(s_{\tau+t},a_{\tau+t}))]
=Σt=0Lγt𝔼st,at[θπθ(st,at)πθ(st,at)×\displaystyle=\,\Sigma_{t=0}^{L}\gamma^{t}\mathbb{E}_{s_{t},a_{t}}[\frac{\nabla_{\theta}\pi_{\theta}(s_{t},a_{t})}{\pi_{\theta}(s_{t},a_{t})}\times
Στ=0Ltγτ𝔼st+1,at+1,st+2[Aπ(sτ+t,aτ+t)|st,at]]\displaystyle\Sigma_{\tau=0}^{L-t}\gamma^{\tau}\mathbb{E}_{s_{t+1},a_{t+1},s_{t+2}\ldots}[A^{\pi}(s_{\tau+t},a_{\tau+t})|s_{t},a_{t}]]

Using equations 1 and 2, we can say that

𝔼[θL+1]\displaystyle\mathbb{E}[\theta_{L+1}] =Σt=0Lγt𝔼[θπθ(st,at)πθ(st,at)Aπ(st,at)]\displaystyle=\Sigma_{t=0}^{L}\gamma^{t}\mathbb{E}[\frac{\nabla_{\theta}\pi_{\theta}(s_{t},a_{t})}{\pi_{\theta}(s_{t},a_{t})}A^{\pi}(s_{t},a_{t})]

Using the fact that 𝔼[θπθ(st,at)πθ(st,at)Vπ(st)]=0\mathbb{E}[\frac{\nabla_{\theta}\pi_{\theta}(s_{t},a_{t})}{\pi_{\theta}(s_{t},a_{t})}V^{\pi}(s_{t})]=0 (Thomas & Brunskill, 2017)

𝔼[θL+1]=Σt=0Lγt𝔼[θπθ(st,at)πθ(st,at)Qπ(st,at)]\displaystyle\mathbb{E}[\theta_{L+1}]=\Sigma_{t=0}^{L}\gamma^{t}\mathbb{E}[\frac{\nabla_{\theta}\pi_{\theta}(s_{t},a_{t})}{\pi_{\theta}(s_{t},a_{t})}Q^{\pi}(s_{t},a_{t})]

Thus, using the argument in 3.1, we can state that E-COACH converges under advantage feedback.
\Box

Refer to caption
Refer to caption
Refer to caption
Figure 1: Performance of E-COACH, TAMER and Q-learning under each feedback scheme. The domain is a 10 by 10 GridWorld coded using simple_rl (Abel, 2019). Each agent was run for 150 episodes and was cut short after 1000 steps. We ran 10 instances of each agent and plotted the reward averaged over these instances. E-COACH and Q-learning maximize rewards in all three settings, while TAMER falters under reward feedback and has difficulty with advantage. Although, other experimental results show TAMER doing well with advantage feedback; see MacGlashan et al. (2017b). These experimental results are meant to support our proofs of convergence for E-COACH. In addition, they support arguments made in Section 7.

6 Original COACH

This section assesses the convergence of the original COACH algorithm (MacGlashan et al., 2017a) under the three different types of feedback defined in this paper. Recall the main differences between E-COACH and COACH:

  1. 1.

    COACH makes use of an eligibility decay factor λ\lambda.

  2. 2.

    COACH does not discount the feedback by γt\gamma^{t} as part of the algorithm.

Note that λ\lambda and γ\gamma are not replaceable as the λ\lambda can only be used to discount stored gradients and thereby discount future rewards. On the other hand, γ\gamma is used to both discount future rewards as well as estimate the unnormalised state visitation distribution dπ(s)d^{\pi}(s) described in 3.1.

As a result, the original COACH is incapable of estimating the state visitation distribution. Hence, the updates made at t=0t=0 and t=10t=10 would be weighted equally by COACH.

This property goes against what policy-gradient algorithms would do. By not using γ\gamma, COACH is basically drawing from a state visitation distribution different from the state visitation distribution dπ(s)d^{\pi}(s) that is part of the objective function ρ\rho that we described in 3.1. As a result, the updates made by COACH are not estimating ρ\rho. Although COACH may learn to do reasonably well, we cannot say that it will behave optimally.

6.1 COACH Under One-Step Reward

The algorithm will converge, but the policy it converges to will be suboptimal for γ1\gamma\neq 1 because COACH does not estimate the gradient of the policy gradient objective, θρ\nabla_{\theta}\rho. Because COACH does not incorporate the discount factor, it behaves as if the domain has γ=1\gamma=1, even if it isn’t necessarily the case. If the domain has a γ[0,1)\gamma\in[0,1), then the policy will have a long-term view because it will ignore this discount factor. Ignoring discounts can lead to suboptimal behavior.

Consider the five-state domain in Figure 2. It shows how the optimal decision in a state can change with the discount factor.

044101010100
Figure 2: Example of the impact of discount factor on optimal policies. The number in the circle represents reward. The state in the middle is our starting state. For γ1\gamma\approx 1, it is clear that the optimal policy is to go left to obtain a value of 14\approx 14. For γ0\gamma\approx 0, the optimal actions is to go right, instead, to obtain a value of 10\approx 10 instead of 4\approx 4. In general, the left action is preferred for γ>0.6\gamma>0.6 and the right action is preferred for γ<0.6\gamma<0.6. The choice of γ\gamma impacts the optimal policy.

6.2 Policy Feedback

COACH will converge, but to a poor performing policy for the same reason given in Section 6.1.

6.3 Advantage Feedback

COACH should converge under this feedback type as per the argument given by MacGlashan et al..

7 Comparison With Other Algorithms

We now know that E-COACH converges under several types of feedback. The three highlighted in this paper are Policy, Advantage, and Reward feedback. In this section, we compare E-COACH to TAMER (Knox & Stone, 2008) and Q-learning under these three types of feedback.

7.1 TAMER

TAMER expects the human trainer to take each action’s long-term implications into account when providing feedback. TAMER learns the trainer’s feedback function, then returns the policy that maximizes one-step feedback in each state.

The pseudocode as described in algorithm 3 is the TAMER algorithm. See (Knox & Stone, 2008) for more details. The tt represents the time, the weights w\overrightarrow{w} is used for the reward model, and the feature vectors ft2\overrightarrow{f_{t-2}} and ft1\overrightarrow{f_{t-1}} are state feature vectors. It takes input α\alpha, a learning rate.

Note that there are several different versions of TAMER. The one we are analyzing is the original by Knox & Stone (2008).

Algorithm 3 TAMER α\langle\alpha\rangle
  t0t\leftarrow 0
  w0\overrightarrow{w}\leftarrow\overrightarrow{0}
  ft20\overrightarrow{f_{t-2}}\leftarrow\overrightarrow{0}
  ft10\overrightarrow{f_{t-1}}\leftarrow\overrightarrow{0}
  aChooseAction(st,w)a\leftarrow ChooseAction(s_{t},\overrightarrow{w})
  takeAction(a)
  while true do
     tt+1t\leftarrow t+1
     if t2t\geq 2 then
        rt2getHumanFeedback()r_{t-2}\leftarrow getHumanFeedback()
        if rt20r_{t-2}\neq 0 then
           wUpdateRewModel(rt2,ft2,ft2,w,α)\overrightarrow{w}\leftarrow UpdateRewModel(r_{t-2},\overrightarrow{f_{t-2}},\overrightarrow{f_{t-2}},\overrightarrow{w},\alpha)
        end if
     end if
     aChooseAction(st,w)a\leftarrow ChooseAction(s_{t},\overrightarrow{w})
     takeAction(a)takeAction(a)
     stgetState()s_{t}\leftarrow getState()
     ft2ft1f_{t-2}\leftarrow f_{t-1}
     ft1getFeatureVec(st)f_{t-1}\leftarrow getFeatureVec(s_{t})
  end while
Algorithm 4 UpdateRewModel rt2,ft2,ft1,w,α\langle r_{t-2},\overrightarrow{f_{t-2}},\overrightarrow{f_{t-1}},\overrightarrow{w},\alpha\rangle
  Set α\alpha as a parameter.
  Δft1,t2ft1ft2\overrightarrow{\Delta f_{t-1,t-2}}\leftarrow\overrightarrow{f_{t-1}}-\overrightarrow{f_{t-2}}
  projectedRewt2i(wi×Δft1,t2)projectedRew_{t-2}\leftarrow\sum_{i}(w_{i}\times\overrightarrow{\Delta f_{t-1,t-2}})
  errorrt2projectedRewt2error\leftarrow r_{t-2}-projectedRew_{t-2}
  for ii in range(0,length(w))range(0,length(\overrightarrow{w})) do
     wiwi+α×error×Δft1,t2w_{i}\leftarrow w_{i}+\alpha\times error\times\overrightarrow{\Delta f_{t-1,t-2}}
  end for
  return w\overrightarrow{w}
Algorithm 5 ChooseAction st,w\langle s_{t},\overrightarrow{w}\rangle
  ftgetFeatureVec(st)\overrightarrow{f_{t}}\leftarrow getFeatureVec(s_{t})
  for each agetAction(st)a\in getAction(s_{t}) do
     st+1,aT(st,a)s_{t+1,a}\leftarrow T(s_{t},a)
     ft+1,agetFeatureVec(st+1,a)\overrightarrow{f_{t+1,a}}\leftarrow getFeatureVec(s_{t+1},a)
     Δft+1,tft+1,aft\overrightarrow{\Delta f_{t+1,t}}\leftarrow\overrightarrow{f_{t+1,a}}-\overrightarrow{f_{t}}
     projectedRewai(wi×δft+1,t,iprojectedRew_{a}\leftarrow\sum_{i}(w_{i}\times\delta f_{t+1,t,i}
  end for
  return argmaxa(projectedRewa)argmax_{a}(projectedRew_{a})

7.1.1 Reward Feedback

Because TAMER will maximize over the learned function, it will result in a bad policy for this form of feedback. TAMER does not take future rewards into account and instead will greedily maximize for immediate reward. TAMER assumes the trainer has taken future rewards into account already. See figure 1.

7.1.2 Policy Feedback

TAMER expects policy feedback and chooses correct actions assuming sufficient exploration. See figure 1.

7.1.3 Advantage Feedback

It is not known precisely how TAMER responds to advantage feedback. Knox & Stone (2008) claim that TAMER should work under moving feedback. That is, TAMER should behave properly even when feedback changes over time because the algorithm expects the human trainer to be inconsistent and continues to update its choices even in the face of changes. The advantage function assigns different values to actions as the policy is updated, so at different times it gets different values. Assuming TAMER is able to learn this moving function, then a greedy one-step policy should be optimal because the maximal value of the advantage function is always the optimal action for the given state. See figure 1.

7.2 Q-learning

Q-learning (Watkins, 1989) is an algorithm that expects feedback in the form of immediate reward and calculates long-term value from these signals. Specifically, Qk(s,a)Q_{k}(s,a) is its estimate of long-term value and, when it is informed of a transition from sts_{t} to st+1s_{t+1} via action ata_{t} and feedback ftf_{t}, it makes the update:

Qk+1(st,at)\displaystyle Q_{k+1}(s_{t},a_{t})\leftarrow (1α)Qk(st,at)\displaystyle(1-\alpha)Q_{k}(s_{t},a_{t})
+α(ft+γV(st+1)).\displaystyle+\alpha(f_{t}+\gamma V(s_{t+1})).

7.2.1 Reward Feedback

Q-learning is typically defined to expect the feedback to be the expected one-step reward ft=R(st,at)f_{t}=R(s_{t},a_{t}) or a value whose expectation is R(st,at)R(s_{t},a_{t}). It has been proven to converge to optimal behavior under this type of feedback (Watkins & Dayan, 1992; Littman & Szepesvári, 1996; Singh et al., 2000; Melo, 2001). See figure 1.

7.2.2 Policy Feedback

Given policy feedback, Q-learning will optimize the expected sum of future “rewards”, which, in this case, is an indicator of whether the agent’s selected action is the trainer’s target policy or not.

Policy feedback depends on only the previous state and action, and, as such, Q-learning can treat this feedback as a reward function and converge on the behavior that optimizes the sum of these feedbacks.

Interestingly, the policy that optimizes the sum of policy feedbacks is exactly the target policy. This observation follows from the fact that matching the trainer’s target policy results in a value of 1+γ1+γ2+γ3+1+\gamma^{1}+\gamma^{2}+\gamma^{3}+\cdots. On the other hand, selecting even a single action that does not match the trainer’s target policy results in the removal of one of these terms and therefore lower value. Under policy feedback, Q-learning thus converges to the policy that matches the trainer’s target policy. See figure 1.

7.2.3 Advantage Feedback

Q-learning is not designed to work with advantage feedback because the advantage function is policy dependent and can cause its reward signals to change as it updates its value. Nevertheless, advantage feedback does provide a signal for how values should change and, empirically, we often see Q-learning handling advantage feedback well. The analytical challenge is that the changes in the policy influence the reward and the changes in the reward influence the policy, so these two functions need to converge together for Q-learning to handle advantage feedback successfully.

We conjecture that careful annealing of Q-learning’s learning rate could provide a mechanism for stabilizing these two different adaptive processes. Resolving this question is a topic for future work. We believe the work done by Konda & Borkar (1999) could provide greater insight. See figure 1, where Q-learning appears to converge for a simple GridWorld domain.

8 Conclusion

In this paper, we analyzed the convergence of COnvergent Actor-Critic by Humans (MacGlashan et al., 2017a) under three types of feedback—one-step reward, policy, and advantage feedback. These are all examples of feedback a human trainer might give.

We defined a COACH variant called E-COACH and demonstrated its convergence under these types of feedback. Original COACH, unfortunately, does not necessarily converge to an optimal policy under the feedback types defined in this paper. In addition, we compared the new E-COACH with two algorithms: Q-learning and TAMER. TAMER does poorly under one-step-reward feedback. And Q-learning appears to converge to optimal behavior under one-step-reward and policy feedback, but future work is required to determine its performance under advantage feedback.

References

  • Abel (2019) Abel, D. simple_rl: Reproducible reinforcement learning in python. In ICLR Workshop on Reproducibility in Machine Learning, 2019.
  • Agarwal et al. (2020) Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. 2020. arXiv:1908.00251v5.
  • Ho et al. (2019) Ho, M. K., Cushman, F., Littman, M. L., and Austerweil, J. L. People teach with rewards and punishments as communication, not reinforcements. Journal of Experimental Psychology: General, pp.  520–549, 2019.
  • Jin et al. (2017) Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan, M. I. How to escape saddle points efficiently. 2017. arXiv:1703.00887v1.
  • Kearns & Singh (1998) Kearns, M. and Singh, S. Near-optimal reinforcement learning in polynomial time. pp.  12–16, 1998.
  • Knox & Stone (2008) Knox, W. B. and Stone, P. TAMER: Training an agent manually via evaluative reinforcement. In 2008 7th IEEE International Conference on Development and Learning, pp.  292–297. IEEE, 2008.
  • Konda & Borkar (1999) Konda, V. R. and Borkar, V. S. Actor-critic–type learning algorithms for markov decision processes. SIAM J. CONTROL OPTIM, 1999.
  • Littman & Szepesvári (1996) Littman, M. L. and Szepesvári, C. A generalized reinforcement-learning model: Convergence and applications. In Saitta, L. (ed.), Proceedings of the Thirteenth International Conference on Machine Learning, pp.  310–318, 1996.
  • MacGlashan et al. (2017a) MacGlashan, J., Ho, M. K., Loftin, R., Peng, B., Wang, G., Roberts, D. L., Taylor, M. E., and Littman, M. L. Interactive learning from policy-dependent human feedback. In Proceedings of the Thirty-Fourth International Conference on Machine Learning, 2017a.
  • MacGlashan et al. (2017b) MacGlashan, J., Ho, M. K., Loftin, R., Peng, B., Wang, G., Roberts, D. L., Taylor, M. E., and Littman, M. L. Interactive learning from policy-dependent human feedback. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp.  2285–2294. JMLR. org, 2017b.
  • Melo (2001) Melo, F. S. Convergence of Q-learning: A simple proof. Institute Of Systems and Robotics, Tech. Rep, pp.  1–4, 2001.
  • Singh et al. (2000) Singh, S., Jaakkola, T., Littman, M. L., and Szepesvári, C. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning, 39:287–308, 2000.
  • Sutton et al. (2000) Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Neural Information Processing Systems, pp.  1057–1063, 2000.
  • Thomas & Brunskill (2017) Thomas, P. S. and Brunskill, E. Policy gradient methods for reinforcement learning with function approximation and action-dependent baselines. arXiv:1706.06643v1, 2017.
  • Watkins (1989) Watkins, C. J. C. H. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, UK, 1989.
  • Watkins & Dayan (1992) Watkins, C. J. C. H. and Dayan, P. Q-learning. Machine Learning, 8(3):279–292, 1992.
  • Zhang et al. (2020) Zhang, K., Koppel, A., Zhu, H., and Basar, T. Global convergence of policy gradient methods to (almost) locally optimal policies. 2020. arXiv:1906.08383v3.