This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Reinforce Policy Gradient Algorithm Revisited

Shalabh Bhatnagar
Department of Computer Science and Automation,
Indian Institute of Science,
Bengaluru 560012, India
shalabh@iisc.ac.in
The author was supported by a J. C. Bose Fellowship, Project No. DFTM/ 02/ 3125/M/04/AIR-04 from DRDO under DIA-RCOE, a project from DST-ICPS, and the RBCCPS, IISc.
(October 2023)
Abstract

We revisit the Reinforce policy gradient algorithm from the literature. Note that this algorithm typically works with cost returns obtained over random length episodes obtained from either termination upon reaching a goal state (as with episodic tasks) or from instants of visit to a prescribed recurrent state (in the case of continuing tasks). We propose a major enhancement to the basic algorithm. We estimate the policy gradient using a function measurement over a perturbed parameter by appealing to a class of random search approaches. This has advantages in the case of systems with infinite state and action spaces as it relax some of the regularity requirements that would otherwise be needed for proving convergence of the Reinforce algorithm. Nonetheless, we observe that even though we estimate the gradient of the performance objective using the performance objective itself (and not via the sample gradient), the algorithm converges to a neighborhood of a local minimum. We also provide a proof of convergence for this new algorithm.

Index Terms:
Reinforce policy gradient algorithm, smoothed functional technique, stochastic gradient search, stochastic shortest path Markov decision processes.

I Introduction

Policy gradient methods [25, 26] are a popular class of approaches in reinforcement learning. The policy in these approaches is considered parameterized and one updates the policy parameter along a gradient search direction where the gradient is of a performance objective, normally the value function. The policy gradient theorem [26, 20, 10] which is a fundamental result in these approaches relies on an interchange of the gradient and expectation operators and in such cases turns out to be the expectation of the gradient of noisy performance functions much like the previously studied perturbation analysis based sensitivity approaches for optimization via simulation [13, 11].

The Reinforce algorithm [27, 25] is a noisy gradient scheme for which the expectation of the gradient is the policy gradient, i.e., the gradient of the expected objective w.r.t. the policy parameters. The updates of the policy parameter are however obtained once after the full return on an episode has been found. Actor-critic algorithms [25, 16, 17, 7, 6] have been presented in the literature as alternatives to the Reinforce algorithm as they perform incremental parameter updates at every instant but do so using two-timescale stochastic approximation algorithms.

In this paper, we revisit the Reinforce algorithm and present a new algorithm for the case of episodic tasks or the stochastic shortest path setting. Our algorithm performs parameter updates upon termination of episodes, that is when goal or terminal states are reached. In this setting, as mentioned above, updates are performed only at instants of visit to a prescribed recurrent state [10, 20]. This algorithm is based on a single function measurement or simulation at a perturbed parameter value where the perturbations are obtained using independent Gaussian random variates.

Gradient estimation in this algorithm is performed using the smoothed functional (SF) technique for gradient estimation [22, 4, 3, 5]. The basic problem in this setting is the following: Given an objective function J:dJ:\mathcal{R}^{d}\rightarrow\mathcal{R} such that J(θ)=Eξ[h(θ,ξ)]J(\theta)=E_{\xi}[h(\theta,\xi)], where θd\theta\in\mathcal{R}^{d} is the parameter to be tuned and ξ\xi is the noise element, the goal is to find θd\theta^{*}\in\mathcal{R}^{d} such that

J(θ)=minθdJ(θ).J(\theta^{*})=\min_{\theta\in\mathcal{R}^{d}}J(\theta).

Since the objective function J()J(\cdot) can be highly nonlinear, one often settles for a lesser goal – that of finding a local instead of a global minimum. In this setting, the Kiefer-Wolfowitz [15] finite difference estimates for the gradient of JJ would correspond to the following: For i=1,,di=1,\ldots,d,

iJ(θ)=lim0<δ0Eξ[h(θ+δei,ξi+)h(θδei,ξi)2δ],\nabla_{i}J(\theta)=\lim_{0<\delta\downarrow 0}E_{\xi}\left[\frac{h(\theta+\delta e_{i},\xi^{+}_{i})-h(\theta-\delta e_{i},\xi^{-}_{i})}{2\delta}\right],

where eie_{i} is the unit vector with 1 in the iith place and 0 elsewhere. Further, ξi+\xi^{+}_{i} and ξi\xi^{-}_{i}, i=1,,di=1,\ldots,d are independent noise random variables having a common distribution. The expectation above is taken w.r.t this common distribution on the noise random variables. This approach does not perform well in practice when dd is large, since one needs 2d2d function measurements or simulations for a parameter update.

Random search methods such as simultaenous perturbation stochastic approximation (SPSA) [23, 24, 2], smoothed functional (SF) [14, 4, 3] or random directions stochastic approximation (RDSA) [18, 21] typically require much less number of simulations. For instance, the gradient based algorithms in these approaches require only one or two simulations regardless of the parameter dimension dd, while their Newton-based counterparts usually involve one to four system simulations for any parameter update (again regardless of dd). A textbook treatment of random search approaches for stochastic optimization is available in [5].

We consider here a one-simulation SF algorithm where the gradient of J(θ)J(\theta) is estimated using a noisy function measurement, at the parameter θ+δΔ\theta+\delta\Delta, where δ>0\delta>0 and Δ=(Δ1,,Δd)T\Delta\stackrel{{\scriptstyle\triangle}}{{=}}(\Delta_{1},\ldots,\Delta_{d})^{T} with each ΔiN(0,1)\Delta_{i}\sim N(0,1) with Δi\Delta_{i} being independent of Δj\Delta_{j}, ji\forall j\not=i. Further, Δ\Delta is independent of the measurement noise as well. The gradient estimate in this setting is the following: For i=1,,di=1,\ldots,d,

iJ(θ)=lim0<δ0Eξ,Δ[Δih(θ+δΔ,ξ)δ].\nabla_{i}J(\theta)=\lim_{0<\delta\downarrow 0}E_{\xi,\Delta}\left[\Delta_{i}\frac{h(\theta+\delta\Delta,\xi)}{\delta}\right].

In the above, ξ\xi denotes the measurement noise random variable. The expectation above is with respect to the joint distribution of ξ\xi and Δ\Delta. Before we proceed further, we present the basic Markov decision process (MDP) setting as well as recall the Reinforce algorithm that we consider for the episodic setting. We remark here that there are not many analyses of Reinforce type algorithms in the literature in the episodic or stochastic shortest path setting.

II The Basic MDP Setting

By a Markov decision process, we mean a controlled stochastic process {Xn}\{X_{n}\} whose evolution is governed by an associated control-valued sequence {Zn}\{Z_{n}\}. It is assumed that Xn,n0X_{n},n\geq 0 takes values in a set SS called the state-space. Let A(s)A(s) be the set of feasible actions in state sSs\in S and A=sSA(s){\displaystyle A\stackrel{{\scriptstyle\triangle}}{{=}}\cup_{s\in S}A(s)} denote the set of all actions. When the state is say ss and a feasible action aa is chosen, the next state seen is ss^{\prime} with a probability p(s|s,a)=P(Xn+1=sXn=s,Zn=a)p(s^{\prime}|s,a)\stackrel{{\scriptstyle\triangle}}{{=}}P(X_{n+1}=s^{\prime}\mid X_{n}=s,Z_{n}=a), n\forall n. We assume these probabilities do not depend on nn. Such a process satisfies the controlled Markov property, i.e.,

P(Xn+1=sXn,Zn,,X0,Z0)=p(sXn,Zn) a.s.P(X_{n+1}=s^{\prime}\mid X_{n},Z_{n},\ldots,X_{0},Z_{0})=p(s^{\prime}\mid X_{n},Z_{n})\mbox{ a.s.}

By an admissible policy or simply a policy, we mean a sequence of functions π={μ0,μ1,μ2,}\pi=\{\mu_{0},\mu_{1},\mu_{2},\ldots\}, with μi:SA\mu_{i}:S\rightarrow A, i0i\geq 0, such that μi(s)A(s)\mu_{i}(s)\in A(s), sS\forall s\in S. The policy π\pi is a decision rule which specifies that if at instant kk, the state is ii, then the action chosen under π\pi by the decision maker would be μk(i)\mu_{k}(i). A stationary policy π\pi is one for which μk=μl=μ\mu_{k}=\mu_{l}\stackrel{{\scriptstyle\triangle}}{{=}}\mu, k,l=0,1,\forall k,l=0,1,\ldots. In other words, under a stationary policy, the function that decides the action-choice in a given state does not depend on the instant nn at which the action is chosen.

Associated with any transition to a state ss^{\prime} from a state ss under action aa, is a ‘single-stage’ cost g(s,a,s)g(s,a,s^{\prime}) where g:S×A×Sg:S\times A\times S\rightarrow\mathcal{R} is called the cost function. The goal of the decision maker is to select actions ak,k0a_{k},k\geq 0 in response to the system states sk,k0s_{k},k\geq 0, observed one at a time, so as to minimize a long-term cost objective. We assume here that the number of states and actions is finite.

II-A The Episodic or Stochastic Shortest Path Setting

We consider here the episodic or the stochastic shortest path problem where decision making terminates once a goal or terminal state is reached. We let 1,,p1,\ldots,p denote the set of non-terminal or regular states and tt be the terminal state. Thus, S={1,2,,p,t}S=\{1,2,\ldots,p,t\} denotes the state space for this problem.

Our basic setting here is similar to Chapter 3 of [1], where it is assumed that under any policy there is a positive probability of hitting the goal state tt in at most pp steps starting from any initial (non-terminal) state, that would in turn signify that the problem would terminate in a finite though random amount of time.

Under a given policy π\pi, define

Vπ(s)=Eπ[k=0Tg(Xk,μk(Xk),Xk+1)X0=s],V_{\pi}(s)=E_{\pi}\left[\sum_{k=0}^{T}g(X_{k},\mu_{k}(X_{k}),X_{k+1})\mid X_{0}=s\right], (1)

where T>0T>0 is a finite random time at which the process enters the terminal state tt. Here Eπ[]E_{\pi}[\cdot] indicates that all actions are chosen according to policy π\pi depending on the system state at any instant. We assume that there is no action that is feasible in the state tt and so the process terminates once it reaches tt.

Let Π\Pi denote the set of all admissible policies. The goal here is to find the optimal value function V(i),iSV^{*}(i),i\in S, where

V(i)=minπΠVπ(i)=Vπ(i), iS.V^{*}(i)=\min_{\pi\in\Pi}V_{\pi}(i)=V_{\pi^{*}}(i),\mbox{ }i\in S. (2)

Here π\pi^{*} denotes the optimal policy, i.e., the one that minimizes Vπ(i)V_{\pi}(i) over all policies π\pi. A related goal then would be to find the optimal policy π\pi^{*}. It turns out that in these problems, there exist stationary policies that are optimal, and so it is sufficient to search for an optimal policy within the class of stationary policies.

A stationary policy π\pi is called a proper policy (cf. pp.174 of [1]) if

p^π=maxs=1,,pP(XptX0=s,π)<1.\hat{p}_{\pi}\stackrel{{\scriptstyle\triangle}}{{=}}\max_{s=1,\ldots,p}P(X_{p}\not=t\mid X_{0}=s,\pi)<1.

In other words, regardless of the initial state ii, there is a positive probability of termination after at most pp stages when using a proper policy π\pi and moreover P(T<)=1P(T<\infty)=1 under such a policy.

An admissible policy (and so also a stationary policy) can be randomized as well. A randomized admissible policy or simply a randomized policy is a sequence of distributions ψ={ϕ0,ϕ1,}\psi=\{\phi_{0},\phi_{1},\ldots\} with each ϕi:SP(A)\phi_{i}:S\rightarrow P(A). In other words, given a state ss, a randomized policy would provide a distribution ϕi(s)=(ϕi(s,a),aA(s))\phi_{i}(s)=(\phi_{i}(s,a),a\in A(s)) for the action to be chosen in the iith stage. A stationary randomized policy is one for which ϕj=ϕk=ϕ\phi_{j}=\phi_{k}\stackrel{{\scriptstyle\triangle}}{{=}}\phi, j,k=0,1,\forall j,k=0,1,\ldots. In this case, we simply call ϕ\phi to be a stationary randomized policy. Here and in the rest of the paper, we shall assume that the policies are stationary randomized and are parameterized via a certain parameter θCd\theta\in C\subset\mathcal{R}^{d}, a compact and convex set.

We make the following assumption:

Assumption 1

All stationary randomized policies ϕθ\phi_{\theta} parameterized by θC\theta\in C are proper.

In practice, one might be able to relax this assumption (as with the model-based analysis of [1]) by (a) assuming that for policies that are not proper, Vπ(i)=V_{\pi}(i)=\infty for at least one non-terminal state ii and (b) there exists a proper policy. The optimal value function satisfies in this case the following Bellman equation: For s=1,,ps=1,\ldots,p,

V(s)=minaA(s)(g¯(s,a)+j=1pp(js,a)V(j)),V^{*}(s)=\min_{a\in A(s)}\left(\bar{g}(s,a)+\sum_{j=1}^{p}p(j\mid s,a)V^{*}(j)\right), (3)

where g¯(s,a)=j=1pp(j|s,a)g(s,a,j)+p(t|s,a)g(s,a,t){\displaystyle\bar{g}(s,a)=\sum_{j=1}^{p}p(j|s,a)g(s,a,j)+p(t|s,a)g(s,a,t)} is the expected single-stage cost in a non-terminal state ss when a feasible action aa is chosen. It can be shown, see [1], that an optimal stationary proper policy exists.

II-B The Policy Gradient Theorem

Policy gradient methods perform a gradient search within the prescribed class of parameterized policies. Let ϕθ(s,a)\phi_{\theta}(s,a) denote the probability of selecting action aA(s)a\in A(s) when the state is sSs\in S and the policy parameter is θC\theta\in C. We assume that ϕθ(s,a)\phi_{\theta}(s,a) is continuously differentiable in θ\theta. A common example here is of the parameterized Boltzmann or softmax policies. Let ϕθ(s)=(ϕθ(s,a),aA(s))\phi_{\theta}(s)\stackrel{{\scriptstyle\triangle}}{{=}}(\phi_{\theta}(s,a),a\in A(s)), sSs\in S and ϕθ=(ϕθ(s),sS)\phi_{\theta}\stackrel{{\scriptstyle\triangle}}{{=}}(\phi_{\theta}(s),s\in S).

We assume that trajectories of states and actions are available either as real data or from a simulation device. Let Gk=j=kT1gj{\displaystyle G_{k}=\sum_{j=k}^{T-1}g_{j}} denote the sum of costs until termination (likely when a goal state is reached) on a trajectory starting from instant kk. Note that if all actions are chosen according to a policy ϕ\phi, then the value and Q-value functions (under ϕ\phi) would be Vϕ(s)=Eϕ[GkXk=s]V_{\phi}(s)=E_{\phi}[G_{k}\mid X_{k}=s] and Qϕ(s,a)=Eϕ[GkXk=s,Zk=a],Q_{\phi}(s,a)=E_{\phi}[G_{k}\mid X_{k}=s,Z_{k}=a], respectively. In what follows, for ease of notation, we let VθVϕθV_{\theta}\equiv V_{\phi_{\theta}}.

The policy gradient theorem for episodic problems has the following form, cf. [26, 25]:

Vθ(s0)=sSμ(s)aA(s)θπ(s,a)Qθ(s,a),\nabla V_{\theta}(s_{0})=\sum_{s\in S}\mu(s)\sum_{a\in A(s)}\nabla_{\theta}\pi(s,a)Q_{\theta}(s,a), (4)

where μ(s),sS\mu(s),s\in S is defined as μ(s)=η(s)sSη(s){\displaystyle\mu(s)=\frac{\eta(s)}{\sum_{s^{\prime}\in S}\eta(s^{\prime})}} where η(s)=k=0pk(s|s0,πθ){\displaystyle\eta(s)=\sum_{k=0}^{\infty}p^{k}(s|s_{0},\pi_{\theta})}, sSs\in S, with pk(s|s0,πθ)p^{k}(s|s_{0},\pi_{\theta}) being the kk-step transition probability of going to state ss from s0s_{0} under the policy πθ\pi_{\theta}. A similar result holds for the long-run average cost setting with μ(s)=dπ(s)\mu(s)=d^{\pi}(s) (the stationary distribution of {Xn}\{X_{n}\} under policy π\pi), and Qπ(s,a)Q_{\pi}(s,a) is the state-action differential value function. Further, in the discounted cost setting too, a similar result holds but with η(s)\eta(s) replaced by η(s)=k=0γkpk(s|s0,πθ){\displaystyle\eta(s)=\sum_{k=0}^{\infty}\gamma^{k}p^{k}(s|s_{0},\pi_{\theta})}, where 0<γ<10<\gamma<1 is the discount factor. Proving the policy gradient theorem when the state-action spaces are finite is relatively straight forward [26, 25]. However, one would require strong regularity assumptions on the system dynamics and performance function as with IPA or likelihood ratio approaches [13] if the state-action spaces are non-finite i.e., countably infinite or continuously-valued sets.

The Reinforce algorithm [25, 27] makes use of the policy gradient theorem as the latter indicates that the gradient of the value function is the expectation of the gradient of a function of the noisy returns obtained from episodes. In what follows, we present an alternative algorithm based on Reinforce that incorporates a one-measurement SF gradient estimator. Our algorithm does not incorporate the policy gradient theorem and thus does not rely on an interchange between the gradient and expectation operators. Our algorithm incorporates a zeroth-order gradient approximation using the smoothed functional method and yet, like the Reinforce algorithm, requires only one sample trajectory. This is thus a clear advantage with our algorithm. However, since our algorithm caters to episodic tasks, it performs updates at the instants of visit to a certain prescribed recurrent state as considered in [10, 20]. It is important to mention that such instants can be highly sparse in practice since in most practical systems, the number of state-action spaces can be very large in size. We refer to our algorithm as the SF Reinforce algorithm.

III The SF Reinforce Algorithm

We consider here the case of episodic problems and the model-free setting whereby we do not assume any knowledge of the system model, i.e., the transition probabilities p(ss,a)p(s^{\prime}\mid s,a), and in their place, we assume that we have access to data (either real or simulated). The data that is available is over trajectories of states, actions, single-stage costs and next states until termination. We assume that multiple trajectories of data can be made available and the data on the mmth trajectory can be represented in the form of the tuples (skm,akm,gkm,sk+1m)(s^{m}_{k},a^{m}_{k},g^{m}_{k},s^{m}_{k+1}), k=0,1,,Tmk=0,1,\ldots,T_{m} with TmT_{m} being the termination instant on the mmth trajectory, m1m\geq 1. Also, sjms^{m}_{j} is the state at instant jj, j=k,k+1j=k,k+1 in the mmth trajectory. Further, akma^{m}_{k} and gkmg^{m}_{k} are the action chosen and the cost incurred, respectively, at instant kk in the mmth trajectory. As mentioned before, we consider a class of stationary randomized policies that are parameterized by θ\theta and satisfy Assumption 1.

Let Γ:dC\Gamma:{\cal R}^{d}\rightarrow C denote a projection operator that projects any x=(x1,,xd)Tdx=(x_{1},\ldots,x_{d})^{T}\in{\cal R}^{d} to its nearest point in CC. Thus, if xCx\in C, then Γ(x)C\Gamma(x)\in C as well. For ease of exposition, we assume that CC is a dd-dimensional rectangle having the form C=i=1d[ai,min,ai,max]{\displaystyle C=\prod_{i=1}^{d}[a_{i,\min},a_{i,\max}]}, where <ai,min<ai,max<-\infty<a_{i,\min}<a_{i,\max}<\infty, i=1,,d\forall i=1,\ldots,d. Then Γ(x)=(Γ1(x1),,Γd(xd))T\Gamma(x)=(\Gamma_{1}(x_{1}),\ldots,\Gamma_{d}(x_{d}))^{T} with Γi:[ai,min,ai,max]\Gamma_{i}:{\cal R}\rightarrow[a_{i,\min},a_{i,\max}] such that Γi(xi)=min(ai,max,max(ai,min,x)){\displaystyle\Gamma_{i}(x_{i})=\min(a_{i,\max},\max(a_{i,\min},x))}, i=1,,di=1,\ldots,d. Also, let 𝒞(C){\cal C}(C) denote the space of all continuous functions from CC to d{\cal R}^{d}.

In what follows, we present a procedure that incrementally updates the parameter θ\theta. Let θ(n)\theta(n) denote the parameter value obtained after the nnth update of this procedure which depends on the nnth episode and which is run using the policy parameter Γ(θ(n)+δnΔ(n))\Gamma(\theta(n)+\delta_{n}\Delta(n)), for n0n\geq 0, where θ(n)=(θ1(n),,θd(n))Td\theta(n)=(\theta_{1}(n),\ldots,\theta_{d}(n))^{T}\in\mathcal{R}^{d}, δn>0\delta_{n}>0 n\forall n with δn0\delta_{n}\rightarrow 0 as nn\rightarrow\infty and Δ(n)=(Δ1(n),,Δd(n))T,n0\Delta(n)=(\Delta_{1}(n),\ldots,\Delta_{d}(n))^{T},n\geq 0, where Δi(n),i=1,,d,n0\Delta_{i}(n),i=1,\ldots,d,n\geq 0 are independent random variables distributed according to the N(0,1)N(0,1) distribution.

Algorithm (5) below is used to update the parameter θCd\theta\in C\subset\mathcal{R}^{d}. Let χn\chi^{n} denote the nnth state-action trajectory χn={s0n,a0n,s1n,a1n,,sT1n,aT1n,sTn}\chi^{n}=\{s^{n}_{0},a^{n}_{0},s^{n}_{1},a^{n}_{1},\ldots,s^{n}_{T-1},a^{n}_{T-1},s^{n}_{T}\}, n0n\geq 0 where the actions a0n,,aT1na^{n}_{0},\ldots,a^{n}_{T-1} in χn\chi^{n} are obtained using the policy parameter θ(n)+δnΔ(n)\theta(n)+\delta_{n}\Delta(n). The instant TT denotes the termination instant in the trajectory χn\chi^{n} and corresponds to the instant when the terminal or goal state tt is reached.. Note that the various actions in the trajectory χn\chi^{n} are chosen according to the policy ϕ(θ(n)+δnΔ(n))\phi_{(\theta(n)+\delta_{n}\Delta(n))}. The initial state is assumed to be sampled from a given initial distribution ν=(ν(i),iS)\nu=(\nu(i),i\in S) over states.

Let Gn=k=0T1gkn{\displaystyle G^{n}=\sum_{k=0}^{T-1}g^{n}_{k}} denote the sum of costs until termination on the trajectory χn\chi^{n}, with gkng(Xkn,Zkn,Xk+1n)g^{n}_{k}\equiv g(X^{n}_{k},Z^{n}_{k},X^{n}_{k+1}) and where the superscript nn on the various quantities indicates that these correspond to the nnth episode. The update rule that we consider here is the following: For n0,i=1,,dn\geq 0,i=1,\ldots,d,

θi(n+1)=Γi(θi(n)a(n)(Δi(n)Gnδn)).\theta_{i}(n+1)=\Gamma_{i}\left(\theta_{i}(n)-a(n)\left(\Delta_{i}(n)\frac{G^{n}}{\delta_{n}}\right)\right). (5)

We assume that the step-sizes a(n),n0a(n),n\geq 0 satisfy the following requirement:

Assumption 2

The step-size sequence {a(n)}\{a(n)\} satisfies a(n)>0a(n)>0, n\forall n. Further,

na(n)=, n(a(n)δn)2<.\sum_{n}a(n)=\infty,\mbox{ }\sum_{n}\left(\frac{a(n)}{\delta_{n}}\right)^{2}<\infty.

Once the nnth update of the parameter (i.e., θ(n)\theta(n)) is obtained, the perturbed parameter θ(n)+δnΔ(n)\theta(n)+\delta_{n}\Delta(n) is obtained after sampling Δ(n)\Delta(n) from the multivariate Gaussian distribution as explained previously and thereafter a new trajectory governed by this perturbed parameter is generated with the initial state in each episode sampled according to a given distribution ν\nu. This is because each episode ends in the terminal state tt and any fresh episode starts from a non-terminal initial state sampled from the distribution ν\nu.

IV Convergence Analysis

We begin by rewriting the recursion (5) as follows:

θi(n+1)=Γi(θi(n)a(n)E[Δi(n)Gnδn|n]+Mn+1i),\theta_{i}(n+1)=\Gamma_{i}\left(\theta_{i}(n)-a(n)E\left[\Delta_{i}(n)\frac{G^{n}}{\delta_{n}}|\mathcal{F}_{n}\right]+M^{i}_{n+1}\right), (6)

where Mn+1i=Δi(n)GnδnE[Δi(n)Gnδn|n],n0.M^{i}_{n+1}=\Delta_{i}(n)\frac{G^{n}}{\delta_{n}}-E\left[\Delta_{i}(n)\frac{G^{n}}{\delta_{n}}|\mathcal{F}_{n}\right],n\geq 0. Here, we let n=σ(θ(m),mn,Δ(m),χm,m<n),n1\mathcal{F}_{n}\stackrel{{\scriptstyle\triangle}}{{=}}\sigma(\theta(m),m\leq n,\Delta(m),\chi^{m},m<n),n\geq 1 as a sequence of increasing sigma fields with 0=σ(θ(0))\mathcal{F}_{0}=\sigma(\theta(0)). Let Mn=(Mn1,,Mnd)TM_{n}\stackrel{{\scriptstyle\triangle}}{{=}}(M^{1}_{n},\ldots,M^{d}_{n})^{T}, n0n\geq 0.

Lemma 1

(Mn,n),n0(M_{n},\mathcal{F}_{n}),n\geq 0 is a martingale difference sequence.

Proof:

Notice that

Mni=Δi(n1)Gn1δn1E[Δi(n1)Gn1δn1n1].M^{i}_{n}=\Delta_{i}(n-1)\frac{G^{n-1}}{\delta_{n-1}}-E\left[\Delta_{i}(n-1)\frac{G^{n-1}}{\delta_{n-1}}\mid\mathcal{F}_{n-1}\right].

The first term on the RHS above is clearly measurable n\mathcal{F}_{n} while the second term is measurable n1\mathcal{F}_{n-1} and hence measurable n\mathcal{F}_{n} as well. Further, from Assumption 1, each MnM_{n} is integrable. Finally, it is easy to verify that

E[Mn+1in]=0, i.E[M^{i}_{n+1}\mid\mathcal{F}_{n}]=0,\mbox{ }\forall i.

The claim follows. ∎

Proposition 1

We have

E[Δi(n)Gnδnn]=sSν(s)iVθ(n)(s)+o(δn) a.s.E\left[\Delta_{i}(n)\frac{G^{n}}{\delta_{n}}\mid\mathcal{F}_{n}\right]=\sum_{s\in S}\nu(s)\nabla_{i}V_{\theta(n)}(s)+o(\delta_{n})\mbox{ a.s.}
Proof:

Note that

E[Δi(n)Gnδnn]=E[E[Δi(n)Gnδn𝒢n]n],E\left[\Delta_{i}(n)\frac{G^{n}}{\delta_{n}}\mid\mathcal{F}_{n}\right]=E\left[E\left[\Delta_{i}(n)\frac{G^{n}}{\delta_{n}}\mid\mathcal{G}_{n}\right]\mid\mathcal{F}_{n}\right],

where 𝒢n=σ(θ(m),Δ(m),mn,χm,m<n),n1\mathcal{G}_{n}\stackrel{{\scriptstyle\triangle}}{{=}}\sigma(\theta(m),\Delta(m),m\leq n,\chi^{m},m<n),n\geq 1 is a sequence of increasing sigma fields with 𝒢0=σ(θ(0),Δ(0))\mathcal{G}_{0}=\sigma(\theta(0),\Delta(0)). It is clear that n𝒢n,n0\mathcal{F}_{n}\subset\mathcal{G}_{n},\forall n\geq 0. Now,

E[Δi(n)Gnδn𝒢n]=Δi(n)δnE[Gn𝒢n].E\left[\Delta_{i}(n)\frac{G^{n}}{\delta_{n}}\mid\mathcal{G}_{n}\right]=\frac{\Delta_{i}(n)}{\delta_{n}}E[G^{n}\mid\mathcal{G}_{n}].

Let s0n=ss^{n}_{0}=s denote the initial state in the trajectory χn\chi^{n}. Recall that the initial state ss is chosen randomly from the distribution ν\nu. Thus,

E[Gn𝒢n]=sν(s)E[Gns0n=s,ϕθ(n)+δnΔ(n)]E[G^{n}\mid\mathcal{G}_{n}]=\sum_{s}\nu(s)E[G^{n}\mid s^{n}_{0}=s,\phi_{\theta(n)+\delta_{n}\Delta(n)}]
=sν(s)Vθ(n)+δnΔ(n)(s).=\sum_{s}\nu(s)V_{\theta(n)+\delta_{n}\Delta(n)}(s).

Thus, with probability one,

E[Δi(n)Gnδn𝒢n]=sν(s)(Δi(n)Vθ(n)+δnΔ(n)(s)δn).E\left[\Delta_{i}(n)\frac{G^{n}}{\delta_{n}}\mid\mathcal{G}_{n}\right]=\sum_{s}\nu(s)\left(\Delta_{i}(n)\frac{V_{\theta(n)+\delta_{n}\Delta(n)}(s)}{\delta_{n}}\right).

Hence, it follows almost surely that

E[Δi(n)Gnδnn]=sν(s)E[Δi(n)Vθ(n)+δnΔ(n)(s)δnn].E\left[\Delta_{i}(n)\frac{G^{n}}{\delta_{n}}\mid\mathcal{F}_{n}\right]=\sum_{s}\nu(s)E\left[\Delta_{i}(n)\frac{V_{\theta(n)+\delta_{n}\Delta(n)}(s)}{\delta_{n}}\mid\mathcal{F}_{n}\right].

Using a Taylor’s expansion of Vθ(n)+δnΔ(n)(s)V_{\theta(n)+\delta_{n}\Delta(n)}(s) around θ(n)\theta(n) gives us

Vθ(n)+δnΔ(n)(sn)=Vθ(n)(sn)+δnΔ(n)TVθ(n)(sn)V_{\theta(n)+\delta_{n}\Delta(n)}(s_{n})=V_{\theta(n)}(s_{n})+\delta_{n}\Delta(n)^{T}\nabla V_{\theta(n)}(s_{n})
+δn22Δ(n)T2Vθ(n)(sn)Δ(n)+o(δn2).+\frac{\delta_{n}^{2}}{2}\Delta(n)^{T}\nabla^{2}V_{\theta(n)}(s_{n})\Delta(n)+o(\delta_{n}^{2}).

Now recall that Δ(n)=(Δi(n),i=1,,d)T\Delta(n)=(\Delta_{i}(n),i=1,\ldots,d)^{T}. Thus,

Δ(n)Vθ(n)+δnΔ(n)(sn)δn=1δnΔ(n)Vθ(n)(sn)\Delta(n)\frac{V_{\theta(n)+\delta_{n}\Delta(n)}(s_{n})}{\delta_{n}}=\frac{1}{\delta_{n}}\Delta(n)V_{\theta(n)}(s_{n})
+Δ(n)Δ(n)TVθ(n)(sn)+\Delta(n)\Delta(n)^{T}\nabla V_{\theta(n)}(s_{n})
+δn2Δ(n)Δ(n)T2Vθ(n)(sn)Δ(n)+o(δn).+\frac{\delta_{n}}{2}\Delta(n)\Delta(n)^{T}\nabla^{2}V_{\theta(n)}(s_{n})\Delta(n)+o(\delta_{n}).

Now observe from the properties of Δi(n),i,n,\Delta_{i}(n),\forall i,n, that
(i) E[Δ(n)]=0E[\Delta(n)]=0 (the zero-vector), n\forall n, since Δi(n)N(0,1)\Delta_{i}(n)\sim N(0,1), i,n\forall i,n.
(ii) E[Δ(n)Δ(n)T]=IE[\Delta(n)\Delta(n)^{T}]=I (the identity matrix), n\forall n.
(iii) E[i,j,k=1dΔi(n)Δj(n)Δk(n)]=0{\displaystyle E\left[\sum_{i,j,k=1}^{d}\Delta_{i}(n)\Delta_{j}(n)\Delta_{k}(n)\right]=0}.
Property (iii) follows from the facts that (a) E[Δi(n)Δj(n)Δk(n)]=0E[\Delta_{i}(n)\Delta_{j}(n)\Delta_{k}(n)]=0, ijk\forall i\not=j\not=k, (b) E[Δi(n)Δj2(n)]=0E[\Delta_{i}(n)\Delta_{j}^{2}(n)]=0, ij\forall i\not=j (this pertains to the case where iji\not=j but j=kj=k above) and (c) E[Δi3(n)]=0E[\Delta_{i}^{3}(n)]=0 (for the case when i=j=ki=j=k above). These properties follow from the independence of the random variables Δi(n)\Delta_{i}(n), i=1,,di=1,\ldots,d and n0n\geq 0, as well as the fact that they are all distributed N(0,1)N(0,1). The claim now follows from (i)-(iii) above. ∎

In the light of Proposition 1, we can rewrite (5) as follows:

θ(n+1)=Γ(θ(n)a(n)(sν(s)Vθ(n)(s)+Mn+1\theta(n+1)=\Gamma(\theta(n)-a(n)(\sum_{s}\nu(s)\nabla V_{\theta(n)}(s)+M_{n+1}
+β(n))),+\beta(n))), (7)

where βi(n)=E[Δi(n)Gnδn]{\displaystyle\beta_{i}(n)=E\left[\Delta_{i}(n)\frac{G_{n}}{\delta}\mid\mathcal{F}_{n}\right]} sν(s)iVθ(n)(s)-\sum_{s}\nu(s)\nabla_{i}V_{\theta(n)}(s) and β(n)=(β1(n),,βd(n))T\beta(n)=(\beta_{1}(n),\ldots,\beta_{d}(n))^{T}. From Proposition 1, it follows that β(n)=o(δn)\beta(n)=o(\delta_{n}).

Lemma 2

The function Vθ(s)\nabla V_{\theta}(s) is Lipschitz continuous in θ\theta. Further, \exists a constant K1>0K_{1}>0 such that Vθ(s)K1(1+θ)\parallel\nabla V_{\theta}(s)\parallel\leq K_{1}(1+\parallel\theta\parallel).

Proof:

It can be seen from (4) that Vθ(s)V_{\theta}(s) is continuously differentiable in θ\theta. It can also be shown as in Theorem 3 of [12] that 2Vθ(s)\nabla^{2}V_{\theta}(s) exists and is continuous. Since θ\theta takes values in CC, a compact set, it follows that 2Vθ(s)\nabla^{2}V_{\theta}(s) is bounded and thus Vθ(s)\nabla V_{\theta}(s) is Lipschitz continuous.

Finally, let L1s>0L_{1}^{s}>0 denote the Lipschitz constant for the function Vθ(s)\nabla V_{\theta}(s). Then, for a given θ0C\theta_{0}\in C,

Vθ(s)Vθ0(s)Vθ(s)Vθ0(s)\parallel\nabla V_{\theta}(s)\parallel-\parallel\nabla V_{\theta_{0}}(s)\parallel\leq\parallel\nabla V_{\theta}(s)-\nabla V_{\theta_{0}}(s)\parallel
L1sθθ0\leq L_{1}^{s}\parallel\theta-\theta_{0}\parallel
L1sθ+L1sθ0.\leq L_{1}^{s}\parallel\theta\parallel+L_{1}^{s}\parallel\theta_{0}\parallel.

Thus, Vθ(s)Vθ0(s)+L1sθ0+L1sθ.\parallel\nabla V_{\theta}(s)\parallel\leq\parallel\nabla V_{\theta_{0}}(s)\parallel+L^{s}_{1}\parallel\theta_{0}\parallel+L^{s}_{1}\parallel\theta\parallel. Let Ks=Vθ0(s)+L1sθ0K_{s}\stackrel{{\scriptstyle\triangle}}{{=}}\|\nabla V_{\theta_{0}}(s)\|+L^{s}_{1}\|\theta_{0}\| and K1=max(Ks,L1s,sS)K_{1}\stackrel{{\scriptstyle\triangle}}{{=}}\max(K_{s},L^{s}_{1},s\in S). Thus, Vθ(s)K1(1+θ)\parallel\nabla V_{\theta}(s)\parallel\leq K_{1}(1+\parallel\theta\parallel). Note here that since |S|<|S|<\infty, K1<K_{1}<\infty as well. The claim follows. ∎

Lemma 3

The sequence (Mn,n)(M_{n},\mathcal{F}_{n}), n0n\geq 0 satisfies E[Mn+12n]L^δn2,\displaystyle{E[\|M_{n+1}\|^{2}\mid\mathcal{F}_{n}]\leq\frac{\hat{L}}{\delta_{n}^{2}}}, for some constant L^>0\hat{L}>0.

Proof:

Note that

Mn+12=i=1d(Mn+1i)2\|M_{n+1}\|^{2}=\sum_{i=1}^{d}(M^{i}_{n+1})^{2}
=i=1d(Δi2(n)(Gn)2δn2+1δn2E[Δi(n)Gnn]2=\sum_{i=1}^{d}\Big{(}\Delta_{i}^{2}(n)\frac{(G^{n})^{2}}{\delta_{n}^{2}}+\frac{1}{\delta_{n}^{2}}E\left[\Delta_{i}(n)G^{n}\mid\mathcal{F}_{n}\right]^{2}
2Δi(n)Gnδn2E[Δi(n)Gnn]).-2\Delta_{i}(n)\frac{G^{n}}{\delta_{n}^{2}}E\left[\Delta_{i}(n)G^{n}\mid\mathcal{F}_{n}\right]\Big{)}.

Thus,

E[Mn+12n]=1δn2i=1d(E[Δi2(n)(Gn)2n]E[\|M_{n+1}\|^{2}\mid\mathcal{F}_{n}]=\frac{1}{\delta_{n}^{2}}\sum_{i=1}^{d}\Big{(}E[\Delta_{i}^{2}(n)(G^{n})^{2}\mid\mathcal{F}_{n}]
E2[Δi(n)Gnn]).-E^{2}[\Delta_{i}(n)G^{n}\mid\mathcal{F}_{n}]\Big{)}.

The claim now follows from Assumption 1 and the fact that all single-stage costs are bounded (cf. pp.174, Chapter 3 of [1]).∎

Define now a sequence Zn,n0Z_{n},n\geq 0 according to

Zn=m=0n1a(m)Mm+1,Z_{n}=\sum_{m=0}^{n-1}a(m)M_{m+1},

n1n\geq 1 with Z0=0Z_{0}=0.

Lemma 4

(Zn,n)(Z_{n},\mathcal{F}_{n}), n0n\geq 0 is an almost surely convergent martingale sequence.

Proof:

It is easy to see that ZnZ_{n} is n\mathcal{F}_{n}-measurable n\forall n. Further, it is integrable for each nn and moreover E[Zn+1n]=ZnE[Z_{n+1}\mid\mathcal{F}_{n}]=Z_{n} almost surely since (Mn+1,n)(M_{n+1},\mathcal{F}_{n}), n0n\geq 0 is a martingale difference sequence by Lemma 1. It is also square integrable from Lemma 3. The quadratic variation process of this martingale will be convergent almost surely if

n=0E[Zn+1Zn2n]< a.s.\sum_{n=0}^{\infty}E[\|Z_{n+1}-Z_{n}\|^{2}\mid\mathcal{F}_{n}]<\infty\mbox{ a.s.} (8)

Note that

E[Zn+1Zn2n]=a(n)2E[Mn+12n].E[\|Z_{n+1}-Z_{n}\|^{2}\mid\mathcal{F}_{n}]=a(n)^{2}E[\|M_{n+1}\|^{2}\mid\mathcal{F}_{n}].

Thus,

n=0E[Zn+1Zn2n]=n=0a(n)2E[Mn+12n]\sum_{n=0}^{\infty}E[\|Z_{n+1}-Z_{n}\|^{2}\mid\mathcal{F}_{n}]=\sum_{n=0}^{\infty}a(n)^{2}E[\|M_{n+1}\|^{2}\mid\mathcal{F}_{n}]
L^n=0(a(n)δn)2,\leq\hat{L}\sum_{n=0}^{\infty}\left(\frac{a(n)}{\delta_{n}}\right)^{2},

by Lemma 3. (8) now follows as a consequence of Assumption 2. Now (Zn,n)(Z_{n},\mathcal{F}_{n}), n0n\geq 0 can be seen to be convergent from the martingale convergence theorem for square integrable martingales [8]. ∎

Consider now the following ODE:

θ˙(t)=Γ¯(sν(s)Vθ(s)),\dot{\theta}(t)=\bar{\Gamma}(-\sum_{s}\nu(s)\nabla V_{\theta}(s)), (9)

where Γ¯:𝒞(C)𝒞(d)\bar{\Gamma}:{\cal C}(C)\rightarrow{\cal C}({\cal R}^{d}) is defined according to

Γ¯(v(x))=limη0(Γ(x+ηv(x))xη).\bar{\Gamma}(v(x))=\lim_{\eta\rightarrow 0}\left(\frac{\Gamma(x+\eta v(x))-x}{\eta}\right). (10)

Let H={θΓ¯(sν(s)Vθ(s))=0}H\stackrel{{\scriptstyle\triangle}}{{=}}\{\theta\mid\bar{\Gamma}(-\sum_{s}\nu(s)\nabla V_{\theta}(s))=0\} denote the set of all equilibria of (9). By Lemma 11.1 of [9], the only possible ω\omega-limit sets that can occur as invariant sets for the ODE (9) are subsets of HH. Let H¯H\bar{H}\subset H be the set of all internally chain recurrent points of the ODE (9). Our main result below is based on Theorem 5.3.1 of [18] for projected stochastic approximation algorithms. Before we proceed further, we recall that result below.

Let CdC\subset{\cal R}^{d} be a compact and convex set as before and Γ:dC\Gamma:{\cal R}^{d}\rightarrow C denote the projection operator that projects any x=(x1,,xd)Tdx=(x_{1},\ldots,x_{d})^{T}\in{\cal R}^{d} to its nearest point in CC.

Consider now the following the dd-dimensional stochastic recursion

Xn+1=Γ(Xn+a(n)(h(Xn)+ξn+βn)),X_{n+1}=\Gamma(X_{n}+a(n)(h(X_{n})+\xi_{n}+\beta_{n})), (11)

under the assumptions listed below. Also, consider the following ODE associated with (11):

X˙(t)=Γ¯(h(X(t))).\dot{X}(t)=\bar{\Gamma}(h(X(t))). (12)

Let 𝒞(C){\cal C}(C) denote the space of all continuous functions from CC to d{\cal R}^{d}. The operator Γ¯:𝒞(C)𝒞(d)\bar{\Gamma}:{\cal C}(C)\rightarrow{\cal C}({\cal R}^{d}) is defined according to

Γ¯(v(x))=limη0(Γ(x+ηv(x))xη),\bar{\Gamma}(v(x))=\lim_{\eta\rightarrow 0}\left(\frac{\Gamma(x+\eta v(x))-x}{\eta}\right), (13)

for any continuous v:Cdv:C\rightarrow{\cal R}^{d}. The limit in (13) exists and is unique since CC is a convex set. In case this limit is not unique, one may consider the set of all limit points of (13). Note also that from its definition, Γ¯(v(x))=v(x)\bar{\Gamma}(v(x))=v(x) if xCox\in C^{o} (the interior of CC). This is because for such an xx, one can find η>0\eta>0 sufficiently small so that x+ηv(x)Cox+\eta v(x)\in C^{o} as well and hence Γ(x+ηv(x))=x+ηv(x)\Gamma(x+\eta v(x))=x+\eta v(x). On the other hand, if xCx\in\partial C (the boundary of CC) is such that x+ηv(x)Cx+\eta v(x)\not\in C, for any small η>0\eta>0, then Γ¯(v(x))\bar{\Gamma}(v(x)) is the projection of v(x)v(x) to the tangent space of C\partial C at xx.

Consider now the assumptions listed below.

  • (B1)

    The function h:ddh:{\cal R}^{d}\rightarrow{\cal R}^{d} is continuous.

  • (B2)

    The step-sizes a(n),n0a(n),n\geq 0 satisfy

    a(n)>0n, na(n)=, a(n)0 as n.a(n)>0\forall n,\mbox{ }\sum_{n}a(n)=\infty,\mbox{ }a(n)\rightarrow 0\mbox{ as }n\rightarrow\infty.
  • (B3)

    The sequence βn,n0\beta_{n},n\geq 0 is a bounded random sequence with βn0\beta_{n}\rightarrow 0 almost surely as nn\rightarrow\infty.

  • (B4)

    There exists T>0T>0 such that ϵ>0\forall\epsilon>0,

    limnP(supjnmaxtT|i=m(jT)m(jT+t)1a(i)ξi|ϵ)=0.\lim_{n\rightarrow\infty}P\left(\sup_{j\geq n}\max_{t\leq T}\left|\sum_{i=m(jT)}^{m(jT+t)-1}a(i)\xi_{i}\right|\geq\epsilon\right)=0.
  • (B5)

    The ODE (12) has a compact subset KK of N{\cal R}^{N} as its set of asymptotically stable equilibrium points.

Let t(n),n0t(n),n\geq 0 be a sequence of positive real numbers defined according to t(0)=0t(0)=0 and for n1n\geq 1, t(n)=j=0n1a(j){\displaystyle t(n)=\sum_{j=0}^{n-1}a(j)}. By Assumption (B2), t(n)t(n)\rightarrow\infty as nn\rightarrow\infty. Let m(t)=max{nt(n)t}{\displaystyle m(t)=\max\{n\mid t(n)\leq t\}}. Thus, m(t)m(t)\rightarrow\infty as tt\rightarrow\infty. Assumptions (B1)-(B3) correspond to A5.1.3-A5.1.5 of [18] while (B4)-(B5) correspond to A5.3.1-A5.3.2 there.

[18, Theorem 5.3.1 (pp. 191-196)] essentially says the following:

Theorem 1 (Kushner and Clark Theorem)

Under Assumptions (B1)–(B5), almost surely, XnKX_{n}\rightarrow K as nn\rightarrow\infty.

Theorem 2

The iterates θ(n),n0\theta(n),n\geq 0 governed by (5) converge almost surely to H¯\bar{H}.

Proof:

In lieu of the foregoing, we rewrite (5) according to

θi(n+1)=Γi(θi(n)a(n)sν(s)iVθ(n)(s)\theta_{i}(n+1)=\Gamma_{i}\Big{(}\theta_{i}(n)-a(n)\sum_{s}\nu(s)\nabla_{i}V_{\theta(n)}(s)
a(n)βi(n)+Mn+1i),-a(n)\beta_{i}(n)+M^{i}_{n+1}\Big{)}, (14)

where βi(n)\beta_{i}(n) is as in (7). We shall proceed by verifying Assumptions (B1)-(B5) and subsequently appeal to Theorem 5.3.1 of [18] (i.e., Theorem 1 above) to claim convergence of the scheme. Note that Lemma 2 ensures Lipschitz continuity of Vθ(s)\nabla V_{\theta}(s) implying (B1). Next, from (B2), since δn0\delta_{n}\rightarrow 0, it follows that a(n)0a(n)\rightarrow 0 as nn\rightarrow\infty. Thus, Assumption (B2) holds as well. Now from Lemma 2, it follows that sν(s)Vθ(s)\sum_{s}\nu(s)\nabla V_{\theta}(s) is uniformly bounded since θC\theta\in C, a compact set. Assumption (B3) is now verified from Proposition 1. Since CC is a convex and compact set, Assumption (B4) holds trivially. Finally, Assumption (B5) is also easy to see as a consequence of Lemma 4. Now note that for the ODE (9), F(θ)=sν(s)Vθ(s)F(\theta)=\sum_{s}\nu(s)V_{\theta}(s) serves as an associated Lyapunov function and in fact

F(θ)TΓ¯(sν(s)Vθ(s))\nabla F(\theta)^{T}\bar{\Gamma}(-\sum_{s}\nu(s)\nabla V_{\theta}(s))
=(sν(s)θVθ(s))TΓ¯(sν(s)Vθ(s))0.=(\sum_{s}\nu(s)\nabla_{\theta}V_{\theta}(s))^{T}\bar{\Gamma}(-\sum_{s}\nu(s)\nabla V_{\theta}(s))\leq 0.

For θCo\theta\in C^{o} (the interior of CC), it is easy to see that Γ¯(sν(s)Vθ(s))\bar{\Gamma}(-\sum_{s}\nu(s)\nabla V_{\theta}(s)) =sν(s)Vθ(s)=-\sum_{s}\nu(s)\nabla V_{\theta}(s), and

F(θ)TΓ¯(sν(s)Vθ(s))\displaystyle\nabla F(\theta)^{T}\bar{\Gamma}(-\sum_{s}\nu(s)\nabla V_{\theta}(s)) <\displaystyle< 0 if θHcC\displaystyle 0\mbox{ if }\theta\in H^{c}\cap C
=\displaystyle= 0 o.w.\displaystyle 0\mbox{ o.w.}

For θδC\theta\in\delta C (the boundary of CC), there can additionally be spurious attractors, see [19], that are also contained in HH. The claim now follows from Theorem 5.3.1 of [18]. ∎

V Conclusions

We presented a version of the Reinforce algorithm for the setting of episodic tasks that incorporates a one-simulation SF algorithm and proved its convergence using the ODE approach. In a longer version of this paper, we shall consider also the average cost case and present an algorithm based on [4] that does not rely on regeneration epochs for performing updates. Moreover, we shall show detailed empirical results comparing our proposed algorithms with the policy gradient scheme and other algorithms.

References

  • [1] D. P. Bertsekas. Dynamic Programming and Optimal Control, Vol.II. Athena Scientific, 2012.
  • [2] S. Bhatnagar. Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization. ACM Transactions on Modeling and Computer Simulation, 15(1):74–107, 2005.
  • [3] S. Bhatnagar. Adaptive Newton-based smoothed functional algorithms for simulation optimization. ACM Transactions on Modeling and Computer Simulation, 18(1):2:1–2:35, 2007.
  • [4] S. Bhatnagar and V.S. Borkar. Multiscale chaotic SPSA and smoothed functional algorithms for simulation optimization. Simulation, 79(10):568–580, 2003.
  • [5] S Bhatnagar, H. L. Prasad, and L. A. Prashanth. Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods (Lecture Notes in Control and Information Sciences), volume 434. Springer, 2013.
  • [6] S. Bhatnagar, R.S. Sutton, M. Ghavamzadeh, and M. Lee. Incremental natural actor-critic algorithms. Advances in neural information processing systems, 20, 2007.
  • [7] S. Bhatnagar, R.S. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor–critic algorithms. Automatica, 45(11):2471–2482, 2009.
  • [8] V. S. Borkar. Probability Theory: An Advanced Course. Springer, New York, 1995.
  • [9] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint, 2’nd Edition. Cambridge University Press, 2022.
  • [10] X.-R. Cao. Stochastic Learning and Optimization: A Sensitivity-Based Approach. Springer, 2007.
  • [11] E. K. P. Chong and P. J. Ramadge. Optimization of queues using an infinitesimal perturbation analysis-based stochastic algorithm with general update times. SIAM J. Cont. and Optim., 31(3):698–732, 1993.
  • [12] T. Furmston, G. Lever, and D. Barber. Approximate Newton methods for approximate policy search in Markov decision processes. Journal of Machine Learning Research, 17:1–51, 2016.
  • [13] Y. C. Ho and X. R. Cao. Perturbation Analysis of Discrete Event Dynamical Systems. Kluwer, Boston, 1991.
  • [14] V. Ya Katkovnik and Yu Kulchitsky. Convergence of a class of random search algorithms. Automation Remote Control, 8:1321–1326, 1972.
  • [15] J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regression function. Ann. Math. Statist., 23:462–466, 1952.
  • [16] V.R. Konda and V.S. Borkar. Actor-critic–type learning algorithms for markov decision processes. SIAM Journal on control and Optimization, 38(1):94–123, 1999.
  • [17] V.R. Konda and J.N. Tsitsiklis. Onactor-critic algorithms. SIAM journal on Control and Optimization, 42(4):1143–1166, 2003.
  • [18] H. J. Kushner and D. S. Clark. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer Verlag, New York, 1978.
  • [19] H. J. Kushner and G. G. Yin. Stochastic Approximation Algorithms and Applications. Springer Verlag, New York, 1997.
  • [20] P. Marbach and J.N. Tsitsiklis. Simulation-based optimization of Markov reward processes. IEEE Transactions on Automatic Control, 46(2):191–209, 2001.
  • [21] L A Prashanth, S. Bhatnagar, M.C. Fu, and S.I. Marcus. Adaptive system optimization using random directions stochastic approximation. IEEE Transactions on Automatic Control, 62(5):2223–2238, 2017.
  • [22] R. Y. Rubinstein. Simulation and the Monte Carlo Method. Wiley, New York, 1981.
  • [23] J.C. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37(3):332–341, 1992.
  • [24] J.C. Spall. A one-measurement form of simultaneous perturbation stochastic approximation. Automatica, 33(1):109–112, 1997.
  • [25] R. S. Sutton and A. W. Barto. Reinforcement Learning, 2’nd Edition. MIT Press, 2018.
  • [26] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, volume 99, pages 1057–1063, 1999.
  • [27] R.J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinforcement learning, pages 5–32, 1992.