The Reinforce Policy Gradient Algorithm Revisited

Shalabh Bhatnagar
Department of Computer Science and Automation,
Indian Institute of Science,
Bengaluru 560012, India
shalabh@iisc.ac.in The author was supported by a J. C. Bose Fellowship, Project No. DFTM/ 02/ 3125/M/04/AIR-04 from DRDO under DIA-RCOE, a project from DST-ICPS, and the RBCCPS, IISc.

(October 2023)

Abstract

We revisit the Reinforce policy gradient algorithm from the literature. Note that this algorithm typically works with cost returns obtained over random length episodes obtained from either termination upon reaching a goal state (as with episodic tasks) or from instants of visit to a prescribed recurrent state (in the case of continuing tasks). We propose a major enhancement to the basic algorithm. We estimate the policy gradient using a function measurement over a perturbed parameter by appealing to a class of random search approaches. This has advantages in the case of systems with infinite state and action spaces as it relax some of the regularity requirements that would otherwise be needed for proving convergence of the Reinforce algorithm. Nonetheless, we observe that even though we estimate the gradient of the performance objective using the performance objective itself (and not via the sample gradient), the algorithm converges to a neighborhood of a local minimum. We also provide a proof of convergence for this new algorithm.

Index Terms:

Reinforce policy gradient algorithm, smoothed functional technique, stochastic gradient search, stochastic shortest path Markov decision processes.

I Introduction

Policy gradient methods [25, 26] are a popular class of approaches in reinforcement learning. The policy in these approaches is considered parameterized and one updates the policy parameter along a gradient search direction where the gradient is of a performance objective, normally the value function. The policy gradient theorem [26, 20, 10] which is a fundamental result in these approaches relies on an interchange of the gradient and expectation operators and in such cases turns out to be the expectation of the gradient of noisy performance functions much like the previously studied perturbation analysis based sensitivity approaches for optimization via simulation [13, 11].

The Reinforce algorithm [27, 25] is a noisy gradient scheme for which the expectation of the gradient is the policy gradient, i.e., the gradient of the expected objective w.r.t. the policy parameters. The updates of the policy parameter are however obtained once after the full return on an episode has been found. Actor-critic algorithms [25, 16, 17, 7, 6] have been presented in the literature as alternatives to the Reinforce algorithm as they perform incremental parameter updates at every instant but do so using two-timescale stochastic approximation algorithms.

In this paper, we revisit the Reinforce algorithm and present a new algorithm for the case of episodic tasks or the stochastic shortest path setting. Our algorithm performs parameter updates upon termination of episodes, that is when goal or terminal states are reached. In this setting, as mentioned above, updates are performed only at instants of visit to a prescribed recurrent state [10, 20]. This algorithm is based on a single function measurement or simulation at a perturbed parameter value where the perturbations are obtained using independent Gaussian random variates.

Gradient estimation in this algorithm is performed using the smoothed functional (SF) technique for gradient estimation [22, 4, 3, 5]. The basic problem in this setting is the following: Given an objective function $J:\mathcal{R}^{d}\rightarrow\mathcal{R}$ such that $J(\theta)=E_{\xi}[h(\theta,\xi)]$ , where $\theta\in\mathcal{R}^{d}$ is the parameter to be tuned and $\xi$ is the noise element, the goal is to find $\theta^{*}\in\mathcal{R}^{d}$ such that

J(\theta^{*})=\min_{\theta\in\mathcal{R}^{d}}J(\theta).

Since the objective function $J(\cdot)$ can be highly nonlinear, one often settles for a lesser goal – that of finding a local instead of a global minimum. In this setting, the Kiefer-Wolfowitz [15] finite difference estimates for the gradient of $J$ would correspond to the following: For $i=1,\ldots,d$ ,

\nabla_{i}J(\theta)=\lim_{0<\delta\downarrow 0}E_{\xi}\left[\frac{h(\theta+\delta e_{i},\xi^{+}_{i})-h(\theta-\delta e_{i},\xi^{-}_{i})}{2\delta}\right],

where $e_{i}$ is the unit vector with 1 in the $i$ th place and $0$ elsewhere. Further, $\xi^{+}_{i}$ and $\xi^{-}_{i}$ , $i=1,\ldots,d$ are independent noise random variables having a common distribution. The expectation above is taken w.r.t this common distribution on the noise random variables. This approach does not perform well in practice when $d$ is large, since one needs $2d$ function measurements or simulations for a parameter update.

Random search methods such as simultaenous perturbation stochastic approximation (SPSA) [23, 24, 2], smoothed functional (SF) [14, 4, 3] or random directions stochastic approximation (RDSA) [18, 21] typically require much less number of simulations. For instance, the gradient based algorithms in these approaches require only one or two simulations regardless of the parameter dimension $d$ , while their Newton-based counterparts usually involve one to four system simulations for any parameter update (again regardless of $d$ ). A textbook treatment of random search approaches for stochastic optimization is available in [5].

We consider here a one-simulation SF algorithm where the gradient of $J(\theta)$ is estimated using a noisy function measurement, at the parameter $\theta+\delta\Delta$ , where $\delta>0$ and $\Delta\stackrel{{\scriptstyle\triangle}}{{=}}(\Delta_{1},\ldots,\Delta_{d})^{T}$ with each $\Delta_{i}\sim N(0,1)$ with $\Delta_{i}$ being independent of $\Delta_{j}$ , $\forall j\not=i$ . Further, $\Delta$ is independent of the measurement noise as well. The gradient estimate in this setting is the following: For $i=1,\ldots,d$ ,

\nabla_{i}J(\theta)=\lim_{0<\delta\downarrow 0}E_{\xi,\Delta}\left[\Delta_{i}\frac{h(\theta+\delta\Delta,\xi)}{\delta}\right].

In the above, $\xi$ denotes the measurement noise random variable. The expectation above is with respect to the joint distribution of $\xi$ and $\Delta$ . Before we proceed further, we present the basic Markov decision process (MDP) setting as well as recall the Reinforce algorithm that we consider for the episodic setting. We remark here that there are not many analyses of Reinforce type algorithms in the literature in the episodic or stochastic shortest path setting.

II The Basic MDP Setting

By a Markov decision process, we mean a controlled stochastic process $\{X_{n}\}$ whose evolution is governed by an associated control-valued sequence $\{Z_{n}\}$ . It is assumed that $X_{n},n\geq 0$ takes values in a set $S$ called the state-space. Let $A(s)$ be the set of feasible actions in state $s\in S$ and $A\stackrel{{\scriptstyle\triangle}}{{=}}\cup_{s\in S}A(s)$ denote the set of all actions. When the state is say $s$ and a feasible action $a$ is chosen, the next state seen is $s^{\prime}$ with a probability $p(s^{\prime}|s,a)\stackrel{{\scriptstyle\triangle}}{{=}}P(X_{n+1}=s^{\prime}\mid X_{n}=s,Z_{n}=a)$ , $\forall n$ . We assume these probabilities do not depend on $n$ . Such a process satisfies the controlled Markov property, i.e.,

P(X_{n+1}=s^{\prime}\mid X_{n},Z_{n},\ldots,X_{0},Z_{0})=p(s^{\prime}\mid X_{n},Z_{n})\mbox{ a.s.}

By an admissible policy or simply a policy, we mean a sequence of functions $\pi=\{\mu_{0},\mu_{1},\mu_{2},\ldots\}$ , with $\mu_{i}:S\rightarrow A$ , $i\geq 0$ , such that $\mu_{i}(s)\in A(s)$ , $\forall s\in S$ . The policy $\pi$ is a decision rule which specifies that if at instant $k$ , the state is $i$ , then the action chosen under $\pi$ by the decision maker would be $\mu_{k}(i)$ . A stationary policy $\pi$ is one for which $\mu_{k}=\mu_{l}\stackrel{{\scriptstyle\triangle}}{{=}}\mu$ , $\forall k,l=0,1,\ldots$ . In other words, under a stationary policy, the function that decides the action-choice in a given state does not depend on the instant $n$ at which the action is chosen.

Associated with any transition to a state $s^{\prime}$ from a state $s$ under action $a$ , is a ‘single-stage’ cost $g(s,a,s^{\prime})$ where $g:S\times A\times S\rightarrow\mathcal{R}$ is called the cost function. The goal of the decision maker is to select actions $a_{k},k\geq 0$ in response to the system states $s_{k},k\geq 0$ , observed one at a time, so as to minimize a long-term cost objective. We assume here that the number of states and actions is finite.

II-A The Episodic or Stochastic Shortest Path Setting

We consider here the episodic or the stochastic shortest path problem where decision making terminates once a goal or terminal state is reached. We let $1,\ldots,p$ denote the set of non-terminal or regular states and $t$ be the terminal state. Thus, $S=\{1,2,\ldots,p,t\}$ denotes the state space for this problem.

Our basic setting here is similar to Chapter 3 of [1], where it is assumed that under any policy there is a positive probability of hitting the goal state $t$ in at most $p$ steps starting from any initial (non-terminal) state, that would in turn signify that the problem would terminate in a finite though random amount of time.

Under a given policy $\pi$ , define

V_{\pi}(s)=E_{\pi}\left[\sum_{k=0}^{T}g(X_{k},\mu_{k}(X_{k}),X_{k+1})\mid X_{0}=s\right],

(1)

where $T>0$ is a finite random time at which the process enters the terminal state $t$ . Here $E_{\pi}[\cdot]$ indicates that all actions are chosen according to policy $\pi$ depending on the system state at any instant. We assume that there is no action that is feasible in the state $t$ and so the process terminates once it reaches $t$ .

Let $\Pi$ denote the set of all admissible policies. The goal here is to find the optimal value function $V^{*}(i),i\in S$ , where

V^{*}(i)=\min_{\pi\in\Pi}V_{\pi}(i)=V_{\pi^{*}}(i),\mbox{ }i\in S.

(2)

Here $\pi^{*}$ denotes the optimal policy, i.e., the one that minimizes $V_{\pi}(i)$ over all policies $\pi$ . A related goal then would be to find the optimal policy $\pi^{*}$ . It turns out that in these problems, there exist stationary policies that are optimal, and so it is sufficient to search for an optimal policy within the class of stationary policies.

A stationary policy $\pi$ is called a proper policy (cf. pp.174 of [1]) if

\hat{p}_{\pi}\stackrel{{\scriptstyle\triangle}}{{=}}\max_{s=1,\ldots,p}P(X_{p}\not=t\mid X_{0}=s,\pi)<1.

In other words, regardless of the initial state $i$ , there is a positive probability of termination after at most $p$ stages when using a proper policy $\pi$ and moreover $P(T<\infty)=1$ under such a policy.

An admissible policy (and so also a stationary policy) can be randomized as well. A randomized admissible policy or simply a randomized policy is a sequence of distributions $\psi=\{\phi_{0},\phi_{1},\ldots\}$ with each $\phi_{i}:S\rightarrow P(A)$ . In other words, given a state $s$ , a randomized policy would provide a distribution $\phi_{i}(s)=(\phi_{i}(s,a),a\in A(s))$ for the action to be chosen in the $i$ th stage. A stationary randomized policy is one for which $\phi_{j}=\phi_{k}\stackrel{{\scriptstyle\triangle}}{{=}}\phi$ , $\forall j,k=0,1,\ldots$ . In this case, we simply call $\phi$ to be a stationary randomized policy. Here and in the rest of the paper, we shall assume that the policies are stationary randomized and are parameterized via a certain parameter $\theta\in C\subset\mathcal{R}^{d}$ , a compact and convex set.

We make the following assumption:

Assumption 1

All stationary randomized policies $\phi_{\theta}$ parameterized by $\theta\in C$ are proper.

In practice, one might be able to relax this assumption (as with the model-based analysis of [1]) by (a) assuming that for policies that are not proper, $V_{\pi}(i)=\infty$ for at least one non-terminal state $i$ and (b) there exists a proper policy. The optimal value function satisfies in this case the following Bellman equation: For $s=1,\ldots,p$ ,

V^{*}(s)=\min_{a\in A(s)}\left(\bar{g}(s,a)+\sum_{j=1}^{p}p(j\mid s,a)V^{*}(j)\right),

(3)

where ${\displaystyle\bar{g}(s,a)=\sum_{j=1}^{p}p(j|s,a)g(s,a,j)+p(t|s,a)g(s,a,t)}$ is the expected single-stage cost in a non-terminal state $s$ when a feasible action $a$ is chosen. It can be shown, see [1], that an optimal stationary proper policy exists.

II-B The Policy Gradient Theorem

Policy gradient methods perform a gradient search within the prescribed class of parameterized policies. Let $\phi_{\theta}(s,a)$ denote the probability of selecting action $a\in A(s)$ when the state is $s\in S$ and the policy parameter is $\theta\in C$ . We assume that $\phi_{\theta}(s,a)$ is continuously differentiable in $\theta$ . A common example here is of the parameterized Boltzmann or softmax policies. Let $\phi_{\theta}(s)\stackrel{{\scriptstyle\triangle}}{{=}}(\phi_{\theta}(s,a),a\in A(s))$ , $s\in S$ and $\phi_{\theta}\stackrel{{\scriptstyle\triangle}}{{=}}(\phi_{\theta}(s),s\in S)$ .

We assume that trajectories of states and actions are available either as real data or from a simulation device. Let $G_{k}=\sum_{j=k}^{T-1}g_{j}$ denote the sum of costs until termination (likely when a goal state is reached) on a trajectory starting from instant $k$ . Note that if all actions are chosen according to a policy $\phi$ , then the value and Q-value functions (under $\phi$ ) would be $V_{\phi}(s)=E_{\phi}[G_{k}\mid X_{k}=s]$ and $Q_{\phi}(s,a)=E_{\phi}[G_{k}\mid X_{k}=s,Z_{k}=a],$ respectively. In what follows, for ease of notation, we let $V_{\theta}\equiv V_{\phi_{\theta}}$ .

The policy gradient theorem for episodic problems has the following form, cf. [26, 25]:

\nabla V_{\theta}(s_{0})=\sum_{s\in S}\mu(s)\sum_{a\in A(s)}\nabla_{\theta}\pi(s,a)Q_{\theta}(s,a),

(4)

where $\mu(s),s\in S$ is defined as ${\displaystyle\mu(s)=\frac{\eta(s)}{\sum_{s^{\prime}\in S}\eta(s^{\prime})}}$ where ${\displaystyle\eta(s)=\sum_{k=0}^{\infty}p^{k}(s|s_{0},\pi_{\theta})}$ , $s\in S$ , with $p^{k}(s|s_{0},\pi_{\theta})$ being the $k$ -step transition probability of going to state $s$ from $s_{0}$ under the policy $\pi_{\theta}$ . A similar result holds for the long-run average cost setting with $\mu(s)=d^{\pi}(s)$ (the stationary distribution of $\{X_{n}\}$ under policy $\pi$ ), and $Q_{\pi}(s,a)$ is the state-action differential value function. Further, in the discounted cost setting too, a similar result holds but with $\eta(s)$ replaced by ${\displaystyle\eta(s)=\sum_{k=0}^{\infty}\gamma^{k}p^{k}(s|s_{0},\pi_{\theta})}$ , where $0<\gamma<1$ is the discount factor. Proving the policy gradient theorem when the state-action spaces are finite is relatively straight forward [26, 25]. However, one would require strong regularity assumptions on the system dynamics and performance function as with IPA or likelihood ratio approaches [13] if the state-action spaces are non-finite i.e., countably infinite or continuously-valued sets.

The Reinforce algorithm [25, 27] makes use of the policy gradient theorem as the latter indicates that the gradient of the value function is the expectation of the gradient of a function of the noisy returns obtained from episodes. In what follows, we present an alternative algorithm based on Reinforce that incorporates a one-measurement SF gradient estimator. Our algorithm does not incorporate the policy gradient theorem and thus does not rely on an interchange between the gradient and expectation operators. Our algorithm incorporates a zeroth-order gradient approximation using the smoothed functional method and yet, like the Reinforce algorithm, requires only one sample trajectory. This is thus a clear advantage with our algorithm. However, since our algorithm caters to episodic tasks, it performs updates at the instants of visit to a certain prescribed recurrent state as considered in [10, 20]. It is important to mention that such instants can be highly sparse in practice since in most practical systems, the number of state-action spaces can be very large in size. We refer to our algorithm as the SF Reinforce algorithm.

III The SF Reinforce Algorithm

We consider here the case of episodic problems and the model-free setting whereby we do not assume any knowledge of the system model, i.e., the transition probabilities $p(s^{\prime}\mid s,a)$ , and in their place, we assume that we have access to data (either real or simulated). The data that is available is over trajectories of states, actions, single-stage costs and next states until termination. We assume that multiple trajectories of data can be made available and the data on the $m$ th trajectory can be represented in the form of the tuples $(s^{m}_{k},a^{m}_{k},g^{m}_{k},s^{m}_{k+1})$ , $k=0,1,\ldots,T_{m}$ with $T_{m}$ being the termination instant on the $m$ th trajectory, $m\geq 1$ . Also, $s^{m}_{j}$ is the state at instant $j$ , $j=k,k+1$ in the $m$ th trajectory. Further, $a^{m}_{k}$ and $g^{m}_{k}$ are the action chosen and the cost incurred, respectively, at instant $k$ in the $m$ th trajectory. As mentioned before, we consider a class of stationary randomized policies that are parameterized by $\theta$ and satisfy Assumption 1.

Let $\Gamma:{\cal R}^{d}\rightarrow C$ denote a projection operator that projects any $x=(x_{1},\ldots,x_{d})^{T}\in{\cal R}^{d}$ to its nearest point in $C$ . Thus, if $x\in C$ , then $\Gamma(x)\in C$ as well. For ease of exposition, we assume that $C$ is a $d$ -dimensional rectangle having the form $C=\prod_{i=1}^{d}[a_{i,\min},a_{i,\max}]$ , where $-\infty<a_{i,\min}<a_{i,\max}<\infty$ , $\forall i=1,\ldots,d$ . Then $\Gamma(x)=(\Gamma_{1}(x_{1}),\ldots,\Gamma_{d}(x_{d}))^{T}$ with $\Gamma_{i}:{\cal R}\rightarrow[a_{i,\min},a_{i,\max}]$ such that ${\displaystyle\Gamma_{i}(x_{i})=\min(a_{i,\max},\max(a_{i,\min},x))}$ , $i=1,\ldots,d$ . Also, let ${\cal C}(C)$ denote the space of all continuous functions from $C$ to ${\cal R}^{d}$ .

In what follows, we present a procedure that incrementally updates the parameter $\theta$ . Let $\theta(n)$ denote the parameter value obtained after the $n$ th update of this procedure which depends on the $n$ th episode and which is run using the policy parameter $\Gamma(\theta(n)+\delta_{n}\Delta(n))$ , for $n\geq 0$ , where $\theta(n)=(\theta_{1}(n),\ldots,\theta_{d}(n))^{T}\in\mathcal{R}^{d}$ , $\delta_{n}>0$ $\forall n$ with $\delta_{n}\rightarrow 0$ as $n\rightarrow\infty$ and $\Delta(n)=(\Delta_{1}(n),\ldots,\Delta_{d}(n))^{T},n\geq 0$ , where $\Delta_{i}(n),i=1,\ldots,d,n\geq 0$ are independent random variables distributed according to the $N(0,1)$ distribution.

Algorithm (5) below is used to update the parameter $\theta\in C\subset\mathcal{R}^{d}$ . Let $\chi^{n}$ denote the $n$ th state-action trajectory $\chi^{n}=\{s^{n}_{0},a^{n}_{0},s^{n}_{1},a^{n}_{1},\ldots,s^{n}_{T-1},a^{n}_{T-1},s^{n}_{T}\}$ , $n\geq 0$ where the actions $a^{n}_{0},\ldots,a^{n}_{T-1}$ in $\chi^{n}$ are obtained using the policy parameter $\theta(n)+\delta_{n}\Delta(n)$ . The instant $T$ denotes the termination instant in the trajectory $\chi^{n}$ and corresponds to the instant when the terminal or goal state $t$ is reached.. Note that the various actions in the trajectory $\chi^{n}$ are chosen according to the policy $\phi_{(\theta(n)+\delta_{n}\Delta(n))}$ . The initial state is assumed to be sampled from a given initial distribution $\nu=(\nu(i),i\in S)$ over states.

Let $G^{n}=\sum_{k=0}^{T-1}g^{n}_{k}$ denote the sum of costs until termination on the trajectory $\chi^{n}$ , with $g^{n}_{k}\equiv g(X^{n}_{k},Z^{n}_{k},X^{n}_{k+1})$ and where the superscript $n$ on the various quantities indicates that these correspond to the $n$ th episode. The update rule that we consider here is the following: For $n\geq 0,i=1,\ldots,d$ ,

\theta_{i}(n+1)=\Gamma_{i}\left(\theta_{i}(n)-a(n)\left(\Delta_{i}(n)\frac{G^{n}}{\delta_{n}}\right)\right).

(5)

We assume that the step-sizes $a(n),n\geq 0$ satisfy the following requirement:

Assumption 2

The step-size sequence $\{a(n)\}$ satisfies $a(n)>0$ , $\forall n$ . Further,

\sum_{n}a(n)=\infty,\mbox{ }\sum_{n}\left(\frac{a(n)}{\delta_{n}}\right)^{2}<\infty.

Once the $n$ th update of the parameter (i.e., $\theta(n)$ ) is obtained, the perturbed parameter $\theta(n)+\delta_{n}\Delta(n)$ is obtained after sampling $\Delta(n)$ from the multivariate Gaussian distribution as explained previously and thereafter a new trajectory governed by this perturbed parameter is generated with the initial state in each episode sampled according to a given distribution $\nu$ . This is because each episode ends in the terminal state $t$ and any fresh episode starts from a non-terminal initial state sampled from the distribution $\nu$ .

IV Convergence Analysis

We begin by rewriting the recursion (5) as follows:

\theta_{i}(n+1)=\Gamma_{i}\left(\theta_{i}(n)-a(n)E\left[\Delta_{i}(n)\frac{G^{n}}{\delta_{n}}|\mathcal{F}_{n}\right]+M^{i}_{n+1}\right),

(6)

where $M^{i}_{n+1}=\Delta_{i}(n)\frac{G^{n}}{\delta_{n}}-E\left[\Delta_{i}(n)\frac{G^{n}}{\delta_{n}}|\mathcal{F}_{n}\right],n\geq 0.$ Here, we let $\mathcal{F}_{n}\stackrel{{\scriptstyle\triangle}}{{=}}\sigma(\theta(m),m\leq n,\Delta(m),\chi^{m},m<n),n\geq 1$ as a sequence of increasing sigma fields with $\mathcal{F}_{0}=\sigma(\theta(0))$ . Let $M_{n}\stackrel{{\scriptstyle\triangle}}{{=}}(M^{1}_{n},\ldots,M^{d}_{n})^{T}$ , $n\geq 0$ .

Lemma 1

$(M_{n},\mathcal{F}_{n}),n\geq 0$ is a martingale difference sequence.

Proof:

Notice that

M^{i}_{n}=\Delta_{i}(n-1)\frac{G^{n-1}}{\delta_{n-1}}-E\left[\Delta_{i}(n-1)\frac{G^{n-1}}{\delta_{n-1}}\mid\mathcal{F}_{n-1}\right].

The first term on the RHS above is clearly measurable $\mathcal{F}_{n}$ while the second term is measurable $\mathcal{F}_{n-1}$ and hence measurable $\mathcal{F}_{n}$ as well. Further, from Assumption 1, each $M_{n}$ is integrable. Finally, it is easy to verify that

E[M^{i}_{n+1}\mid\mathcal{F}_{n}]=0,\mbox{ }\forall i.

The claim follows. ∎

Proposition 1

We have

E\left[\Delta_{i}(n)\frac{G^{n}}{\delta_{n}}\mid\mathcal{F}_{n}\right]=\sum_{s\in S}\nu(s)\nabla_{i}V_{\theta(n)}(s)+o(\delta_{n})\mbox{ a.s.}

Proof:

Note that

E\left[\Delta_{i}(n)\frac{G^{n}}{\delta_{n}}\mid\mathcal{F}_{n}\right]=E\left[E\left[\Delta_{i}(n)\frac{G^{n}}{\delta_{n}}\mid\mathcal{G}_{n}\right]\mid\mathcal{F}_{n}\right],

where $\mathcal{G}_{n}\stackrel{{\scriptstyle\triangle}}{{=}}\sigma(\theta(m),\Delta(m),m\leq n,\chi^{m},m<n),n\geq 1$ is a sequence of increasing sigma fields with $\mathcal{G}_{0}=\sigma(\theta(0),\Delta(0))$ . It is clear that $\mathcal{F}_{n}\subset\mathcal{G}_{n},\forall n\geq 0$ . Now,

E\left[\Delta_{i}(n)\frac{G^{n}}{\delta_{n}}\mid\mathcal{G}_{n}\right]=\frac{\Delta_{i}(n)}{\delta_{n}}E[G^{n}\mid\mathcal{G}_{n}].

Let $s^{n}_{0}=s$ denote the initial state in the trajectory $\chi^{n}$ . Recall that the initial state $s$ is chosen randomly from the distribution $\nu$ . Thus,

E[G^{n}\mid\mathcal{G}_{n}]=\sum_{s}\nu(s)E[G^{n}\mid s^{n}_{0}=s,\phi_{\theta(n)+\delta_{n}\Delta(n)}]

=\sum_{s}\nu(s)V_{\theta(n)+\delta_{n}\Delta(n)}(s).

Thus, with probability one,

E\left[\Delta_{i}(n)\frac{G^{n}}{\delta_{n}}\mid\mathcal{G}_{n}\right]=\sum_{s}\nu(s)\left(\Delta_{i}(n)\frac{V_{\theta(n)+\delta_{n}\Delta(n)}(s)}{\delta_{n}}\right).

Hence, it follows almost surely that

E\left[\Delta_{i}(n)\frac{G^{n}}{\delta_{n}}\mid\mathcal{F}_{n}\right]=\sum_{s}\nu(s)E\left[\Delta_{i}(n)\frac{V_{\theta(n)+\delta_{n}\Delta(n)}(s)}{\delta_{n}}\mid\mathcal{F}_{n}\right].

Using a Taylor’s expansion of $V_{\theta(n)+\delta_{n}\Delta(n)}(s)$ around $\theta(n)$ gives us

V_{\theta(n)+\delta_{n}\Delta(n)}(s_{n})=V_{\theta(n)}(s_{n})+\delta_{n}\Delta(n)^{T}\nabla V_{\theta(n)}(s_{n})

+\frac{\delta_{n}^{2}}{2}\Delta(n)^{T}\nabla^{2}V_{\theta(n)}(s_{n})\Delta(n)+o(\delta_{n}^{2}).

Now recall that $\Delta(n)=(\Delta_{i}(n),i=1,\ldots,d)^{T}$ . Thus,

\Delta(n)\frac{V_{\theta(n)+\delta_{n}\Delta(n)}(s_{n})}{\delta_{n}}=\frac{1}{\delta_{n}}\Delta(n)V_{\theta(n)}(s_{n})

+\Delta(n)\Delta(n)^{T}\nabla V_{\theta(n)}(s_{n})

+\frac{\delta_{n}}{2}\Delta(n)\Delta(n)^{T}\nabla^{2}V_{\theta(n)}(s_{n})\Delta(n)+o(\delta_{n}).

Now observe from the properties of $\Delta_{i}(n),\forall i,n,$ that
(i) $E[\Delta(n)]=0$ (the zero-vector), $\forall n$ , since $\Delta_{i}(n)\sim N(0,1)$ , $\forall i,n$ .
(ii) $E[\Delta(n)\Delta(n)^{T}]=I$ (the identity matrix), $\forall n$ .
(iii) $E\left[\sum_{i,j,k=1}^{d}\Delta_{i}(n)\Delta_{j}(n)\Delta_{k}(n)\right]=0$ .
Property (iii) follows from the facts that (a) $E[\Delta_{i}(n)\Delta_{j}(n)\Delta_{k}(n)]=0$ , $\forall i\not=j\not=k$ , (b) $E[\Delta_{i}(n)\Delta_{j}^{2}(n)]=0$ , $\forall i\not=j$ (this pertains to the case where $i\not=j$ but $j=k$ above) and (c) $E[\Delta_{i}^{3}(n)]=0$ (for the case when $i=j=k$ above). These properties follow from the independence of the random variables $\Delta_{i}(n)$ , $i=1,\ldots,d$ and $n\geq 0$ , as well as the fact that they are all distributed $N(0,1)$ . The claim now follows from (i)-(iii) above. ∎

In the light of Proposition 1, we can rewrite (5) as follows:

\theta(n+1)=\Gamma(\theta(n)-a(n)(\sum_{s}\nu(s)\nabla V_{\theta(n)}(s)+M_{n+1}

+\beta(n))),

(7)

where ${\displaystyle\beta_{i}(n)=E\left[\Delta_{i}(n)\frac{G_{n}}{\delta}\mid\mathcal{F}_{n}\right]}$ $-\sum_{s}\nu(s)\nabla_{i}V_{\theta(n)}(s)$ and $\beta(n)=(\beta_{1}(n),\ldots,\beta_{d}(n))^{T}$ . From Proposition 1, it follows that $\beta(n)=o(\delta_{n})$ .

Lemma 2

The function $\nabla V_{\theta}(s)$ is Lipschitz continuous in $\theta$ . Further, $\exists$ a constant $K_{1}>0$ such that $\parallel\nabla V_{\theta}(s)\parallel\leq K_{1}(1+\parallel\theta\parallel)$ .

Proof:

It can be seen from (4) that $V_{\theta}(s)$ is continuously differentiable in $\theta$ . It can also be shown as in Theorem 3 of [12] that $\nabla^{2}V_{\theta}(s)$ exists and is continuous. Since $\theta$ takes values in $C$ , a compact set, it follows that $\nabla^{2}V_{\theta}(s)$ is bounded and thus $\nabla V_{\theta}(s)$ is Lipschitz continuous.

Finally, let $L_{1}^{s}>0$ denote the Lipschitz constant for the function $\nabla V_{\theta}(s)$ . Then, for a given $\theta_{0}\in C$ ,

\parallel\nabla V_{\theta}(s)\parallel-\parallel\nabla V_{\theta_{0}}(s)\parallel\leq\parallel\nabla V_{\theta}(s)-\nabla V_{\theta_{0}}(s)\parallel

\leq L_{1}^{s}\parallel\theta-\theta_{0}\parallel

\leq L_{1}^{s}\parallel\theta\parallel+L_{1}^{s}\parallel\theta_{0}\parallel.

Thus, $\parallel\nabla V_{\theta}(s)\parallel\leq\parallel\nabla V_{\theta_{0}}(s)\parallel+L^{s}_{1}\parallel\theta_{0}\parallel+L^{s}_{1}\parallel\theta\parallel.$ Let $K_{s}\stackrel{{\scriptstyle\triangle}}{{=}}\|\nabla V_{\theta_{0}}(s)\|+L^{s}_{1}\|\theta_{0}\|$ and $K_{1}\stackrel{{\scriptstyle\triangle}}{{=}}\max(K_{s},L^{s}_{1},s\in S)$ . Thus, $\parallel\nabla V_{\theta}(s)\parallel\leq K_{1}(1+\parallel\theta\parallel)$ . Note here that since $|S|<\infty$ , $K_{1}<\infty$ as well. The claim follows. ∎

Lemma 3

The sequence $(M_{n},\mathcal{F}_{n})$ , $n\geq 0$ satisfies $\displaystyle{E[\|M_{n+1}\|^{2}\mid\mathcal{F}_{n}]\leq\frac{\hat{L}}{\delta_{n}^{2}}},$ for some constant $\hat{L}>0$ .

Proof:

Note that

\|M_{n+1}\|^{2}=\sum_{i=1}^{d}(M^{i}_{n+1})^{2}

=\sum_{i=1}^{d}\Big{(}\Delta_{i}^{2}(n)\frac{(G^{n})^{2}}{\delta_{n}^{2}}+\frac{1}{\delta_{n}^{2}}E\left[\Delta_{i}(n)G^{n}\mid\mathcal{F}_{n}\right]^{2}

-2\Delta_{i}(n)\frac{G^{n}}{\delta_{n}^{2}}E\left[\Delta_{i}(n)G^{n}\mid\mathcal{F}_{n}\right]\Big{)}.

Thus,

E[\|M_{n+1}\|^{2}\mid\mathcal{F}_{n}]=\frac{1}{\delta_{n}^{2}}\sum_{i=1}^{d}\Big{(}E[\Delta_{i}^{2}(n)(G^{n})^{2}\mid\mathcal{F}_{n}]

-E^{2}[\Delta_{i}(n)G^{n}\mid\mathcal{F}_{n}]\Big{)}.

The claim now follows from Assumption 1 and the fact that all single-stage costs are bounded (cf. pp.174, Chapter 3 of [1]).∎

Define now a sequence $Z_{n},n\geq 0$ according to

Z_{n}=\sum_{m=0}^{n-1}a(m)M_{m+1},

$n\geq 1$ with $Z_{0}=0$ .

Lemma 4

$(Z_{n},\mathcal{F}_{n})$ , $n\geq 0$ is an almost surely convergent martingale sequence.

Proof:

It is easy to see that $Z_{n}$ is $\mathcal{F}_{n}$ -measurable $\forall n$ . Further, it is integrable for each $n$ and moreover $E[Z_{n+1}\mid\mathcal{F}_{n}]=Z_{n}$ almost surely since $(M_{n+1},\mathcal{F}_{n})$ , $n\geq 0$ is a martingale difference sequence by Lemma 1. It is also square integrable from Lemma 3. The quadratic variation process of this martingale will be convergent almost surely if

\sum_{n=0}^{\infty}E[\|Z_{n+1}-Z_{n}\|^{2}\mid\mathcal{F}_{n}]<\infty\mbox{ a.s.}

(8)

Note that

E[\|Z_{n+1}-Z_{n}\|^{2}\mid\mathcal{F}_{n}]=a(n)^{2}E[\|M_{n+1}\|^{2}\mid\mathcal{F}_{n}].

Thus,

\sum_{n=0}^{\infty}E[\|Z_{n+1}-Z_{n}\|^{2}\mid\mathcal{F}_{n}]=\sum_{n=0}^{\infty}a(n)^{2}E[\|M_{n+1}\|^{2}\mid\mathcal{F}_{n}]

\leq\hat{L}\sum_{n=0}^{\infty}\left(\frac{a(n)}{\delta_{n}}\right)^{2},

by Lemma 3. (8) now follows as a consequence of Assumption 2. Now $(Z_{n},\mathcal{F}_{n})$ , $n\geq 0$ can be seen to be convergent from the martingale convergence theorem for square integrable martingales [8]. ∎

Consider now the following ODE:

\dot{\theta}(t)=\bar{\Gamma}(-\sum_{s}\nu(s)\nabla V_{\theta}(s)),

(9)

where $\bar{\Gamma}:{\cal C}(C)\rightarrow{\cal C}({\cal R}^{d})$ is defined according to

\bar{\Gamma}(v(x))=\lim_{\eta\rightarrow 0}\left(\frac{\Gamma(x+\eta v(x))-x}{\eta}\right).

(10)

Let $H\stackrel{{\scriptstyle\triangle}}{{=}}\{\theta\mid\bar{\Gamma}(-\sum_{s}\nu(s)\nabla V_{\theta}(s))=0\}$ denote the set of all equilibria of (9). By Lemma 11.1 of [9], the only possible $\omega$ -limit sets that can occur as invariant sets for the ODE (9) are subsets of $H$ . Let $\bar{H}\subset H$ be the set of all internally chain recurrent points of the ODE (9). Our main result below is based on Theorem 5.3.1 of [18] for projected stochastic approximation algorithms. Before we proceed further, we recall that result below.

Let $C\subset{\cal R}^{d}$ be a compact and convex set as before and $\Gamma:{\cal R}^{d}\rightarrow C$ denote the projection operator that projects any $x=(x_{1},\ldots,x_{d})^{T}\in{\cal R}^{d}$ to its nearest point in $C$ .

Consider now the following the $d$ -dimensional stochastic recursion

X_{n+1}=\Gamma(X_{n}+a(n)(h(X_{n})+\xi_{n}+\beta_{n})),

(11)

under the assumptions listed below. Also, consider the following ODE associated with (11):

\dot{X}(t)=\bar{\Gamma}(h(X(t))).

(12)

Let ${\cal C}(C)$ denote the space of all continuous functions from $C$ to ${\cal R}^{d}$ . The operator $\bar{\Gamma}:{\cal C}(C)\rightarrow{\cal C}({\cal R}^{d})$ is defined according to

\bar{\Gamma}(v(x))=\lim_{\eta\rightarrow 0}\left(\frac{\Gamma(x+\eta v(x))-x}{\eta}\right),

(13)

for any continuous $v:C\rightarrow{\cal R}^{d}$ . The limit in (13) exists and is unique since $C$ is a convex set. In case this limit is not unique, one may consider the set of all limit points of (13). Note also that from its definition, $\bar{\Gamma}(v(x))=v(x)$ if $x\in C^{o}$ (the interior of $C$ ). This is because for such an $x$ , one can find $\eta>0$ sufficiently small so that $x+\eta v(x)\in C^{o}$ as well and hence $\Gamma(x+\eta v(x))=x+\eta v(x)$ . On the other hand, if $x\in\partial C$ (the boundary of $C$ ) is such that $x+\eta v(x)\not\in C$ , for any small $\eta>0$ , then $\bar{\Gamma}(v(x))$ is the projection of $v(x)$ to the tangent space of $\partial C$ at $x$ .

Consider now the assumptions listed below.

(B1)

The function $h:{\cal R}^{d}\rightarrow{\cal R}^{d}$ is continuous.
(B2)

The step-sizes $a(n),n\geq 0$ satisfy

$a(n)>0\forall n,\mbox{ }\sum_{n}a(n)=\infty,\mbox{ }a(n)\rightarrow 0\mbox{ as }n\rightarrow\infty.$
(B3)

The sequence $\beta_{n},n\geq 0$ is a bounded random sequence with $\beta_{n}\rightarrow 0$ almost surely as $n\rightarrow\infty$ .

(B4)

There exists $T>0$ such that $\forall\epsilon>0$ ,

\lim_{n\rightarrow\infty}P\left(\sup_{j\geq n}\max_{t\leq T}\left|\sum_{i=m(jT)}^{m(jT+t)-1}a(i)\xi_{i}\right|\geq\epsilon\right)=0.

(B5)

The ODE (12) has a compact subset $K$ of ${\cal R}^{N}$ as its set of asymptotically stable equilibrium points.

Let $t(n),n\geq 0$ be a sequence of positive real numbers defined according to $t(0)=0$ and for $n\geq 1$ , $t(n)=\sum_{j=0}^{n-1}a(j)$ . By Assumption (B2), $t(n)\rightarrow\infty$ as $n\rightarrow\infty$ . Let $m(t)=\max\{n\mid t(n)\leq t\}$ . Thus, $m(t)\rightarrow\infty$ as $t\rightarrow\infty$ . Assumptions (B1)-(B3) correspond to A5.1.3-A5.1.5 of [18] while (B4)-(B5) correspond to A5.3.1-A5.3.2 there.

[18, Theorem 5.3.1 (pp. 191-196)] essentially says the following:

Theorem 1 (Kushner and Clark Theorem)

Under Assumptions (B1)–(B5), almost surely, $X_{n}\rightarrow K$ as $n\rightarrow\infty$ .

Theorem 2

The iterates $\theta(n),n\geq 0$ governed by (5) converge almost surely to $\bar{H}$ .

Proof:

In lieu of the foregoing, we rewrite (5) according to

\theta_{i}(n+1)=\Gamma_{i}\Big{(}\theta_{i}(n)-a(n)\sum_{s}\nu(s)\nabla_{i}V_{\theta(n)}(s)

-a(n)\beta_{i}(n)+M^{i}_{n+1}\Big{)},

(14)

where $\beta_{i}(n)$ is as in (7). We shall proceed by verifying Assumptions (B1)-(B5) and subsequently appeal to Theorem 5.3.1 of [18] (i.e., Theorem 1 above) to claim convergence of the scheme. Note that Lemma 2 ensures Lipschitz continuity of $\nabla V_{\theta}(s)$ implying (B1). Next, from (B2), since $\delta_{n}\rightarrow 0$ , it follows that $a(n)\rightarrow 0$ as $n\rightarrow\infty$ . Thus, Assumption (B2) holds as well. Now from Lemma 2, it follows that $\sum_{s}\nu(s)\nabla V_{\theta}(s)$ is uniformly bounded since $\theta\in C$ , a compact set. Assumption (B3) is now verified from Proposition 1. Since $C$ is a convex and compact set, Assumption (B4) holds trivially. Finally, Assumption (B5) is also easy to see as a consequence of Lemma 4. Now note that for the ODE (9), $F(\theta)=\sum_{s}\nu(s)V_{\theta}(s)$ serves as an associated Lyapunov function and in fact

\nabla F(\theta)^{T}\bar{\Gamma}(-\sum_{s}\nu(s)\nabla V_{\theta}(s))

=(\sum_{s}\nu(s)\nabla_{\theta}V_{\theta}(s))^{T}\bar{\Gamma}(-\sum_{s}\nu(s)\nabla V_{\theta}(s))\leq 0.

For $\theta\in C^{o}$ (the interior of $C$ ), it is easy to see that $\bar{\Gamma}(-\sum_{s}\nu(s)\nabla V_{\theta}(s))$ $=-\sum_{s}\nu(s)\nabla V_{\theta}(s)$ , and

	$\displaystyle\nabla F(\theta)^{T}\bar{\Gamma}(-\sum_{s}\nu(s)\nabla V_{\theta}(s))$	$\displaystyle<$	$\displaystyle 0\mbox{ if }\theta\in H^{c}\cap C$
		$\displaystyle=$	$\displaystyle 0\mbox{ o.w.}$

For $\theta\in\delta C$ (the boundary of $C$ ), there can additionally be spurious attractors, see [19], that are also contained in $H$ . The claim now follows from Theorem 5.3.1 of [18]. ∎

V Conclusions

We presented a version of the Reinforce algorithm for the setting of episodic tasks that incorporates a one-simulation SF algorithm and proved its convergence using the ODE approach. In a longer version of this paper, we shall consider also the average cost case and present an algorithm based on [4] that does not rely on regeneration epochs for performing updates. Moreover, we shall show detailed empirical results comparing our proposed algorithms with the policy gradient scheme and other algorithms.

References

[1] D. P. Bertsekas. Dynamic Programming and Optimal Control, Vol.II. Athena Scientific, 2012.
[2] S. Bhatnagar. Adaptive multivariate three-timescale stochastic approximation algorithms for simulation based optimization. ACM Transactions on Modeling and Computer Simulation, 15(1):74–107, 2005.
[3] S. Bhatnagar. Adaptive Newton-based smoothed functional algorithms for simulation optimization. ACM Transactions on Modeling and Computer Simulation, 18(1):2:1–2:35, 2007.
[4] S. Bhatnagar and V.S. Borkar. Multiscale chaotic SPSA and smoothed functional algorithms for simulation optimization. Simulation, 79(10):568–580, 2003.
[5] S Bhatnagar, H. L. Prasad, and L. A. Prashanth. Stochastic Recursive Algorithms for Optimization: Simultaneous Perturbation Methods (Lecture Notes in Control and Information Sciences), volume 434. Springer, 2013.
[6] S. Bhatnagar, R.S. Sutton, M. Ghavamzadeh, and M. Lee. Incremental natural actor-critic algorithms. Advances in neural information processing systems, 20, 2007.
[7] S. Bhatnagar, R.S. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor–critic algorithms. Automatica, 45(11):2471–2482, 2009.
[8] V. S. Borkar. Probability Theory: An Advanced Course. Springer, New York, 1995.
[9] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint, 2’nd Edition. Cambridge University Press, 2022.
[10] X.-R. Cao. Stochastic Learning and Optimization: A Sensitivity-Based Approach. Springer, 2007.
[11] E. K. P. Chong and P. J. Ramadge. Optimization of queues using an infinitesimal perturbation analysis-based stochastic algorithm with general update times. SIAM J. Cont. and Optim., 31(3):698–732, 1993.
[12] T. Furmston, G. Lever, and D. Barber. Approximate Newton methods for approximate policy search in Markov decision processes. Journal of Machine Learning Research, 17:1–51, 2016.
[13] Y. C. Ho and X. R. Cao. Perturbation Analysis of Discrete Event Dynamical Systems. Kluwer, Boston, 1991.
[14] V. Ya Katkovnik and Yu Kulchitsky. Convergence of a class of random search algorithms. Automation Remote Control, 8:1321–1326, 1972.
[15] J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regression function. Ann. Math. Statist., 23:462–466, 1952.
[16] V.R. Konda and V.S. Borkar. Actor-critic–type learning algorithms for markov decision processes. SIAM Journal on control and Optimization, 38(1):94–123, 1999.
[17] V.R. Konda and J.N. Tsitsiklis. Onactor-critic algorithms. SIAM journal on Control and Optimization, 42(4):1143–1166, 2003.
[18] H. J. Kushner and D. S. Clark. Stochastic Approximation Methods for Constrained and Unconstrained Systems. Springer Verlag, New York, 1978.
[19] H. J. Kushner and G. G. Yin. Stochastic Approximation Algorithms and Applications. Springer Verlag, New York, 1997.
[20] P. Marbach and J.N. Tsitsiklis. Simulation-based optimization of Markov reward processes. IEEE Transactions on Automatic Control, 46(2):191–209, 2001.
[21] L A Prashanth, S. Bhatnagar, M.C. Fu, and S.I. Marcus. Adaptive system optimization using random directions stochastic approximation. IEEE Transactions on Automatic Control, 62(5):2223–2238, 2017.
[22] R. Y. Rubinstein. Simulation and the Monte Carlo Method. Wiley, New York, 1981.
[23] J.C. Spall. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37(3):332–341, 1992.
[24] J.C. Spall. A one-measurement form of simultaneous perturbation stochastic approximation. Automatica, 33(1):109–112, 1997.
[25] R. S. Sutton and A. W. Barto. Reinforcement Learning, 2’nd Edition. MIT Press, 2018.
[26] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, volume 99, pages 1057–1063, 1999.
[27] R.J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Reinforcement learning, pages 5–32, 1992.