Gradient Temporal Difference with Momentum: Stability and Convergence

Rohan Deb, Shalabh Bhatnagar
Corresponding Author

Abstract

Gradient temporal difference (Gradient TD) algorithms are a popular class of stochastic approximation (SA) algorithms used for policy evaluation in reinforcement learning. Here, we consider Gradient TD algorithms with an additional heavy ball momentum term and provide choice of step size and momentum parameter that ensures almost sure convergence of these algorithms asymptotically. In doing so, we decompose the heavy ball Gradient TD iterates into three separate iterates with different step sizes. We first analyze these iterates under one-timescale SA setting using results from current literature. However, the one-timescale case is restrictive and a more general analysis can be provided by looking at a three-timescale decomposition of the iterates. In the process we provide the first conditions for stability and convergence of general three-timescale SA. We then prove that the heavy ball Gradient TD algorithm is convergent using our three-timescale SA analysis. Finally, we evaluate these algorithms on standard RL problems and report improvement in performance over the vanilla algorithms.

1 Introduction

In reinforcement learning (RL), the goal of the learner or the agent is to maximize its long term accumulated reward by interacting with the environment. One important task in most of RL algorithms is that of policy evaluation. It predicts the average accumulated reward an agent would receive from a state (called value function) if it follows the given policy. In model-free learning, the agent does not have access to the underlying dynamics of the environment and has to learn the value function from samples of the form (state, action, reward, next-state). Two very popular algorithms in the model-free setting are Monte-Carlo (MC) and temporal difference (TD) learning (see Sutton and Barto (2018), Sutton (1988)). It is a well known fact that TD learning diverges in the off-policy setting (see Baird (1995)). A class of algorithms called gradient temporal difference (Gradient TD) were introduced in (Sutton, Maei, and Szepesvári 2009) and (Sutton et al. 2009) which are convergent even in the off-policy setting. These algorithms fall under a larger class of algorithms called linear stochastic approximation (SA) algorithms.

A lot of literature is dedicated to studying the asymptotic behaviour of SA algorithms starting from the work of (Robbins and Monro 1951). In recent times, the ODE method to analyze asymptotic behaviour of SA (Ljung 1977; Kushner and Clark 1978; Borkar 2008b; Borkar and Meyn 2000) has become quite popular in the RL community. The Gradient TD methods were shown to be convergent using the ODE approach. A generic one-timescale (One-TS) SA iterate has the following form:

x_{n+1}=x_{n}+a(n)\left(h(x_{n})+M_{n+1}\right),

(1)

where $x\in\mathbb{R}^{d_{1}}$ are the iterates. The function $h:\mathbb{R}^{d_{1}}\rightarrow\mathbb{R}^{d_{1}}$ is assumed to be a Lipschitz continuous function. $M_{n+1}$ is a Martingale difference noise sequence and $a(n)$ is the step-size at time-step $n$ . Under some mild assumptions, the iterate given by (1) converges (see Borkar 2008b; Borkar and Meyn 2000). When $h$ is a linear map of the form $b-Ax_{n}$ , the matrix $A$ is often called the driving matrix. The three Gradient TD algorithms: GTD (Sutton, Maei, and Szepesvári 2009), GTD2 and TDC (Sutton et al. 2009) consist two iterates of the following form:

x_{n+1}=x_{n}+a(n)(h(x_{n},y_{n})+M_{n+1}^{(1)},

(2)

y_{n+1}=y_{n}+b(n)(g(x_{n},y_{n})+M_{n+1}^{(2)}),

(3)

where $x\in\mathbb{R}^{d_{1}}$ , $y\in\mathbb{R}^{d_{2}}$ . See section 2 for exact form of the iterates. The two iterates still form a One-TS SA scheme if $\lim_{n\rightarrow\infty}\frac{b(n)}{a(n)}=c$ , where $c$ is a constant and a two-timescale (two-TS) scheme if $\lim_{n\rightarrow\infty}\frac{b(n)}{a(n)}=0$ .

Separately, adding a momentum term to accelerate the convergence of iterates is a popular technique in stochastic gradient descent (SGD). The two most popular schemes are the Polyak’s Heavy ball method (Polyak 1964), and Nesterov’s accelerated gradient method (Nesterov 1983). A lot of literature is dedicated to studying momentum with SGD. Some recent works include (Ghadimi, Feyzmahdavian, and Johansson 2014; Loizou and Richtárik 2020; Gitman et al. 2019; Ma and Yarats 2019; Assran and Rabbat 2020). Momentum in the SA setting, which is the focus of the current work, has limited results. Very few works study the effect of momentum in the SA setting. A recent work by (Mou et al. 2020) studies SA with momentum briefly and shows an improvement of mixing rate. However, the setting considered is restricted to linear SA and the driving matrix is assumed to be symmetric. Further, the iterates involve an additional Polyak-Ruppert averaging (Polyak 1990). Here, in contrast, we analyze the asymptotic behaviour of the algorithm and make none of the above assumptions. A somewhat distant paper is by (Devraj, Bušíć, and Meyn 2019) that introduces Matrix momentum in SA and is not equivalent to heavy ball momentum.

A very recent work by (Avrachenkov, Patil, and Thoppe 2020) studied One-TS SA with heavy ball momentum in the univariate case (i.e., $d=1$ in iterate (1)) in the context of web-page crawling. The iterates took the following form:

x_{n+1}=x_{n}+a(n)\left(h(x_{n})+M_{n+1}\right)+\eta_{n}(x_{n}-x_{n-1}).

(4)

The momentum parameter $\eta_{n}$ was chosen to decompose the iterate into two recursions of the form given by (2) and (3). We use such a decomposition for Gradient TD methods with momentum. This leads to three separate iterates with three step-sizes. We analyze these three iterates and provide stability (iterates remain bounded throughout) and almost sure (a.s.) convergence guarantees.

1.1 Our Contribution

•

We first consider the One-TS decomposition of Gradient TD with momentum iterates and show that the driving matrix in this case is Hurwitz (all eigen values are negative). Thereafter we use the theory of One-TS SA to show that the iterates are stable and convergent to the same TD solution.
•

Next, we consider the Three-TS decomposition. We provide the first stability and convergence conditions for general Three-TS recursions. We then show that the iterates under consideration satisfy these conditions.
•

Finally, we evaluate these algorithms for different choice of step-size and momentum parameters on standard RL problems and report an improvement in performance over their vanilla counterparts.

2 Preliminaries

In the standard RL setup, an agent interacts with the environment which is a Markov Decision Process (MDP). At each discrete time step $t$ , the agent is in state $s_{t}\in\mathcal{S},$ takes an action $a_{t}\in\mathcal{A},$ receives a reward $r_{t+1}\equiv r(s_{t},a_{t},s_{t+1})\in\mathbb{R}$ and moves to another state $s_{t+1}\in\mathcal{S}$ . Here $\mathcal{S}$ and $\mathcal{A}$ are finite sets of possible states and actions respectively. The transitions are governed by a kernel $\mathbb{P}$ . A policy $\pi:\mathcal{S}\times\mathcal{A}\rightarrow[0,1]$ is a mapping that defines the probability of picking an action in a state. We let $P^{\pi}(s^{\prime}|s)$ be the transition probability matrix induced by $\pi$ . Also, $\{d^{\pi}(s)\}_{s\in\mathcal{S}}$ represents the steady-state distribution for the Markov chain induced by $\pi$ and the matrix $D$ is a diagonal matrix of dimension $n\times n$ with the entries $d^{\pi}(s)$ on its diagonals.The state-value function associated with a policy $\pi$ for state $s$ is

V^{\pi}(s)=\mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}R_{t+1}|s_{0}=s\right],

where $\gamma$ $\in[0,1)$ is the discount factor.

In the linear architecture setting, policy evaluation deals with estimating $V^{\pi}(s)$ through a linear model $V_{\theta}(s)=\theta^{T}\phi(s)$ , where $\phi(s)\equiv\phi_{s}$ is a feature associated with the state $s$ and $\theta$ is the parameter vector. We define the TD-error as $\delta_{t}=r_{t+1}+\gamma\theta_{t}^{T}\phi_{t+1}-\theta_{t}^{T}\phi_{t}$ and $\Phi$ as an $n\times d$ matrix where the $s^{th}$ row is $\phi(s)^{T}$ . In the i.i.d setting it is assumed that the tuple $(\phi_{t},\phi_{t}^{\prime}$ ) (where $\phi_{t+1}\equiv\phi_{t}^{\prime}$ ) is drawn independently from the stationary distribution of the Markov chain induced by $\pi$ . Let $\bar{A}=\mathbb{E}[\phi_{t}(\gamma\phi_{t}^{\prime}-\phi_{t})^{T}]$ and $\bar{b}=\mathbb{E}[r_{t+1}\phi_{t}]$ , where the expectations are w.r.t. the stationary distribution of the induced chain. The matrix $\bar{A}$ is negative definite (see Maei (2011); Tsitsiklis and Van Roy (1997)). In the off-policy case, the importance weight is given by $\rho_{t}=\frac{\pi(a_{t}|s_{t})}{\mu(a_{t}|s_{t})}$ , where $\pi$ and $\mu$ are the target and behaviour policies respectively. Introduced in (Sutton, Maei, and Szepesvári 2009), Gradient TD are a class of TD algorithms that are convergent even in the off-policy setting. Next, we present the iterates associated with the algorithms GTD (Sutton, Maei, and Szepesvári 2009), GTD2, TDC (Sutton et al. 2009).

•

GTD:

	$\displaystyle\theta_{t+1}=\theta_{t}+\alpha_{t}(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t},$		(5)
	$\displaystyle u_{t+1}=u_{t}+\beta_{t}(\delta_{t}\phi_{t}-u_{t}).$		(6)

•

GTD2:

	$\displaystyle\theta_{t+1}=\theta_{t}+\alpha_{t}(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t},$		(7)
	$\displaystyle u_{t+1}=u_{t}+\beta_{t}(\delta_{t}-\phi_{t}^{T}u_{t})\phi_{t}.$		(8)

•

TDC:

	$\displaystyle\theta_{t+1}=\theta_{t}+\alpha_{t}\delta_{t}\phi_{t}-\alpha_{t}\gamma\phi_{t}^{\prime}(\phi_{t}^{T}u_{t}),$		(9)
	$\displaystyle u_{t+1}=u_{t}+\beta_{t}(\delta_{t}-\phi_{t}^{T}u_{t})\phi_{t}.$		(10)

The objective function for GTD is Norm of Expected Error defined as $NEU(\theta)=\mathbb{E}[\delta\phi]$ . The GTD algorithm is derived by expressing the gradient direction as $-\frac{1}{2}\nabla NEU(\theta)$ = $\mathbb{E}\left[(\phi-\gamma\phi^{\prime})\phi^{T}\right]\mathbb{E}[\delta(\theta)\phi]$ . Here $\phi^{\prime}\equiv\phi(s^{\prime})$ . If both the expectations are sampled together, then the term would be biased by their correlation. An estimate of the second expectation is maintained as a long-term quasi-stationary estimate (see (5)) and the first expectation is sampled (see (6)). For GTD2 and TDC, a similar approach is used on the objective function Mean Square Projected Bellman Error defined as $MSPBE(\theta)=||V_{\theta}-\Pi T^{\pi}V_{\theta}||_{D}$ . Here, $\Pi$ is the projection operator that projects vectors to the subspace $\{\Phi\theta|\theta\in\mathbb{R}^{d}\}$ and $T^{\pi}$ is the Bellman operator defined as $T^{\pi}V=R^{\pi}+\gamma P^{\pi}V$ . As originally presented, GTD and GTD2 are one-timescale algorithms ( $\frac{\alpha_{t}}{\beta_{t}}$ is constant) while TDC is a two-timescale algorithm ( $\frac{\alpha_{t}}{\beta_{t}}\rightarrow 0$ ). It was shown in all the three cases that $\theta_{n}\rightarrow\theta^{*}=-\bar{A}^{-1}\bar{b}$ .

3 Gradient TD with Momentum

Although, Gradient TD starts with a gradient descent based approach, it ends up with two-TS SA recursions. Momentum methods are known to accelerate the convergence of SGD iterates. Motivated by this, we examine momentum in the SA setting, and ask if the SA recursions for Gradient TD with momentum even converge to the same TD solution. We probe the heavy ball extension of the three Gradient TD algorithms where, we keep an accumulation of the previous gradient values in $\zeta_{t}$ . Then, at time step $t+1$ the new gradient value multiplied by the step size is added to the current accumulation vector $\zeta_{t}$ multiplied by the momentum parameter $\eta_{t}$ as below:

\zeta_{t+1}=\eta_{t}\zeta_{t}+\alpha_{t}(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}.

The parameter $\theta$ is then updated in the negative of the direction $\zeta_{t+1}$ , i.e., $\theta_{t+1}=\theta_{t}-\zeta_{t+1}.$ Since $u_{t+1}$ is computed as a long-term estimate of $\mathbb{E}[\delta(\theta)\phi]$ , its update rule remains same. The momentum parameter $\eta_{t}$ is usually set to a constant in the stochastic gradient setting. An exception to this can however be found in (Gitman et al. 2019; Gadat, Panloup, and Saadane 2016), where $\eta_{t}\rightarrow 1$ . Here, we consider the latter case. Substituting $\zeta_{t+1}$ into the iteration of $\theta_{t+1}$ and noting that $\zeta_{t}=\theta_{t}-\theta_{t-1}$ , the iterates for GTD with Momentum (GTD-M) can be written as:

\theta_{t+1}=\theta_{t}+\alpha_{t}(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}+\eta_{t}(\theta_{t}-\theta_{t-1}),

(11)

u_{t+1}=u_{t}+\beta_{t}(\delta_{t}\phi_{t}-u_{t}).

(12)

Similarly the iterates for GTD2-M are given by:

\theta_{t+1}=\theta_{t}+\alpha_{t}(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}+\eta_{t}(\theta_{t}-\theta_{t-1}),

(13)

u_{t+1}=u_{t}+\beta_{t}(\delta_{t}-\phi_{t}^{T}u_{t})\phi_{t}.

(14)

Finally, the iterates for TDC-M are given by:

\begin{split}\theta_{t+1}=\theta_{t}+\alpha_{t}(\delta_{t}\phi_{t}-\gamma\phi_{t}^{\prime}(\phi_{t}^{T}u_{t}))+\eta_{t}(\theta_{t}-\theta_{t-1}),\end{split}

(15)

u_{t+1}=u_{t}+\beta_{t}(\delta_{t}-\phi_{t}^{T}u_{t})\phi_{t}.

(16)

We choose the momentum parameter $\eta_{t}$ as in (Avrachenkov, Patil, and Thoppe 2020) as follows: $\eta_{t}=\frac{\varrho_{t}-w\alpha_{t}}{\varrho_{t-1}}$ , where $\{\varrho_{t}\}$ is a positive sequence s.t. $\varrho_{t}\rightarrow 0$ as $t\rightarrow\infty$ and $w\in\mathbb{R}$ is a constant. Note that $\eta_{t}\rightarrow 1$ as $t\rightarrow\infty$ . We later provide conditions on $\varrho_{t}$ and $w$ to ensure a.s. convergence. As we would see in section 4, the condition on $w$ in the One-TS setting is restrictive. Specifically, it depends on the norm of the driving matrix $\bar{A}$ . This motivates us to look at the Three-TS setting and then the corresponding condition on $w$ is less restrictive. Using the momentum parameter as above,

\begin{split}\theta_{t+1}&=\theta_{t}+\alpha_{t}(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}+\frac{\varrho_{t}-w\alpha_{t}}{\varrho_{t-1}}(\theta_{t}-\theta_{t-1})\end{split}

Rearranging the terms and dividing by $\rho_{t}$ , we get:

\begin{split}\frac{\theta_{t+1}-\theta_{t}}{\varrho_{t}}&=\frac{\theta_{t}-\theta_{t-1}}{\varrho_{t-1}}\\ &+\frac{\alpha_{t}}{\varrho_{t}}\Bigg{(}(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}-w\left(\frac{\theta_{t}-\theta_{t-1}}{\varrho_{t-1}}\right)\Bigg{)}.\end{split}

We let

\frac{\theta_{t+1}-\theta_{t}}{\varrho_{t}}=v_{t+1},\xi_{t}=\frac{\alpha_{t}}{\varrho_{t}}\mbox{ and }\varepsilon_{t}=v_{t+1}-v_{t}.

Then, the GTD-M iterates in (11) and (12) can be re-written with the following three iterates:

	$\displaystyle v_{t+1}=v_{t}+\xi_{t}\left((\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}-wv_{t}\right)$		(17)
	$\displaystyle u_{t+1}=u_{t}+\beta_{t}(\delta_{t}\phi_{t}-u_{t})$		(18)
	$\displaystyle\theta_{t+1}=\theta_{t}+\varrho_{t}(v_{t}+\varepsilon_{t})$		(19)

A similar decomposition can be done for the GTD2-M and TDC-M iterates.

4 Convergence Analysis

In this section we analyze the asymptotic behaviour of the GTD-M iterates given by (17), (18) and (19). Throughout the section, we consider $v_{t},u_{t},\theta_{t}\in\mathbb{R}^{d}$ . We first consider the One-TS case when $\beta_{t}=c_{1}\xi_{t}$ and $\varrho_{t}=c_{2}\xi_{t}$ $\forall t$ , for some real constants $c_{1},c_{2}>0$ . Subsequently, we consider the Three-TS setting where $\frac{{\beta_{t}}}{\xi_{t}}\rightarrow 0$ and $\frac{{\varrho_{t}}}{\beta_{t}}\rightarrow 0$ as $t\rightarrow\infty$ .

4.1 One-Timescale Setting

We begin by analyzing GTD-M using a one-timescale SA setting. We let $c_{1}=c_{2}=1$ for simplicity. The iterates of GTD-M can then be re-written as:

\psi_{t+1}=\psi_{t}+\xi_{t}(G_{t}\psi_{t}+g_{t}+\varepsilon_{t}),

(20)

where,

\displaystyle\psi_{t}=\begin{pmatrix}v_{t}\\ u_{t}\\ \theta_{t}\end{pmatrix},g_{t}=\begin{pmatrix}0\\ r_{t+1}\phi_{t}\\ 0\end{pmatrix},\bar{\varepsilon}_{t}=\begin{pmatrix}0\\ 0\\ \varepsilon_{t}\end{pmatrix},

\displaystyle G_{t}=\begin{pmatrix}-wI&(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}&0\\ 0&-I&\phi_{t}(\gamma\phi_{t}^{\prime}-\phi_{t})^{T}\\ I&0&0\end{pmatrix}.

Equation (20) can be re-written in the general SA scheme as:

\psi_{t+1}=\psi_{t}+\xi_{t}(h(\psi_{t})+M_{t+1}+\bar{\varepsilon}_{t}).

(21)

Here $h(\psi)=g+G\psi,g=\mathbb{E}[g_{t}],G=\mathbb{E}[G_{t}]$ , where the expectations are w.r.t. the stationary distribution of the Markov chain induced by the target policy $\pi$ . $M_{t+1}=(G_{t+1}-G)\psi_{t}+(g_{t+1}-g)$ . In particular,

G=\begin{pmatrix}-wI&-\bar{A}^{T}&0\\ 0&-I&\bar{A}\\ I&0&0\end{pmatrix},g=\begin{pmatrix}0\\ \bar{b}\\ 0\end{pmatrix},

where recall that $\bar{A}=\mathbb{E}[\phi(\gamma\phi^{\prime}-\phi)^{T}]$ and $\bar{b}=\mathbb{E}[r\phi]$

Lemma 1.

Assume, $w(w+1)>||\bar{A}||^{2}$ . Then, the matrix $G$ is Hurwitz.

Proof.

Let $\lambda$ be an eigenvalue of $G$ . The characteristic equation of the matrix $G$ is given by:

	$\displaystyle\begin{vmatrix}-wI-\lambda I&-\bar{A}^{T}&0\\ 0&-I-\lambda I&\bar{A}\\ I&0&-\lambda I\end{vmatrix}=0$
	$\displaystyle\begin{vmatrix}wI+\lambda I&\bar{A}^{T}&0\\ 0&I+\lambda I&-\bar{A}\\ -I&0&\lambda I\end{vmatrix}=0$

Using the following formula for determinant of block matrices

	$\displaystyle\begin{vmatrix}A_{11}&A_{12}&A_{13}\\ A_{21}&A_{22}&A_{23}\\ A_{31}&A_{32}&A_{33}\end{vmatrix}=$
	$\displaystyle\begin{vmatrix}A_{11}\end{vmatrix}\begin{vmatrix}\begin{pmatrix}A_{22}&A_{23}\\ A_{32}&A_{33}\end{pmatrix}-\begin{pmatrix}A_{21}\\ A_{31}\end{pmatrix}A_{11}^{-1}\begin{pmatrix}A_{12}&A_{13}\end{pmatrix}\end{vmatrix}$

we have,

	$\displaystyle\begin{vmatrix}wI+\lambda I&\bar{A}^{T}&0\\ 0&I+\lambda I&-\bar{A}\\ -I&0&\lambda I\end{vmatrix}=$
	$\displaystyle\begin{vmatrix}(w+\lambda)I\end{vmatrix}\begin{vmatrix}\begin{pmatrix}I+\lambda I&-\bar{A}\\ 0&\lambda I\end{pmatrix}-\frac{1}{w+\lambda}\begin{pmatrix}0\\ -I\end{pmatrix}\begin{pmatrix}\bar{A}^{T}&0\end{pmatrix}\end{vmatrix}$
	$\displaystyle=(w+\lambda)^{d}\begin{vmatrix}I+\lambda I&-\bar{A}\\ \frac{\bar{A}^{T}}{w+\lambda}&\lambda I\end{vmatrix}$
	$\displaystyle=(w+\lambda)^{d}\begin{vmatrix}(1+\lambda)I\end{vmatrix}\begin{vmatrix}\lambda I+\frac{1}{(1+\lambda)(w+\lambda)}\bar{A}^{T}\bar{A}\end{vmatrix}$
	$\displaystyle=\frac{(w+\lambda)^{d}(1+\lambda)^{d}}{(w+\lambda)^{d}(1+\lambda)^{d}}\begin{vmatrix}\lambda(1+\lambda)(w+\lambda)I+\bar{A}^{T}\bar{A}\end{vmatrix}$
	$\displaystyle=\begin{vmatrix}\lambda(1+\lambda)(w+\lambda)I+\bar{A}^{T}\bar{A}\end{vmatrix}$

Therefore, from the characteristic equation of $G$ , we have that

\begin{vmatrix}\lambda(1+\lambda)(w+\lambda)I+\bar{A}^{T}\bar{A}\end{vmatrix}=0.

There must exist a non-zero vector $x\in\mathbb{C}^{d}$ , such that

x^{*}(\lambda(1+\lambda)(w+\lambda)I+\bar{A}^{T}\bar{A})x=0,

where $x^{*}$ is the conjugate transpose of the vector $x$ and $x^{*}x=||x||^{2}>0$ . The above equation reduces to the following cubic-polynomial equation:

\lambda^{3}||x||^{2}+(w+1)\lambda^{2}||x||^{2}+w\lambda||x||^{2}+||\bar{A}x||^{2}=0,

where $||\bar{A}x||^{2}=x^{*}\bar{A}^{T}\bar{A}x$ . Using Routh-Hurwitz criterion, a cubic polynomial $a_{3}\lambda^{3}+a_{2}\lambda^{2}+a_{1}\lambda+a_{0}$ has all roots with negative real parts iff $a_{3},a_{2},a_{1},a_{0}>0$ and $a_{1}a_{2}>a_{0}a_{3}$ . In our case, $a_{3}=||x||^{2}>0,a_{2}=(w+1)||x||^{2}>0,a_{1}=w||x||^{2}>0\mbox{ and }a_{0}=||\bar{A}x||^{2}>0$ . The last inequality follows from the fact that $\bar{A}$ is negative definite and therefore $x^{*}\bar{A}^{T}\bar{A}x>0$ . Finally, $a_{1}a_{2}=w(w+1)||x||^{4},a_{0}a_{3}=||x||^{2}||\bar{A}x||^{2}$ and $a_{1}a_{2}>a_{0}a_{3}$ follows from $\frac{||\bar{A}x||^{2}}{||x||^{2}}<||\bar{A}||^{2}<w(w+1)$ . Therefore $Re(\lambda)<0$ and the claim follows. ∎

Consider the following assumptions:

$\mathbfcal{A}$ 1.

All rewards $r(s,s^{\prime})$ and features $\phi(s)$ are bounded, i.e., $r(s,s^{\prime})\leq 1$ and $||\phi(s)||\leq 1$ $\forall s,s^{\prime}\in\mathcal{S}$ . Also, the matrix $\Phi$ has full rank, where $\Phi$ is an $n\times d$ matrix where the s^th row is $\phi(s)^{T}$ .

$\mathbfcal{A}$ 2.

The step-sizes satisfy $\xi_{t}=\beta_{t}=\varrho_{t}>0$ ,

\sum_{t}\xi_{t}=\infty\sum_{t}\xi_{t}^{2}<\infty\mbox{ ,where }\xi_{t}=\frac{\alpha_{t}}{\varrho_{t}}

and the momentum parameter satisfies: $\eta_{t}=\frac{\varrho_{t}-w\alpha_{t}}{\varrho_{t-1}}.$

$\mathbfcal{A}$ 3.

The samples ( $\phi_{t},\phi_{t}^{\prime}$ ) are drawn i.i.d from the stationary distribution of the Markov chain induced by target policy $\pi$ .

Theorem 2.

Assume $\mathbfcal{A}$ 1, $\mathbfcal{A}$ 2 and $\mathbfcal{A}$ 3 hold and let $w\geq 1$ . Then, the GTD-M iterates given by (11) and (12) satisfy $\theta_{n}\rightarrow\theta^{*}=-\bar{A}^{-1}\bar{b}$ a.s. as $n\rightarrow\infty$ .

Proof.

Assumption $\mathbfcal{A}$ 1 ensures that $||\bar{A}||^{2}<w(w+1)$ and $\mathbfcal{A}$ 3 ensures that the function $h(\cdot)$ is well defined. Now, using Lemma 1 and (Borkar and Meyn 2000) we can show that the iterates in (20) remain stable. Then using the third extension from (Chapter-2 pp. 17, Borkar (2008b)) we can show that $\psi_{n}\rightarrow-G^{-1}g$ as $n\rightarrow\infty$ . Thereafter using the formula for inverse of block matrices it can be shown that $\theta_{n}\rightarrow-\bar{A}^{-1}b$ as $n\rightarrow\infty$ . See Appendix A1 for a detailed proof. ∎

Similar results can be proved for the GTD2-M and TDC-M iterates.

Remark 1.

If $w$ is large, the initial values of the momentum parameter is small. The condition on $w$ in lemma 1 is large compared to the condition on $w$ in (Avrachenkov, Patil, and Thoppe 2020), where the condition is $w>0$ . Motivated by this, we look at the three-TS case of the iterates.

4.2 Three Timescale Setting

We consider the three iterates for GTD-M in (17), (18) and (19) under the following criteria for step-sizes: $\frac{\xi_{t}}{{\beta_{t}}}\rightarrow 0$ and $\frac{{\varrho_{t}}}{\xi_{t}}\rightarrow 0$ as $t\rightarrow\infty$ . We provide the first conditions for stability and a.s. convergence of generic three-TS SA recursions. We emphasize that the setting we look at in Theorem 3 is more general than the setting at hand of GTD-M iterates. Although stability and convergence results exist for one-TS and two-TS cases, this is the first time such results have been provided for the case of three-TS recursions. We next provide the general iterates for a three-TS recursion along with the assumptions used while analyzing them. Consider the following three iterates:

x_{n+1}=x_{n}+a(n)\left(h(x_{n},y_{n},z_{n})+M_{n+1}^{(1)}+\varepsilon^{(1)}_{n}\right),

(22)

y_{n+1}=y_{n}+b(n)\left(g(x_{n},y_{n},z_{n})+M_{n+1}^{(2)}+\varepsilon^{(2)}_{n}\right),

(23)

z_{n+1}=z_{n}+c(n)\left(f(x_{n},y_{n},z_{n})+M_{n+1}^{(3)}+\varepsilon^{(3)}_{n}\right),

(24)

and the following assumptions:

(B1)

$h:\mathbb{R}^{d_{1}+d_{2}+d_{3}}\rightarrow\mathbb{R}^{d_{1}},g:\mathbb{R}^{d_{1}+d_{2}+d_{3}}\rightarrow\mathbb{R}^{d_{2}},f:\mathbb{R}^{d_{1}+d_{2}+d_{3}}\rightarrow\mathbb{R}^{d_{3}}$ are Lipchitz continuous, with Lipchitz constants $L_{1},L_{2}$ and $L_{3}$ respectively.
(B2)

$\{a(n)\}$ , $\{b(n)\}$ , $\{c(n)\}$ are step-size sequences that satisfy $a(n)>0,b(n)>0,c(n)>0,\forall n>0,$

$\sum_{n}a(n)=\sum_{n}b(n)=\sum_{n}c(n)=\infty,$

$\sum_{n}(a(n)^{2}+b(n)^{2}+c(n)^{2})<\infty,$

$\frac{b(n)}{a(n)}\rightarrow 0,\frac{c(n)}{b(n)}\rightarrow 0\mbox{ as }n\rightarrow\infty.$

(B3)

$\{M_{n}^{(1)}\},\{M_{n}^{(2)}\},\{M_{n}^{(3)}\}$ are Martingale difference sequences w.r.t. the filtration $\{\mathcal{F}_{n}\}$ where,

\mathcal{F}_{n}=\sigma\left(x_{m},y_{m},z_{m},M_{m}^{(1)},M_{m}^{(2)},M_{m}^{(3)},m\leq n\right)

\mathbb{E}\left[||M_{n+1}^{(i)}||^{2}|\mathcal{F}_{n}\right]\leq K_{i}\left(1+||x_{n}||^{2}+||y_{n}||^{2}+||z_{n}||^{2}\right);

$\forall n\geq 0$ , $i=1,2,3$ and constants $0<K_{i}<\infty$ . The terms $\varepsilon^{(i)}_{t}$ satisfy $||\varepsilon^{(1)}_{n}||+||\varepsilon^{(2)}_{n}||+||\varepsilon^{(3)}_{n}||\rightarrow 0$ as $n\rightarrow\infty$ .

(B4)
1. (i)
  
  The ode $\dot{x}(t)=h(x(t),y,z),y\in\mathbb{R}^{d_{2}},z\in\mathbb{R}^{d_{3}}$ has a globally asymptotically stable equilibrium (g.a.s.e) $\lambda(y,z)$ , and $\lambda:\mathbb{R}^{d_{2}\times d_{3}}\rightarrow\mathbb{R}^{d_{1}}$ is Lipchitz continuous.
2. (ii)
  
  The ode $\dot{y}(t)=g(\lambda(y(t),z),y(t),z),z\in\mathbb{R}^{d_{3}}$ has a globally asymptotically stable equilibrium $\Gamma(z)$ , where $\Gamma:\mathbb{R}^{d_{3}}\rightarrow\mathbb{R}^{d_{2}}$ is Lipchitz continuous.
3. (iii)
  
  The ode $\dot{z}(t)=f(\lambda(\Gamma(z(t)),z(t)),\Gamma(z(t)),z(t))$ , has a globally asymptotically stable equilibrium $z^{*}\in\mathbb{R}^{d_{3}}$ .
(B5)

The functions $h_{c}(x,y,z)=\frac{h(cx,cy,cz)}{c},c\geq 1$ satisfy $h_{c}\rightarrow h_{\infty}$ as $c\rightarrow\infty$ uniformly on compacts. The ODE: $\dot{x}(t)=h_{\infty}(x(t),y,z),$ has a unique globally asymptotically stable equilibrium $\lambda_{\infty}(y,z)$ , where $\lambda_{\infty}:\mathbb{R}^{d_{2}+d_{3}}\rightarrow\mathbb{R}^{d_{1}}$ is Lipschitz continuous. Further, $\lambda_{\infty}(0,0)=0$ .
(B6)

The functions $g_{c}(y,z)=\frac{g(c\lambda_{\infty}(y,z),cy,cz)}{c},c\geq 1$ satisfy $g_{c}\rightarrow g_{\infty}$ as $c\rightarrow\infty$ uniformly on compacts. The ODE: $\dot{y}(t)=g_{\infty}(y(t),z),$ has a unique globally asymptotically stable equilibrium $\Gamma_{\infty}(z)$ , where $\Gamma_{\infty}:\mathbb{R}^{d_{3}}\rightarrow\mathbb{R}^{d_{2}}$ is Lipschitz continuous. Further, $\Gamma_{\infty}(0)=0$ .
(B7)

The functions $f_{c}(z)=\frac{g(c\lambda_{\infty}(\Gamma_{\infty}(z),z),c\Gamma_{\infty}(z),cz)}{c},c\geq 1$ satisfy $f_{c}\rightarrow f_{\infty}$ as $c\rightarrow\infty$ uniformly on compacts. The ODE: $\dot{z}(t)=f_{\infty}(z(t)),$ has the origin in $\mathbb{R}^{d_{3}}$ as its unique globally asymptotically stable equilibrium.

Remark 2.

Conditions $\bf{(B5)}-\bf{(B7)}$ give sufficient conditions that ensure that the iterates remain stable. Specifically it ensures that $\sup_{n}(||x_{n}||+||y_{n}||+||z_{n}||)<\infty$ $a.s.$ . Conditions $\bf{(B1)}-\bf{(B4)}$ along with the stability of iterates ensures a.s. convergence of the iterates.

Theorem 3.

Under assumptions $\bf{(B1)}$ - $\bf{(B7)}$ ,the iterates given by (22) satisfy (23) and (24),

(x_{n},y_{n},z_{n})\rightarrow(\lambda(\Gamma(z^{*}),z^{*}),\Gamma(z^{*}),z^{*})\mbox{ as }n\rightarrow\infty

Proof.

See Appendix A2. ∎

Next we use theorem 3, to show that the iterates of GTD-M a.s. converge to the TD solution $-\bar{A}^{-1}\bar{b}$ . Consider the following assumption on step-size sequences instead of $\mathbfcal{A}$ 2.

$\mathbfcal{A}$ 4.

The step-sizes satisfy $\xi_{t}>0,\beta_{t}>0,\varrho_{t}>0\mbox{ }\forall t$ ,

\sum_{t}\xi_{t}=\sum_{t}\beta_{t}=\sum_{t}\varrho_{t}=\infty,

\sum_{t}(\xi_{t}^{2}+\beta_{t}^{2}+\varrho_{t}^{2})<\infty,

\frac{\beta_{t}}{\xi_{t}}\rightarrow 0,\frac{\varrho_{t}}{\beta_{t}}\rightarrow 0\mbox{ as }t\rightarrow\infty

and the momentum parameter satisfies: $\eta_{t}=\frac{\varrho_{t}-w\alpha_{t}}{\varrho_{t-1}}.$

Theorem 4.

Assume $\mathbfcal{A}$ 1, $\mathbfcal{A}$ 3 and $\mathbfcal{A}$ 4 hold and let $w>0$ . Then, the GTD-M iterates given by (11) and (12) satisfy $\theta_{n}\rightarrow\theta^{*}=-\bar{A}^{-1}\bar{b}$ a.s. as $n\rightarrow\infty$ .

Proof.

We transform the iterates given by (17), (18) and (19) into the standard SA form given by (22), (23) and (24). Let $\mathcal{F}_{t}=\sigma(u_{0},v_{0},\theta_{0},r_{j+1},\phi_{j},\phi_{j}^{\prime}:j<t)$ . Let, $A_{t}=\phi_{t}(\gamma\phi_{t}^{\prime}-\phi_{t})^{T}$ and $b_{t}=r_{t+1}\phi_{t}$ . Then, (17) can be re-written as:

v_{t+1}=v_{t}+\xi_{t}\left(h(v_{t},u_{t},\theta_{t})+M_{t+1}^{(1)}\right)

where,

\begin{split}h(v_{t},u_{t},\theta_{t})&=\mathbb{E}[(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}-wv_{t}|\mathcal{F}_{t}]\\ &=-\bar{A}^{T}u_{t}-wv_{t}.\\ M_{t+1}^{(1)}=-A_{t}^{T}u_{t}&-wv_{t}-h(v_{t},u_{t},\theta_{t})=(\bar{A}^{T}-A_{t}^{T})u_{t}.\end{split}

Next, (18) can be re-written as:

\begin{split}u_{t+1}&=u_{t}+\beta_{t}\left(g(v_{t},u_{t},\theta_{t})+M_{t+1}^{(2)}\right)\\ \end{split}

where,

\begin{split}g(v_{t},u_{t},\theta_{t})&=\mathbb{E}[\delta_{t}\phi_{t}-u_{t}|\mathcal{F}_{t}]=\bar{A}\theta_{t}+\bar{b}-u_{t}\\ M_{t+1}^{(2)}&=A_{t}\theta_{t}+b_{t}-u_{t}-g(v_{t},u_{t},\theta_{t})\\ &=(A_{t}-\bar{A})\theta_{t}+(b_{t}-\bar{b}).\end{split}

Finally, (19) can be re-written as:

\begin{split}\theta_{t+1}=\theta_{t}+\varrho_{t}\left(f(v_{t},u_{t},\theta_{t})+\varepsilon_{t}+M_{t+1}^{(3)}\right)\end{split}

where, $f(v_{t},u_{t},\theta_{t})=v_{t}\mbox{ and }M_{t+1}^{(3)}=0.$

Refer to caption — Figure 1: RMSPBE (averaged over 100 independent runs) accross episodes for Boyan Chain. The features used are the standard spiked features of size 4 used in Boyan chain (see (Dann, Neumann, and Peters 2014)).

The functions $h,g,f$ are linear in $v,u,\theta$ and hence Lipchitz continuous, therefore satisfying $\bf{(B1)}$ . We choose the step-size sequences such that they satisfy $\bf{(B2)}$ . One popular choice is $\xi_{t}=\frac{1}{(t+1)^{\xi}},\beta_{t}=\frac{1}{(t+1)^{\beta}},\varrho_{t}=\frac{1}{(t+1)^{\varrho}},$ $\frac{1}{2}<\xi<\beta<\varrho\leq 1.$ Next, $M_{t+1}^{(1)},M_{t+1}^{(2)}$ and $M_{t+1}^{(3)}$ $t\geq 0$ , are martingale difference sequences w.r.t $\mathcal{F}_{t}$ by construction. $\mathbb{E}[||M_{t+1}^{(1)}||^{2}|\mathcal{F}_{t}]\leq||(\bar{A}^{T}-A_{t}^{T})||^{2}||u_{t}||^{2}$ , $\mathbb{E}[||M_{t+1}^{(2)}||^{2}|\mathcal{F}_{t}]\leq 2(||(A_{t}-\bar{A})||^{2}||\theta_{t}||^{2}+||(b_{t}-\bar{b})||^{2})$ . The first part of $\bf{(B3)}$ is satisfied with $K_{1}=||(\bar{A}^{T}-A_{t}^{T})||^{2}$ , $K_{2}=2\max(||A_{t}-\bar{A}||^{2},||b_{t}-\bar{b}||^{2})$ and any $K_{3}>0$ . The fact that $K_{1},K_{2}<\infty$ follows from the bounded features and bounded rewards assumption in $\mathbfcal{A}$ 1. Next, observe that $||\varepsilon_{t}^{(3)}||=\xi_{t}||\left((\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}-wv_{t}\right)||\rightarrow 0$ since $\xi_{t}\rightarrow 0\mbox{ as }t\rightarrow\infty$ . For a fixed $u,\theta\in\mathbb{R}^{d}$ , consider the ODE $\dot{v}(t)=-\bar{A}^{T}u-wv(t).$ For $w>0$ , $\lambda(u,\theta)=-\frac{\bar{A}^{T}u}{w}$ is the unique g.a.s.e, is linear and therefore Lipchitz continuous. This satisfies $\bf{(B4)}$ (i). Next, for a fixed $\theta\in\mathbb{R}^{d}$ , $\dot{u}(t)=\bar{A}\theta+\bar{b}-u(t),$ has $\Gamma(\theta)=\bar{A}\theta+\bar{b}$ as its unique g.a.s.e and is Lipschitz. This satisfies $\bf{(B4)}(ii)$ . Finally, to satisfy $\bf{(B4)}(iii)$ , consider,

\begin{split}\dot{\theta}(t)&=f(\lambda(\Gamma(z(t)),z(t)),\Gamma(z(t)),z(t))\\ &=\frac{-\bar{A}^{T}\bar{A}\theta(t)-\bar{A}^{T}\bar{b}}{w}.\end{split}

Since, $\bar{A}$ is negative definite, therefore, $-\bar{A}^{T}\bar{A}$ is negative definite. Therefore, $\theta^{*}=-\bar{A}^{-1}\bar{b}$ is the unique g.a.s.e. Next, we show that the sufficient conditions for stability of the three iterates are satisfied. The function, $h_{c}(v,u,\theta)=\frac{-c\bar{A}^{T}u-wcv}{c}=-\bar{A}^{T}u-wv\rightarrow h_{\infty}(v,u,\theta)=-\bar{A}^{T}u-wv$ uniformly on compacts as $c\rightarrow\infty$ . The limiting ODE: $\dot{v}(t)=-\bar{A}^{T}u-wv(t)$ has $\lambda_{\infty}(u,\theta)=-\frac{\bar{A}^{T}u}{w}$ as its unique g.a.s.e. $\lambda_{\infty}$ is Lipschitz with $\lambda_{\infty}(0,0)=0$ , thus satisfying assumption $\bf{(B5)}$ .

The function, $g_{c}(u,\theta)=\frac{c\bar{A}\theta+\bar{b}-cu}{c}=\bar{A}\theta-u+\frac{\bar{b}}{c}\rightarrow g_{\infty}(u,\theta)=-\bar{A}\theta-u$ uniformly on compacts as $c\rightarrow\infty$ . The limiting ODE $\dot{u}(t)=-\bar{A}\theta-u(t)$ has $\Gamma_{\infty}(\theta)=\bar{A}\theta$ as its unique g.a.s.e. $\Gamma_{\infty}$ is Lipchitz with $\Gamma_{\infty}(0)=0$ . Thus assumption $\bf{(B6)}$ is satisfied.

Finally, $f_{c}(\theta)=\frac{-c\bar{A}^{T}\bar{A}\theta}{cw}\rightarrow f_{\infty}=\frac{-\bar{A}^{T}\bar{A}\theta}{w}$ uniformly on compacts as $c\rightarrow\infty$ and the ODE: $\dot{\theta}(t)=-\frac{\bar{A}^{T}\bar{A}\theta(t)}{w}$ has origin as its unique g.a.s.e. This ensures the final condition $\bf{(B7)}$ . By theorem 3,

\begin{pmatrix}v_{t}\\ u_{t}\\ \theta_{t}\end{pmatrix}\rightarrow\begin{pmatrix}\lambda(\Gamma(-\bar{A}^{-1}\bar{b}),-\bar{A}^{-1}\bar{b})\\ \Gamma(-\bar{A}^{-1}\bar{b})\\ -\bar{A}^{-1}\bar{b}.\end{pmatrix}=\begin{pmatrix}0\\ 0\\ -\bar{A}^{-1}\bar{b}.\end{pmatrix}

Specifically, $\theta_{t}\rightarrow-\bar{A}^{-1}\bar{b}$ . ∎

Similar analysis can be provided for GTD2-M and TDC-M iterates. See Appendix A3 for details.

5 Experiments

We evaluate the momentum based GTD algorithms defined in section 3 to four standard problems of policy evaluation in reinforcement learning namely, Boyan Chain (Boyan 1999), 5-State random walk (Sutton et al. 2009), 19-State Random Walk (Sutton and Barto 2018) and Random MDP (Sutton et al. 2009). See Appendix A4 for a detailed description of the MDP settings and (Dann, Neumann, and Peters 2014) for details on implementation. We run the three algorithms, GTD, GTD2 and TDC along with their heavy ball momentum variants in One-TS and Three-TS settings and compare the RMSPBE (Root of MSPBE) across episodes. Figure-1 to Figure-4 plot these results. We consider decreasing step-sizes of the form: $\xi_{t}=\frac{1}{(t+1)^{\xi}},\beta_{t}=\frac{1}{(t+1)^{\beta}},\varrho_{t}=\frac{1}{(t+1)^{\varrho}},\alpha_{t}=\frac{1}{(t+1)^{\alpha}}$ in all the examples. Table 1 summarizes the different step-size sequences used in our experiment.

In one-TS setting, we require $\xi=\beta=\varrho$ . Since $\xi_{t}=\frac{\alpha_{t}}{\varrho_{t}}$ , we must have $\alpha=2\varrho$ . In the Three-TS setting, $\xi<\beta<\varrho$ thus implying, $\alpha<\varrho+\beta$ and $\beta<\varrho$ . Although our analysis requires square summability: $\xi,\beta,\varrho>0.5$ , such choice of step-size makes the algorithms converge very slowly. Recently, (Dalal et al. 2018a) showed convergence rate results for Gradient TD schemes with non-square summable

Table 1: Choice of step-size parameters

Boyan Chain	$\alpha$	$\beta$	$\varrho$	$w$
Vanilla	0.25	0.125	-	-
One-TS	0.25	0.125	0.125	1
Three-TS	0.25	0.125	0.2	0.1
5-state RW	$\alpha$	$\beta$	$\varrho$	$w$
Vanilla	0.25	0.125	-	-
One-TS	0.25	0.125	0.125	1
Three-TS	0.25	0.125	0.2	0.1
19-State RW	$\alpha$	$\beta$	$\varrho$	$w$
Vanilla	0.125	0.0625	-	-
One-TS	0.125	0.0625	0.0625	1
Three-TS	0.125	0.0625	0.1	0.1
Random Chain	$\alpha$	$\beta$	$\varrho$	$w$
Vanilla	0.5	0.25	-	-
One-TS	0.5	0.25	0.25	1
Three-TS	0.5	0.25	0.3	0.1

step-sizes also (See Remark 2 of (Dalal et al. 2018a)). Therefore, we look at non-square summable step-sizes here, and observe that in all the examples the iterates do converge. The momentum parameter is chosen as in $\mathbfcal{A}$ 2.

In all the examples considered, the momentum methods outperform their vanilla counterparts. Since, in the Three-TS setting, a lower value of $w$ can be chosen, this ensures that the momentum parameter is not small in the initial phase of the algorithm as in the One-TS setting. This in turn helps to reduce the RMSPBE faster in the initial phase of the algorithm as is evident from the experiments.

6 Related Work and Conclusion

To the best of our knowledge no previous work has specifically looked at Gradient TD methods with an added heavy ball term. The use of momentum specifically in the SA setting is very limited. Section 4.1 of (Mou et al. 2020) does talk about momentum; however the problem looked at is that of SGD with momentum and the driving matrix is assumed to be symmetric (see Appendix H of their paper). We do not make any such assumption here. The work of (Devraj, Bušíć, and Meyn 2019), indeed looks at momentum in SA setting. However, they introduce a matrix momentum term which is not equivalent to heavy ball momentum. Acceleration in Gradient TD methods has been looked at in (Pan, White, and White 2017). The authors provide a new algorithm called ATD and the acceleration is in form of better data efficiency. However, they do not make use of momentum methods.

In this work we have introduced heavy ball momentum in Gradient Temporal difference algorithms for the first time. We decompose the two iterates of these algorithms into three separate iterates and provide asymptotic convergence guarantees of these new schemes under the same assumptions made by their vanilla counterparts. Specifically, we show convergence in the One-TS regime as well as Three-TS regime. In both the cases, the momentum parameter gradually goes 1. Three-TS formulation gives us more flexibility in choosing the momentum parameter. Specifically, compared to the One-TS setting, a larger momentum parameter can be chosen during the initial phase in the Three-TS case. We observe improved performance with these new schemes when compared with the original algorithms.

As a step forward from this work, the natural direction would be to look at more sophisticated momentum methods such as Nesterov’s accelerated method (Nesterov 1983). Also, here we only provide the convergence guarantees of these new momentum methods. A particularly interesting step would be to quantify the benefits of using momentum in SA settings. Specifically, it would be interesting to extend weak convergence rate analysis of (Konda and Tsitsiklis 2004; Mokkadem and Pelletier 2006) to Three-TS regime. Also, extending the recent convergence rate results in expectation and high probability of GTD methods (Dalal et al. 2018b; Gupta, Srikant, and Ying 2019; Kaledin et al. 2019; Dalal, Szorenyi, and Thoppe 2020) to these momentum settings would be interesting works for the future.

References

Assran and Rabbat (2020) Assran, M.; and Rabbat, M. 2020. On the Convergence of Nesterov’s Accelerated Gradient Method in Stochastic Settings. Proceedings of the 37th International Conference on Machine Learning, PMLR, 119: 410–420.
Avrachenkov, Patil, and Thoppe (2020) Avrachenkov, K.; Patil, K.; and Thoppe, G. 2020. Online Algorithms for Estimating Change Rates of Web Pages. arXiv, 2009.08142.
Baird (1995) Baird, L. 1995. Residual Algorithms: Reinforcement Learning with Function Approximation. In In Proceedings of the Twelfth International Conference on Machine Learning, 30–37. Morgan Kaufmann.
Borkar (2008a) Borkar, V. 2008a. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press. ISBN 9780521515924.
Borkar (2008b) Borkar, V. S. 2008b. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press. ISBN 9780521515924.
Borkar and Meyn (2000) Borkar, V. S.; and Meyn, S. P. 2000. The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning. SIAM Journal on Control and Optimization, 38(2): 447–469.
Boyan (1999) Boyan, J. 1999. Least-Squares Temporal Difference Learning. In ICML.
Dalal, Szorenyi, and Thoppe (2020) Dalal, G.; Szorenyi, B.; and Thoppe, G. 2020. A Tale of Two-Timescale Reinforcement Learning with the Tightest Finite-Time Bound. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04): 3701–3708.
Dalal et al. (2018a) Dalal, G.; Szorenyi, B.; Thoppe, G.; and Mannor, S. 2018a. Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning. arXiv:1703.05376.
Dalal et al. (2018b) Dalal, G.; Thoppe, G.; Szörényi, B.; and Mannor, S. 2018b. Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning. In Bubeck, S.; Perchet, V.; and Rigollet, P., eds., Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, 1199–1233. PMLR.
Dann, Neumann, and Peters (2014) Dann, C.; Neumann, G.; and Peters, J. 2014. Policy Evaluation with Temporal Differences: A Survey and Comparison. Journal of Machine Learning Research, 15(24): 809–883.
Devraj, Bušíć, and Meyn (2019) Devraj, A. M.; Bušíć, A.; and Meyn, S. 2019. On Matrix Momentum Stochastic Approximation and Applications to Q-learning. 57th Annual Allerton Conference on Communication, Control, and Computing, 749–756.
Gadat, Panloup, and Saadane (2016) Gadat, S.; Panloup, F.; and Saadane, S. 2016. Stochastic Heavy ball. Electronic Journal of Statistics, 12: 461–529.
Ghadimi, Feyzmahdavian, and Johansson (2014) Ghadimi, E.; Feyzmahdavian, H. R.; and Johansson, M. 2014. Global convergence of the Heavy-ball method for convex optimization. arXiv:1412.7457.
Gitman et al. (2019) Gitman, I.; Lang, H.; Zhang, P.; and Xiao, L. 2019. Understanding the role of momentum in stochastic gradient methods. Advances in Neural Information Processing Systems, 9630–9640.
Gupta, Srikant, and Ying (2019) Gupta, H.; Srikant, R.; and Ying, L. 2019. Finite-Time Performance Bounds and Adaptive Learning Rate Selection for Two Time-Scale Reinforcement Learning. arXiv:1907.06290.
Kaledin et al. (2019) Kaledin, M.; Moulines, E.; Naumov, A.; Tadic, V.; and Wai, H. 2019. Finite Time Analysis of Linear Two-timescale Stochastic Approximation with Markovian Noise. Conference on Learning Theory, 125: 2144–2203.
Konda and Tsitsiklis (2004) Konda, V.; and Tsitsiklis, J. 2004. Convergence rate of linear two-time-scale stochastic approximation. Annals of Applied Probability, 14.
Kushner and Clark (1978) Kushner, H.; and Clark, D. 1978. Stochastic Approximation Methods for constrained and unconstrained systems. Springer.
Lakshminarayanan and Bhatnagar (2017) Lakshminarayanan, C.; and Bhatnagar, S. 2017. A Stability Criterion for Two-Timescale Stochastic Approximation Schemes. Automatica, 79: 108–114.
Ljung (1977) Ljung, L. 1977. Analysis of recursive stochastic algorithms. IEEE Transactions on Automatic Control, 22(4): 551–575.
Loizou and Richtárik (2020) Loizou, N.; and Richtárik, P. 2020. Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods. Computational Optimization and Applications, 77: 653–710.
Ma and Yarats (2019) Ma, J.; and Yarats, D. 2019. Quasi-hyperbolic momentum and adam for deep learning. International Conference on Learning Representations.
Maei (2011) Maei, H. R. 2011. Gradient Temporal-Difference Learning Algorithms. Ph.D. thesis, University of Alberta, CAN. AAINR89455.
Mokkadem and Pelletier (2006) Mokkadem, A.; and Pelletier, M. 2006. Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms. The Annals of Applied Probability, 16(3): 1671 – 1702.
Mou et al. (2020) Mou, W.; Li, C. J.; Wainwright, M. J.; Bartlett, P. L.; and Jordan, M. I. 2020. On Linear Stochastic Approximation: Fine-grained Polyak-Ruppert and Non-Asymptotic Concentration. Proceedings of Thirty Third Conference on Learning Theory, PMLR, 125: 2947–2997.
Nesterov (1983) Nesterov, Y. 1983. A method of solving a convex programming problem with convergence rate $O\bigl{(}\frac{1}{k^{2}}\bigr{)}$ . Soviet Mathematics Doklady, 269: 543–547.
Pan, White, and White (2017) Pan, Y.; White, A.; and White, M. 2017. Accelerated Gradient Temporal Difference Learning. arXiv:1611.09328.
Polyak (1964) Polyak, B. 1964. Some methods of speeding up the convergence of iteration methods. Ussr Computational Mathematics and Mathematical Physics, 4: 1–17.
Polyak (1990) Polyak, B. 1990. New stochastic approximation type procedures. Avtomatica i Telemekhanika, 7: 98–107.
Robbins and Monro (1951) Robbins, H.; and Monro, S. 1951. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3): 400 – 407.
Sutton and Barto (2018) Sutton, R.; and Barto, A. 2018. Reinforcement Learning: An Introduction. Cambridge, MA, USA: A Bradford Book. ISBN 0262039249.
Sutton et al. (2009) Sutton, R.; Maei, H.; Precup, D.; Bhatnagar, S.; Silver, D.; Szepesvári, C.; and Wiewiora, E. 2009. Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, 993–1000. New York, NY, USA: Association for Computing Machinery. ISBN 9781605585161.
Sutton (1988) Sutton, R. S. 1988. Learning to Predict By the Methods of Temporal Differences. Machine Learning, 3(1): 9–44.
Sutton, Maei, and Szepesvári (2009) Sutton, R. S.; Maei, H.; and Szepesvári, C. 2009. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation. In Koller, D.; Schuurmans, D.; Bengio, Y.; and Bottou, L., eds., Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc.
Tsitsiklis and Van Roy (1997) Tsitsiklis, J.; and Van Roy, B. 1997. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5): 674–690.

Appendix

A1 Proof of Theorem 2

Consider the One timescale recursion for the GTD-M iterates given by (21) as given below:

\psi_{t+1}=\psi_{t}+\xi_{t}(h(\psi_{t})+M_{t+1}+\bar{\varepsilon}_{t}).

G=\begin{pmatrix}-wI&-\bar{A}^{T}&0\\ 0&-I&\bar{A}\\ I&0&0\end{pmatrix},g=\begin{pmatrix}0\\ \bar{b}\\ 0\end{pmatrix},

where recall that $\bar{A}=\mathbb{E}[\phi(\gamma\phi^{\prime}-\phi)^{T}]$ and $\bar{b}=\mathbb{E}[r\phi]$ We show that the conditions ${\bf(A1)-(A4)}$ in Chapter 2 of (Borkar 2008b) hold and thereafter use Theorem 2 of (Borkar 2008b) to show convergence to the TD solution.

(A1)

The map $h(\psi)$ is linear in $\psi$ and therefore Lipschitz continuous with Lipschitz constant $||G||$ .
(A2)

The step-size sequence $\xi_{t}$ satisfies the required conditions (cf. assumption $\mathbfcal{A}$ 2 of the current paper).
(A3)

By construction $M_{t+1}$ is a martingale difference sequence w.r.t the filtration $\mathcal{F}_{t}=\sigma(\psi_{0},M_{k},k\leq t)$ . Also, $\mathbb{E}[||(G_{t+1}-G)\psi_{t}+(g_{t+1}-g)||^{2}|\mathcal{F}_{t}]\leq 2(||(G_{t+1}-G)||^{2}||\psi_{t}||^{2}+||g_{t+1}-g||^{2})$ . (A3) is satisfied with $K=2\max(||(G_{t+1}-G)||^{2}+||g_{t+1}-g||^{2})$ . $K<\infty$ follows from the fact that the rewards are uniformly bounded and the features are normalized (see assumption $\mathbfcal{A}$ 1).
(A4)

To ensure (A4), we show that (A5) of (Chapter 3, pp.22, Borkar (2008b)) holds and then use Theorem 7 of (Borkar 2008b). The functions $h_{c}(x)=\frac{h(cx)}{c}=\frac{g}{c}+G\psi,c\geq 1$ . For any compact set $H$ , $h_{c}\rightarrow h_{\infty}$ as $c\rightarrow\infty=G\psi$ uniformly. Consider the ODE

$\dot{\psi}(t)=h_{\infty}(\psi(t))=G\psi_{t}.$

Observe that since $||\phi_{t}||\leq 1$ and $r_{t}\leq 1$ $\forall t$ , we have $||A||^{2}<2$ . Since we have assumed that $w\geq 1$ therefore, $w(w+1)\geq||A||^{2}$ , and hence from lemma 1, we have that $G$ is Hurwitz. Hence, the origin is a unique globally asymptotically stable equilibrium (g.a.s.e) for the above ODE. This in turn implies that the iterates remain bounded i.e., $\sup_{t}||\psi_{t}||<\infty$ a.s. $\forall t$ . By (Theorem 2, Chapter 2 of Borkar (2008b)) $\psi_{t}$ converges to an internally chain transitive invariant set of the ODE $\dot{\psi}(t)=h(\psi(t))=g+G\psi(t)$ . The only such point of the ODE is its equilibrium point $-G^{-1}g$ . By (Corollary 4, Chapter 2 of Borkar (2008b)),

$\psi_{t}\rightarrow-G^{-1}g.$

A straightforward calculation for the inverse of the $3\times 3$ block matrix $G$ gives us that

$\theta_{t}\rightarrow-\bar{A}^{-1}\bar{b}$

A2 Proof of Theorem 3

We first start by assuming that the iterates remain stable (cf. assumption (B5)) and show that the three timescale recursions converge. Subsequently we provide conditions which ensure that the iterates remain stable. We consider general three timescale recursions as given below:

x_{n+1}=x_{n}+a(n)\left(h(x_{n},y_{n},z_{n})+M_{n+1}^{(1)}\right),

(25)

y_{n+1}=y_{n}+b(n)\left(g(x_{n},y_{n},z_{n})+M_{n+1}^{(2)}\right),

(26)

z_{n+1}=z_{n}+c(n)\left(f(x_{n},y_{n},z_{n})+M_{n+1}^{(3)}\right),

(27)

where $x_{n}\in\mathbb{R}^{d_{1}}$ , $y_{n}\in\mathbb{R}^{d_{2}}$ and $z_{n}\in\mathbb{R}^{d_{3}}$ $\forall n\geq 0$ . Next we consider the following assumptions:

(B1)

$h:\mathbb{R}^{d_{1}+d_{2}+d_{3}}\rightarrow\mathbb{R}^{d_{1}},g:\mathbb{R}^{d_{1}+d_{2}+d_{3}}\rightarrow\mathbb{R}^{d_{2}},f:\mathbb{R}^{d_{1}+d_{2}+d_{3}}\rightarrow\mathbb{R}^{d_{3}}$ are Lipschitz continuous.

(B2)

$\{M_{n}^{(1)}\},\{M_{n}^{(2)}\},\{M_{n}^{(3)}\}$ are Martingale difference sequences w.r.t. $\{\mathcal{F}_{n}\}$ where,

\mathcal{F}_{n}=\sigma\left(x_{m},y_{m},z_{m},M_{m}^{(1)},M_{m}^{(2)},M_{m}^{(3)};m\leq n\right),n\geq 0

\mathbb{E}\left[||M_{n+1}^{(i)}||^{2}|\mathcal{F}_{n}\right]\leq K_{i}\left(1+||x_{n}||^{2}+||y_{n}||^{2}+||z_{n}||^{2}\right)a.s.;i=1,2,3,

for some constants $K_{i}>0$ , $i=1,2,3$ .

(B3)

$\{a(n)\}$ , $\{b(n)\}$ , $\{c(n)\}$ are step-size sequences that satisfy $a(n)>0,b(n)>0,c(n)>0,\forall n\geq 0$

$\sum_{n}a(n)=\sum_{n}b(n)=\sum_{n}c(n)=\infty,$

$\sum_{n}(a(n)^{2}+b(n)^{2}+c(n)^{2})<\infty,$

$\frac{b(n)}{a(n)}\rightarrow 0\mbox{,\quad}\frac{c(n)}{b(n)}\rightarrow 0\mbox{ as }n\rightarrow\infty.$
(B4)
1. (i)
  
  The ODE $\dot{x}(t)=h(x(t),y,z),y\in\mathbb{R}^{d_{2}},z\in\mathbb{R}^{d_{3}}$ has a globally asymptotically stable equilibrium $\lambda(y,z)$ , where $\lambda:\mathbb{R}^{d_{2}\times d_{3}}\rightarrow\mathbb{R}^{d_{1}}$ is Lipschitz continuous.
2. (ii)
  
  The ODE $\dot{y}(t)=g(\lambda(y(t),z),y(t),z),z\in\mathbb{R}^{d_{3}}$ has a globally asymptotically stable equilibrium $\Gamma(z)$ , where $\Gamma:\mathbb{R}^{d_{3}}\rightarrow\mathbb{R}^{d_{2}}$ is Lipschitz continuous.
3. (iii)
  
  The ODE $\dot{z}(t)=f(\lambda(\Gamma(z(t)),z(t)),\Gamma(z(t)),z(t))$ , has a globally asymptotically stable equilibrium $z^{*}\in\mathbb{R}^{d_{3}}$
(B5)

$\sup_{n}\left(||x_{n}||+||y_{n}||+||z_{n}||\right)<\infty$ $a.s.$

Theorem 5.

Under $(\textbf{B1})-(\textbf{B5})$ the iterates given by (25), (26) and (27),

(x_{n},y_{n},z_{n})\rightarrow(\lambda(\Gamma(z^{*}),z^{*}),\Gamma(z^{*}),z^{*}).

Proof.

We start with the following Lemma that characterizes the set to which the iterates converge.

Lemma 6.

$(x_{n},y_{n},z_{n})\rightarrow\{(\lambda(\Gamma(z),z),\Gamma(z),z):z\in\mathbb{R}^{d_{3}}\}$

Proof.

We first consider the fastest timescale of $\{a(n)\}$ and show that:

(x_{n},y_{n},z_{n})\rightarrow\{(\lambda(y,z),y,z):y\in\mathbb{R}^{d_{2}},z\in R^{d_{3}}\}.

We rewrite the iterates (25), (26) and (27) as:

x_{n+1}=x_{n}+a(n)\left(h(x_{n},y_{n},z_{n})+M_{n+1}^{(1)}\right),

(28)

y_{n+1}=y_{n}+a(n)\left(\epsilon^{(2),a}_{n}+M_{n+1}^{(2),a}\right),

(29)

z_{n+1}=z_{n}+a(n)\left(\epsilon^{(3),a}_{n}+M_{n+1}^{(3),a}\right),

(30)

where,

\epsilon^{(2),a}_{n}=\frac{b(n)}{a(n)}g(x_{n},y_{n},z_{n}),M_{n+1}^{(2),a}=\frac{b(n)}{a(n)}M_{n+1}^{(2)},

\epsilon^{(3),a}_{n}=\frac{c(n)}{a(n)}f(x_{n},y_{n},z_{n}),M_{n+1}^{(3),a}=\frac{c(n)}{a(n)}M_{n+1}^{(3)}.

Using third extension from Chapter-2 of (Borkar 2008a), $(x_{n},y_{n},z_{n})$ converges to an internally chain transitive invariant set of the ODE

\dot{x}(t)=h(x(t),y(t),z(t)),

\dot{y}(t)=0,

\dot{z}(t)=0.

For initial conditions $x\in\mathbb{R}^{d_{1}},y\in\mathbb{R}^{d_{2}},z\in\mathbb{R}^{d_{3}}$ , the internally chain transitive invariant set of the above ODE is $\{(\lambda(y,z),y,z)\}$ . Therefore,

(x_{n},y_{n},z_{n})\rightarrow\{(\lambda(y,z),y,z):y\in\mathbb{R}^{d_{2}},z\in\mathbb{R}^{d_{3}}\}

(31)

Next we consider the middle timescale $\{b(n)\}$ . (26) and (27) can be re-written as:

y_{n+1}=y_{n}+b(n)\left(g(x_{n},y_{n},z_{n})+M_{n+1}^{(2)}\right),

z_{n+1}=z_{n}+b(n)\left(\epsilon_{n}^{(3),b}+M_{n+1}^{(3),b}\right),

(32)

where,

\epsilon_{n}^{(3),b}=\frac{b(n)}{c(n)}f(x_{n},y_{n},z_{n}),\mbox{\quad}M_{n+1}^{(3),b}=\frac{b(n)}{c(n)}M_{n+1}^{(3)}.

The iteration for $\{y_{n}\}$ can be re-written as:

y_{n+1}=y_{n}+b(n)\left(g(\lambda(y_{n},z_{n}),y_{n},z_{n})+\epsilon_{n}^{(2),b}+M_{n+1}^{(2)}\right)

(33)

where,

\epsilon_{n}^{(2),b}=g(x_{n},y_{n},z_{n})-g(\lambda(y_{n},z_{n}),y_{n},z_{n}).

Since, $(x_{n},y_{n},z_{n})\rightarrow\{(\lambda(y,z),y,z):y\in\mathbb{R}^{d_{2}},z\in\mathbb{R}^{d_{3}}\}$ , therefore $||\epsilon_{n}^{(2),b}||\rightarrow 0$ as $n\rightarrow\infty$ . Again using third extension from Chapter-2 of (Borkar 2008a), it can be seen that (32) and (33) converges to an internally chain transitive invariant set of the ODE

\dot{y}(t)=g\left(\lambda(y(t),z(t)),y(t),z(t)\right)

\dot{z}(t)=0

For initial conditions $y\in\mathbb{R}^{d_{2}},z\in\mathbb{R}^{d_{3}}$ , the internally chain transitive invariant set of the above ODE is $\{(\Gamma(z),z)\}$ . Therefore,

(y_{n},z_{n})\rightarrow\{(\Gamma(z),z):z\in\mathbb{R}^{d_{3}}.\}

(34)

Combining (31) and (34) we get:

(x_{n},y_{n},z_{n})\rightarrow\{(\lambda(\Gamma(z),z),\Gamma(z),z):z\in\mathbb{R}^{d_{3}}\}.

∎

Finally, we consider the slowest timescale of $\{c(n)\}$ . We define the piece wise linear continuous interpolation of the iterates $z_{n}$ as:

\bar{z}(t(n))=z_{n}

\bar{z}(t)=z_{n}+(z_{n+1}-z_{n})\frac{t-t(n)}{t(n+1)-t(n)},t\in[t(n),t(n+1)],

where, $t(n)=\sum_{m=0}^{n-1}c(n),n\geq 1.$ Also, let $z^{s}(t),t\geq s$ , denote the unique solution to the below ODE starting at $s\in\mathbb{R}$ :

\dot{z}^{s}(t)=h(z^{s}(t)),t\geq s,

with $z^{s}(s)=\bar{z}(s)$ . Using the arguments as in Theorem-2, Chapter-6 of (Borkar 2008a), it can be shown that for any $T>0$

\lim_{s\rightarrow\infty}\sup_{t\in[s,s+T]}||\bar{z}(t)-z^{s}(t)||=0\mbox{ a.s. }

Subsequently arguing as in proof of Theorem-2, Chapter-2 of (Borkar 2008a), we get:

z_{n}\rightarrow z^{*}\mbox{ a.s. }

Using Lemma 6, we get:

(x_{n},y_{n},z_{n})\rightarrow\left(\lambda(\Gamma(z^{*}),z^{*}),\Gamma(z^{*}),z^{*}\right)\mbox{ a.s.}

∎

Next we provide sufficient conditions for $\bf(B5)$ to hold. Consider the following additional assumptions:

(B6)

The functions $h_{c}(x,y,z)\triangleq\frac{h(cx,cy,cz)}{c},c\geq 1$ satisfy $h_{c}\rightarrow h_{\infty}$ as $c\rightarrow\infty$ uniformly on compacts. For fixed $y\in\mathbb{R}^{d_{2}},z\in\mathbb{R}^{d_{3}}$ , the ODE

$\dot{x}(t)=h_{\infty}(x(t),y,z)$

has its unique globally asymptotically stable equilibrium $\lambda_{\infty}(y,z)$ , where $\lambda_{\infty}:\mathbb{R}^{d_{2}+d_{3}}\rightarrow\mathbb{R}^{d_{1}}$ is Lipschitz continuous. Further, $\lambda_{\infty}(0,0)=0$ , i.e.,

$\dot{x}(t)=h_{\infty}(x(t),0,0)$

has origin in $\mathbb{R}^{d_{1}}$ as unique globally asymptotically stable equilibrium.
(B7)

The functions $g_{c}(y,z)\triangleq\frac{g(c\lambda_{\infty}(y,z),cy,cz)}{c},c\geq 1$ satisfy $g_{c}\rightarrow g_{\infty}$ as $c\rightarrow\infty$ uniformly on compacts. For fixed $z\in\mathbb{R}^{d_{3}}$ , the ODE

$\dot{y}(t)=g_{\infty}(y(t),z)$

has its unique globally asymptotically stable equilibrium $\Gamma_{\infty}(z)$ , where $\Gamma_{\infty}:\mathbb{R}^{d_{3}}\rightarrow\mathbb{R}^{d_{2}}$ is Lipschitz continuous. Further, $\Gamma_{\infty}(0)=0$ , i.e.,

$\dot{y}(t)=g_{\infty}(y(t),0)$

has origin in $\mathbb{R}^{d_{2}}$ as its unique globally asymptotically stable equilibrium.
(B8)

The functions $f_{c}(z)\triangleq\frac{f(c\lambda_{\infty}(\Gamma_{\infty}(z),z),c\Gamma_{\infty}(z),cz)}{c},c\geq 1$ satisfy $f_{c}\rightarrow f_{\infty}$ as $c\rightarrow\infty$ uniformly on compacts. The ODE

$\dot{z}(t)=f_{\infty}(z(t))$

has the origin in $\mathbb{R}^{d_{3}}$ as its unique globally asymptotically stable equilibrium.

Theorem 7.

Under assumptions $\bf{(B1)}$ - $\bf{(B4)}$ and $\bf{(B6)}$ - $\bf{(B8)}$ ,

\sup_{n}(||x_{n}||+||y_{n}||+||z_{n}||)<\infty.

Proof.

We begin with the fastest time scale determined by the step size $a(n)$ . Consider the following definitions:

(F1)

Define

t(n)=\sum_{i=0}^{n-1}a(i),n\geq 1,\mbox{ with }t(0)=0.

Let $\psi_{k}=(x_{k},y_{k},z_{k}),k\geq 0$ , and

\bar{\psi}(t)=\psi_{n}+\left(\psi_{n+1}-\psi_{n}\right)\frac{t-t(n)}{t(n+1)-t(n)},\mbox{ \quad}t\in[t(n),t(n+1)].

(F2)

Given $t(n),n\geq 0$ and a constant $T>0$ define

$T_{0}=0,$

$T_{n}=\min(t(m):t(m)\geq T_{n-1}+T),n\geq 1.$

One can find a subsequence $\{m(n)\}$ such that $T_{n}=t(m(n))$ $\forall n$ and $m(n)\rightarrow\infty$ as $n\rightarrow\infty$ .
(F3)

The scaling sequence is defined as:

$r(n)=\max(r(n-1),||\bar{\psi}(T_{n})||,1),n\geq 1.$

(F4)

The scaled iterates for $m(n)\leq k\leq m(n+1)-1$ are:

\hat{x}_{m(n)}=\frac{x_{m(n)}}{r(n)},\mbox{\quad}\hat{y}_{m(n)}=\frac{y_{m(n)}}{r(n)},\mbox{\quad}\hat{z}_{m(n)}=\frac{z_{m(n)}}{r(n)},

\hat{x}_{k+1}=\hat{x}_{k}+a(k)\left(\frac{h(c\hat{x}_{k},c\hat{y}_{k},c\hat{z}_{k})}{c}+\hat{M}_{k+1}^{(1)}\right)

\hat{y}_{k+1}=\hat{y}_{k}+a(k)\left(\epsilon_{k}^{(2),a}+\hat{M}_{k+1}^{(2)}\right)

\hat{z}_{k+1}=\hat{z}_{k}+a(k)\left(\epsilon_{k}^{(3),a}+\hat{M}_{k+1}^{(3)}\right)

where, $c=r(n)$ ,

\epsilon_{k}^{(2),a}=\frac{b(k)}{a(k)}\frac{g(c\hat{x}_{k},c\hat{y}_{k},c\hat{z}_{k})}{c}

\epsilon_{k}^{(3),a}=\frac{c(k)}{a(k)}\frac{f(c\hat{x}_{k},c\hat{y}_{k},c\hat{z}_{k})}{c}

\hat{M}_{k+1}^{(1)}=\frac{M_{k+1}^{(1)}}{r(n)},\hat{M}_{k+1}^{(2)}=\frac{b(k)}{a(k)}\frac{M_{k+1}^{(2)}}{r(n)},\hat{M}_{k+1}^{(3)}=\frac{c(k)}{a(k)}\frac{M_{k+1}^{(3)}}{r(n)}.

(F5)

Next we define the linearly interpolated trajectory for the scaled iterates as follows:

\hat{\psi}(t)=\hat{\psi}_{n}+(\hat{\psi}_{n+1}-\hat{\psi}_{n})\frac{t-t(n)}{t(n+1)-t(n)},\mbox{\quad}t\in[t(n),t(n+1)].

(F6)

Let $\psi_{n}(t)=(x_{n}(t),y_{n}(t),z_{n}(t)),\mbox{\quad}t\in[T_{n},T_{n+1}]$ denote the trajectory of the ODE:

$\dot{x}(t)=h_{r(n)}(x(t),y(t),z(t)),$

$\dot{y}(t)=0,$

$\dot{z}(t)=0,$

with $x_{n}(T_{n})=\hat{x}(T_{n})$ , $y_{n}(T_{n})=\hat{y}(T_{n})$ and $z_{n}(T_{n})=\hat{z}(T_{n})$ .

First we state four lemmas for ODEs with two external inputs. The proofs of these lemmas follow exactly as Lemmas 2, 3, 4 and 5 of (Lakshminarayanan and Bhatnagar 2017). Subsequently when we analyze the middle timescale (timescale of $\{b(n)\}$ ) and slow timescale (timescale of $\{c(n)\}$ ) recursions, we restate the corresponding lemmas for ODEs with one and no external inputs respectively. Let $x_{c}^{y(t),z(t)}(t,x)$ and $x_{\infty}^{y(t),z(t)}(t,x)$ denote the solution to the ODEs

\dot{x}(t)=h_{c}(x(t),y(t),z(t)),\mbox{\quad}t\geq 0,

\dot{x}(t)=h_{\infty}(x(t),y(t),z(t)),\mbox{\quad}t\geq 0,

respectively, with initial condition $x\in\mathbb{R}^{d_{1}}$ and the external inputs $y(t)\in\mathbb{R}^{d_{2}}$ and $z(t)\in\mathbb{R}^{d_{3}}$ . Throughout the paper, $B(x,r)\triangleq\{q\in\mathbb{R}^{d_{1}}\Big{|}||q-x||<r\},B(y,r)\triangleq\{q\in\mathbb{R}^{d_{2}}\Big{|}||q-y||<r\}$ and $B(z,r)\triangleq\{q\in\mathbb{R}^{d_{3}}\Big{|}||q-z||<r\}$ denote the ball of radius $r$ around $x,y$ and $z$ respectively.

Lemma 8.

Let $K\subset\mathbb{R}^{d_{1}}$ be a compact set, $y\in\mathbb{R}^{d_{2}}$ and $z\in\mathbb{R}^{d_{3}}$ be fixed external inputs. Then under $\bf{(B6)}$ , given $\delta>0$ , $\exists T_{\delta}>0$ such that $\forall x\in K$

x^{y,z}_{\infty}(t,x)\in B(\lambda_{\infty}(y,z),\delta),\forall\delta\geq T_{\delta}.

Lemma 9.

Let $x\in\mathbb{R}^{d_{1}}$ , $y\in\mathbb{R}^{d_{2}}$ , $z\in\mathbb{R}^{d_{3}}$ , $[0,T]$ be a given time interval and $r>0$ . Let $y^{\prime}(t)\in B(y,r),$ $z^{\prime}(t)\in B(z,r)$ $\forall t\in[0,T]$ , then

||x_{c}^{y^{\prime}(t),z^{\prime}(t)}(t,x)-x_{\infty}^{y,z}(t,x)||\leq(\epsilon(c)+2Lr)Te^{LT},\mbox{\quad}\forall t\in[0,T],

where $\epsilon(c)\rightarrow 0$ as $c\rightarrow\infty$ .

Lemma 10.

Let $y\in\mathbb{R}^{d_{2}}$ , $z\in\mathbb{R}^{d_{3}}$ then given $\epsilon>0$ and $T>0$ , $\exists c_{\epsilon,T}>0$ , $\delta_{\epsilon,T}>0$ and $r_{\epsilon,T}>0$ such that $\forall t\in[0,T)$ , $\forall x\in B(\lambda_{\infty}(y,z),\delta_{\epsilon,T})$ $\forall c>c_{\epsilon,T}$ and external inputs $y^{\prime}(s)\in B(y,r_{\epsilon,T})$ and $z^{\prime}(s)\in B(z,r_{\epsilon,T})$ . Then,

x_{c}^{y^{\prime}(t),z^{\prime}(t)}(t,x)\in B(\lambda_{\infty}(y,z),2\epsilon)\mbox{\quad}\forall t\in[0,T].

Lemma 11.

Let $x\in B(0,1)\subset\mathbb{R}^{d_{1}},y\in K^{\prime}\subset\mathbb{R}^{d_{2}},z\in K^{\prime\prime}\subset\mathbb{R}^{d_{3}}$ and let $\bf{(B6)}$ hold. Then given $\epsilon>0,\exists c_{\epsilon}\geq 1,r_{\epsilon}>0$ and $T_{\epsilon}>0$ such that for any external input satisfying $y^{\prime}(s)\in B(y,r_{\epsilon})$ , $z^{\prime}(s)\in B(z,r_{\epsilon})$ , $\forall s\in[0,T],$

||x_{c}^{y^{\prime}(t),z^{\prime}(t)}(t,x)-\lambda_{\infty}(y,z)||\leq 2\epsilon,\mbox{\quad}\forall c>c_{\epsilon},t\geq T_{\epsilon}.

The next lemma uses the convergence result of three time scale iterates under the stability assumption of $\bf{(B5)}$ (Theorem 5) and shows that the scaled iterates defined in $\bf{(F4)}$ converge.

Lemma 12.

Under $\bf{(B1)}-\bf{(B3)}$ ,

(i)

For $0\leq k\leq m(n+1)-m(n)$ , $||\hat{\psi}(t(m(n)+k))||\leq K^{(1)}$ a.s. for some constant $K^{(1)}>0.$
(ii)

$\lim_{n\rightarrow\infty}||\hat{\psi}(t)-\psi_{n}(t)||=0\mbox{ a.s. }\forall t\in[T_{n},T_{n+1}]$

Proof.

(i)

Follows as in (Lemma 4, Chapter-3, pp. 24, Borkar (2008a)).
(ii)

By construction, the iterates $\hat{x}_{k}$ , $\hat{y}_{k}$ , $\hat{z}_{k}$ remain bounded, i.e., $\sup_{k}(||\hat{x}_{k}||+||\hat{y}_{k}||+||\hat{z}_{k}||)<\infty$ a.s. Therefore, $\bf{(B1)}$ - $\bf{(B4)}$ are satisfied. Using Theorem 5, the iterates $(\hat{x}_{n},\hat{y}_{n},\hat{z}_{n})$ converges. Using the third extension from Chapter-2 of (Borkar 2008a), the iterates $(\hat{x}_{n},\hat{y}_{n},\hat{z}_{n})$ track the ODE system

$\dot{x}(t)=h_{r(n)}(x(t),y(t),z(t)),$

$\dot{y}(t)=0,$

$\dot{z}(t)=0.$

Therefore, $\lim_{n\rightarrow\infty}||\hat{\psi}(t)-\psi_{n}(t)||=0\mbox{ a.s. }\forall t\in[T_{n},T_{n+1}]$

∎

In particular, Lemma 12(i) shows that along the fastest timescale between instants $T_{n}$ and $T_{n+1}$ , the norm of the scaled iterate can grow at most by a factor $K^{(1)}$ starting from $B(0,1)$ . Next, Lemma 12(ii) shows that the scaled iterate asymptotically tracks the ODE defined in $\bf{(F6)}$ . The next theorem bounds $||x_{n}||$ in terms of $||y_{n}||$ and $||z_{n}||$ . We define the linearly interpolated trajectory of the three iterates as: $\forall t\in[t(n),t(n+1)]$ ,

\bar{x}(t)=x_{n}+\left(x_{n+1}-x_{n}\right)\frac{t-t(n)}{t(n+1)-t(n)},

\bar{y}(t)=y_{n}+\left(y_{n+1}-y_{n}\right)\frac{t-t(n)}{t(n+1)-t(n)},

\bar{z}(t)=z_{n}+\left(z_{n+1}-z_{n}\right)\frac{t-t(n)}{t(n+1)-t(n)}.

Theorem 13.

Under assumptions $\bf{(B1)}$ - $\bf{(B4)}$ and $\bf{(B6)}$ ,

(i)

For $n$ large, and $T=T_{\frac{1}{4}}$ (here $T$ is the sampling frequency as in (F2) and $T_{\frac{1}{4}}$ is $T_{\epsilon}$ as in Lemma 11 with $\epsilon=\frac{1}{4}$ ), if $||\bar{x}(T_{n})||>C_{a}(1+||\bar{y}(T_{n})||+||\bar{z}(T_{n})||)$ , for some $C_{a}>0$ then $||\bar{x}(T_{n+1})||\leq\frac{3}{4}||\bar{x}(T_{n})||$
(ii)

$||\bar{x}(T_{n})||\leq C_{a}^{*}(1+||\bar{y}(T_{n})||+||\bar{z}(T_{n})||)$ a.s. for some $C_{a}^{*}>0$ .
(iii)

$||x_{n}||\leq K_{a}^{*}(1+||y_{n}||+||z_{n}||),\mbox{ for some }K_{a}^{*}>0$

Proof.

(i)

We have $||\bar{x}(T_{n})||>C_{a}(1+||\bar{y}(T_{n})||+||\bar{z}(T_{n})||).$ Since, $r(n)=\max(r(n-1),||\bar{\psi}(T_{n})||,1)$ , this implies $r(n)\geq||\bar{\psi}(T_{n})||$ . Therefore, $r(n)\geq C_{a}$ . Next we show that

||\hat{y}(T_{n})||<\frac{1}{C_{a}}\mbox{ and }||\hat{z}(T_{n})||<\frac{1}{C_{a}}.

For $p\geq 1$ ,

\begin{split}||\hat{y}(T_{n})||_{p}&=\frac{||\bar{y}(T_{n})||_{p}}{r(n)}\leq\frac{||\bar{y}(T_{n})||_{p}}{||\bar{\psi}(T_{n})||_{p}}\\ &=\frac{||\bar{y}(T_{n})||_{p}}{\Big{(}||\bar{x}(T_{n})||_{p}^{p}+||\bar{y}(T_{n})||_{p}^{p}+||\bar{z}(T_{n})||_{p}^{p}\Big{)}^{\frac{1}{p}}}\end{split}

Since, $||\bar{x}(T_{n})||_{p}\geq C_{a}(1+||\bar{y}(T_{n})||_{p}+||\bar{z}(T_{n})||_{p})$ ,

\begin{split}||\bar{x}(T_{n})||_{p}^{p}&\geq C_{a}^{p}\Big{(}||\bar{y}(T_{n})||_{p}+||\bar{z}(T_{n})||_{p}\Big{)}^{p}\\ &\geq C_{a}^{p}\Big{(}||\bar{y}(T_{n})||_{p}^{p}+||\bar{z}(T_{n})||_{p}^{p}\Big{)}\\ \end{split}

(35)

Therefore,

\begin{split}||\hat{y}(T_{n})||&\leq\frac{||\bar{y}(T_{n})||_{p}}{\Big{(}C_{a}^{p}+1\Big{)}^{\frac{1}{p}}\Big{(}||\bar{y}(T_{n})||_{p}^{p}+||\bar{z}(T_{n})||_{p}^{p}\Big{)}^{\frac{1}{p}}}\\ &\leq\frac{1}{\Big{(}1+C_{a}^{p}\Big{)}^{\frac{1}{p}}}\\ &<\frac{1}{C_{a}}.\end{split}

The second inequality follows from the fact that $||\bar{y}(T_{n})||_{p}^{p}\leq||\bar{y}(T_{n})||_{p}^{p}+||\bar{z}(T_{n})||_{p}^{p}$ . A similar analysis proves $||\hat{z}(T_{n})||<\frac{1}{C_{a}}$ . Next we show that

||\hat{x}(T_{n})||_{p}>\frac{1}{1+\frac{1}{C_{a}}}.

Here we are considering the case when iterates are blowing up. Therefore let $r(n)=\bar{\psi}(T_{n})$ . Then,

\begin{split}||\hat{x}(T_{n})||&=\frac{||\bar{x}(T_{n})||}{||\bar{\psi}(T_{n})||}\\ &=\frac{||\bar{x}(T_{n})||}{\Big{(}||\bar{x}(T_{n})||_{p}^{p}+||\bar{y}(T_{n})||_{p}^{p}+||\bar{z}(T_{n})||_{p}^{p}\Big{)}^{\frac{1}{p}}}\\ &=\frac{1}{\Big{(}1+\frac{||\bar{y}(T_{n})||_{p}^{p}+||\bar{z}(T_{n})||_{p}^{p}}{||\bar{x}(T_{n})||_{p}^{p}}\Big{)}^{\frac{1}{p}}}\\ &>\frac{1}{\Big{(}1+\frac{||\bar{y}(T_{n})||_{p}^{p}+||\bar{z}(T_{n})||_{p}^{p}}{C_{a}^{p}(||\bar{y}(T_{n})||_{p}^{p}+||\bar{z}(T_{n})||_{p}^{p})}\Big{)}^{\frac{1}{p}}}\\ &>\frac{1}{1+\frac{1}{C_{a}}}.\end{split}

Let $y^{\prime}(t-T_{n})=y_{n}(t)$ and $z^{\prime}(t-T_{n})=z_{n}(t)$ $\forall t\in[T_{n},T_{n+1}]$ . From lemma 11, $\exists r_{\frac{1}{4}},c_{\frac{1}{4}},T_{\frac{1}{4}}$ such that

||x_{c}^{y^{\prime}(t),z^{\prime}(t)}(t,\hat{x}(T_{n}))||\leq\frac{1}{4},\forall t\geq T_{\frac{1}{4}},\forall c\geq c_{\frac{1}{4}},

whenever $y^{\prime}(t)\in B(0,r_{\frac{1}{4}})$ and $z^{\prime}(t)\in B(0,r_{\frac{1}{4}})$ . Choose $C_{a}>\max(c_{\frac{1}{4}},\frac{2}{r_{\frac{1}{4}}})$ and $T=T_{\frac{1}{4}}$ . Since $\dot{y}(t)=0,$ and $\dot{z}(t)=0$ for the ODE defined in $\textbf{(F6)},y^{\prime}(t-T_{n})=y_{n}(t)=\hat{y}(T_{n})$ and $z^{\prime}(t-T_{n})=z_{n}(t)=\hat{z}(T_{n})$ $\forall t\in[T_{n},T_{n+1}].$ From $||\hat{y}(T_{n})||<\frac{1}{C_{a}}$ and $||\hat{z}(T_{n})||<\frac{1}{C_{a}}$ , it follows that $y^{\prime}(s)\in B(0,r_{\frac{1}{4}})$ and $z^{\prime}(s)\in B(0,r_{\frac{1}{4}})$ $\forall s\in[0,T].$ Using Lemma 12(ii), $||\hat{x}(T_{n+1}^{-})-x_{n}(T_{n+1})||<\frac{1}{4}$ for large enough $n$ . Also observe that $||x_{n}(T_{n+1})||=||x^{y^{\prime}(t),z^{\prime}(t)}_{r(n)}(T_{n+1}-T_{n},\hat{x}(T_{n}))||\leq\frac{1}{4}$ . Using these, we have $||\hat{x}(T_{n+1}^{-})||\leq||\hat{x}(T_{n+1}^{-})-x_{n}(T_{n+1})||+||x_{n}(T_{n+1})||\leq\frac{1}{2}$ . Finally since

\frac{||\bar{x}(T_{n+1})||}{||\bar{x}(T_{n})||}=\frac{||\hat{x}(T_{n+1}^{-})||}{||\hat{x}(T_{n})}||,

we have

\begin{split}||\bar{x}(T_{n+1})||&=\frac{||\hat{x}(T_{n+1}^{-})||}{||\hat{x}(T_{n})||}||\bar{x}(T_{n})||\\ &<\frac{\frac{1}{2}}{\frac{1}{1+1/C_{a}}}||\bar{x}(T_{n})||\end{split}

Choosing $C_{a}>\max\left(c_{\frac{1}{4}},\frac{2}{r_{\frac{1}{4}}}\right)>2$ , proves the claim.

(ii) and (iii) follow along the lines of arguments in (Lakshminarayanan and Bhatnagar 2017) Lemma 6 (ii) and (iii) respectively. ∎

Next we consider the middle timescale of $\{b(n)\}$ and re-define the following terms:

(M1)

Define

t(n)=\sum_{i=0}^{n-1}b(i),n\geq 1\mbox{ with }t(0)=0.

Let $\psi_{n}=(x_{n},y_{n},z_{n})$ and

\bar{\psi}(t)=\psi_{n}+\left(\psi_{n+1}-\psi_{n}\right)\frac{t-t(n)}{t(n+1)-t(n)},\mbox{ \quad}t\in[t(n),t(n+1)].

(M2)

Given $t(n),n\geq 0$ and a constant $T>0$ define

$T_{0}=0,$

$T_{n}=\min(t(m):t(m)\geq T_{n-1}+T)n\geq 1$

One can find a subsequence $\{m(n)\}$ such that $T_{n}=t(m(n))$ $\forall n$ , and $m(n)\rightarrow\infty$ as $n\rightarrow\infty$ .
(M3)

The scaling sequence is defined as:

$r(n)=\max(r(n-1),||\bar{\psi}(T_{n})||,1),n\geq 1$

(M4)

The scaled iterates for $m(n)\leq k\leq m(n+1)-1$ are:

\hat{x}_{m(n)}=\frac{x_{m(n)}}{r(n)},\mbox{\quad}\hat{y}_{m(n)}=\frac{y_{m(n)}}{r(n)},\mbox{\quad}\hat{z}_{m(n)}=\frac{z_{m(n)}}{r(n)},

\hat{x}_{k+1}=\hat{x}_{k}+a(k)\left(\frac{h(c\hat{x}_{k},c\hat{y}_{k},c\hat{z}_{k})}{c}+\hat{M}_{k+1}^{(1)}\right),

\hat{y}_{k+1}=\hat{y}_{k}+b(k)\left(\frac{g(c\hat{x}_{k},c\hat{y}_{k},c\hat{z}_{k})}{c}+\hat{M}_{k+1}^{(2)}\right),

\hat{z}_{k+1}=\hat{z}_{k}+b(k)\left(\epsilon_{k}^{(3),b}+\hat{M}_{k+1}^{(3)}\right),

where, $c=r(n)$ ,

\epsilon_{k}^{(3),b}=\frac{c(k)}{b(k)}\frac{f(c\hat{x}_{k},c\hat{y}_{k},c\hat{z}_{k})}{c},

\hat{M}_{k+1}^{(1)}=\frac{M_{k+1}^{(1)}}{r(n)},

\hat{M}_{k+1}^{(2)}=\frac{M_{k+1}^{(2)}}{r(n)},

\hat{M}_{k+1}^{(3)}=\frac{c(k)}{a(k)}\frac{M_{k+1}^{(3)}}{r(n)}.

(M5)

Next, we define the linearly interpolated trajectory for the scaled iterates as follows:

\hat{\psi}(t)=\hat{\psi}_{n}+(\hat{\psi}_{n+1}-\hat{\psi}_{n})\frac{t-t(n)}{t(n+1)-t(n)},\mbox{\quad}t\in[t(n),t(n+1)].

(M6)

Let $\psi_{n}(t)=(x_{n}(t),y_{n}(t),z_{n}(t)),\mbox{\quad}t\in[T_{n},T_{n+1}]$ denote the trajectory of the ODE:

$\dot{x}(t)=h_{r(n)}(x(t),y(t),z(t)),$

$\dot{y}(t)=g_{r(n)}(y(t),z(t)),$

$\dot{z}(t)=0,$

with $x_{n}(T_{n})=\hat{x}(T_{n})$ , $y_{n}(T_{n})=\hat{y}(T_{n})$ and $z_{n}(T_{n})=\hat{z}(T_{n})$ .

As before we state a few lemmas for ODEs with one external input. These follow along the lines of Lemmas 2-5 of (Lakshminarayanan and Bhatnagar 2017). Let $y_{c}^{z(t)}(t,y)$ and $y_{\infty}^{z(t)}(t,y)$ denote the solution to the ODEs

\dot{y}(t)=g_{c}(y(t),z(t)),\mbox{\quad}t\geq 0,

\dot{y}(t)=g_{\infty}(y(t),z(t)),\mbox{\quad}t\geq 0,

respectively, with initial condition $y\in\mathbb{R}^{d_{1}}$ and the external input $z(t)\in\mathbb{R}^{d_{3}}$ .

Lemma 14.

Let $K\subset\mathbb{R}^{d_{1}}$ be a compact set and $z\in\mathbb{R}^{d_{3}}$ . Then under $\bf{(B6)}$ , given $\delta>0$ , $\exists T_{\delta}>0$ such that $\forall y\in K$

y^{z}_{\infty}(t,y)\in B(\Gamma_{\infty}(z),\delta),\forall\delta\geq T_{\delta}.

Lemma 15.

Let $y\in\mathbb{R}^{d_{2}}$ , $z\in\mathbb{R}^{d_{3}}$ , $[0,T]$ be a given time interval and $r>0$ . Let $z^{\prime}(t)\in B(z,r),\forall t\in[0,T]$ , then

||y_{c}^{z^{\prime}(t)}(t,y)-y_{\infty}^{z}(t,y)||\leq(\epsilon(c)+Lr)Te^{LT},\mbox{\quad}\forall t\in[0,T],

where $\epsilon(c)\rightarrow 0$ as $c\rightarrow\infty$ .

Lemma 16.

Let $z\in\mathbb{R}^{d_{3}}$ then given $\epsilon>0$ and $T>0$ , $\exists c_{\epsilon,T}>0$ , $\delta_{\epsilon,T}>0$ and $r_{\epsilon,T}>0$ such that $\forall t\in[0,T)$ , $\forall y\in B(\Gamma_{\infty}(z),\delta_{\epsilon,T})$ $\forall c>c_{\epsilon,T}$ and external input $z^{\prime}(s)\in B(z,r_{\epsilon,T})$ ,

y_{c}^{z^{\prime}(t)}(t,y)\in B(\Gamma_{\infty}(z),2\epsilon)\mbox{\quad}\forall t\in[0,T].

Lemma 17.

Let $y\in B(0,1)\subset\mathbb{R}^{d_{2}},z\in K^{\prime}\subset\mathbb{R}^{d_{3}},$ and $\bf{(B7)}$ holds. Then given $\epsilon>0,\exists c_{\epsilon}\geq 1,r_{\epsilon}>0$ and $T_{\epsilon}>0$ such that for any external input satisfying $z^{\prime}(s)\in B(z,r_{\epsilon})$ , $\forall s\in[0,T]$ ,

||y_{c}^{z^{\prime}(t)}(t,y)-\Gamma_{\infty}(z)||\leq 2\epsilon,\mbox{\quad}\forall c>c_{\epsilon},t\geq T_{\epsilon}.

Lemma 18.

Under $\bf{(B1)}-\bf{(B3)}$ ,

(i)

For $0\leq k\leq m(n+1)-m(n)$ , $||\hat{\psi}(t(m(n)+k))||\leq K^{(2)}$ a.s. for some constant $K^{(2)}>0.$
(ii)

For sufficiently large $n$ , we have $\sup_{[T_{n},T_{n+1}})||\hat{y}(t)-y_{n}(t)||=\epsilon(c)LTe^{L(L+1)T}\mbox{ a.s. where }\epsilon(c)\rightarrow 0\mbox{ as }c\rightarrow\infty$

Proof.

See Lemma 9 of (Lakshminarayanan and Bhatnagar 2017) ∎

Theorem 19.

Assume $\bf{(B1)}$ - ${\bf{(B4)}}$ and $\bf{(B6)-(B8)}$ hold. Then, with $C^{*}_{a}$ as defined in Theorem 13,

(i)

For large $n$ and $T=T_{1/8(C^{*}_{a}+1)}$ (here $T$ is the sampling frequency as in (M2) and $T_{1/8(C^{*}_{a}+1)}$ is $T_{\epsilon}$ as in Lemma 17 with $\epsilon=1/8(C^{*}_{a}+1)$ ), if $||\bar{y}(T_{n})||>C_{b}(1+||\bar{z}(T_{n})||)$ , for some $C_{b}>0$ , then $||\bar{y}(T_{n+1})||<\frac{5}{8}||\bar{y}(T_{n})||$ .
(ii)

$||\bar{y}(T_{n})||\leq C_{b}^{*}\left(1+||\bar{z}(T_{n})||\right)$ , for some $C_{b}^{*}>0$
(iii)

$||y_{n}||\leq K_{b}^{*}(1+||z_{n}||)$ , for some $K_{b}^{*}>0$

Proof.

(i)

Since $||\bar{y}(T_{n})||>C_{b}(1+||\bar{z}(T_{n})||)$ , $r(n)>C_{b}$ . We first show that $||\hat{z}(T_{n})||<\frac{1}{C_{b}}$ .

||\hat{z}(T_{n})||=\frac{||\bar{z}(T_{n})||_{p}}{r(n)}\leq\frac{||\bar{z}(T_{n})||_{p}}{(||\bar{y}(T_{n})||_{p}^{p}+||\bar{z}(T_{n})||_{p}^{p})^{\frac{1}{p}}}

Since $||\bar{y}(T_{n})||_{p}>C_{b}(1+||\bar{z}(T_{n})||)$ , $||\bar{y}(T_{n})||_{p}^{p})>C_{b}^{p}||\bar{z}(T_{n})||_{p}^{p}$ . Therefore,

\begin{split}||\hat{z}(T_{n})||&<\frac{||\bar{z}(T_{n})||_{p}}{\left((1+C_{b}^{p})||\bar{z}(T_{n})||_{p}^{p}\right)^{\frac{1}{p}}}\\ &=\frac{1}{(1+C_{b}^{p})^{\frac{1}{p}}}\\ &<\frac{1}{C_{b}}\end{split}

Next we show that $||\hat{y}(T_{n})||>\frac{1}{(C^{*}_{a}+1)(2+\frac{1}{C_{b}})}$ , where $C^{*}_{a}$ is as defined in Theorem 13. Here again we are considering the case when the iterates are blowing up. Therefore let $r(n)=||\bar{\psi}(T_{n})||$ . Now, from Theorem 13 ,we know $||\bar{x}(T_{n})||\leq K_{a}^{*}(1+||\bar{y}(T_{n})||+||\bar{z}(T_{n})||)$ and therefore, $r(n)\leq K_{a}^{*}(1+||\bar{y}(T_{n})||+||\bar{z}(T_{n})||)+||\bar{y}(T_{n})||+||\bar{z}(T_{n})||$ . With this we have,

\begin{split}||\hat{y}(T_{n})||_{p}&\geq\frac{||\bar{y}(T_{n})||}{C^{*}_{a}+(C^{*}_{a}+1)(||\bar{y}(T_{n})||_{p}^{p}+||\bar{z}(T_{n})||_{p}^{p})^{\frac{1}{p}}}\\ &>\frac{1}{C^{*}_{a}+(C^{*}_{a}+1)(1+\frac{1}{C_{b}})}\\ &>\frac{1}{(C^{*}_{a}+1)(2+\frac{1}{C_{b}})}.\end{split}

Now we proceed as in Theorem 13 (i). Let $z^{\prime}(t-T_{n})=z_{n}(t)$ $\forall t\in[T_{n},T_{n+1}]$ . From Lemma 17, $\exists r_{1/8(C^{*}_{a}+1)},c_{1/8(C^{*}_{a}+1)},T_{1/8(C^{*}_{a}+1)}>0$ such that

||y_{c}^{z^{\prime}(t)}(t,\hat{x}(T_{n}))||\leq\frac{1}{8(C^{*}_{a}+1)},\forall t\geq T_{1/8(C^{*}_{a}+1)},\forall c\geq c_{1/8(C^{*}_{a}+1)},

whenever $z^{\prime}(t)\in B(0,r_{1/8(C^{*}_{a}+1)})$ . Choose $T=T_{1/8(C^{*}_{a}+1)}$ . Since $\dot{z}(t)=0$ for the ODE defined in (M6) and $z^{\prime}(t-T_{n})=z_{n}(t)=\hat{z}(T_{n})$ $\forall t\in[T_{n},T_{n+1}]$ and we choose $C_{b}>\max\left(c_{1/8(C^{*}_{a}+1)},\frac{2}{r_{1/8(C^{*}_{a}+1)}}\right)$ from $||\hat{z}(T_{n})||<\frac{1}{C_{b}}$ , it follows that $z^{\prime}(s)\in B(0,r_{1/8(C^{*}_{a}+1)})$ $\forall s\in[0,T].$ Using Lemma 18(ii), $\exists C_{1}>0$ s.t. $||\hat{y}(T_{n+1}^{-})-y_{n}(T_{n+1})||<\frac{1}{8(C^{*}_{a}+1)}$ for large enough $n$ and $r(n)>C_{1}$ . Choose $C_{b}>\max(c_{1/8(C^{*}_{a}+1)},\frac{2}{r_{1/8(C^{*}_{a}+1)}},C_{1})$ . Also observe that $||y_{n}(T_{n+1})||=||y^{z^{\prime}(t)}_{r(n)}(T_{n+1}-T_{n},\hat{y}(T_{n}))||\leq\frac{1}{8(C^{*}_{a}+1)}$ . Using these, we have $||\hat{y}(T_{n+1}^{-})||\leq||\hat{y}(T_{n+1}^{-})-y_{n}(T_{n+1})||+||y_{n}(T_{n+1})||\leq\frac{1}{4(C^{*}_{a}+1)}$ . Finally since

\frac{||\bar{y}(T_{n+1})||}{||\bar{y}(T_{n})||}=\frac{||\hat{y}(T_{n+1}^{-})||}{||\hat{y}(T_{n})||},

we have

\begin{split}||\bar{y}(T_{n+1})||&=\frac{||\hat{y}(T_{n+1}^{-})||}{||\hat{y}(T_{n})||}||\bar{y}(T_{n})||\\ &<\frac{\frac{1}{4(C^{*}_{a}+1)}}{\frac{1}{(C^{*}_{a}+1)(2+1/C_{b})}}||\bar{x}(T_{n})||\\ &<\frac{2+\frac{1}{C_{b}}}{4}\end{split}

Choosing $C_{b}>\max\left(c_{1/8(C^{*}_{a}+1)},\frac{2}{r_{1/8(C^{*}_{a}+1)}},C_{1}\right)>2$ , proves the claim.

(ii) and (iii) follow along the lines of arguments in (Lakshminarayanan and Bhatnagar 2017), Lemma 6 (ii) and (iii), respectively.

∎

Finally we consider the slowest timescale corresponding to $\{c(n)\}$ . As before we redefine the terms as follows:

(S1)

Define

t(n)=\sum_{i=0}^{n-1}c(i),n\geq 0\mbox{ with }t(0)=0

Let $\psi_{n}=(x_{n},y_{n},z_{n})$ and

\bar{\psi}(t)=\psi_{n}+\left(\psi_{n+1}-\psi_{n}\right)\frac{t-t(n)}{t(n+1)-t(n)},\mbox{ \quad}t\in[t(n),t(n+1)].

(S2)

Given $t(n),n\geq 0$ and a constant $T>0$ define

$T_{0}=0,$

$T_{n}=\min(t(m):t(m)\geq T_{n+1}+T),n\geq 1$

There exists some subsequence $\{m(n)\}$ such that $T_{n}=t(m(n))$ and $m(n)\rightarrow\infty$ as $n\rightarrow\infty$ .
(S3)

The scaling sequence is defined as:

$r(n)=\max(r(n-1),||\bar{\psi}(T_{n})||,1),n\geq 1$

(S4)

The scaled iterates for $m(n)\leq k\leq m(n+1)-1$ are:

\hat{x}_{m(n)}=\frac{x_{m(n)}}{r(n)},\mbox{\quad}\hat{y}_{m(n)}=\frac{y_{m(n)}}{r(n)},\mbox{\quad}\hat{z}_{m(n)}=\frac{z_{m(n)}}{r(n)},

\hat{x}_{k+1}=\hat{x}_{k}+a(k)\left(\frac{h(c\hat{x}_{k},c\hat{y}_{k},c\hat{z}_{k})}{c}+\hat{M}_{k+1}^{(1)}\right),

\hat{y}_{k+1}=\hat{y}_{k}+b(k)\left(\frac{g(c\hat{x}_{k},c\hat{y}_{k},c\hat{z}_{k})}{c}+\hat{M}_{k+1}^{(2)}\right),

\hat{z}_{k+1}=\hat{z}_{k}+c(k)\left(\frac{f(c\hat{x}_{k},c\hat{y}_{k},c\hat{z}_{k})}{c}+\hat{M}_{k+1}^{(3)}\right),

where, $c=r(n)$ ,

\hat{M}_{k+1}^{(1)}=\frac{M_{k+1}^{(1)}}{r(n)},\hat{M}_{k+1}^{(2)}=\frac{M_{k+1}^{(2)}}{r(n)},\hat{M}_{k+1}^{(3)}=\frac{M_{k+1}^{(3)}}{r(n)}.

(S5)

Next we define the linearly interpolated trajectory for the scaled iterates as follows:

\hat{\psi}(t)=\hat{\psi}_{n}+(\hat{\psi}_{n+1}-\hat{\psi}_{n})\frac{t-t(n)}{t(n+1)-t(n)},\mbox{\quad}t\in[t(n),t(n+1)].

(S6)

Let $\psi_{n}(t)=(x_{n}(t),y_{n}(t),z_{n}(t)),\mbox{\quad}t\in[T_{n},T_{n+1}]$ denote the trajectory of the ODE:

$\dot{x}(t)=h_{r(n)}(x(t),y(t),z(t)),$

$\dot{y}(t)=g_{r(n)}(y(t),z(t)),$

$\dot{z}(t)=f_{r(n)}(z(t)),$

with $x_{n}(T_{n})=\hat{x}(T_{n})$ , $y_{n}(T_{n})=\hat{y}(T_{n})$ and $z_{n}(T_{n})=\hat{z}(T_{n})$ .

We again state some results on ODEs, this time with no external input. These again follow along the lines of Lemma 2-5 in (Lakshminarayanan and Bhatnagar 2017). Let $z_{c}(t,z)$ and $z_{\infty}(t,z)$ denote the solution to the ODEs

\dot{z}(t)=f_{c}(z(t)),\mbox{\quad}t\geq 0,

\dot{z}(t)=f_{\infty}(z(t)),\mbox{\quad}t\geq 0,

respectively with initial condition $z\in\mathbb{R}^{d_{3}}$ .

Lemma 20.

Let $K\subset\mathbb{R}^{d_{3}}$ be a compact set . Then under $\bf{(B8)}$ , given $\delta>0$ , $\exists T_{\delta}>0$ such that $\forall z\in K$

z_{\infty}(t,z)\in B(0,\delta),\forall\delta\geq T_{\delta}.

Lemma 21.

Let $z\in\mathbb{R}^{d_{3}}$ , $[0,T]$ be a given time interval and $r>0$ . Then

||z_{c}(t,z)-z_{\infty}(t,z)||\leq(\epsilon(c))Te^{LT},\mbox{\quad}\forall t\in[0,T],

where $\epsilon(c)\rightarrow 0$ as $c\rightarrow\infty$ .

Lemma 22.

Given $\epsilon>0$ and $T>0$ $\exists c_{\epsilon,T}>0$ , $\delta_{\epsilon,T}>0$ and $r_{\epsilon,T}>0$ such that $\forall t\in[0,T)$ , $\forall z\in B(0,\delta_{\epsilon,T})$ , $\forall c>c_{\epsilon,T}$ ,

z_{c}(t,z)\in B(0,2\epsilon)\mbox{\quad},\forall t\in[0,T].

Lemma 23.

Let $z\in B(0,1)\subset\mathbb{R}^{d_{3}}$ and let $\bf{(B8)}$ hold. Then given $\epsilon>0,\exists c_{\epsilon}\geq 1,r_{\epsilon}>0$ and $T_{\epsilon}>0$ , then

||z_{c}(t,z)||\leq 2\epsilon,\mbox{\quad}\forall c>c_{\epsilon}.

Lemma 24.

Under $\bf{(B1)}-\bf{(B3)}$ ,

(i)

For $0\leq k\leq m(n+1)-m(n)$ , $||\hat{\psi}(t(m(n)+k))||\leq K^{(3)}$ a.s. for some constant $K^{(3)}>0.$
(ii)

For sufficiently large $n$ , we have $\sup_{[T_{n},T_{n+1})}||\hat{z}(t)-z_{n}(t)||=(\epsilon_{1}(c)+\epsilon_{2}(c))LTe^{L(L+1)T}\mbox{ a.s. where }\epsilon(c)\rightarrow 0\mbox{ as }c\rightarrow\infty.$

Proof.

See Lemma 9 (ii) and (iii) of (Lakshminarayanan and Bhatnagar 2017). ∎

Theorem 25.

Under assumptions $\bf{(B1)}$ - $\bf{(B4)}$ and $\bf{(B6)}-\bf{(B8)}$ , we have:

(i)

Let $C^{*}_{a}$ and $C^{*}_{b}$ be as in Theorems 13 and 19 respectively. Then, $||\hat{z}(T_{n})||\geq\frac{1}{4+C^{*}_{a}C^{*}_{b}+C^{*}_{b}}$ for sufficiently large $||\bar{z}(T_{n})||$ .
(ii)

For $n$ large, $T=T_{\frac{1}{4}}$ (here $T$ is the sampling frequency as in (F2) and $T_{\frac{1}{4}}$ is $T_{\epsilon}$ as in Lemma 11 with $\epsilon=\frac{1}{4}$ ), if $||\bar{z}(T_{n})||>C$ , for some $C>0$ then $||\bar{z}(T_{n+1})||<\frac{1}{2}||\bar{z}(T_{n})||$
(iii)

$||\bar{z}(T_{n})||\leq K_{c}^{*}$ for some $K_{c}^{*}>0$ .
(iv)

$\sup_{n}||z_{n}||<\infty$ a.s.

Proof.

(i)

From Theorems 13 and 19 we know that $||r(n)||<C^{*}_{a}(1+||\bar{y}(T_{n})||+||\bar{z}(T_{n})||)+C^{*}_{b}(1+||\bar{z}(T_{n})||)+||\bar{z}(T_{n})||$ . Therefore,

\begin{split}||\hat{z}(T_{n})||&=\frac{||\bar{z}(T_{n})||}{r(n)}>\frac{||\bar{z}(T_{n})||}{C^{*}_{a}(1+||\bar{y}(T_{n})||+||\bar{z}(T_{n})||)+C^{*}_{b}(1+||\bar{z}(T_{n})||)+||\bar{z}(T_{n})||}\\ &>\frac{||\bar{z}(T_{n})||}{C^{*}_{a}(1+C^{*}_{b}(1+||\bar{z}(T_{n})||)+||\bar{z}(T_{n})||)+C^{*}_{b}(1+||\bar{z}(T_{n})||)+||\bar{z}(T_{n})||}\\ &>\frac{1}{\frac{C^{*}_{a}}{||\bar{z}(T_{n})||}+\frac{C^{*}_{a}C^{*}_{b}}{||\bar{z}(T_{n})||}+C^{*}_{a}C^{*}_{b}+C^{*}_{a}+\frac{C^{*}_{b}}{||\bar{z}(T_{n})||}+C^{*}_{b}+1}\\ &>\frac{1}{4+C^{*}_{a}C^{*}_{b}+C^{*}_{b}},\mbox{\qquad for \quad}||\bar{z}(T_{n})||>\max{(C^{*}_{a},C^{*}_{b},C^{*}_{a}C^{*}_{b})}\end{split}

(ii)

Since, $0\in\mathbb{R}^{d_{3}}$ is the unique globally asymptotically stable equilibrium, therefore using Lemma 23, $\exists c_{\frac{1}{4}},T_{\frac{1}{4}}>0$ , such that $||z_{c}(t,z)||<\frac{1}{4(4+C^{*}_{a}C^{*}_{b}+C^{*}_{b})},$ $\forall c\geq c_{\frac{1}{4}},t\geq T_{\frac{1}{4}}$ . Also, for $||\bar{z}(T_{n})||>\max{(C^{*}_{a},C^{*}_{b},C^{*}_{a}C^{*}_{b})}$ we have $||\hat{z}(T_{n})||>\frac{1}{4+C^{*}_{a}C^{*}_{b}+C^{*}_{b}}$ and for sufficiently large $n$ , from Lemma 24(ii), $\exists C_{2}>0$ such that $||\hat{z}(T_{n+1}^{-})-z_{n}(T_{n+1})||<\frac{1}{4(4+C^{*}_{a}C^{*}_{b}+C^{*}_{b})}$ for $r(n)>C_{2}$ . We pick $C=\max(c_{1/4},C_{1},\max{(C^{*}_{a},C^{*}_{b},C^{*}_{a}C^{*}_{b})})$ and $T=T_{1/4}$ . For $n$ large it then follows that $||\hat{z}(T_{n+1}^{-})||\leq||\hat{z}(T_{n+1}^{-})-z_{n}(T_{n+1})||+||z_{n}(T_{n+1})||\leq\frac{1}{2(4+K_{a}^{*}C^{*}_{b}+C^{*}_{b})}$ . Finally, since

$\frac{||\bar{z}(T_{n+1})||}{||\bar{z}(T_{n})||}=\frac{||\hat{z}(T_{n+1}^{-})||}{||\hat{z}(T_{n})||},$

it follows that

$||\bar{z}(T_{n+1})||<\frac{1}{2}||\bar{z}(T_{n})||.$

(iii) and (iv) follow along the lines of arguments as in Lemma 10 (iii) and (iv) of (Lakshminarayanan and Bhatnagar 2017). ∎

Now from Theorem 25 (iii), it follows that the slow timescale iterates $z_{n}$ are bounded a.s. ( $||z_{n}||<\infty$ a.s. ) which in turn implies that the middle timescale iterates $y_{n}$ are bounded using Theorem 19 ( i.e., $||y_{n}||<\infty$ a.s. ). Finally the fast timescale iterates $x_{n}$ are bounded because of Theorem 13 and the fact that both middle timescale and slow timescale iterates are bounded showing $||x_{n}||<\infty$ a.s. Combining these we have $\sup_{n}(||x_{n}||+||y_{n}||+||z_{n}||)<\infty$ a.s, thereby proving Theorem 7. ∎

The slightly more general version where each iterate could have small perturbation terms as given below:

x_{n+1}=x_{n}+a(n)\left(h(x_{n},y_{n},z_{n})+M_{n+1}^{(1)}+\varepsilon^{(1)}_{n}\right),

(36)

y_{n+1}=y_{n}+b(n)\left(g(x_{n},y_{n},z_{n})+M_{n+1}^{(2)}+\varepsilon^{(2)}_{n}\right),

(37)

z_{n+1}=z_{n}+c(n)\left(f(x_{n},y_{n},z_{n})+M_{n+1}^{(3)}+\varepsilon^{(3)}_{n}\right),

(38)

with $\epsilon_{n}^{(k)}=o(1),k=1,2,3$ can be shown to converge to the same solution. Since the additional error terms are $o(1)$ , their contribution is asymptotically negligible. See arguments in third extension of (Chapter 2, pp. 17 of Borkar (2008b) ) that handles this case for one-timescale iterates.

A3 Convergence of GTD-2 M and TDC-M

Here we provide the asymptotic convergence guarantees of the momentum variants of the remaining two Gradient TD methods namely GTD2-M and TDC-M. The analysis is similar to that of GTD-M in Theorem 4 and is provided here for completeness. We show that the assumptions (B1) - (B7) of the main paper are satisfied and thereby invoke Theorem 3 to show convergence.

A3.1 Asymptotic convergence of GTD2-M

We re-write the iterates for GTD2-M below:

\theta_{t+1}=\theta_{t}+\alpha_{t}(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}+\eta_{t}(\theta_{t}-\theta_{t-1}),

(39)

u_{t+1}=u_{t}+\beta_{t}(\delta_{t}-\phi_{t}^{T}u_{t})\phi_{t}.

(40)

As before, choosing $\eta_{t}=\frac{\varrho_{t}-w\alpha_{t}}{\varrho_{t-1}}$ , where $\{\varrho_{t}\}$ is a positive sequence and $w\in\mathbb{R}$ is a constant, we can decompose the two iterates into three recursions as below:

	$\displaystyle v_{t+1}=v_{t}+\xi_{t}\left((\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}-wv_{t}\right)$		(41)
	$\displaystyle u_{t+1}=u_{t}+\beta_{t}(\delta_{t}\phi_{t}-\phi_{t}\phi_{t}^{T}u_{t})$		(42)
	$\displaystyle\theta_{t+1}=\theta_{t}+\varrho_{t}(v_{t}+\varepsilon_{t})$		(43)

Theorem 26.

Assume $\mathbfcal{A}$ 1, $\mathbfcal{A}$ 3 and $\mathbfcal{A}$ 4 hold and let $w>0$ . Then, the GTD2-M iterates given by (39) and (40) satisfy $\theta_{n}\rightarrow\theta^{*}=-\bar{A}^{-1}\bar{b}$ a.s. as $n\rightarrow\infty$ .

Proof.

We transform the iterates given by (41), (42) and (43) into the standard SA form given by (22), (23) and (24). Let $\mathcal{F}_{t}=\sigma(u_{0},v_{0},\theta_{0},r_{j+1},\phi_{j},\phi_{j}^{\prime}:j<t)$ . Let, $A_{t}=\phi_{t}(\gamma\phi_{t}^{\prime}-\phi_{t})^{T}$ and $b_{t}=r_{t+1}\phi_{t}$ . Then, (41) can be re-written as:

v_{t+1}=v_{t}+\xi_{t}\left(h(v_{t},u_{t},\theta_{t})+M_{t+1}^{(1)}\right)

where,

\begin{split}h(v_{t},u_{t},\theta_{t})&=\mathbb{E}[(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}-wv_{t}|\mathcal{F}_{t}]\\ &=-\bar{A}^{T}u_{t}-wv_{t}.\\ M_{t+1}^{(1)}=-A_{t}^{T}u_{t}&-wv_{t}-h(v_{t},u_{t},\theta_{t})=(\bar{A}^{T}-A_{t}^{T})u_{t}.\end{split}

Next, (42) can be re-written as:

\begin{split}u_{t+1}&=u_{t}+\beta_{t}\left(g(v_{t},u_{t},\theta_{t})+M_{t+1}^{(2)}\right)\\ \end{split}

where,

\begin{split}g(v_{t},u_{t},\theta_{t})&=\mathbb{E}[\delta_{t}\phi_{t}-\phi_{t}\phi_{t}^{T}u_{t}|\mathcal{F}_{t}]=\bar{A}\theta_{t}+\bar{b}-\bar{C}u_{t}\\ M_{t+1}^{(2)}&=A_{t}\theta_{t}+b_{t}-C_{t}u_{t}-g(v_{t},u_{t},\theta_{t})\\ &=(A_{t}-\bar{A})\theta_{t}+(b_{t}-\bar{b})+(\bar{C}-C_{t})u_{t}.\end{split}

Here, $C_{t}=\phi_{t}\phi_{t}^{T}$ and $\bar{C}=\mathbb{E}[\phi_{t}\phi_{t}^{T}]$ . Finally, (43) can be re-written as:

\begin{split}\theta_{t+1}=\theta_{t}+\varrho_{t}\left(f(v_{t},u_{t},\theta_{t})+\varepsilon_{t}+M_{t+1}^{(3)}\right)\end{split}

where,

f(v_{t},u_{t},\theta_{t})=v_{t}\mbox{ and }M_{t+1}^{(3)}=0.

\xi_{t}=\frac{1}{(t+1)^{\xi}},\beta_{t}=\frac{1}{(t+1)^{\beta}},\varrho_{t}=\frac{1}{(t+1)^{\varrho}},\frac{1}{2}<\xi<\beta<\varrho\leq 1.

Next, $M_{t+1}^{(1)},M_{t+1}^{(2)}$ and $M_{t+1}^{(3)}$ $t\geq 0$ , are martingale difference sequences w.r.t $\mathcal{F}_{t}$ by construction. Next,

\mathbb{E}[||M_{t+1}^{(1)}||^{2}|\mathcal{F}_{t}]\leq||(\bar{A}^{T}-A_{t}^{T})||^{2}||u_{t}||^{2},

\mathbb{E}[||M_{t+1}^{(2)}||^{2}|\mathcal{F}_{t}]\leq 3(||(A_{t}-\bar{A})||^{2}||\theta_{t}||^{2}+||(b_{t}-\bar{b})||^{2}+||(\bar{C}-C_{t})||^{2}||u_{t}||^{2}).

The first part of $\bf{(B3)}$ is satisfied with $K_{1}=||(\bar{A}^{T}-A_{t}^{T})||^{2}$ , $K_{2}=3\max(||A_{t}-\bar{A}||^{2},||b_{t}-\bar{b}||^{2},||(\bar{C}-C_{t})||^{2})$ and any $K_{3}>0$ . The fact that $K_{1},K_{2}<\infty$ follows from the bounded features and bounded rewards assumption in $\mathbfcal{A}$ 1. Next, observe that $||\varepsilon_{t}^{(3)}||=\xi_{t}||\left((\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}-wv_{t}\right)||\rightarrow 0$ since $\xi_{t}\rightarrow 0\mbox{ as }t\rightarrow\infty$ . For a fixed $u,\theta\in\mathbb{R}^{d}$ , consider the ODE

\dot{v}(t)=-\bar{A}^{T}u-wv(t).

For $w>0$ , $\lambda(u,\theta)=-\frac{\bar{A}^{T}u}{w}$ is the unique g.a.s.e, is linear and therefore Lipchitz continuous. This satisfies $\bf{(B4)}$ (i). Next, for a fixed $\theta\in\mathbb{R}^{d}$ ,

\dot{u}(t)=\bar{A}\theta+\bar{b}-\bar{C}u(t),

has $\Gamma(\theta)=\bar{C}^{-1}(\bar{A}\theta+\bar{b})$ as its unique g.a.s.e because $-\bar{C}^{-1}$ is negative definite. Also $\Gamma(\theta)$ is linear in $\theta$ and therefore Lipschitz. This satisfies $\bf{(B4)}(ii)$ . Finally, to satisfy $\bf{(B4)}(iii)$ , consider,

\begin{split}\dot{\theta}(t)&=\frac{-\bar{A}^{T}\bar{C}^{-1}\bar{A}\theta(t)-\bar{A}^{T}\bar{C}^{-1}\bar{b}}{w}.\end{split}

Since $\bar{A}$ is negative definite and $\bar{C}$ is positive definite, therefore, $-\bar{A}^{T}\bar{C}^{-1}\bar{A}$ is negative definite. Therefore, $\theta^{*}=-\bar{A}^{-1}\bar{b}$ is the unique g.a.s.e.

Next, we show that the sufficient conditions for stability of the three iterates are satisfied. The function, $h_{c}(v,u,\theta)=\frac{-c\bar{A}^{T}u-wcv}{c}=-\bar{A}^{T}u-wv\rightarrow h_{\infty}(v,u,\theta)=-\bar{A}^{T}u-wv$ uniformly on compacts as $c\rightarrow\infty$ . The limiting ODE:

\dot{v}(t)=-\bar{A}^{T}u-wv(t)

has $\lambda_{\infty}(u,\theta)=-\frac{\bar{A}^{T}u}{w}$ as its unique g.a.s.e. $\lambda_{\infty}$ is Lipschitz with $\lambda_{\infty}(0,0)=0$ , thus satisfying assumption $\bf{(B5)}$ .

The function, $g_{c}(u,\theta)=\frac{c\bar{A}\theta+\bar{b}-c\bar{C}u}{c}=\bar{A}\theta-\bar{C}u+\frac{\bar{b}}{c}\rightarrow g_{\infty}(u,\theta)=\bar{A}\theta-\bar{C}u$ uniformly on compacts as $c\rightarrow\infty$ . The limiting ODE

\dot{u}(t)=\bar{A}\theta-\bar{C}u(t)

has $\Gamma_{\infty}(\theta)=\bar{C}^{-1}\bar{A}\theta$ as its unique g.a.s.e. since $-\bar{C}$ is negative definite. $\Gamma_{\infty}$ is Lipchitz with $\Gamma_{\infty}(0)=0$ . Thus assumption $\bf{(B6)}$ is satisfied.

Finally, $f_{c}(\theta)=\frac{-c\bar{A}^{T}\bar{C}^{-1}\bar{A}\theta}{cw}\rightarrow f_{\infty}=\frac{-\bar{A}^{T}\bar{A}\theta}{w}$ uniformly on compacts as $c\rightarrow\infty$ and the ODE:

\dot{\theta}(t)=-\frac{\bar{A}^{T}\bar{C}^{-1}\bar{A}\theta(t)}{w}

has origin in $\mathbb{R}^{d}$ as its unique g.a.s.e. This ensures the final condition $\bf{(B7)}$ . By theorem 3,

\begin{pmatrix}v_{t}\\ u_{t}\\ \theta_{t}\end{pmatrix}\rightarrow\begin{pmatrix}\lambda(\Gamma(-\bar{A}^{-1}\bar{b}),-\bar{A}^{-1}\bar{b})\\ \Gamma(-\bar{A}^{-1}\bar{b})\\ -\bar{A}^{-1}\bar{b}.\end{pmatrix}=\begin{pmatrix}0\\ 0\\ -\bar{A}^{-1}\bar{b}.\end{pmatrix}

Specifically, $\theta_{t}\rightarrow-\bar{A}^{-1}\bar{b}$ . ∎

A3.2 Asymptotic Convergence of TDC-M

We re-write the iterates for TDC-M below:

\begin{split}\theta_{t+1}=\theta_{t}+\alpha_{t}(\delta_{t}\phi_{t}-\gamma\phi_{t}^{\prime}(\phi_{t}^{T}u_{t}))+\eta_{t}(\theta_{t}-\theta_{t-1}),\end{split}

(44)

u_{t+1}=u_{t}+\beta_{t}(\delta_{t}-\phi_{t}^{T}u_{t})\phi_{t}.

(45)

	$\displaystyle v_{t+1}=v_{t}+\xi_{t}\left(\delta_{t}\phi_{t}-\gamma\phi_{t}^{\prime}\phi_{t}^{T}u_{t}-wv_{t}\right)$		(46)
	$\displaystyle u_{t+1}=u_{t}+\beta_{t}(\delta_{t}\phi_{t}-\phi_{t}\phi_{t}^{T}u_{t})$		(47)
	$\displaystyle\theta_{t+1}=\theta_{t}+\varrho_{t}(v_{t}+\varepsilon_{t})$		(48)

Theorem 27.

Assume $\mathbfcal{A}$ 1, $\mathbfcal{A}$ 3 and $\mathbfcal{A}$ 4 hold and let $w>0$ . Then, the TDC-M iterates given by (44) and (45) satisfy $\theta_{n}\rightarrow\theta^{*}=-\bar{A}^{-1}\bar{b}$ a.s. as $n\rightarrow\infty$ .

Proof.

We transform the iterates given by (46), (47) and (48) into the standard SA form given by (22), (23) and (24). Let $\mathcal{F}_{t}=\sigma(u_{0},v_{0},\theta_{0},r_{j+1},\phi_{j},\phi_{j}^{\prime}:j<t)$ . Let, $A_{t}=\phi_{t}(\gamma\phi_{t}^{\prime}-\phi_{t})^{T}$ and $b_{t}=r_{t+1}\phi_{t}$ . Then, (46) can be re-written as:

v_{t+1}=v_{t}+\xi_{t}\left(h(v_{t},u_{t},\theta_{t})+M_{t+1}^{(1)}\right)

where,

\begin{split}h(v_{t},u_{t},\theta_{t})&=\mathbb{E}[\delta_{t}\phi_{t}-\gamma\phi_{t}^{\prime}\phi_{t}^{T}u_{t}-wv_{t}|\mathcal{F}_{t}]\\ &=\bar{A}\theta_{t}+\bar{b}-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]u_{t}-wv_{t}.\\ M_{t+1}^{(1)}&=\delta_{t}\phi_{t}-\gamma\phi_{t}^{\prime}\phi_{t}^{T}u_{t}-wv_{t}-h(v_{t},u_{t},\theta_{t})\\ &=(A_{t}-\bar{A})\theta_{t}+(b_{t}-\bar{b})+\gamma(\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]-\phi_{t}^{\prime}\phi_{t}^{T})u_{t}.\end{split}

Next, (46) can be re-written as:

\begin{split}u_{t+1}&=u_{t}+\beta_{t}\left(g(v_{t},u_{t},\theta_{t})+M_{t+1}^{(2)}\right)\\ \end{split}

where,

\begin{split}g(v_{t},u_{t},\theta_{t})&=\mathbb{E}[\delta_{t}\phi_{t}-\phi_{t}\phi_{t}^{T}u_{t}|\mathcal{F}_{t}]=\bar{A}\theta_{t}+\bar{b}-\bar{C}u_{t}\\ M_{t+1}^{(2)}&=A_{t}\theta_{t}+b_{t}-C_{t}u_{t}-g(v_{t},u_{t},\theta_{t})\\ &=(A_{t}-\bar{A})\theta_{t}+(b_{t}-\bar{b})+(\bar{C}-C_{t})u_{t}.\end{split}

Here, $C_{t}=\phi_{t}\phi_{t}^{T}$ and $\bar{C}=\mathbb{E}[\phi_{t}\phi_{t}^{T}]$ . Finally, (46) can be re-written as:

\begin{split}\theta_{t+1}=\theta_{t}+\varrho_{t}\left(f(v_{t},u_{t},\theta_{t})+\varepsilon_{t}+M_{t+1}^{(3)}\right)\end{split}

where,

f(v_{t},u_{t},\theta_{t})=v_{t}\mbox{ and }M_{t+1}^{(3)}=0.

\xi_{t}=\frac{1}{(t+1)^{\xi}},\beta_{t}=\frac{1}{(t+1)^{\beta}},\varrho_{t}=\frac{1}{(t+1)^{\varrho}},\frac{1}{2}<\xi<\beta<\varrho\leq 1.

Observe that, $M_{t+1}^{(1)},M_{t+1}^{(2)}$ and $M_{t+1}^{(3)}$ $t\geq 0$ , are martingale difference sequences w.r.t $\mathcal{F}_{t}$ by construction. Next,

\mathbb{E}[||M_{t+1}^{(1)}||^{2}|\mathcal{F}_{t}]\leq 3(||(A_{t}-\bar{A})||^{2}||\theta_{t}||^{2}+||(b_{t}-\bar{b})||^{2}+\gamma(||\mathbb{E}[\phi_{t}^{{}^{\prime}}\phi_{t}^{T}]-\phi_{t}^{\prime}\phi_{t}^{T}||^{2})||u_{t}||^{2}),

\mathbb{E}[||M_{t+1}^{(2)}||^{2}|\mathcal{F}_{t}]\leq 3(||(A_{t}-\bar{A})||^{2}||\theta_{t}||^{2}+||(b_{t}-\bar{b})||^{2}+||(\bar{C}-C_{t})||^{2}||u_{t}||^{2})

. The first part of $\bf{(B3)}$ is satisfied with $K_{1}=3\max(||(A_{t}-\bar{A})||^{2},||(b_{t}-\bar{b})||^{2},\gamma(||\mathbb{E}[\phi_{t}^{{}^{\prime}}\phi_{t}^{T}]-\phi_{t}^{\prime}\phi_{t}^{T}||^{2}))$ , $K_{2}=3\max(||A_{t}-\bar{A}||^{2},||b_{t}-\bar{b}||^{2},||(\bar{C}-C_{t})||^{2})$ and any $K_{3}>0$ . The fact that $K_{1},K_{2}<\infty$ follows from the bounded features and bounded rewards assumption in $\mathbfcal{A}$ 1. Next, observe that $||\varepsilon_{t}^{(3)}||=\xi_{t}||\left((\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}-wv_{t}\right)||\rightarrow 0$ since $\xi_{t}\rightarrow 0\mbox{ as }t\rightarrow\infty$ . For a fixed $u,\theta\in\mathbb{R}^{d}$ , consider the ODE

\dot{v}(t)=\bar{A}\theta+\bar{b}-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]u-wv(t).

For $w>0$ , $\lambda(u,\theta)=\frac{\bar{A}\theta+\bar{b}-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]u}{w}$ is the unique g.a.s.e, is linear and therefore Lipchitz continuous. This satisfies $\bf{(B4)}$ (i). Next, for a fixed $\theta\in\mathbb{R}^{d}$ ,

\dot{u}(t)=\bar{A}\theta+\bar{b}-\bar{C}u(t),

\begin{split}\dot{\theta}(t)&=\frac{(I-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]\bar{C}^{-1})(\bar{A}\theta(t)+\bar{b})}{w}.\end{split}

Now, $(I-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]\bar{C}^{-1})\bar{A}$ = $(\mathbb{E}[\phi_{t}\phi_{t}^{T}]-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}])\bar{C}^{-1}\bar{A}=\mathbb{E}[(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}]\bar{C}^{-1}\bar{A}=-\bar{A}^{T}\bar{C}^{-1}\bar{A}$ . Since, $\bar{A}$ is negative definite and $\bar{C}$ is positive definite, therefore $-\bar{A}^{T}\bar{C}^{-1}\bar{A}$ is negative definite and hence the above ODE has $\theta^{*}=-\bar{A}^{-1}\bar{b}$ as its unique g.a.s.e.

Next, we show that the sufficient conditions for stability of the three iterates are satisfied. The function, $h_{c}(v,u,\theta)=\frac{c\bar{A}\theta+\bar{b}-c\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]u-cwv}{c}=\bar{A}\theta_{t}-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]u_{t}-wv_{t}\rightarrow h_{\infty}(v,u,\theta)=\bar{A}\theta_{t}-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]u_{t}-wv_{t}$ uniformly on compacts as $c\rightarrow\infty$ . The limiting ODE:

\dot{v}(t)=\bar{A}\theta_{t}-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]u_{t}-wv(t)

has $\lambda_{\infty}(u,\theta)=\frac{\bar{A}\theta-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]u}{w}$ as its unique g.a.s.e. $\lambda_{\infty}$ is Lipschitz with $\lambda_{\infty}(0,0)=0$ , thus satisfying assumption $\bf{(B5)}$ .

The function, $g_{c}(u,\theta)=\frac{c\bar{A}\theta+\bar{b}-c\bar{C}u}{c}=\bar{A}\theta-\bar{C}u+\frac{\bar{b}}{c}\rightarrow g_{\infty}(u,\theta)=-\bar{A}\theta-\bar{C}u$ uniformly on compacts as $c\rightarrow\infty$ . The limiting ODE

\dot{u}(t)=\bar{A}\theta-\bar{C}u(t)

has $\Gamma_{\infty}(\theta)=\bar{C}^{-1}\bar{A}\theta$ as its unique g.a.s.e. since $-\bar{C}$ is negative definite. $\Gamma_{\infty}$ is Lipschitz with $\Gamma_{\infty}(0)=0$ . Thus assumption $\bf{(B6)}$ is satisfied.

Finally, $f_{c}(\theta)=\frac{c\bar{A}\theta-c\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]\bar{C}^{-1}\bar{A}\theta}{cw}\rightarrow f_{\infty}=\frac{(I-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]\bar{C}^{-1})\bar{A}\theta}{w}$ uniformly on compacts as $c\rightarrow\infty$ . Consider the ODE:

\dot{\theta}(t)=\frac{(I-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]\bar{C}^{-1})\bar{A}\theta(t)}{w}.

Now, $(I-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]\bar{C}^{-1})\bar{A}$ = $(\mathbb{E}[\phi_{t}\phi_{t}^{T}]-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}])\bar{C}^{-1}\bar{A}=\mathbb{E}[(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}]\bar{C}^{-1}\bar{A}=-\bar{A}^{T}\bar{C}^{-1}\bar{A}$ . Since, $\bar{A}$ is negative definite and $\bar{C}$ is positive definite, therefore $-\bar{A}^{T}\bar{C}^{-1}\bar{A}$ is negative definite and hence the above ODE has origin as its unique g.a.s.e. This ensures the final condition $\bf{(B7)}$ . By Theorem 3,

\begin{pmatrix}v_{t}\\ u_{t}\\ \theta_{t}\end{pmatrix}\rightarrow\begin{pmatrix}\lambda(\Gamma(-\bar{A}^{-1}\bar{b}),-\bar{A}^{-1}\bar{b})\\ \Gamma(-\bar{A}^{-1}\bar{b})\\ -\bar{A}^{-1}\bar{b}.\end{pmatrix}=\begin{pmatrix}0\\ 0\\ -\bar{A}^{-1}\bar{b}.\end{pmatrix}

Specifically, $\theta_{t}\rightarrow-\bar{A}^{-1}\bar{b}$ . ∎

A4 Experiment Details

Here we briefly describe the MDP settings considered in section 5.

1.

Example-1 (Boyan Chain): It consists of a linear arrangement of 14 states. From each of the first 13 states, one can move to the next state or the next to next state with equal probability. The last state is an absorbing state. The reward at each transition is -3 except the transition from state-6 to state-7 where it is -2. The discount factor $\gamma$ is set to $0.95$ . The following figure shows the corresponding MDP for 7 state Boyan Chain.

Figure 5: 7 state Boyan Chain from (Boyan 1999)
2.

Example-2 (5-State Random Walk): It consists of a linear arrangement of 5 states with two terminal states. There is a single action at each state. From each state one either moves left or right with equal probability. Moving left from state 1 results in episode termination yielding a reward of 0. Similarly, moving right from state 5 also results in episode termination, however, yielding a reward of +1. The reward associated with all other transitions is 0 and the discount factor $\gamma=1$ . The following figure shows the corresponding MDP.

Figure 6: 5-State Random Walk from (Sutton et al. 2009)
3.

Example-3 (19-State Random Walk): It consists of a linear arrangement of 19 states. From each state one either moves left or right with equal probability. Moving left from state 1 results in episode termination yielding a reward of -1. Similarly, moving right from state 19 also results in episode termination, however, yielding a reward of +1. The reward associated with all other transitions is 0 and the discount factor $\gamma=1$ . The following figure shows the corresponding MDP:

Figure 7: 19 State Random Walk from (Sutton and Barto 2018)
4.

Example-4 (Random MDP): This is a randomly generated discrete MDP with 20 states and 5 actions in each state. The transition probabilities are uniformly generated from $[0,1]$ with a small additive constant. The rewards are also uniformly generated from $[0,1]$ . The policy and the start state distribution are also generated in a similar way and the discount factor $\gamma=0.95$ . See (Dann, Neumann, and Peters 2014) for a more detailed description.

Gradient Temporal Difference with Momentum: Stability and Convergence

Abstract

1 Introduction

1.1 Our Contribution

2 Preliminaries

3 Gradient TD with Momentum

4 Convergence Analysis

4.1 One-Timescale Setting

Lemma 1.

Proof.

𝒜\mathbfcal{A} 1.

𝒜\mathbfcal{A} 2.

𝒜\mathbfcal{A} 3.

Theorem 2.

Proof.

Remark 1.

4.2 Three Timescale Setting

Remark 2.

Theorem 3.

Proof.

𝒜\mathbfcal{A} 4.

Theorem 4.

Proof.

5 Experiments

6 Related Work and Conclusion

References

A1 Proof of Theorem 2

A2 Proof of Theorem 3

Theorem 5.

Proof.

Lemma 6.

Proof.

Theorem 7.

Proof.

Lemma 8.

Lemma 9.

Lemma 10.

Lemma 11.

Lemma 12.

Proof.

Theorem 13.

Proof.

Lemma 14.

Lemma 15.

Lemma 16.

Lemma 17.

Lemma 18.

Proof.

Theorem 19.

Proof.

Lemma 20.

Lemma 21.

Lemma 22.

Lemma 23.

Lemma 24.

Proof.

Theorem 25.

Proof.

A3 Convergence of GTD-2 M and TDC-M

A3.1 Asymptotic convergence of GTD2-M

Theorem 26.

Proof.

A3.2 Asymptotic Convergence of TDC-M

Theorem 27.

Proof.

A4 Experiment Details

$\mathbfcal{A}$ 1.

$\mathbfcal{A}$ 2.

$\mathbfcal{A}$ 3.

$\mathbfcal{A}$ 4.