This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Gradient Temporal Difference with Momentum: Stability and Convergence

Rohan Deb, Shalabh Bhatnagar
Corresponding Author
Abstract

Gradient temporal difference (Gradient TD) algorithms are a popular class of stochastic approximation (SA) algorithms used for policy evaluation in reinforcement learning. Here, we consider Gradient TD algorithms with an additional heavy ball momentum term and provide choice of step size and momentum parameter that ensures almost sure convergence of these algorithms asymptotically. In doing so, we decompose the heavy ball Gradient TD iterates into three separate iterates with different step sizes. We first analyze these iterates under one-timescale SA setting using results from current literature. However, the one-timescale case is restrictive and a more general analysis can be provided by looking at a three-timescale decomposition of the iterates. In the process we provide the first conditions for stability and convergence of general three-timescale SA. We then prove that the heavy ball Gradient TD algorithm is convergent using our three-timescale SA analysis. Finally, we evaluate these algorithms on standard RL problems and report improvement in performance over the vanilla algorithms.

1 Introduction

In reinforcement learning (RL), the goal of the learner or the agent is to maximize its long term accumulated reward by interacting with the environment. One important task in most of RL algorithms is that of policy evaluation. It predicts the average accumulated reward an agent would receive from a state (called value function) if it follows the given policy. In model-free learning, the agent does not have access to the underlying dynamics of the environment and has to learn the value function from samples of the form (state, action, reward, next-state). Two very popular algorithms in the model-free setting are Monte-Carlo (MC) and temporal difference (TD) learning (see Sutton and Barto (2018), Sutton (1988)). It is a well known fact that TD learning diverges in the off-policy setting (see Baird (1995)). A class of algorithms called gradient temporal difference (Gradient TD) were introduced in (Sutton, Maei, and Szepesvári 2009) and (Sutton et al. 2009) which are convergent even in the off-policy setting. These algorithms fall under a larger class of algorithms called linear stochastic approximation (SA) algorithms.

A lot of literature is dedicated to studying the asymptotic behaviour of SA algorithms starting from the work of (Robbins and Monro 1951). In recent times, the ODE method to analyze asymptotic behaviour of SA (Ljung 1977; Kushner and Clark 1978; Borkar 2008b; Borkar and Meyn 2000) has become quite popular in the RL community. The Gradient TD methods were shown to be convergent using the ODE approach. A generic one-timescale (One-TS) SA iterate has the following form:

xn+1=xn+a(n)(h(xn)+Mn+1),x_{n+1}=x_{n}+a(n)\left(h(x_{n})+M_{n+1}\right), (1)

where xd1x\in\mathbb{R}^{d_{1}} are the iterates. The function h:d1d1h:\mathbb{R}^{d_{1}}\rightarrow\mathbb{R}^{d_{1}} is assumed to be a Lipschitz continuous function. Mn+1M_{n+1} is a Martingale difference noise sequence and a(n)a(n) is the step-size at time-step nn. Under some mild assumptions, the iterate given by (1) converges (see Borkar 2008b; Borkar and Meyn 2000). When hh is a linear map of the form bAxnb-Ax_{n}, the matrix AA is often called the driving matrix. The three Gradient TD algorithms: GTD (Sutton, Maei, and Szepesvári 2009), GTD2 and TDC (Sutton et al. 2009) consist two iterates of the following form:

xn+1=xn+a(n)(h(xn,yn)+Mn+1(1),x_{n+1}=x_{n}+a(n)(h(x_{n},y_{n})+M_{n+1}^{(1)}, (2)
yn+1=yn+b(n)(g(xn,yn)+Mn+1(2)),y_{n+1}=y_{n}+b(n)(g(x_{n},y_{n})+M_{n+1}^{(2)}), (3)

where xd1x\in\mathbb{R}^{d_{1}}, yd2y\in\mathbb{R}^{d_{2}}. See section 2 for exact form of the iterates. The two iterates still form a One-TS SA scheme if limnb(n)a(n)=c\lim_{n\rightarrow\infty}\frac{b(n)}{a(n)}=c, where cc is a constant and a two-timescale (two-TS) scheme if limnb(n)a(n)=0\lim_{n\rightarrow\infty}\frac{b(n)}{a(n)}=0.

Separately, adding a momentum term to accelerate the convergence of iterates is a popular technique in stochastic gradient descent (SGD). The two most popular schemes are the Polyak’s Heavy ball method (Polyak 1964), and Nesterov’s accelerated gradient method (Nesterov 1983). A lot of literature is dedicated to studying momentum with SGD. Some recent works include (Ghadimi, Feyzmahdavian, and Johansson 2014; Loizou and Richtárik 2020; Gitman et al. 2019; Ma and Yarats 2019; Assran and Rabbat 2020). Momentum in the SA setting, which is the focus of the current work, has limited results. Very few works study the effect of momentum in the SA setting. A recent work by (Mou et al. 2020) studies SA with momentum briefly and shows an improvement of mixing rate. However, the setting considered is restricted to linear SA and the driving matrix is assumed to be symmetric. Further, the iterates involve an additional Polyak-Ruppert averaging (Polyak 1990). Here, in contrast, we analyze the asymptotic behaviour of the algorithm and make none of the above assumptions. A somewhat distant paper is by (Devraj, Bušíć, and Meyn 2019) that introduces Matrix momentum in SA and is not equivalent to heavy ball momentum.

A very recent work by (Avrachenkov, Patil, and Thoppe 2020) studied One-TS SA with heavy ball momentum in the univariate case (i.e., d=1d=1 in iterate (1)) in the context of web-page crawling. The iterates took the following form:

xn+1=xn+a(n)(h(xn)+Mn+1)+ηn(xnxn1).x_{n+1}=x_{n}+a(n)\left(h(x_{n})+M_{n+1}\right)+\eta_{n}(x_{n}-x_{n-1}). (4)

The momentum parameter ηn\eta_{n} was chosen to decompose the iterate into two recursions of the form given by (2) and (3). We use such a decomposition for Gradient TD methods with momentum. This leads to three separate iterates with three step-sizes. We analyze these three iterates and provide stability (iterates remain bounded throughout) and almost sure (a.s.) convergence guarantees.

1.1 Our Contribution

  • We first consider the One-TS decomposition of Gradient TD with momentum iterates and show that the driving matrix in this case is Hurwitz (all eigen values are negative). Thereafter we use the theory of One-TS SA to show that the iterates are stable and convergent to the same TD solution.

  • Next, we consider the Three-TS decomposition. We provide the first stability and convergence conditions for general Three-TS recursions. We then show that the iterates under consideration satisfy these conditions.

  • Finally, we evaluate these algorithms for different choice of step-size and momentum parameters on standard RL problems and report an improvement in performance over their vanilla counterparts.

2 Preliminaries

In the standard RL setup, an agent interacts with the environment which is a Markov Decision Process (MDP). At each discrete time step tt, the agent is in state st𝒮,s_{t}\in\mathcal{S}, takes an action at𝒜,a_{t}\in\mathcal{A}, receives a reward rt+1r(st,at,st+1)r_{t+1}\equiv r(s_{t},a_{t},s_{t+1})\in\mathbb{R} and moves to another state st+1𝒮s_{t+1}\in\mathcal{S}. Here 𝒮\mathcal{S} and 𝒜\mathcal{A} are finite sets of possible states and actions respectively. The transitions are governed by a kernel \mathbb{P}. A policy π:𝒮×𝒜[0,1]\pi:\mathcal{S}\times\mathcal{A}\rightarrow[0,1] is a mapping that defines the probability of picking an action in a state. We let Pπ(s|s)P^{\pi}(s^{\prime}|s) be the transition probability matrix induced by π\pi. Also, {dπ(s)}s𝒮\{d^{\pi}(s)\}_{s\in\mathcal{S}} represents the steady-state distribution for the Markov chain induced by π\pi and the matrix DD is a diagonal matrix of dimension n×nn\times n with the entries dπ(s)d^{\pi}(s) on its diagonals.The state-value function associated with a policy π\pi for state ss is

Vπ(s)=𝔼π[t=0γtRt+1|s0=s],V^{\pi}(s)=\mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}R_{t+1}|s_{0}=s\right],

where γ\gamma [0,1)\in[0,1) is the discount factor.

In the linear architecture setting, policy evaluation deals with estimating Vπ(s)V^{\pi}(s) through a linear model Vθ(s)=θTϕ(s)V_{\theta}(s)=\theta^{T}\phi(s), where ϕ(s)ϕs\phi(s)\equiv\phi_{s} is a feature associated with the state ss and θ\theta is the parameter vector. We define the TD-error as δt=rt+1+γθtTϕt+1θtTϕt\delta_{t}=r_{t+1}+\gamma\theta_{t}^{T}\phi_{t+1}-\theta_{t}^{T}\phi_{t} and Φ\Phi as an n×dn\times d matrix where the sths^{th} row is ϕ(s)T\phi(s)^{T}. In the i.i.d setting it is assumed that the tuple (ϕt,ϕt(\phi_{t},\phi_{t}^{\prime}) (where ϕt+1ϕt\phi_{t+1}\equiv\phi_{t}^{\prime} ) is drawn independently from the stationary distribution of the Markov chain induced by π\pi. Let A¯=𝔼[ϕt(γϕtϕt)T]\bar{A}=\mathbb{E}[\phi_{t}(\gamma\phi_{t}^{\prime}-\phi_{t})^{T}] and b¯=𝔼[rt+1ϕt]\bar{b}=\mathbb{E}[r_{t+1}\phi_{t}], where the expectations are w.r.t. the stationary distribution of the induced chain. The matrix A¯\bar{A} is negative definite (see Maei (2011); Tsitsiklis and Van Roy (1997)). In the off-policy case, the importance weight is given by ρt=π(at|st)μ(at|st)\rho_{t}=\frac{\pi(a_{t}|s_{t})}{\mu(a_{t}|s_{t})}, where π\pi and μ\mu are the target and behaviour policies respectively. Introduced in (Sutton, Maei, and Szepesvári 2009), Gradient TD are a class of TD algorithms that are convergent even in the off-policy setting. Next, we present the iterates associated with the algorithms GTD (Sutton, Maei, and Szepesvári 2009), GTD2, TDC (Sutton et al. 2009).

  • GTD:

    θt+1=θt+αt(ϕtγϕt)ϕtTut,\displaystyle\theta_{t+1}=\theta_{t}+\alpha_{t}(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}, (5)
    ut+1=ut+βt(δtϕtut).\displaystyle u_{t+1}=u_{t}+\beta_{t}(\delta_{t}\phi_{t}-u_{t}). (6)
  • GTD2:

    θt+1=θt+αt(ϕtγϕt)ϕtTut,\displaystyle\theta_{t+1}=\theta_{t}+\alpha_{t}(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}, (7)
    ut+1=ut+βt(δtϕtTut)ϕt.\displaystyle u_{t+1}=u_{t}+\beta_{t}(\delta_{t}-\phi_{t}^{T}u_{t})\phi_{t}. (8)
  • TDC:

    θt+1=θt+αtδtϕtαtγϕt(ϕtTut),\displaystyle\theta_{t+1}=\theta_{t}+\alpha_{t}\delta_{t}\phi_{t}-\alpha_{t}\gamma\phi_{t}^{\prime}(\phi_{t}^{T}u_{t}), (9)
    ut+1=ut+βt(δtϕtTut)ϕt.\displaystyle u_{t+1}=u_{t}+\beta_{t}(\delta_{t}-\phi_{t}^{T}u_{t})\phi_{t}. (10)

The objective function for GTD is Norm of Expected Error defined as NEU(θ)=𝔼[δϕ]NEU(\theta)=\mathbb{E}[\delta\phi]. The GTD algorithm is derived by expressing the gradient direction as 12NEU(θ)-\frac{1}{2}\nabla NEU(\theta) = 𝔼[(ϕγϕ)ϕT]𝔼[δ(θ)ϕ]\mathbb{E}\left[(\phi-\gamma\phi^{\prime})\phi^{T}\right]\mathbb{E}[\delta(\theta)\phi]. Here ϕϕ(s)\phi^{\prime}\equiv\phi(s^{\prime}). If both the expectations are sampled together, then the term would be biased by their correlation. An estimate of the second expectation is maintained as a long-term quasi-stationary estimate (see (5)) and the first expectation is sampled (see (6)). For GTD2 and TDC, a similar approach is used on the objective function Mean Square Projected Bellman Error defined as MSPBE(θ)=VθΠTπVθDMSPBE(\theta)=||V_{\theta}-\Pi T^{\pi}V_{\theta}||_{D}. Here, Π\Pi is the projection operator that projects vectors to the subspace {Φθ|θd}\{\Phi\theta|\theta\in\mathbb{R}^{d}\} and TπT^{\pi} is the Bellman operator defined as TπV=Rπ+γPπVT^{\pi}V=R^{\pi}+\gamma P^{\pi}V. As originally presented, GTD and GTD2 are one-timescale algorithms (αtβt\frac{\alpha_{t}}{\beta_{t}} is constant) while TDC is a two-timescale algorithm (αtβt0\frac{\alpha_{t}}{\beta_{t}}\rightarrow 0). It was shown in all the three cases that θnθ=A¯1b¯\theta_{n}\rightarrow\theta^{*}=-\bar{A}^{-1}\bar{b}.

3 Gradient TD with Momentum

Although, Gradient TD starts with a gradient descent based approach, it ends up with two-TS SA recursions. Momentum methods are known to accelerate the convergence of SGD iterates. Motivated by this, we examine momentum in the SA setting, and ask if the SA recursions for Gradient TD with momentum even converge to the same TD solution. We probe the heavy ball extension of the three Gradient TD algorithms where, we keep an accumulation of the previous gradient values in ζt\zeta_{t}. Then, at time step t+1t+1 the new gradient value multiplied by the step size is added to the current accumulation vector ζt\zeta_{t} multiplied by the momentum parameter ηt\eta_{t} as below:

ζt+1=ηtζt+αt(ϕtγϕt)ϕtTut.\zeta_{t+1}=\eta_{t}\zeta_{t}+\alpha_{t}(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}.

The parameter θ\theta is then updated in the negative of the direction ζt+1\zeta_{t+1}, i.e., θt+1=θtζt+1.\theta_{t+1}=\theta_{t}-\zeta_{t+1}. Since ut+1u_{t+1} is computed as a long-term estimate of 𝔼[δ(θ)ϕ]\mathbb{E}[\delta(\theta)\phi], its update rule remains same. The momentum parameter ηt\eta_{t} is usually set to a constant in the stochastic gradient setting. An exception to this can however be found in (Gitman et al. 2019; Gadat, Panloup, and Saadane 2016), where ηt1\eta_{t}\rightarrow 1. Here, we consider the latter case. Substituting ζt+1\zeta_{t+1} into the iteration of θt+1\theta_{t+1} and noting that ζt=θtθt1\zeta_{t}=\theta_{t}-\theta_{t-1}, the iterates for GTD with Momentum (GTD-M) can be written as:

θt+1=θt+αt(ϕtγϕt)ϕtTut+ηt(θtθt1),\theta_{t+1}=\theta_{t}+\alpha_{t}(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}+\eta_{t}(\theta_{t}-\theta_{t-1}), (11)
ut+1=ut+βt(δtϕtut).u_{t+1}=u_{t}+\beta_{t}(\delta_{t}\phi_{t}-u_{t}). (12)

Similarly the iterates for GTD2-M are given by:

θt+1=θt+αt(ϕtγϕt)ϕtTut+ηt(θtθt1),\theta_{t+1}=\theta_{t}+\alpha_{t}(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}+\eta_{t}(\theta_{t}-\theta_{t-1}), (13)
ut+1=ut+βt(δtϕtTut)ϕt.u_{t+1}=u_{t}+\beta_{t}(\delta_{t}-\phi_{t}^{T}u_{t})\phi_{t}. (14)

Finally, the iterates for TDC-M are given by:

θt+1=θt+αt(δtϕtγϕt(ϕtTut))+ηt(θtθt1),\begin{split}\theta_{t+1}=\theta_{t}+\alpha_{t}(\delta_{t}\phi_{t}-\gamma\phi_{t}^{\prime}(\phi_{t}^{T}u_{t}))+\eta_{t}(\theta_{t}-\theta_{t-1}),\end{split} (15)
ut+1=ut+βt(δtϕtTut)ϕt.u_{t+1}=u_{t}+\beta_{t}(\delta_{t}-\phi_{t}^{T}u_{t})\phi_{t}. (16)

We choose the momentum parameter ηt\eta_{t} as in (Avrachenkov, Patil, and Thoppe 2020) as follows: ηt=ϱtwαtϱt1\eta_{t}=\frac{\varrho_{t}-w\alpha_{t}}{\varrho_{t-1}}, where {ϱt}\{\varrho_{t}\} is a positive sequence s.t. ϱt0\varrho_{t}\rightarrow 0 as tt\rightarrow\infty and ww\in\mathbb{R} is a constant. Note that ηt1\eta_{t}\rightarrow 1 as tt\rightarrow\infty. We later provide conditions on ϱt\varrho_{t} and ww to ensure a.s. convergence. As we would see in section 4, the condition on ww in the One-TS setting is restrictive. Specifically, it depends on the norm of the driving matrix A¯\bar{A}. This motivates us to look at the Three-TS setting and then the corresponding condition on ww is less restrictive. Using the momentum parameter as above,

θt+1=θt+αt(ϕtγϕt)ϕtTut+ϱtwαtϱt1(θtθt1)\begin{split}\theta_{t+1}&=\theta_{t}+\alpha_{t}(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}+\frac{\varrho_{t}-w\alpha_{t}}{\varrho_{t-1}}(\theta_{t}-\theta_{t-1})\end{split}

Rearranging the terms and dividing by ρt\rho_{t}, we get:

θt+1θtϱt=θtθt1ϱt1+αtϱt((ϕtγϕt)ϕtTutw(θtθt1ϱt1)).\begin{split}\frac{\theta_{t+1}-\theta_{t}}{\varrho_{t}}&=\frac{\theta_{t}-\theta_{t-1}}{\varrho_{t-1}}\\ &+\frac{\alpha_{t}}{\varrho_{t}}\Bigg{(}(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}-w\left(\frac{\theta_{t}-\theta_{t-1}}{\varrho_{t-1}}\right)\Bigg{)}.\end{split}

We let

θt+1θtϱt=vt+1,ξt=αtϱt and εt=vt+1vt.\frac{\theta_{t+1}-\theta_{t}}{\varrho_{t}}=v_{t+1},\xi_{t}=\frac{\alpha_{t}}{\varrho_{t}}\mbox{ and }\varepsilon_{t}=v_{t+1}-v_{t}.

Then, the GTD-M iterates in (11) and (12) can be re-written with the following three iterates:

vt+1=vt+ξt((ϕtγϕt)ϕtTutwvt)\displaystyle v_{t+1}=v_{t}+\xi_{t}\left((\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}-wv_{t}\right) (17)
ut+1=ut+βt(δtϕtut)\displaystyle u_{t+1}=u_{t}+\beta_{t}(\delta_{t}\phi_{t}-u_{t}) (18)
θt+1=θt+ϱt(vt+εt)\displaystyle\theta_{t+1}=\theta_{t}+\varrho_{t}(v_{t}+\varepsilon_{t}) (19)

A similar decomposition can be done for the GTD2-M and TDC-M iterates.

4 Convergence Analysis

In this section we analyze the asymptotic behaviour of the GTD-M iterates given by (17), (18) and (19). Throughout the section, we consider vt,ut,θtdv_{t},u_{t},\theta_{t}\in\mathbb{R}^{d}. We first consider the One-TS case when βt=c1ξt\beta_{t}=c_{1}\xi_{t} and ϱt=c2ξt\varrho_{t}=c_{2}\xi_{t} t\forall t, for some real constants c1,c2>0c_{1},c_{2}>0. Subsequently, we consider the Three-TS setting where βtξt0\frac{{\beta_{t}}}{\xi_{t}}\rightarrow 0 and ϱtβt0\frac{{\varrho_{t}}}{\beta_{t}}\rightarrow 0 as tt\rightarrow\infty.

4.1 One-Timescale Setting

We begin by analyzing GTD-M using a one-timescale SA setting. We let c1=c2=1c_{1}=c_{2}=1 for simplicity. The iterates of GTD-M can then be re-written as:

ψt+1=ψt+ξt(Gtψt+gt+εt),\psi_{t+1}=\psi_{t}+\xi_{t}(G_{t}\psi_{t}+g_{t}+\varepsilon_{t}), (20)

where,

ψt=(vtutθt),gt=(0rt+1ϕt0),ε¯t=(00εt),\displaystyle\psi_{t}=\begin{pmatrix}v_{t}\\ u_{t}\\ \theta_{t}\end{pmatrix},g_{t}=\begin{pmatrix}0\\ r_{t+1}\phi_{t}\\ 0\end{pmatrix},\bar{\varepsilon}_{t}=\begin{pmatrix}0\\ 0\\ \varepsilon_{t}\end{pmatrix},
Gt=(wI(ϕtγϕt)ϕtT00Iϕt(γϕtϕt)TI00).\displaystyle G_{t}=\begin{pmatrix}-wI&(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}&0\\ 0&-I&\phi_{t}(\gamma\phi_{t}^{\prime}-\phi_{t})^{T}\\ I&0&0\end{pmatrix}.

Equation (20) can be re-written in the general SA scheme as:

ψt+1=ψt+ξt(h(ψt)+Mt+1+ε¯t).\psi_{t+1}=\psi_{t}+\xi_{t}(h(\psi_{t})+M_{t+1}+\bar{\varepsilon}_{t}). (21)

Here h(ψ)=g+Gψ,g=𝔼[gt],G=𝔼[Gt]h(\psi)=g+G\psi,g=\mathbb{E}[g_{t}],G=\mathbb{E}[G_{t}], where the expectations are w.r.t. the stationary distribution of the Markov chain induced by the target policy π\pi. Mt+1=(Gt+1G)ψt+(gt+1g)M_{t+1}=(G_{t+1}-G)\psi_{t}+(g_{t+1}-g). In particular,

G=(wIA¯T00IA¯I00),g=(0b¯0),G=\begin{pmatrix}-wI&-\bar{A}^{T}&0\\ 0&-I&\bar{A}\\ I&0&0\end{pmatrix},g=\begin{pmatrix}0\\ \bar{b}\\ 0\end{pmatrix},

where recall that A¯=𝔼[ϕ(γϕϕ)T]\bar{A}=\mathbb{E}[\phi(\gamma\phi^{\prime}-\phi)^{T}] and b¯=𝔼[rϕ]\bar{b}=\mathbb{E}[r\phi]

Lemma 1.

Assume, w(w+1)>A¯2w(w+1)>||\bar{A}||^{2}. Then, the matrix GG is Hurwitz.

Proof.

Let λ\lambda be an eigenvalue of GG. The characteristic equation of the matrix GG is given by:

|wIλIA¯T00IλIA¯I0λI|=0\displaystyle\begin{vmatrix}-wI-\lambda I&-\bar{A}^{T}&0\\ 0&-I-\lambda I&\bar{A}\\ I&0&-\lambda I\end{vmatrix}=0
|wI+λIA¯T00I+λIA¯I0λI|=0\displaystyle\begin{vmatrix}wI+\lambda I&\bar{A}^{T}&0\\ 0&I+\lambda I&-\bar{A}\\ -I&0&\lambda I\end{vmatrix}=0

Using the following formula for determinant of block matrices

|A11A12A13A21A22A23A31A32A33|=\displaystyle\begin{vmatrix}A_{11}&A_{12}&A_{13}\\ A_{21}&A_{22}&A_{23}\\ A_{31}&A_{32}&A_{33}\end{vmatrix}=
|A11||(A22A23A32A33)(A21A31)A111(A12A13)|\displaystyle\begin{vmatrix}A_{11}\end{vmatrix}\begin{vmatrix}\begin{pmatrix}A_{22}&A_{23}\\ A_{32}&A_{33}\end{pmatrix}-\begin{pmatrix}A_{21}\\ A_{31}\end{pmatrix}A_{11}^{-1}\begin{pmatrix}A_{12}&A_{13}\end{pmatrix}\end{vmatrix}

we have,

|wI+λIA¯T00I+λIA¯I0λI|=\displaystyle\begin{vmatrix}wI+\lambda I&\bar{A}^{T}&0\\ 0&I+\lambda I&-\bar{A}\\ -I&0&\lambda I\end{vmatrix}=
|(w+λ)I||(I+λIA¯0λI)1w+λ(0I)(A¯T0)|\displaystyle\begin{vmatrix}(w+\lambda)I\end{vmatrix}\begin{vmatrix}\begin{pmatrix}I+\lambda I&-\bar{A}\\ 0&\lambda I\end{pmatrix}-\frac{1}{w+\lambda}\begin{pmatrix}0\\ -I\end{pmatrix}\begin{pmatrix}\bar{A}^{T}&0\end{pmatrix}\end{vmatrix}
=(w+λ)d|I+λIA¯A¯Tw+λλI|\displaystyle=(w+\lambda)^{d}\begin{vmatrix}I+\lambda I&-\bar{A}\\ \frac{\bar{A}^{T}}{w+\lambda}&\lambda I\end{vmatrix}
=(w+λ)d|(1+λ)I||λI+1(1+λ)(w+λ)A¯TA¯|\displaystyle=(w+\lambda)^{d}\begin{vmatrix}(1+\lambda)I\end{vmatrix}\begin{vmatrix}\lambda I+\frac{1}{(1+\lambda)(w+\lambda)}\bar{A}^{T}\bar{A}\end{vmatrix}
=(w+λ)d(1+λ)d(w+λ)d(1+λ)d|λ(1+λ)(w+λ)I+A¯TA¯|\displaystyle=\frac{(w+\lambda)^{d}(1+\lambda)^{d}}{(w+\lambda)^{d}(1+\lambda)^{d}}\begin{vmatrix}\lambda(1+\lambda)(w+\lambda)I+\bar{A}^{T}\bar{A}\end{vmatrix}
=|λ(1+λ)(w+λ)I+A¯TA¯|\displaystyle=\begin{vmatrix}\lambda(1+\lambda)(w+\lambda)I+\bar{A}^{T}\bar{A}\end{vmatrix}

Therefore, from the characteristic equation of GG, we have that

|λ(1+λ)(w+λ)I+A¯TA¯|=0.\begin{vmatrix}\lambda(1+\lambda)(w+\lambda)I+\bar{A}^{T}\bar{A}\end{vmatrix}=0.

There must exist a non-zero vector xdx\in\mathbb{C}^{d}, such that

x(λ(1+λ)(w+λ)I+A¯TA¯)x=0,x^{*}(\lambda(1+\lambda)(w+\lambda)I+\bar{A}^{T}\bar{A})x=0,

where xx^{*} is the conjugate transpose of the vector xx and xx=x2>0x^{*}x=||x||^{2}>0. The above equation reduces to the following cubic-polynomial equation:

λ3x2+(w+1)λ2x2+wλx2+A¯x2=0,\lambda^{3}||x||^{2}+(w+1)\lambda^{2}||x||^{2}+w\lambda||x||^{2}+||\bar{A}x||^{2}=0,

where A¯x2=xA¯TA¯x||\bar{A}x||^{2}=x^{*}\bar{A}^{T}\bar{A}x. Using Routh-Hurwitz criterion, a cubic polynomial a3λ3+a2λ2+a1λ+a0a_{3}\lambda^{3}+a_{2}\lambda^{2}+a_{1}\lambda+a_{0} has all roots with negative real parts iff a3,a2,a1,a0>0a_{3},a_{2},a_{1},a_{0}>0 and a1a2>a0a3a_{1}a_{2}>a_{0}a_{3}. In our case, a3=x2>0,a2=(w+1)x2>0,a1=wx2>0 and a0=A¯x2>0a_{3}=||x||^{2}>0,a_{2}=(w+1)||x||^{2}>0,a_{1}=w||x||^{2}>0\mbox{ and }a_{0}=||\bar{A}x||^{2}>0. The last inequality follows from the fact that A¯\bar{A} is negative definite and therefore xA¯TA¯x>0x^{*}\bar{A}^{T}\bar{A}x>0. Finally, a1a2=w(w+1)x4,a0a3=x2A¯x2a_{1}a_{2}=w(w+1)||x||^{4},a_{0}a_{3}=||x||^{2}||\bar{A}x||^{2} and a1a2>a0a3a_{1}a_{2}>a_{0}a_{3} follows fromA¯x2x2<A¯2<w(w+1)\frac{||\bar{A}x||^{2}}{||x||^{2}}<||\bar{A}||^{2}<w(w+1). Therefore Re(λ)<0Re(\lambda)<0 and the claim follows. ∎

Consider the following assumptions:

𝒜\mathbfcal{A} 1.

All rewards r(s,s)r(s,s^{\prime}) and features ϕ(s)\phi(s) are bounded, i.e., r(s,s)1r(s,s^{\prime})\leq 1 and ϕ(s)1||\phi(s)||\leq 1 s,s𝒮\forall s,s^{\prime}\in\mathcal{S}. Also, the matrix Φ\Phi has full rank, where Φ\Phi is an n×dn\times d matrix where the sth row is ϕ(s)T\phi(s)^{T}.

𝒜\mathbfcal{A} 2.

The step-sizes satisfy ξt=βt=ϱt>0\xi_{t}=\beta_{t}=\varrho_{t}>0,

tξt=tξt2< ,where ξt=αtϱt\sum_{t}\xi_{t}=\infty\sum_{t}\xi_{t}^{2}<\infty\mbox{ ,where }\xi_{t}=\frac{\alpha_{t}}{\varrho_{t}}

and the momentum parameter satisfies: ηt=ϱtwαtϱt1.\eta_{t}=\frac{\varrho_{t}-w\alpha_{t}}{\varrho_{t-1}}.

𝒜\mathbfcal{A} 3.

The samples (ϕt,ϕt\phi_{t},\phi_{t}^{\prime}) are drawn i.i.d from the stationary distribution of the Markov chain induced by target policy π\pi.

Theorem 2.

Assume 𝒜\mathbfcal{A}1, 𝒜\mathbfcal{A}2 and 𝒜\mathbfcal{A}3 hold and let w1w\geq 1. Then, the GTD-M iterates given by (11) and (12) satisfy θnθ=A¯1b¯\theta_{n}\rightarrow\theta^{*}=-\bar{A}^{-1}\bar{b} a.s. as nn\rightarrow\infty.

Proof.

Assumption 𝒜\mathbfcal{A}1 ensures that A¯2<w(w+1)||\bar{A}||^{2}<w(w+1) and 𝒜\mathbfcal{A}3 ensures that the function h()h(\cdot) is well defined. Now, using Lemma 1 and (Borkar and Meyn 2000) we can show that the iterates in (20) remain stable. Then using the third extension from (Chapter-2 pp. 17, Borkar (2008b)) we can show that ψnG1g\psi_{n}\rightarrow-G^{-1}g as nn\rightarrow\infty. Thereafter using the formula for inverse of block matrices it can be shown that θnA¯1b\theta_{n}\rightarrow-\bar{A}^{-1}b as nn\rightarrow\infty. See Appendix A1 for a detailed proof. ∎

Similar results can be proved for the GTD2-M and TDC-M iterates.

Remark 1.

If ww is large, the initial values of the momentum parameter is small. The condition on ww in lemma 1 is large compared to the condition on ww in (Avrachenkov, Patil, and Thoppe 2020), where the condition is w>0w>0. Motivated by this, we look at the three-TS case of the iterates.

4.2 Three Timescale Setting

We consider the three iterates for GTD-M in (17), (18) and (19) under the following criteria for step-sizes: ξtβt0\frac{\xi_{t}}{{\beta_{t}}}\rightarrow 0 and ϱtξt0\frac{{\varrho_{t}}}{\xi_{t}}\rightarrow 0 as tt\rightarrow\infty. We provide the first conditions for stability and a.s. convergence of generic three-TS SA recursions. We emphasize that the setting we look at in Theorem 3 is more general than the setting at hand of GTD-M iterates. Although stability and convergence results exist for one-TS and two-TS cases, this is the first time such results have been provided for the case of three-TS recursions. We next provide the general iterates for a three-TS recursion along with the assumptions used while analyzing them. Consider the following three iterates:

xn+1=xn+a(n)(h(xn,yn,zn)+Mn+1(1)+εn(1)),x_{n+1}=x_{n}+a(n)\left(h(x_{n},y_{n},z_{n})+M_{n+1}^{(1)}+\varepsilon^{(1)}_{n}\right), (22)
yn+1=yn+b(n)(g(xn,yn,zn)+Mn+1(2)+εn(2)),y_{n+1}=y_{n}+b(n)\left(g(x_{n},y_{n},z_{n})+M_{n+1}^{(2)}+\varepsilon^{(2)}_{n}\right), (23)
zn+1=zn+c(n)(f(xn,yn,zn)+Mn+1(3)+εn(3)),z_{n+1}=z_{n}+c(n)\left(f(x_{n},y_{n},z_{n})+M_{n+1}^{(3)}+\varepsilon^{(3)}_{n}\right), (24)

and the following assumptions:

  • (B1)

    h:d1+d2+d3d1,g:d1+d2+d3d2,f:d1+d2+d3d3h:\mathbb{R}^{d_{1}+d_{2}+d_{3}}\rightarrow\mathbb{R}^{d_{1}},g:\mathbb{R}^{d_{1}+d_{2}+d_{3}}\rightarrow\mathbb{R}^{d_{2}},f:\mathbb{R}^{d_{1}+d_{2}+d_{3}}\rightarrow\mathbb{R}^{d_{3}} are Lipchitz continuous, with Lipchitz constants L1,L2L_{1},L_{2} and L3L_{3} respectively.

  • (B2)

    {a(n)}\{a(n)\}, {b(n)}\{b(n)\}, {c(n)}\{c(n)\} are step-size sequences that satisfy a(n)>0,b(n)>0,c(n)>0,n>0,a(n)>0,b(n)>0,c(n)>0,\forall n>0,

    na(n)=nb(n)=nc(n)=,\sum_{n}a(n)=\sum_{n}b(n)=\sum_{n}c(n)=\infty,
    n(a(n)2+b(n)2+c(n)2)<,\sum_{n}(a(n)^{2}+b(n)^{2}+c(n)^{2})<\infty,
    b(n)a(n)0,c(n)b(n)0 as n.\frac{b(n)}{a(n)}\rightarrow 0,\frac{c(n)}{b(n)}\rightarrow 0\mbox{ as }n\rightarrow\infty.
  • (B3)

    {Mn(1)},{Mn(2)},{Mn(3)}\{M_{n}^{(1)}\},\{M_{n}^{(2)}\},\{M_{n}^{(3)}\} are Martingale difference sequences w.r.t. the filtration {n}\{\mathcal{F}_{n}\} where,

    n=σ(xm,ym,zm,Mm(1),Mm(2),Mm(3),mn)\mathcal{F}_{n}=\sigma\left(x_{m},y_{m},z_{m},M_{m}^{(1)},M_{m}^{(2)},M_{m}^{(3)},m\leq n\right)
    𝔼[Mn+1(i)2|n]Ki(1+xn2+yn2+zn2);\mathbb{E}\left[||M_{n+1}^{(i)}||^{2}|\mathcal{F}_{n}\right]\leq K_{i}\left(1+||x_{n}||^{2}+||y_{n}||^{2}+||z_{n}||^{2}\right);

    n0\forall n\geq 0, i=1,2,3i=1,2,3 and constants 0<Ki<0<K_{i}<\infty. The terms εt(i)\varepsilon^{(i)}_{t} satisfy εn(1)+εn(2)+εn(3)0||\varepsilon^{(1)}_{n}||+||\varepsilon^{(2)}_{n}||+||\varepsilon^{(3)}_{n}||\rightarrow 0 as nn\rightarrow\infty.

  • (B4)
    1. (i)

      The ode x˙(t)=h(x(t),y,z),yd2,zd3\dot{x}(t)=h(x(t),y,z),y\in\mathbb{R}^{d_{2}},z\in\mathbb{R}^{d_{3}} has a globally asymptotically stable equilibrium (g.a.s.e) λ(y,z)\lambda(y,z), and λ:d2×d3d1\lambda:\mathbb{R}^{d_{2}\times d_{3}}\rightarrow\mathbb{R}^{d_{1}} is Lipchitz continuous.

    2. (ii)

      The ode y˙(t)=g(λ(y(t),z),y(t),z),zd3\dot{y}(t)=g(\lambda(y(t),z),y(t),z),z\in\mathbb{R}^{d_{3}} has a globally asymptotically stable equilibrium Γ(z)\Gamma(z), where Γ:d3d2\Gamma:\mathbb{R}^{d_{3}}\rightarrow\mathbb{R}^{d_{2}} is Lipchitz continuous.

    3. (iii)

      The ode z˙(t)=f(λ(Γ(z(t)),z(t)),Γ(z(t)),z(t))\dot{z}(t)=f(\lambda(\Gamma(z(t)),z(t)),\Gamma(z(t)),z(t)), has a globally asymptotically stable equilibrium zd3z^{*}\in\mathbb{R}^{d_{3}}.

  • (B5)

    The functions hc(x,y,z)=h(cx,cy,cz)c,c1h_{c}(x,y,z)=\frac{h(cx,cy,cz)}{c},c\geq 1 satisfy hchh_{c}\rightarrow h_{\infty} as cc\rightarrow\infty uniformly on compacts. The ODE: x˙(t)=h(x(t),y,z),\dot{x}(t)=h_{\infty}(x(t),y,z), has a unique globally asymptotically stable equilibrium λ(y,z)\lambda_{\infty}(y,z), where λ:d2+d3d1\lambda_{\infty}:\mathbb{R}^{d_{2}+d_{3}}\rightarrow\mathbb{R}^{d_{1}} is Lipschitz continuous. Further, λ(0,0)=0\lambda_{\infty}(0,0)=0.

  • (B6)

    The functions gc(y,z)=g(cλ(y,z),cy,cz)c,c1g_{c}(y,z)=\frac{g(c\lambda_{\infty}(y,z),cy,cz)}{c},c\geq 1 satisfy gcgg_{c}\rightarrow g_{\infty} as cc\rightarrow\infty uniformly on compacts. The ODE: y˙(t)=g(y(t),z),\dot{y}(t)=g_{\infty}(y(t),z), has a unique globally asymptotically stable equilibrium Γ(z)\Gamma_{\infty}(z), where Γ:d3d2\Gamma_{\infty}:\mathbb{R}^{d_{3}}\rightarrow\mathbb{R}^{d_{2}} is Lipschitz continuous. Further, Γ(0)=0\Gamma_{\infty}(0)=0.

  • (B7)

    The functions fc(z)=g(cλ(Γ(z),z),cΓ(z),cz)c,c1f_{c}(z)=\frac{g(c\lambda_{\infty}(\Gamma_{\infty}(z),z),c\Gamma_{\infty}(z),cz)}{c},c\geq 1 satisfy fcff_{c}\rightarrow f_{\infty} as cc\rightarrow\infty uniformly on compacts. The ODE: z˙(t)=f(z(t)),\dot{z}(t)=f_{\infty}(z(t)), has the origin in d3\mathbb{R}^{d_{3}} as its unique globally asymptotically stable equilibrium.

Remark 2.

Conditions (𝐁𝟓)(𝐁𝟕)\bf{(B5)}-\bf{(B7)} give sufficient conditions that ensure that the iterates remain stable. Specifically it ensures that supn(xn+yn+zn)<\sup_{n}(||x_{n}||+||y_{n}||+||z_{n}||)<\infty a.s.a.s.. Conditions (𝐁𝟏)(𝐁𝟒)\bf{(B1)}-\bf{(B4)} along with the stability of iterates ensures a.s. convergence of the iterates.

Theorem 3.

Under assumptions (𝐁𝟏)\bf{(B1)}-(𝐁𝟕)\bf{(B7)},the iterates given by (22) satisfy (23) and (24),

(xn,yn,zn)(λ(Γ(z),z),Γ(z),z) as n(x_{n},y_{n},z_{n})\rightarrow(\lambda(\Gamma(z^{*}),z^{*}),\Gamma(z^{*}),z^{*})\mbox{ as }n\rightarrow\infty
Proof.

See Appendix A2. ∎

Next we use theorem 3, to show that the iterates of GTD-M a.s. converge to the TD solution A¯1b¯-\bar{A}^{-1}\bar{b}. Consider the following assumption on step-size sequences instead of 𝒜\mathbfcal{A}2.

𝒜\mathbfcal{A} 4.

The step-sizes satisfy ξt>0,βt>0,ϱt>0 t\xi_{t}>0,\beta_{t}>0,\varrho_{t}>0\mbox{ }\forall t,

tξt=tβt=tϱt=,\sum_{t}\xi_{t}=\sum_{t}\beta_{t}=\sum_{t}\varrho_{t}=\infty,
t(ξt2+βt2+ϱt2)<,\sum_{t}(\xi_{t}^{2}+\beta_{t}^{2}+\varrho_{t}^{2})<\infty,
βtξt0,ϱtβt0 as t\frac{\beta_{t}}{\xi_{t}}\rightarrow 0,\frac{\varrho_{t}}{\beta_{t}}\rightarrow 0\mbox{ as }t\rightarrow\infty

and the momentum parameter satisfies: ηt=ϱtwαtϱt1.\eta_{t}=\frac{\varrho_{t}-w\alpha_{t}}{\varrho_{t-1}}.

Theorem 4.

Assume 𝒜\mathbfcal{A}1, 𝒜\mathbfcal{A}3 and 𝒜\mathbfcal{A}4 hold and let w>0w>0. Then, the GTD-M iterates given by (11) and (12) satisfy θnθ=A¯1b¯\theta_{n}\rightarrow\theta^{*}=-\bar{A}^{-1}\bar{b} a.s. as nn\rightarrow\infty.

Proof.

We transform the iterates given by (17), (18) and (19) into the standard SA form given by (22), (23) and (24). Let t=σ(u0,v0,θ0,rj+1,ϕj,ϕj:j<t)\mathcal{F}_{t}=\sigma(u_{0},v_{0},\theta_{0},r_{j+1},\phi_{j},\phi_{j}^{\prime}:j<t). Let, At=ϕt(γϕtϕt)TA_{t}=\phi_{t}(\gamma\phi_{t}^{\prime}-\phi_{t})^{T} and bt=rt+1ϕtb_{t}=r_{t+1}\phi_{t}. Then, (17) can be re-written as:

vt+1=vt+ξt(h(vt,ut,θt)+Mt+1(1))v_{t+1}=v_{t}+\xi_{t}\left(h(v_{t},u_{t},\theta_{t})+M_{t+1}^{(1)}\right)

where,

h(vt,ut,θt)=𝔼[(ϕtγϕt)ϕtTutwvt|t]=A¯Tutwvt.Mt+1(1)=AtTutwvth(vt,ut,θt)=(A¯TAtT)ut.\begin{split}h(v_{t},u_{t},\theta_{t})&=\mathbb{E}[(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}-wv_{t}|\mathcal{F}_{t}]\\ &=-\bar{A}^{T}u_{t}-wv_{t}.\\ M_{t+1}^{(1)}=-A_{t}^{T}u_{t}&-wv_{t}-h(v_{t},u_{t},\theta_{t})=(\bar{A}^{T}-A_{t}^{T})u_{t}.\end{split}

Next, (18) can be re-written as:

ut+1=ut+βt(g(vt,ut,θt)+Mt+1(2))\begin{split}u_{t+1}&=u_{t}+\beta_{t}\left(g(v_{t},u_{t},\theta_{t})+M_{t+1}^{(2)}\right)\\ \end{split}

where,

g(vt,ut,θt)=𝔼[δtϕtut|t]=A¯θt+b¯utMt+1(2)=Atθt+btutg(vt,ut,θt)=(AtA¯)θt+(btb¯).\begin{split}g(v_{t},u_{t},\theta_{t})&=\mathbb{E}[\delta_{t}\phi_{t}-u_{t}|\mathcal{F}_{t}]=\bar{A}\theta_{t}+\bar{b}-u_{t}\\ M_{t+1}^{(2)}&=A_{t}\theta_{t}+b_{t}-u_{t}-g(v_{t},u_{t},\theta_{t})\\ &=(A_{t}-\bar{A})\theta_{t}+(b_{t}-\bar{b}).\end{split}

Finally, (19) can be re-written as:

θt+1=θt+ϱt(f(vt,ut,θt)+εt+Mt+1(3))\begin{split}\theta_{t+1}=\theta_{t}+\varrho_{t}\left(f(v_{t},u_{t},\theta_{t})+\varepsilon_{t}+M_{t+1}^{(3)}\right)\end{split}

where, f(vt,ut,θt)=vt and Mt+1(3)=0.f(v_{t},u_{t},\theta_{t})=v_{t}\mbox{ and }M_{t+1}^{(3)}=0.

Refer to caption
Figure 1: RMSPBE (averaged over 100 independent runs) accross episodes for Boyan Chain. The features used are the standard spiked features of size 4 used in Boyan chain (see (Dann, Neumann, and Peters 2014)).
Refer to caption
Figure 2: RMSPBE (averaged over 100 independent runs) across episodes for the 5-State Random Chain problem. The features used are the Dependent features used in (Sutton et al. 2009).

The functions h,g,fh,g,f are linear in v,u,θv,u,\theta and hence Lipchitz continuous, therefore satisfying (𝐁𝟏)\bf{(B1)}. We choose the step-size sequences such that they satisfy (𝐁𝟐)\bf{(B2)}. One popular choice is ξt=1(t+1)ξ,βt=1(t+1)β,ϱt=1(t+1)ϱ,\xi_{t}=\frac{1}{(t+1)^{\xi}},\beta_{t}=\frac{1}{(t+1)^{\beta}},\varrho_{t}=\frac{1}{(t+1)^{\varrho}}, 12<ξ<β<ϱ1.\frac{1}{2}<\xi<\beta<\varrho\leq 1. Next, Mt+1(1),Mt+1(2)M_{t+1}^{(1)},M_{t+1}^{(2)} and Mt+1(3)M_{t+1}^{(3)} t0t\geq 0, are martingale difference sequences w.r.t t\mathcal{F}_{t} by construction. 𝔼[Mt+1(1)2|t](A¯TAtT)2ut2\mathbb{E}[||M_{t+1}^{(1)}||^{2}|\mathcal{F}_{t}]\leq||(\bar{A}^{T}-A_{t}^{T})||^{2}||u_{t}||^{2}, 𝔼[Mt+1(2)2|t]2((AtA¯)2θt2+(btb¯)2)\mathbb{E}[||M_{t+1}^{(2)}||^{2}|\mathcal{F}_{t}]\leq 2(||(A_{t}-\bar{A})||^{2}||\theta_{t}||^{2}+||(b_{t}-\bar{b})||^{2}). The first part of (𝐁𝟑)\bf{(B3)} is satisfied with K1=(A¯TAtT)2K_{1}=||(\bar{A}^{T}-A_{t}^{T})||^{2}, K2=2max(AtA¯2,btb¯2)K_{2}=2\max(||A_{t}-\bar{A}||^{2},||b_{t}-\bar{b}||^{2}) and any K3>0K_{3}>0. The fact that K1,K2<K_{1},K_{2}<\infty follows from the bounded features and bounded rewards assumption in 𝒜\mathbfcal{A}1. Next, observe that εt(3)=ξt((ϕtγϕt)ϕtTutwvt)0||\varepsilon_{t}^{(3)}||=\xi_{t}||\left((\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}-wv_{t}\right)||\rightarrow 0 since ξt0 as t\xi_{t}\rightarrow 0\mbox{ as }t\rightarrow\infty. For a fixed u,θdu,\theta\in\mathbb{R}^{d}, consider the ODE v˙(t)=A¯Tuwv(t).\dot{v}(t)=-\bar{A}^{T}u-wv(t). For w>0w>0, λ(u,θ)=A¯Tuw\lambda(u,\theta)=-\frac{\bar{A}^{T}u}{w} is the unique g.a.s.e, is linear and therefore Lipchitz continuous. This satisfies (𝐁𝟒)\bf{(B4)}(i). Next, for a fixed θd\theta\in\mathbb{R}^{d}, u˙(t)=A¯θ+b¯u(t),\dot{u}(t)=\bar{A}\theta+\bar{b}-u(t), has Γ(θ)=A¯θ+b¯\Gamma(\theta)=\bar{A}\theta+\bar{b} as its unique g.a.s.e and is Lipschitz. This satisfies (𝐁𝟒)(𝐢𝐢)\bf{(B4)}(ii). Finally, to satisfy (𝐁𝟒)(𝐢𝐢𝐢)\bf{(B4)}(iii), consider,

θ˙(t)=f(λ(Γ(z(t)),z(t)),Γ(z(t)),z(t))=A¯TA¯θ(t)A¯Tb¯w.\begin{split}\dot{\theta}(t)&=f(\lambda(\Gamma(z(t)),z(t)),\Gamma(z(t)),z(t))\\ &=\frac{-\bar{A}^{T}\bar{A}\theta(t)-\bar{A}^{T}\bar{b}}{w}.\end{split}

Since, A¯\bar{A} is negative definite, therefore, A¯TA¯-\bar{A}^{T}\bar{A} is negative definite. Therefore, θ=A¯1b¯\theta^{*}=-\bar{A}^{-1}\bar{b} is the unique g.a.s.e. Next, we show that the sufficient conditions for stability of the three iterates are satisfied. The function, hc(v,u,θ)=cA¯Tuwcvc=A¯Tuwvh(v,u,θ)=A¯Tuwvh_{c}(v,u,\theta)=\frac{-c\bar{A}^{T}u-wcv}{c}=-\bar{A}^{T}u-wv\rightarrow h_{\infty}(v,u,\theta)=-\bar{A}^{T}u-wv uniformly on compacts as cc\rightarrow\infty. The limiting ODE: v˙(t)=A¯Tuwv(t)\dot{v}(t)=-\bar{A}^{T}u-wv(t) has λ(u,θ)=A¯Tuw\lambda_{\infty}(u,\theta)=-\frac{\bar{A}^{T}u}{w} as its unique g.a.s.e. λ\lambda_{\infty} is Lipschitz with λ(0,0)=0\lambda_{\infty}(0,0)=0, thus satisfying assumption (𝐁𝟓)\bf{(B5)}.

The function, gc(u,θ)=cA¯θ+b¯cuc=A¯θu+b¯cg(u,θ)=A¯θug_{c}(u,\theta)=\frac{c\bar{A}\theta+\bar{b}-cu}{c}=\bar{A}\theta-u+\frac{\bar{b}}{c}\rightarrow g_{\infty}(u,\theta)=-\bar{A}\theta-u uniformly on compacts as cc\rightarrow\infty. The limiting ODE u˙(t)=A¯θu(t)\dot{u}(t)=-\bar{A}\theta-u(t) has Γ(θ)=A¯θ\Gamma_{\infty}(\theta)=\bar{A}\theta as its unique g.a.s.e. Γ\Gamma_{\infty} is Lipchitz with Γ(0)=0\Gamma_{\infty}(0)=0. Thus assumption (𝐁𝟔)\bf{(B6)} is satisfied.

Finally, fc(θ)=cA¯TA¯θcwf=A¯TA¯θwf_{c}(\theta)=\frac{-c\bar{A}^{T}\bar{A}\theta}{cw}\rightarrow f_{\infty}=\frac{-\bar{A}^{T}\bar{A}\theta}{w} uniformly on compacts as cc\rightarrow\infty and the ODE: θ˙(t)=A¯TA¯θ(t)w\dot{\theta}(t)=-\frac{\bar{A}^{T}\bar{A}\theta(t)}{w} has origin as its unique g.a.s.e. This ensures the final condition (𝐁𝟕)\bf{(B7)}. By theorem 3,

(vtutθt)(λ(Γ(A¯1b¯),A¯1b¯)Γ(A¯1b¯)A¯1b¯.)=(00A¯1b¯.)\begin{pmatrix}v_{t}\\ u_{t}\\ \theta_{t}\end{pmatrix}\rightarrow\begin{pmatrix}\lambda(\Gamma(-\bar{A}^{-1}\bar{b}),-\bar{A}^{-1}\bar{b})\\ \Gamma(-\bar{A}^{-1}\bar{b})\\ -\bar{A}^{-1}\bar{b}.\end{pmatrix}=\begin{pmatrix}0\\ 0\\ -\bar{A}^{-1}\bar{b}.\end{pmatrix}

Specifically, θtA¯1b¯\theta_{t}\rightarrow-\bar{A}^{-1}\bar{b}. ∎

Similar analysis can be provided for GTD2-M and TDC-M iterates. See Appendix A3 for details.

5 Experiments

Refer to caption
Figure 3: RMSPBE (averaged over 100 independent runs) accross episodes for the 19-State Random Walk problem. The features used are an extension of the Dependent features used in (Sutton et al. 2009).
Refer to caption
Figure 4: RMSPBE (averaged over 100 independent runs) accross episodes for 20-state Random MDP with 5 random actions. The features used are Linear random of size 10 (see (Dann, Neumann, and Peters 2014)). For each state, the value of the feature vector at 10th10^{th} position is 1 and all the values in all other 9 positions is chosen randomly from 0 to 10 and are then normalized.

We evaluate the momentum based GTD algorithms defined in section 3 to four standard problems of policy evaluation in reinforcement learning namely, Boyan Chain (Boyan 1999), 5-State random walk (Sutton et al. 2009), 19-State Random Walk (Sutton and Barto 2018) and Random MDP (Sutton et al. 2009). See Appendix A4 for a detailed description of the MDP settings and (Dann, Neumann, and Peters 2014) for details on implementation. We run the three algorithms, GTD, GTD2 and TDC along with their heavy ball momentum variants in One-TS and Three-TS settings and compare the RMSPBE (Root of MSPBE) across episodes. Figure-1 to Figure-4 plot these results. We consider decreasing step-sizes of the form: ξt=1(t+1)ξ,βt=1(t+1)β,ϱt=1(t+1)ϱ,αt=1(t+1)α\xi_{t}=\frac{1}{(t+1)^{\xi}},\beta_{t}=\frac{1}{(t+1)^{\beta}},\varrho_{t}=\frac{1}{(t+1)^{\varrho}},\alpha_{t}=\frac{1}{(t+1)^{\alpha}} in all the examples. Table 1 summarizes the different step-size sequences used in our experiment.

In one-TS setting, we require ξ=β=ϱ\xi=\beta=\varrho. Since ξt=αtϱt\xi_{t}=\frac{\alpha_{t}}{\varrho_{t}}, we must have α=2ϱ\alpha=2\varrho. In the Three-TS setting, ξ<β<ϱ\xi<\beta<\varrho thus implying, α<ϱ+β\alpha<\varrho+\beta and β<ϱ\beta<\varrho. Although our analysis requires square summability: ξ,β,ϱ>0.5\xi,\beta,\varrho>0.5, such choice of step-size makes the algorithms converge very slowly. Recently, (Dalal et al. 2018a) showed convergence rate results for Gradient TD schemes with non-square summable

Table 1: Choice of step-size parameters
Boyan Chain α\alpha β\beta ϱ\varrho ww
Vanilla 0.25 0.125 - -
One-TS 0.25 0.125 0.125 1
Three-TS 0.25 0.125 0.2 0.1
5-state RW α\alpha β\beta ϱ\varrho ww
Vanilla 0.25 0.125 - -
One-TS 0.25 0.125 0.125 1
Three-TS 0.25 0.125 0.2 0.1
19-State RW α\alpha β\beta ϱ\varrho ww
Vanilla 0.125 0.0625 - -
One-TS 0.125 0.0625 0.0625 1
Three-TS 0.125 0.0625 0.1 0.1
Random Chain α\alpha β\beta ϱ\varrho ww
Vanilla 0.5 0.25 - -
One-TS 0.5 0.25 0.25 1
Three-TS 0.5 0.25 0.3 0.1

step-sizes also (See Remark 2 of (Dalal et al. 2018a)). Therefore, we look at non-square summable step-sizes here, and observe that in all the examples the iterates do converge. The momentum parameter is chosen as in 𝒜\mathbfcal{A}2.

In all the examples considered, the momentum methods outperform their vanilla counterparts. Since, in the Three-TS setting, a lower value of ww can be chosen, this ensures that the momentum parameter is not small in the initial phase of the algorithm as in the One-TS setting. This in turn helps to reduce the RMSPBE faster in the initial phase of the algorithm as is evident from the experiments.

6 Related Work and Conclusion

To the best of our knowledge no previous work has specifically looked at Gradient TD methods with an added heavy ball term. The use of momentum specifically in the SA setting is very limited. Section 4.1 of (Mou et al. 2020) does talk about momentum; however the problem looked at is that of SGD with momentum and the driving matrix is assumed to be symmetric (see Appendix H of their paper). We do not make any such assumption here. The work of (Devraj, Bušíć, and Meyn 2019), indeed looks at momentum in SA setting. However, they introduce a matrix momentum term which is not equivalent to heavy ball momentum. Acceleration in Gradient TD methods has been looked at in (Pan, White, and White 2017). The authors provide a new algorithm called ATD and the acceleration is in form of better data efficiency. However, they do not make use of momentum methods.

In this work we have introduced heavy ball momentum in Gradient Temporal difference algorithms for the first time. We decompose the two iterates of these algorithms into three separate iterates and provide asymptotic convergence guarantees of these new schemes under the same assumptions made by their vanilla counterparts. Specifically, we show convergence in the One-TS regime as well as Three-TS regime. In both the cases, the momentum parameter gradually goes 1. Three-TS formulation gives us more flexibility in choosing the momentum parameter. Specifically, compared to the One-TS setting, a larger momentum parameter can be chosen during the initial phase in the Three-TS case. We observe improved performance with these new schemes when compared with the original algorithms.

As a step forward from this work, the natural direction would be to look at more sophisticated momentum methods such as Nesterov’s accelerated method (Nesterov 1983). Also, here we only provide the convergence guarantees of these new momentum methods. A particularly interesting step would be to quantify the benefits of using momentum in SA settings. Specifically, it would be interesting to extend weak convergence rate analysis of (Konda and Tsitsiklis 2004; Mokkadem and Pelletier 2006) to Three-TS regime. Also, extending the recent convergence rate results in expectation and high probability of GTD methods (Dalal et al. 2018b; Gupta, Srikant, and Ying 2019; Kaledin et al. 2019; Dalal, Szorenyi, and Thoppe 2020) to these momentum settings would be interesting works for the future.

References

  • Assran and Rabbat (2020) Assran, M.; and Rabbat, M. 2020. On the Convergence of Nesterov’s Accelerated Gradient Method in Stochastic Settings. Proceedings of the 37th International Conference on Machine Learning, PMLR, 119: 410–420.
  • Avrachenkov, Patil, and Thoppe (2020) Avrachenkov, K.; Patil, K.; and Thoppe, G. 2020. Online Algorithms for Estimating Change Rates of Web Pages. arXiv, 2009.08142.
  • Baird (1995) Baird, L. 1995. Residual Algorithms: Reinforcement Learning with Function Approximation. In In Proceedings of the Twelfth International Conference on Machine Learning, 30–37. Morgan Kaufmann.
  • Borkar (2008a) Borkar, V. 2008a. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press. ISBN 9780521515924.
  • Borkar (2008b) Borkar, V. S. 2008b. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press. ISBN 9780521515924.
  • Borkar and Meyn (2000) Borkar, V. S.; and Meyn, S. P. 2000. The O.D.E. Method for Convergence of Stochastic Approximation and Reinforcement Learning. SIAM Journal on Control and Optimization, 38(2): 447–469.
  • Boyan (1999) Boyan, J. 1999. Least-Squares Temporal Difference Learning. In ICML.
  • Dalal, Szorenyi, and Thoppe (2020) Dalal, G.; Szorenyi, B.; and Thoppe, G. 2020. A Tale of Two-Timescale Reinforcement Learning with the Tightest Finite-Time Bound. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04): 3701–3708.
  • Dalal et al. (2018a) Dalal, G.; Szorenyi, B.; Thoppe, G.; and Mannor, S. 2018a. Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning. arXiv:1703.05376.
  • Dalal et al. (2018b) Dalal, G.; Thoppe, G.; Szörényi, B.; and Mannor, S. 2018b. Finite Sample Analysis of Two-Timescale Stochastic Approximation with Applications to Reinforcement Learning. In Bubeck, S.; Perchet, V.; and Rigollet, P., eds., Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, 1199–1233. PMLR.
  • Dann, Neumann, and Peters (2014) Dann, C.; Neumann, G.; and Peters, J. 2014. Policy Evaluation with Temporal Differences: A Survey and Comparison. Journal of Machine Learning Research, 15(24): 809–883.
  • Devraj, Bušíć, and Meyn (2019) Devraj, A. M.; Bušíć, A.; and Meyn, S. 2019. On Matrix Momentum Stochastic Approximation and Applications to Q-learning. 57th Annual Allerton Conference on Communication, Control, and Computing, 749–756.
  • Gadat, Panloup, and Saadane (2016) Gadat, S.; Panloup, F.; and Saadane, S. 2016. Stochastic Heavy ball. Electronic Journal of Statistics, 12: 461–529.
  • Ghadimi, Feyzmahdavian, and Johansson (2014) Ghadimi, E.; Feyzmahdavian, H. R.; and Johansson, M. 2014. Global convergence of the Heavy-ball method for convex optimization. arXiv:1412.7457.
  • Gitman et al. (2019) Gitman, I.; Lang, H.; Zhang, P.; and Xiao, L. 2019. Understanding the role of momentum in stochastic gradient methods. Advances in Neural Information Processing Systems, 9630–9640.
  • Gupta, Srikant, and Ying (2019) Gupta, H.; Srikant, R.; and Ying, L. 2019. Finite-Time Performance Bounds and Adaptive Learning Rate Selection for Two Time-Scale Reinforcement Learning. arXiv:1907.06290.
  • Kaledin et al. (2019) Kaledin, M.; Moulines, E.; Naumov, A.; Tadic, V.; and Wai, H. 2019. Finite Time Analysis of Linear Two-timescale Stochastic Approximation with Markovian Noise. Conference on Learning Theory, 125: 2144–2203.
  • Konda and Tsitsiklis (2004) Konda, V.; and Tsitsiklis, J. 2004. Convergence rate of linear two-time-scale stochastic approximation. Annals of Applied Probability, 14.
  • Kushner and Clark (1978) Kushner, H.; and Clark, D. 1978. Stochastic Approximation Methods for constrained and unconstrained systems. Springer.
  • Lakshminarayanan and Bhatnagar (2017) Lakshminarayanan, C.; and Bhatnagar, S. 2017. A Stability Criterion for Two-Timescale Stochastic Approximation Schemes. Automatica, 79: 108–114.
  • Ljung (1977) Ljung, L. 1977. Analysis of recursive stochastic algorithms. IEEE Transactions on Automatic Control, 22(4): 551–575.
  • Loizou and Richtárik (2020) Loizou, N.; and Richtárik, P. 2020. Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods. Computational Optimization and Applications, 77: 653–710.
  • Ma and Yarats (2019) Ma, J.; and Yarats, D. 2019. Quasi-hyperbolic momentum and adam for deep learning. International Conference on Learning Representations.
  • Maei (2011) Maei, H. R. 2011. Gradient Temporal-Difference Learning Algorithms. Ph.D. thesis, University of Alberta, CAN. AAINR89455.
  • Mokkadem and Pelletier (2006) Mokkadem, A.; and Pelletier, M. 2006. Convergence rate and averaging of nonlinear two-time-scale stochastic approximation algorithms. The Annals of Applied Probability, 16(3): 1671 – 1702.
  • Mou et al. (2020) Mou, W.; Li, C. J.; Wainwright, M. J.; Bartlett, P. L.; and Jordan, M. I. 2020. On Linear Stochastic Approximation: Fine-grained Polyak-Ruppert and Non-Asymptotic Concentration. Proceedings of Thirty Third Conference on Learning Theory, PMLR, 125: 2947–2997.
  • Nesterov (1983) Nesterov, Y. 1983. A method of solving a convex programming problem with convergence rate O(1k2)O\bigl{(}\frac{1}{k^{2}}\bigr{)}. Soviet Mathematics Doklady, 269: 543–547.
  • Pan, White, and White (2017) Pan, Y.; White, A.; and White, M. 2017. Accelerated Gradient Temporal Difference Learning. arXiv:1611.09328.
  • Polyak (1964) Polyak, B. 1964. Some methods of speeding up the convergence of iteration methods. Ussr Computational Mathematics and Mathematical Physics, 4: 1–17.
  • Polyak (1990) Polyak, B. 1990. New stochastic approximation type procedures. Avtomatica i Telemekhanika, 7: 98–107.
  • Robbins and Monro (1951) Robbins, H.; and Monro, S. 1951. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3): 400 – 407.
  • Sutton and Barto (2018) Sutton, R.; and Barto, A. 2018. Reinforcement Learning: An Introduction. Cambridge, MA, USA: A Bradford Book. ISBN 0262039249.
  • Sutton et al. (2009) Sutton, R.; Maei, H.; Precup, D.; Bhatnagar, S.; Silver, D.; Szepesvári, C.; and Wiewiora, E. 2009. Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation. In Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, 993–1000. New York, NY, USA: Association for Computing Machinery. ISBN 9781605585161.
  • Sutton (1988) Sutton, R. S. 1988. Learning to Predict By the Methods of Temporal Differences. Machine Learning, 3(1): 9–44.
  • Sutton, Maei, and Szepesvári (2009) Sutton, R. S.; Maei, H.; and Szepesvári, C. 2009. A Convergent O(n) Temporal-difference Algorithm for Off-policy Learning with Linear Function Approximation. In Koller, D.; Schuurmans, D.; Bengio, Y.; and Bottou, L., eds., Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc.
  • Tsitsiklis and Van Roy (1997) Tsitsiklis, J.; and Van Roy, B. 1997. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5): 674–690.

Appendix

 

A1  Proof of Theorem 2

Consider the One timescale recursion for the GTD-M iterates given by (21) as given below:

ψt+1=ψt+ξt(h(ψt)+Mt+1+ε¯t).\psi_{t+1}=\psi_{t}+\xi_{t}(h(\psi_{t})+M_{t+1}+\bar{\varepsilon}_{t}).

Here h(ψ)=g+Gψ,g=𝔼[gt],G=𝔼[Gt]h(\psi)=g+G\psi,g=\mathbb{E}[g_{t}],G=\mathbb{E}[G_{t}], where the expectations are w.r.t. the stationary distribution of the Markov chain induced by the target policy π\pi. Mt+1=(Gt+1G)ψt+(gt+1g)M_{t+1}=(G_{t+1}-G)\psi_{t}+(g_{t+1}-g). In particular,

G=(wIA¯T00IA¯I00),g=(0b¯0),G=\begin{pmatrix}-wI&-\bar{A}^{T}&0\\ 0&-I&\bar{A}\\ I&0&0\end{pmatrix},g=\begin{pmatrix}0\\ \bar{b}\\ 0\end{pmatrix},

where recall that A¯=𝔼[ϕ(γϕϕ)T]\bar{A}=\mathbb{E}[\phi(\gamma\phi^{\prime}-\phi)^{T}] and b¯=𝔼[rϕ]\bar{b}=\mathbb{E}[r\phi] We show that the conditions (𝐀𝟏)(𝐀𝟒){\bf(A1)-(A4)} in Chapter 2 of (Borkar 2008b) hold and thereafter use Theorem 2 of (Borkar 2008b) to show convergence to the TD solution.

  • (A1)

    The map h(ψ)h(\psi) is linear in ψ\psi and therefore Lipschitz continuous with Lipschitz constant G||G||.

  • (A2)

    The step-size sequence ξt\xi_{t} satisfies the required conditions (cf. assumption 𝒜\mathbfcal{A} 2 of the current paper).

  • (A3)

    By construction Mt+1M_{t+1} is a martingale difference sequence w.r.t the filtration t=σ(ψ0,Mk,kt)\mathcal{F}_{t}=\sigma(\psi_{0},M_{k},k\leq t). Also, 𝔼[(Gt+1G)ψt+(gt+1g)2|t]2((Gt+1G)2ψt2+gt+1g2)\mathbb{E}[||(G_{t+1}-G)\psi_{t}+(g_{t+1}-g)||^{2}|\mathcal{F}_{t}]\leq 2(||(G_{t+1}-G)||^{2}||\psi_{t}||^{2}+||g_{t+1}-g||^{2}). (A3) is satisfied with K=2max((Gt+1G)2+gt+1g2)K=2\max(||(G_{t+1}-G)||^{2}+||g_{t+1}-g||^{2}). K<K<\infty follows from the fact that the rewards are uniformly bounded and the features are normalized (see assumption 𝒜\mathbfcal{A} 1).

  • (A4)

    To ensure (A4), we show that (A5) of (Chapter 3, pp.22, Borkar (2008b)) holds and then use Theorem 7 of (Borkar 2008b). The functions hc(x)=h(cx)c=gc+Gψ,c1h_{c}(x)=\frac{h(cx)}{c}=\frac{g}{c}+G\psi,c\geq 1. For any compact set HH, hchh_{c}\rightarrow h_{\infty} as c=Gψc\rightarrow\infty=G\psi uniformly. Consider the ODE

    ψ˙(t)=h(ψ(t))=Gψt.\dot{\psi}(t)=h_{\infty}(\psi(t))=G\psi_{t}.

    Observe that since ϕt1||\phi_{t}||\leq 1 and rt1r_{t}\leq 1 t\forall t, we have A2<2||A||^{2}<2. Since we have assumed that w1w\geq 1 therefore, w(w+1)A2w(w+1)\geq||A||^{2}, and hence from lemma 1, we have that GG is Hurwitz. Hence, the origin is a unique globally asymptotically stable equilibrium (g.a.s.e) for the above ODE. This in turn implies that the iterates remain bounded i.e., suptψt<\sup_{t}||\psi_{t}||<\infty a.s. t\forall t. By (Theorem 2, Chapter 2 of Borkar (2008b)) ψt\psi_{t} converges to an internally chain transitive invariant set of the ODE ψ˙(t)=h(ψ(t))=g+Gψ(t)\dot{\psi}(t)=h(\psi(t))=g+G\psi(t). The only such point of the ODE is its equilibrium point G1g-G^{-1}g. By (Corollary 4, Chapter 2 of Borkar (2008b)),

    ψtG1g.\psi_{t}\rightarrow-G^{-1}g.

    A straightforward calculation for the inverse of the 3×33\times 3 block matrix GG gives us that

    θtA¯1b¯\theta_{t}\rightarrow-\bar{A}^{-1}\bar{b}

A2  Proof of Theorem 3

We first start by assuming that the iterates remain stable (cf. assumption (B5)) and show that the three timescale recursions converge. Subsequently we provide conditions which ensure that the iterates remain stable. We consider general three timescale recursions as given below:

xn+1=xn+a(n)(h(xn,yn,zn)+Mn+1(1)),x_{n+1}=x_{n}+a(n)\left(h(x_{n},y_{n},z_{n})+M_{n+1}^{(1)}\right), (25)
yn+1=yn+b(n)(g(xn,yn,zn)+Mn+1(2)),y_{n+1}=y_{n}+b(n)\left(g(x_{n},y_{n},z_{n})+M_{n+1}^{(2)}\right), (26)
zn+1=zn+c(n)(f(xn,yn,zn)+Mn+1(3)),z_{n+1}=z_{n}+c(n)\left(f(x_{n},y_{n},z_{n})+M_{n+1}^{(3)}\right), (27)

where xnd1x_{n}\in\mathbb{R}^{d_{1}}, ynd2y_{n}\in\mathbb{R}^{d_{2}} and znd3z_{n}\in\mathbb{R}^{d_{3}} n0\forall n\geq 0. Next we consider the following assumptions:

  • (B1)

    h:d1+d2+d3d1,g:d1+d2+d3d2,f:d1+d2+d3d3h:\mathbb{R}^{d_{1}+d_{2}+d_{3}}\rightarrow\mathbb{R}^{d_{1}},g:\mathbb{R}^{d_{1}+d_{2}+d_{3}}\rightarrow\mathbb{R}^{d_{2}},f:\mathbb{R}^{d_{1}+d_{2}+d_{3}}\rightarrow\mathbb{R}^{d_{3}} are Lipschitz continuous.

  • (B2)

    {Mn(1)},{Mn(2)},{Mn(3)}\{M_{n}^{(1)}\},\{M_{n}^{(2)}\},\{M_{n}^{(3)}\} are Martingale difference sequences w.r.t. {n}\{\mathcal{F}_{n}\} where,

    n=σ(xm,ym,zm,Mm(1),Mm(2),Mm(3);mn),n0\mathcal{F}_{n}=\sigma\left(x_{m},y_{m},z_{m},M_{m}^{(1)},M_{m}^{(2)},M_{m}^{(3)};m\leq n\right),n\geq 0
    𝔼[||Mn+1(i)||2|n]Ki(1+||xn||2+||yn||2+||zn||2)a.s.;i=1,2,3,\mathbb{E}\left[||M_{n+1}^{(i)}||^{2}|\mathcal{F}_{n}\right]\leq K_{i}\left(1+||x_{n}||^{2}+||y_{n}||^{2}+||z_{n}||^{2}\right)a.s.;i=1,2,3,

    for some constants Ki>0K_{i}>0, i=1,2,3i=1,2,3.

  • (B3)

    {a(n)}\{a(n)\}, {b(n)}\{b(n)\}, {c(n)}\{c(n)\} are step-size sequences that satisfy a(n)>0,b(n)>0,c(n)>0,n0a(n)>0,b(n)>0,c(n)>0,\forall n\geq 0

    na(n)=nb(n)=nc(n)=,\sum_{n}a(n)=\sum_{n}b(n)=\sum_{n}c(n)=\infty,
    n(a(n)2+b(n)2+c(n)2)<,\sum_{n}(a(n)^{2}+b(n)^{2}+c(n)^{2})<\infty,
    b(n)a(n)0c(n)b(n)0 as n.\frac{b(n)}{a(n)}\rightarrow 0\mbox{,\quad}\frac{c(n)}{b(n)}\rightarrow 0\mbox{ as }n\rightarrow\infty.
  • (B4)
    1. (i)

      The ODE x˙(t)=h(x(t),y,z),yd2,zd3\dot{x}(t)=h(x(t),y,z),y\in\mathbb{R}^{d_{2}},z\in\mathbb{R}^{d_{3}} has a globally asymptotically stable equilibrium λ(y,z)\lambda(y,z), where λ:d2×d3d1\lambda:\mathbb{R}^{d_{2}\times d_{3}}\rightarrow\mathbb{R}^{d_{1}} is Lipschitz continuous.

    2. (ii)

      The ODE y˙(t)=g(λ(y(t),z),y(t),z),zd3\dot{y}(t)=g(\lambda(y(t),z),y(t),z),z\in\mathbb{R}^{d_{3}} has a globally asymptotically stable equilibrium Γ(z)\Gamma(z), where Γ:d3d2\Gamma:\mathbb{R}^{d_{3}}\rightarrow\mathbb{R}^{d_{2}} is Lipschitz continuous.

    3. (iii)

      The ODE z˙(t)=f(λ(Γ(z(t)),z(t)),Γ(z(t)),z(t))\dot{z}(t)=f(\lambda(\Gamma(z(t)),z(t)),\Gamma(z(t)),z(t)), has a globally asymptotically stable equilibrium zd3z^{*}\in\mathbb{R}^{d_{3}}

  • (B5)

    supn(xn+yn+zn)<\sup_{n}\left(||x_{n}||+||y_{n}||+||z_{n}||\right)<\infty a.s.a.s.

Theorem 5.

Under (B1)(B5)(\textbf{B1})-(\textbf{B5}) the iterates given by (25), (26) and (27),

(xn,yn,zn)(λ(Γ(z),z),Γ(z),z).(x_{n},y_{n},z_{n})\rightarrow(\lambda(\Gamma(z^{*}),z^{*}),\Gamma(z^{*}),z^{*}).
Proof.

We start with the following Lemma that characterizes the set to which the iterates converge.

Lemma 6.

(xn,yn,zn){(λ(Γ(z),z),Γ(z),z):zd3}(x_{n},y_{n},z_{n})\rightarrow\{(\lambda(\Gamma(z),z),\Gamma(z),z):z\in\mathbb{R}^{d_{3}}\}

Proof.

We first consider the fastest timescale of {a(n)}\{a(n)\} and show that:

(xn,yn,zn){(λ(y,z),y,z):yd2,zRd3}.(x_{n},y_{n},z_{n})\rightarrow\{(\lambda(y,z),y,z):y\in\mathbb{R}^{d_{2}},z\in R^{d_{3}}\}.

We rewrite the iterates (25), (26) and (27) as:

xn+1=xn+a(n)(h(xn,yn,zn)+Mn+1(1)),x_{n+1}=x_{n}+a(n)\left(h(x_{n},y_{n},z_{n})+M_{n+1}^{(1)}\right), (28)
yn+1=yn+a(n)(ϵn(2),a+Mn+1(2),a),y_{n+1}=y_{n}+a(n)\left(\epsilon^{(2),a}_{n}+M_{n+1}^{(2),a}\right), (29)
zn+1=zn+a(n)(ϵn(3),a+Mn+1(3),a),z_{n+1}=z_{n}+a(n)\left(\epsilon^{(3),a}_{n}+M_{n+1}^{(3),a}\right), (30)

where,

ϵn(2),a=b(n)a(n)g(xn,yn,zn),Mn+1(2),a=b(n)a(n)Mn+1(2),\epsilon^{(2),a}_{n}=\frac{b(n)}{a(n)}g(x_{n},y_{n},z_{n}),M_{n+1}^{(2),a}=\frac{b(n)}{a(n)}M_{n+1}^{(2)},
ϵn(3),a=c(n)a(n)f(xn,yn,zn),Mn+1(3),a=c(n)a(n)Mn+1(3).\epsilon^{(3),a}_{n}=\frac{c(n)}{a(n)}f(x_{n},y_{n},z_{n}),M_{n+1}^{(3),a}=\frac{c(n)}{a(n)}M_{n+1}^{(3)}.

Using third extension from Chapter-2 of (Borkar 2008a), (xn,yn,zn)(x_{n},y_{n},z_{n}) converges to an internally chain transitive invariant set of the ODE

x˙(t)=h(x(t),y(t),z(t)),\dot{x}(t)=h(x(t),y(t),z(t)),
y˙(t)=0,\dot{y}(t)=0,
z˙(t)=0.\dot{z}(t)=0.

For initial conditions xd1,yd2,zd3x\in\mathbb{R}^{d_{1}},y\in\mathbb{R}^{d_{2}},z\in\mathbb{R}^{d_{3}}, the internally chain transitive invariant set of the above ODE is {(λ(y,z),y,z)}\{(\lambda(y,z),y,z)\}. Therefore,

(xn,yn,zn){(λ(y,z),y,z):yd2,zd3}(x_{n},y_{n},z_{n})\rightarrow\{(\lambda(y,z),y,z):y\in\mathbb{R}^{d_{2}},z\in\mathbb{R}^{d_{3}}\} (31)

Next we consider the middle timescale {b(n)}\{b(n)\}. (26) and (27) can be re-written as:

yn+1=yn+b(n)(g(xn,yn,zn)+Mn+1(2)),y_{n+1}=y_{n}+b(n)\left(g(x_{n},y_{n},z_{n})+M_{n+1}^{(2)}\right),
zn+1=zn+b(n)(ϵn(3),b+Mn+1(3),b),z_{n+1}=z_{n}+b(n)\left(\epsilon_{n}^{(3),b}+M_{n+1}^{(3),b}\right), (32)

where,

ϵn(3),b=b(n)c(n)f(xn,yn,zn), Mn+1(3),b=b(n)c(n)Mn+1(3).\epsilon_{n}^{(3),b}=\frac{b(n)}{c(n)}f(x_{n},y_{n},z_{n}),\mbox{\quad}M_{n+1}^{(3),b}=\frac{b(n)}{c(n)}M_{n+1}^{(3)}.

The iteration for {yn}\{y_{n}\} can be re-written as:

yn+1=yn+b(n)(g(λ(yn,zn),yn,zn)+ϵn(2),b+Mn+1(2))y_{n+1}=y_{n}+b(n)\left(g(\lambda(y_{n},z_{n}),y_{n},z_{n})+\epsilon_{n}^{(2),b}+M_{n+1}^{(2)}\right) (33)

where,

ϵn(2),b=g(xn,yn,zn)g(λ(yn,zn),yn,zn).\epsilon_{n}^{(2),b}=g(x_{n},y_{n},z_{n})-g(\lambda(y_{n},z_{n}),y_{n},z_{n}).

Since, (xn,yn,zn){(λ(y,z),y,z):yd2,zd3}(x_{n},y_{n},z_{n})\rightarrow\{(\lambda(y,z),y,z):y\in\mathbb{R}^{d_{2}},z\in\mathbb{R}^{d_{3}}\}, therefore ϵn(2),b0||\epsilon_{n}^{(2),b}||\rightarrow 0 as nn\rightarrow\infty. Again using third extension from Chapter-2 of (Borkar 2008a), it can be seen that (32) and (33) converges to an internally chain transitive invariant set of the ODE

y˙(t)=g(λ(y(t),z(t)),y(t),z(t))\dot{y}(t)=g\left(\lambda(y(t),z(t)),y(t),z(t)\right)
z˙(t)=0\dot{z}(t)=0

For initial conditions yd2,zd3y\in\mathbb{R}^{d_{2}},z\in\mathbb{R}^{d_{3}}, the internally chain transitive invariant set of the above ODE is {(Γ(z),z)}\{(\Gamma(z),z)\}. Therefore,

(yn,zn){(Γ(z),z):zd3.}(y_{n},z_{n})\rightarrow\{(\Gamma(z),z):z\in\mathbb{R}^{d_{3}}.\} (34)

Combining (31) and (34) we get:

(xn,yn,zn){(λ(Γ(z),z),Γ(z),z):zd3}.(x_{n},y_{n},z_{n})\rightarrow\{(\lambda(\Gamma(z),z),\Gamma(z),z):z\in\mathbb{R}^{d_{3}}\}.

Finally, we consider the slowest timescale of {c(n)}\{c(n)\}. We define the piece wise linear continuous interpolation of the iterates znz_{n} as:

z¯(t(n))=zn\bar{z}(t(n))=z_{n}
z¯(t)=zn+(zn+1zn)tt(n)t(n+1)t(n),t[t(n),t(n+1)],\bar{z}(t)=z_{n}+(z_{n+1}-z_{n})\frac{t-t(n)}{t(n+1)-t(n)},t\in[t(n),t(n+1)],

where, t(n)=m=0n1c(n),n1.t(n)=\sum_{m=0}^{n-1}c(n),n\geq 1. Also, let zs(t),tsz^{s}(t),t\geq s, denote the unique solution to the below ODE starting at ss\in\mathbb{R}:

z˙s(t)=h(zs(t)),ts,\dot{z}^{s}(t)=h(z^{s}(t)),t\geq s,

with zs(s)=z¯(s)z^{s}(s)=\bar{z}(s). Using the arguments as in Theorem-2, Chapter-6 of (Borkar 2008a), it can be shown that for any T>0T>0

limssupt[s,s+T]z¯(t)zs(t)=0 a.s. \lim_{s\rightarrow\infty}\sup_{t\in[s,s+T]}||\bar{z}(t)-z^{s}(t)||=0\mbox{ a.s. }

Subsequently arguing as in proof of Theorem-2, Chapter-2 of (Borkar 2008a), we get:

znz a.s. z_{n}\rightarrow z^{*}\mbox{ a.s. }

Using Lemma 6, we get:

(xn,yn,zn)(λ(Γ(z),z),Γ(z),z) a.s.(x_{n},y_{n},z_{n})\rightarrow\left(\lambda(\Gamma(z^{*}),z^{*}),\Gamma(z^{*}),z^{*}\right)\mbox{ a.s.}

Next we provide sufficient conditions for (𝐁𝟓)\bf(B5) to hold. Consider the following additional assumptions:

  • (B6)

    The functions hc(x,y,z)h(cx,cy,cz)c,c1h_{c}(x,y,z)\triangleq\frac{h(cx,cy,cz)}{c},c\geq 1 satisfy hchh_{c}\rightarrow h_{\infty} as cc\rightarrow\infty uniformly on compacts. For fixed yd2,zd3y\in\mathbb{R}^{d_{2}},z\in\mathbb{R}^{d_{3}}, the ODE

    x˙(t)=h(x(t),y,z)\dot{x}(t)=h_{\infty}(x(t),y,z)

    has its unique globally asymptotically stable equilibrium λ(y,z)\lambda_{\infty}(y,z), where λ:d2+d3d1\lambda_{\infty}:\mathbb{R}^{d_{2}+d_{3}}\rightarrow\mathbb{R}^{d_{1}} is Lipschitz continuous. Further, λ(0,0)=0\lambda_{\infty}(0,0)=0, i.e.,

    x˙(t)=h(x(t),0,0)\dot{x}(t)=h_{\infty}(x(t),0,0)

    has origin in d1\mathbb{R}^{d_{1}} as unique globally asymptotically stable equilibrium.

  • (B7)

    The functions gc(y,z)g(cλ(y,z),cy,cz)c,c1g_{c}(y,z)\triangleq\frac{g(c\lambda_{\infty}(y,z),cy,cz)}{c},c\geq 1 satisfy gcgg_{c}\rightarrow g_{\infty} as cc\rightarrow\infty uniformly on compacts. For fixed zd3z\in\mathbb{R}^{d_{3}}, the ODE

    y˙(t)=g(y(t),z)\dot{y}(t)=g_{\infty}(y(t),z)

    has its unique globally asymptotically stable equilibrium Γ(z)\Gamma_{\infty}(z), where Γ:d3d2\Gamma_{\infty}:\mathbb{R}^{d_{3}}\rightarrow\mathbb{R}^{d_{2}} is Lipschitz continuous. Further, Γ(0)=0\Gamma_{\infty}(0)=0, i.e.,

    y˙(t)=g(y(t),0)\dot{y}(t)=g_{\infty}(y(t),0)

    has origin in d2\mathbb{R}^{d_{2}} as its unique globally asymptotically stable equilibrium.

  • (B8)

    The functions fc(z)f(cλ(Γ(z),z),cΓ(z),cz)c,c1f_{c}(z)\triangleq\frac{f(c\lambda_{\infty}(\Gamma_{\infty}(z),z),c\Gamma_{\infty}(z),cz)}{c},c\geq 1 satisfy fcff_{c}\rightarrow f_{\infty} as cc\rightarrow\infty uniformly on compacts. The ODE

    z˙(t)=f(z(t))\dot{z}(t)=f_{\infty}(z(t))

    has the origin in d3\mathbb{R}^{d_{3}} as its unique globally asymptotically stable equilibrium.

Theorem 7.

Under assumptions (𝐁𝟏)\bf{(B1)}-(𝐁𝟒)\bf{(B4)} and (𝐁𝟔)\bf{(B6)}-(𝐁𝟖)\bf{(B8)},

supn(xn+yn+zn)<.\sup_{n}(||x_{n}||+||y_{n}||+||z_{n}||)<\infty.
Proof.

We begin with the fastest time scale determined by the step size a(n)a(n). Consider the following definitions:

  1. (F1)

    Define

    t(n)=i=0n1a(i),n1, with t(0)=0.t(n)=\sum_{i=0}^{n-1}a(i),n\geq 1,\mbox{ with }t(0)=0.

    Let ψk=(xk,yk,zk),k0\psi_{k}=(x_{k},y_{k},z_{k}),k\geq 0, and

    ψ¯(t)=ψn+(ψn+1ψn)tt(n)t(n+1)t(n), t[t(n),t(n+1)].\bar{\psi}(t)=\psi_{n}+\left(\psi_{n+1}-\psi_{n}\right)\frac{t-t(n)}{t(n+1)-t(n)},\mbox{ \quad}t\in[t(n),t(n+1)].
  2. (F2)

    Given t(n),n0t(n),n\geq 0 and a constant T>0T>0 define

    T0=0,T_{0}=0,
    Tn=min(t(m):t(m)Tn1+T),n1.T_{n}=\min(t(m):t(m)\geq T_{n-1}+T),n\geq 1.

    One can find a subsequence {m(n)}\{m(n)\} such that Tn=t(m(n))T_{n}=t(m(n)) n\forall n and m(n)m(n)\rightarrow\infty as nn\rightarrow\infty.

  3. (F3)

    The scaling sequence is defined as:

    r(n)=max(r(n1),ψ¯(Tn),1),n1.r(n)=\max(r(n-1),||\bar{\psi}(T_{n})||,1),n\geq 1.
  4. (F4)

    The scaled iterates for m(n)km(n+1)1m(n)\leq k\leq m(n+1)-1 are:

    x^m(n)=xm(n)r(n), y^m(n)=ym(n)r(n), z^m(n)=zm(n)r(n),\hat{x}_{m(n)}=\frac{x_{m(n)}}{r(n)},\mbox{\quad}\hat{y}_{m(n)}=\frac{y_{m(n)}}{r(n)},\mbox{\quad}\hat{z}_{m(n)}=\frac{z_{m(n)}}{r(n)},
    x^k+1=x^k+a(k)(h(cx^k,cy^k,cz^k)c+M^k+1(1))\hat{x}_{k+1}=\hat{x}_{k}+a(k)\left(\frac{h(c\hat{x}_{k},c\hat{y}_{k},c\hat{z}_{k})}{c}+\hat{M}_{k+1}^{(1)}\right)
    y^k+1=y^k+a(k)(ϵk(2),a+M^k+1(2))\hat{y}_{k+1}=\hat{y}_{k}+a(k)\left(\epsilon_{k}^{(2),a}+\hat{M}_{k+1}^{(2)}\right)
    z^k+1=z^k+a(k)(ϵk(3),a+M^k+1(3))\hat{z}_{k+1}=\hat{z}_{k}+a(k)\left(\epsilon_{k}^{(3),a}+\hat{M}_{k+1}^{(3)}\right)

    where, c=r(n)c=r(n),

    ϵk(2),a=b(k)a(k)g(cx^k,cy^k,cz^k)c\epsilon_{k}^{(2),a}=\frac{b(k)}{a(k)}\frac{g(c\hat{x}_{k},c\hat{y}_{k},c\hat{z}_{k})}{c}
    ϵk(3),a=c(k)a(k)f(cx^k,cy^k,cz^k)c\epsilon_{k}^{(3),a}=\frac{c(k)}{a(k)}\frac{f(c\hat{x}_{k},c\hat{y}_{k},c\hat{z}_{k})}{c}
    M^k+1(1)=Mk+1(1)r(n),M^k+1(2)=b(k)a(k)Mk+1(2)r(n),M^k+1(3)=c(k)a(k)Mk+1(3)r(n).\hat{M}_{k+1}^{(1)}=\frac{M_{k+1}^{(1)}}{r(n)},\hat{M}_{k+1}^{(2)}=\frac{b(k)}{a(k)}\frac{M_{k+1}^{(2)}}{r(n)},\hat{M}_{k+1}^{(3)}=\frac{c(k)}{a(k)}\frac{M_{k+1}^{(3)}}{r(n)}.
  5. (F5)

    Next we define the linearly interpolated trajectory for the scaled iterates as follows:

    ψ^(t)=ψ^n+(ψ^n+1ψ^n)tt(n)t(n+1)t(n), t[t(n),t(n+1)].\hat{\psi}(t)=\hat{\psi}_{n}+(\hat{\psi}_{n+1}-\hat{\psi}_{n})\frac{t-t(n)}{t(n+1)-t(n)},\mbox{\quad}t\in[t(n),t(n+1)].
  6. (F6)

    Let ψn(t)=(xn(t),yn(t),zn(t)), t[Tn,Tn+1]\psi_{n}(t)=(x_{n}(t),y_{n}(t),z_{n}(t)),\mbox{\quad}t\in[T_{n},T_{n+1}] denote the trajectory of the ODE:

    x˙(t)=hr(n)(x(t),y(t),z(t)),\dot{x}(t)=h_{r(n)}(x(t),y(t),z(t)),
    y˙(t)=0,\dot{y}(t)=0,
    z˙(t)=0,\dot{z}(t)=0,

    with xn(Tn)=x^(Tn)x_{n}(T_{n})=\hat{x}(T_{n}), yn(Tn)=y^(Tn)y_{n}(T_{n})=\hat{y}(T_{n}) and zn(Tn)=z^(Tn)z_{n}(T_{n})=\hat{z}(T_{n}).

First we state four lemmas for ODEs with two external inputs. The proofs of these lemmas follow exactly as Lemmas 2, 3, 4 and 5 of (Lakshminarayanan and Bhatnagar 2017). Subsequently when we analyze the middle timescale (timescale of {b(n)}\{b(n)\}) and slow timescale (timescale of {c(n)}\{c(n)\}) recursions, we restate the corresponding lemmas for ODEs with one and no external inputs respectively. Let xcy(t),z(t)(t,x)x_{c}^{y(t),z(t)}(t,x) and xy(t),z(t)(t,x)x_{\infty}^{y(t),z(t)}(t,x) denote the solution to the ODEs

x˙(t)=hc(x(t),y(t),z(t)), t0,\dot{x}(t)=h_{c}(x(t),y(t),z(t)),\mbox{\quad}t\geq 0,
x˙(t)=h(x(t),y(t),z(t)), t0,\dot{x}(t)=h_{\infty}(x(t),y(t),z(t)),\mbox{\quad}t\geq 0,

respectively, with initial condition xd1x\in\mathbb{R}^{d_{1}} and the external inputs y(t)d2y(t)\in\mathbb{R}^{d_{2}} and z(t)d3z(t)\in\mathbb{R}^{d_{3}}. Throughout the paper, B(x,r){qd1|qx<r},B(y,r){qd2|qy<r}B(x,r)\triangleq\{q\in\mathbb{R}^{d_{1}}\Big{|}||q-x||<r\},B(y,r)\triangleq\{q\in\mathbb{R}^{d_{2}}\Big{|}||q-y||<r\} and B(z,r){qd3|qz<r}B(z,r)\triangleq\{q\in\mathbb{R}^{d_{3}}\Big{|}||q-z||<r\} denote the ball of radius rr around x,yx,y and zz respectively.

Lemma 8.

Let Kd1K\subset\mathbb{R}^{d_{1}} be a compact set, yd2y\in\mathbb{R}^{d_{2}} and zd3z\in\mathbb{R}^{d_{3}} be fixed external inputs. Then under (𝐁𝟔)\bf{(B6)}, given δ>0\delta>0, Tδ>0\exists T_{\delta}>0 such that xK\forall x\in K

xy,z(t,x)B(λ(y,z),δ),δTδ.x^{y,z}_{\infty}(t,x)\in B(\lambda_{\infty}(y,z),\delta),\forall\delta\geq T_{\delta}.
Lemma 9.

Let xd1x\in\mathbb{R}^{d_{1}}, yd2y\in\mathbb{R}^{d_{2}}, zd3z\in\mathbb{R}^{d_{3}}, [0,T][0,T] be a given time interval and r>0r>0. Let y(t)B(y,r),y^{\prime}(t)\in B(y,r), z(t)B(z,r)z^{\prime}(t)\in B(z,r) t[0,T]\forall t\in[0,T], then

xcy(t),z(t)(t,x)xy,z(t,x)(ϵ(c)+2Lr)TeLT, t[0,T],||x_{c}^{y^{\prime}(t),z^{\prime}(t)}(t,x)-x_{\infty}^{y,z}(t,x)||\leq(\epsilon(c)+2Lr)Te^{LT},\mbox{\quad}\forall t\in[0,T],

where ϵ(c)0\epsilon(c)\rightarrow 0 as cc\rightarrow\infty.

Lemma 10.

Let yd2y\in\mathbb{R}^{d_{2}}, zd3z\in\mathbb{R}^{d_{3}} then given ϵ>0\epsilon>0 and T>0T>0, cϵ,T>0\exists c_{\epsilon,T}>0, δϵ,T>0\delta_{\epsilon,T}>0 and rϵ,T>0r_{\epsilon,T}>0 such that t[0,T)\forall t\in[0,T), xB(λ(y,z),δϵ,T)\forall x\in B(\lambda_{\infty}(y,z),\delta_{\epsilon,T}) c>cϵ,T\forall c>c_{\epsilon,T} and external inputs y(s)B(y,rϵ,T)y^{\prime}(s)\in B(y,r_{\epsilon,T}) and z(s)B(z,rϵ,T)z^{\prime}(s)\in B(z,r_{\epsilon,T}). Then,

xcy(t),z(t)(t,x)B(λ(y,z),2ϵ) t[0,T].x_{c}^{y^{\prime}(t),z^{\prime}(t)}(t,x)\in B(\lambda_{\infty}(y,z),2\epsilon)\mbox{\quad}\forall t\in[0,T].
Lemma 11.

Let xB(0,1)d1,yKd2,zK′′d3x\in B(0,1)\subset\mathbb{R}^{d_{1}},y\in K^{\prime}\subset\mathbb{R}^{d_{2}},z\in K^{\prime\prime}\subset\mathbb{R}^{d_{3}} and let (𝐁𝟔)\bf{(B6)} hold. Then given ϵ>0,cϵ1,rϵ>0\epsilon>0,\exists c_{\epsilon}\geq 1,r_{\epsilon}>0 and Tϵ>0T_{\epsilon}>0 such that for any external input satisfying y(s)B(y,rϵ)y^{\prime}(s)\in B(y,r_{\epsilon}), z(s)B(z,rϵ)z^{\prime}(s)\in B(z,r_{\epsilon}), s[0,T],\forall s\in[0,T],

xcy(t),z(t)(t,x)λ(y,z)2ϵ, c>cϵ,tTϵ.||x_{c}^{y^{\prime}(t),z^{\prime}(t)}(t,x)-\lambda_{\infty}(y,z)||\leq 2\epsilon,\mbox{\quad}\forall c>c_{\epsilon},t\geq T_{\epsilon}.

The next lemma uses the convergence result of three time scale iterates under the stability assumption of (𝐁𝟓)\bf{(B5)} (Theorem 5) and shows that the scaled iterates defined in (𝐅𝟒)\bf{(F4)} converge.

Lemma 12.

Under (𝐁𝟏)(𝐁𝟑)\bf{(B1)}-\bf{(B3)},

  1. (i)

    For 0km(n+1)m(n)0\leq k\leq m(n+1)-m(n), ψ^(t(m(n)+k))K(1)||\hat{\psi}(t(m(n)+k))||\leq K^{(1)} a.s. for some constant K(1)>0.K^{(1)}>0.

  2. (ii)

    limnψ^(t)ψn(t)=0 a.s. t[Tn,Tn+1]\lim_{n\rightarrow\infty}||\hat{\psi}(t)-\psi_{n}(t)||=0\mbox{ a.s. }\forall t\in[T_{n},T_{n+1}]

Proof.
  1. (i)

    Follows as in (Lemma 4, Chapter-3, pp. 24, Borkar (2008a)).

  2. (ii)

    By construction, the iterates x^k\hat{x}_{k}, y^k\hat{y}_{k}, z^k\hat{z}_{k} remain bounded, i.e., supk(x^k+y^k+z^k)<\sup_{k}(||\hat{x}_{k}||+||\hat{y}_{k}||+||\hat{z}_{k}||)<\infty a.s. Therefore, (𝐁𝟏)\bf{(B1)}-(𝐁𝟒)\bf{(B4)} are satisfied. Using Theorem 5, the iterates (x^n,y^n,z^n)(\hat{x}_{n},\hat{y}_{n},\hat{z}_{n}) converges. Using the third extension from Chapter-2 of (Borkar 2008a), the iterates (x^n,y^n,z^n)(\hat{x}_{n},\hat{y}_{n},\hat{z}_{n}) track the ODE system

    x˙(t)=hr(n)(x(t),y(t),z(t)),\dot{x}(t)=h_{r(n)}(x(t),y(t),z(t)),
    y˙(t)=0,\dot{y}(t)=0,
    z˙(t)=0.\dot{z}(t)=0.

    Therefore, limnψ^(t)ψn(t)=0 a.s. t[Tn,Tn+1]\lim_{n\rightarrow\infty}||\hat{\psi}(t)-\psi_{n}(t)||=0\mbox{ a.s. }\forall t\in[T_{n},T_{n+1}]

In particular, Lemma 12(i) shows that along the fastest timescale between instants TnT_{n} and Tn+1T_{n+1}, the norm of the scaled iterate can grow at most by a factor K(1)K^{(1)} starting from B(0,1)B(0,1). Next, Lemma 12(ii) shows that the scaled iterate asymptotically tracks the ODE defined in (𝐅𝟔)\bf{(F6)}. The next theorem bounds xn||x_{n}|| in terms of yn||y_{n}|| and zn||z_{n}||. We define the linearly interpolated trajectory of the three iterates as: t[t(n),t(n+1)]\forall t\in[t(n),t(n+1)],

x¯(t)=xn+(xn+1xn)tt(n)t(n+1)t(n),\bar{x}(t)=x_{n}+\left(x_{n+1}-x_{n}\right)\frac{t-t(n)}{t(n+1)-t(n)},
y¯(t)=yn+(yn+1yn)tt(n)t(n+1)t(n),\bar{y}(t)=y_{n}+\left(y_{n+1}-y_{n}\right)\frac{t-t(n)}{t(n+1)-t(n)},
z¯(t)=zn+(zn+1zn)tt(n)t(n+1)t(n).\bar{z}(t)=z_{n}+\left(z_{n+1}-z_{n}\right)\frac{t-t(n)}{t(n+1)-t(n)}.
Theorem 13.

Under assumptions (𝐁𝟏)\bf{(B1)}-(𝐁𝟒)\bf{(B4)} and (𝐁𝟔)\bf{(B6)},

  1. (i)

    For nn large, and T=T14T=T_{\frac{1}{4}} (here TT is the sampling frequency as in (F2) and T14T_{\frac{1}{4}} is TϵT_{\epsilon} as in Lemma 11 with ϵ=14\epsilon=\frac{1}{4}), if x¯(Tn)>Ca(1+y¯(Tn)+z¯(Tn))||\bar{x}(T_{n})||>C_{a}(1+||\bar{y}(T_{n})||+||\bar{z}(T_{n})||), for some Ca>0C_{a}>0 then x¯(Tn+1)34x¯(Tn)||\bar{x}(T_{n+1})||\leq\frac{3}{4}||\bar{x}(T_{n})||

  2. (ii)

    x¯(Tn)Ca(1+y¯(Tn)+z¯(Tn))||\bar{x}(T_{n})||\leq C_{a}^{*}(1+||\bar{y}(T_{n})||+||\bar{z}(T_{n})||) a.s. for some Ca>0C_{a}^{*}>0.

  3. (iii)

    xnKa(1+yn+zn), for some Ka>0||x_{n}||\leq K_{a}^{*}(1+||y_{n}||+||z_{n}||),\mbox{ for some }K_{a}^{*}>0

Proof.
  1. (i)

    We have x¯(Tn)>Ca(1+y¯(Tn)+z¯(Tn)).||\bar{x}(T_{n})||>C_{a}(1+||\bar{y}(T_{n})||+||\bar{z}(T_{n})||). Since, r(n)=max(r(n1),ψ¯(Tn),1)r(n)=\max(r(n-1),||\bar{\psi}(T_{n})||,1), this implies r(n)ψ¯(Tn)r(n)\geq||\bar{\psi}(T_{n})||. Therefore, r(n)Car(n)\geq C_{a}. Next we show that

    y^(Tn)<1Ca and z^(Tn)<1Ca.||\hat{y}(T_{n})||<\frac{1}{C_{a}}\mbox{ and }||\hat{z}(T_{n})||<\frac{1}{C_{a}}.

    For p1p\geq 1,

    y^(Tn)p=y¯(Tn)pr(n)y¯(Tn)pψ¯(Tn)p=y¯(Tn)p(x¯(Tn)pp+y¯(Tn)pp+z¯(Tn)pp)1p\begin{split}||\hat{y}(T_{n})||_{p}&=\frac{||\bar{y}(T_{n})||_{p}}{r(n)}\leq\frac{||\bar{y}(T_{n})||_{p}}{||\bar{\psi}(T_{n})||_{p}}\\ &=\frac{||\bar{y}(T_{n})||_{p}}{\Big{(}||\bar{x}(T_{n})||_{p}^{p}+||\bar{y}(T_{n})||_{p}^{p}+||\bar{z}(T_{n})||_{p}^{p}\Big{)}^{\frac{1}{p}}}\end{split}

    Since, x¯(Tn)pCa(1+y¯(Tn)p+z¯(Tn)p)||\bar{x}(T_{n})||_{p}\geq C_{a}(1+||\bar{y}(T_{n})||_{p}+||\bar{z}(T_{n})||_{p}),

    x¯(Tn)ppCap(y¯(Tn)p+z¯(Tn)p)pCap(y¯(Tn)pp+z¯(Tn)pp)\begin{split}||\bar{x}(T_{n})||_{p}^{p}&\geq C_{a}^{p}\Big{(}||\bar{y}(T_{n})||_{p}+||\bar{z}(T_{n})||_{p}\Big{)}^{p}\\ &\geq C_{a}^{p}\Big{(}||\bar{y}(T_{n})||_{p}^{p}+||\bar{z}(T_{n})||_{p}^{p}\Big{)}\\ \end{split} (35)

    Therefore,

    y^(Tn)y¯(Tn)p(Cap+1)1p(y¯(Tn)pp+z¯(Tn)pp)1p1(1+Cap)1p<1Ca.\begin{split}||\hat{y}(T_{n})||&\leq\frac{||\bar{y}(T_{n})||_{p}}{\Big{(}C_{a}^{p}+1\Big{)}^{\frac{1}{p}}\Big{(}||\bar{y}(T_{n})||_{p}^{p}+||\bar{z}(T_{n})||_{p}^{p}\Big{)}^{\frac{1}{p}}}\\ &\leq\frac{1}{\Big{(}1+C_{a}^{p}\Big{)}^{\frac{1}{p}}}\\ &<\frac{1}{C_{a}}.\end{split}

    The second inequality follows from the fact that y¯(Tn)ppy¯(Tn)pp+z¯(Tn)pp||\bar{y}(T_{n})||_{p}^{p}\leq||\bar{y}(T_{n})||_{p}^{p}+||\bar{z}(T_{n})||_{p}^{p}. A similar analysis proves z^(Tn)<1Ca||\hat{z}(T_{n})||<\frac{1}{C_{a}}. Next we show that

    x^(Tn)p>11+1Ca.||\hat{x}(T_{n})||_{p}>\frac{1}{1+\frac{1}{C_{a}}}.

    Here we are considering the case when iterates are blowing up. Therefore let r(n)=ψ¯(Tn)r(n)=\bar{\psi}(T_{n}). Then,

    x^(Tn)=x¯(Tn)ψ¯(Tn)=x¯(Tn)(x¯(Tn)pp+y¯(Tn)pp+z¯(Tn)pp)1p=1(1+y¯(Tn)pp+z¯(Tn)ppx¯(Tn)pp)1p>1(1+y¯(Tn)pp+z¯(Tn)ppCap(y¯(Tn)pp+z¯(Tn)pp))1p>11+1Ca.\begin{split}||\hat{x}(T_{n})||&=\frac{||\bar{x}(T_{n})||}{||\bar{\psi}(T_{n})||}\\ &=\frac{||\bar{x}(T_{n})||}{\Big{(}||\bar{x}(T_{n})||_{p}^{p}+||\bar{y}(T_{n})||_{p}^{p}+||\bar{z}(T_{n})||_{p}^{p}\Big{)}^{\frac{1}{p}}}\\ &=\frac{1}{\Big{(}1+\frac{||\bar{y}(T_{n})||_{p}^{p}+||\bar{z}(T_{n})||_{p}^{p}}{||\bar{x}(T_{n})||_{p}^{p}}\Big{)}^{\frac{1}{p}}}\\ &>\frac{1}{\Big{(}1+\frac{||\bar{y}(T_{n})||_{p}^{p}+||\bar{z}(T_{n})||_{p}^{p}}{C_{a}^{p}(||\bar{y}(T_{n})||_{p}^{p}+||\bar{z}(T_{n})||_{p}^{p})}\Big{)}^{\frac{1}{p}}}\\ &>\frac{1}{1+\frac{1}{C_{a}}}.\end{split}

    Let y(tTn)=yn(t)y^{\prime}(t-T_{n})=y_{n}(t) and z(tTn)=zn(t)z^{\prime}(t-T_{n})=z_{n}(t) t[Tn,Tn+1]\forall t\in[T_{n},T_{n+1}]. From lemma 11, r14,c14,T14\exists r_{\frac{1}{4}},c_{\frac{1}{4}},T_{\frac{1}{4}} such that

    xcy(t),z(t)(t,x^(Tn))14,tT14,cc14,||x_{c}^{y^{\prime}(t),z^{\prime}(t)}(t,\hat{x}(T_{n}))||\leq\frac{1}{4},\forall t\geq T_{\frac{1}{4}},\forall c\geq c_{\frac{1}{4}},

    whenever y(t)B(0,r14)y^{\prime}(t)\in B(0,r_{\frac{1}{4}}) and z(t)B(0,r14)z^{\prime}(t)\in B(0,r_{\frac{1}{4}}). Choose Ca>max(c14,2r14)C_{a}>\max(c_{\frac{1}{4}},\frac{2}{r_{\frac{1}{4}}}) and T=T14T=T_{\frac{1}{4}}. Since y˙(t)=0,\dot{y}(t)=0, and z˙(t)=0\dot{z}(t)=0 for the ODE defined in (F6),y(tTn)=yn(t)=y^(Tn)\textbf{(F6)},y^{\prime}(t-T_{n})=y_{n}(t)=\hat{y}(T_{n}) and z(tTn)=zn(t)=z^(Tn)z^{\prime}(t-T_{n})=z_{n}(t)=\hat{z}(T_{n}) t[Tn,Tn+1].\forall t\in[T_{n},T_{n+1}]. From y^(Tn)<1Ca||\hat{y}(T_{n})||<\frac{1}{C_{a}} and z^(Tn)<1Ca||\hat{z}(T_{n})||<\frac{1}{C_{a}}, it follows that y(s)B(0,r14)y^{\prime}(s)\in B(0,r_{\frac{1}{4}}) and z(s)B(0,r14)z^{\prime}(s)\in B(0,r_{\frac{1}{4}}) s[0,T].\forall s\in[0,T]. Using Lemma 12(ii), x^(Tn+1)xn(Tn+1)<14||\hat{x}(T_{n+1}^{-})-x_{n}(T_{n+1})||<\frac{1}{4} for large enough nn. Also observe that xn(Tn+1)=xr(n)y(t),z(t)(Tn+1Tn,x^(Tn))14||x_{n}(T_{n+1})||=||x^{y^{\prime}(t),z^{\prime}(t)}_{r(n)}(T_{n+1}-T_{n},\hat{x}(T_{n}))||\leq\frac{1}{4}. Using these, we have x^(Tn+1)x^(Tn+1)xn(Tn+1)+xn(Tn+1)12||\hat{x}(T_{n+1}^{-})||\leq||\hat{x}(T_{n+1}^{-})-x_{n}(T_{n+1})||+||x_{n}(T_{n+1})||\leq\frac{1}{2}. Finally since

    x¯(Tn+1)x¯(Tn)=x^(Tn+1)||x^(Tn)||,\frac{||\bar{x}(T_{n+1})||}{||\bar{x}(T_{n})||}=\frac{||\hat{x}(T_{n+1}^{-})||}{||\hat{x}(T_{n})}||,

    we have

    x¯(Tn+1)=x^(Tn+1)x^(Tn)x¯(Tn)<1211+1/Cax¯(Tn)\begin{split}||\bar{x}(T_{n+1})||&=\frac{||\hat{x}(T_{n+1}^{-})||}{||\hat{x}(T_{n})||}||\bar{x}(T_{n})||\\ &<\frac{\frac{1}{2}}{\frac{1}{1+1/C_{a}}}||\bar{x}(T_{n})||\end{split}

    Choosing Ca>max(c14,2r14)>2C_{a}>\max\left(c_{\frac{1}{4}},\frac{2}{r_{\frac{1}{4}}}\right)>2, proves the claim.

(ii) and (iii) follow along the lines of arguments in (Lakshminarayanan and Bhatnagar 2017) Lemma 6 (ii) and (iii) respectively. ∎

Next we consider the middle timescale of {b(n)}\{b(n)\} and re-define the following terms:

  1. (M1)

    Define

    t(n)=i=0n1b(i),n1 with t(0)=0.t(n)=\sum_{i=0}^{n-1}b(i),n\geq 1\mbox{ with }t(0)=0.

    Let ψn=(xn,yn,zn)\psi_{n}=(x_{n},y_{n},z_{n}) and

    ψ¯(t)=ψn+(ψn+1ψn)tt(n)t(n+1)t(n), t[t(n),t(n+1)].\bar{\psi}(t)=\psi_{n}+\left(\psi_{n+1}-\psi_{n}\right)\frac{t-t(n)}{t(n+1)-t(n)},\mbox{ \quad}t\in[t(n),t(n+1)].
  2. (M2)

    Given t(n),n0t(n),n\geq 0 and a constant T>0T>0 define

    T0=0,T_{0}=0,
    Tn=min(t(m):t(m)Tn1+T)n1T_{n}=\min(t(m):t(m)\geq T_{n-1}+T)n\geq 1

    One can find a subsequence {m(n)}\{m(n)\} such that Tn=t(m(n))T_{n}=t(m(n)) n\forall n, and m(n)m(n)\rightarrow\infty as nn\rightarrow\infty.

  3. (M3)

    The scaling sequence is defined as:

    r(n)=max(r(n1),ψ¯(Tn),1),n1r(n)=\max(r(n-1),||\bar{\psi}(T_{n})||,1),n\geq 1
  4. (M4)

    The scaled iterates for m(n)km(n+1)1m(n)\leq k\leq m(n+1)-1 are:

    x^m(n)=xm(n)r(n), y^m(n)=ym(n)r(n), z^m(n)=zm(n)r(n),\hat{x}_{m(n)}=\frac{x_{m(n)}}{r(n)},\mbox{\quad}\hat{y}_{m(n)}=\frac{y_{m(n)}}{r(n)},\mbox{\quad}\hat{z}_{m(n)}=\frac{z_{m(n)}}{r(n)},
    x^k+1=x^k+a(k)(h(cx^k,cy^k,cz^k)c+M^k+1(1)),\hat{x}_{k+1}=\hat{x}_{k}+a(k)\left(\frac{h(c\hat{x}_{k},c\hat{y}_{k},c\hat{z}_{k})}{c}+\hat{M}_{k+1}^{(1)}\right),
    y^k+1=y^k+b(k)(g(cx^k,cy^k,cz^k)c+M^k+1(2)),\hat{y}_{k+1}=\hat{y}_{k}+b(k)\left(\frac{g(c\hat{x}_{k},c\hat{y}_{k},c\hat{z}_{k})}{c}+\hat{M}_{k+1}^{(2)}\right),
    z^k+1=z^k+b(k)(ϵk(3),b+M^k+1(3)),\hat{z}_{k+1}=\hat{z}_{k}+b(k)\left(\epsilon_{k}^{(3),b}+\hat{M}_{k+1}^{(3)}\right),

    where, c=r(n)c=r(n),

    ϵk(3),b=c(k)b(k)f(cx^k,cy^k,cz^k)c,\epsilon_{k}^{(3),b}=\frac{c(k)}{b(k)}\frac{f(c\hat{x}_{k},c\hat{y}_{k},c\hat{z}_{k})}{c},
    M^k+1(1)=Mk+1(1)r(n),\hat{M}_{k+1}^{(1)}=\frac{M_{k+1}^{(1)}}{r(n)},
    M^k+1(2)=Mk+1(2)r(n),\hat{M}_{k+1}^{(2)}=\frac{M_{k+1}^{(2)}}{r(n)},
    M^k+1(3)=c(k)a(k)Mk+1(3)r(n).\hat{M}_{k+1}^{(3)}=\frac{c(k)}{a(k)}\frac{M_{k+1}^{(3)}}{r(n)}.
  5. (M5)

    Next, we define the linearly interpolated trajectory for the scaled iterates as follows:

    ψ^(t)=ψ^n+(ψ^n+1ψ^n)tt(n)t(n+1)t(n), t[t(n),t(n+1)].\hat{\psi}(t)=\hat{\psi}_{n}+(\hat{\psi}_{n+1}-\hat{\psi}_{n})\frac{t-t(n)}{t(n+1)-t(n)},\mbox{\quad}t\in[t(n),t(n+1)].
  6. (M6)

    Let ψn(t)=(xn(t),yn(t),zn(t)), t[Tn,Tn+1]\psi_{n}(t)=(x_{n}(t),y_{n}(t),z_{n}(t)),\mbox{\quad}t\in[T_{n},T_{n+1}] denote the trajectory of the ODE:

    x˙(t)=hr(n)(x(t),y(t),z(t)),\dot{x}(t)=h_{r(n)}(x(t),y(t),z(t)),
    y˙(t)=gr(n)(y(t),z(t)),\dot{y}(t)=g_{r(n)}(y(t),z(t)),
    z˙(t)=0,\dot{z}(t)=0,

    with xn(Tn)=x^(Tn)x_{n}(T_{n})=\hat{x}(T_{n}), yn(Tn)=y^(Tn)y_{n}(T_{n})=\hat{y}(T_{n}) and zn(Tn)=z^(Tn)z_{n}(T_{n})=\hat{z}(T_{n}).

As before we state a few lemmas for ODEs with one external input. These follow along the lines of Lemmas 2-5 of (Lakshminarayanan and Bhatnagar 2017). Let ycz(t)(t,y)y_{c}^{z(t)}(t,y) and yz(t)(t,y)y_{\infty}^{z(t)}(t,y) denote the solution to the ODEs

y˙(t)=gc(y(t),z(t)), t0,\dot{y}(t)=g_{c}(y(t),z(t)),\mbox{\quad}t\geq 0,
y˙(t)=g(y(t),z(t)), t0,\dot{y}(t)=g_{\infty}(y(t),z(t)),\mbox{\quad}t\geq 0,

respectively, with initial condition yd1y\in\mathbb{R}^{d_{1}} and the external input z(t)d3z(t)\in\mathbb{R}^{d_{3}}.

Lemma 14.

Let Kd1K\subset\mathbb{R}^{d_{1}} be a compact set and zd3z\in\mathbb{R}^{d_{3}}. Then under (𝐁𝟔)\bf{(B6)}, given δ>0\delta>0, Tδ>0\exists T_{\delta}>0 such that yK\forall y\in K

yz(t,y)B(Γ(z),δ),δTδ.y^{z}_{\infty}(t,y)\in B(\Gamma_{\infty}(z),\delta),\forall\delta\geq T_{\delta}.
Lemma 15.

Let yd2y\in\mathbb{R}^{d_{2}}, zd3z\in\mathbb{R}^{d_{3}}, [0,T][0,T] be a given time interval and r>0r>0. Let z(t)B(z,r),t[0,T]z^{\prime}(t)\in B(z,r),\forall t\in[0,T], then

ycz(t)(t,y)yz(t,y)(ϵ(c)+Lr)TeLT, t[0,T],||y_{c}^{z^{\prime}(t)}(t,y)-y_{\infty}^{z}(t,y)||\leq(\epsilon(c)+Lr)Te^{LT},\mbox{\quad}\forall t\in[0,T],

where ϵ(c)0\epsilon(c)\rightarrow 0 as cc\rightarrow\infty.

Lemma 16.

Let zd3z\in\mathbb{R}^{d_{3}} then given ϵ>0\epsilon>0 and T>0T>0, cϵ,T>0\exists c_{\epsilon,T}>0, δϵ,T>0\delta_{\epsilon,T}>0 and rϵ,T>0r_{\epsilon,T}>0 such that t[0,T)\forall t\in[0,T), yB(Γ(z),δϵ,T)\forall y\in B(\Gamma_{\infty}(z),\delta_{\epsilon,T}) c>cϵ,T\forall c>c_{\epsilon,T} and external input z(s)B(z,rϵ,T)z^{\prime}(s)\in B(z,r_{\epsilon,T}),

ycz(t)(t,y)B(Γ(z),2ϵ) t[0,T].y_{c}^{z^{\prime}(t)}(t,y)\in B(\Gamma_{\infty}(z),2\epsilon)\mbox{\quad}\forall t\in[0,T].
Lemma 17.

Let yB(0,1)d2,zKd3,y\in B(0,1)\subset\mathbb{R}^{d_{2}},z\in K^{\prime}\subset\mathbb{R}^{d_{3}}, and (𝐁𝟕)\bf{(B7)} holds. Then given ϵ>0,cϵ1,rϵ>0\epsilon>0,\exists c_{\epsilon}\geq 1,r_{\epsilon}>0 and Tϵ>0T_{\epsilon}>0 such that for any external input satisfying z(s)B(z,rϵ)z^{\prime}(s)\in B(z,r_{\epsilon}), s[0,T]\forall s\in[0,T],

ycz(t)(t,y)Γ(z)2ϵ, c>cϵ,tTϵ.||y_{c}^{z^{\prime}(t)}(t,y)-\Gamma_{\infty}(z)||\leq 2\epsilon,\mbox{\quad}\forall c>c_{\epsilon},t\geq T_{\epsilon}.
Lemma 18.

Under (𝐁𝟏)(𝐁𝟑)\bf{(B1)}-\bf{(B3)},

  1. (i)

    For 0km(n+1)m(n)0\leq k\leq m(n+1)-m(n), ψ^(t(m(n)+k))K(2)||\hat{\psi}(t(m(n)+k))||\leq K^{(2)} a.s. for some constant K(2)>0.K^{(2)}>0.

  2. (ii)

    For sufficiently large nn, we have sup[Tn,Tn+1)||y^(t)yn(t)||=ϵ(c)LTeL(L+1)T a.s. where ϵ(c)0 as c\sup_{[T_{n},T_{n+1}})||\hat{y}(t)-y_{n}(t)||=\epsilon(c)LTe^{L(L+1)T}\mbox{ a.s. where }\epsilon(c)\rightarrow 0\mbox{ as }c\rightarrow\infty

Proof.

See Lemma 9 of (Lakshminarayanan and Bhatnagar 2017)

Theorem 19.

Assume (𝐁𝟏)\bf{(B1)}-(𝐁𝟒){\bf{(B4)}} and (𝐁𝟔)(𝐁𝟖)\bf{(B6)-(B8)} hold. Then, with CaC^{*}_{a} as defined in Theorem 13,

  1. (i)

    For large nn and T=T1/8(Ca+1)T=T_{1/8(C^{*}_{a}+1)} (here TT is the sampling frequency as in (M2) and T1/8(Ca+1)T_{1/8(C^{*}_{a}+1)} is TϵT_{\epsilon} as in Lemma 17 with ϵ=1/8(Ca+1)\epsilon=1/8(C^{*}_{a}+1)), if y¯(Tn)>Cb(1+z¯(Tn))||\bar{y}(T_{n})||>C_{b}(1+||\bar{z}(T_{n})||), for some Cb>0C_{b}>0, then y¯(Tn+1)<58y¯(Tn)||\bar{y}(T_{n+1})||<\frac{5}{8}||\bar{y}(T_{n})||.

  2. (ii)

    y¯(Tn)Cb(1+z¯(Tn))||\bar{y}(T_{n})||\leq C_{b}^{*}\left(1+||\bar{z}(T_{n})||\right), for some Cb>0C_{b}^{*}>0

  3. (iii)

    ynKb(1+zn)||y_{n}||\leq K_{b}^{*}(1+||z_{n}||), for some Kb>0K_{b}^{*}>0

Proof.
  1. (i)

    Since y¯(Tn)>Cb(1+z¯(Tn))||\bar{y}(T_{n})||>C_{b}(1+||\bar{z}(T_{n})||), r(n)>Cbr(n)>C_{b}. We first show that z^(Tn)<1Cb||\hat{z}(T_{n})||<\frac{1}{C_{b}}.

    z^(Tn)=z¯(Tn)pr(n)z¯(Tn)p(y¯(Tn)pp+z¯(Tn)pp)1p||\hat{z}(T_{n})||=\frac{||\bar{z}(T_{n})||_{p}}{r(n)}\leq\frac{||\bar{z}(T_{n})||_{p}}{(||\bar{y}(T_{n})||_{p}^{p}+||\bar{z}(T_{n})||_{p}^{p})^{\frac{1}{p}}}

    Since y¯(Tn)p>Cb(1+z¯(Tn))||\bar{y}(T_{n})||_{p}>C_{b}(1+||\bar{z}(T_{n})||), ||y¯(Tn)||pp)>Cbp||z¯(Tn)||pp||\bar{y}(T_{n})||_{p}^{p})>C_{b}^{p}||\bar{z}(T_{n})||_{p}^{p}. Therefore,

    z^(Tn)<z¯(Tn)p((1+Cbp)z¯(Tn)pp)1p=1(1+Cbp)1p<1Cb\begin{split}||\hat{z}(T_{n})||&<\frac{||\bar{z}(T_{n})||_{p}}{\left((1+C_{b}^{p})||\bar{z}(T_{n})||_{p}^{p}\right)^{\frac{1}{p}}}\\ &=\frac{1}{(1+C_{b}^{p})^{\frac{1}{p}}}\\ &<\frac{1}{C_{b}}\end{split}

    Next we show that y^(Tn)>1(Ca+1)(2+1Cb)||\hat{y}(T_{n})||>\frac{1}{(C^{*}_{a}+1)(2+\frac{1}{C_{b}})}, where CaC^{*}_{a} is as defined in Theorem 13. Here again we are considering the case when the iterates are blowing up. Therefore let r(n)=ψ¯(Tn)r(n)=||\bar{\psi}(T_{n})||. Now, from Theorem 13 ,we know x¯(Tn)Ka(1+y¯(Tn)+z¯(Tn))||\bar{x}(T_{n})||\leq K_{a}^{*}(1+||\bar{y}(T_{n})||+||\bar{z}(T_{n})||) and therefore, r(n)Ka(1+y¯(Tn)+z¯(Tn))+y¯(Tn)+z¯(Tn)r(n)\leq K_{a}^{*}(1+||\bar{y}(T_{n})||+||\bar{z}(T_{n})||)+||\bar{y}(T_{n})||+||\bar{z}(T_{n})||. With this we have,

    y^(Tn)py¯(Tn)Ca+(Ca+1)(y¯(Tn)pp+z¯(Tn)pp)1p>1Ca+(Ca+1)(1+1Cb)>1(Ca+1)(2+1Cb).\begin{split}||\hat{y}(T_{n})||_{p}&\geq\frac{||\bar{y}(T_{n})||}{C^{*}_{a}+(C^{*}_{a}+1)(||\bar{y}(T_{n})||_{p}^{p}+||\bar{z}(T_{n})||_{p}^{p})^{\frac{1}{p}}}\\ &>\frac{1}{C^{*}_{a}+(C^{*}_{a}+1)(1+\frac{1}{C_{b}})}\\ &>\frac{1}{(C^{*}_{a}+1)(2+\frac{1}{C_{b}})}.\end{split}

    Now we proceed as in Theorem 13 (i). Let z(tTn)=zn(t)z^{\prime}(t-T_{n})=z_{n}(t) t[Tn,Tn+1]\forall t\in[T_{n},T_{n+1}]. From Lemma 17, r1/8(Ca+1),c1/8(Ca+1),T1/8(Ca+1)>0\exists r_{1/8(C^{*}_{a}+1)},c_{1/8(C^{*}_{a}+1)},T_{1/8(C^{*}_{a}+1)}>0 such that

    ycz(t)(t,x^(Tn))18(Ca+1),tT1/8(Ca+1),cc1/8(Ca+1),||y_{c}^{z^{\prime}(t)}(t,\hat{x}(T_{n}))||\leq\frac{1}{8(C^{*}_{a}+1)},\forall t\geq T_{1/8(C^{*}_{a}+1)},\forall c\geq c_{1/8(C^{*}_{a}+1)},

    whenever z(t)B(0,r1/8(Ca+1))z^{\prime}(t)\in B(0,r_{1/8(C^{*}_{a}+1)}). Choose T=T1/8(Ca+1)T=T_{1/8(C^{*}_{a}+1)}. Since z˙(t)=0\dot{z}(t)=0 for the ODE defined in (M6) and z(tTn)=zn(t)=z^(Tn)z^{\prime}(t-T_{n})=z_{n}(t)=\hat{z}(T_{n}) t[Tn,Tn+1]\forall t\in[T_{n},T_{n+1}] and we choose Cb>max(c1/8(Ca+1),2r1/8(Ca+1))C_{b}>\max\left(c_{1/8(C^{*}_{a}+1)},\frac{2}{r_{1/8(C^{*}_{a}+1)}}\right) from z^(Tn)<1Cb||\hat{z}(T_{n})||<\frac{1}{C_{b}}, it follows that z(s)B(0,r1/8(Ca+1))z^{\prime}(s)\in B(0,r_{1/8(C^{*}_{a}+1)}) s[0,T].\forall s\in[0,T]. Using Lemma 18(ii), C1>0\exists C_{1}>0 s.t. y^(Tn+1)yn(Tn+1)<18(Ca+1)||\hat{y}(T_{n+1}^{-})-y_{n}(T_{n+1})||<\frac{1}{8(C^{*}_{a}+1)} for large enough nn and r(n)>C1r(n)>C_{1}. Choose Cb>max(c1/8(Ca+1),2r1/8(Ca+1),C1)C_{b}>\max(c_{1/8(C^{*}_{a}+1)},\frac{2}{r_{1/8(C^{*}_{a}+1)}},C_{1}). Also observe that yn(Tn+1)=yr(n)z(t)(Tn+1Tn,y^(Tn))18(Ca+1)||y_{n}(T_{n+1})||=||y^{z^{\prime}(t)}_{r(n)}(T_{n+1}-T_{n},\hat{y}(T_{n}))||\leq\frac{1}{8(C^{*}_{a}+1)}. Using these, we have y^(Tn+1)y^(Tn+1)yn(Tn+1)+yn(Tn+1)14(Ca+1)||\hat{y}(T_{n+1}^{-})||\leq||\hat{y}(T_{n+1}^{-})-y_{n}(T_{n+1})||+||y_{n}(T_{n+1})||\leq\frac{1}{4(C^{*}_{a}+1)}. Finally since

    y¯(Tn+1)y¯(Tn)=y^(Tn+1)y^(Tn),\frac{||\bar{y}(T_{n+1})||}{||\bar{y}(T_{n})||}=\frac{||\hat{y}(T_{n+1}^{-})||}{||\hat{y}(T_{n})||},

    we have

    y¯(Tn+1)=y^(Tn+1)y^(Tn)y¯(Tn)<14(Ca+1)1(Ca+1)(2+1/Cb)x¯(Tn)<2+1Cb4\begin{split}||\bar{y}(T_{n+1})||&=\frac{||\hat{y}(T_{n+1}^{-})||}{||\hat{y}(T_{n})||}||\bar{y}(T_{n})||\\ &<\frac{\frac{1}{4(C^{*}_{a}+1)}}{\frac{1}{(C^{*}_{a}+1)(2+1/C_{b})}}||\bar{x}(T_{n})||\\ &<\frac{2+\frac{1}{C_{b}}}{4}\end{split}

    Choosing Cb>max(c1/8(Ca+1),2r1/8(Ca+1),C1)>2C_{b}>\max\left(c_{1/8(C^{*}_{a}+1)},\frac{2}{r_{1/8(C^{*}_{a}+1)}},C_{1}\right)>2, proves the claim.

    (ii) and (iii) follow along the lines of arguments in (Lakshminarayanan and Bhatnagar 2017), Lemma 6 (ii) and (iii), respectively.

Finally we consider the slowest timescale corresponding to {c(n)}\{c(n)\}. As before we redefine the terms as follows:

  1. (S1)

    Define

    t(n)=i=0n1c(i),n0 with t(0)=0t(n)=\sum_{i=0}^{n-1}c(i),n\geq 0\mbox{ with }t(0)=0

    Let ψn=(xn,yn,zn)\psi_{n}=(x_{n},y_{n},z_{n}) and

    ψ¯(t)=ψn+(ψn+1ψn)tt(n)t(n+1)t(n), t[t(n),t(n+1)].\bar{\psi}(t)=\psi_{n}+\left(\psi_{n+1}-\psi_{n}\right)\frac{t-t(n)}{t(n+1)-t(n)},\mbox{ \quad}t\in[t(n),t(n+1)].
  2. (S2)

    Given t(n),n0t(n),n\geq 0 and a constant T>0T>0 define

    T0=0,T_{0}=0,
    Tn=min(t(m):t(m)Tn+1+T),n1T_{n}=\min(t(m):t(m)\geq T_{n+1}+T),n\geq 1

    There exists some subsequence {m(n)}\{m(n)\} such that Tn=t(m(n))T_{n}=t(m(n)) and m(n)m(n)\rightarrow\infty as nn\rightarrow\infty.

  3. (S3)

    The scaling sequence is defined as:

    r(n)=max(r(n1),ψ¯(Tn),1),n1r(n)=\max(r(n-1),||\bar{\psi}(T_{n})||,1),n\geq 1
  4. (S4)

    The scaled iterates for m(n)km(n+1)1m(n)\leq k\leq m(n+1)-1 are:

    x^m(n)=xm(n)r(n), y^m(n)=ym(n)r(n), z^m(n)=zm(n)r(n),\hat{x}_{m(n)}=\frac{x_{m(n)}}{r(n)},\mbox{\quad}\hat{y}_{m(n)}=\frac{y_{m(n)}}{r(n)},\mbox{\quad}\hat{z}_{m(n)}=\frac{z_{m(n)}}{r(n)},
    x^k+1=x^k+a(k)(h(cx^k,cy^k,cz^k)c+M^k+1(1)),\hat{x}_{k+1}=\hat{x}_{k}+a(k)\left(\frac{h(c\hat{x}_{k},c\hat{y}_{k},c\hat{z}_{k})}{c}+\hat{M}_{k+1}^{(1)}\right),
    y^k+1=y^k+b(k)(g(cx^k,cy^k,cz^k)c+M^k+1(2)),\hat{y}_{k+1}=\hat{y}_{k}+b(k)\left(\frac{g(c\hat{x}_{k},c\hat{y}_{k},c\hat{z}_{k})}{c}+\hat{M}_{k+1}^{(2)}\right),
    z^k+1=z^k+c(k)(f(cx^k,cy^k,cz^k)c+M^k+1(3)),\hat{z}_{k+1}=\hat{z}_{k}+c(k)\left(\frac{f(c\hat{x}_{k},c\hat{y}_{k},c\hat{z}_{k})}{c}+\hat{M}_{k+1}^{(3)}\right),

    where, c=r(n)c=r(n),

    M^k+1(1)=Mk+1(1)r(n),M^k+1(2)=Mk+1(2)r(n),M^k+1(3)=Mk+1(3)r(n).\hat{M}_{k+1}^{(1)}=\frac{M_{k+1}^{(1)}}{r(n)},\hat{M}_{k+1}^{(2)}=\frac{M_{k+1}^{(2)}}{r(n)},\hat{M}_{k+1}^{(3)}=\frac{M_{k+1}^{(3)}}{r(n)}.
  5. (S5)

    Next we define the linearly interpolated trajectory for the scaled iterates as follows:

    ψ^(t)=ψ^n+(ψ^n+1ψ^n)tt(n)t(n+1)t(n), t[t(n),t(n+1)].\hat{\psi}(t)=\hat{\psi}_{n}+(\hat{\psi}_{n+1}-\hat{\psi}_{n})\frac{t-t(n)}{t(n+1)-t(n)},\mbox{\quad}t\in[t(n),t(n+1)].
  6. (S6)

    Let ψn(t)=(xn(t),yn(t),zn(t)), t[Tn,Tn+1]\psi_{n}(t)=(x_{n}(t),y_{n}(t),z_{n}(t)),\mbox{\quad}t\in[T_{n},T_{n+1}] denote the trajectory of the ODE:

    x˙(t)=hr(n)(x(t),y(t),z(t)),\dot{x}(t)=h_{r(n)}(x(t),y(t),z(t)),
    y˙(t)=gr(n)(y(t),z(t)),\dot{y}(t)=g_{r(n)}(y(t),z(t)),
    z˙(t)=fr(n)(z(t)),\dot{z}(t)=f_{r(n)}(z(t)),

    with xn(Tn)=x^(Tn)x_{n}(T_{n})=\hat{x}(T_{n}), yn(Tn)=y^(Tn)y_{n}(T_{n})=\hat{y}(T_{n}) and zn(Tn)=z^(Tn)z_{n}(T_{n})=\hat{z}(T_{n}).

We again state some results on ODEs, this time with no external input. These again follow along the lines of Lemma 2-5 in (Lakshminarayanan and Bhatnagar 2017). Let zc(t,z)z_{c}(t,z) and z(t,z)z_{\infty}(t,z) denote the solution to the ODEs

z˙(t)=fc(z(t)), t0,\dot{z}(t)=f_{c}(z(t)),\mbox{\quad}t\geq 0,
z˙(t)=f(z(t)), t0,\dot{z}(t)=f_{\infty}(z(t)),\mbox{\quad}t\geq 0,

respectively with initial condition zd3z\in\mathbb{R}^{d_{3}}.

Lemma 20.

Let Kd3K\subset\mathbb{R}^{d_{3}} be a compact set . Then under (𝐁𝟖)\bf{(B8)}, given δ>0\delta>0, Tδ>0\exists T_{\delta}>0 such that zK\forall z\in K

z(t,z)B(0,δ),δTδ.z_{\infty}(t,z)\in B(0,\delta),\forall\delta\geq T_{\delta}.
Lemma 21.

Let zd3z\in\mathbb{R}^{d_{3}}, [0,T][0,T] be a given time interval and r>0r>0. Then

zc(t,z)z(t,z)(ϵ(c))TeLT, t[0,T],||z_{c}(t,z)-z_{\infty}(t,z)||\leq(\epsilon(c))Te^{LT},\mbox{\quad}\forall t\in[0,T],

where ϵ(c)0\epsilon(c)\rightarrow 0 as cc\rightarrow\infty.

Lemma 22.

Given ϵ>0\epsilon>0 and T>0T>0 cϵ,T>0\exists c_{\epsilon,T}>0, δϵ,T>0\delta_{\epsilon,T}>0 and rϵ,T>0r_{\epsilon,T}>0 such that t[0,T)\forall t\in[0,T), zB(0,δϵ,T)\forall z\in B(0,\delta_{\epsilon,T}), c>cϵ,T\forall c>c_{\epsilon,T},

zc(t,z)B(0,2ϵ) ,t[0,T].z_{c}(t,z)\in B(0,2\epsilon)\mbox{\quad},\forall t\in[0,T].
Lemma 23.

Let zB(0,1)d3z\in B(0,1)\subset\mathbb{R}^{d_{3}} and let (𝐁𝟖)\bf{(B8)} hold. Then given ϵ>0,cϵ1,rϵ>0\epsilon>0,\exists c_{\epsilon}\geq 1,r_{\epsilon}>0 and Tϵ>0T_{\epsilon}>0, then

zc(t,z)2ϵ, c>cϵ.||z_{c}(t,z)||\leq 2\epsilon,\mbox{\quad}\forall c>c_{\epsilon}.
Lemma 24.

Under (𝐁𝟏)(𝐁𝟑)\bf{(B1)}-\bf{(B3)},

  1. (i)

    For 0km(n+1)m(n)0\leq k\leq m(n+1)-m(n), ψ^(t(m(n)+k))K(3)||\hat{\psi}(t(m(n)+k))||\leq K^{(3)} a.s. for some constant K(3)>0.K^{(3)}>0.

  2. (ii)

    For sufficiently large nn, we have sup[Tn,Tn+1)z^(t)zn(t)=(ϵ1(c)+ϵ2(c))LTeL(L+1)T a.s. where ϵ(c)0 as c.\sup_{[T_{n},T_{n+1})}||\hat{z}(t)-z_{n}(t)||=(\epsilon_{1}(c)+\epsilon_{2}(c))LTe^{L(L+1)T}\mbox{ a.s. where }\epsilon(c)\rightarrow 0\mbox{ as }c\rightarrow\infty.

Proof.

See Lemma 9 (ii) and (iii) of (Lakshminarayanan and Bhatnagar 2017). ∎

Theorem 25.

Under assumptions (𝐁𝟏)\bf{(B1)}-(𝐁𝟒)\bf{(B4)} and (𝐁𝟔)(𝐁𝟖)\bf{(B6)}-\bf{(B8)} , we have:

  • (i)

    Let CaC^{*}_{a} and CbC^{*}_{b} be as in Theorems 13 and 19 respectively. Then, z^(Tn)14+CaCb+Cb||\hat{z}(T_{n})||\geq\frac{1}{4+C^{*}_{a}C^{*}_{b}+C^{*}_{b}} for sufficiently large z¯(Tn)||\bar{z}(T_{n})||.

  • (ii)

    For nn large, T=T14T=T_{\frac{1}{4}}(here TT is the sampling frequency as in (F2) and T14T_{\frac{1}{4}} is TϵT_{\epsilon} as in Lemma 11 with ϵ=14\epsilon=\frac{1}{4}), if z¯(Tn)>C||\bar{z}(T_{n})||>C, for some C>0C>0 then z¯(Tn+1)<12z¯(Tn)||\bar{z}(T_{n+1})||<\frac{1}{2}||\bar{z}(T_{n})||

  • (iii)

    z¯(Tn)Kc||\bar{z}(T_{n})||\leq K_{c}^{*} for some Kc>0K_{c}^{*}>0.

  • (iv)

    supnzn<\sup_{n}||z_{n}||<\infty a.s.

Proof.
  1. (i)

    From Theorems 13 and 19 we know that r(n)<Ca(1+y¯(Tn)+z¯(Tn))+Cb(1+z¯(Tn))+z¯(Tn)||r(n)||<C^{*}_{a}(1+||\bar{y}(T_{n})||+||\bar{z}(T_{n})||)+C^{*}_{b}(1+||\bar{z}(T_{n})||)+||\bar{z}(T_{n})||. Therefore,

    z^(Tn)=z¯(Tn)r(n)>z¯(Tn)Ca(1+y¯(Tn)+z¯(Tn))+Cb(1+z¯(Tn))+z¯(Tn)>z¯(Tn)Ca(1+Cb(1+z¯(Tn))+z¯(Tn))+Cb(1+z¯(Tn))+z¯(Tn)>1Caz¯(Tn)+CaCbz¯(Tn)+CaCb+Ca+Cbz¯(Tn)+Cb+1>14+CaCb+Cb, for z¯(Tn)>max(Ca,Cb,CaCb)\begin{split}||\hat{z}(T_{n})||&=\frac{||\bar{z}(T_{n})||}{r(n)}>\frac{||\bar{z}(T_{n})||}{C^{*}_{a}(1+||\bar{y}(T_{n})||+||\bar{z}(T_{n})||)+C^{*}_{b}(1+||\bar{z}(T_{n})||)+||\bar{z}(T_{n})||}\\ &>\frac{||\bar{z}(T_{n})||}{C^{*}_{a}(1+C^{*}_{b}(1+||\bar{z}(T_{n})||)+||\bar{z}(T_{n})||)+C^{*}_{b}(1+||\bar{z}(T_{n})||)+||\bar{z}(T_{n})||}\\ &>\frac{1}{\frac{C^{*}_{a}}{||\bar{z}(T_{n})||}+\frac{C^{*}_{a}C^{*}_{b}}{||\bar{z}(T_{n})||}+C^{*}_{a}C^{*}_{b}+C^{*}_{a}+\frac{C^{*}_{b}}{||\bar{z}(T_{n})||}+C^{*}_{b}+1}\\ &>\frac{1}{4+C^{*}_{a}C^{*}_{b}+C^{*}_{b}},\mbox{\qquad for \quad}||\bar{z}(T_{n})||>\max{(C^{*}_{a},C^{*}_{b},C^{*}_{a}C^{*}_{b})}\end{split}
  2. (ii)

    Since, 0d30\in\mathbb{R}^{d_{3}} is the unique globally asymptotically stable equilibrium, therefore using Lemma 23, c14,T14>0\exists c_{\frac{1}{4}},T_{\frac{1}{4}}>0, such that zc(t,z)<14(4+CaCb+Cb),||z_{c}(t,z)||<\frac{1}{4(4+C^{*}_{a}C^{*}_{b}+C^{*}_{b})}, cc14,tT14\forall c\geq c_{\frac{1}{4}},t\geq T_{\frac{1}{4}}. Also, for z¯(Tn)>max(Ca,Cb,CaCb)||\bar{z}(T_{n})||>\max{(C^{*}_{a},C^{*}_{b},C^{*}_{a}C^{*}_{b})} we have z^(Tn)>14+CaCb+Cb||\hat{z}(T_{n})||>\frac{1}{4+C^{*}_{a}C^{*}_{b}+C^{*}_{b}} and for sufficiently large nn, from Lemma 24(ii), C2>0\exists C_{2}>0 such that z^(Tn+1)zn(Tn+1)<14(4+CaCb+Cb)||\hat{z}(T_{n+1}^{-})-z_{n}(T_{n+1})||<\frac{1}{4(4+C^{*}_{a}C^{*}_{b}+C^{*}_{b})} for r(n)>C2r(n)>C_{2}. We pick C=max(c1/4,C1,max(Ca,Cb,CaCb))C=\max(c_{1/4},C_{1},\max{(C^{*}_{a},C^{*}_{b},C^{*}_{a}C^{*}_{b})}) and T=T1/4T=T_{1/4}. For nn large it then follows that z^(Tn+1)z^(Tn+1)zn(Tn+1)+zn(Tn+1)12(4+KaCb+Cb)||\hat{z}(T_{n+1}^{-})||\leq||\hat{z}(T_{n+1}^{-})-z_{n}(T_{n+1})||+||z_{n}(T_{n+1})||\leq\frac{1}{2(4+K_{a}^{*}C^{*}_{b}+C^{*}_{b})}. Finally, since

    z¯(Tn+1)z¯(Tn)=z^(Tn+1)z^(Tn),\frac{||\bar{z}(T_{n+1})||}{||\bar{z}(T_{n})||}=\frac{||\hat{z}(T_{n+1}^{-})||}{||\hat{z}(T_{n})||},

    it follows that

    z¯(Tn+1)<12z¯(Tn).||\bar{z}(T_{n+1})||<\frac{1}{2}||\bar{z}(T_{n})||.

(iii) and (iv) follow along the lines of arguments as in Lemma 10 (iii) and (iv) of (Lakshminarayanan and Bhatnagar 2017). ∎

Now from Theorem 25 (iii), it follows that the slow timescale iterates znz_{n} are bounded a.s. ( zn<||z_{n}||<\infty a.s. ) which in turn implies that the middle timescale iterates yny_{n} are bounded using Theorem 19 ( i.e., yn<||y_{n}||<\infty a.s. ). Finally the fast timescale iterates xnx_{n} are bounded because of Theorem 13 and the fact that both middle timescale and slow timescale iterates are bounded showing xn<||x_{n}||<\infty a.s. Combining these we have supn(xn+yn+zn)<\sup_{n}(||x_{n}||+||y_{n}||+||z_{n}||)<\infty a.s, thereby proving Theorem 7. ∎

The slightly more general version where each iterate could have small perturbation terms as given below:

xn+1=xn+a(n)(h(xn,yn,zn)+Mn+1(1)+εn(1)),x_{n+1}=x_{n}+a(n)\left(h(x_{n},y_{n},z_{n})+M_{n+1}^{(1)}+\varepsilon^{(1)}_{n}\right), (36)
yn+1=yn+b(n)(g(xn,yn,zn)+Mn+1(2)+εn(2)),y_{n+1}=y_{n}+b(n)\left(g(x_{n},y_{n},z_{n})+M_{n+1}^{(2)}+\varepsilon^{(2)}_{n}\right), (37)
zn+1=zn+c(n)(f(xn,yn,zn)+Mn+1(3)+εn(3)),z_{n+1}=z_{n}+c(n)\left(f(x_{n},y_{n},z_{n})+M_{n+1}^{(3)}+\varepsilon^{(3)}_{n}\right), (38)

with ϵn(k)=o(1),k=1,2,3\epsilon_{n}^{(k)}=o(1),k=1,2,3 can be shown to converge to the same solution. Since the additional error terms are o(1)o(1), their contribution is asymptotically negligible. See arguments in third extension of (Chapter 2, pp. 17 of Borkar (2008b) ) that handles this case for one-timescale iterates.

A3  Convergence of GTD-2 M and TDC-M

Here we provide the asymptotic convergence guarantees of the momentum variants of the remaining two Gradient TD methods namely GTD2-M and TDC-M. The analysis is similar to that of GTD-M in Theorem 4 and is provided here for completeness. We show that the assumptions (B1) - (B7) of the main paper are satisfied and thereby invoke Theorem 3 to show convergence.

A3.1  Asymptotic convergence of GTD2-M

We re-write the iterates for GTD2-M below:

θt+1=θt+αt(ϕtγϕt)ϕtTut+ηt(θtθt1),\theta_{t+1}=\theta_{t}+\alpha_{t}(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}+\eta_{t}(\theta_{t}-\theta_{t-1}), (39)
ut+1=ut+βt(δtϕtTut)ϕt.u_{t+1}=u_{t}+\beta_{t}(\delta_{t}-\phi_{t}^{T}u_{t})\phi_{t}. (40)

As before, choosing ηt=ϱtwαtϱt1\eta_{t}=\frac{\varrho_{t}-w\alpha_{t}}{\varrho_{t-1}}, where {ϱt}\{\varrho_{t}\} is a positive sequence and ww\in\mathbb{R} is a constant, we can decompose the two iterates into three recursions as below:

vt+1=vt+ξt((ϕtγϕt)ϕtTutwvt)\displaystyle v_{t+1}=v_{t}+\xi_{t}\left((\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}-wv_{t}\right) (41)
ut+1=ut+βt(δtϕtϕtϕtTut)\displaystyle u_{t+1}=u_{t}+\beta_{t}(\delta_{t}\phi_{t}-\phi_{t}\phi_{t}^{T}u_{t}) (42)
θt+1=θt+ϱt(vt+εt)\displaystyle\theta_{t+1}=\theta_{t}+\varrho_{t}(v_{t}+\varepsilon_{t}) (43)
Theorem 26.

Assume 𝒜\mathbfcal{A}1, 𝒜\mathbfcal{A}3 and 𝒜\mathbfcal{A}4 hold and let w>0w>0. Then, the GTD2-M iterates given by (39) and (40) satisfy θnθ=A¯1b¯\theta_{n}\rightarrow\theta^{*}=-\bar{A}^{-1}\bar{b} a.s. as nn\rightarrow\infty.

Proof.

We transform the iterates given by (41), (42) and (43) into the standard SA form given by (22), (23) and (24). Let t=σ(u0,v0,θ0,rj+1,ϕj,ϕj:j<t)\mathcal{F}_{t}=\sigma(u_{0},v_{0},\theta_{0},r_{j+1},\phi_{j},\phi_{j}^{\prime}:j<t). Let, At=ϕt(γϕtϕt)TA_{t}=\phi_{t}(\gamma\phi_{t}^{\prime}-\phi_{t})^{T} and bt=rt+1ϕtb_{t}=r_{t+1}\phi_{t}. Then, (41) can be re-written as:

vt+1=vt+ξt(h(vt,ut,θt)+Mt+1(1))v_{t+1}=v_{t}+\xi_{t}\left(h(v_{t},u_{t},\theta_{t})+M_{t+1}^{(1)}\right)

where,

h(vt,ut,θt)=𝔼[(ϕtγϕt)ϕtTutwvt|t]=A¯Tutwvt.Mt+1(1)=AtTutwvth(vt,ut,θt)=(A¯TAtT)ut.\begin{split}h(v_{t},u_{t},\theta_{t})&=\mathbb{E}[(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}-wv_{t}|\mathcal{F}_{t}]\\ &=-\bar{A}^{T}u_{t}-wv_{t}.\\ M_{t+1}^{(1)}=-A_{t}^{T}u_{t}&-wv_{t}-h(v_{t},u_{t},\theta_{t})=(\bar{A}^{T}-A_{t}^{T})u_{t}.\end{split}

Next, (42) can be re-written as:

ut+1=ut+βt(g(vt,ut,θt)+Mt+1(2))\begin{split}u_{t+1}&=u_{t}+\beta_{t}\left(g(v_{t},u_{t},\theta_{t})+M_{t+1}^{(2)}\right)\\ \end{split}

where,

g(vt,ut,θt)=𝔼[δtϕtϕtϕtTut|t]=A¯θt+b¯C¯utMt+1(2)=Atθt+btCtutg(vt,ut,θt)=(AtA¯)θt+(btb¯)+(C¯Ct)ut.\begin{split}g(v_{t},u_{t},\theta_{t})&=\mathbb{E}[\delta_{t}\phi_{t}-\phi_{t}\phi_{t}^{T}u_{t}|\mathcal{F}_{t}]=\bar{A}\theta_{t}+\bar{b}-\bar{C}u_{t}\\ M_{t+1}^{(2)}&=A_{t}\theta_{t}+b_{t}-C_{t}u_{t}-g(v_{t},u_{t},\theta_{t})\\ &=(A_{t}-\bar{A})\theta_{t}+(b_{t}-\bar{b})+(\bar{C}-C_{t})u_{t}.\end{split}

Here, Ct=ϕtϕtTC_{t}=\phi_{t}\phi_{t}^{T} and C¯=𝔼[ϕtϕtT]\bar{C}=\mathbb{E}[\phi_{t}\phi_{t}^{T}]. Finally, (43) can be re-written as:

θt+1=θt+ϱt(f(vt,ut,θt)+εt+Mt+1(3))\begin{split}\theta_{t+1}=\theta_{t}+\varrho_{t}\left(f(v_{t},u_{t},\theta_{t})+\varepsilon_{t}+M_{t+1}^{(3)}\right)\end{split}

where,

f(vt,ut,θt)=vt and Mt+1(3)=0.f(v_{t},u_{t},\theta_{t})=v_{t}\mbox{ and }M_{t+1}^{(3)}=0.

The functions h,g,fh,g,f are linear in v,u,θv,u,\theta and hence Lipchitz continuous, therefore satisfying (𝐁𝟏)\bf{(B1)}. We choose the step-size sequences such that they satisfy (𝐁𝟐)\bf{(B2)}. One popular choice is

ξt=1(t+1)ξ,βt=1(t+1)β,ϱt=1(t+1)ϱ,12<ξ<β<ϱ1.\xi_{t}=\frac{1}{(t+1)^{\xi}},\beta_{t}=\frac{1}{(t+1)^{\beta}},\varrho_{t}=\frac{1}{(t+1)^{\varrho}},\frac{1}{2}<\xi<\beta<\varrho\leq 1.

Next, Mt+1(1),Mt+1(2)M_{t+1}^{(1)},M_{t+1}^{(2)} and Mt+1(3)M_{t+1}^{(3)} t0t\geq 0, are martingale difference sequences w.r.t t\mathcal{F}_{t} by construction. Next,

𝔼[Mt+1(1)2|t](A¯TAtT)2ut2,\mathbb{E}[||M_{t+1}^{(1)}||^{2}|\mathcal{F}_{t}]\leq||(\bar{A}^{T}-A_{t}^{T})||^{2}||u_{t}||^{2},
𝔼[Mt+1(2)2|t]3((AtA¯)2θt2+(btb¯)2+(C¯Ct)2ut2).\mathbb{E}[||M_{t+1}^{(2)}||^{2}|\mathcal{F}_{t}]\leq 3(||(A_{t}-\bar{A})||^{2}||\theta_{t}||^{2}+||(b_{t}-\bar{b})||^{2}+||(\bar{C}-C_{t})||^{2}||u_{t}||^{2}).

The first part of (𝐁𝟑)\bf{(B3)} is satisfied with K1=(A¯TAtT)2K_{1}=||(\bar{A}^{T}-A_{t}^{T})||^{2}, K2=3max(AtA¯2,btb¯2,(C¯Ct)2)K_{2}=3\max(||A_{t}-\bar{A}||^{2},||b_{t}-\bar{b}||^{2},||(\bar{C}-C_{t})||^{2}) and any K3>0K_{3}>0. The fact that K1,K2<K_{1},K_{2}<\infty follows from the bounded features and bounded rewards assumption in 𝒜\mathbfcal{A}1. Next, observe that εt(3)=ξt((ϕtγϕt)ϕtTutwvt)0||\varepsilon_{t}^{(3)}||=\xi_{t}||\left((\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}-wv_{t}\right)||\rightarrow 0 since ξt0 as t\xi_{t}\rightarrow 0\mbox{ as }t\rightarrow\infty. For a fixed u,θdu,\theta\in\mathbb{R}^{d}, consider the ODE

v˙(t)=A¯Tuwv(t).\dot{v}(t)=-\bar{A}^{T}u-wv(t).

For w>0w>0, λ(u,θ)=A¯Tuw\lambda(u,\theta)=-\frac{\bar{A}^{T}u}{w} is the unique g.a.s.e, is linear and therefore Lipchitz continuous. This satisfies (𝐁𝟒)\bf{(B4)}(i). Next, for a fixed θd\theta\in\mathbb{R}^{d},

u˙(t)=A¯θ+b¯C¯u(t),\dot{u}(t)=\bar{A}\theta+\bar{b}-\bar{C}u(t),

has Γ(θ)=C¯1(A¯θ+b¯)\Gamma(\theta)=\bar{C}^{-1}(\bar{A}\theta+\bar{b}) as its unique g.a.s.e because C¯1-\bar{C}^{-1} is negative definite. Also Γ(θ)\Gamma(\theta) is linear in θ\theta and therefore Lipschitz. This satisfies (𝐁𝟒)(𝐢𝐢)\bf{(B4)}(ii). Finally, to satisfy (𝐁𝟒)(𝐢𝐢𝐢)\bf{(B4)}(iii), consider,

θ˙(t)=A¯TC¯1A¯θ(t)A¯TC¯1b¯w.\begin{split}\dot{\theta}(t)&=\frac{-\bar{A}^{T}\bar{C}^{-1}\bar{A}\theta(t)-\bar{A}^{T}\bar{C}^{-1}\bar{b}}{w}.\end{split}

Since A¯\bar{A} is negative definite and C¯\bar{C} is positive definite, therefore, A¯TC¯1A¯-\bar{A}^{T}\bar{C}^{-1}\bar{A} is negative definite. Therefore, θ=A¯1b¯\theta^{*}=-\bar{A}^{-1}\bar{b} is the unique g.a.s.e.

Next, we show that the sufficient conditions for stability of the three iterates are satisfied. The function, hc(v,u,θ)=cA¯Tuwcvc=A¯Tuwvh(v,u,θ)=A¯Tuwvh_{c}(v,u,\theta)=\frac{-c\bar{A}^{T}u-wcv}{c}=-\bar{A}^{T}u-wv\rightarrow h_{\infty}(v,u,\theta)=-\bar{A}^{T}u-wv uniformly on compacts as cc\rightarrow\infty. The limiting ODE:

v˙(t)=A¯Tuwv(t)\dot{v}(t)=-\bar{A}^{T}u-wv(t)

has λ(u,θ)=A¯Tuw\lambda_{\infty}(u,\theta)=-\frac{\bar{A}^{T}u}{w} as its unique g.a.s.e. λ\lambda_{\infty} is Lipschitz with λ(0,0)=0\lambda_{\infty}(0,0)=0, thus satisfying assumption (𝐁𝟓)\bf{(B5)}.

The function, gc(u,θ)=cA¯θ+b¯cC¯uc=A¯θC¯u+b¯cg(u,θ)=A¯θC¯ug_{c}(u,\theta)=\frac{c\bar{A}\theta+\bar{b}-c\bar{C}u}{c}=\bar{A}\theta-\bar{C}u+\frac{\bar{b}}{c}\rightarrow g_{\infty}(u,\theta)=\bar{A}\theta-\bar{C}u uniformly on compacts as cc\rightarrow\infty. The limiting ODE

u˙(t)=A¯θC¯u(t)\dot{u}(t)=\bar{A}\theta-\bar{C}u(t)

has Γ(θ)=C¯1A¯θ\Gamma_{\infty}(\theta)=\bar{C}^{-1}\bar{A}\theta as its unique g.a.s.e. since C¯-\bar{C} is negative definite. Γ\Gamma_{\infty} is Lipchitz with Γ(0)=0\Gamma_{\infty}(0)=0. Thus assumption (𝐁𝟔)\bf{(B6)} is satisfied.

Finally, fc(θ)=cA¯TC¯1A¯θcwf=A¯TA¯θwf_{c}(\theta)=\frac{-c\bar{A}^{T}\bar{C}^{-1}\bar{A}\theta}{cw}\rightarrow f_{\infty}=\frac{-\bar{A}^{T}\bar{A}\theta}{w} uniformly on compacts as cc\rightarrow\infty and the ODE:

θ˙(t)=A¯TC¯1A¯θ(t)w\dot{\theta}(t)=-\frac{\bar{A}^{T}\bar{C}^{-1}\bar{A}\theta(t)}{w}

has origin in d\mathbb{R}^{d}as its unique g.a.s.e. This ensures the final condition (𝐁𝟕)\bf{(B7)}. By theorem 3,

(vtutθt)(λ(Γ(A¯1b¯),A¯1b¯)Γ(A¯1b¯)A¯1b¯.)=(00A¯1b¯.)\begin{pmatrix}v_{t}\\ u_{t}\\ \theta_{t}\end{pmatrix}\rightarrow\begin{pmatrix}\lambda(\Gamma(-\bar{A}^{-1}\bar{b}),-\bar{A}^{-1}\bar{b})\\ \Gamma(-\bar{A}^{-1}\bar{b})\\ -\bar{A}^{-1}\bar{b}.\end{pmatrix}=\begin{pmatrix}0\\ 0\\ -\bar{A}^{-1}\bar{b}.\end{pmatrix}

Specifically, θtA¯1b¯\theta_{t}\rightarrow-\bar{A}^{-1}\bar{b}. ∎

A3.2  Asymptotic Convergence of TDC-M

We re-write the iterates for TDC-M below:

θt+1=θt+αt(δtϕtγϕt(ϕtTut))+ηt(θtθt1),\begin{split}\theta_{t+1}=\theta_{t}+\alpha_{t}(\delta_{t}\phi_{t}-\gamma\phi_{t}^{\prime}(\phi_{t}^{T}u_{t}))+\eta_{t}(\theta_{t}-\theta_{t-1}),\end{split} (44)
ut+1=ut+βt(δtϕtTut)ϕt.u_{t+1}=u_{t}+\beta_{t}(\delta_{t}-\phi_{t}^{T}u_{t})\phi_{t}. (45)

As before, choosing ηt=ϱtwαtϱt1\eta_{t}=\frac{\varrho_{t}-w\alpha_{t}}{\varrho_{t-1}}, where {ϱt}\{\varrho_{t}\} is a positive sequence and ww\in\mathbb{R} is a constant, we can decompose the two iterates into three recursions as below:

vt+1=vt+ξt(δtϕtγϕtϕtTutwvt)\displaystyle v_{t+1}=v_{t}+\xi_{t}\left(\delta_{t}\phi_{t}-\gamma\phi_{t}^{\prime}\phi_{t}^{T}u_{t}-wv_{t}\right) (46)
ut+1=ut+βt(δtϕtϕtϕtTut)\displaystyle u_{t+1}=u_{t}+\beta_{t}(\delta_{t}\phi_{t}-\phi_{t}\phi_{t}^{T}u_{t}) (47)
θt+1=θt+ϱt(vt+εt)\displaystyle\theta_{t+1}=\theta_{t}+\varrho_{t}(v_{t}+\varepsilon_{t}) (48)
Theorem 27.

Assume 𝒜\mathbfcal{A}1, 𝒜\mathbfcal{A}3 and 𝒜\mathbfcal{A}4 hold and let w>0w>0. Then, the TDC-M iterates given by (44) and (45) satisfy θnθ=A¯1b¯\theta_{n}\rightarrow\theta^{*}=-\bar{A}^{-1}\bar{b} a.s. as nn\rightarrow\infty.

Proof.

We transform the iterates given by (46), (47) and (48) into the standard SA form given by (22), (23) and (24). Let t=σ(u0,v0,θ0,rj+1,ϕj,ϕj:j<t)\mathcal{F}_{t}=\sigma(u_{0},v_{0},\theta_{0},r_{j+1},\phi_{j},\phi_{j}^{\prime}:j<t). Let, At=ϕt(γϕtϕt)TA_{t}=\phi_{t}(\gamma\phi_{t}^{\prime}-\phi_{t})^{T} and bt=rt+1ϕtb_{t}=r_{t+1}\phi_{t}. Then, (46) can be re-written as:

vt+1=vt+ξt(h(vt,ut,θt)+Mt+1(1))v_{t+1}=v_{t}+\xi_{t}\left(h(v_{t},u_{t},\theta_{t})+M_{t+1}^{(1)}\right)

where,

h(vt,ut,θt)=𝔼[δtϕtγϕtϕtTutwvt|t]=A¯θt+b¯γ𝔼[ϕtϕtT]utwvt.Mt+1(1)=δtϕtγϕtϕtTutwvth(vt,ut,θt)=(AtA¯)θt+(btb¯)+γ(𝔼[ϕtϕtT]ϕtϕtT)ut.\begin{split}h(v_{t},u_{t},\theta_{t})&=\mathbb{E}[\delta_{t}\phi_{t}-\gamma\phi_{t}^{\prime}\phi_{t}^{T}u_{t}-wv_{t}|\mathcal{F}_{t}]\\ &=\bar{A}\theta_{t}+\bar{b}-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]u_{t}-wv_{t}.\\ M_{t+1}^{(1)}&=\delta_{t}\phi_{t}-\gamma\phi_{t}^{\prime}\phi_{t}^{T}u_{t}-wv_{t}-h(v_{t},u_{t},\theta_{t})\\ &=(A_{t}-\bar{A})\theta_{t}+(b_{t}-\bar{b})+\gamma(\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]-\phi_{t}^{\prime}\phi_{t}^{T})u_{t}.\end{split}

Next, (46) can be re-written as:

ut+1=ut+βt(g(vt,ut,θt)+Mt+1(2))\begin{split}u_{t+1}&=u_{t}+\beta_{t}\left(g(v_{t},u_{t},\theta_{t})+M_{t+1}^{(2)}\right)\\ \end{split}

where,

g(vt,ut,θt)=𝔼[δtϕtϕtϕtTut|t]=A¯θt+b¯C¯utMt+1(2)=Atθt+btCtutg(vt,ut,θt)=(AtA¯)θt+(btb¯)+(C¯Ct)ut.\begin{split}g(v_{t},u_{t},\theta_{t})&=\mathbb{E}[\delta_{t}\phi_{t}-\phi_{t}\phi_{t}^{T}u_{t}|\mathcal{F}_{t}]=\bar{A}\theta_{t}+\bar{b}-\bar{C}u_{t}\\ M_{t+1}^{(2)}&=A_{t}\theta_{t}+b_{t}-C_{t}u_{t}-g(v_{t},u_{t},\theta_{t})\\ &=(A_{t}-\bar{A})\theta_{t}+(b_{t}-\bar{b})+(\bar{C}-C_{t})u_{t}.\end{split}

Here, Ct=ϕtϕtTC_{t}=\phi_{t}\phi_{t}^{T} and C¯=𝔼[ϕtϕtT]\bar{C}=\mathbb{E}[\phi_{t}\phi_{t}^{T}]. Finally, (46) can be re-written as:

θt+1=θt+ϱt(f(vt,ut,θt)+εt+Mt+1(3))\begin{split}\theta_{t+1}=\theta_{t}+\varrho_{t}\left(f(v_{t},u_{t},\theta_{t})+\varepsilon_{t}+M_{t+1}^{(3)}\right)\end{split}

where,

f(vt,ut,θt)=vt and Mt+1(3)=0.f(v_{t},u_{t},\theta_{t})=v_{t}\mbox{ and }M_{t+1}^{(3)}=0.

The functions h,g,fh,g,f are linear in v,u,θv,u,\theta and hence Lipchitz continuous, therefore satisfying (𝐁𝟏)\bf{(B1)}. We choose the step-size sequences such that they satisfy (𝐁𝟐)\bf{(B2)}. One popular choice is

ξt=1(t+1)ξ,βt=1(t+1)β,ϱt=1(t+1)ϱ,12<ξ<β<ϱ1.\xi_{t}=\frac{1}{(t+1)^{\xi}},\beta_{t}=\frac{1}{(t+1)^{\beta}},\varrho_{t}=\frac{1}{(t+1)^{\varrho}},\frac{1}{2}<\xi<\beta<\varrho\leq 1.

Observe that, Mt+1(1),Mt+1(2)M_{t+1}^{(1)},M_{t+1}^{(2)} and Mt+1(3)M_{t+1}^{(3)} t0t\geq 0, are martingale difference sequences w.r.t t\mathcal{F}_{t} by construction. Next,

𝔼[Mt+1(1)2|t]3((AtA¯)2θt2+(btb¯)2+γ(𝔼[ϕtϕtT]ϕtϕtT2)ut2),\mathbb{E}[||M_{t+1}^{(1)}||^{2}|\mathcal{F}_{t}]\leq 3(||(A_{t}-\bar{A})||^{2}||\theta_{t}||^{2}+||(b_{t}-\bar{b})||^{2}+\gamma(||\mathbb{E}[\phi_{t}^{{}^{\prime}}\phi_{t}^{T}]-\phi_{t}^{\prime}\phi_{t}^{T}||^{2})||u_{t}||^{2}),
𝔼[Mt+1(2)2|t]3((AtA¯)2θt2+(btb¯)2+(C¯Ct)2ut2)\mathbb{E}[||M_{t+1}^{(2)}||^{2}|\mathcal{F}_{t}]\leq 3(||(A_{t}-\bar{A})||^{2}||\theta_{t}||^{2}+||(b_{t}-\bar{b})||^{2}+||(\bar{C}-C_{t})||^{2}||u_{t}||^{2})

. The first part of (𝐁𝟑)\bf{(B3)} is satisfied with K1=3max((AtA¯)2,(btb¯)2,γ(𝔼[ϕtϕtT]ϕtϕtT2))K_{1}=3\max(||(A_{t}-\bar{A})||^{2},||(b_{t}-\bar{b})||^{2},\gamma(||\mathbb{E}[\phi_{t}^{{}^{\prime}}\phi_{t}^{T}]-\phi_{t}^{\prime}\phi_{t}^{T}||^{2})), K2=3max(AtA¯2,btb¯2,(C¯Ct)2)K_{2}=3\max(||A_{t}-\bar{A}||^{2},||b_{t}-\bar{b}||^{2},||(\bar{C}-C_{t})||^{2}) and any K3>0K_{3}>0. The fact that K1,K2<K_{1},K_{2}<\infty follows from the bounded features and bounded rewards assumption in 𝒜\mathbfcal{A}1. Next, observe that εt(3)=ξt((ϕtγϕt)ϕtTutwvt)0||\varepsilon_{t}^{(3)}||=\xi_{t}||\left((\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}u_{t}-wv_{t}\right)||\rightarrow 0 since ξt0 as t\xi_{t}\rightarrow 0\mbox{ as }t\rightarrow\infty. For a fixed u,θdu,\theta\in\mathbb{R}^{d}, consider the ODE

v˙(t)=A¯θ+b¯γ𝔼[ϕtϕtT]uwv(t).\dot{v}(t)=\bar{A}\theta+\bar{b}-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]u-wv(t).

For w>0w>0, λ(u,θ)=A¯θ+b¯γ𝔼[ϕtϕtT]uw\lambda(u,\theta)=\frac{\bar{A}\theta+\bar{b}-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]u}{w} is the unique g.a.s.e, is linear and therefore Lipchitz continuous. This satisfies (𝐁𝟒)\bf{(B4)}(i). Next, for a fixed θd\theta\in\mathbb{R}^{d},

u˙(t)=A¯θ+b¯C¯u(t),\dot{u}(t)=\bar{A}\theta+\bar{b}-\bar{C}u(t),

has Γ(θ)=C¯1(A¯θ+b¯)\Gamma(\theta)=\bar{C}^{-1}(\bar{A}\theta+\bar{b}) as its unique g.a.s.e because C¯1-\bar{C}^{-1} is negative definite. Also Γ(θ)\Gamma(\theta) is linear in θ\theta and therefore Lipschitz. This satisfies (𝐁𝟒)(𝐢𝐢)\bf{(B4)}(ii). Finally, to satisfy (𝐁𝟒)(𝐢𝐢𝐢)\bf{(B4)}(iii), consider,

θ˙(t)=(Iγ𝔼[ϕtϕtT]C¯1)(A¯θ(t)+b¯)w.\begin{split}\dot{\theta}(t)&=\frac{(I-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]\bar{C}^{-1})(\bar{A}\theta(t)+\bar{b})}{w}.\end{split}

Now, (Iγ𝔼[ϕtϕtT]C¯1)A¯(I-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]\bar{C}^{-1})\bar{A} = (𝔼[ϕtϕtT]γ𝔼[ϕtϕtT])C¯1A¯=𝔼[(ϕtγϕt)ϕtT]C¯1A¯=A¯TC¯1A¯(\mathbb{E}[\phi_{t}\phi_{t}^{T}]-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}])\bar{C}^{-1}\bar{A}=\mathbb{E}[(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}]\bar{C}^{-1}\bar{A}=-\bar{A}^{T}\bar{C}^{-1}\bar{A}. Since, A¯\bar{A} is negative definite and C¯\bar{C} is positive definite, therefore A¯TC¯1A¯-\bar{A}^{T}\bar{C}^{-1}\bar{A} is negative definite and hence the above ODE has θ=A¯1b¯\theta^{*}=-\bar{A}^{-1}\bar{b} as its unique g.a.s.e.

Next, we show that the sufficient conditions for stability of the three iterates are satisfied. The function, hc(v,u,θ)=cA¯θ+b¯cγ𝔼[ϕtϕtT]ucwvc=A¯θtγ𝔼[ϕtϕtT]utwvth(v,u,θ)=A¯θtγ𝔼[ϕtϕtT]utwvth_{c}(v,u,\theta)=\frac{c\bar{A}\theta+\bar{b}-c\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]u-cwv}{c}=\bar{A}\theta_{t}-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]u_{t}-wv_{t}\rightarrow h_{\infty}(v,u,\theta)=\bar{A}\theta_{t}-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]u_{t}-wv_{t} uniformly on compacts as cc\rightarrow\infty. The limiting ODE:

v˙(t)=A¯θtγ𝔼[ϕtϕtT]utwv(t)\dot{v}(t)=\bar{A}\theta_{t}-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]u_{t}-wv(t)

has λ(u,θ)=A¯θγ𝔼[ϕtϕtT]uw\lambda_{\infty}(u,\theta)=\frac{\bar{A}\theta-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]u}{w} as its unique g.a.s.e. λ\lambda_{\infty} is Lipschitz with λ(0,0)=0\lambda_{\infty}(0,0)=0, thus satisfying assumption (𝐁𝟓)\bf{(B5)}.

The function, gc(u,θ)=cA¯θ+b¯cC¯uc=A¯θC¯u+b¯cg(u,θ)=A¯θC¯ug_{c}(u,\theta)=\frac{c\bar{A}\theta+\bar{b}-c\bar{C}u}{c}=\bar{A}\theta-\bar{C}u+\frac{\bar{b}}{c}\rightarrow g_{\infty}(u,\theta)=-\bar{A}\theta-\bar{C}u uniformly on compacts as cc\rightarrow\infty. The limiting ODE

u˙(t)=A¯θC¯u(t)\dot{u}(t)=\bar{A}\theta-\bar{C}u(t)

has Γ(θ)=C¯1A¯θ\Gamma_{\infty}(\theta)=\bar{C}^{-1}\bar{A}\theta as its unique g.a.s.e. since C¯-\bar{C} is negative definite. Γ\Gamma_{\infty} is Lipschitz with Γ(0)=0\Gamma_{\infty}(0)=0. Thus assumption (𝐁𝟔)\bf{(B6)} is satisfied.

Finally, fc(θ)=cA¯θcγ𝔼[ϕtϕtT]C¯1A¯θcwf=(Iγ𝔼[ϕtϕtT]C¯1)A¯θwf_{c}(\theta)=\frac{c\bar{A}\theta-c\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]\bar{C}^{-1}\bar{A}\theta}{cw}\rightarrow f_{\infty}=\frac{(I-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]\bar{C}^{-1})\bar{A}\theta}{w} uniformly on compacts as cc\rightarrow\infty. Consider the ODE:

θ˙(t)=(Iγ𝔼[ϕtϕtT]C¯1)A¯θ(t)w.\dot{\theta}(t)=\frac{(I-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]\bar{C}^{-1})\bar{A}\theta(t)}{w}.

Now, (Iγ𝔼[ϕtϕtT]C¯1)A¯(I-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}]\bar{C}^{-1})\bar{A} = (𝔼[ϕtϕtT]γ𝔼[ϕtϕtT])C¯1A¯=𝔼[(ϕtγϕt)ϕtT]C¯1A¯=A¯TC¯1A¯(\mathbb{E}[\phi_{t}\phi_{t}^{T}]-\gamma\mathbb{E}[\phi_{t}^{\prime}\phi_{t}^{T}])\bar{C}^{-1}\bar{A}=\mathbb{E}[(\phi_{t}-\gamma\phi_{t}^{\prime})\phi_{t}^{T}]\bar{C}^{-1}\bar{A}=-\bar{A}^{T}\bar{C}^{-1}\bar{A}. Since, A¯\bar{A} is negative definite and C¯\bar{C} is positive definite, therefore A¯TC¯1A¯-\bar{A}^{T}\bar{C}^{-1}\bar{A} is negative definite and hence the above ODE has origin as its unique g.a.s.e. This ensures the final condition (𝐁𝟕)\bf{(B7)}. By Theorem 3,

(vtutθt)(λ(Γ(A¯1b¯),A¯1b¯)Γ(A¯1b¯)A¯1b¯.)=(00A¯1b¯.)\begin{pmatrix}v_{t}\\ u_{t}\\ \theta_{t}\end{pmatrix}\rightarrow\begin{pmatrix}\lambda(\Gamma(-\bar{A}^{-1}\bar{b}),-\bar{A}^{-1}\bar{b})\\ \Gamma(-\bar{A}^{-1}\bar{b})\\ -\bar{A}^{-1}\bar{b}.\end{pmatrix}=\begin{pmatrix}0\\ 0\\ -\bar{A}^{-1}\bar{b}.\end{pmatrix}

Specifically, θtA¯1b¯\theta_{t}\rightarrow-\bar{A}^{-1}\bar{b}. ∎

A4  Experiment Details

Here we briefly describe the MDP settings considered in section 5.

  1. 1.

    Example-1 (Boyan Chain): It consists of a linear arrangement of 14 states. From each of the first 13 states, one can move to the next state or the next to next state with equal probability. The last state is an absorbing state. The reward at each transition is -3 except the transition from state-6 to state-7 where it is -2. The discount factor γ\gamma is set to 0.950.95. The following figure shows the corresponding MDP for 7 state Boyan Chain.

    Refer to caption
    Figure 5: 7 state Boyan Chain from (Boyan 1999)
  2. 2.

    Example-2 (5-State Random Walk): It consists of a linear arrangement of 5 states with two terminal states. There is a single action at each state. From each state one either moves left or right with equal probability. Moving left from state 1 results in episode termination yielding a reward of 0. Similarly, moving right from state 5 also results in episode termination, however, yielding a reward of +1. The reward associated with all other transitions is 0 and the discount factor γ=1\gamma=1. The following figure shows the corresponding MDP.

    Refer to caption
    Figure 6: 5-State Random Walk from (Sutton et al. 2009)
  3. 3.

    Example-3 (19-State Random Walk): It consists of a linear arrangement of 19 states. From each state one either moves left or right with equal probability. Moving left from state 1 results in episode termination yielding a reward of -1. Similarly, moving right from state 19 also results in episode termination, however, yielding a reward of +1. The reward associated with all other transitions is 0 and the discount factor γ=1\gamma=1. The following figure shows the corresponding MDP:

    Refer to caption
    Figure 7: 19 State Random Walk from (Sutton and Barto 2018)
  4. 4.

    Example-4 (Random MDP): This is a randomly generated discrete MDP with 20 states and 5 actions in each state. The transition probabilities are uniformly generated from [0,1][0,1] with a small additive constant. The rewards are also uniformly generated from [0,1][0,1]. The policy and the start state distribution are also generated in a similar way and the discount factor γ=0.95\gamma=0.95. See (Dann, Neumann, and Peters 2014) for a more detailed description.