Asymptotic and Finite Sample Analysis of
Nonexpansive Stochastic Approximations with Markovian Noise

Ethan Blaser Shangtong Zhang

Abstract

Stochastic approximation is an important class of algorithms, and a large body of previous analysis focuses on stochastic approximations driven by contractive operators, which is not applicable in some important reinforcement learning settings. This work instead investigates stochastic approximations with merely nonexpansive operators. In particular, we study nonexpansive stochastic approximations with Markovian noise, providing both asymptotic and finite sample analysis. Key to our analysis are a few novel bounds of noise terms resulting from the Poisson equation. As an application, we prove, for the first time, that the classical tabular average reward temporal difference learning converges to a sample path dependent fixed point.

Machine Learning, ICML

1 Introduction

Stochastic approximation (SA) algorithms (Robbins & Monro, 1951; Kushner & Yin, 2003; Borkar, 2009) form the foundation of many iterative optimization and learning methods by updating a vector incrementally and stochastically. Prominent examples include stochastic gradient descent (SGD) (Kiefer & Wolfowitz, 1952) and temporal difference (TD) learning (Sutton, 1988). These algorithms generate a sequence of iterates $\quantity{x_{n}}$ starting from an initial point $x_{0}\in\mathbb{R}^{d}$ through the recursive update:

x_{n+1}\doteq x_{n}+\alpha_{n+1}\quantity(H\quantity(x_{n},Y_{n+1})-x_{n})

(SA)

where $\left\{\alpha_{n}\right\}$ is a sequence of deterministic learning rates, $\{Y_{n}\}$ is a sequence of random noise in a space $\mathcal{Y}$ , and a function $H:\mathbb{R}^{d}\times\mathcal{Y}\rightarrow\mathbb{R}^{d}$ maps the current iterate $x_{n}$ and noise $Y_{n+1}$ to the actual incremental update. We use $h$ to denote the expected update, i.e., $h(x)\doteq\mathbb{E}\quantity[H(x,y)]$ , where the expectation will be formally defined shortly.

Despite the foundational role of SA in analyzing reinforcement learning (RL) (Sutton & Barto, 2018) algorithms, most of the existing literature assumes that the expected mapping $h$ is a contraction, ensuring the stability and convergence of the iterates $\quantity{x_{n}}$ under mild conditions. Table 1 highlights the relative scarcity of results concerning nonexpansive mappings. However, in many problems in RL, particularly those involving average reward formulations (Tsitsiklis & Roy, 1999; Puterman, 2014; Wan et al., 2021b, a; He et al., 2022), $h$ is only guaranteed to be non-expansive, not contractive.

Table 1: Overview of stochastic approximation methods, with a focus on those that consider non-expansive mappings. “Non-expansive

h

” refers to works where the expected mapping is non-expansive, as opposed to strictly a contraction. “Markovian

\quantity{Y_{n}}

” indicates cases where the noise term

\quantity{Y_{n}}

is Markovian. “Asymptotic” refers to works that prove almost sure convergence, which is not necessarily weaker than non-asymptotic convergence results. Note that we present only a representative subset of results for SA with contractive mappings due to an abundance of literature in the area. For a more comprehensive treatment, see (Benveniste et al., 1990; Kushner & Yin, 2003; Borkar, 2009).

	Nonexpansive $h$	Markovian $\{Y_{n}\}$	Asymptotic	Non-Asymptotic
Krasnosel’skii (1955)	✓		✓
Ishikawa (1976)	✓		✓
Reich (1979)	✓		✓
Benveniste et al. (1990)			✓
Liu (1995)			✓
Szepesvári (1997)			✓
Abounadi et al. (2002)	✓		✓
Tadić (2002)		✓		✓
Kushner & Yin (2003)			✓
Koval & Schwabe (2003)			✓	✓
Tadic (2004)		✓		✓
Kim & Xu (2007)	✓		✓
Borkar (2009)			✓
Cominetti et al. (2014)	✓		✓	✓
Bravo et al. (2019)	✓		✓	✓
Chen et al. (2021)		✓		✓
Borkar et al. (2021)		✓	✓	✓
Karandikar & Vidyasagar (2024)		✓	✓	✓
Bravo & Cominetti (2024)	✓		✓	✓
Qian et al. (2024)		✓	✓	✓
Liu et al. (2025)		✓	✓
Ours	✓	✓	✓	✓

One tool for analyzing (SA) with nonexpansive $h$ which has recently gained renewed attention, is Krasnoselskii-Mann iterations. In their simplest deterministic form, these iterations are given by:

\displaystyle x_{t+1}=x_{t}+\alpha_{t+1}(Tx_{t}-x_{t}),

(KM)

where $T:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ is some nonexpansive mapping. Under some other restrictive conditions, Krasnosel’skii (1955) first proves the convergence of (KM) to a fixed point of $T$ and this result is further generalized by Edelstein (1966); Ishikawa (1976); Reich (1979); Liu (1995). More recently, Cominetti et al. (2014) use a novel fox-and-hare model to connect KM iterations with Bernoulli random variables, providing a sharper convergence rate for $\norm{x_{k}-Tx_{k}}\to 0$ .

In practice, algorithms often deviate from (KM) due to noise, leading to the study of inexact KM iterations (IKM) with deterministic noise (Kim & Xu, 2007; Bravo et al., 2019):

\displaystyle x_{t+1}=x_{t}+\alpha_{t+1}(Tx_{t}-x_{t}+e_{t+1}),

(IKM)

where $\quantity{e_{t}}$ is a sequence of deterministic noise. Bravo et al. (2019) extend Cominetti et al. (2014) and establish the convergence of (IKM), under some mild conditions on $\quantity{e_{t}}$ .

However, deterministic noise is still not desirable in many problems. To this end, a stochastic version of (IKM) is studied, which considers the iterates

\displaystyle x_{t+1}=x_{t}+\alpha_{t+1}(Tx_{t}-x_{t}+M_{t+1}),

(SKM)

where $\quantity{M_{t}}$ is a Martingale difference sequence. Under mild conditions, Bravo & Cominetti (2024) proves the almost sure convergence of (SKM) to a fixed point of $T$ . If we write (SA) as

\displaystyle x_{n+1}=x_{n}+\alpha_{n+1}\quantity(h(x_{n})-x_{n}+H\quantity(x_{n},Y_{n+1})-h(x_{n})),

(1)

we observe that the convergence result from Bravo & Cominetti (2024) implies the almost sure convergence of (SA) when $\quantity{Y_{n}}$ is i.i.d., since this makes $\quantity{H(x_{n},Y_{n+1})-h(x_{n})}$ a Martingale difference sequence.

Bravo & Cominetti (2024) is the first to introduce this SKM based method in RL, by using it to prove the almost sure convergence and non-asymptotic convergence rate of a synchronous version of RVI $Q$ -learning (Abounadi et al., 2001). However, the assumption that $\quantity{Y_{n}}$ is i.i.d only holds for some synchronous RL algorithms. In most practical settings where the RL algorithm is asynchronous, the noise $\quantity{Y_{n}}$ is Markovian, meaning $\quantity{H(x_{n},Y_{n+1})-h(x_{n})}$ is not a Martingale difference sequence and the results of Bravo & Cominetti (2024) do not apply.

Contribution

Our primary contribution is to close the aforementioned gap by extending the results of Bravo & Cominetti (2024) to the Markovian noise setting. Namely, this work allows $\quantity{Y_{n}}$ to be a Markov chain, and $H$ to be a $1-$ Lipschitz continuous noisy estimate of a non-expansive operator $h$ , providing both the first proof of almost sure convergence, and also the first non-asymptotic convergence rate in this setting (Table 1).

•

Theorem 2.6 proves that the sequence $\quantity{x_{n}}$ generated by (SA) with Markovian $\quantity{Y_{n}}$ and nonexpansive $h$ , converges almost surely to some random point $x^{*}\in\mathcal{X}_{*}$ , where $\mathcal{X}_{*}$ is the set of fixed points of $h$ . Importantly, $x_{*}$ may depend on the entire sample path.
•

Theorem 3.1 provides the convergence rate of the expected residuals $\mathbb{E}\quantity[\norm{x_{n}-h(x_{n})}]$ .
•

Theorem 4.2 utilizes our SKM results to provide the first proof of almost sure convergence of tabular average reward temporal difference learning (TD) to a (possibly sample path dependent) fixed point.

By extending Bravo & Cominetti (2024) to Markovian noise, we are the first to use the SKM method to analyze asynchronous RL algorithms.

The key idea of our approach is to use Poisson’s equation to decompose the error $\quantity{H(x_{n},Y_{n+1})-h(x_{n})}$ into boundable error terms (Benveniste et al., 1990). While the use of Poisson’s equation for handling Markovian noise is well-established, our method departs from prior techniques for bounding these error terms in almost sure convergence analyses. Specifically, Benveniste et al. (1990) and Konda & Tsitsiklis (1999) use stopping times, while Borkar et al. (2021) employ a Lyapunov function and use the scaled iterates technique. In contrast, we leverage a 1-Lipschitz continuity assumption on $H$ to directly control the growth of error terms.

Notations

In this paper, all vectors are column. We use $\norm{\cdot}$ to denote a generic operator norm and use $e$ to denote an all-one vector. We use $\norm{\cdot}_{2}$ and $\norm{\cdot}_{\infty}$ to denote $\ell_{2}$ norm and infinity norm respectively. We use $\mathcal{O}(\cdot)$ to hide deterministic constants for simplifying presentation, while the letter $\zeta$ is reserved for sample-path dependent constants.

2 Almost Sure Convergence of Stochastic Krasnoselskii-Mann Iterations with Markovian and Additive Noise

To extend the analysis of (SKM) in Bravo et al. (2019); Bravo & Cominetti (2024) to SKM with Markovian and additive noise, we consider the following iterates

x_{n+1}=x_{n}+\alpha_{n+1}\left(H(x_{n},Y_{n+1})-x_{n}+{\epsilon}^{{\left(1\right)}}_{n+1}\right).

(SKM with Markovian and Additive Noise)

Here, $\quantity{x_{n}}$ are stochastic vectors evolving in $\mathbb{R}^{d}$ , $\quantity{Y_{n}}$ is a Markov chain evolving in a finite state space $\mathcal{Y}$ , $H:\mathbb{R}^{d}\times\mathcal{Y}\to\mathbb{R}^{d}$ defines the update, $\quantity{{\epsilon}^{{\left(1\right)}}_{n+1}}$ is a sequence of stochastic noise evolving in $\mathbb{R}^{d}$ , and $\quantity{\alpha_{n}}$ is a sequence of deterministic learning rates. Although the primary contribution of this work is to allow $\quantity{Y_{n}}$ to be Markovian, we also include the deterministic noise term ${\epsilon}^{{\left(1\right)}}_{n}$ in (SKM with Markovian and Additive Noise), as it will later be instrumental in proving the almost sure convergence of average reward TD in Section 4.

We make the following assumptions.

Assumption 2.1 (Ergodicity).

The Markov chain $\quantity{Y_{n}}$ is irreducible and aperiodic.

The Markov chain $\quantity{Y_{n}}$ thus adopts a unique invariant distribution, denoted $d_{\mu}$ . We use $P$ to denote the transition matrix of $\quantity{Y_{n}}$ .

Assumption 2.2 (1-Lipschitz).

The function $H$ is 1-Lipschitz continuous in its first argument w.r.t. some operator norm $\norm{\cdot}$ and uniformly in its second argument, i.e., for any $x,x^{\prime},y$ , it holds that

\displaystyle\norm{H(x,y)-H(x^{\prime},y)}\leq\norm{x-x^{\prime}}.

(2)

This assumption has two important implications. First, it implies that $H(x,y)$ can grow at most linearly. Indeed, let $x^{\prime}=0$ , we get $\norm{H(x,y)}\leq\norm{H(0,y)}+\norm{x}$ . Define $C_{H}\doteq\max_{y}\norm{H(0,y)}$ , we get

\displaystyle\norm{H(x,y)}\leq C_{H}+\norm{x}.

(3)

Second, define the function $h:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ as the expectation of $H$ over the stationary distribution $d_{\mu}$ :

\displaystyle h(x)\doteq\mathbb{E}_{y\sim d_{\mu}}[H(x,y)].

(4)

We then have that $h$ is non-expansive. Namely,

	$\displaystyle\textstyle\norm{h(x)-h(x^{\prime})}$	$\displaystyle\leq\sum_{y}d_{\mu}(y)\norm{H(x,y)-H(x^{\prime},y)}$		(5)
		$\displaystyle\leq\norm{x-x^{\prime}}.$		(6)

This $h$ is exactly the non-expansive operator in the SKM literature. We, of course, need to assume that the problem is solvable.

Assumption 2.3 (Fixed Points).

The non-expansive operator $h$ adopts at least one fixed point.

We use $\mathcal{X}_{*}\neq\emptyset$ to denote the set of fixed points of $h$ .

Assumption 2.4 (Learning Rate).

The learning rate $\quantity{\alpha_{n}}$ has the form

\displaystyle\textstyle\alpha_{n}=\frac{1}{(n+1)^{b}},\alpha_{0}=0,

(7)

where $b\in(\frac{4}{5},1]$ .

The primary motivation for requiring $b\in(\frac{4}{5},1]$ is that our learning rates $\alpha_{n}$ need to decrease quickly enough for certain key terms in the proof to be finite. The specific need for $b>\frac{4}{5}$ can be seen in the proof of (79) in Lemma B.1.

Next, using this definition of the learning rates, we will define two useful shorthands,

	$\displaystyle\alpha_{k,n}$	$\displaystyle\doteq\alpha_{k}\prod_{j=k+1}^{n}{\left(1-\alpha_{j}\right)},\,\alpha_{n,n}\doteq\alpha_{n},$		(8)
	$\displaystyle\tau_{n}$	$\displaystyle\doteq\sum_{k=1}^{n}\alpha_{k}{\left(1-\alpha_{k}\right)}.$		(9)

We now impose assumptions on the additive noise.

Assumption 2.5 (Additive Noise).

	$\displaystyle\textstyle\sum_{k=1}^{\infty}\alpha_{k}\norm{{\epsilon}^{{\left(1\right)}}_{k}}<$	$\displaystyle\infty\mbox{\quad a.s.,\quad}$		(10)
	$\displaystyle\textstyle\mathbb{E}\left[\norm{{\epsilon}^{{\left(1\right)}}_{n}}^{2}\right]=$	$\displaystyle\textstyle\mathcal{O}\quantity(\frac{1}{n}).$		(11)

The first part of Assumption 2.5 can be interpreted as a requirement that the total amount of additive noise remains finite, akin to the assumption on $e_{t}$ in (IKM) in Bravo et al. (2019). Additionally, we impose a condition on the second moment of this noise, requiring it to converge at the rate $\mathcal{O}\quantity(\frac{1}{n})$ . While these assumptions on ${{\epsilon}^{{\left(1\right)}}_{n}}$ may seem restrictive, it should be noted that even if ${\epsilon}^{{\left(1\right)}}_{n}$ were absent, our work would still extend the results of (Bravo & Cominetti, 2024) to cases involving Markovian noise, as the Markovian noise component is already incorporated in $Y_{n}$ , which represents a significant result. For most RL applications involving algorithms which have only one set of weights, the additional noise ${\epsilon}^{{\left(1\right)}}_{k}$ will simply be 0. We are now ready to present the main convergence result.

Theorem 2.6.

Let Assumptions 2.1 - 2.5 hold. Then the iterates $\quantity{x_{n}}$ generated by
(SKM with Markovian and Additive Noise) satisfy

\displaystyle\lim_{n\to\infty}x_{n}=x_{*}\mbox{\quad a.s.,\quad}

(12)

where $x_{*}\in\mathcal{X}_{*}$ is a possibly sample-path dependent fixed point. Or more precisely speaking, let $\omega$ denote a sample path $(w_{0},Y_{0},Y_{1},\dots)$ and write $x_{n}(\omega)$ to emphasize the dependence of $x_{n}$ on $\omega$ . Then there exists a set $\Omega$ of sample paths with $\Pr(\Omega)=1$ such that for any $\omega\in\Omega$ , the limit $\lim_{n\to\infty}x_{n}(\omega)$ exists, denoted as $x_{*}(\omega)$ , and satisfies $x_{*}(\omega)\in\mathcal{X}_{*}$ .

Proof.

We start with a decomposition of the error $H(x,Y_{n+1})-h(x)$ using Poisson’s equation akin to Métivier & Priouret (1987); Benveniste et al. (1990). Namely, thanks to the finiteness of $\mathcal{Y}$ , it is well known (see, e.g., Theorem 17.4.2 of Meyn & Tweedie (2012) or Theorem 8.2.6 of Puterman (2014)) that there exists a function $\nu(x,y):\mathbb{R}^{d}\times\mathcal{Y}\to\mathbb{R}^{d}$ such that

\displaystyle H(x,y)-h(x)=\nu(x,y)-(P\nu)(x,y).

(13)

Here, we use $P\nu$ to denote the function $(x,y)\mapsto\sum_{y^{\prime}}P(y,y^{\prime})\nu(x,y^{\prime})$ . The error can then be decomposed as

\displaystyle H(x,Y_{n+1})-h(x)=M_{n+1}+{\epsilon}^{{\left(2\right)}}_{n+1}+{\epsilon}^{{\left(3\right)}}_{n+1},

(14)

where

$\displaystyle M_{n+1}$	$\displaystyle\doteq\nu(x_{n},Y_{n+2})-(P\nu)(x_{n},Y_{n+1}),$	(15)
$\displaystyle{\epsilon}^{{\left(2\right)}}_{n+1}$	$\displaystyle\doteq\nu{\left(x_{n},Y_{n+1}\right)}-\nu{\left(x_{n+1},Y_{n+2}\right)},$	(16)
$\displaystyle{\epsilon}^{{\left(3\right)}}_{n+1}$	$\displaystyle\doteq\nu{\left(x_{n+1},Y_{n+2}\right)}-\nu{\left(x_{n},Y_{n+2}\right)}.$	(17)

Here $\quantity{M_{n+1}}$ is a Martingale difference sequence. We then use

\displaystyle\xi_{n+1}

\displaystyle\doteq{\epsilon}^{{\left(1\right)}}_{n+1}+{\epsilon}^{{\left(2\right)}}_{n+1}+{\epsilon}^{{\left(3\right)}}_{n+1},

(18)

to denote all the non-Martingale noise, yielding

\displaystyle x_{n+1}

\displaystyle={\left(1-\alpha_{n+1}\right)}x_{n}+\alpha_{n+1}{\left(h{\left(x_{n}\right)}+M_{n+1}+\xi_{n+1}\right)}.

(19)

We now define an auxiliary sequence $\quantity{U_{n}}$ to capture how the noise evolves

	$\displaystyle U_{0}\doteq$	$\displaystyle\,0,$		(20)
	$\displaystyle U_{n+1}\doteq$	$\displaystyle\,{\left(1-\alpha_{n+1}\right)}U_{n}+\alpha_{n+1}{\left(M_{n+1}+\xi_{n+1}\right)}.$		(21)

If we are able to prove that the total noise is well controlled in the following sense

	$\displaystyle\sum_{k=1}^{\infty}\alpha_{k}\norm{U_{k-1}}$	$\displaystyle<\infty\mbox{\quad a.s.,\quad}$		(22)
	$\displaystyle\lim_{n\rightarrow\infty}\norm{U_{n}}$	$\displaystyle=0\mbox{\quad a.s.,\quad}$		(23)

then a result from Bravo & Cominetti (2024) concerning the convergence of (IKM) can be applied on each sample path to complete the almost sure convergence proof. The rest of the proof is dedicated to the verification of those two conditions.

Telescoping (21) yields

	$\displaystyle U_{n}=$	$\displaystyle\underbrace{\sum_{k=1}^{n}\alpha_{k,n}M_{k}}_{{\overline{M}}_{n}}+\underbrace{\sum_{k=1}^{n}\alpha_{k,n}{\epsilon}^{{\left(1\right)}}_{k}}_{{\overline{\epsilon}}^{{\left(1\right)}}_{n}}+$		(24)
		$\displaystyle\quad\underbrace{\sum_{k=1}^{n}\alpha_{k,n}{\epsilon}^{{\left(2\right)}}_{k}}_{{\overline{\epsilon}}^{{\left(2\right)}}_{n}}+\underbrace{\sum_{k=1}^{n}\alpha_{k,n}{\epsilon}^{{\left(3\right)}}_{k}}_{{\overline{\epsilon}}^{{\left(3\right)}}_{n}}.$		(25)

Then, we can upper-bound (22) as

	$\displaystyle\sum_{k=1}^{n}\alpha_{k}\norm{U_{k-1}}$	$\displaystyle\leq\underbrace{\sum_{k=1}^{n}\alpha_{k}\norm{{\overline{M}}_{k-1}}}_{{\overline{\overline{M}}}_{n}}+\underbrace{\sum_{k=1}^{n}\alpha_{k}\norm{{\overline{\epsilon}}^{{\left(1\right)}}_{k-1}}}_{{\overline{\overline{\epsilon}}}^{{\left(1\right)}}_{n}}$		(26)
		$\displaystyle+\underbrace{\sum_{k=1}^{n}\alpha_{k}\norm{{\overline{\epsilon}}^{{\left(2\right)}}_{k-1}}}_{{\overline{\overline{\epsilon}}}^{{\left(2\right)}}_{n}}+\underbrace{\sum_{k=1}^{n}\alpha_{k}\norm{{\overline{\epsilon}}^{{\left(3\right)}}_{k-1}}}_{{\overline{\overline{\epsilon}}}^{{\left(3\right)}}_{n}}.$		(27)

Lemmas B.8, B.9, and B.10 respectively prove that ${\overline{\overline{M}}}_{n},{\overline{\overline{\epsilon}}}^{{\left(1\right)}}_{n},$ and ${\overline{\overline{\epsilon}}}^{{\left(3\right)}}_{n}$ in (27) are bounded almost surely. We bound the remaining term ${\overline{\overline{\epsilon}}}^{{\left(2\right)}}_{n}$ needed to verify (22) here as an example of the novelty in bounding these terms. Starting with the definition of ${\overline{\epsilon}}^{{\left(2\right)}}_{n}$ from (25), we have,

$\displaystyle{\overline{\epsilon}}^{{\left(2\right)}}_{n}$	$\displaystyle=\sum_{k=1}^{n}\alpha_{k,n}{\epsilon}^{{\left(2\right)}}_{k}$	(28)
	$\displaystyle=-\sum_{k=1}^{n}\alpha_{k,n}\quantity(\nu\quantity(x_{k},Y_{k+1})-\nu\quantity(x_{k-1},Y_{k})),$	(29)
	$\displaystyle=-\sum_{k=1}^{n}\alpha_{k,n}\nu\quantity(x_{k},Y_{k+1})-\alpha_{k-1,n}\nu{\left(x_{k-1},Y_{k}\right)}$	(30)
	$\displaystyle\quad\quad+\alpha_{k-1,n}\nu{\left(x_{k-1},Y_{k}\right)}-\alpha_{k,n}\nu{\left(x_{k-1},Y_{k}\right)},$	(31)
	$\displaystyle=-\alpha_{n,n}\nu{\left(x_{n},Y_{n+1}\right)}$	(32)
	$\displaystyle\quad\quad-\sum_{k=1}^{n}\quantity(\alpha_{k-1,n}-\alpha_{k,n})\nu\quantity(x_{k-1},Y_{k}).$	(33)

where the last inequality holds because $\alpha_{0}\doteq 0$ . Additionally, since $\alpha_{n,n}=\alpha_{n}$ , taking the norm gives

	$\displaystyle\norm{{\overline{\epsilon}}^{{\left(2\right)}}_{n}}$		(34)
	$\displaystyle\leq\alpha_{n}\norm{\nu{\left(x_{n},Y_{n+1}\right)}}+\sum_{k=1}^{n}\absolutevalue{\alpha_{k-1,n}-\alpha_{k,n}}\norm{\nu{\left(x_{k-1},Y_{k}\right)}},$		(35)
	$\displaystyle\leq\zeta_{\ref{lem:v_norm}}{\left(\alpha_{n}\tau_{n}+\sum_{k=1}^{n}\left\|\alpha_{k-1,n}-\alpha_{k,n}\right\|\tau_{k-1}\right)},$		(36)
	$\displaystyle\leq 2\zeta_{\ref{lem:v_norm}}\alpha_{n}\tau_{n},$		(37)

where the second inequality holds by Lemma B.5, and the last inequality holds because $\alpha_{0}\doteq 0$ , and that $\alpha_{i,n}$ and $\tau_{i}$ are monotonically increasing (Lemma A.2).

Then, from the definition of ${\overline{\overline{\epsilon}}}^{{\left(2\right)}}_{n}$ in (22), we have

\displaystyle{\overline{\overline{\epsilon}}}^{{\left(2\right)}}_{n}=\sum_{k=1}^{n}\alpha_{k}\norm{{\overline{\epsilon}}^{{\left(2\right)}}_{k-1}}\leq 2\zeta_{\ref{lem:v_norm}}\sum_{k=1}^{n}\alpha_{k}^{2}\tau_{k}.

(38)

where the inequality holds because $\alpha_{0}\doteq 0$ and $\alpha_{k}$ is decreasing. Then, by Lemma B.1, we have $\sup_{n}\sum_{k=1}^{n}\alpha_{k}^{2}\tau_{k}<\infty$ , which when combined with the monotone convergence theorem, proves that $\lim_{n\rightarrow\infty}{\overline{\overline{\epsilon}}}^{{\left(2\right)}}_{n}<\infty$ , verifying (22).

We now verify (23). This time, rewrite $U_{n}$ as

\displaystyle U_{n}

\displaystyle=-\sum_{k=1}^{n}\alpha_{k}U_{k-1}+\alpha_{k}{\left(M_{k}+{\epsilon}^{{\left(1\right)}}_{k}+{\epsilon}^{{\left(2\right)}}_{k}+{\epsilon}^{{\left(3\right)}}_{k}\right)}.

(39)

Lemma B.11, Assumption 2.5, and Lemmas B.12, B.13 prove that $\sup_{n}\norm{\sum_{k=1}^{n}\alpha_{k}M_{k}}<\infty$ and $\sup_{n}\norm{\sum_{k=1}^{n}\alpha_{k}{\epsilon}^{{\left(j\right)}}_{k}}<\infty$ for $j\in\quantity{1,2,3}$ respectively.

Together with (25), this means that $\sup_{n}\norm{U_{n}}<\infty$ . In other words, we have established the stability of (21). Then, it can be shown (Lemma B.14), using an extension of Theorem 2.1 of Borkar (2009) (Lemma D.7), that $\quantity{U_{n}}$ converges to the globally asymptotically stable equilibrium of the ODE $\derivative{U(t)}{t}=-U(t)$ , which is 0. This verifies (23). Lemma B.15 then invokes a result from Bravo & Cominetti (2024) and completes the proof. ∎

Remark 2.7.

We want to highlight that the technical novelty of our work comes from two sources. The first is that while the use of Poisson’s equation for handling Markovian noise is well-established, including the noise representation in (14), previous works with such error decomposition (e.g., Benveniste et al. (1990); Konda & Tsitsiklis (1999); Borkar et al. (2021)) usually only need to bound terms like ${\sum_{k}\alpha_{k}{\epsilon}^{{\left(1\right)}}_{k}}$ . In contrast, our setup requires bounding additional terms such as ${\overline{\epsilon}}^{{\left(1\right)}}_{n}=\sum_{k}\alpha_{k,n}{\epsilon}^{{\left(1\right)}}_{k}$ and ${\overline{\overline{\epsilon}}}^{{\left(1\right)}}_{n}=\sum_{i}\alpha_{i}\norm{{\overline{\epsilon}}^{{\left(1\right)}}_{k-1}}$ which appear novel and more challenging. Second, our work extends Theorem 2.1 of Borkar (2009) by relaxing an assumption on the convergence of the deterministic noise term. Instead of requiring the noise to converge to 0, we only require more mild condition on the asymptotic rate of change of this noise term. We believe this extension, detailed in Appendix D, has independent utility beyond this work.

3 Convergence Rate

The previous analysis not only guarantees the almost sure convergence of the iterates, but can also be used to obtain estimates of the expected fixed-point residuals.

Theorem 3.1.

Consider the iteration (SKM with Markovian and Additive Noise) and let Assumptions 2.1 $-$ 2.5 hold. There there exists a constant $C_{\ref{thm:conv_rate}}$ such that

\displaystyle\mathbb{E}\left[\norm{x_{n}-h{\left(x_{n}\right)}}\right]\leq\frac{C_{\ref{thm:conv_rate}}}{\sqrt{\tau_{n}}}=\begin{cases}\mathcal{O}{\left(1/\sqrt{n^{1-b}}\right)}&\text{if}\ \frac{4}{5}<b<1,\\ \mathcal{O}{\left(1/\sqrt{\log n}\right)}&\text{if}\ b=1.\end{cases}

(40)

Proof.

Considering the sequence $z_{n}\doteq x_{n}-U_{n}$ we have,

	$\displaystyle\norm{x_{n}-h\quantity(x_{n})}$	$\displaystyle\leq\norm{z_{n}-h\quantity(z_{n})}+2\norm{z_{n}-x_{n}},$		(41)
		$\displaystyle=\norm{z_{n}-h\quantity(z_{n})}+2\norm{U_{n}}.$		(42)

where the inequality holds due to the non-expansivity of $h$ as proven in (6). Then, our proof of Theorem 2.6 guarantees the conditions under which the $z_{n}$ ’s are bounded. Specifically, we proved in Lemma B.15 that if $\sum_{n=1}^{\infty}\alpha_{k}\norm{U_{k-1}}<\infty$ (22) and $\norm{U_{n}}\rightarrow 0$ (23) almost surely, then with $e_{k}=U_{k-1}$ , Lemma A.1 can be invoked to bound $\norm{z_{n}-h(z_{n})}$ . This yields,

	$\displaystyle\norm{x_{n}-h{\left(x_{n}\right)}}$		(43)
	$\displaystyle\leq\zeta_{\ref{lem:bravo_2.1}}\sigma\quantity(\tau_{n})+\sum_{k=2}^{n}2\alpha_{k}\sigma{\left(\tau_{n}-\tau_{k}\right)}\norm{U_{k-1}}+4\norm{U_{n}}.$		(44)

for $\zeta_{\ref{lem:bravo_2.1}}=2dist(x_{0},\mathcal{X}_{*})+\sum_{k=2}^{\infty}\alpha_{k}\norm{U_{k-1}}$ . However, $\zeta_{\ref{lem:bravo_2.1}}$ is a sample-path dependent constant whose order is unknown, and the random sequence $\norm{U_{n}}$ may occasionally become very large. Therefore, we compute the non-asymptotic error bound of the expected residuals $\mathbb{E}\left[\norm{x_{n}-h(x_{n})}\right]$ , which gives,

	$\displaystyle\mathbb{E}\quantity[\norm{x_{n}-h{\left(x_{n}\right)}}]\leq\underbrace{\mathbb{E}\quantity[\zeta_{\ref{lem:bravo_2.1}}]\sigma\quantity(\tau_{n})}_{R_{1}}$		(45)
	$\displaystyle\quad+\underbrace{\sum_{k=2}^{n}2\alpha_{k}\sigma\quantity(\tau_{n}-\tau_{k})\mathbb{E}\quantity[\norm{U_{k-1}}]}_{R_{2}}+\underbrace{4\mathbb{E}\quantity[\norm{U_{n}}]}_{R_{3}}.$		(46)

Recalling that $\sigma(y)\doteq\min\quantity{1,1/\sqrt{\pi y}}$ , we can see that if there exists a deterministic constant $C_{\ref{thm:conv_rate}}$ such that $\mathbb{E}\quantity[\zeta_{\ref{lem:bravo_2.1}}]\leq C_{\ref{thm:conv_rate}}$ , we obtain that $R_{1}=\mathcal{O}\quantity(1/\sqrt{\tau_{n}})$ . Therefore, in order to prove the Theorem, it is sufficient to find such a constant $C_{\ref{thm:conv_rate}}$ such that $\mathbb{E}\quantity[\zeta_{\ref{lem:bravo_2.1}}]\leq C_{\ref{thm:conv_rate}}$ , and prove that $R_{2}$ , and $R_{3}$ are also $\mathcal{O}\quantity(1/\sqrt{\tau_{n}})$ .

We proceed by first upper-bounding $\mathbb{E}\quantity[\norm{U_{n}}]$ . Taking the expectation of (25), we have,

	$\displaystyle\mathbb{E}\quantity[\norm{U_{n}}]$	(47)
$\displaystyle\leq$	$\displaystyle\mathbb{E}\quantity[\norm{{\overline{M}}_{n}}]+\mathbb{E}\quantity[\norm{{\overline{\epsilon}}^{{\left(1\right)}}_{n}}]+\mathbb{E}\quantity[\norm{{\overline{\epsilon}}^{{\left(2\right)}}_{n}}]+\mathbb{E}\quantity[\norm{{\overline{\epsilon}}^{{\left(3\right)}}_{n}}]$	(48)
$\displaystyle\leq$	$\displaystyle C_{\ref{lem: M rate}}\tau_{n}\sqrt{\alpha_{n+1}}+\sum_{i=1}^{n}\alpha_{i,n}\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{i}}]+C_{\ref{lem: e2 rate}}\alpha_{n}\tau_{n}$	(49)
	$\displaystyle\quad+C_{\ref{lem: e3 rate}}\alpha_{n}\sum_{i=1}^{n}\alpha_{i}\tau_{i}\quad\text{(Corollaries \ref{lem: M rate}, \ref{lem: e2 rate}, \ref{lem: e3 rate})}$	(50)
$\displaystyle\doteq$	$\displaystyle\omega_{n}$	(51)

It can be shown (Lemma C.4) that $\omega_{n}=\mathcal{O}(\tau_{n}\sqrt{\alpha_{n+1}})$ . Then, to prove $\mathbb{E}\quantity[\zeta_{\ref{lem:bravo_2.1}}]\leq C_{\ref{thm:conv_rate}}$ , since

\displaystyle\sum_{k=2}^{\infty}\alpha_{k}\mathbb{E}\quantity[\norm{U_{k-1}}]\leq\sum_{k=2}^{\infty}\alpha_{k}\omega_{k-1}=\mathcal{O}\quantity(\sum_{k=2}^{\infty}\alpha_{k}^{3/2}\tau_{k-1}),

(52)

which converges almost surely by Lemma B.1, there exists a $C_{\ref{thm:conv_rate}}$ such that $\mathbb{E}\quantity[\zeta_{\ref{lem:bravo_2.1}}]=2dist(x_{0},\mathcal{X}_{*})+\sum_{k=2}^{\infty}\alpha_{k}\mathbb{E}\quantity[\norm{U_{k-1}}]\leq C_{\ref{thm:conv_rate}}$ almost surely.

Additionally, our $\omega_{n}$ is of the same order as the analogous $\nu_{n}$ in Theorem 2.10 of Bravo & Cominetti (2024). Therefore, we can invoke Lemma C.5, which is a combination of Theorems 2.11 and 3.1 from Bravo & Cominetti (2024), which proves that $R_{2}=\mathcal{O}\quantity(1/\sqrt{\tau_{n}})$ . Finally, by (51), we directly have that $R_{3}=\mathcal{O}(\tau_{n}\sqrt{\alpha_{n+1}})$ which is dominated by $R_{2}$ and $R_{1}$ . ∎

4 Application in Average Reward Temporal Difference Learning

In this section, we provide the first proof of almost sure convergence to a fixed point for average reward TD in its simplest tabular form. Remarkably, this convergence result has remained unproven for over 25 years despite the algorithm’s fundamental importance and simplicity.

4.1 Reinforcement Learning Background

In reinforcement learning (RL), we consider a Markov Decision Process (MDP; Bellman (1957); Puterman (2014)) with a finite state space $\mathcal{S}$ , a finite action space $\mathcal{A}$ , a reward function $r:\mathcal{S}\times\mathcal{A}\to\mathbb{R}$ , a transition function $p:\mathcal{S}\times\mathcal{S}\times\mathcal{A}\to[0,1]$ , an initial distribution $p_{0}:\mathcal{S}\to[0,1]$ . At time step $0$ , an initial state $S_{0}$ is sampled from $p_{0}$ . At time $t$ , given the state $S_{t}$ , the agent samples an action $A_{t}\sim\pi(\cdot|S_{t})$ , where $\pi:\mathcal{A}\times\mathcal{S}\to[0,1]$ is the policy being followed by the agent. A reward $R_{t+1}\doteq r(S_{t},A_{t})$ is then emitted and the agent proceeds to a successor state $S_{t+1}\sim p(\cdot|S_{t},A_{t})$ . In the rest of the paper, we will assume the Markov chain $\quantity{S_{t}}$ induced by the policy $\pi$ is irreducible and thus adopts a unique stationary distribution $d_{\mu}$ . The average reward (a.k.a. gain, Puterman (2014)) is defined as

\textstyle\bar{J}_{\pi}\doteq\lim_{T\rightarrow\infty}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\left[R_{t}\right].

(53)

Correspondingly, the differential value function (a.k.a. bias, Puterman (2014)) is defined as

\displaystyle\textstyle v_{\pi}(s)\doteq\lim_{T\to\infty}\frac{1}{T}\sum_{\tau=1}^{T}\mathbb{E}\left[\sum_{i=1}^{\tau}(R_{t+i}-\bar{J}_{\pi})\mid S_{t}=s\right].

(54)

The corresponding Bellman equation (a.k.a. Poisson’s equation) is then

\displaystyle v=r_{\pi}-\bar{J}_{\pi}e+P_{\pi}v,

(55)

where $v\in\mathbb{R}^{|\mathcal{S}|}$ is the free variable, $r_{\pi}\in\mathbb{R}^{|\mathcal{S}|}$ is the reward vector induced by the policy $\pi$ , i.e., $r_{\pi}(s)\doteq\sum_{a}\pi(a|s)r(s,a)$ , and $P_{\pi}\in\mathbb{R}^{{|\mathcal{S}|}\times{|\mathcal{S}|}}$ is the transition matrix induced by the policy $\pi$ , i.e., $P_{\pi}(s,s^{\prime})\doteq\pi(a|s)p(s^{\prime}|s,a)$ . It is known (Puterman, 2014) that all solutions to (55) form a set

\displaystyle\mathcal{V}_{*}\doteq\quantity{v_{\pi}+ce\mid c\in\mathbb{R}}.

(56)

The policy evaluation problem in average reward MDPs is to estimate $v_{\pi}$ , perhaps up to a constant offset $ce$ .

4.2 Average Reward Temporal Difference Learning

Temporal Difference learning (TD; Sutton (1988)) is a foundational algorithm in RL (Sutton & Barto, 2018). Inspired by its success in the discounted setting, Tsitsiklis & Roy (1999) proposed using the update rule (Average Reward TD) to estimate $v_{\pi}$ (up to a constant offset) for average reward MDPs. The updates are given by:

	$\displaystyle J_{t+1}$	$\displaystyle=J_{t}+\beta_{t+1}(R_{t+1}-J_{t}),$		(Average Reward TD)
	$\displaystyle v_{t+1}(S_{t})$	$\displaystyle=v_{t}(S_{t})+\alpha_{t+1}\big{(}R_{t+1}-J_{t}+v_{t}(S_{t+1})-v_{t}(S_{t})\big{)},$		(57)

where $\{S_{0},R_{1},S_{1},\dots\}$ is a trajectory of states and rewards from an MDP under a fixed policy in a finite state space $\mathcal{S}$ , $J_{t}\in\mathbb{R}$ is the scalar estimate of the average reward $\bar{J}_{\pi}$ , $v_{t}\in\mathbb{R}^{|\mathcal{S}|}$ is the tabular value estimate, and $\{\alpha_{t},\beta_{t}\}$ are learning rates.

To utilize Theorem 2.6 to prove the almost sure convergence of (Average Reward TD), we first rewrite it in a compact form to match that of (SKM with Markovian and Additive Noise). Define the augmented Markov chain $Y_{t+1}\doteq(S_{t},A_{t},S_{t+1})$ . It is easy to see that $\quantity{Y_{t}}$ evolves in the finite space $\mathcal{Y}\doteq\quantity{(s,a,s^{\prime})\mid\pi(a|s)>0,p(s^{\prime}|s,a)>0}$ . We then define a function $H:\mathbb{R}^{|\mathcal{S}|}\times\mathcal{Y}\to\mathbb{R}^{|\mathcal{S}|}$ by defining the $s$ -th element of $H(v,(s_{0},a_{0},s_{1}))$ as

	$\displaystyle H(v,(s_{0},a_{0},s_{1}))[s]\doteq$		(58)
	$\displaystyle\quad\mathbb{I}\quantity{s=s_{0}}(r(s_{0},a_{0})-\bar{J}_{\pi}+v(s_{1})-v(s_{0}))+v(s).$		(59)

Then, the update to $\quantity{v_{t}}$ in (Average Reward TD) can then be expressed as

\displaystyle v_{t+1}=v_{t}+\alpha_{t+1}\quantity(H(v_{t},Y_{t+1})-v_{t}+\epsilon_{t+1}).

(60)

Here, $\epsilon_{t+1}\in\mathbb{R}^{|\mathcal{S}|}$ is the random noise vector defined as $\epsilon_{t+1}(s)\doteq\mathbb{I}\quantity{s=S_{t}}(J_{t}-\bar{J}_{\pi})$ . This $\epsilon_{t+1}$ is the current estimate error of the average reward estimator $J_{t}$ . Intuitively, the indicator $\mathbb{I}\quantity{s=S_{t}}$ reflects the asynchronous nature of (Average Reward TD). For each $t$ , only the $S_{t}$ -indexed element in $v_{t}$ is updated.

We are now ready to prove the convergence of (Average Reward TD). Throughout the rest of the section, we utilize the following assumption.

Assumption 4.1 (Ergodicity).

Both $\mathcal{S}$ and $\mathcal{A}$ are finite. The Markov chain $\quantity{S_{t}}$ induced by the policy $\pi$ is aperiodic and irreducible.

Theorem 4.2.

Let Assumption 4.1 hold. Consider the learning rates in the form of $\alpha_{t}=\frac{1}{(t+1)^{b}},\beta_{t}=\frac{1}{t}$ with $b\in(\frac{4}{5},1]$ . Then the iterates $\quantity{v_{t}}$ generated by (Average Reward TD) satisfy

\displaystyle\lim_{t\to\infty}{v_{t}}=v_{*}\mbox{\quad a.s.,\quad}

(61)

where $v_{*}\in\mathcal{V}_{*}$ is a possibly sample-path dependent fixed point.

Proof.

We proceed via verifying assumptions of Theorem 2.6. In particular, we consider the compact form (60).

Under Assumption 4.1, it is obvious that $\quantity{Y_{t}}$ is irreducible and aperiodic and adopts a unique stationary distribution.

To verify Assumption 2.2, we demonstrate that $H$ is $1-$ Lipschitz in $v$ w.r.t $\norm{\cdot}_{\infty}$ . For notation simplicity, let $y=(s_{0},a_{0},s_{1})$ . We have,

	$\displaystyle H(v,y)[s]-H(v^{\prime},y)[s]=$		(62)
	$\displaystyle\,\mathbb{I}\quantity{s=s_{0}}(v(s_{1})-v^{\prime}(s_{1})-v(s_{0})+v^{\prime}(s_{0}))+v(s)-v^{\prime}(s).$		(63)

Separating cases based on $s$ , if $s\neq s_{0}$ , we have

\absolutevalue{H(v,y)[s]-H(v^{\prime},y)[s]}=\absolutevalue{v(s)-v^{\prime}(s)}\leq\norm{v-v^{\prime}}_{\infty}.

(64)

For the case when $s=s_{0}$ , we have

\absolutevalue{H(v,y)[s]-H(v^{\prime},y)[s]}=\absolutevalue{v(s_{1})-v^{\prime}(s_{1})}\leq\norm{v-v^{\prime}}_{\infty}.

(65)

Therefore

	$\displaystyle\norm{H(v,y)-H(v,y)}_{\infty}$	$\displaystyle=\max_{s\in\mathcal{S}}\absolutevalue{H(v,y)[s]-H(v^{\prime},y)[s]}$		(66)
		$\displaystyle\leq\norm{v-v^{\prime}}_{\infty}.$		(67)

It is well known that the set of solutions to Poisson’s equation $\mathcal{V}_{*}$ defined in (56) is non-empty (Puterman, 2014), verifying Assumption 2.3. Assumption 2.4 is directly met by the definition of $\alpha_{t}$ .

To verify Assumption 2.5, we first notice that for (Average Reward TD), we have $\norm{{\epsilon}^{{\left(1\right)}}_{t}}_{\infty}=\absolutevalue{\bar{J}_{\pi}-J_{t}}$ . It is well-known from the ergodic theorem that $J_{t}$ converges to $\bar{J}_{\pi}$ almost surely. To verify Assumption 2.5, however, requires both an almost sure convergence rate and an $L^{2}$ convergence rate. To this end, we rewrite the update of $\quantity{J_{t}}$ as

\displaystyle J_{t+1}=J_{t}+\beta_{t+1}\left(R_{t+1}+\gamma J_{t}\phi(S_{t+1})-J_{t}\phi(S_{t})\right)\phi(S_{t}),

(68)

where we define $\gamma\doteq 0$ and $\phi(s)\doteq 1\,\forall s$ . It is now clear that the update of $\quantity{J_{t}}$ is a special case of linear TD in the discounted setting (Sutton, 1988). Given our choice of $\beta_{t}=\frac{1}{t}$ , the general result about the almost sure convergence rate of linear TD (Theorem 1 of Tadić (2002)) ensures that

\displaystyle\absolutevalue{J_{t}-\bar{J}_{\pi}}\leq\frac{\zeta_{\ref{thm:avg_rew_td}}\sqrt{\ln\ln t}}{\sqrt{t}}\mbox{\quad a.s.,\quad}

(69)

where $\zeta_{\ref{thm:avg_rew_td}}$ is a sample-path dependent constant. This immediately verifies (10). We do note that this almost sure convergence rate can also be obtained via a law of the iterated logarithm for Markov chains (Theorem 17.0.1 of Meyn & Tweedie (2012)). The general result about the $L^{2}$ convergence rate of linear TD (Theorem 11 of Srikant & Ying (2019)) ensures that

\displaystyle\mathbb{E}\quantity[\absolutevalue{J_{t}-\bar{J}_{\pi}}^{2}]=\mathcal{O}\quantity(\frac{1}{t}).

(70)

This immediately verifies (11) and completes the proof. ∎

5 Related Work

ODE and Lyapunov Methods for Asymptotic Convergence

A large body of research has employed ODE-based methods to establish almost sure convergence of SA algorithms (Benveniste et al., 1990; Kushner & Yin, 2003; Borkar, 2009). These methods typically begin by proving stability of the iterates $\quantity{x_{n}}$ (i.e. $\sup_{n}\norm{x_{n}}<\infty$ ). Abounadi et al. (2002) uses this ODE-method to study the convergence of (SKM), but they require the noise sequence $\quantity{M_{n}}$ to be uniformly bounded, and that the set of fixed points of the nonexpansive map $T$ be a singleton to prove the stability of the iterates.

The ODE@ $\infty$ technique (Borkar & Meyn, 2000; Borkar et al., 2021; Meyn, 2024; Liu et al., 2025) is a powerful stability technique in RL. If the so called “ODE@ $\infty$ is globally asymptotically stable, existing results such as Meyn (2022); Borkar et al. (2021); Liu et al. (2025) can be used to establish the desired stability of $\quantity{x_{t}}$ . However, if we consider a generic non-expansive operator $h$ which may admit multiple fixed points or induce oscillatory behavior, we cannot guarantee the global asymptotic stability of the ODE@ $\infty$ without additional assumptions. This limits the ODE method’s utility in analyzing (SKM with Markovian and Additive Noise).

In addition to the ODE method, there are other works that use Lyapunov methods such as (Bertsekas & Tsitsiklis, 1996; Konda & Tsitsiklis, 1999; Srikant & Ying, 2019; Borkar et al., 2021; Chen et al., 2021; Zhang et al., 2022, 2023) to provide asymptotic and nonasymptotic results of various RL algorithms. Both the ODE and Lyapunov based methods are distinct from the fox-and-hare based approach for (IKM) introduced by (Cominetti et al., 2014) that our work is built upon.

Average Reward TD

The (Average Reward TD) algorithm introduced by Tsitsiklis & Roy (1999) is the most fundamental policy evaluation algorithm in average reward settings.

In addition to the tabular setting we study here, (Average Reward TD) has also been extended to linear function approximation (Tsitsiklis & Roy, 1999; Konda & Tsitsiklis, 1999; Wu et al., 2020; Zhang et al., 2021). Instead of using a look-up table $v\in\mathbb{R}^{|\mathcal{S}|}$ to store the value estimate, linear function approximation approximates $v(s)$ with $\phi(s)^{\top}w$ . Let $\Phi\in\mathbb{R}^{{|\mathcal{S}|}\times K}$ be the feature matrix, whose $s$ -th row is the $\phi(s)^{\top}$ , and $w$ is the learnable weights. Linear function approximation reduces to the tabular method when $\Phi=I$ . While Tsitsiklis & Roy (1999) proves the almost sure convergence under assumptions such as linear independence of columns in $\Phi$ and $\Phi w\neq ce$ for any $c\in\mathbb{R}$ , these conditions fail to hold in the most straightforward tabular case (where $Ie=e$ ). However, under a non-trivial construction of $\Phi$ , it can be shown that the results from Tsitsiklis & Roy (1999) can be used to prove the almost sure convergence of (Average Reward TD) to a set in the tabular case.

Zhang et al. (2021) establishes the $L^{2}$ convergence of (Average Reward TD), and also provides a convergence rate. However, it is well known that $L^{2}$ convergence and almost sure convergence do not imply each other. Our work improves upon both of these works by proving that the iterates converge to a fixed point almost surely.

Finally, the (Average Reward TD) algorithm has inspired the design of many other TD algorithms for average reward MDPs, for both policy evaluation and control, including Konda & Tsitsiklis (1999); Yang et al. (2016); Wan et al. (2021a); Zhang & Ross (2021); Wan et al. (2021b); He et al. (2022); Saxena et al. (2023). We envision that our work will shed light on the almost sure convergence of those follow-up algorithms.

6 Conclusion

In this work, we provide the first proof of almost sure convergence as well as non-asymptotic finite sample analysis of stochastic approximations under nonexpansive maps with Markovian noise. As an application, we provide the first proof of almost sure convergence of (Average Reward TD) to a potentially sample-path dependent fixed point. This result highlights the underappreciated strength of SKM iterations, a tool whose potential is often overlooked in the RL community. Addressing several follow-up questions could open the door to proving the convergence of many other RL algorithms. Do SKM iterations converge in $L^{p}$ ? Do they follow a central limit theorem or a law of the iterated logarithm? Can they be extended to two-timescale settings? And can we develop a finite sample analysis for them? Resolving these questions could pave the way for significant advancements across RL theory. We leave them for future investigation.

Acknowledgements

This work is supported in part by the US National Science Foundation (NSF) under grants III-2128019 and SLES-2331904. EB acknowledges support from the NSF Graduate Research Fellowship (NSF-GRFP) under award 1842490. This work was also supported in part by the Coastal Virginia Center for Cyber Innovation (COVA CCI) and the Commonwealth Cyber Initiative (CCI), an investment in the advancement of cyber research and development, innovation, and workforce development. For more information about CCI, visit www.covacci.org and www.cyberinitiative.org.

Impact Statement

This paper presents work whose goal is to advance the field of reinforcement learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

References

Abounadi et al. (2001) Abounadi, J., Bertsekas, D., and Borkar, V. S. Learning algorithms for markov decision processes with average cost. SIAM Journal on Control and Optimization, 2001.
Abounadi et al. (2002) Abounadi, J., Bertsekas, D. P., and Borkar, V. Stochastic approximation for nonexpansive maps: Application to q-learning algorithms. SIAM Journal on Control and Optimization, 41(1):1–22, 2002.
Bellman (1957) Bellman, R. A markovian decision process. Journal of mathematics and mechanics, pp. 679–684, 1957.
Benveniste et al. (1990) Benveniste, A., Métivier, M., and Priouret, P. Adaptive Algorithms and Stochastic Approximations. Springer, 1990.
Bertsekas & Tsitsiklis (1996) Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-Dynamic Programming. Athena Scientific Belmont, MA, 1996.
Borkar et al. (2021) Borkar, V., Chen, S., Devraj, A., Kontoyiannis, I., and Meyn, S. The ode method for asymptotic statistics in stochastic approximation and reinforcement learning. arXiv preprint arXiv:2110.14427, 2021.
Borkar (2009) Borkar, V. S. Stochastic approximation: a dynamical systems viewpoint. Springer, 2009.
Borkar & Meyn (2000) Borkar, V. S. and Meyn, S. P. The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 2000.
Bravo & Cominetti (2024) Bravo, M. and Cominetti, R. Stochastic fixed-point iterations for nonexpansive maps: Convergence and error bounds. SIAM Journal on Control and Optimization, 62(1):191–219, 2024.
Bravo et al. (2019) Bravo, M., Cominetti, R., and Pavez-Signé, M. Rates of convergence for inexact krasnosel’skii–mann iterations in banach spaces. Mathematical Programming, 175:241–262, 2019.
Chen et al. (2021) Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. A lyapunov theory for finite-sample guarantees of asynchronous q-learning and td-learning variants. arXiv preprint arXiv:2102.01567, 2021.
Cominetti et al. (2014) Cominetti, R., Soto, J. A., and Vaisman, J. On the rate of convergence of krasnosel’skii-mann iterations and their connection with sums of bernoullis. Israel Journal of Mathematics, 199:757–772, 2014.
Edelstein (1966) Edelstein, M. A remark on a theorem of m. a. krasnoselski. American Mathematical Monthly, 1966.
Folland (1999) Folland, G. B. Real analysis: modern techniques and their applications, volume 40. John Wiley & Sons, 1999.
He et al. (2022) He, J., Wan, Y., and Mahmood, A. R. The emphatic approach to average-reward policy evaluation. In Deep Reinforcement Learning Workshop NeurIPS 2022, 2022.
Ishikawa (1976) Ishikawa, S. Fixed points and iteration of a nonexpansive mapping in a banach space. Proceedings of the American Mathematical Society, 59(1):65–71, 1976.
Karandikar & Vidyasagar (2024) Karandikar, R. L. and Vidyasagar, M. Convergence rates for stochastic approximation: Biased noise with unbounded variance, and applications. Journal of Optimization Theory and Applications, pp. 1–39, 2024.
Kiefer & Wolfowitz (1952) Kiefer, J. and Wolfowitz, J. Stochastic estimation of the maximum of a regression function. Annals of Mathematical Statistics, 1952.
Kim & Xu (2007) Kim, T.-H. and Xu, H.-K. Robustness of mann’s algorithm for nonexpansive mappings. Journal of Mathematical Analysis and Applications, 327(2):1105–1115, 2007.
Konda & Tsitsiklis (1999) Konda, V. R. and Tsitsiklis, J. N. Actor-critic algorithms. In Advances in Neural Information Processing Systems, 1999.
Koval & Schwabe (2003) Koval, V. and Schwabe, R. A law of the iterated logarithm for stochastic approximation procedures in d-dimensional euclidean space. Stochastic processes and their applications, 105(2):299–313, 2003.
Krasnosel’skii (1955) Krasnosel’skii, M. A. Two remarks on the method of successive approximations. Uspekhi matematicheskikh nauk, 10(1):123–127, 1955.
Kushner & Yin (2003) Kushner, H. and Yin, G. G. Stochastic approximation and recursive algorithms and applications. Springer Science & Business Media, 2003.
Liu (1995) Liu, L.-S. Ishikawa and mann iterative process with errors for nonlinear strongly accretive mappings in banach spaces. Journal of Mathematical Analysis and Applications, 194(1):114–125, 1995.
Liu et al. (2025) Liu, S., Chen, S., and Zhang, S. The ODE method for stochastic approximation and reinforcement learning with markovian noise. Journal of Machine Learning Research, 2025.
Métivier & Priouret (1987) Métivier, M. and Priouret, P. Théorèmes de convergence presque sure pour une classe d’algorithmes stochastiques à pas décroissant. Probability Theory and related fields, 74:403–428, 1987.
Meyn (2022) Meyn, S. Control systems and reinforcement learning. Cambridge University Press, 2022.
Meyn (2024) Meyn, S. The projected bellman equation in reinforcement learning. IEEE Transactions on Automatic Control, 2024.
Meyn & Tweedie (2012) Meyn, S. P. and Tweedie, R. L. Markov chains and stochastic stability. Springer Science & Business Media, 2012.
Puterman (2014) Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
Qian et al. (2024) Qian, X., Xie, Z., Liu, X., and Zhang, S. Almost sure convergence rates and concentration of stochastic approximation and reinforcement learning with markovian noise. arXiv preprint arXiv:2411.13711, 2024.
Reich (1979) Reich, S. Weak convergence theorems for nonexpansive mappings in banach spaces. J. Math. Anal. Appl, 67(2):274–276, 1979.
Robbins & Monro (1951) Robbins, H. and Monro, S. A stochastic approximation method. The Annals of Mathematical Statistics, 1951.
Saxena et al. (2023) Saxena, N., Khastagir, S., Kolathaya, S., and Bhatnagar, S. Off-policy average reward actor-critic with deterministic policy search. In International Conference on Machine Learning, pp. 30130–30203. PMLR, 2023.
Srikant & Ying (2019) Srikant, R. and Ying, L. Finite-time error bounds for linear stochastic approximation andtd learning. In Proceedings of the Conference on Learning Theory, 2019.
Sutton (1988) Sutton, R. S. Learning to predict by the methods of temporal differences. Machine Learning, 1988.
Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Reinforcement Learning: An Introduction (2nd Edition). MIT press, 2018.
Szepesvári (1997) Szepesvári, C. The asymptotic convergence-rate of q-learning. Advances in neural information processing systems, 10, 1997.
Tadić (2002) Tadić, V. B. On the almost sure rate of convergence of temporal-difference learning algorithms. IFAC Proceedings Volumes, 35(1):455–460, 2002.
Tadic (2004) Tadic, V. B. On the almost sure rate of convergence of linear stochastic approximation algorithms. IEEE Transactions on Information Theory, 50(2):401–409, 2004.
Tsitsiklis & Roy (1999) Tsitsiklis, J. N. and Roy, B. V. Average cost temporal-difference learning. Automatica, 1999.
Wan et al. (2021a) Wan, Y., Naik, A., and Sutton, R. Average-reward learning and planning with options. Advances in Neural Information Processing Systems, 34:22758–22769, 2021a.
Wan et al. (2021b) Wan, Y., Naik, A., and Sutton, R. S. Learning and planning in average-reward markov decision processes. In Proceedings of the International Conference on Machine Learning, 2021b.
Wu et al. (2020) Wu, Y., Zhang, W., Xu, P., and Gu, Q. A finite-time analysis of two time-scale actor-critic methods. In Advances in Neural Information Processing Systems, 2020.
Yang et al. (2016) Yang, S., Gao, Y., An, B., Wang, H., and Chen, X. Efficient average reward reinforcement learning using constant shifting values. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
Zhang et al. (2021) Zhang, S., Zhang, Z., and Maguluri, S. T. Finite sample analysis of average-reward td learning and $q$ -learning. Advances in Neural Information Processing Systems, 2021.
Zhang et al. (2022) Zhang, S., des Combes, R. T., and Laroche, R. Global optimality and finite sample analysis of softmax off-policy actor critic under state distribution mismatch. Journal of Machine Learning Research, 2022.
Zhang et al. (2023) Zhang, S., Des Combes, R. T., and Laroche, R. On the convergence of sarsa with linear function approximation. In International Conference on Machine Learning, 2023.
Zhang & Ross (2021) Zhang, Y. and Ross, K. W. On-policy deep reinforcement learning for the average-reward criterion. In International Conference on Machine Learning, pp. 12535–12545. PMLR, 2021.

Appendix A Mathematical Background

Lemma A.1 (Theorem 2.1 from Bravo & Cominetti (2024)).

Let $\quantity{z_{n}}$ be a sequence generated by (IKM). Let $\text{Fix}(T)$ denote the set of fixed points of $T$ (assumed to be nonempty). Additionally, let $\tau_{n}$ be defined according to (9) and the real function $\sigma:(0,\infty)\rightarrow(0,\infty)$ as

\displaystyle\sigma(y)=\min\quantity{{1,1/\sqrt{\pi y}}}.

(71)

If $\zeta_{\ref{lem:bravo_2.1}}\geq 0$ is such that $\norm{Tz_{n}-x_{0}}\leq\zeta_{\ref{lem:bravo_2.1}}$ for all $n\geq 1$ , then

\norm{z_{n}-Tz_{n}}\leq\zeta_{\ref{lem:bravo_2.1}}\sigma{\left(\tau_{n}\right)}+\sum_{k=1}^{n}2\alpha_{k}\norm{e_{k}}\sigma{\left(\tau_{n}-\tau_{k}\right)}+2\norm{e_{n+1}}.

(72)

Moreover, if $\tau_{n}\rightarrow\infty$ and $\norm{e_{n}}\rightarrow 0$ with $S\doteq\sum_{n=1}^{\infty}\alpha_{n}\norm{e_{n}}<\infty$ , then (72) holds with $\zeta_{\ref{lem:bravo_2.1}}=2\inf_{x\in\text{Fix}(T)}\norm{x_{0}-x}+S$ , and we have $\norm{z_{n}-Tz_{n}}\rightarrow 0$ as well as $z_{n}\rightarrow x_{*}$ for some fixed point $x_{*}\in\text{Fix}(T)$

Lemma A.2 (Monotonicity of $\alpha_{k,n}$ from Lemma B.1 in Bravo & Cominetti (2024)).

For $\alpha_{n}=\frac{1}{{\left(n+1\right)}^{b}}$ with $0<b\leq 1$ and $\alpha_{i,n}$ in (8), we have $\alpha_{k,n}\leq\alpha_{k+1,n}$ for $k\geq 1$ so that $\alpha_{k+1,n}\leq\alpha_{n,n}=\alpha_{n}$ .

Lemma A.3 (Lemma B.2 from (Bravo & Cominetti, 2024)).

For $\alpha_{n}=\frac{1}{{\left(n+1\right)}^{b}}$ with $0<b\leq 1$ and $\alpha_{i,n}$ in (8), we have $\sum_{k=1}^{n}\alpha_{k,n}^{2}\leq\alpha_{n+1}$ for all $n\geq 1$ .

Lemma A.4 (Monotone Convergence Theorem from Folland (1999)).

Given a measure space $\quantity(X,M,\mu)$ , define $L^{+}$ as the space of all measurable functions from $X$ to $[0,\infty]$ . Then, if $\quantity{f_{n}}$ is a sequence in $L^{+}$ such that $f_{j}\leq f_{j+1}$ for all j, and $f=\lim_{n\rightarrow\infty}f_{n}$ , then $\int fd\mu=\lim_{n\rightarrow\infty}\int f_{n}d\mu$ .

Appendix B Additional Lemmas from Section 2

In this section, we present and prove the lemmas referenced in Section 2 as part of the proof of Theorem 2.6. Additionally, we establish several auxiliary lemmas necessary for these proofs.

We begin by proving several convergence results related to the learning rates.

Lemma B.1 (Learning Rates).

With $\tau_{n}$ defined in (9) we have,

\displaystyle\tau_{n}=\begin{cases}\mathcal{O}\quantity(n^{1-b})&\text{if}\quad\frac{4}{5}<b<1,\\ \mathcal{O}\quantity(\log n)&\text{if}\quad b=1.\end{cases}

(73)

This further implies,

$\displaystyle\sup_{n}\,\sum_{k=1}^{n}\alpha_{k}^{2}\tau_{k}$	$\displaystyle<\infty,$	(74)
$\displaystyle\sup_{n}\,\sum_{k=1}^{n}\alpha_{k}^{2}\tau_{k}^{2}$	$\displaystyle<\infty,$	(75)
$\displaystyle\sup_{n}\,\sum_{k=1}^{n}\alpha_{k}^{3/2}\tau_{k-1}$	$\displaystyle<\infty,$	(76)
$\displaystyle\sup_{n}\sum_{k=0}^{n-1}\absolutevalue{\alpha_{k}-\alpha_{k+1}}\tau_{k}$	$\displaystyle<\infty,$	(77)
$\displaystyle\sup_{n}\sum_{k=1}^{n}\alpha_{k}^{2}\sum_{j=1}^{i-1}\alpha_{j}\tau_{j}$	$\displaystyle<\infty,$	(78)
$\displaystyle\sup_{n}\sum_{k=1}^{n}\alpha_{k}\sqrt{\sum_{j=1}^{k-1}\alpha_{j,k-1}^{2}\tau_{j-1}^{2}}$	$\displaystyle<\infty,$	(79)

Since this Lemma is comprised of several short proofs regarding the deterministic learning rates defined in Assumption 2.4, we will decompose each result into subsections. Recall that $\alpha_{n}\doteq\frac{1}{\quantity(n+1)^{b}}$ where $\frac{4}{5}<b\leq 1$ .

(73):

Proof.

From the definition of $\tau_{n}$ in (9), we have

\displaystyle\tau_{n}

\displaystyle\doteq\sum_{k=1}^{n}\alpha_{k}\quantity(1-\alpha_{k})\leq\sum_{k=1}^{n}\alpha_{k}=\sum_{k=1}^{n}\frac{1}{(k+1)^{b}}.

(81)

Case 1: $b=1$ . It is easy to see $\tau_{n}=\mathcal{O}\quantity(\log n)$ .

Case 2: When $b<1$ , we can approximate the sum with an integral, with

\displaystyle\sum_{k=1}^{n}\frac{1}{\quantity(k+1)^{b}}\leq\int_{1}^{n}\frac{1}{k^{b}}\,dk=\frac{n^{1-b}-1}{1-b}

(82)

Therefore we have $\tau_{n}=\mathcal{O}\quantity(n^{1-b})$ when $b<1$ . ∎

In analyzing the subsequent equations, we will use the fact that $\tau_{n}=\mathcal{O}\quantity(\log n)$ when $b=1$ and $\tau_{n}=\mathcal{O}\quantity(n^{1-b})$ when $\frac{4}{5}<b<1$ . Additionally, we have $\alpha_{n}=\quantity(\frac{1}{n^{b}})$ .

(74):

Proof.

We have an order-wise approximation of the sum

\displaystyle\sum_{k=1}^{n}\alpha_{k}^{2}\tau_{k}=\begin{dcases}\mathcal{O}\quantity(\sum_{k=1}^{n}\frac{1}{k^{3b-1}})&\text{if}\ \frac{4}{5}<b<1,\\ \mathcal{O}\quantity(\sum_{k=1}^{n}\frac{\log(k)}{k^{2}})&\text{if}\ b=1.\end{dcases}.

(83)

In both cases of $b=1$ and $\frac{4}{5}<b<1$ , the series clearly converge as $n\rightarrow\infty$ . ∎

(76):

Proof.

We have an order-wise approximation of the sum

\displaystyle\sum_{k=1}^{n}\alpha_{k}^{3/2}\tau_{k}=\begin{dcases}\mathcal{O}\quantity(\sum_{k=1}^{n}\frac{1}{k^{\frac{5}{2}b-1}})&\text{if}\ \frac{4}{5}<b<1,\\ \mathcal{O}\quantity(\sum_{k=1}^{n}\frac{\log(k)}{k^{3/2}})&\text{if}\ b=1.\end{dcases}.

(84)

In both cases of $b=1$ and $\frac{4}{5}<b<1$ , the series clearly converge as $n\rightarrow\infty$ . ∎

(75):

Proof.

We can give an order-wise approximation of the sum

\displaystyle\sum_{k=1}^{n}\alpha_{k}^{2}\tau_{k}^{2}=\begin{dcases}\mathcal{O}\quantity(\sum_{k=1}^{n}\frac{1}{k^{4b-2}})&\text{if}\ \frac{4}{5}<b<1,\\ \mathcal{O}\quantity(\sum_{k=1}^{n}\frac{\log^{2}(k)}{k^{2}})&\text{if}\ b=1.\end{dcases}.

(85)

In both cases of $b=1$ and $\frac{4}{5}<b<1$ , the series clearly converge as $n\rightarrow\infty$ . ∎

(77):

Proof.

Since $\alpha_{n}$ is strictly decreasing, we have $\absolutevalue{\alpha_{k}-\alpha_{k+1}}=\alpha_{k}-\alpha_{k+1}$ .

Case 1: For the case where $b=1$ , it is trivial to see that,

\displaystyle\sum_{k=1}^{n}\absolutevalue{\alpha_{k}-\alpha_{k+1}}\tau_{k}=\mathcal{O}\quantity(\sum_{k=1}^{n}\frac{\log(k)}{k^{2}+k}).

(86)

This series clearly converges.

Case 2: For the case where $\frac{4}{5}<b<1$ , we have

	$\displaystyle\alpha_{n}-\alpha_{n+1}$	$\displaystyle=\mathcal{O}\quantity(\frac{1}{n^{b}}-\frac{1}{(n+1)^{b}}),$		(87)
		$\displaystyle=\mathcal{O}\quantity(\frac{(n+1)^{b}-n^{b}}{n^{b}(n+1)^{b}}).$		(88)

To analyze the behavior of this term for large $n$ we first consider the binomial expansion of $(n+1)^{b}$ ,

\displaystyle(n+1)^{b}

\displaystyle=n^{b}\quantity(1+\frac{1}{n})^{b}=n^{b}(1+b\frac{1}{n}+\frac{b(b-1)}{2}\frac{1}{n^{2}}+\dots)

(89)

Subtracting $n^{b}$ from $(n+1)^{b}$ :

\displaystyle(n+1)^{b}-n^{b}=n^{b}(1+b\frac{1}{n}+\frac{b(b-1)}{2}\frac{1}{n^{2}}+\dots)-n^{b}=\mathcal{O}\quantity(bn^{b-1}).

(90)

The leading order of the denominator of (88) is clearly $n^{2b}$ , which gives

\displaystyle\alpha_{n}-\alpha_{n+1}=\mathcal{O}\quantity(\frac{bn^{b-1}}{n^{2b}})=\mathcal{O}\quantity(\frac{b}{n^{b+1}}).

(91)

Therefore with $\tau_{n}=\mathcal{O}\quantity(n^{1-b})$ ,

\displaystyle\sum_{k=1}^{n}\absolutevalue{\alpha_{k}-\alpha_{k+1}}\tau_{k}=\mathcal{O}\quantity(b\sum_{k=1}^{n}\frac{1}{k^{2b}})

(92)

which clearly converges as $n\rightarrow\infty$ for $\frac{4}{5}<b<1$ . ∎

(78):

Proof.

Case 1: In the proof for (73) we prove that $\sum_{k=1}^{n}\alpha_{k}=\mathcal{O}\quantity(\log n)$ when $b=1$ . Then since $\tau_{k}$ is increasing, we have

\sum_{k=1}^{n}\alpha_{k}^{2}\sum_{j=1}^{k-1}\alpha_{j}\tau_{j}\leq\sum_{k=1}^{n}\alpha_{k}^{2}\tau_{k}\sum_{j=1}^{k-1}\alpha_{j}=\mathcal{O}\quantity(\sum_{k=1}^{n}\frac{\log^{2}k}{k^{2}}),

(93)

which clearly converges as $n\rightarrow\infty$ .

Case 2: For the case when $b\in(\frac{4}{5},1)$ , we first consider the inner sum of (78),

\displaystyle\sum_{j=1}^{k-1}\alpha_{j}\tau_{j}=\mathcal{O}\quantity(\sum_{j=1}^{k-1}\frac{1}{j^{2b-1}}),

(94)

which we can approximate by an integral,

\displaystyle\int_{1}^{k}\frac{1}{x^{2b-1}}\ dx=\mathcal{O}\quantity(k^{2-2b}).

(95)

Therefore,

\displaystyle\sum_{k=1}^{n}\alpha_{k}^{2}\sum_{j=1}^{k-1}\alpha_{j}\tau_{j}=\mathcal{O}\quantity(\sum_{k=1}^{n}\frac{k^{2-2b}}{k^{2b}})=\mathcal{O}\quantity(\sum_{k=1}^{n}\frac{1}{k^{4b-2}}),

(96)

which converges for $\frac{4}{5}<b\leq 1$ as $n\rightarrow\infty$ . ∎

(79):

Proof.

Case 1: For $b=1$ , because we have $\alpha_{j,i}<\alpha_{j+1,i}$ and $\alpha_{i,i}=\alpha_{i}$ from Lemma A.2, we have the order-wise approximation,

$\displaystyle\sum_{i=1}^{n}\alpha_{i}\sqrt{\sum_{j=1}^{i-1}\alpha_{j,i-1}^{2}\tau_{j-1}^{2}}$	$\displaystyle\leq\sum_{i=1}^{n}\alpha_{i}\sqrt{\alpha_{i-1}^{2}\tau_{i-1}^{2}\sum_{j=1}^{i-1}1},$	( $\tau_{i}$ is increasing)	(97)
	$\displaystyle=\sum_{i=1}^{n}\alpha_{i}\alpha_{i-1}\tau_{i-1}\sqrt{i-1}.$		(98)
	$\displaystyle=\mathcal{O}\quantity(\sum_{i=1}^{n}\frac{\log(i-1)}{i\sqrt{(i-1)}})$		(99)
	$\displaystyle=\mathcal{O}\quantity(\sum_{i=1}^{n}\frac{\log(i-1)}{i^{3/2}}),$		(100)

which clearly converges.

Case 2: For the case when $b\in(\frac{4}{5},1)$ , we have,

$\displaystyle\sum_{i=1}^{n}\alpha_{i}\sqrt{\sum_{j=1}^{i-1}\alpha_{j,i-1}^{2}\tau_{j-1}^{2}}$	$\displaystyle\leq\sum_{i=1}^{n}\alpha_{i}\tau_{i-1}\sqrt{\sum_{j=1}^{i-1}\alpha_{j,i-1}^{2}},$	( $\tau_{i}$ is increasing)	(101)
	$\displaystyle=\sum_{i=1}^{n}\alpha_{i}\tau_{i-1}\sqrt{\alpha_{i}}.$	(Lemma A.3)	(102)
	$\displaystyle=\mathcal{O}\quantity(\sum_{i=1}^{n}\frac{i^{1-b}}{i^{b}\sqrt{i^{b}}})$		(103)
	$\displaystyle=\mathcal{O}\quantity(\sum_{i=1}^{n}\frac{1}{i^{5b/2-1}}),$		(104)

which converges for $\frac{4}{5}<b<1$ . ∎

Then, under Assumption 2.5, we prove additional results about the convergence of the first and second moments of the additive noise $\quantity{{\epsilon}^{{\left(1\right)}}_{n}}$ .

Lemma B.2.

Let Assumptions 2.4 and 2.5 hold. Then, we have

$\displaystyle\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{n}}]$	$\displaystyle=\mathcal{O}\quantity(\frac{1}{\sqrt{n}}),$	(105)
$\displaystyle\sup_{n}\sum_{k=1}^{n}\alpha_{k}\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{k}}]$	$\displaystyle<\infty,$	(106)
$\displaystyle\sup_{n}\sum_{k=1}^{n}\alpha_{k}\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{k}}^{2}]$	$\displaystyle<\infty,$	(107)
$\displaystyle\sup_{n}\sum_{k=1}^{n}\alpha_{k}^{2}\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{k}}^{2}]$	$\displaystyle<\infty,$	(108)
$\displaystyle\sup_{n}\sum_{k=1}^{n}\alpha_{k}\sum_{j=1}^{k-1}\alpha_{j,k-1}\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{j}}]$	$\displaystyle<\infty.$	(109)

Proof.

Recall that by Assumption 2.5 we have $\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{n}}^{2}]=\mathcal{O}\quantity(\frac{1}{n})$ . Also recall that $\alpha_{k}=\mathcal{O}\quantity(\frac{1}{n^{b}})$ with $\frac{4}{5}<b\leq 1$ . Then, we can prove the following equations:

(105):

By Jensen’s inequality, we have

\displaystyle\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{n}}]\leq\sqrt{\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{n}}^{2}]}=\mathcal{O}\quantity(\frac{1}{\sqrt{n}}).

(110)

(106):

\sum_{k=1}^{n}\alpha_{k}\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{k}}]=\mathcal{O}\quantity(\sum_{k=1}^{n}\frac{1}{k^{b+\frac{1}{2}}})

(111)

which clearly converges as $n\rightarrow\infty$ for $\frac{4}{5}<b\leq 1$ .

(107):

\sum_{k=1}^{n}\alpha_{k}\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{k}}^{2}]=\mathcal{O}\quantity(\sum_{k=1}^{n}\frac{1}{k^{b+1}})

(112)

which clearly converges as $n\rightarrow\infty$ for $\frac{4}{5}<b\leq 1$ .

(108):

\sum_{k=1}^{n}\alpha_{k}^{2}\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{k}}^{2}]=\mathcal{O}\quantity(\sum_{k=1}^{n}\frac{1}{k^{2b+1}})

(113)

which clearly converges as $n\rightarrow\infty$ for $\frac{4}{5}<b\leq 1$ .

(109):

	$\displaystyle\sum_{k=1}^{n}\alpha_{k}\sum_{j=1}^{k-1}\alpha_{j,k-1}\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{j}}]$	$\displaystyle\leq\sum_{k=1}^{n}\alpha_{k}^{2}\sum_{j=1}^{k-1}\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{j}}],$	(Lemma A.2)		(114)
		$\displaystyle=\mathcal{O}\quantity(\sum_{k=1}^{n}\frac{1}{k^{2b}}\sum_{j=1}^{k-1}\frac{1}{\sqrt{j}}).$	(Lemma B.2)		(115)

It can be easily verified with an integral approximation that $\sum_{j=1}^{k-1}\frac{1}{\sqrt{j}}=\mathcal{O}(\sqrt{k})$ . This further implies

\displaystyle\sum_{k=1}^{n}\alpha_{k}\sum_{j=1}^{k-1}\alpha_{j,k-1}\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{j}}]

\displaystyle=\mathcal{O}\quantity(\sum_{k=1}^{n}\frac{1}{k^{2b-\frac{1}{2}}}),

(116)

which converges as $n\rightarrow\infty$ for $\frac{4}{5}<b\leq 1$ . ∎

Next, in Lemma B.3, we upper-bound the iterates $\quantity{x_{n}}$ .

Lemma B.3.

For each $\quantity{x_{n}}$ , we have

\|x_{n}\|\leq\norm{x_{0}}+C_{H}\sum_{k=1}^{n}\alpha_{k}+\sum_{k=1}^{n}\alpha_{k}\norm{{\epsilon}^{{\left(1\right)}}_{k}}\leq C_{\ref{lem:xn_norm}}\tau_{n}+\sum_{k=1}^{n}\alpha_{k}\norm{{\epsilon}^{{\left(1\right)}}_{k}},

(117)

where $C_{\ref{lem:xn_norm}}$ is a deterministic constant.

Proof.

Applying $\|\cdot\|$ to both sides of (SKM with Markovian and Additive Noise) gives,

$\displaystyle\\|x_{n+1}\\|$	$\displaystyle=\norm{(1-\alpha_{n+1})x_{n}+\alpha_{n+1}\quantity(H{\left(x_{n},Y_{n+1}\right)}+{\epsilon}^{{\left(1\right)}}_{n+1})},$		(118)
	$\displaystyle\leq(1-\alpha_{n+1})\\|x_{n}\\|+\alpha_{n+1}\norm{H(x_{n},Y_{n+1})}+\alpha_{n+1}\norm{{\epsilon}^{{\left(1\right)}}_{n+1}},$		(119)
	$\displaystyle\leq(1-\alpha_{n+1})\norm{x_{n}}+\alpha_{n+1}{\left(C_{H}+\norm{x_{n}}\right)}+\alpha_{n+1}\norm{{\epsilon}^{{\left(1\right)}}_{n+1}},$	(By (3))	(120)
	$\displaystyle=\\|x_{n}\\|+\alpha_{n+1}C_{H}+\alpha_{n+1}\norm{{\epsilon}^{{\left(1\right)}}_{n+1}}.$		(121)

A simple induction shows that almost surely,

\displaystyle\norm{x_{n}}\leq\norm{x_{0}}+C_{H}\sum_{k=1}^{n}\alpha_{k}+\sum_{k=1}^{n}\alpha_{k}\norm{{\epsilon}^{{\left(1\right)}}_{k}}.

(122)

Since $\quantity{\alpha_{n}}$ is monotonically decreasing, we have

$\displaystyle\norm{x_{n}}$	$\displaystyle\leq\norm{x_{0}}+\frac{C_{H}}{{\left(1-\alpha_{1}\right)}}\sum_{k=1}^{n}\alpha_{k}(1-\alpha_{k})+\sum_{k=1}^{n}\alpha_{k}\norm{{\epsilon}^{{\left(1\right)}}_{k}},$	(123)
	$\displaystyle=\norm{x_{0}}+\frac{C_{H}}{{\left(1-\alpha_{1}\right)}}\tau_{n}+\sum_{k=1}^{n}\alpha_{k}\norm{{\epsilon}^{{\left(1\right)}}_{k}},$	(124)
	$\displaystyle\leq\max\left\{\norm{x_{0}},\frac{C_{H}}{{\left(1-\alpha_{1}\right)}}\right\}{\left(1+\tau_{n}\right)}+\sum_{k=1}^{n}\alpha_{k}\norm{{\epsilon}^{{\left(1\right)}}_{k}}.$	(125)

Therefore, since $\tau_{n}$ is monotonically increasing, there exists some constant we denote as $C_{\ref{lem:xn_norm}}$ such that

\displaystyle\norm{x_{n}}

\displaystyle\leq C_{\ref{lem:xn_norm}}\tau_{n}+\sum_{k=1}^{n}\alpha_{k}\norm{{\epsilon}^{{\left(1\right)}}_{k}}.

(126)

∎

Lemma B.4.

With $\nu(x,y)$ as defined in (13), we have

\norm{\nu{\left(x,y\right)}-\nu{\left(x^{\prime},y\right)}}\leq C_{\ref{lem:v Lipschitz}}\norm{x-x^{\prime}},

(127)

which further implies

\norm{\nu{\left(x,y\right)}}\leq C_{\ref{lem:v Lipschitz}}\quantity(C_{\ref{lem:v Lipschitz}}^{\prime}+\norm{x}),

(128)

where $C_{\ref{lem:v Lipschitz}},C_{\ref{lem:v Lipschitz}}^{\prime}$ are deterministic constants.

Proof.

Since we work with a finite $\mathcal{Y}$ , we will use functions and matrices interchangeably. For example, given a function $f:\mathcal{Y}\to\mathbb{R}^{d}$ , we also use $f$ to denote a matrix in $\mathbb{R}^{\quantity({|\mathcal{Y}|}\times d)}$ whose $y$ -th row is $f(y)^{\top}$ . Similarly, a matrix in $\mathbb{R}^{\quantity({|\mathcal{Y}|}\times d)}$ also corresponds to a function $\mathcal{Y}\to\mathbb{R}^{d}$ .

Let $\nu_{x}\in\mathbb{R}^{{|\mathcal{Y}|}\times d}$ denote the function $y\mapsto\nu(x,y)$ and let $H_{x}\in\mathbb{R}^{{|\mathcal{Y}|}\times d}$ denote the function $y\mapsto H(x,y)$ . Theorem 8.2.6 of Puterman (2014) then ensures that

\displaystyle\nu_{x}=H_{\mathcal{Y}}H_{x},

(129)

where $H_{\mathcal{Y}}\in\mathbb{R}^{{|\mathcal{Y}|}\times{|\mathcal{Y}|}}$ is the fundamental matrix of the Markov chain depending only on the chain’s transition matrix $P$ . The exact expression of $H_{\mathcal{Y}}$ is inconsequential and we refer the reader to Puterman (2014) for details. Then we have for any $i=1,\dots,d$ ,

\displaystyle\nu_{x}[y,i]=\sum_{y^{\prime}}H_{\mathcal{Y}}[y,y^{\prime}]H_{x}[y^{\prime},i].

(130)

This implies that

$\displaystyle\absolutevalue{\nu_{x}[y,i]-\nu_{x^{\prime}}[y,i]}\leq$	$\displaystyle\sum_{y^{\prime}}H_{\mathcal{Y}}[y,y^{\prime}]\absolutevalue{H_{x}[y^{\prime},i]-H_{x^{\prime}}[y^{\prime},i]}$		(131)
$\displaystyle\leq$	$\displaystyle\sum_{y^{\prime}}H_{\mathcal{Y}}[y,y^{\prime}]\norm{H(x,y)-H(x^{\prime},y^{\prime})}_{\infty}$		(132)
$\displaystyle\leq$	$\displaystyle\sum_{y^{\prime}}H_{\mathcal{Y}}[y,y^{\prime}]\norm{x-x^{\prime}}_{\infty}$	(Assumption 2.2)	(133)
$\displaystyle\leq$	$\displaystyle\norm{H_{\mathcal{Y}}}_{\infty}\norm{x-x^{\prime}}_{\infty},$		(134)

yielding

\displaystyle\norm{\nu(x,y)-\nu(x^{\prime},y)}_{\infty}\leq\norm{H_{\mathcal{Y}}}_{\infty}\norm{x-x^{\prime}}_{\infty}.

(135)

The equivalence between norms in finite dimensional space ensures that there exists some $C_{\ref{lem:v Lipschitz}}$ such that (127) holds. Letting $x^{\prime}=0$ then yields

\displaystyle\norm{\nu(x,y)}\leq C_{\ref{lem:v Lipschitz}}\quantity(\norm{\nu(0,y)}+\norm{x}).

(136)

Define $C_{\ref{lem:v Lipschitz}}^{\prime}\doteq\max_{y}\norm{\nu(0,y)}$ , we get

\displaystyle\norm{\nu(x,y)}\leq C_{\ref{lem:v Lipschitz}}\quantity(C_{\ref{lem:v Lipschitz}}^{\prime}+\norm{x}).

(137)

∎

Lemma B.5.

We have for any $y\in\mathcal{Y}$ ,

\norm{\nu{\left(x_{n},y\right)}}\leq\zeta_{\ref{lem:v_norm}}\tau_{n},

(138)

where $\zeta$ is a possibly sample-path dependent constant. Additionally, we have

\displaystyle\mathbb{E}\quantity[\norm{\nu{\left(x_{n},y\right)}}]

\displaystyle\leq C_{\ref{lem:v_norm}}\tau_{n},

(139)

where $C_{\ref{lem:v_norm}}$ is a deterministic constant.

Proof.

Having proven that $\nu{\left(x,y\right)}$ is Lipschitz continuous in $x$ in Lemma B.4, we have

$\displaystyle\norm{\nu{\left(x_{n},y\right)}}$	$\displaystyle\leq C_{\ref{lem:v Lipschitz}}(C_{\ref{lem:v Lipschitz}}^{\prime}+\norm{x_{n}}),$	(Lemma B.4)	(140)
	$\displaystyle\leq C_{\ref{lem:v Lipschitz}}\quantity(C_{\ref{lem:v Lipschitz}}^{\prime}+C_{\ref{lem:xn_norm}}\tau_{n}+\sum_{k=1}^{n}\alpha_{k}\norm{{\epsilon}^{{\left(1\right)}}_{k}}).$	(Lemma B.3)	(141)
	$\displaystyle=\mathcal{O}\quantity(\tau_{n}+\sum_{k=1}^{n}\alpha_{k}\norm{{\epsilon}^{{\left(1\right)}}_{k}}).$		(142)

Since (10) in Assumption 2.5 assures us that $\sum_{k=1}^{\infty}\alpha_{k}\norm{{\epsilon}^{{\left(1\right)}}_{k}}$ is finite almost surely while $\tau_{n}$ is monotonically increasing, then there exists some possibly sample-path dependent constant $\zeta_{\ref{lem:v_norm}}$ such that

\norm{\nu{\left(x_{n},y\right)}}\leq\zeta_{\ref{lem:v_norm}}\tau_{n}.

(143)

We can also prove a deterministic bound on the expectation of $\norm{\nu{\left(x_{n},Y_{n+1}\right)}}$ ,

	$\displaystyle\mathbb{E}\quantity[\norm{\nu{\left(x_{n},y\right)}}]$	$\displaystyle=\mathcal{O}\quantity(\mathbb{E}\quantity[\tau_{n}+\sum_{k=1}^{n}\alpha_{k}\norm{{\epsilon}^{{\left(1\right)}}_{k}}]),$		(144)
		$\displaystyle=\mathcal{O}\quantity(\tau_{n}+\sum_{k=1}^{n}\alpha_{k}\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{k}}]).$		(145)

By Lemma B.2, its easy to see that $\sum_{k=1}^{n}\alpha_{k}\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{k}}]<\infty$ . Therefore, there exists some deterministic constant $C_{\ref{lem:v_norm}}$ such that

\mathbb{E}\quantity[\norm{\nu{\left(x_{n},y\right)}}]\leq C_{\ref{lem:v_norm}}\tau_{n}.

(146)

∎

Although the two statements in Lemma B.5 appear similar, their difference is crucial. Assumption 2.5 and (10) only ensure the existence of a sample-path dependent constant $\zeta_{\ref{lem:v_norm}}$ but its form is unknown, preventing its use for expectations or explicit bounds. In contrast, using (11) from Assumption 2.5, we derive a universal constant $C_{\ref{lem:v_norm}}$ .

Lemma B.6.

For each $\quantity{M_{n}}$ , defined in (15), we have

\norm{M_{n+1}}\leq\zeta_{\ref{lem:M_norm_bound}}\tau_{n},

(147)

where $\zeta_{\ref{lem:M_norm_bound}}$ is a the sample-path dependent constant.

Proof.

Applying $\norm{\cdot}$ to (15) gives

$\displaystyle\norm{M_{n+1}}$	$\displaystyle=\norm{\nu{\left(x_{n},Y_{n+2}\right)}-P\nu{\left(x_{n},Y_{n+1}\right)}},$	(148)
	$\displaystyle\leq\norm{P\nu{\left(x_{n},Y_{n+1}\right)}}+\norm{\nu{\left(x_{n},Y_{n+2}\right)}},$	(149)
	$\displaystyle=\norm{\sum_{y^{\prime}\in\mathcal{Y}}P(Y_{n+1},y^{\prime})\nu(x_{n},y^{\prime})}+\norm{\nu{\left(x_{n},Y_{n+2}\right)}},$	(150)
	$\displaystyle\leq\sum_{y^{\prime}\in\mathcal{Y}}\norm{P(Y_{n+1},y^{\prime})\nu(x_{n},y^{\prime})}+\norm{\nu{\left(x_{n},Y_{n+2}\right)}},$	(151)
	$\displaystyle=\quantity(\max_{y\in\mathcal{Y}}\norm{\nu(x_{n},y)})\sum_{y^{\prime}\in\mathcal{Y}}\absolutevalue{P(Y_{n+1},y^{\prime})}+\norm{\nu(x_{n},Y_{n+2})},$	(152)
	$\displaystyle\leq 2\max_{y\in\mathcal{Y}}\norm{\nu(x_{n},y)}$	(153)

Under Assumption 2.5, we can apply the sample-path dependent bound from Lemma B.5,

	$\displaystyle\norm{M_{n+1}}$	$\displaystyle\leq 2\zeta_{\ref{lem:v_norm}}\tau_{n},$	(Lemma B.5)		(154)
		$\displaystyle=\zeta_{\ref{lem:M_norm_bound}}\tau_{n},$			(155)

with $\zeta_{\ref{lem:M_norm_bound}}\doteq 2\zeta_{\ref{lem:v_norm}}$ . ∎

Lemma B.7.

For each $\quantity{M_{n}}$ , defined in (15), we have

\mathbb{E}\quantity[\norm{M_{n+1}}^{2}\mid\mathcal{F}_{n+1}]\leq C_{\ref{lem: M second moment}}^{\prime}(1+\norm{x_{n}}^{2}),

(156)

and

\mathbb{E}\quantity[\norm{M_{n+1}}_{2}^{2}]\leq C_{\ref{lem: M second moment}}^{2}\tau_{n}^{2},

(157)

where $C_{\ref{lem: M second moment}}^{\prime}$ and $C_{\ref{lem: M second moment}}$ are deterministic constants and

\displaystyle\mathcal{F}_{n+1}\doteq\sigma(x_{0},Y_{1},\dots,Y_{n+1})

(158)

is the $\sigma$ -algebra until time $n+1$ .

Proof.

First, to prove (156), we have

\displaystyle\mathbb{E}\quantity[\norm{M_{n+1}}^{2}\mid\mathcal{F}_{n+1}]

\displaystyle\leq 4\max_{y\in\mathcal{Y}}\norm{\nu(x_{n},y)}^{2}=\mathcal{O}\quantity(1+\norm{x_{n}}^{2}),

(159)

where the first inequality results form (153) in Lemma B.6 and the second inequality results from Lemma B.4.

Then, to prove (157), from Lemma B.3 we then have,

\displaystyle\mathbb{E}\quantity[\norm{\nu\quantity(x_{n},y)}^{2}]

\displaystyle\leq\mathbb{E}\quantity[1+\quantity(C_{\ref{lem:xn_norm}}\tau_{n}+\sum_{k=1}^{n}\alpha_{k}\norm{{\epsilon}^{{\left(1\right)}}_{k}})^{2}]=\mathcal{O}\quantity(\tau_{n}^{2}+\mathbb{E}\quantity[\quantity(\sum_{k=1}^{n}\alpha_{k}\norm{{\epsilon}^{{\left(1\right)}}_{k}})^{2}]).

(160)

Recall that by Assumption 2.5, $\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{k}}^{2}]=\mathcal{O}\quantity(\frac{1}{k})$ . Examining the right-most term we then have,

$\displaystyle\mathbb{E}\quantity[\quantity(\sum_{k=1}^{n}\alpha_{k}\norm{{\epsilon}^{{\left(1\right)}}_{k}})^{2}]$	$\displaystyle\leq\mathbb{E}\quantity[\quantity(\sum_{k=1}^{n}\alpha_{k})\quantity(\sum_{k=1}^{n}\alpha_{k}\norm{{\epsilon}^{{\left(1\right)}}_{k}}^{2})],$	(Cauchy-Schwarz)	(161)
	$\displaystyle=\mathcal{O}\quantity(\sum_{k=1}^{n}\alpha_{k}),$	(By (107) in Lemma B.2)	(162)
	$\displaystyle=\mathcal{O}\quantity(\frac{1}{1-\alpha_{1}}\sum_{k=1}^{n}\alpha_{k}(1-\alpha_{1})),$		(163)
	$\displaystyle=\mathcal{O}\quantity(\sum_{k=1}^{n}\alpha_{k}(1-\alpha_{k}));$		(164)
	$\displaystyle=\mathcal{O}\quantity(\tau_{n}).$		(165)

We then have

\displaystyle\mathbb{E}\quantity[\norm{\nu\quantity(x_{n},y)}^{2}]=\mathcal{O}(\tau_{n}^{2}).

(166)

Because our bound on $\mathbb{E}\quantity[\norm{\nu\quantity(x_{n},y)}^{2}]$ is independent of $y$ , we have

\displaystyle\mathbb{E}\quantity[\norm{M_{n+1}}^{2}]

\displaystyle=\mathcal{O}\quantity(\mathbb{E}\quantity[\norm{\nu(x_{n},y)}^{2}])=\mathcal{O}(\tau_{n}^{2}).

(By (166))

(167)

Due to the equivalence of norms in finite-dimensional spaces, there exists a deterministic constant $C_{\ref{lem: M second moment}}$ such that (157) holds. ∎

Now, we are ready to present four additional lemmas which we will use to bound the four noise terms in (27).

Lemma B.8.

With $\quantity{{\overline{\overline{M}}}_{n}}$ defined in (27),

\lim_{n\rightarrow\infty}{\overline{\overline{M}}}_{n}<\infty,\mbox{\quad a.s.\quad}

(168)

Proof.

We first observe that the sequence $\quantity{{\overline{\overline{M}}}_{n}}$ defined in (27) is positive and monotonically increasing. Therefore by the monotone convergence theorem, it converges almost surely to a (possibly infinite) limit which we denote as,

{\overline{\overline{M}}}_{\infty}\doteq\lim_{n\rightarrow\infty}{\overline{\overline{M}}}_{n}\mbox{\quad a.s.\quad}

(169)

Then, we will utilize a generalization of Lebesgue’s monotone convergence theorem (Lemma A.4) to prove that the limit ${\overline{\overline{M}}}_{\infty}$ is finite almost surely. From Lemma A.4, we see that

\displaystyle\mathbb{E}\quantity[{\overline{\overline{M}}}_{\infty}]=\lim_{n\rightarrow\infty}\mathbb{E}\quantity[{\overline{\overline{M}}}_{n}].

(170)

Therefore, to prove that ${\overline{\overline{M}}}_{\infty}$ is almost surely finite, it is sufficient to prove that $\lim_{n\rightarrow\infty}\mathbb{E}\quantity[{\overline{\overline{M}}}_{n}]<\infty$ . To this end, we proceed by bounding the expectation of $\quantity{{\overline{\overline{M}}}_{n}}$ , by first starting with $\quantity{{\overline{M}}_{n}}$ from (25). We have,

$\displaystyle\mathbb{E}\left[\norm{{\overline{M}}_{n}}\right]$	$\displaystyle=\mathbb{E}\left[\norm{\sum_{i=1}^{n}\alpha_{i,n}M_{i}}\right],$		(171)
	$\displaystyle=\mathcal{O}\quantity(\sqrt{\mathbb{E}\left[\norm{\sum_{i=1}^{n}\alpha_{i,n}M_{i}}_{2}^{2}\right]}),$	(Jensen’s Ineq.)	(172)
	$\displaystyle=\mathcal{O}\quantity(\sqrt{\sum_{i=1}^{n}\alpha_{i,n}^{2}\mathbb{E}\left[\norm{M_{i}}_{2}^{2}\right]}),$	( $M_{i}$ is a Martingale Difference Series)	(173)
	$\displaystyle=\mathcal{O}\quantity(\sqrt{\sum_{i=1}^{n}\alpha_{i,n}^{2}\tau_{i}^{2}}),$	(Lemma B.7)	(174)

Then using the definition of $\quantity{{\overline{\overline{M}}}_{n}}$ from (27), we have

\displaystyle\mathbb{E}\quantity[{\overline{\overline{M}}}_{n}]

\displaystyle=\sum_{i=1}^{n}\alpha_{i}\mathbb{E}\left[\norm{{\overline{M}}_{i-1}}\right]=\mathcal{O}\quantity(\sum_{i=1}^{n}\alpha_{i}\sqrt{\sum_{j=1}^{i-1}\alpha_{j,i-1}^{2}\tau_{j-1}^{2}}).

(175)

Then, by (79) in Lemma B.1, we have

\sup_{n}\mathbb{E}\quantity[{\overline{\overline{M}}}_{n}]<\infty,

(176)

and since $\quantity{\mathbb{E}\quantity[{\overline{\overline{M}}}_{n}]}$ is also monotonically increasing, we have

\lim_{n\rightarrow\infty}\mathbb{E}\quantity[{\overline{\overline{M}}}_{n}]<\infty,

(177)

which implies that ${\overline{\overline{M}}}_{\infty}<\infty$ almost surely. ∎

Lemma B.9.

With $\quantity{{\overline{\overline{\epsilon}}}^{{\left(1\right)}}_{n}}$ defined in (27),

\lim_{n\rightarrow\infty}{\overline{\overline{\epsilon}}}^{{\left(1\right)}}_{n}<\infty,\ \mbox{\quad a.s.\quad}

(178)

Proof.

We first observe that the sequence $\quantity{{\overline{\overline{\epsilon}}}^{{\left(1\right)}}_{n}}$ defined in (27) is positive and monotonically increasing. Therefore by the monotone convergence theorem, it converges almost surely to a (possibly infinite) limit which we denote as,

{\overline{\overline{\epsilon}}}^{{\left(1\right)}}_{\infty}\doteq\lim_{n\rightarrow\infty}{\overline{\overline{\epsilon}}}^{{\left(1\right)}}_{n}\mbox{\quad a.s.\quad}

(179)

Then, we utilize a generalization of Lebesgue’s monotone convergence theorem (Lemma A.4) to prove that the limit ${\overline{\overline{\epsilon}}}^{{\left(1\right)}}_{\infty}$ is finite almost surely. By Lemma A.4, we have

\displaystyle\mathbb{E}\quantity[{\overline{\overline{\epsilon}}}^{{\left(1\right)}}_{\infty}]=\lim_{n\rightarrow\infty}\mathbb{E}\quantity[{\overline{\overline{\epsilon}}}^{{\left(1\right)}}_{n}].

(180)

Therefore, to prove that ${\overline{\overline{\epsilon}}}^{{\left(1\right)}}_{\infty}$ is almost surely finite, it is sufficient to prove that $\lim_{n\rightarrow\infty}\mathbb{E}\quantity[{\overline{\overline{\epsilon}}}^{{\left(1\right)}}_{n}]<\infty$ . To this end, we proceed by bounding the expectation of $\quantity{{\overline{\overline{\epsilon}}}^{{\left(1\right)}}_{n}}$ ,

\displaystyle\mathbb{E}\left[{\overline{\overline{\epsilon}}}^{{\left(1\right)}}_{n}\right]=\sum_{i=1}^{n}\alpha_{i}\mathbb{E}\left[\norm{{\overline{\epsilon}}^{{\left(1\right)}}_{i-1}}\right]\leq\sum_{i=1}^{n}\alpha_{i}\sum_{j=1}^{i-1}\alpha_{j,i-1}\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{j}}].

(181)

Then, by (109) in Lemma B.2, we have,

\sup_{n}\mathbb{E}\quantity[{\overline{\overline{\epsilon}}}^{{\left(1\right)}}_{n}]<\infty,

(182)

and since $\quantity{\mathbb{E}\quantity[{\overline{\overline{\epsilon}}}^{{\left(1\right)}}_{n}]}$ is also monotonically increasing, we have

\lim_{n\rightarrow\infty}\mathbb{E}\quantity[{\overline{\overline{\epsilon}}}^{{\left(1\right)}}_{n}]<\infty.

(183)

which implies that ${\overline{\overline{\epsilon}}}^{{\left(1\right)}}_{\infty}<\infty$ almost surely.

∎

Lemma B.10.

With $\quantity{{\overline{\overline{\epsilon}}}^{{\left(3\right)}}_{n}}$ defined in (27), we have

\lim_{n\rightarrow\infty}\ {\overline{\overline{\epsilon}}}^{{\left(3\right)}}_{n}<\infty,\mbox{\quad a.s.\quad}

(184)

Proof.

Beginning with the definition of ${\overline{\epsilon}}^{{\left(3\right)}}_{n}$ in (25), we have

$\displaystyle\norm{{\overline{\epsilon}}^{{\left(3\right)}}_{n}}$	$\displaystyle=\norm{\sum_{i=1}^{n}\alpha_{i,n}{\left(\nu{\left(x_{i},Y_{i+1}\right)}-\nu{\left(x_{i-1},Y_{i+1}\right)}\right)}},$		(185)
	$\displaystyle\leq\sum_{i=1}^{n}\alpha_{i,n}\norm{\nu{\left(x_{i},Y_{i+1}\right)}-\nu{\left(x_{i-1},Y_{i+1}\right)}},$		(186)
	$\displaystyle\leq C_{\ref{lem:v Lipschitz}}\sum_{i=1}^{n}\alpha_{i,n}\norm{x_{i}-x_{i-1}},$	(Lemma B.4)	(187)
	$\displaystyle\leq C_{\ref{lem:v Lipschitz}}\sum_{i=1}^{n}\alpha_{i,n}\alpha_{i}\quantity(\norm{H{\left(x_{i-1},Y_{i}\right)}}+\norm{x_{i-1}}+\norm{{\epsilon}^{{\left(1\right)}}_{i}}),$	(By (SKM with Markovian and Additive Noise))	(188)
	$\displaystyle\leq C_{\ref{lem:v Lipschitz}}\sum_{i=1}^{n}\alpha_{i,n}\alpha_{i}\quantity(2\norm{x_{i-1}}+C_{H}+\norm{{\epsilon}^{{\left(1\right)}}_{i}}),$	(By (3))	(189)
	$\displaystyle\leq C_{\ref{lem:v Lipschitz}}\sum_{i=1}^{n}\alpha_{i,n}\alpha_{i}\quantity(2C_{\ref{lem:xn_norm}}\tau_{i-1}+2\sum_{k=1}^{i-1}\alpha_{k}\norm{{\epsilon}^{{\left(1\right)}}_{k}}+C_{H}+\norm{{\epsilon}^{{\left(1\right)}}_{i}}),$	(Lemma B.3)	(190)

Because Assumption 2.5 assures us that $\sum_{k=1}^{\infty}\alpha_{k}\norm{{\epsilon}^{{\left(1\right)}}_{k}}$ is almost surely finite, then there exists some sample-path dependent constant we denote as $\zeta_{\ref{lem:sup_e3}}$ where,

$\displaystyle\norm{{\overline{\epsilon}}^{{\left(3\right)}}_{n}}$	$\displaystyle\leq\zeta_{\ref{lem:sup_e3}}\sum_{i=1}^{n}\alpha_{i,n}\alpha_{i}\quantity(\tau_{i-1}+\norm{{\epsilon}^{{\left(1\right)}}_{i}}),$	(Assumption 2.5)	(191)
	$\displaystyle\leq\zeta_{\ref{lem:sup_e3}}\quantity(\sum_{i=1}^{n}\alpha_{i,n}\alpha_{i}\tau_{i}+\sum_{i=1}^{n}\alpha_{i,n}\alpha_{i}\norm{{\epsilon}^{{\left(1\right)}}_{i}}),$	( $\tau_{i}$ is increasing)	(192)
	$\displaystyle\leq\zeta_{\ref{lem:sup_e3}}\alpha_{n}\quantity(\sum_{i=1}^{n}\alpha_{i}\tau_{i}+\sum_{i=1}^{n}\alpha_{i}\norm{{\epsilon}^{{\left(1\right)}}_{i}}).$	$\displaystyle\text{(Lemma \ref{lem:bravo b1})}.$	(193)

Again, from Assumption 2.5 we can conclude that there exists some other sample-path dependent constant we denote as $\zeta_{\ref{lem:sup_e3}}^{\prime}$ where

\norm{{\overline{\epsilon}}^{{\left(3\right)}}_{n}}\leq\zeta_{\ref{lem:sup_e3}}^{\prime}\alpha_{n}\sum_{i=1}^{n}\alpha_{i}\tau_{i}.

(194)

Therefore, from the definition of ${\overline{\overline{\epsilon}}}^{{\left(3\right)}}_{n}$ in (22)

{\overline{\overline{\epsilon}}}^{{\left(3\right)}}_{n}\leq\zeta_{\ref{lem:sup_e3}}^{\prime}\sum_{i=1}^{n}\alpha_{i}^{2}\sum_{j=1}^{i-1}\alpha_{j}\tau_{j}.

(195)

So, by (78) in Lemma B.1

\sup_{n}\ {\overline{\overline{\epsilon}}}^{{\left(3\right)}}_{n}\leq\sup_{n}\ \zeta_{\ref{lem:sup_e3}}^{\prime}\sum_{i=1}^{n}\alpha_{i}^{2}\sum_{j=1}^{i-1}\alpha_{j}\tau_{j}<\infty\mbox{\quad a.s.\quad}

(196)

Then, the monotone convergence theorem proves the lemma. ∎

To prove (23) holds almost surely, we introduce four lemmas which we will subsequently use to prove an extension of Theorem 2 from (Borkar, 2009) in Section D.

Lemma B.11.

We have

\sup_{n}\norm{\sum_{k=1}^{n}\alpha_{k}M_{k}}<\infty\mbox{\quad a.s.\quad}

(197)

Proof.

Recall that $M_{k}$ is a Martingale difference series. Then, the Martingale sequence

\quantity{\sum_{k=1}^{n}\alpha_{k}M_{k}}

(198)

is bounded in $L^{2}$ with,

$\displaystyle\mathbb{E}\left[\norm{\sum_{k=1}^{n}\alpha_{k}M_{k}}_{2}\right]$	$\displaystyle\leq\sqrt{\mathbb{E}\left[\norm{\sum_{k=1}^{n}\alpha_{k}M_{k}}_{2}^{2}\right]},$	(Jensen’s Ineq.)	(199)
	$\displaystyle=\sqrt{\sum_{k=1}^{n}\alpha_{k}^{2}\mathbb{E}\left[\norm{M_{k}}_{2}^{2}\right]},$	( $M_{i}$ is a Martingale Difference Series)	(200)
	$\displaystyle\leq C_{\ref{lem: M second moment}}\sqrt{\sum_{k=1}^{n}\alpha_{k}^{2}\tau_{k}^{2}}.$	(Lemma B.7)	(201)

Lemma B.1 then gives

\displaystyle\sup_{n}\ C_{\ref{lem: M second moment}}\sqrt{\sum_{k=1}^{n}\alpha_{k}^{2}\tau_{k}^{2}}

\displaystyle<\infty

(202)

Doob’s martingale convergence theorem implies that $\quantity{\sum_{k=1}^{n}\alpha_{k}M_{k}}$ converges to an almost surely finite random variable, which proves the lemma. ∎

Lemma B.12.

We have,

\displaystyle\sup_{n}\norm{\sum_{k=1}^{n}\alpha_{k}{\epsilon}^{{\left(2\right)}}_{k}}<\infty\mbox{\quad a.s.\quad}

(203)

Proof.

Utilizing the definition of ${\epsilon}^{{\left(2\right)}}_{k}$ in (16), we have

$\displaystyle\sum_{k=1}^{n}\alpha_{k}{\epsilon}^{{\left(2\right)}}_{k}$	$\displaystyle=-\sum_{k=1}^{n}\alpha_{k}{\left(\nu{\left(x_{k},Y_{k+1}\right)}-\nu{\left(x_{k-1},Y_{k}\right)}\right)},$		(204)
	$\displaystyle=-\sum_{k=1}^{n}\alpha_{k}\nu{\left(x_{k},Y_{k+1}\right)}-\alpha_{k-1}\nu{\left(x_{k-1},Y_{k}\right)}+\alpha_{k-1}\nu{\left(x_{k-1},Y_{k}\right)}-\alpha_{k}\nu{\left(x_{k-1},Y_{k}\right)},$		(205)
	$\displaystyle=-\alpha_{n}\nu{\left(x_{n},Y_{n+1}\right)}-\sum_{k=1}^{n}{\left(\alpha_{k-1}-\alpha_{k}\right)}\nu{\left(x_{k-1},Y_{k}\right)}.$	( $\alpha_{0}$ = 0)	(206)

The triangle inequality gives

$\displaystyle\norm{\sum_{k=1}^{n}\alpha_{k}{\epsilon}^{{\left(2\right)}}_{k}}$	$\displaystyle\leq\alpha_{n}\norm{\nu{\left(x_{n},Y_{n+1}\right)}}+\sum_{k=1}^{n}\left\|\alpha_{k-1}-\alpha_{k}\right\|\norm{\nu{\left(x_{k-1},Y_{k}\right)}},$		(207)
	$\displaystyle\leq\zeta_{\ref{lem:v_norm}}{\left(\alpha_{n}\tau_{n}+\sum_{k=1}^{n}\left\|\alpha_{k-1}-\alpha_{k}\right\|\tau_{k-1}\right)},$	(Lemma B.5)	(208)
	$\displaystyle=\zeta_{\ref{lem:v_norm}}\quantity(\alpha_{n}\tau_{n}+\alpha_{1}\tau_{1}+\sum_{k=1}^{n-1}\absolutevalue{\alpha_{k}-\alpha_{k+1}}\tau_{k})$	$\displaystyle\text{($\alpha_{0}\doteq 0$)}.$	(209)

Its easy to see that $\lim_{n\rightarrow\infty}\alpha_{n}\tau_{n}=0$ , and $\alpha_{1}\tau_{1}$ is simply a deterministic and finite constant. Therefore, by Lemma B.1 we have

\displaystyle\sup_{n}\sum_{k=1}^{n}\absolutevalue{\alpha_{k}-\alpha_{k+1}}\tau_{k}

\displaystyle<\infty\mbox{\quad a.s.\quad}

(210)

which proves the lemma.

∎

Lemma B.13.

We have,

\sup_{n}\norm{\sum_{k=1}^{n}\alpha_{k}{\epsilon}^{{\left(3\right)}}_{k}}<\infty\mbox{\quad a.s.\quad}

(211)

Proof.

Utilizing the definition of ${\epsilon}^{{\left(3\right)}}_{k}$ in (17), we have

$\displaystyle\norm{\sum_{k=1}^{n}\alpha_{k}{\epsilon}^{{\left(3\right)}}_{k}}$	$\displaystyle=\norm{\sum_{k=1}^{n}\alpha_{k}\quantity(\nu\quantity(x_{k},Y_{k+1})-\nu\quantity(x_{k-1},Y_{k+1}))},$		(212)
	$\displaystyle\leq\sum_{k=1}^{n}\alpha_{k}\norm{\nu\quantity(x_{k},Y_{k+1})-\nu\quantity(x_{k-1},Y_{k+1})},$		(213)
	$\displaystyle\leq C_{\ref{lem:v Lipschitz}}\sum_{k=1}^{n}\alpha_{k}\norm{x_{k}-x_{k-1}},$	(Lemma B.4)	(214)
	$\displaystyle\leq C_{\ref{lem:v Lipschitz}}\sum_{k=1}^{n}\alpha_{k}^{2}\quantity(\norm{H{\left(x_{k-1},Y_{k}\right)}}+\norm{x_{k-1}}+\norm{{\epsilon}^{{\left(1\right)}}_{k}}),$		(215)
	(By (SKM with Markovian and Additive Noise))		(216)
	$\displaystyle\leq C_{\ref{lem:v Lipschitz}}\sum_{k=1}^{n}\alpha_{k}^{2}\quantity(2\norm{x_{k-1}}+C_{H}+\norm{{\epsilon}^{{\left(1\right)}}_{k}}),$	(By (3))	(217)
	$\displaystyle\leq C_{\ref{lem:v Lipschitz}}\sum_{k=1}^{n}\alpha_{k}^{2}\quantity(2C_{\ref{lem:xn_norm}}\tau_{k-1}+2\sum_{i=1}^{k-1}\alpha_{i}\norm{{\epsilon}^{{\left(1\right)}}_{i}}+C_{H}+\norm{{\epsilon}^{{\left(1\right)}}_{k}}).$	(Lemma B.3)	(218)

Because Assumption 2.5 assures us that $\sum_{k=1}^{\infty}\alpha_{k}\norm{{\epsilon}^{{\left(1\right)}}_{k}}$ is finite, then there exists some sample-path dependent constant we denote as $\zeta_{\ref{lem:ak_e3}}$ where,

	$\displaystyle\norm{\sum_{k=1}^{n}\alpha_{k}{\epsilon}^{{\left(3\right)}}_{k}}$	$\displaystyle\leq\zeta_{\ref{lem:ak_e3}}\sum_{k=1}^{n}\alpha_{k}^{2}\quantity(\tau_{k-1}+\norm{{\epsilon}^{{\left(1\right)}}_{k}}),$	(Assumption 2.5)		(219)
		$\displaystyle\leq\zeta_{\ref{lem:ak_e3}}\quantity(\sum_{k=1}^{n}\alpha_{k}^{2}\tau_{k}+\sum_{k=1}^{n}\alpha_{k}^{2}\norm{{\epsilon}^{{\left(1\right)}}_{k}}),$	( $\tau_{k}$ is increasing)		(220)

Lemma B.1 and Assumption 2.5 then prove the lemma. ∎

Lemma B.14.

Let $U_{n}$ be the iterates defined in (21). Then if $\sup_{n}\norm{U_{n}}<\infty$ , we have $U_{n}\rightarrow 0$ almost surely.

Proof.

We use a stochastic approximation argument to show that $U_{n}\rightarrow 0$ . The almost sure convergence of $U_{n}\rightarrow 0$ is given by a generalization of Theorem 2.1 of (Borkar, 2009), which we present as Theorem D.6 in Appendix D for completeness.

We now verify the assumptions of Theorem D.6. Beginning with the definition of $\xi_{k}$ in (18), we have

	$\displaystyle\lim_{n\to\infty}\sup_{j\geq n}\norm{\sum_{k=n}^{j}\alpha_{k}\xi_{k}}$	$\displaystyle=\lim_{n\to\infty}\sup_{j\geq n}\norm{\sum_{k=n}^{j}\alpha_{k}{\left({\epsilon}^{{\left(1\right)}}_{k}+{\epsilon}^{{\left(2\right)}}_{k}+{\epsilon}^{{\left(3\right)}}_{k}\right)}},$		(221)
		$\displaystyle\leq\underbrace{\lim_{n\to\infty}\sup_{j\geq n}\norm{\sum_{k=n}^{j}\alpha_{k}{\epsilon}^{{\left(1\right)}}_{k}}}_{S_{1}}+\underbrace{\lim_{n\to\infty}\sup_{j\geq n}\norm{\sum_{k=n}^{j}\alpha_{k}{\epsilon}^{{\left(2\right)}}_{k}}}_{S_{2}}+\underbrace{\lim_{n\to\infty}\sup_{j\geq n}\norm{\sum_{k=n}^{j}\alpha_{k}{\epsilon}^{{\left(3\right)}}_{k}}}_{S_{3}}.$		(222)

We now bound the three terms in the RHS.

For $S_{1}$ , we have

\displaystyle\lim_{n\to\infty}\sup_{j\geq n}\norm{\sum_{k=n}^{j}\alpha_{k}{\epsilon}^{{\left(1\right)}}_{k}}\leq\lim_{n\to\infty}\sup_{j\geq n}\sum_{k=n}^{j}\alpha_{k}\norm{{\epsilon}^{{\left(1\right)}}_{k}}\leq\lim_{n\to\infty}\sum_{k=n}^{\infty}\alpha_{k}\norm{{\epsilon}^{{\left(1\right)}}_{k}}=0,

(223)

where we have used the fact that the series $\sum_{k=1}^{n}\alpha_{k}\norm{{\epsilon}^{{\left(1\right)}}_{k}}$ converges by Assumption 2.5 almost surely.

For $S_{2}$ , from (206) in Lemma B.12, we have

	$\displaystyle\sum_{k=n}^{j}\alpha_{k}{\epsilon}^{{\left(2\right)}}_{k}$	$\displaystyle=\sum_{k=1}^{j}\alpha_{k}{\epsilon}^{{\left(2\right)}}_{k}-\sum_{k=1}^{n-1}\alpha_{k}{\epsilon}^{{\left(2\right)}}_{k},$		(224)
		$\displaystyle=\alpha_{n-1}\nu(x_{n},Y_{n})-\alpha_{j}\nu(x_{j},Y_{j+1})-\sum_{k=n}^{j}(\alpha_{k-1}-\alpha_{k})\nu(x_{k-1},Y_{k}).$		(225)

Taking the norm and applying the triangle inequality, we have

$\displaystyle\lim_{n\to\infty}\sup_{j\geq n}\norm{\sum_{k=n}^{j}\alpha_{k}{\epsilon}^{{\left(2\right)}}_{k}}$	$\displaystyle\leq\lim_{n\to\infty}\sup_{j\geq n}\bigg{(}\alpha_{n-1}\norm{\nu(x_{n},Y_{n})}+\alpha_{j}\norm{\nu(x_{j},Y_{j+1})}$		(226)
	$\displaystyle\quad+\sum_{k=n}^{j}\norm{(\alpha_{k-1}-\alpha_{k})\nu(x_{k-1},Y_{k})}\bigg{)},$		(227)
	$\displaystyle\leq\lim_{n\to\infty}\sup_{j\geq n}\zeta_{\ref{lem:v_norm}}\quantity(\alpha_{n-1}\tau_{n-1}+\alpha_{j}\tau_{j}+\sum_{k=n}^{\infty}\absolutevalue{\alpha_{k-1}-\alpha_{k}}\tau_{k-1}),$	(Lemma B.5)	(228)

where the last inequality holds because $\sum_{k=n}^{j}\absolutevalue{\alpha_{k-1}-\alpha_{k}}\tau_{k-1}$ is monotonically increasing. Note that

\displaystyle\alpha_{n}\tau_{n}=\begin{cases}\mathcal{O}\quantity(n^{1-2b})&\text{if}\quad\frac{4}{5}<b<1,\\ \mathcal{O}\quantity(\frac{\log n}{n})&\text{if}\quad b=1.\end{cases}

(229)

Since we have $j\geq n$ , then

\displaystyle\lim_{n\to\infty}\sup_{j\geq n}\norm{\sum_{k=n}^{j}\alpha_{k}{\epsilon}^{{\left(2\right)}}_{k}}

\displaystyle\leq\lim_{n\to\infty}\zeta_{\ref{lem:v_norm}}\quantity(2\alpha_{n-1}\tau_{n-1}+\sum_{k=n}^{\infty}\absolutevalue{\alpha_{k-1}-\alpha_{k}}\tau_{k-1})=0

(230)

where we used the fact that (77) in Lemma B.1 and the monotone convergence theorem prove that the series $\sum_{k=1}^{n}\absolutevalue{\alpha_{k}-\alpha_{k+1}}\tau_{k}$ converges almost surely.

For $S_{3}$ , following the steps in Lemma B.13 (which we omit to avoid repetition), we have,

\displaystyle\lim_{n\to\infty}\sup_{j\geq n}\norm{\sum_{k=n}^{j}\alpha_{k}{\epsilon}^{{\left(3\right)}}_{k}}

\displaystyle\leq\lim_{n\to\infty}\sup_{j\geq n}\zeta_{\ref{lem:ak_e3}}\quantity(\sum_{k=n}^{j}\alpha_{k}^{2}\tau_{k}+\sum_{k=n}^{j}\alpha_{k}^{2}\norm{{\epsilon}^{{\left(1\right)}}_{k}}).

(231)

which further implies that

\displaystyle\lim_{n\to\infty}\sup_{j\geq n}\norm{\sum_{k=n}^{j}\alpha_{k}{\epsilon}^{{\left(3\right)}}_{k}}

\displaystyle\leq\lim_{n\to\infty}\zeta_{\ref{lem:ak_e3}}\quantity(\sum_{k=n}^{\infty}\alpha_{k}^{2}\tau_{k}+\sum_{k=n}^{\infty}\alpha_{k}^{2}\norm{{\epsilon}^{{\left(1\right)}}_{k}})=0,

(232)

where we use the fact that, by (74) in Lemma B.1, Assumption 2.5, and the monotone convergence theorem, both series on the RHS series converge almost surely. Therefore we have proven that,

\lim_{n\rightarrow\infty}\sup_{j\geq n}\norm{\sum_{k=n}^{j}\alpha_{k}\xi_{k}}=0\mbox{\quad a.s.\quad}

(233)

thereby verifying Assumption D.1.

Assumption D.2 is satisfied by (6) which is the result of Assumption 2.2. Assumption D.3 is clearly met by the definition of the deterministic learning rates in Assumption 2.4. Demonstrating Assumption D.4 holds, Lemma B.7 demonstrates $\quantity{M_{n}}$ is square-integrable martingale difference series.

Therefore, by Theorem D.6, the iterates $\quantity{U_{n}}$ converge almost surely to a possibly sample-path dependent compact connected internally chain transitive set of the following ODE:

\displaystyle\derivative{U(t)}{t}=-U(t).

(234)

Since the origin is the unique globally asymptotically stable equilibrium point of (234), we have that $U_{n}\rightarrow 0$ almost surely. ∎

Lemma B.15.

With $\quantity{x_{n}}$ defined in (18) and $\quantity{U_{n}}$ defined in (21), if $\sum_{k=1}^{\infty}\alpha_{k}\norm{U_{k-1}}$ and $\lim_{n\rightarrow\infty}U_{n}=0$ , then $\lim_{n\rightarrow\infty}x_{n}=x_{*}$ where $x_{*}\in\mathcal{X}_{*}$ is a possibly sample-path dependent fixed point.

Proof.

Following the approach of Bravo & Cominetti (2024), we utilize the estimate for inexact Krasnoselskii-Mann iterations of the form (IKM) presented in Lemma A.1 to prove the convergence of (SKM with Markovian and Additive Noise). Using the definition of $\left\{U_{n}\right\}$ in (21), we then let $z_{0}=x_{0}$ and define $z_{n}\doteq x_{n}-U_{n}$ , which gives

$\displaystyle z_{n+1}$	$\displaystyle={\left(1-\alpha_{n+1}\right)}x_{n}+\alpha_{n+1}{\left(h(x_{n})+M_{n+1}+\xi_{n+1}\right)}$	(235)
	$\displaystyle\quad\quad-{\left({\left(1-\alpha_{n+1}\right)}U_{n}+\alpha_{n+1}{\left(M_{n+1}+\xi_{n+1}\right)}\right)}$	(236)
	$\displaystyle=\quantity(1-\alpha_{n+1})z_{n}+\alpha_{n+1}h(x_{n})$	(237)
	$\displaystyle=z_{n}+\alpha_{n+1}{\left(h(z_{n})-z_{n}+e_{n+1}\right)}$	(238)

which matches the form of (IKM) with $e_{n}=h{\left(x_{n-1}\right)}-h{\left(z_{n-1}\right)}$ . Due to the non-expansivity of $h$ from (6), we have

\norm{e_{n+1}}=\norm{h\quantity(x_{n})-h\quantity(z_{n})}\leq\norm{x_{n}-z_{n}}=\norm{U_{n}}

(239)

The convergence of $x_{n}$ then follows directly from Lemma A.1 which gives $\lim_{n\rightarrow\infty}z_{n}=x_{*}$ for some $x_{*}\in\mathcal{X}_{*}$ , and therefore $\lim_{n\rightarrow\infty}x_{n}=\lim_{n\rightarrow\infty}z_{n}+U_{n}=x_{*}$ . We note that here $e_{n}$ is stochastic while the (IKM) result in Lemma A.1 considers deterministic noise. This means we apply Lemma A.1 for each sample path. ∎

Appendix C Additional Lemmas from Section 3

Corollary C.1.

We have

\mathbb{E}\quantity[\norm{{\overline{M}}_{n}}]\leq C_{\ref{lem: M rate}}\tau_{n}\sqrt{\alpha_{n+1}}

(240)

where $C_{\ref{lem: M rate}}$ is a deterministic constant.

Proof.

Starting from (174) from Lemma B.8 to avoid redundancy, we directly have

\displaystyle\mathbb{E}\left[\norm{{\overline{M}}_{n}}\right]

\displaystyle=\mathcal{O}\quantity(\sqrt{\sum_{i=1}^{n}\alpha_{i,n}^{2}\tau_{i}^{2}}).

(241)

Additionally, by Lemma A.3, we have $\sqrt{\sum_{i=1}^{n}\alpha_{i,n}^{2}\tau_{i}^{2}}\leq\tau_{n}\sqrt{\alpha_{n+1}}$ . Therefore, there exists a deterministic constant such that the corollary holds. ∎

Corollary C.2.

We have

\mathbb{E}\quantity[\norm{{\overline{\epsilon}}^{{\left(2\right)}}_{n}}]\leq C_{\ref{lem: e2 rate}}\alpha_{n}\tau_{n}

(242)

where $C_{\ref{lem: e2 rate}}$ is a deterministic constant.

Proof.

Starting from (35) to avoid repetition, we have,

\displaystyle\norm{{\overline{\epsilon}}^{{\left(2\right)}}_{n}}

\displaystyle\leq\alpha_{n}\norm{\nu{\left(x_{n},Y_{n+1}\right)}}+\sum_{i=1}^{n}\left|\alpha_{i-1,n}-\alpha_{i,n}\right|\norm{\nu{\left(x_{i-1},Y_{i}\right)}}.

(243)

Now we can take the expectation and apply the sample-path independent bound from Lemma B.5 with,

	$\displaystyle\mathbb{E}\quantity[\norm{{\overline{\epsilon}}^{{\left(2\right)}}_{n}}]$	$\displaystyle\leq C_{\ref{lem:v_norm}}{\left(\alpha_{n}\tau_{n}+\sum_{i=1}^{n}\left\|\alpha_{i-1,n}-\alpha_{i,n}\right\|\tau_{i-1}\right)}$	(Lemma B.5)		(244)
		$\displaystyle=C_{\ref{lem:v_norm}}{\left(\alpha_{n}\tau_{n}+\sum_{k=0}^{n-1}\left\|\alpha_{k,n}-\alpha_{k+1,n}\right\|\tau_{k}\right)}$			(245)

Lemma B.1 and $\tau_{k}$ being monotonically increasing for $k\geq 1$ yields,

$\displaystyle\mathbb{E}\quantity[\norm{{\overline{\epsilon}}^{{\left(2\right)}}_{n}}]$	$\displaystyle\leq C_{\ref{lem:v_norm}}{\left(\alpha_{n}\tau_{n}+\alpha_{1,n}\tau_{0}+\tau_{n}\sum_{k=1}^{n-1}{\left(\alpha_{k+1,n}-\alpha_{k,n}\right)}\right)},$		(246)
	$\displaystyle=C_{\ref{lem:v_norm}}{\left(\alpha_{n}\tau_{n}+\alpha_{1,n}+\tau_{n}{\left(\alpha_{n,n}-\alpha_{1,n}\right)}\right)},$	( $\tau_{0}\doteq 1$ )	(247)
	$\displaystyle=\mathcal{O}\quantity(\alpha_{n}\tau_{n}).$	(Lemma A.2)	(248)

Therefore, there exists a deterministic constant we denote as $C_{\ref{lem: e2 rate}}$ such that

\displaystyle\mathbb{E}\quantity[\norm{{\overline{\epsilon}}^{{\left(2\right)}}_{n}}]

\displaystyle\leq C_{\ref{lem: e2 rate}}\alpha_{n}\tau_{n}.

(249)

∎

Corollary C.3.

We have

\mathbb{E}\quantity[\norm{{\overline{\epsilon}}^{{\left(3\right)}}_{n}}]\leq C_{\ref{lem: e3 rate}}\alpha_{n}\sum_{i=1}^{n}\alpha_{i}\tau_{i}.

(250)

Proof.

Starting with (190) from Lemma B.10 to avoid redundancy, we have

\displaystyle\norm{{\overline{\epsilon}}^{{\left(3\right)}}_{n}}

\displaystyle\leq C_{\ref{lem:v Lipschitz}}\sum_{k=1}^{n}\alpha_{k,n}\alpha_{k}\quantity(2C_{\ref{lem:xn_norm}}\tau_{k-1}+2\sum_{i=1}^{k-1}\alpha_{i}\norm{{\epsilon}^{{\left(1\right)}}_{i}}+C_{H}+\norm{{\epsilon}^{{\left(1\right)}}_{k}}).

(251)

Taking the expectation gives,

\displaystyle\mathbb{E}\quantity[\norm{{\overline{\epsilon}}^{{\left(3\right)}}_{n}}]

\displaystyle\leq C_{\ref{lem:v Lipschitz}}\sum_{k=1}^{n}\alpha_{k,n}\alpha_{k}\quantity(2C_{\ref{lem:xn_norm}}\tau_{k-1}+2\sum_{i=1}^{k-1}\alpha_{i}\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{i}}]+C_{H}+\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{k}}]).

(252)

Recall that $\tau_{k}$ is monotonically increasing. Additionally, by Lemma B.2, $\sum_{i=1}^{k-1}\alpha_{i}\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{i}}]$ converges and $\lim_{k\rightarrow\infty}\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{k}}]=0$ . Therefore, there exists a deterministic constant $C_{\ref{lem: e3 rate}}$ such that

	$\displaystyle\mathbb{E}\quantity[\norm{{\overline{\epsilon}}^{{\left(3\right)}}_{n}}]$	$\displaystyle\leq C_{\ref{lem: e3 rate}}\sum_{k=1}^{n}\alpha_{k,n}\alpha_{k}\tau_{k-1},$			(253)
		$\displaystyle\leq C_{\ref{lem: e3 rate}}\alpha_{n}\sum_{i=1}^{n}\alpha_{i}\tau_{i}$	$\displaystyle\text{(Lemma \ref{lem:bravo b1})}.$		(254)

∎

Lemma C.4.

For $\omega_{n}$ defined in (51), we have

\displaystyle\omega_{n}=\mathcal{O}(\tau_{n}\sqrt{\alpha_{n+1}})

(255)

Proof.

From (51), we have

\displaystyle\omega_{n}\doteq\underbrace{C_{\ref{lem: M second moment}}\tau_{n}\sqrt{\alpha_{n+1}}}_{K_{1}}+\underbrace{\sum_{i=1}^{n}\alpha_{i,n}\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{i}}]}_{K_{2}}+\underbrace{C_{\ref{lem: e2 rate}}\alpha_{n}\tau_{n}}_{K_{3}}+\underbrace{C_{\ref{lem: e3 rate}}\alpha_{n}\sum_{i=1}^{n}\alpha_{i}\tau_{i}}_{K_{4}}

(256)

To prove the Lemma, we will examine each of the four terms and prove they are $\mathcal{O}(\tau_{n}\sqrt{\alpha_{n+1}})$ . For $K_{1}$ , this is trivial. For $K_{2}$ , we first recall from Lemma B.1 that $\alpha_{n}=\mathcal{O}(\frac{1}{n^{b}})$ and

\displaystyle\tau_{n}=\begin{cases}\mathcal{O}\quantity(n^{1-b})&\text{if}\quad\frac{4}{5}<b<1,\\ \mathcal{O}\quantity(\log n)&\text{if}\quad b=1.\end{cases}

(257)

Then we have,

\displaystyle\tau_{n}\sqrt{\alpha_{n+1}}=\begin{cases}\mathcal{O}\quantity(\frac{1}{n^{\frac{3}{2}b-1}})&\text{if}\quad\frac{4}{5}<b<1,\\ \mathcal{O}\quantity(\frac{\log n}{\sqrt{n}})&\text{if}\quad b=1.\end{cases}

(258)

Then by Lemma B.2 we have

$\displaystyle\sum_{i=1}^{n}\alpha_{i,n}\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{i}}]$	$\displaystyle\leq\alpha_{n}\sum_{i=1}^{n}\mathbb{E}\quantity[\norm{{\epsilon}^{{\left(1\right)}}_{i}}],$	(Lemma A.2)	(259)
	$\displaystyle=\mathcal{O}\quantity(\alpha_{n}\sum_{i=1}^{n}\frac{1}{\sqrt{i}}),$		(260)
	$\displaystyle=\mathcal{O}\quantity(\alpha_{n}\sqrt{n})$		(261)
	$\displaystyle=\mathcal{O}\quantity(\frac{1}{n^{b}}\sqrt{n}),$		(262)
	$\displaystyle=\mathcal{O}\quantity(\frac{1}{n^{b-1/2}})$		(263)

Because we have $\frac{3}{2}b-1\leq b-\frac{1}{2}$ for $b\in(\frac{4}{5},1]$ , we can see from (258), that $K_{2}$ is dominated by $K_{1}$ .

For $K_{3}$ , by Lemma B.1 we have,

\displaystyle\alpha_{n}\tau_{n}=\begin{cases}\mathcal{O}\quantity(\frac{1}{n^{2b-1}})&\text{if}\quad\frac{4}{5}<b<1,\\ \mathcal{O}\quantity(\frac{\log n}{n})&\text{if}\quad b=1.\end{cases}

(264)

It is clear from (258), $K_{3}$ is dominated by $K_{1}$ .

For $K_{4}$ , for the case when $b=1$ , we have

$\displaystyle\alpha_{n}\sum_{i=1}^{n}\alpha_{i}\tau_{i}$	$\displaystyle\leq\alpha_{n}\tau_{n}\sum_{i=1}^{n}\alpha_{i}$	( $\tau_{n}$ increasing)	(265)
	$\displaystyle=\mathcal{O}\quantity(\frac{\log n}{n}\sum_{i=1}^{n}\frac{1}{i}),$		(266)
	$\displaystyle=\mathcal{O}\quantity(\frac{\log^{2}n}{n}).$		(267)

For the case when $\frac{4}{5}<b<1$ , we have

\displaystyle\alpha_{n}\sum_{i=1}^{n}\alpha_{i}\tau_{i}

\displaystyle=\mathcal{O}\quantity(\frac{1}{n^{b}}\sum_{i=1}^{n}\frac{1}{i^{2b-1}})

(268)

which we can approximate by an integral,

\displaystyle\int_{1}^{n}\frac{1}{x^{2b-1}}\ dx=\mathcal{O}\quantity(n^{2-2b}).

(269)

Therefore,

\displaystyle\alpha_{n}\sum_{i=1}^{n}\alpha_{i}\tau_{i}

\displaystyle=\mathcal{O}\quantity(n^{2-3b})

(270)

Combining our results from the two cases, we have for $K_{4}$

\displaystyle\alpha_{n}\sum_{i=1}^{n}\alpha_{i}\tau_{i}=\begin{cases}\mathcal{O}\quantity(\frac{1}{n^{3b-2}})&\text{if}\quad\frac{4}{5}<b<1,\\ \mathcal{O}\quantity(\frac{\log^{2}n}{n})&\text{if}\quad b=1.\end{cases}

(271)

Comparing with $K_{1}$ in (258), since we have $\frac{3}{2}b-1<3b-2$ for $b\in(\frac{4}{5},1)$ , we can see that $K_{4}$ is dominated by $K_{1}$ , thereby proving the lemma.

∎

Lemma C.5.

We have,

\displaystyle\sum_{k=2}^{n}2\alpha_{k}\sigma\quantity(\tau_{n}-\tau_{k})\mathbb{E}\quantity[\norm{U_{k-1}}]=\mathcal{O}(1/\sqrt{\tau_{n}}).

(272)

Proof.

The proof of this Lemma is a straightforward combination of the existing results of Theorems 2.11 and 3.1 from (Bravo & Cominetti, 2024). First, from (51), we have

\displaystyle\sum_{k=2}^{n}2\alpha_{k}\sigma\quantity(\tau_{n}-\tau_{k})\mathbb{E}\quantity[\norm{U_{k-1}}]\leq\sum_{k=2}^{n}2\alpha_{k}\sigma\quantity(\tau_{n}-\tau_{k})\omega_{k-1}.

(273)

In the proof of Theorem 2.11 of (Bravo & Cominetti, 2024), they prove that if there exists a decreasing convex function $f:(0,\infty)\rightarrow(0,\infty)$ of class $C^{2}$ , and a constant $\gamma\geq 1$ , such that for $k\geq 2$ ,

\displaystyle\begin{cases}w_{k-1}\leq(1-\alpha_{k})f(\tau_{k}),\\ \alpha_{k}(1-\alpha_{k})\leq\gamma\alpha_{k+1}(1-\alpha_{k+1}),\end{cases}

(274)

then,

\displaystyle\sum_{k=2}^{n}2\alpha_{k}\sigma\quantity(\tau_{n}-\tau_{k})\omega_{k-1}\leq\frac{2\gamma}{\sqrt{\pi}}\int_{\tau_{1}}^{\tau_{n}}\frac{f(x)}{\sqrt{\tau_{n}-x}}dx+2\alpha_{n}w_{n-1}.

(275)

Using the fact that $\omega_{n}=\mathcal{O}(\tau_{n}\sqrt{\alpha_{n+1}})$ , which aligns with the analogous $\nu_{n}$ from Bravo & Cominetti (2024), and adopting their definition of $\tau_{n}$ , we avoid redundant derivations here.

Theorem 3.1 in Bravo & Cominetti (2024) establishes that for the step size schedule specified in Assumption 2.4, there exist constants $\gamma\geq 1$ and a function $f(x)$ satisfying (274). Specifically, they show with

\displaystyle f(x)=\begin{cases}\kappa x(1+x)^{-b/2(1-b)}&\text{if }b<1,\\ \kappa xe^{-x/2}&\text{if }b=1,\end{cases}

(276)

for some constant $\kappa$ and $\gamma=\frac{32}{27}$ , (274) is satisfied. Moreover, they demonstrate that the resulting convolution integral in (275) evaluates to $\mathcal{O}(1/\sqrt{\tau_{n}})$ .

Combining these results, the right-hand side of (275) simplifies to $\mathcal{O}(1/\sqrt{\tau_{n}})$ , which completes the proof. For detailed steps, we refer the reader to Bravo & Cominetti (2024) to avoid repetition. ∎

Appendix D Extension of Theorem 2.1 of Borkar (2009)

In this section, we present a simple extension of Theorem 2 from (Borkar, 2009) for completeness. Readers familiar with stochastic approximation theory should find this extension fairly straightforward. Originally, Chapter 2 of (Borkar, 2009) considers stochastic approximations of the form,

\displaystyle y_{n+1}=y_{n}+\alpha_{n}{\left(h(y_{n})+M_{n+1}+\xi_{n+1}\right)}

(277)

where it is assumed that $\xi_{n}\rightarrow 0$ almost surely. However, our work requires that we remove the assumption that $\xi_{n}\rightarrow 0$ , and replace it with a more mild condition on the asymptotic rate of change of $\xi_{n}$ , akin to Kushner & Yin (2003).

Assumption D.1.

For any $T>0$ ,

\displaystyle\lim_{n\to\infty}\sup_{n\leq j\leq m(n,T)}\norm{\sum_{i=n}^{j}\alpha_{i}\xi_{i}}=0\mbox{\quad a.s.\quad}

(278)

where $m(n,T)\doteq\min\quantity{k|\sum_{i=n}^{k}\alpha(i)\geq T}$ .

The next four assumptions are the same as the remaining assumptions in Chapter 2 of (Borkar, 2009).

Assumption D.2.

The map $h$ is Lipschitz: $\norm{h{\left(x\right)}-h{\left(y\right)}}\leq L\norm{x-y}$ for some $0<L<\infty$ .

Assumption D.3.

The step sizes $\quantity{\alpha_{n}}$ are positive scalars satisfying

\displaystyle\sum_{n}\alpha_{n}=\infty,\sum_{n}\alpha_{n}^{2}<\infty

(279)

Assumption D.4.

$\quantity{M_{n}}$ is a martingale difference sequence w.r.t the increasing family of $\sigma$ -algebras

\displaystyle\mathcal{F}_{n}\doteq\sigma\quantity(y_{m},M_{m},m\leq n)=\sigma\quantity(y_{0},M_{1},\ldots,M_{n}),\,n\geq 0.

(280)

That is,

\displaystyle\mathbb{E}\left[M_{n+1}|\mathcal{F}_{n}\right]=0\mbox{\quad a.s.\quad},n\geq 0.

(281)

Furthermore, $\quantity{M_{n}}$ are square-integrable with

\displaystyle\mathbb{E}\left[\norm{M_{n+1}}^{2}|\mathcal{F}_{n}\right]\leq K{\left(1+\norm{x_{n}}^{2}\right)}\ \mbox{\quad a.s.\quad},\ n\geq 0,

(282)

for some constant $K>0$

Assumption D.5.

The iterates of (277) remain bounded almost surely, i.e.,

\displaystyle\sup_{n}\norm{y_{n}}<\infty

(283)

Theorem D.6 (Extension of Theorem 2.1 from (Borkar, 2009)).

Let Assumptions D.1, D.2, D.3, D.4, D.5 hold. Almost surely, the sequence $\quantity{y_{n}}$ generated by (277) converges to a (possibly sample-path dependent) compact connected internally chain transitive set of the ODE

\displaystyle\derivative{y(t)}{t}=h(y(t)).

(284)

Proof.

We now demonstrate that even with the relaxed assumption on $\xi_{n}$ , we can still achieve the same almost sure convergence of the iterates achieved by (Borkar, 2009). Following Chapter 2 of (Borkar, 2009), we construct a continuous interpolated trajectory $\bar{y}{\left(t\right)},t\geq 0$ , and show that it asymptotically approaches the solution set of (284) almost surely. Define time instants $t{\left(0\right)}=0,t{\left(n\right)}=\sum_{m=0}^{n-1}\alpha_{m},n\geq 1$ . By assumption D.3, $t{\left(n\right)}\uparrow\infty$ . Let $I_{n}\doteq\left[t{\left(n\right)},t{\left(n+1\right)}\right],n\geq 0$ . Define a continuous, piece-wise linear $\bar{y}{\left(t\right)},t\geq 0$ by $\bar{y}{{\left(t{\left(n\right)}\right)}}=y_{n},\ n\geq 0$ , with linear interpolation on each interval $I_{n}$ :

\bar{y}{\left(t\right)}=y_{n}+{\left(y_{n+1}-y_{n}\right)}\frac{t-t{\left(n\right)}}{t{\left(n+1\right)}-t{\left(n\right)}},t\in I_{n}

(285)

It is worth noting that $\sup_{t\geq 0}\norm{\bar{y}{\left(t\right)}}=\sup_{n}\norm{y_{n}}<\infty$ almost surely by Assumption D.5. Let $y^{s}{\left(t\right)},t\geq s,$ denote the unique solution to $\eqref{eq:Un_ode}$ ‘starting at s’:

\displaystyle\derivative{y^{s}(t)}{t}

\displaystyle=h{\left(y^{s}{\left(t\right)}\right)},t\geq s,

(286)

with $y^{s}{\left(s\right)}=\bar{y}{\left(s\right)},s\in\mathbb{R}$ . Similarly, let $y_{s}{\left(t\right)},t\geq s,$ denote the unique solution to $\eqref{eq:Un_ode}$ ‘ending at s’:

\displaystyle\derivative{y_{s}(t)}{t}

\displaystyle=h{\left(y_{s}{\left(t\right)}\right)},t\leq s,

(287)

with $y_{s}{\left(s\right)}=\bar{y}{\left(s\right)},s\in\mathbb{R}$ . Define also

\displaystyle\zeta_{n}

\displaystyle=\sum_{m=0}^{n-1}\alpha_{m}{\left(M_{m+1}+\xi_{m+1}\right)},\ n\geq 1

(288)

After invoking Lemma D.7, the analysis and proof presented for Theorem 2 in (Borkar, 2009) applies directly, yielding our desired extended result. ∎

Lemma D.7 (Extension of Theorem 1 from (Borkar, 2009)).

Let D.1 $-$ D.5 hold. We have for any $T>0$ ,

	$\displaystyle\lim_{s\rightarrow\infty}$	$\displaystyle\sup_{t\in\left[s,s+T\right]}\norm{\bar{y}{\left(t\right)}-y^{s}{\left(t\right)}}=0,\mbox{\quad a.s.\quad}$		(289)
	$\displaystyle\lim_{s\rightarrow\infty}$	$\displaystyle\sup_{t\in\left[s,s+T\right]}\norm{\bar{y}{\left(t\right)}-y_{s}{\left(t\right)}}=0,\mbox{\quad a.s.\quad}$		(290)

Proof.

Let $t{\left(n+m\right)}$ be in $\left[t(n),t(n)+T\right]$ . Let $\left[t\right]\doteq\max\quantity{t(k):t(k)\leq t}$ . Then,

\displaystyle\bar{y}{\left(t{\left(n+m\right)}\right)}=\bar{y}{\left(t{\left(n\right)}\right)}+\sum_{k=0}^{m-1}\alpha_{n+k}h{\left(\bar{y}{\left(t{\left(n+k\right)}\right)}\right)}+\delta_{n,n+m}

(2.1.6 in (Borkar, 2009))

(291)

where $\delta_{n,n+m}\doteq\zeta_{n+m}-\zeta_{n}$ . Borkar (2009) then compares this with

	$\displaystyle y^{t(n)}{\left(t{\left(n+m\right)}\right)}=\bar{y}{\left(t{\left(n\right)}\right)}+\sum_{k=0}^{m-1}\alpha_{n+k}h{\left(y^{t(n)}{\left(t{\left(n+k\right)}\right)}\right)}$			(292)
	$\displaystyle\ +\int_{t(n)}^{t(n+m)}{\left(h{\left(y^{t(n)}{\left(z\right)}\right)}-h{\left(y^{t(n)}{\left(\left[z\right]\right)}\right)}\right)}dz.$	(2.1.7 in (Borkar, 2009))		(293)

Next, Borkar (2009) bounds the integral on the right-hand side by proving

\displaystyle\norm{\int_{t(n)}^{t(n+m)}{\left(h{\left(y^{t(n)}{\left(t\right)}\right)}-h{\left(y^{t(n)}{\left(\left[t\right]\right)}\right)}\right)}dt}\leq C_{T}L\sum_{k=0}^{\infty}\alpha_{n+k}^{2}\xrightarrow{n\uparrow\infty}0,\ \mbox{\quad a.s.\quad}

(2.1.8 in (Borkar, 2009))

(294)

where $C_{T}\doteq\norm{h{\left(0\right)}}+L{\left(C_{0}+\norm{h{\left(0\right)}}T\right)}e^{LT}<\infty$ almost surely and $C_{0}\doteq\sup_{n}\norm{y_{n}}<\infty$ a.s. by Assumption D.5.

Then, we can subtract (2.1.7) from (2.1.6) and take norms, yielding

	$\displaystyle\norm{\bar{y}{\left(t{\left(n+m\right)}\right)}-y^{t(n)}{\left(t{\left(n+m\right)}\right)}}$	$\displaystyle\leq L\sum_{i=0}^{m-1}\alpha_{n+i}\norm{\bar{y}{\left(t{\left(n+i\right)}\right)}-y^{t(n)}{\left(t{\left(n+i\right)}\right)}}$		(295)
		$\displaystyle\quad+C_{T}L\sum_{k\geq 0}\alpha_{n+k}^{2}+\sup_{0\leq k\leq m(n,T)}\norm{\delta_{n,n+k}}.$		(296)

The key difference between (296) and the analogous equation in Borkar (2009) Chapter 2, is that we replace the $\sup_{k\geq 0}$ with a $\sup_{0\leq k\leq m(n,T)}$ . The reason we can make this change is that we defined $t(n+m)$ to be in the range $\left[t(n),t(n)+T\right]$ . Recall that we also defined $m(n,T)\doteq\min\quantity{k|\sum_{i=n}^{k}\alpha(i)\geq T}$ in Assumption D.1, so we therefore know that $m\leq m(n,T)$ in (291). Borkar (2009) unnecessarily relaxes this for notation simplicity, but a similar argument can be found in (Kushner & Yin, 2003).

Also, we have,

$\displaystyle\norm{\delta_{n,n+k}}$	$\displaystyle=\norm{\zeta_{n+k}-\zeta_{n}},$		(297)
	$\displaystyle=\norm{\sum_{i=n}^{k}\alpha_{i}{\left(M_{i+1}+\xi_{i+1}\right)}},$	(by (288))	(298)
	$\displaystyle\leq\norm{\sum_{i=n}^{k}\alpha_{i}M_{i+1}}+\norm{\sum_{i=n}^{k}\alpha_{i}\xi_{i+1}}.$		(299)

Borkar (2009) proves that ${\left(\sum_{i=0}^{n-1}\alpha_{i}M_{i+1},\mathcal{F}_{n}\right)},\ n\geq 1$ is a zero mean, square-integrable martingale. By D.3, D.4, D.5,

\displaystyle\sum_{n\geq 0}\mathbb{E}\left[\norm{\sum_{i=0}^{n}\alpha_{i}M_{i+1}-\sum_{i=0}^{n-1}\alpha_{i}M_{i+1}}\,\bigg{|}\,\mathcal{F}_{n}\right]=\sum_{n\geq 0}\mathbb{E}\left[\norm{M_{n+1}}^{2}\,|\,\mathcal{F}_{n}\right]<\infty.

(300)

Therefore, the martingale convergence theorem gives the almost sure convergence of ${\left(\sum_{i=n}^{k}\alpha_{i}M_{i+1},\mathcal{F}_{n}\right)}$ as $n\rightarrow\infty$ . Combining this with assumption D.1 yields,

\displaystyle\lim_{n\rightarrow\infty}\sup_{0\leq k\leq m(n,T)}\norm{\delta_{n,n+k}}=0\mbox{\quad a.s.\quad}

(301)

Using the definition of $K_{T,n}\doteq C_{T}L\sum_{k\geq 0}\alpha_{n+k}^{2}+\sup_{0\leq k\leq m(n,T)}\norm{\delta_{n,n+k}}$ given by (Borkar, 2009), we have proven that our slightly relaxed assumption still yields $K_{T,n}\rightarrow 0$ almost surely as $n\rightarrow\infty$ . The rest of the argument for the proof of the theorem in Borkar (2009) holds without any additional modification. ∎

Asymptotic and Finite Sample Analysis of Nonexpansive Stochastic Approximations with Markovian Noise

Abstract

1 Introduction

Contribution

Notations

2 Almost Sure Convergence of Stochastic Krasnoselskii-Mann Iterations with Markovian and Additive Noise

Assumption 2.1 (Ergodicity).

Assumption 2.2 (1-Lipschitz).

Assumption 2.3 (Fixed Points).

Assumption 2.4 (Learning Rate).

Assumption 2.5 (Additive Noise).

Theorem 2.6.

Proof.

Remark 2.7.

3 Convergence Rate

Theorem 3.1.

Proof.

4 Application in Average Reward Temporal Difference Learning

4.1 Reinforcement Learning Background

4.2 Average Reward Temporal Difference Learning

Assumption 4.1 (Ergodicity).

Theorem 4.2.

Proof.

5 Related Work

ODE and Lyapunov Methods for Asymptotic Convergence

Average Reward TD

6 Conclusion

Acknowledgements

Impact Statement

References

Appendix A Mathematical Background

Lemma A.1 (Theorem 2.1 from Bravo & Cominetti (2024)).

Lemma A.2 (Monotonicity of αk,n\alpha_{k,n} from Lemma B.1 in Bravo & Cominetti (2024)).

Lemma A.3 (Lemma B.2 from (Bravo & Cominetti, 2024)).

Lemma A.4 (Monotone Convergence Theorem from Folland (1999)).

Appendix B Additional Lemmas from Section 2

Lemma B.1 (Learning Rates).

(73):

Proof.

(74):

Proof.

(76):

Proof.

(75):

Proof.

(77):

Proof.

(78):

Proof.

(79):

Proof.

Lemma B.2.

Proof.

(105):

(106):

(107):

(108):

(109):

Lemma B.3.

Proof.

Lemma B.4.

Proof.

Lemma B.5.

Proof.

Lemma B.6.

Proof.

Lemma B.7.

Proof.

Lemma B.8.

Proof.

Lemma B.9.

Proof.

Lemma B.10.

Proof.

Lemma B.11.

Proof.

Lemma B.12.

Proof.

Lemma B.13.

Proof.

Asymptotic and Finite Sample Analysis of
Nonexpansive Stochastic Approximations with Markovian Noise

Lemma A.2 (Monotonicity of $\alpha_{k,n}$ from Lemma B.1 in Bravo & Cominetti (2024)).