This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Temporal Difference Learning with Compressed Updates:
Error-Feedback meets Reinforcement Learning

Aritra Mitra, George J. Pappas, and Hamed Hassani A. Mitra is with the Department of Electrical and Computer Engineering, North Carolina State University. Email: amitra2@ncsu.edu. H. Hassani and G. Pappas are with the Department of Electrical and Systems Engineering, University of Pennsylvania. Email: {pappasg, hassani}@seas.upenn.edu. This work was supported by NSF Award 1837253, NSF CAREER award CIF 1943064, and The Institute for Learning-enabled Optimization at Scale (TILOS), under award number NSF-CCF-2112665.
Abstract

In large-scale distributed machine learning, recent works have studied the effects of compressing gradients in stochastic optimization to alleviate the communication bottleneck. These works have collectively revealed that stochastic gradient descent (SGD) is robust to structured perturbations such as quantization, sparsification, and delays. Perhaps surprisingly, despite the surge of interest in multi-agent reinforcement learning, almost nothing is known about the analogous question: Are common reinforcement learning (RL) algorithms also robust to similar perturbations? We investigate this question by studying a variant of the classical temporal difference (TD) learning algorithm with a perturbed update direction, where a general compression operator is used to model the perturbation. Our work makes three important technical contributions. First, we prove that compressed TD algorithms, coupled with an error-feedback mechanism used widely in optimization, exhibit the same non-asymptotic theoretical guarantees as their SGD counterparts. Second, we show that our analysis framework extends seamlessly to nonlinear stochastic approximation schemes that subsume Q-learning. Third, we prove that for multi-agent TD learning, one can achieve linear convergence speedups with respect to the number of agents while communicating just O~(1)\tilde{O}(1) bits per iteration. Notably, these are the first finite-time results in RL that account for general compression operators and error-feedback in tandem with linear function approximation and Markovian sampling. Our proofs hinge on the construction of novel Lyapunov functions that capture the dynamics of a memory variable introduced by error-feedback.

1 Introduction

Stochastic gradient descent (SGD) is at the heart of large-scale distributed machine learning paradigms such as federated learning (FL) [1]. In these applications, the task of training high-dimensional weight vectors is distributed among several workers that exchange information over networks of limited bandwidth. While parallelization at such an immense scale helps to reduce the computational burden, it creates several other challenges: delays, asynchrony, and most importantly, a significant communication bottleneck. The popularity and success of SGD can be attributed in no small part to the fact that it is extremely robust to such deviations from ideal operating conditions. In fact, by now, there is a rich literature that analyzes the robustness of SGD to a host of structured perturbations that include lossy gradient-quantization [2, 3] and sparsification [4, 5]. For instance, SignSGD - a variant of SGD where each coordinate of the gradient is replaced by its sign - is extensively used to train deep neural networks [6, 7]. Inspired by these findings, in this paper, we ask a different question: Are common reinforcement learning (RL) algorithms also robust to similar structured perturbations?

Motivation. Perhaps surprisingly, despite the recent surge of interest in multi-agent/federated RL, almost nothing is known about the above question. To fill this void, we initiate the study of a robustness theory for iterative RL algorithms with compressed update directions. We primarily focus on the problem of evaluating the value function associated with a fixed policy μ\mu in a Markov decision process (MDP). Just as SGD is the workhorse of stochastic optimization, the classical temporal difference (TD) learning algorithm [8] for policy evaluation forms the core subroutine in a variety of decision-making algorithms in RL (e.g., Watkin’s Q-learning algorithm). In fact, in their book [9], Sutton and Barto mention: “If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning.” Thus, it stands to reason that we center our investigation around a variant of the TD(0) learning algorithm with linear function approximation, where the TD(0) update direction is replaced by a compressed version of it. Moreover, to account for general biased compression operators (e.g., sign and Top-kk), we couple this scheme with an error-feedback mechanism that retains some memory of TD(0) update directions from the past.

Other than the robustness angle, a key motivation of our work is to design communication-efficient algorithms for the emerging paradigms of multi-agent RL (MARL) and federated RL (FRL). In existing works on these topics [10, 11, 12, 13], agents typically exchange high-dimensional models (parameters) or model-differentials (i.e., gradient-like update directions) over low-bandwidth channels, keeping their personal data (i.e., rewards, states, and actions) private. In the recent survey paper on FRL by [11], the authors explain how the above issues contribute to a major communication bottleneck, just as in the standard FL setting. However, no work on MARL or FRL provides any theory whatsoever when it comes to compression/quantization in RL. Our work takes a significant step towards filling this gap via a suite of comprehensive theoretical results that we discuss below.

1.1 Our Contributions

The main contributions of this work are as follows.

  1. 1.

    Algorithmic Framework for Compressed Stochastic Approximation. We develop a general framework for analyzing iterative compressed stochastic approximation (SA) algorithms with error-feedback. The generality of this framework stems from two salient features: (i) it applies to nonlinear SA with Markovian noise; and (ii) the compression operator covers several common quantization and (biased) sparsification schemes studied in optimization. As an instance of our framework, we propose a new algorithm called EF-TD (Algorithm 1) to study extreme distortions to the TD(0) update direction. Examples of such distortion include, but are not limited to, replacing each coordinate of the TD(0) update direction with just its sign, or just retaining the coordinate with the largest magnitude. While such distortions have been extensively studied for SGD, we are unaware of any analagous algorithms or analysis in the context of RL.

  2. 2.

    Analysis under Markovian Sampling. In Theorem 1, we show that with a constant step-size, EF-TD guarantees linear convergence to a ball centered around the optimal parameter. Up to a compression factor, our result mirrors existing rates in RL without compression [14]. Moreover, the effect of the compression factor exactly mirrors that for compressed SGD [15]. The significance of this result is twofold: (i) It is the first result in RL that accounts for general compression operators and error-feedback in tandem with Markovian sampling and linear function approximation; and (ii) It is the first theoretical result on the robustness of TD learning algorithms to structured perturbations. One interesting takeaway from Theorem 1 is that “slowly-mixing” Markov chains have more inherent robustness to distortions/compression. This appears to be a new observation that we elaborate on in Section 4.

  3. 3.

    Analysis for General Nonlinear Stochastic Approximation. In Section 5, we study Algorithm 1 in its full generality by considering a compressed nonlinear SA scheme with error-feedback, and establishing an analog of Theorem 1 in Theorem 2. The significance of this result lies in revealing that the power and scope of the popular error-feedback mechanism [2] in large-scale ML is not limited to the optimization problems it has been used for thus far. In particular, since the nonlinear SA scheme we study captures certain instances of Q-learning [16], Theorem 2 conveys that our results extend well beyond policy evaluation to decision-making (control) problems as well.

  4. 4.

    Communication-Efficient MARL. Since one of our main goals is to facilitate communication-efficient MARL, we consider a collaborative MARL setting that has shown up in several recent works [10, 13, 17]. Here, MM agents interact with the same MDP, but observe potentially different realizations of rewards and state transitions. These agents exchange compressed TD update directions via a central server to speed up the process of evaluating a specific policy. In this context, we ask the following fundamental question: How much information needs to be transmitted to achieve the optimal linear-speedup (w.r.t. the number of agents) for policy-evaluation? To answer this question, we propose a multi-agent version of EF-TD. In Theorem 3, we prove that by collaborating, each agent can achieve an MM-fold speedup in the dominant term of the convergence rate. Importantly, the effect of compression only shows up in higher-order terms, leaving the dominant term unaffected. Thus, we prove that by transmitting just O~(1)\tilde{O}(1) bits per-agent per-round, one can preserve the same asymptotic rates as vanilla TD(0), while achieving an optimal linear convergence speedup w.r.t. the number of agents. We envision this result will have important implications for MARL. Our analysis also leads to a tighter linear dependence on the mixing time compared to the quadratic dependence in the only other paper that establishes linear speedups under Markovian sampling [13].

  5. 5.

    Technical Challenges and Novel Proof Techniques. One might wonder whether our results above follow as simple extensions of known analysis techniques in RL. In what follows, we explain why this isn’t the case by outlining the key technical challenges unique to our setup, and our novel proof ideas to overcome them. First, a non-asymptotic analysis of even vanilla TD(0) under Markovian sampling is known to be extremely challenging due to complex temporal correlations. Our setting is further complicated by the fact that the dynamics of the parameter and the memory variable are intricately coupled with the temporally-correlated Markov data tuples. This leads to a complex stochastic dynamical system that has not been analyzed before in RL. Second, we cannot directly appeal to existing proofs of compression in optimization since the problems we study do not involve minimizing a static loss function. Moreover, the aspect of temporally correlated observations is completely absent in compressed optimization since one deals with i.i.d. data. The above discussion motivates the need for new analysis tools beyond what are known for both RL and optimization.

    \bullet New Proof Ingredients for Theorems 1 and 2. To prove Theorem 1, our first innovation is to construct a novel Lyapunov function that accounts for the joint dynamics of the parameter and the memory variable introduced by error-feedback. The next natural step is to then analyze the drift of this Lyapunov function by appealing to mixing time arguments - as is typically done in finite-time RL proofs [18, 14]. This is where we again run into difficulties since the presence of the memory variable due to error-feedback introduces non-standard delay terms in the drift bound. Notably, this difficulty does not show up when one analyzes error-feedback in optimization since there is no need for mixing time arguments of Markov chains. To overcome this challenge, we make a connection to what might at first appear unrelated: the Incremental Aggregated Gradient (IAG) method for finite-sum optimization. Our main observation here is that the elegant analysis of the IAG method in [19] shows how one can handle the effect of “shocks” (delayed terms), as long as the amplitude and duration of the shocks is not too large. This ends up being precisely what we need for our purpose: we carefully relate the amplitude of the shocks to our constructed Lyapunov function, and their duration to the mixing time of the underlying Markov chain. We spell out these details explicitly in Section 7.

    \bullet New Proof Ingredients for Theorem 3. Despite the long list of papers on MARL, the only one that establishes a linear speedup (w.r.t. the number of agents) in sample-complexity under Markovian sampling is the very recent paper by [13]; all other papers end up making a restrictive i.i.d. sampling assumption [10, 17, 20] to show a speedup.111After the first version of this paper was released, and while the current version was under review, some other recent works managed to establish linear speedup results under various settings [21, 22, 23, 24, 25]. The proof in [13] is quite involved, and relies on Generalized Moreau Envelopes. This makes it particularly challenging to extend their proof framework to our setting where we need to additionally contend with the effects of compression and error-feedback. As such, we provide an alternate proof technique for establishing the linear speedup effect that relies crucially on a new variance reduction lemma under Markovian data, namely Lemma 2. This lemma is not just limited to TD learning, and may be of independent interest to the MARL literature. Lemma 2, coupled with a finer Lyapunov function (relative to that for proving Theorem 1) enable us to establish the desired linear speedup result under Markovian sampling for our setting. We elaborate on these points in Section 7.

Scope of our Work. We note that our work is primarily theoretical in nature, in the same spirit as several recent finite-time RL papers [26, 18, 14, 10, 13]. We do, however, report simulations on moderately sized (100 states) MDPs that reveal: (i) without error-feedback (EF), compressed TD(0) can end up making little to no progress; and (ii) with EF, the performance of compressed TD(0) complies with our theory. These simulations are in line with those known in optimization for deep learning [5, 6, 27, 28], where one empirically observes significant benefits of performing EF. Succinctly, we convey: compressed TD learning algorithms with EF are just as robust as their SGD counterparts, and exhibit similar convergence guarantees.

1.2 Related Work

In what follows, we discuss the most relevant threads of related work.

\bullet Analysis of TD Learning Algorithms. An analysis of the classical temporal difference learning algorithm [8] with value function approximation was first provided by [29]. They employed the ODE method [30] - commonly used to study stochastic approximation algorithms - to provide an asymptotic convergence rate analysis of TD learning algorithms. Finite-time rates for such algorithms remained elusive for several years, till the work by [31]. Soon after, [32] noted some issues with the proofs in [31]. Under an i.i.d. sampling assumption, [33] and [26] went on to provide finite-time rates for TD learning algorithms. For the more challenging Markovian setting, finite-time rates have been recently derived using various perspectives: (i) by making explicit connections to optimization [18]; (ii) by taking a control-theoretic approach and studying the drift of a suitable Lyapunov function [14]; and (iii) by arguing that the mean-path temporal difference direction acts as a “gradient-splitting” of an appropriately chosen function [34]. Each of these interpretations provides interesting new insights into the dynamics of TD algorithms.

The above works focus on the vanilla TD learning rule. Our work adds to this literature by providing an understanding of the robustness of TD learning algorithms subject to structured distortions.

\bullet Communication-Efficient Algorithms for (Distributed) Optimization and Learning. In the last decade or so, a variety of both scalar [2, 4, 7, 3], and vector [35, 36] quantization schemes have been explored for optimization/empirical risk minimization. In particular, an aggressive compression scheme employed popularly in deep learning is gradient sparsification, where one only transmits a few components of the gradient vector that have the largest magnitudes. While empirical studies [6, 27] have revealed the benefits of extreme sparsification, theoretical guarantees for such methods are much harder to establish due to their biased nature: the output of the compression operator is not an unbiased version of its argument. In fact, for biased schemes, naively compressing gradients can lead to diverging iterates [37, 15]. In [5, 38], and later in [37, 39], it was shown that the above issue can be “fixed” by using a mechanism known as error-feedback that exploits memory. This idea is also related to modulation techniques in coding theory [40]. Recently, [15] and [41] provided theoretical results on biased sparsification for a master-worker type distributed architecture, and [42] studied sparsification in a federated learning context. For more recent work on the error-feedback idea, we refer the reader to [43, 44].

Our work can be seen as the first analog of the above results in general, and the error-feedback idea in particular, in the context of iterative algorithms for RL.

Finally, we note that while several compression algorithms have been proposed and analyzed over the years, fundamental performance bounds were only recently identified in [35], [36], and [39]. Computationally efficient algorithms that match such performance bounds were developed in [45]. While all the above results pertain to static optimization, some recent works have also explored quantization schemes in the context of sequential-decision making problems, focusing on multi-armed bandits [46, 47, 48].

2 Model and Problem Setup

We consider a Markov Decision Process (MDP) denoted by =(𝒮,𝒜,𝒫,,γ)\mathcal{M}=(\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma), where 𝒮\mathcal{S} is a finite state space of size nn, 𝒜\mathcal{A} is a finite action space, 𝒫\mathcal{P} is a set of action-dependent Markov transition kernels, \mathcal{R} is a reward function, and γ(0,1)\gamma\in(0,1) is the discount factor. For the majority of the paper, we will be interested in the policy evaluation problem where the goal is to evaluate the value function VμV_{\mu} of a given policy μ\mu; here, μ:𝒮𝒜\mu:\mathcal{S}\rightarrow\mathcal{A}. The policy μ\mu induces a Markov reward process (MRP) characterized by a transition matrix PP, and a reward function RR.222For simplicity of notation, we have dropped the dependence of PP and RR on the policy μ\mu. In particular, P(s,s)P(s,s^{\prime}) denotes the probability of transitioning from state ss to state ss^{\prime} under the action μ(s)\mu(s); R(s)R(s) denotes the expected instantaneous reward from an initial state ss under the action of the policy μ\mu. The discounted expected cumulative reward obtained by playing policy μ\mu starting from initial state ss is given by:

Vμ(s)=𝔼[t=0γtR(st)|s0=s],V_{\mu}(s)=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}R(s_{t})|s_{0}=s\right], (1)

where sts_{t} represents the state of the Markov chain (induced by μ\mu) at time tt, when initiated from s0=ss_{0}=s. It is well-known [29] that VμV_{\mu} is the fixed point of the policy-specific Bellman operator 𝒯μ:nn\mathcal{T}_{\mu}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}, i.e., 𝒯μVμ=Vμ\mathcal{T}_{\mu}V_{\mu}=V_{\mu}, where for any VnV\in\mathbb{R}^{n},

(𝒯μV)(s)=R(s)+γs𝒮P(s,s)V(s),s𝒮.(\mathcal{T}_{\mu}V)(s)=R(s)+\gamma\sum_{s^{\prime}\in\mathcal{S}}P(s,s^{\prime})V(s^{\prime}),\hskip 2.84526pt\forall s\in\mathcal{S}.

Linear Function Approximation. In practice, the size of the state space 𝒮\mathcal{S} can be extremely large. This renders the task of estimating VμV_{\mu} exactly (based on observations of rewards and state transitions) intractable. One common approach to tackle this difficulty is to consider a parametric approximation V^θ\hat{V}_{\theta} of VμV_{\mu} in the linear subspace spanned by a set {ϕk}k[K]\{\phi_{k}\}_{k\in[K]} of KnK\ll n basis vectors, where ϕk=[ϕk(1),,ϕk(n)]n\phi_{k}=[\phi_{k}(1),\ldots,\phi_{k}(n)]^{\top}\in\mathbb{R}^{n}. Specifically, we have V^θ(s)=k=1Kθ(k)ϕk(s),\hat{V}_{\theta}(s)=\sum_{k=1}^{K}\theta(k)\phi_{k}(s), where θ=[θ(1),,θ(K)]K\theta=[\theta(1),\ldots,\theta(K)]^{\top}\in\mathbb{R}^{K} is a weight vector. Let Φn×K\Phi\in\mathbb{R}^{n\times K} be a matrix with ϕk\phi_{k} as its kk-th column; we will denote the ss-th row of Φ\Phi by ϕ(s)K\phi(s)\in\mathbb{R}^{K}, and refer to it as the feature vector corresponding to state ss. Compactly, V^θ=Φθ\hat{V}_{\theta}=\Phi\theta, and for each s𝒮s\in\mathcal{S}, we have that V^θ(s)=ϕ(s),θ\hat{V}_{\theta}(s)=\langle\phi(s),\theta\rangle. As is standard, we assume that the columns of Φ\Phi are linearly independent, and that the feature vectors are normalized, i.e., for each s𝒮s\in\mathcal{S}, ϕ(s)221.\|\phi(s)\|^{2}_{2}\leq 1.

The TD(0) Algorithm. The goal now is to find the best parameter vector θ\theta^{*} that minimizes the distance (in a suitable norm) between V^θ\hat{V}_{\theta} and VμV_{\mu}. To that end, we will focus on the classical TD(0) algorithm within the family of TD learning algorithms. Starting from an initial estimate θ0\theta_{0}, at each time-step t=0,1,t=0,1,\ldots, this algorithm receives as observation a data tuple Xt=(st,st+1,rt=R(st))X_{t}=(s_{t},s_{t+1},r_{t}=R(s_{t})) comprising of the current state, the next state reached by playing action μ(st)\mu(s_{t}), and the instantaneous reward rtr_{t}. Given this tuple XtX_{t}, let us define gt(θ)=g(Xt,θ)g_{t}(\theta)=g(X_{t},\theta) as:

gt(θ)(rt+γϕ(st+1),θϕ(st),θ)ϕ(st),θK.g_{t}(\theta)\triangleq\left(r_{t}+\gamma\langle\phi(s_{t+1}),\theta\rangle-\langle\phi(s_{t}),\theta\rangle\right)\phi(s_{t}),\forall\theta\in\mathbb{R}^{K}.

The TD(0) update to the current parameter θt\theta_{t} with step-size αt(0,1)\alpha_{t}\in(0,1) can now be described succinctly as:

θt+1=θt+αtgt(θt).\theta_{t+1}=\theta_{t}+\alpha_{t}g_{t}(\theta_{t}). (2)

Under some mild technical conditions, it was shown in [29] that the iterates generated by TD(0) converge almost surely to the best linear approximator in the span of {ϕk}k[K]\{\phi_{k}\}_{k\in[K]}. In particular, θtθ\theta_{t}\rightarrow\theta^{*}, where θ\theta^{*} is the unique solution of the projected Bellman equation ΠD𝒯μ(Φθ)=Φθ\Pi_{D}\mathcal{T}_{\mu}(\Phi\theta^{*})=\Phi\theta^{*}. Here, DD is a diagonal matrix with entries given by the elements of the stationary distribution π\pi of the kernel PP. Moreover, ΠD()\Pi_{D}(\cdot) is the projection operator onto the subspace spanned by {ϕk}k[K]\{\phi_{k}\}_{k\in[K]} with respect to the inner product ,D\langle\cdot,\cdot\rangle_{D}. The key question we explore in this paper is: What can be said of the convergence of TD(0) when the update direction gt(θt)g_{t}(\theta_{t}) in equation 2 is replaced by a distorted/compressed version of it? In the next section, we will make the notion of distortion precise, and outline the various technical challenges that make it non-trivial to answer the question that we posed above.

Refer to caption
Figure 1: A geometric interpretation of the operator 𝒬δ()\mathcal{Q}_{\delta}(\cdot) in Algorithm 1. A larger δ\delta induces more distortion.

3 Error-Feedback meets TD Learning

Algorithm 1 The EF-TD Algorithm
1:Input: Initial estimate θ0K\theta_{0}\in\mathbb{R}^{K}, initial error e1=0e_{-1}=0, and step-size α(0,1)\alpha\in(0,1).
2:for t=0,1,t=0,1,\ldots do
3:     Observe tuple Xt=(st,st+1,rt)X_{t}=(s_{t},s_{t+1},r_{t}).
4:     Compute perturbed TD(0) direction hth_{t}:
ht=𝒬δ(et1+gt(θt)).h_{t}=\mathcal{Q}_{\delta}\left(e_{t-1}+g_{t}(\theta_{t})\right). (3)
5:     Update parameter: θt+1=θt+αht\theta_{t+1}=\theta_{t}+\alpha h_{t}.
6:     Update error: et=et1+gt(θt)hte_{t}=e_{t-1}+g_{t}(\theta_{t})-h_{t}.
7:end for

In this section, we propose a general framework for analyzing TD learning algorithms with distorted update directions. Extreme forms of such distortion could involve replacing each coordinate of the TD(0) direction with just its sign, or retaining just k[K]k\in[K] of the largest magnitude coordinates and zeroing out the rest. In the sequel, we will refer to these variants of TD(0) as SignTD(0) and Topk-TD(0), respectively. Coupled with a memory mechanism known as error-feedback, it is known that analogous variants of SGD exhibit, almost remarkably, the same behavior as SGD itself [5]. Inspired by these findings, we ask: Does compressed TD(0) with error-feedback exhibit behavior similar to that of TD(0)?

Our motivation behind studying the above question is to build an understanding of: (i) communication-efficient versions of MARL algorithms where agents exchange compressed model-differentials, i.e., the gradient-like updates (see Section 6); and (ii) the robustness of TD learning algorithms to structured perturbations. It is important to reiterate here that our motivation mirrors that of studying perturbed versions of SGD in the context of optimization.

Description of EF-TD. In Algorithm 1, we propose the compressed TD(0) algorithm with error-feedback (EF-TD). Compared to equation 2, we note that the parameter θt\theta_{t} is updated based on hth_{t} - a compressed version of the actual TD(0) direction gt(θt)g_{t}(\theta_{t}), where the compression is due to the operator 𝒬δ:KK\mathcal{Q}_{\delta}:\mathbb{R}^{K}\rightarrow\mathbb{R}^{K}. The information lost up to time t1t-1 due to compression is accumulated in the memory variable et1e_{t-1}, and injected back into the current step as in equation 3. In line with compressors used for optimization, we consider a fairly general operator 𝒬δ\mathcal{Q}_{\delta} that is required to only satisfy the following contraction property for some δ1\delta\geq 1:

𝒬δ(θ)θ22(11δ)θ22,θK.\|\mathcal{Q}_{\delta}(\theta)-\theta\|^{2}_{2}\leq\left(1-\frac{1}{\delta}\right)\|\theta\|^{2}_{2},\forall\theta\in\mathbb{R}^{K}. (4)

A distortion perspective. For θ0\theta\neq 0, it is easy to verify based on equation 4 that 𝒬δ(θ),θ1/(2δ)θ22>0\langle\mathcal{Q}_{\delta}(\theta),\theta\rangle\geq 1/(2\delta)\|\theta\|^{2}_{2}>0, i.e., the angle between 𝒬δ(θ)\mathcal{Q}_{\delta}(\theta) and θ\theta is acute. This provides an alternative view to the compression perspective: one can think of 𝒬δ(θ)\mathcal{Q}_{\delta}(\theta) as a tilted version of θ\theta, with a larger δ\delta implying more tilt, and δ=1\delta=1 implying no distortion at all. See Figure 1 for a visual interpretation of this point.

The contraction property in equation 4 is satisfied by several popular quantization/sparsification schemes. These include the sign operator, and the Top-kk operator that selects kk coordinates with the largest magnitude, zeroing out the rest. Importantly, note that the operator 𝒬δ\mathcal{Q}_{\delta} does not necessarily yield an unbiased version of its argument. In optimization, error-feedback serves to counter the bias introduced by 𝒬δ\mathcal{Q}_{\delta}. In fact, without error-feedback, algorithms like SignSGD can converge very slowly, or not converge at all [37]. In a similar spirit, we empirically observe in Fig. 2 (later in Section 8) that without error-feedback, SignTD(0) can end up making little to no progress towards θ\theta^{*}. However, understanding whether error-feedback guarantees convergence to θ\theta^{*} is quite non-trivial. We now provide some intuition as to why this is the case.

Need for Technical Novelty. Error-feedback ensures that past (pseudo)-gradient information is injected with some delay. Thus, for optimization, error-feedback essentially leads to a delayed SGD algorithm. As long as the objective function is smooth, the gradient does not change much, and the delay-effect can be controlled. This intuition does not carry over to our setting since the TD(0) update direction is not a stochastic gradient of any fixed objective function.333As observed by [18] and [34], this can be seen from the fact that the derivative of the TD update direction produces a matrix that is not necessarily symmetric, unlike the symmetric Hessian matrix of a fixed objective function. Thus, controlling the effect of the compressor-induced delay requires new techniques for our setting. Moreover, unlike the SGD noise, the data tuples {Xt}\{X_{t}\} are part of the same Markov trajectory, introducing further complications. Finally, since the parameter updates of Algorithm 1 are intricately linked with the error variable ete_{t}, to analyze their joint behavior, we need to perform a more refined Lyapunov drift analysis relative to the standard TD(0) analysis. Despite these challenges, in the sequel, we will establish that EF-TD retains almost the same convergence guarantees as vanilla TD(0).

Remark 1.

It is important to emphasize that even though the error vector ete_{t} in EF-TD might be dense, ete_{t} never gets transmitted. The communication-efficient aspect of EF-TD stems from the fact that it only requires communicating the compressed update direction hth_{t} - a vector that can be represented by using just a few bits (depending on the level of compression). This fact is further clarified in our description of the multi-agent version of EF-TD in Section 6 (see line 6 of Algorithm 2 in Section 6).

4 Analysis of EF-TD under Markovian Sampling

The goal of this section is to provide a rigorous finite-time analysis of EF-TD under Markovian sampling. To that end, we need to introduce a few concepts and make certain standard assumptions. We start by assuming that all rewards are uniformly bounded by some r¯>0\bar{r}>0, i.e., |R(s)|r¯,s𝒮|R(s)|\leq\bar{r},\forall s\in\mathcal{S}. This ensures that the value functions exist. Next, we state a standard assumption that shows up in the finite-time analysis of iterative RL algorithms [18, 14, 16, 49, 10, 13].

Assumption 1.

The Markov chain induced by the policy μ\mu is aperiodic and irreducible.

The above assumption implies that the Markov chain induced by μ\mu admits a unique stationary distribution π\pi [50]. Let Σ=ΦDΦ\Sigma=\Phi^{\top}D\Phi. Since Φ\Phi is full column rank, Σ\Sigma is full rank with a strictly positive smallest eigenvalue ω<1\omega<1. To appreciate the implication of the above assumption, let us define the “steady-state” version of the TD update direction as follows: for a fixed θK\theta\in\mathbb{R}^{K}, let

g¯(θ)𝔼stπ,st+1Pμ(|st)[g(Xt,θ)].\bar{g}(\theta)\triangleq\mathbb{E}_{s_{t}\sim\pi,s_{t+1}\sim P_{\mu}(\cdot|s_{t})}\left[g(X_{t},\theta)\right].

We now introduce the notion of the mixing time τϵ\tau_{\epsilon}.

Definition 1.

Define τϵmin{t1:𝔼[g(Xk,θ)|X0]g¯(θ)2ϵ(θ2+1),kt,θK,X0}.\tau_{\epsilon}\triangleq\min\{t\geq 1:\|\mathbb{E}\left[g(X_{k},\theta)|X_{0}\right]-\bar{g}(\theta)\|_{2}\leq\epsilon\left(\|\theta\|_{2}+1\right),\forall k\geq t,\forall\theta\in\mathbb{R}^{K},\forall X_{0}\}.

Assumption 1 implies that the total variation distance between the conditional distribution (st=|s0=s)\mathbb{P}\left(s_{t}=\cdot|s_{0}=s\right) and the stationary distribution π\pi decays geometrically fast for all t0t\geq 0, regardless of the initial state s𝒮s\in\mathcal{S} [50]. As a consequence of this geometric mixing of the Markov chain, it is not hard to show that τϵ\tau_{\epsilon} in Definition 1 is O(log(1/ϵ))O\left(\log(1/\epsilon)\right); see, for instance, [16]. The precision that suffices for all results in this paper is ϵ=α2\epsilon=\alpha^{2}, where recall that α\alpha is the step-size. Henceforth, we will simply use τ\tau as a shorthand for τα\tau_{\alpha}. Finally, let us define σmax{1,r¯,θ2}\sigma\triangleq\max\{1,\bar{r},\|\theta^{*}\|_{2}\} as the variance of our noise model, and dtθtθ2d_{t}\triangleq\|\theta_{t}-\theta^{*}\|_{2}. We can now state the first main result of this paper.

Theorem 1.

Suppose Assumption 1 holds. There exist universal constants c,C1c,C\geq 1 such that the iterates generated by EF-TD with step-size αω(1γ)cmax{δ,τ}\alpha\leq\frac{\omega(1-\gamma)}{c\max\{\delta,\tau\}} satisfy the following Tτ\forall T\geq\tau:

𝔼[dT2]C1(1αω(1γ)Cτ)Tτ+O(α(τ+δ)σ2ω(1γ)),whereC1=O(d02+σ2).\mathbb{E}\left[d^{2}_{T}\right]\leq C_{1}\left(1-\frac{\alpha\omega(1-\gamma)}{C\tau}\right)^{T-\tau}+O\left(\frac{\alpha(\tau+\delta)\sigma^{2}}{\omega(1-\gamma)}\right),\hskip 2.84526pt\textrm{where}\hskip 2.84526ptC_{1}=O(d^{2}_{0}+\sigma^{2}). (5)

The proof of Theorem 1 is provided in Appendix D. We now discuss the key implications of this result.

Discussion. Theorem 1 tells us that EF-TD guarantees linear convergence of the iterates to a ball around θ\theta^{*}, where the size of the ball scales with the variance σ2\sigma^{2}; this exactly matches the behavior of vanilla TD [18, 14]. Since the step-size α\alpha scales inversely with the distortion factor δ\delta, we observe from equation 5 that the exponent of linear convergence gets slackened by δ\delta; once again, this is consistent with analogous results for SGD with (biased) compression [15]. The variance term, namely the second term in equation 5, has the exact same dependence on τ,ω\tau,\omega, and γ\gamma as one observes for vanilla TD [18, Theorem 3]. Observe that in this term, the effects of τ\tau and δ\delta in inflating the variance are additive. Moreover, even in the absence of compression, the dependence of the variance term on the mixing time τ\tau is known to be unavoidable [51]. This immediately leads to the following interesting conclusion: when the underlying Markov chain induced by the policy mixes “slowly”, i.e., has a large mixing time τ\tau, one can afford to be more aggressive in terms of compression, i.e., use a larger δ\delta, since this would lead to a variance bound that is no worse than in the uncompressed setting. Said differently, Theorem 1 reveals that slowly-mixing Markov chains have a higher tolerance to distortions. This observation is novel since no prior work has studied the effect of distortions to RL algorithms under Markovian sampling. It is also interesting to note here that the phenomenon described above shows up in other contexts too: for instance, the authors in [36] showed that certain quantization mechanisms in optimization automatically come with privacy guarantees.

Theorem 1 is significant in that it is the first result to reveal that coupled with error-feedback, EF-TD is robust to extreme distortions. We will corroborate this phenomenon empirically as well in Section 8. The other key takeaway from Theorem 1 is that the scope of the error-feedback mechanism - now popularly used in distributed learning - extends to stochastic approximation problems well beyond just static optimization settings.

In Appendix B, we analyze a simpler steady-state version of EF-TD to help build intuition. There, we also provide an analysis of compressed TD learning algorithms that do not employ any error-feedback mechanism. Our analysis reveals that such schemes can also converge, provided the compression parameter δ\delta satisfies a restrictive condition. Notably, as evident from the statement of Theorem 1, we do not need to impose any restrictions on δ\delta for the convergence of EF-TD.

5 Compressed Nonlinear Stochastic Approximation with Error-Feedback

For the TD(0) algorithm, the update direction gt(θ)g_{t}(\theta) is affine in the parameter θ\theta, i.e., gt(θ)g_{t}(\theta) is of the form A(Xt)θb(Xt)A(X_{t})\theta-b(X_{t}). As such, the recursion in equation 2 can be seen as an instance of linear stochastic approximation (SA), where the end goal is to use the data samples {Xt}\{X_{t}\} to find a θ\theta that solves the linear equation Aθ=bA\theta=b; here, AA and bb are the steady-state versions of A(Xt)A(X_{t}) and b(Xt)b(X_{t}), respectively. The more involved TD(λ\lambda) algorithms within the TD learning family also turn out to be instances of linear SA. Instead of deriving versions of our results for TD(λ\lambda) algorithms in particular, we will instead consider a much more general nonlinear SA setting. Accordingly, we now study a variant of Algorithm 1 where g(Xt,θ)g(X_{t},\theta) is a general nonlinear map, and as before, Xt{X_{t}} comes from a finite-state Markov chain that is assumed to be aperiodic and irreducible. We assume that the nonlinear map satisfies the following regularity conditions.

Assumption 2.

There exist L,σ1L,\sigma\geq 1 s.t. the following hold for any XX in the space of data tuples: (i) g(X,θ1)g(X,θ2)2Lθ1θ22,θ1,θ2K\|g(X,\theta_{1})-g(X,\theta_{2})\|_{2}\leq L\|\theta_{1}-\theta_{2}\|_{2},\forall\theta_{1},\theta_{2}\in\mathbb{R}^{K}, and (ii) g(X,θ)2L(θ2+σ),θK.\|g(X,\theta)\|_{2}\leq L(\|\theta\|_{2}+\sigma),\forall\theta\in\mathbb{R}^{K}.

Assumption 3.

Let g¯(θ)𝔼Xtπ[g(Xt,θ)]\bar{g}(\theta)\triangleq\mathbb{E}_{X_{t}\sim\pi}\left[g(X_{t},\theta)\right], θK\forall\theta\in\mathbb{R}^{K}, where π\pi is the stationary distribution of the Markov process {Xt}\{X_{t}\}. The equation g¯(θ)=0\bar{g}(\theta)=0 has a solution θ\theta^{*}, and β>0\exists\beta>0 s.t.

θθ,g¯(θ)g¯(θ)βθθ22,θK.\langle\theta-\theta^{*},\bar{g}(\theta)-\bar{g}(\theta^{*})\rangle\leq-\beta\|\theta-\theta^{*}\|^{2}_{2},\forall\theta\in\mathbb{R}^{K}. (6)

In words, Assumption 2 says that g(X,θ)g(X,\theta) is globally uniformly (w.r.t. XX) Lipschitz in the parameter θ\theta. Assumption 3 is a strong monotone property of the map g¯(θ)-\bar{g}(\theta) that guarantees that the iterates generated by the steady-state version of equation 2 converge exponentially fast to θ\theta^{*}. To provide some context, consider the TD(0) setting. Under feature normalization and the bounded rewards assumption, the global Lipschitz property is immediate. Moreover, Assumption 3 corresponds to negative-definiteness of the steady-state matrix A=𝔼Xtπ[A(Xt)]A=\mathbb{E}_{X_{t}\sim\pi}[A(X_{t})]; this negative-definite property is also easy to verify [14]. For optimization, Assumptions 2 and 3 simply correspond to LL-smoothness and β\beta-strong-convexity of the loss function, respectively. For simplicity, we state the main result of this section for L=2L=2; the analysis for a general LL follows identical arguments.

Theorem 2.

Suppose Assumption 2 holds with L=2L=2, and Assumption  3 holds. Let β¯=min{β,1/β}\bar{\beta}=\min\{\beta,1/\beta\}. There exist universal constants c,C1c,C\geq 1 such that Algorithm 1 with step-size αβ¯cmax{δ,τ}\alpha\leq\frac{\bar{\beta}}{c\max\{\delta,\tau\}} guarantees

𝔼[dT2]C1(1αβCτ)Tτ+O(α(τ+δ)σ2β),Tτ,whereC1=O(d02+σ2).\mathbb{E}\left[d^{2}_{T}\right]\leq C_{1}\left(1-\frac{\alpha\beta}{C\tau}\right)^{T-\tau}+O\left(\frac{\alpha(\tau+\delta)\sigma^{2}}{\beta}\right),\forall T\geq\tau,\hskip 2.84526pt\textrm{where}\hskip 2.84526ptC_{1}=O(d^{2}_{0}+\sigma^{2}). (7)

Discussion. We note that the guarantee in Theorem 2 mirrors that in Theorem 1, and represents our setting in its full generality, accounting for nonlinear SA, Markovian noise, compression, and error-feedback. Providing an explicit finite-time analysis for this setting is one of the main contributions of our paper. Now let us comment on the applications of this result. As noted earlier, our result applies to TD(λ\lambda) algorithms and SGD under Markovian noise [52]; the effect of compression and error-feedback was previously unknown for both these settings. More importantly, certain instances of Q-learning with linear function approximation can also be captured via Assumptions 2 and 3 [16]. The key implication is that our analysis framework is not just limited to policy evaluation, but rather extends gracefully to decision-making (control) problems. This speaks to the significance of Theorem 2.

6 Communication-Efficient Multi-Agent Reinforcement Learning (MARL)

A key motivation of our work is to develop communication-efficient algorithms for MARL. This is particularly relevant for networked/federated versions of RL problems where communication imposes a major bottleneck [11]. To that end, we consider a collaborative MARL setting that has appeared in various recent works [11, 10, 17, 12, 13, 20]. The setup comprises of MM agents, all of whom interact with the same MDP, and can communicate via a central server. Every agent seeks to evaluate the same policy μ\mu. The purpose of collaboration is as in the standard FL setting: to achieve a MM-fold reduction in sample-complexity by leveraging information from all agents. In particular, we ask: By exchanging compressed information via the server, is it possible to expedite the process of learning by achieving a linear speedup in the number of agents? While such questions have been extensively studied for supervised learning, we are unaware of any work that addresses them in the context of MARL. We further note that even without compression or error-feedback, establishing linear speedups under Markovian sampling is highly non-trivial, and the only other paper that does so is the very recent work of [13]. As we explain later in Section 7, our proof technique departs significantly from [13].

Algorithm 2 Multi-Agent EF-TD
1:Input: Initial estimate θ0K\theta_{0}\in\mathbb{R}^{K}, initial errors ei,1=0,i[M]e_{i,-1}=0,\forall i\in[M], and step-size α(0,1)\alpha\in(0,1).
2:for t=0,1,t=0,1,\ldots do
3:     Server sends θt\theta_{t} to all agents.
4:     for i[M]i\in[M] do
5:         Observe tuple Xi,t=(si,t,si,t+1,ri,t)X_{i,t}=(s_{i,t},s_{i,t+1},r_{i,t}).
6:         Compute compressed TD(0) direction hi,t=𝒬δ(ei,t1+gi,t(θt))h_{i,t}=\mathcal{Q}_{\delta}\left(e_{i,t-1}+g_{i,t}(\theta_{t})\right), and send hi,th_{i,t} to server.
7:         Update error: ei,t=ei,t1+gi,t(θt)hi,te_{i,t}=e_{i,t-1}+g_{i,t}(\theta_{t})-h_{i,t}.
8:     end for
9:     Server updates the model as follows: θt+1=θt+αh¯t\theta_{t+1}=\theta_{t}+\alpha\bar{h}_{t}, where ht=(1/M)i[M]hi,th_{t}=(1/M)\sum_{i\in[M]}h_{i,t}.
10:end for

We propose and analyze a natural multi-agent version of EF-TD, outlined as Algorithm 2. In a nutshell, multi-agent EF-TD operates as follows. At each time-step, the server sends down a model θt\theta_{t}; each agent ii observes a local data sample, computes and uploads the compressed direction hi,th_{i,t}, and updates its local error. The server then updates the model. Transmitting compressed TD(0) pseudo-gradients is consistent with both works in FL [53], and in MARL [11, 10, 13], where the agents essentially exchange model-differentials (i.e., the update directions), keeping their raw data private. It is worth noting here that while all agents play the same policy, the realizations of the data tuples {Xi,t}\{X_{i,t}\} may vary across agents. Let θ~t=θt+αe¯t1\tilde{\theta}_{t}=\theta_{t}+\alpha\bar{e}_{t-1}, where e¯t=(1/M)i[M]ei,t\bar{e}_{t}=(1/M)\sum_{i\in[M]}e_{i,t}, and define d~tθ~tθ2\tilde{d}_{t}\triangleq\|\tilde{\theta}_{t}-\theta^{*}\|_{2}. We can now state our main convergence result for Algorithm 2.

Theorem 3.

Suppose Assumption 1 holds. There exist universal constants c,C1c,C\geq 1 such that with step-size αω(1γ)cmax{δ,τ}\alpha\leq\frac{\omega(1-\gamma)}{c\max\{\delta,\tau\}}, and C1=O(d02+σ2)C_{1}=O(d^{2}_{0}+\sigma^{2}), Algorithm 2 guarantees the following T2τ\forall T\geq 2\tau:

𝔼[d~T2]C1(1αω(1γ)Cτ)T2τT1+O(ατω(1γ))σ2MT2+O(α2max{δ,τ}δω2(1γ)2)σ2T3.\mathbb{E}\left[\tilde{d}^{2}_{T}\right]\leq\underbrace{C_{1}\left(1-\frac{\alpha\omega(1-\gamma)}{C\tau}\right)^{T-2\tau}}_{T_{1}}+\underbrace{O\left(\frac{\alpha\tau}{\omega(1-\gamma)}\right)\frac{\sigma^{2}}{{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}M}}}_{T_{2}}+\underbrace{O\left(\frac{\alpha^{2}\max\{\delta,\tau\}\delta}{\omega^{2}(1-\gamma)^{2}}\right)\sigma^{2}}_{T_{3}}. (8)

The proof of Theorem 3 is deferred to Appendix E. There are several key messages conveyed by Theorem 3. We discuss them below.

Message 1: Linear Speedup. Observe that the noise variance terms in equation 8 are T2T_{2} and T3T_{3}, where T3T_{3} is O(α2)O(\alpha^{2}), i.e., a higher-order term in α\alpha. Thus, for small α\alpha, T2T_{2} is the dominant noise term. Compared to the noise term for vanilla TD in [18, Theorem 3], T2T_{2} in our bound has exactly the same dependence on τ\tau, ω\omega, and γ\gamma, and importantly, exhibits an inverse scaling w.r.t. MM, implying a MM-fold reduction in the variance σ2\sigma^{2} relative to the single agent setting. This is precisely what we wanted. For exact convergence, we can set α=O(log(MT2)/T)\alpha=O(\log(MT^{2})/T) to make T1T_{1} and T3T_{3} of order O~(1/T2)\tilde{O}(1/T^{2}), and the dominant term T2=O~(σ2/(MT))T_{2}=\tilde{O}{(\sigma^{2}/(MT))}. Thus, relative to the O(σ2/T)O(\sigma^{2}/T) rate of TD(0), we achieve a linear speedup w.r.t. the number of agents MM in our dominant term T2T_{2}.

Message 2: Communication-efficiency comes (nearly) for free. The second key thing to note is that the dominant T2T_{2} term is completely unaffected by the compression factor δ\delta. Indeed, the compression factor δ\delta only affects higher-order terms that decay much faster than T2T_{2}. This means that we can significantly ramp up δ\delta, thereby communicating very little, while preserving the asymptotic rate of convergence of TD(0) and achieving optimal speedup in the number of agents, i.e., communication-efficiency comes almost for free. For instance, suppose 𝒬δ()\mathcal{Q}_{\delta}(\cdot) is a top-11 operator, i.e., hi,th_{i,t} has only one non-zero entry which can be encoded using O~(1)\tilde{O}(1) bits. In this case, although δ=K\delta=K, where KK is the potentially large number of features, the dominant O~(σ2/(MT))\tilde{O}\left(\sigma^{2}/(MT)\right) term remains unaffected by KK. Thus, in the context of MARL, our work is the first to show that asymptotically optimal rates can be achieved by transmitting just O~(1)\tilde{O}(1) bits per-agent per time-step.

Message 3: Tight Dependence on the Mixing Time. Compared to [13] - the only other paper in MARL that provides a linear speedup under Markovian sampling - our dominant term T2T_{2} has a tighter O(τ)O(\tau) dependence on the mixing time τ\tau as compared to the O(τ2)O(\tau^{2}) dependence in [13, Theorem 4.1]. It should be noted here that the O(τ)O(\tau) dependence is known to be information-theoretically optimal [51].

A few remarks are in order before we end this section.

Remark 2.

We note that the bound in Theorem 3 is stated for θ~T\tilde{\theta}_{T}, not θT\theta_{T}. Since θ~T=θT+αe¯T1\tilde{\theta}_{T}=\theta_{T}+\alpha\bar{e}_{T-1}, to output θ~T\tilde{\theta}_{T}, the server needs to query ei,T1e_{i,T-1} from each agent ii only once at time T1T-1. We believe that this one extra step of communication can also be avoided via a slightly sharper analysis. We provide such an analysis in Appendix F, albeit under a common i.i.d. sampling model [26, 10, 17].

Remark 3.

While our MARL result in Theorem 3 is for TD learning, in light of the developments in Section 5, we note that one can develop analogs of Theorem 3 for multi-agent Q-learning as well.

Remark 4.

Suppose we set M=1M=1 in equation 8, and compare the resulting bound with that in equation 5. While the effect of the distortion δ\delta shows up as a higher-order O(α2)O(\alpha^{2}) term in the former, it manifests as a O(α)O(\alpha) term in the latter. The difference in these bounds can be attributed to the fact that we use a finer Lyapunov function to prove Theorem 3 relative to that which we use to prove Theorem 1. We do so primarily for the sake of exposition: the relatively simpler proof of Theorem 1 (compared to Theorem 3) helps build much of the key intuition needed to understand the proof of Theorem 3.

7 Technical Challenges in Analysis and Overview of our Proof Techniques

We now discuss the novel steps in our analysis. Complete proof details are provided in the Appendix.

Proof Sketch for Theorem 1. Inspired by the perturbed iterate framework of [54], our first step is to define the perturbed iterate θ~t=θt+αet1\tilde{\theta}_{t}=\theta_{t}+\alpha e_{t-1}. Simple calculations reveal: θ~t+1=θ~t+αgt(θt)\tilde{\theta}_{t+1}=\tilde{\theta}_{t}+\alpha g_{t}(\theta_{t}). Notice that this recursion is almost the same as equation 2, other than the fact that the TD(0) update direction is evaluated at θt\theta_{t} instead of θ~t\tilde{\theta}_{t}. Thus, to analyze the above recursion, we need to account for the gap between θt\theta_{t} and θ~t\tilde{\theta}_{t}, the cause of which is the memory variable et1e_{t-1}. Accordingly, our next key step is to construct a novel Lyapunov function ψt\psi_{t} that captures the joint dynamics of θ~t\tilde{\theta}_{t} and et1e_{t-1}:

ψt𝔼[d~t2+α2et122],whered~t2=θ~tθ22.\psi_{t}\triangleq\mathbb{E}[\tilde{d}^{2}_{t}+\alpha^{2}\|e_{t-1}\|_{2}^{2}],\hskip 2.84526pt\textrm{where}\hskip 2.84526pt\tilde{d}^{2}_{t}=\|\tilde{\theta}_{t}-\theta^{*}\|^{2}_{2}. (9)

As far as we are aware, a potential function of the above form has not been analyzed in prior RL work. Our goal now is to exploit the geometric mixing property in Assumption 1 to establish that ψt\psi_{t} decays over time (up to noise terms). This is precisely where the intricate coupling between the parameter θ~t\tilde{\theta}_{t}, the memory variable ete_{t}, and the Markovian data tuples {Xt}\{X_{t}\} makes the analysis quite challenging. Let us elaborate. To exploit mixing-time arguments, we need to condition sufficiently into the past and bound drift terms of the form θ~tθ~tτ2.\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|_{2}. This is where the coupling between θ~t\tilde{\theta}_{t} and ete_{t} introduces non-standard delay terms, precluding the direct use of prior approaches in RL [18, 14, 16]. This difficulty does not show up in compressed optimization since one deals with i.i.d. data, precluding the need for mixing-time arguments. A workaround here is to employ a projection step (as in [18, 10]) to simplify the analysis. This is not satisfying for two reasons: (i) as we show in the Appendix, the simplification in the analysis comes at the cost of a sub-optimal dependence on the distortion δ\delta; and (ii) to project, one needs prior knowledge of the set containing θ\theta^{*}; this turns out to be unnecessary. Our key innovation here to bound the drift θ~tθ~tτ\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\| is the following technical lemma.

Lemma 1.

(Relating Drift to Past Shocks) For EF-TD, the following is true tτ\forall t\geq\tau:

𝔼[θ~tθ~tτ22]=O(α2τ2)maxtτtψ+O(α2τ2σ2).\mathbb{E}[\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|^{2}_{2}]=O(\alpha^{2}\tau^{2})\max_{t-\tau\leq\ell\leq t}\psi_{\ell}+O(\alpha^{2}\tau^{2}\sigma^{2}). (10)

The above lemma relates the drift to damped “shocks” (delay terms) from the past, where (i) the amplitude of the shocks is captured by our Lyapunov function; (ii) the duration of these shocks is the mixing time τ\tau; and (iii) the damping arises from the fact that shocks are scaled by α2\alpha^{2}. In attempting to overcome one difficulty, we have created another for ourselves via the “max” term in equation 10. This is where we make a connection to the analysis of the IAG algorithm in [19] where a similar “max” term shows up. Via this connection, we are able to analyze the following final recursion:

ψt+1(1cαω(1γ))ψt+O(α2τ)maxtτtψ+O(α2)(τ+δ)σ2,wherec<1.\psi_{t+1}\leq(1-c\alpha\omega(1-\gamma))\psi_{t}+O(\alpha^{2}\tau)\max_{t-\tau\leq\ell\leq t}\psi_{\ell}+O(\alpha^{2})(\tau+\delta)\sigma^{2},\hskip 2.84526pt\textrm{where}\hskip 2.84526ptc<1. (11)

Proof Sketch for Theorem 3. To establish the linear speedup property, we need a finer Lyapunov function (and much finer analysis) relative to Theorem 1. Accordingly, we construct:

Ξt𝔼[θ~tθ22]+Cα3Et1,whereEt1M𝔼[i=1Mei,t22],\Xi_{t}\triangleq\mathbb{E}\left[\|\tilde{\theta}_{t}-\theta^{*}\|^{2}_{2}\right]+C\alpha^{3}E_{t-1},\hskip 5.69054pt\textrm{where}\hskip 5.69054ptE_{t}\triangleq\frac{1}{M}\mathbb{E}\left[\sum_{i=1}^{M}\|e_{i,t}\|^{2}_{2}\right], (12)

and CC is a suitable design parameter. Using the techniques in [18], [14], and [16] to bound gi,t(θ)2\|g_{i,t}(\theta)\|^{2} unfortunately do not yield the desired linear speedup. Moreover, it is unclear whether the Generalized Moreau Envelope framework in [13] can be extended to analyze equation 12. As such, we depart from these works by establishing the following key result by carefully exploiting the geometric mixing property.

Lemma 2.

Let zt(θt)(1/M)i[M]gi,t(θt)z_{t}(\theta_{t})\triangleq(1/M)\sum_{i\in[M]}g_{i,t}(\theta_{t}). Under Assumption 1, the following bound holds for Algorithm 2: 𝔼[zt(θt)22]=O(1)𝔼[d~t2]+O(α2)Et1+O(1/M+α4)σ2,tτ.\mathbb{E}[\|z_{t}(\theta_{t})\|^{2}_{2}]=O(1)\mathbb{E}[\tilde{d}^{2}_{t}]+O(\alpha^{2})E_{t-1}+O(1/M+\alpha^{4})\sigma^{2},\forall t\geq\tau.

The above norm bound on the average TD direction turns out to be the main ingredient in bounding the drift terms for Algorithm 2, and eventually establishing a recursion of the form in equation 11 for Ξt\Xi_{t}.

        Refer to caption Refer to caption
Figure 2: Plots of the mean-squared error Et=θTθ22E_{t}=\|\theta_{T}-\theta^{*}\|^{2}_{2} for vanilla TD(0) without compression, and SignTD(0) with (EF-SignTD) and without (SignTD) error-feedback. (Left) Discount factor γ=0.5\gamma=0.5. (Right) Discount factor γ=0.9\gamma=0.9.
Refer to caption
Figure 3: Plot of the mean-squared error Et=θTθ22E_{t}=\|\theta_{T}-\theta^{*}\|^{2}_{2} for EF-TD (Algo. 1), with 𝒬δ()\mathcal{Q}_{\delta}(\cdot) chosen to be the top-kk operator. We study the effect of varying the number of components transmitted kk.
        Refer to caption Refer to caption
Figure 4: Plots of the mean-squared error Et=θTθ22E_{t}=\|\theta_{T}-\theta^{*}\|^{2}_{2} for multi-agent EF-TD (Algorithm 2). (Left) 𝒬δ()\mathcal{Q}_{\delta}(\cdot) is the sign operator. (Right) 𝒬δ()\mathcal{Q}_{\delta}(\cdot) is the Top-kk operator with k=2k=2.

8 Simulations

To corroborate our theory, we construct a MDP with 100100 states, and use a feature matrix Φ\Phi with K=10K=10 independent basis vectors. Using this MDP, we generate the state transition matrix PμP_{\mu} and reward vector RμR_{\mu} associated with a fixed policy μ\mu. We compare the performance of the vanilla uncompressed TD(0) algorithm with linear function approximation to SignTD(0) with and without error-feedback. We perform 3030 independent trials, and average the errors from each of these trials to report the mean-squared error Et=θTθ22E_{t}=\|\theta_{T}-\theta^{*}\|^{2}_{2} for each of the algorithms mentioned above. The results of this experiment are reported in Fig. 2. We observe that in the absence of error-feedback, SignTD(0) can make little to no progress towards θ\theta^{*}. In contrast, the behavior of SignTD(0) with error-feedback almost exactly matches that of vanilla uncompressed TD(0). Our results align with similar observations made in the context of optimization, where SignSGD with error-feedback retains almost the same behavior as SGD with no compression [5, 37]. We provide some additional experiments below.

\bullet Simulation of Topk-TD(0) with Error-Feedback. We simulate the behavior of EF-TD with the operator 𝒬δ()\mathcal{Q}_{\delta}(\cdot) chosen to be a top-kk operator. We consider the same MDP as above, with the rewards for this experiment chosen from the interval [01]\begin{bmatrix}0&1\end{bmatrix}. The size of the parameter vector KK is 5050, and the discount factor γ\gamma is set to be 0.50.5. We vary the level of distortion introduced by 𝒬δ()\mathcal{Q}_{\delta}(\cdot) by changing the number of components kk transmitted. Note that δ=K/k\delta=K/k. As one might expect, increasing the distortion δ\delta by transmitting fewer components impacts the exponential decay rate; this is clearly reflected in Fig. 3.

Simulation of Multi-Agent EF-TD. To simulate the behavior of multi-agent EF-TD (Algorithm 2), we consider the same MDP on 100 states as before with rewards in [01][0\hskip 2.84526pt1]. The dimension KK of the parameter vector is set to 1010 and the discount factor γ\gamma to 0.30.3. We consider two cases: one where 𝒬δ()\mathcal{Q}_{\delta}(\cdot) is the sign operator, and another where it is a top-22 operator, i.e., only two components are transmitted by each agent at every time-step. We report our findings in Fig. 4, where we vary the number of agents MM, and observe that for both the sign- and top-kk operator experiments, the residual mean-squared error goes down as we scale up MM. Since the residual error is essentially an indicator of the noise variance, our plots display a clear effect of variance reduction by increasing the number of agents. Our observations thus align with Theorem 3.

9 Conclusion and Future Work

We contributed to the development of a robustness theory for iterative RL algorithms subject to general compression schemes. In particular, we proposed and analyzed EF-TD - a compressed version of the classical TD(0) algorithm coupled with error-compensation. We then significantly generalized our analysis to nonlinear stochastic approximation and multi-agent settings. Concretely, our work conveys the following key messages: (i) compressed TD learning algorithms with error-feedback can be just as robust as their optimization counterparts; (ii) the popular error-feedback mechanism extends gracefully beyond the static optimization problems it has been explored for thus far; and (iii) linear convergence speedups in multi-agent TD learning can be achieved with very little communication. Our work opens up several research directions. Studying alternate quantization schemes for RL and exploring other RL algorithms (beyond TD and Q-learning) are immediate next steps. A more open-ended question is the following. SignSGD with momentum [7] is known to exhibit convergence behavior very similar to that of adaptive optimization algorithms such as ADAM [55], explaining their use in fast training of deep neural networks. In a similar spirit, can we make connections between SignTD and adaptive RL algorithms? This is an interesting question to explore since its resolution can potentially lead to faster RL algorithms.

References

  • [1] Jakub Konečnỳ, H Brendan McMahan, Daniel Ramage, and Peter Richtárik. Federated optimization: Distributed machine learning for on-device intelligence. arXiv preprint arXiv:1610.02527, 2016.
  • [2] Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
  • [3] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. Advances in Neural Information Processing Systems, 30:1709–1720, 2017.
  • [4] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Terngrad: Ternary gradients to reduce communication in distributed deep learning. In Advances in neural information processing systems, pages 1509–1519, 2017.
  • [5] Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified sgd with memory. In Advances in Neural Information Processing Systems, pages 4447–4458, 2018.
  • [6] Alham Fikri Aji and Kenneth Heafield. Sparse communication for distributed gradient descent. arXiv preprint arXiv:1704.05021, 2017.
  • [7] Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signsgd: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, pages 560–569. PMLR, 2018.
  • [8] Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
  • [9] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • [10] Thinh Doan, Siva Maguluri, and Justin Romberg. Finite-time analysis of distributed td (0) with linear function approximation on multi-agent reinforcement learning. In International Conference on Machine Learning, pages 1626–1635. PMLR, 2019.
  • [11] Jiaju Qi, Qihao Zhou, Lei Lei, and Kan Zheng. Federated reinforcement learning: techniques, applications, and open challenges. arXiv preprint arXiv:2108.11887, 2021.
  • [12] Hao Jin, Yang Peng, Wenhao Yang, Shusen Wang, and Zhihua Zhang. Federated reinforcement learning with environment heterogeneity. In International Conference on Artificial Intelligence and Statistics, pages 18–37. PMLR, 2022.
  • [13] Sajad Khodadadian, Pranay Sharma, Gauri Joshi, and Siva Theja Maguluri. Federated reinforcement learning: Linear speedup under markovian sampling. In International Conference on Machine Learning, pages 10997–11057. PMLR, 2022.
  • [14] Rayadurgam Srikant and Lei Ying. Finite-time error bounds for linear stochastic approximation andtd learning. In Conference on Learning Theory, pages 2803–2830. PMLR, 2019.
  • [15] Aleksandr Beznosikov, Samuel Horváth, Peter Richtárik, and Mher Safaryan. On biased compression for distributed learning. arXiv preprint arXiv:2002.12410, 2020.
  • [16] Zaiwei Chen, Sheng Zhang, Thinh T Doan, Siva Theja Maguluri, and John-Paul Clarke. Performance of q-learning with linear function approximation: Stability and finite-time analysis. arXiv preprint arXiv:1905.11425, page 4, 2019.
  • [17] Rui Liu and Alex Olshevsky. Distributed td (0) with almost no communication. arXiv preprint arXiv:2104.07855, 2021.
  • [18] Jalaj Bhandari, Daniel Russo, and Raghav Singal. A finite time analysis of temporal difference learning with linear function approximation. In Conference on learning theory, pages 1691–1692. PMLR, 2018.
  • [19] Mert Gurbuzbalaban, Asuman Ozdaglar, and Pablo A Parrilo. On the convergence rate of incremental aggregated gradient algorithms. SIAM Journal on Optimization, 27(2):1035–1048, 2017.
  • [20] Han Shen, Kaiqing Zhang, Mingyi Hong, and Tianyi Chen. Towards understanding asynchronous advantage actor-critic: Convergence and linear speedup. IEEE Transactions on Signal Processing, 2023.
  • [21] Jiin Woo, Gauri Joshi, and Yuejie Chi. The blessing of heterogeneity in federated q-learning: Linear speedup and beyond. In International Conference on Machine Learning, pages 37157–37216. PMLR, 2023.
  • [22] Han Wang, Aritra Mitra, Hamed Hassani, George J Pappas, and James Anderson. Federated temporal difference learning with linear function approximation under environmental heterogeneity. arXiv preprint arXiv:2302.02212, 2023.
  • [23] Nicolò Dal Fabbro, Aritra Mitra, and George J Pappas. Federated td learning over finite-rate erasure channels: Linear speedup under markovian sampling. IEEE Control Systems Letters, 2023.
  • [24] Chenyu Zhang, Han Wang, Aritra Mitra, and James Anderson. Finite-time analysis of on-policy heterogeneous federated reinforcement learning. arXiv preprint arXiv:2401.15273, 2024.
  • [25] Haoxing Tian, Ioannis Ch Paschalidis, and Alex Olshevsky. One-shot averaging for distributed td (λ\lambda) under markov sampling. arXiv preprint arXiv:2403.08896, 2024.
  • [26] Gal Dalal, Balázs Szörényi, Gugan Thoppe, and Shie Mannor. Finite sample analyses for td (0) with function approximation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • [27] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and William J Dally. Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv preprint arXiv:1712.01887, 2017.
  • [28] Sebastian U Stich and Sai Praneeth Karimireddy. The error-feedback framework: Better rates for sgd with delayed gradients and compressed communication. arXiv preprint arXiv:1909.05350, 2019.
  • [29] John N Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with function approximation. In IEEE Transactions on Automatic Control, 1997.
  • [30] Vivek S Borkar and Sean P Meyn. The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447–469, 2000.
  • [31] Nathaniel Korda and Prashanth La. On td(0) with function approximation: Concentration bounds and a centered variant with exponential convergence. In International conference on machine learning, pages 626–634. PMLR, 2015.
  • [32] C Narayanan and Csaba Szepesvári. Finite time bounds for temporal difference learning with function approximation: Problems with some “state-of-the-art” results. Technical report, Technical report, 2017.
  • [33] Chandrashekar Lakshminarayanan and Csaba Szepesvári. Linear stochastic approximation: Constant step-size and iterate averaging. arXiv preprint arXiv:1709.04073, 2017.
  • [34] Rui Liu and Alex Olshevsky. Temporal difference learning as gradient splitting. In International Conference on Machine Learning, pages 6905–6913. PMLR, 2021.
  • [35] Prathamesh Mayekar and Himanshu Tyagi. Ratq: A universal fixed-length quantizer for stochastic optimization. In International Conference on Artificial Intelligence and Statistics, pages 1399–1409. PMLR, 2020.
  • [36] Venkata Gandikota, Daniel Kane, Raj Kumar Maity, and Arya Mazumdar. vqsgd: Vector quantized stochastic gradient descent. In International Conference on Artificial Intelligence and Statistics, pages 2197–2205. PMLR, 2021.
  • [37] Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian U Stich, and Martin Jaggi. Error feedback fixes signsgd and other gradient compression schemes. arXiv preprint arXiv:1901.09847, 2019.
  • [38] Dan Alistarh, Torsten Hoefler, Mikael Johansson, Nikola Konstantinov, Sarit Khirirat, and Cédric Renggli. The convergence of sparsified gradient methods. In Advances in Neural Information Processing Systems, pages 5973–5983, 2018.
  • [39] Chung-Yi Lin, Victoria Kostina, and Babak Hassibi. Differentially quantized gradient methods. IEEE Transactions on Information Theory, 2022.
  • [40] Robert M Gray. Source coding theory, volume 83. Springer Science & Business Media, 2012.
  • [41] Eduard Gorbunov, Dmitry Kovalev, Dmitry Makarenko, and Peter Richtárik. Linearly converging error compensated sgd. Advances in Neural Information Processing Systems, 33, 2020.
  • [42] Aritra Mitra, Rayana Jaafar, George J Pappas, and Hamed Hassani. Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients. Advances in Neural Information Processing Systems, 34:14606–14619, 2021.
  • [43] Peter Richtárik, Igor Sokolov, and Ilyas Fatkhullin. Ef21: A new, simpler, theoretically better, and practically faster error feedback. Advances in Neural Information Processing Systems, 34:4384–4396, 2021.
  • [44] Kaja Gruntkowska, Alexander Tyurin, and Peter Richtárik. Ef21-p and friends: Improved theoretical communication complexity for distributed optimization with bidirectional compression. arXiv preprint arXiv:2209.15218, 2022.
  • [45] Rajarshi Saha, Mert Pilanci, and Andrea J Goldsmith. Efficient randomized subspace embeddings for distributed optimization under a communication budget. IEEE Journal on Selected Areas in Information Theory, 2022.
  • [46] Osama A Hanna, Lin Yang, and Christina Fragouli. Solving multi-arm bandit using a few bits of communication. In International Conference on Artificial Intelligence and Statistics, pages 11215–11236. PMLR, 2022.
  • [47] Aritra Mitra, Hamed Hassani, and George J Pappas. Linear stochastic bandits over a bit-constrained channel. arXiv preprint arXiv:2203.01198, 2022.
  • [48] Francesco Pase, Deniz Gündüz, and Michele Zorzi. Remote contextual bandits. In 2022 IEEE International Symposium on Information Theory (ISIT), pages 1665–1670. IEEE, 2022.
  • [49] Gandharv Patil, LA Prashanth, Dheeraj Nagaraj, and Doina Precup. Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation. In International Conference on Artificial Intelligence and Statistics, pages 5438–5448. PMLR, 2023.
  • [50] David A Levin and Yuval Peres. Markov chains and mixing times, volume 107. American Mathematical Soc., 2017.
  • [51] Dheeraj Nagaraj, Xian Wu, Guy Bresler, Prateek Jain, and Praneeth Netrapalli. Least squares regression with markovian data: Fundamental limits and algorithms. Advances in neural information processing systems, 33:16666–16676, 2020.
  • [52] Thinh T Doan. Finite-time analysis of markov gradient descent. IEEE Transactions on Automatic Control, 2022.
  • [53] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pages 5132–5143. PMLR, 2020.
  • [54] Horia Mania, Xinghao Pan, Dimitris Papailiopoulos, Benjamin Recht, Kannan Ramchandran, and Michael I Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. arXiv preprint arXiv:1507.06970, 2015.
  • [55] Lukas Balles and Philipp Hennig. Dissecting adam: The sign, magnitude and variance of stochastic gradients. In International Conference on Machine Learning, pages 404–413. PMLR, 2018.
  • [56] Roger A Horn and Charles R Johnson. Matrix analysis. Cambridge university press, 2012.
  • [57] Vivek S Borkar. Stochastic approximation: a dynamical systems viewpoint, volume 48. Springer, 2009.
  • [58] Angelia Nedic, Asuman Ozdaglar, and Pablo A Parrilo. Constrained consensus and optimization in multi-agent networks. IEEE Transactions on Automatic Control, 55(4):922–938, 2010.
  • [59] Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson. A delayed proximal gradient method with linear convergence rate. In 2014 IEEE international workshop on machine learning for signal processing (MLSP), pages 1–6. IEEE, 2014.
  • [60] Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian U Stich. A unified theory of decentralized sgd with changing topology and local updates. arXiv preprint arXiv:2003.10422, 2020.
  • [61] Sebastian U Stich. On communication compression for distributed optimization on heterogeneous data. arXiv preprint arXiv:2009.02388, 2020.

Appendix A Preliminary Results and Facts

In this section, we will compile and derive some preliminary results that will play a key role in our subsequent analysis. In what follows, unless otherwise stated, we will use \|\cdot\| to refer to the standard Euclidean norm. The next three results are from [18].

Lemma 3.

For all θ1,θ2K\theta_{1},\theta_{2}\in\mathbb{R}^{K}, we have:

ωθ1θ2V^θ1V^θ2Dθ1θ2.\sqrt{\omega}\|\theta_{1}-\theta_{2}\|\leq\|\hat{V}_{\theta_{1}}-\hat{V}_{\theta_{2}}\|_{D}\leq\|\theta_{1}-\theta_{2}\|.

We remind the reader here that in the above result, ω\omega is the smallest eigenvalue of the matrix Σ=ΦDΦ\Sigma=\Phi^{\top}D\Phi. Before stating the next result, we recall the definition of the steady-state TD update direction: for a fixed θK\theta\in\mathbb{R}^{K}, let

g¯(θ)𝔼stπ,st+1Pμ(|st)[g(Xt,θ)].\bar{g}(\theta)\triangleq\mathbb{E}_{s_{t}\sim\pi,s_{t+1}\sim P_{\mu}(\cdot|s_{t})}\left[g(X_{t},\theta)\right].

The next lemma provides intuition as to why the expected steady-state TD(0) update direction g¯(θ)\bar{g}(\theta) acts like a “pseudo-gradient”, driving the TD(0) iterates towards the minimizer θ\theta^{*} of the projected Bellman equation.

Lemma 4.

For any θK\theta\in\mathbb{R}^{K}, the following holds:

θθ,g¯(θ)(1γ)V^θV^θD2.\langle\theta^{*}-\theta,\bar{g}(\theta)\rangle\geq(1-\gamma)\|\hat{V}_{\theta^{*}}-\hat{V}_{\theta}\|^{2}_{D}.

We will have occasion to use the following upper bound on the norm of the steady-state TD(0) update direction.

Lemma 5.

For any θK\theta\in\mathbb{R}^{K}, the following holds:

g¯(θ)2V^θV^θD.\|\bar{g}(\theta)\|\leq 2\|\hat{V}_{\theta^{*}}-\hat{V}_{\theta}\|_{D}.

The following bound on the norm of the random TD(0) update direction will also be invoked several times in our analysis [14].

Lemma 6.

For any θK\theta\in\mathbb{R}^{K}, the following holds t0\forall t\geq 0:

gt(θ)2θ+2r¯2θ+2σ,\|g_{t}(\theta)\|\leq 2\|\theta\|+2\bar{r}\leq 2\|\theta\|+2\sigma, (13)

where σ=max{1,r¯,θ}.\sigma=\max\{1,\bar{r},\|\theta^{*}\|\}.

As mentioned earlier in Section 3, compressed SGD with error-feedback essentially acts like delayed SGD. Thus, for smooth functions where the gradients do not change much, the effect of the delay can be effectively controlled. Unlike this optimization setting, we do not have a fixed objective function at our disposal. So how do we leverage any kind of smoothness property? Fortunately for us, the steady-state TD(0) update direction does satisfy a Lipschitz property; we prove this fact below.

Lemma 7.

(Lipschitz property of steady-state TD(0) update direction) For all θ1,θ2K\theta_{1},\theta_{2}\in\mathbb{R}^{K}, we have:

g¯(θ1)g¯(θ2)θ1θ2.\|\bar{g}(\theta_{1})-\bar{g}(\theta_{2})\|\leq\|\theta_{1}-\theta_{2}\|.
Proof.

We will make use of the explicit affine form of g¯(θ)\bar{g}(\theta) shown below [29]:

g¯(θ)=ΦD(𝒯μΦθΦθ)=A¯θb¯,whereA¯=ΦD(γPμI)Φ,andb¯=ΦDRμ.\bar{g}(\theta)=\Phi^{\top}D\left(\mathcal{T}_{\mu}\Phi\theta-\Phi\theta\right)=\bar{A}\theta-\bar{b},\hskip 2.84526pt\textrm{where}\hskip 2.84526pt\bar{A}=\Phi^{\top}D\left(\gamma P_{\mu}-I\right)\Phi,\hskip 2.84526pt\textrm{and}\hskip 2.84526pt\bar{b}=-\Phi^{\top}DR_{\mu}. (14)

In [18], it was shown that A¯A¯Σ\bar{A}^{\top}\bar{A}\preceq\Sigma, where Σ=ΦDΦ\Sigma=\Phi^{\top}D\Phi. Furthermore, due to feature normalization, it is easy to see that λmax(Σ)1\lambda_{\max}(\Sigma)\leq 1. Using these properties, we have:

g¯(θ1)g¯(θ2)2\displaystyle\|\bar{g}(\theta_{1})-\bar{g}(\theta_{2})\|^{2} =(θ1θ2)A¯A¯(θ1θ2)\displaystyle=\left(\theta_{1}-\theta_{2}\right)^{\top}\bar{A}^{\top}\bar{A}\left(\theta_{1}-\theta_{2}\right) (15)
λmax(A¯A¯)θ1θ22\displaystyle\leq\lambda_{\max}(\bar{A}^{\top}\bar{A})\|\theta_{1}-\theta_{2}\|^{2}
λmax(Σ)θ1θ22\displaystyle\leq\lambda_{\max}(\Sigma)\|\theta_{1}-\theta_{2}\|^{2}
θ1θ22,\displaystyle\leq\|\theta_{1}-\theta_{2}\|^{2},

which leads to the desired claim. For the first inequality above, we used the Rayleigh-Ritz theorem [56, Theorem 4.2.2]. ∎

An immediate consequence of the above result, in tandem with the fact that g¯(θ)=0\bar{g}(\theta^{*})=0, is the following upper bound on the norm of the steady-state TD(0) update direction: θK\forall\theta\in\mathbb{R}^{K}, we have

g¯(θ)=g¯(θ)g¯(θ)θθθ+σ.\|\bar{g}(\theta)\|=\|\bar{g}(\theta)-\bar{g}(\theta^{*})\|\leq\|\theta-\theta^{*}\|\leq\|\theta\|+\sigma. (16)

Essentially, this shows that the bound in Lemma 6 applies to the steady-state TD(0) update direction as well. Next, we prove an analog of the Lipschitz property in Lemma 7 for the random TD(0) update direction.

Lemma 8.

(Lipschitz property of the noisy TD(0) update direction) For all θ1,θ2K\theta_{1},\theta_{2}\in\mathbb{R}^{K}, we have:

gt(θ1)gt(θ2)2θ1θ2.\|{g}_{t}(\theta_{1})-{g}_{t}(\theta_{2})\|\leq 2\|\theta_{1}-\theta_{2}\|.
Proof.

As in the proof of Lemma 7, we will use the fact that the TD(0) update direction is an affine function of the parameter θ\theta. In particular, we have

gt(θ)=A(Xt)θb(Xt),whereA(Xt)=γΦ(st)Φ(st+1)Φ(st)Φ(st),andbt=ϕ(st)rt.g_{t}(\theta)=A(X_{t})\theta-b(X_{t}),\hskip 5.69054pt\textrm{where}\hskip 5.69054ptA(X_{t})=\gamma\Phi(s_{t})\Phi^{\top}(s_{t+1})-\Phi(s_{t})\Phi^{\top}(s_{t}),\hskip 5.69054pt\textrm{and}\hskip 5.69054ptb_{t}=-\phi(s_{t})r_{t}.

Thus, we have

gt(θ1)gt(θ2)\displaystyle\|{g}_{t}(\theta_{1})-{g}_{t}(\theta_{2})\| =A(Xt)(θ1θ2)\displaystyle=\|A(X_{t})\left(\theta_{1}-\theta_{2}\right)\| (17)
A(Xt)θ1θ2\displaystyle\leq\|A(X_{t})\|\|\theta_{1}-\theta_{2}\|
(γΦ(st)Φ(st+1)+Φ(st)2)θ1θ2\displaystyle\leq\left(\gamma\|\Phi(s_{t})\|\|\Phi(s_{t+1})\|+\|\Phi(s_{t})\|^{2}\right)\|\theta_{1}-\theta_{2}\|
2θ1θ2,\displaystyle\leq 2\|\theta_{1}-\theta_{2}\|,

where for the last step we used that Φ(s)1,s𝒮\|\Phi(s)\|\leq 1,\forall s\in\mathcal{S}. ∎

In addition to the above results, we will make use of the following facts.

  • Given any two vectors x,yKx,y\in\mathbb{R}^{K}, the following holds for any η>0\eta>0:

    x+y2(1+η)x2+(1+1η)y2.{\|x+y\|}^{2}\leq(1+\eta){\|x\|}^{2}+\left(1+\frac{1}{\eta}\right){\|y\|}^{2}. (18)
  • Given mm vectors x1,,xmKx_{1},\ldots,x_{m}\in\mathbb{R}^{K}, the following is a simple application of Jensen’s inequality:

    i=1mxi2mi=1mxi2.\bigg{\lVert}\sum\limits_{i=1}^{m}x_{i}\bigg{\rVert}^{2}\leq m\sum\limits_{i=1}^{m}{\|x_{i}\|}^{2}. (19)

Appendix B Building Intuition: Analysis of Mean-Path EF-TD

Since the dynamics of EF-TD are quite complex, and have not been studied before, we provide an analysis in this section for the simplest setting in our paper: the noiseless steady-state version of EF-TD where gt(θt)g_{t}(\theta_{t}) in line 4 of Algorithm 1 is replaced by g¯(θt)\bar{g}(\theta_{t}). We refer to this variant as the mean-path version of EF-TD, and compile its governing equations below:

ht\displaystyle h_{t} =𝒬δ(et1+g¯(θt)),\displaystyle=\mathcal{Q}_{\delta}\left(e_{t-1}+\bar{g}(\theta_{t})\right), (20)
θt+1\displaystyle\theta_{t+1} =θt+αht,\displaystyle=\theta_{t}+\alpha h_{t},
et\displaystyle e_{t} =et1+g¯(θt)ht,\displaystyle=e_{t-1}+\bar{g}(\theta_{t})-h_{t},

where the above equations hold for t=0,1,t=0,1,\ldots, with θ0K\theta_{0}\in\mathbb{R}^{K}, and e1=0e_{-1}=0. The goal of this section is to provide an analysis of mean-path EF-TD, and, in the process, outline some of the key ideas that will aid us in the more involved settings to follow. We have the following result.

Theorem 4.

(Noiseless Setting) There exist universal constants c,C1c,C\geq 1, such that the iterates generated by the mean-path version of EF-TD with step-size α=(1γ)/(cδ)\alpha=(1-\gamma)/(c\delta) satisfy the following after TT iterations:

θTθ222(1(1γ)2ωCδ)Tθ0θ22.\|\theta_{T}-\theta^{*}\|^{2}_{2}\leq 2\left(1-\frac{(1-\gamma)^{2}\omega}{C\delta}\right)^{T}\|\theta_{0}-\theta^{*}\|^{2}_{2}. (21)

Discussion. Theorem 4 reveals linear convergence of the iterates to θ\theta^{*}. When δ=1\delta=1, i.e., when there is no distortion to the TD(0) direction, the linear rate of convergence exactly matches that in [18]. Moreover, when δ>1\delta>1, the slowdown in the linear rate by a factor of δ\delta is also exactly consistent with analogous results for SGD with (biased) compression [15]. Thus, our result captures - in a transparent way - precisely what one could have hoped for.

To proceed with the analysis of mean-path EF-TD, we will make use of the perturbed iterate framework from [54]. In particular, let us define the perturbed iterate θ~tθt+αet1.\tilde{\theta}_{t}\triangleq\theta_{t}+\alpha e_{t-1}. Using equation 20, we then obtain:

θ~t+1\displaystyle\tilde{\theta}_{t+1} =θt+1+αet\displaystyle=\theta_{t+1}+\alpha e_{t} (22)
=θt+αht+α(et1+g¯(θt)ht)\displaystyle=\theta_{t}+\alpha h_{t}+\alpha\left(e_{t-1}+\bar{g}(\theta_{t})-h_{t}\right)
=θ~t+αg¯(θt).\displaystyle=\tilde{\theta}_{t}+\alpha\bar{g}(\theta_{t}).

The final recursion above looks almost like the the standard steady-state TD(0) update direction, other than the fact that g¯(θt)\bar{g}(\theta_{t}) is evaluated at θt\theta_{t}, and not θ~t\tilde{\theta}_{t}. To account for this “mismatch” introduced by the memory-variable et1e_{t-1}, we will analyze the following composite Lyapunov function:

ψtθ~tθ2+α2et12.\psi_{t}\triangleq\|\tilde{\theta}_{t}-\theta^{*}\|^{2}+\alpha^{2}\|e_{t-1}\|^{2}. (23)

Note that the above energy function captures the joint dynamics of the perturbed iterate and the memory variable. Our goal is to prove that this energy function decays exponentially over time. To that end, we start by establishing a bound on θ~t+1θ2\|\tilde{\theta}_{t+1}-\theta^{*}\|^{2} in the following lemma.

Lemma 9.

(Bound on Perturbed Iterate) Suppose the step-size α\alpha is chosen such that α(1γ)/8\alpha\leq(1-\gamma)/8. Then, the iterates generated by the mean-path version of EF-TD satisfy:

θ~t+1θ2θ~tθ2α(1γ)4V^θ~tV^θD2+5α3(1γ)et12.\|\tilde{\theta}_{t+1}-\theta^{*}\|^{2}\leq\|\tilde{\theta}_{t}-\theta^{*}\|^{2}-\frac{\alpha(1-\gamma)}{4}\|\hat{V}_{\tilde{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}+\frac{5\alpha^{3}}{(1-\gamma)}\|e_{t-1}\|^{2}. (24)
Proof.

Subtracting θ\theta^{*} from each side of equation 22 and then squaring both sides yields:

θ~t+1θ2\displaystyle\|\tilde{\theta}_{t+1}-\theta^{*}\|^{2} =θ~tθ2+2αθ~tθ,g¯(θt)+α2g¯(θt)2\displaystyle=\|\tilde{\theta}_{t}-\theta^{*}\|^{2}+2\alpha\langle\tilde{\theta}_{t}-\theta^{*},\bar{g}(\theta_{t})\rangle+\alpha^{2}\|\bar{g}(\theta_{t})\|^{2} (25)
=θ~tθ2+2αθtθ,g¯(θt)+α2g¯(θt)2+2αθ~tθt,g¯(θt)\displaystyle=\|\tilde{\theta}_{t}-\theta^{*}\|^{2}+2\alpha\langle{\theta}_{t}-\theta^{*},\bar{g}(\theta_{t})\rangle+\alpha^{2}\|\bar{g}(\theta_{t})\|^{2}+2\alpha\langle\tilde{\theta}_{t}-{\theta}_{t},\bar{g}(\theta_{t})\rangle
(a)θ~tθ2+2αθtθ,g¯(θt)+α(α+1η)g¯(θt)2+αηθ~tθt2\displaystyle\overset{(a)}{\leq}\|\tilde{\theta}_{t}-\theta^{*}\|^{2}+2\alpha\langle{\theta}_{t}-\theta^{*},\bar{g}(\theta_{t})\rangle+\alpha\left(\alpha+\frac{1}{\eta}\right)\|\bar{g}(\theta_{t})\|^{2}+\alpha\eta\|\tilde{\theta}_{t}-\theta_{t}\|^{2}
=(b)θ~tθ2+2αθtθ,g¯(θt)+α(α+1η)g¯(θt)2+α3ηet12.\displaystyle\overset{(b)}{=}\|\tilde{\theta}_{t}-\theta^{*}\|^{2}+2\alpha\langle{\theta}_{t}-\theta^{*},\bar{g}(\theta_{t})\rangle+\alpha\left(\alpha+\frac{1}{\eta}\right)\|\bar{g}(\theta_{t})\|^{2}+\alpha^{3}\eta\|e_{t-1}\|^{2}.

For (a), we used the fact that for any two vectors x,yKx,y\in\mathbb{R}^{K}, the following holds for all η>0\eta>0,

x,y12ηx2+η2y2.\langle x,y\rangle\leq\frac{1}{2\eta}\|x\|^{2}+\frac{\eta}{2}\|y\|^{2}.

We will pick an appropriate η\eta shortly. For (b), we used the fact that from the definition of the perturbed iterate, it holds that θ~tθt=αet1.\tilde{\theta}_{t}-\theta_{t}=\alpha e_{t-1}. We will now make use of Lemma’s 4 and 5. We proceed as follows.

θ~t+1θ2\displaystyle\|\tilde{\theta}_{t+1}-\theta^{*}\|^{2} θ~tθ2+2αθtθ,g¯(θt)+α(α+1η)g¯(θt)2+α3ηet12\displaystyle\leq\|\tilde{\theta}_{t}-\theta^{*}\|^{2}+2\alpha\langle{\theta}_{t}-\theta^{*},\bar{g}(\theta_{t})\rangle+\alpha\left(\alpha+\frac{1}{\eta}\right)\|\bar{g}(\theta_{t})\|^{2}+\alpha^{3}\eta\|e_{t-1}\|^{2} (26)
(a)θ~tθ22α(1γ)V^θtV^θD2+α(α+1η)g¯(θt)2+α3ηet12\displaystyle\overset{(a)}{\leq}\|\tilde{\theta}_{t}-\theta^{*}\|^{2}-2\alpha(1-\gamma)\|\hat{V}_{\theta_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}+\alpha\left(\alpha+\frac{1}{\eta}\right)\|\bar{g}(\theta_{t})\|^{2}+\alpha^{3}\eta\|e_{t-1}\|^{2}
(b)θ~tθ2α(2(1γ)4(α+1η))V^θtV^θD2+α3ηet12\displaystyle\overset{(b)}{\leq}\|\tilde{\theta}_{t}-\theta^{*}\|^{2}-\alpha\left(2(1-\gamma)-4\left(\alpha+\frac{1}{\eta}\right)\right)\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}+\alpha^{3}\eta\|e_{t-1}\|^{2}
(c)θ~tθ2α(1γ)(14α(1γ))V^θtV^θD2+4α3(1γ)et12\displaystyle\overset{(c)}{\leq}\|\tilde{\theta}_{t}-\theta^{*}\|^{2}-\alpha(1-\gamma)\left(1-\frac{4\alpha}{(1-\gamma)}\right)\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}+\frac{4\alpha^{3}}{(1-\gamma)}\|e_{t-1}\|^{2}
(d)θ~tθ2α(1γ)2V^θtV^θD2+4α3(1γ)et12.\displaystyle\overset{(d)}{\leq}\|\tilde{\theta}_{t}-\theta^{*}\|^{2}-\frac{\alpha(1-\gamma)}{2}\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}+\frac{4\alpha^{3}}{(1-\gamma)}\|e_{t-1}\|^{2}.

In the above steps, (a) follows from Lemma 4; (b) follows from Lemma 5; (c) follows by setting η=4/(1γ)\eta=4/(1-\gamma); and (d) is a consequence of the fact that α(1γ)/8\alpha\leq(1-\gamma)/8. To complete the proof, we need to relate V^θtV^θD2\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D} to V^θ~tV^θD2\|\hat{V}_{\tilde{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}. We do so by using the fact that for any x,ynx,y\in\mathbb{R}^{n}, it holds that x+yD22xD2+2yD2.\|x+y\|^{2}_{D}\leq 2\|x\|^{2}_{D}+2\|y\|^{2}_{D}. This yields:

V^θ~tV^θD2\displaystyle\|\hat{V}_{\tilde{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D} 2V^θtV^θD2+2V^θ~tV^θtD2\displaystyle\leq 2\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}+2\|\hat{V}_{\tilde{\theta}_{t}}-\hat{V}_{\theta_{t}}\|^{2}_{D} (27)
(a)2V^θtV^θD2+2θ~tθt2\displaystyle\overset{(a)}{\leq}2\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}+2\|{\tilde{\theta}_{t}}-{\theta_{t}}\|^{2}
2V^θtV^θD2+2α2et12,\displaystyle\leq 2\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}+2\alpha^{2}\|e_{t-1}\|^{2},

where for (a), we used Lemma 3. Rearranging and simplifying, we obtain:

V^θtV^θD212V^θ~tV^θD2+α2et12.-\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}\leq-\frac{1}{2}\|\hat{V}_{\tilde{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}+\alpha^{2}\|e_{t-1}\|^{2}.

Plugging the above inequality in equation 26 leads to equation 24. This completes the proof. ∎

The last term in equation 24 is one which does not show up in the standard analysis of TD(0), and is unique to our setting. In our next result, we control this extra term (depending on the memory variable) by appealing to the contraction property of the compression operator 𝒬δ()\mathcal{Q}_{\delta}(\cdot) in equation 4.

Lemma 10.

(Bound on Memory Variable) For the mean-path version of EF-TD, the following holds:

et2(112δ+4α2δ)et12+16δV^θ~tV^θD2.\|e_{t}\|^{2}\leq\left(1-\frac{1}{2\delta}+4\alpha^{2}\delta\right)\|e_{t-1}\|^{2}+16\delta\|\hat{V}_{\tilde{\theta}_{t}}-\hat{V}_{{\theta}^{*}}\|^{2}_{D}. (28)
Proof.

We begin as follows:

et2\displaystyle\|e_{t}\|^{2} =et1+g¯(θt)ht2\displaystyle=\|e_{t-1}+\bar{g}(\theta_{t})-h_{t}\|^{2} (29)
=et1+g¯(θt)𝒬δ(et1+g¯(θt))2\displaystyle=\|e_{t-1}+\bar{g}(\theta_{t})-\mathcal{Q}_{\delta}\left(e_{t-1}+\bar{g}(\theta_{t})\right)\|^{2}
(a)(11δ)et1+g¯(θt)2\displaystyle\overset{(a)}{\leq}\left(1-\frac{1}{\delta}\right)\|e_{t-1}+\bar{g}(\theta_{t})\|^{2}
(b)(11δ)(1+1η)et12+(11δ)(1+η)g¯(θt)2,\displaystyle\overset{(b)}{\leq}\left(1-\frac{1}{\delta}\right)\left(1+\frac{1}{\eta}\right)\|e_{t-1}\|^{2}+\left(1-\frac{1}{\delta}\right)\left(1+\eta\right)\|\bar{g}(\theta_{t})\|^{2},

for some η>0\eta>0 to be chosen by us shortly. Here, for (a), we used the contraction property of 𝒬δ()\mathcal{Q}_{\delta}(\cdot) in equation 4; for (b), we used the relaxed triangle inequality equation 18. To ensure that et2\|e_{t}\|^{2} contracts over time, we want

(11δ)(1+1η)<1η>(δ1).\left(1-\frac{1}{\delta}\right)\left(1+\frac{1}{\eta}\right)<1\implies\eta>(\delta-1).

Accordingly, suppose η=δ1.\eta=\delta-1. Simple calculations then yield

(11δ)(1+1η)=(112δ);(11δ)(1+η)<2δ.\left(1-\frac{1}{\delta}\right)\left(1+\frac{1}{\eta}\right)=\left(1-\frac{1}{2\delta}\right);\hskip 14.22636pt\left(1-\frac{1}{\delta}\right)\left(1+\eta\right)<2\delta.

Plugging these bounds back in equation 29, we obtain

et2\displaystyle\|e_{t}\|^{2} (112δ)et12+2δg¯(θt)2\displaystyle\leq\left(1-\frac{1}{2\delta}\right)\|e_{t-1}\|^{2}+2\delta\|\bar{g}(\theta_{t})\|^{2} (30)
(112δ)et12+2δg¯(θt)g¯(θ~t)+g¯(θ~t)2\displaystyle\leq\left(1-\frac{1}{2\delta}\right)\|e_{t-1}\|^{2}+2\delta\|\bar{g}(\theta_{t})-\bar{g}(\tilde{\theta}_{t})+\bar{g}(\tilde{\theta}_{t})\|^{2}
(112δ)et12+4δg¯(θt)g¯(θ~t)2+4δg¯(θ~t)2\displaystyle\leq\left(1-\frac{1}{2\delta}\right)\|e_{t-1}\|^{2}+4\delta\|\bar{g}(\theta_{t})-\bar{g}(\tilde{\theta}_{t})\|^{2}+4\delta\|\bar{g}(\tilde{\theta}_{t})\|^{2}
(a)(112δ)et12+4δθtθ~t2+4δg¯(θ~t)2\displaystyle\overset{(a)}{\leq}\left(1-\frac{1}{2\delta}\right)\|e_{t-1}\|^{2}+4\delta\|\theta_{t}-\tilde{\theta}_{t}\|^{2}+4\delta\|\bar{g}(\tilde{\theta}_{t})\|^{2}
=(112δ+4α2δ)et12+4δg¯(θ~t)2\displaystyle=\left(1-\frac{1}{2\delta}+4\alpha^{2}\delta\right)\|e_{t-1}\|^{2}+4\delta\|\bar{g}(\tilde{\theta}_{t})\|^{2}
(b)(112δ+4α2δ)et12+16δV^θ~tV^θD2.\displaystyle\overset{(b)}{\leq}\left(1-\frac{1}{2\delta}+4\alpha^{2}\delta\right)\|e_{t-1}\|^{2}+16\delta\|\hat{V}_{\tilde{\theta}_{t}}-\hat{V}_{{\theta}^{*}}\|^{2}_{D}.

In the above steps, for (a) we used the Lipschitz property of the steady-state TD(0) update direction, namely Lemma 7; and for (b), we used Lemma 5. This concludes the proof. ∎

Lemmas 9 and 10 reveal that the error-dynamics of the perturbed iterate and the memory variable are coupled with each other. As such, they cannot be studied in isolation. This precisely motivates the choice of the Lyapunov function ψt\psi_{t} in equation 23. We are now ready to complete the proof of Theorem 4.

Proof.

(Proof of Theorem 4) Using the bounds from Lemmas 9 and 10, and recalling the definition of the Lyapunov function ψt\psi_{t} from equation 23, we have

ψt+1\displaystyle\psi_{t+1} θ~tθ2α(1γ)4(164αδ(1γ))V^θ~tV^θD2+α2(112δ+4α2δ+5α(1γ))et12\displaystyle\leq\|\tilde{\theta}_{t}-\theta^{*}\|^{2}-\frac{\alpha(1-\gamma)}{4}\left(1-\frac{64\alpha\delta}{(1-\gamma)}\right)\|\hat{V}_{\tilde{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}+\alpha^{2}\left(1-\frac{1}{2\delta}+4\alpha^{2}\delta+\frac{5\alpha}{(1-\gamma)}\right)\|e_{t-1}\|^{2} (31)
(1αω(1γ)4(164αδ(1γ)))𝒜1θ~tθ2+α2(112δ+4α2δ+5α(1γ))𝒜2et12,\displaystyle\leq\underbrace{\left(1-\frac{\alpha\omega(1-\gamma)}{4}\left(1-\frac{64\alpha\delta}{(1-\gamma)}\right)\right)}_{\mathcal{A}_{1}}\|\tilde{\theta}_{t}-\theta^{*}\|^{2}+\alpha^{2}\underbrace{\left(1-\frac{1}{2\delta}+4\alpha^{2}\delta+\frac{5\alpha}{(1-\gamma)}\right)}_{\mathcal{A}_{2}}\|e_{t-1}\|^{2},

where we used Lemma 3 in the last step. Our goal is to establish an inequality of the form ψt+1νψt\psi_{t+1}\leq\nu\psi_{t} for some ν<1\nu<1. To that end, the next step of the proof is to pick the step-size α\alpha in a way such that max{𝒜1,𝒜2}<1.\max\{\mathcal{A}_{1},\mathcal{A}_{2}\}<1. Accordingly, with α=(1γ)/(128δ)\alpha=(1-\gamma)/(128\delta), we have that

𝒜1=(1(1γ)2ω1024δ);and𝒜2(114δ).\mathcal{A}_{1}=\left(1-\frac{(1-\gamma)^{2}\omega}{1024\delta}\right);\hskip 5.69054pt\textrm{and}\hskip 5.69054pt\mathcal{A}_{2}\leq\left(1-\frac{1}{4\delta}\right).

Furthermore, it is easy to check that 𝒜2𝒜1\mathcal{A}_{2}\leq\mathcal{A}_{1}. Combining these observations with equation 31, we obtain:

ψt+1(1(1γ)2ω1024δ)(θ~tθ2+α2et12)ψt.\psi_{t+1}\leq\left(1-\frac{(1-\gamma)^{2}\omega}{1024\delta}\right)\underbrace{\left(\|\tilde{\theta}_{t}-\theta^{*}\|^{2}+\alpha^{2}\|e_{t-1}\|^{2}\right)}_{\psi_{t}}.

Unrolling the above recursion yields:

ψT\displaystyle\psi_{T} (1(1γ)2ω1024δ)Tψ0\displaystyle\leq\left(1-\frac{(1-\gamma)^{2}\omega}{1024\delta}\right)^{T}\psi_{0} (32)
=(1(1γ)2ω1024δ)Tθ0θ2,\displaystyle=\left(1-\frac{(1-\gamma)^{2}\omega}{1024\delta}\right)^{T}\|\theta_{0}-\theta^{*}\|^{2},

where for the last step, we used the fact that e1=0e_{-1}=0. To conclude the proof, it suffices to notice that:

θTθ2\displaystyle{\|{\theta}_{T}-\theta^{*}\|}^{2} =θTθ~T+θ~Tθ2\displaystyle={\|{\theta}_{T}-\tilde{\theta}_{T}+\tilde{\theta}_{T}-\theta^{*}\|}^{2} (33)
2θ~Tθ2+2θTθ~T2\displaystyle\leq 2{\|\tilde{\theta}_{T}-\theta^{*}\|}^{2}+2{\|{\theta}_{T}-\tilde{\theta}_{T}\|}^{2}
=2θ~Tθ2+2α2eT12\displaystyle=2{\|\tilde{\theta}_{T}-\theta^{*}\|}^{2}+2{\alpha}^{2}{\|e_{T-1}\|}^{2}
=2ψT.\displaystyle=2\psi_{T}.

B.1 Can Compressed TD Methods Without Error-Feedback Still Converge?

Earlier in this section, we provided intuition as to why EF-TD converges by studying its dynamics in the steady-state. One might ask: Can compressed TD algorithms without error-feedback still converge? If so, under what conditions? We turn to answering these questions in this subsection. In what follows, we will show that compressed TD without error-feedback can still converge, provided certain restrictive conditions on the compression parameter δ\delta are met. Notably, these conditions are no longer needed when one employs error-feedback. To convey the key ideas, we consider a mean-path version of compressed TD shown below:

θt+1=θt+α𝒬δ(g¯(θt)),\theta_{t+1}=\theta_{t}+\alpha\mathcal{Q}_{\delta}(\bar{g}(\theta_{t})), (34)

where 𝒬δ()\mathcal{Q}_{\delta}(\cdot) is the compression operator in equation 4. From the above display, we immediately have

θt+1θ2=θtθ2+2αθtθ,𝒬δ(g¯(θt))+α2𝒬δ(g¯(θt))2.\|\theta_{t+1}-\theta^{*}\|^{2}=\|\theta_{t}-\theta^{*}\|^{2}+2\alpha\langle\theta_{t}-\theta^{*},\mathcal{Q}_{\delta}(\bar{g}(\theta_{t}))\rangle+\alpha^{2}\|\mathcal{Q}_{\delta}(\bar{g}(\theta_{t}))\|^{2}. (35)

Among the three terms on the R.H.S. of the above equation, notice that the only term that can lead to a decrease in the iterate error θt+1θ2\|\theta_{t+1}-\theta^{*}\|^{2} is clearly 2αθtθ,𝒬δ(g¯(θt))2\alpha\langle\theta_{t}-\theta^{*},\mathcal{Q}_{\delta}(\bar{g}(\theta_{t}))\rangle. As such, let us fix a θK\theta\in\mathbb{R}^{K}, and investigate what we can say about θθ,𝒬δ(g¯(θ)).\langle\theta-\theta^{*},\mathcal{Q}_{\delta}(\bar{g}(\theta))\rangle. First, notice that if there is no compression, i.e., δ=1\delta=1, then 𝒬δ(g¯(θ))=g¯(θ)\mathcal{Q}_{\delta}(\bar{g}(\theta))=\bar{g}(\theta), and we know from Lemmas 3 and  4 that

θθ,g¯(θ)βθθ2,\langle\theta-\theta^{*},\bar{g}(\theta)\rangle\leq-\beta\|\theta-\theta^{*}\|^{2}, (36)

where β=ω(1γ)(0,1).\beta=\omega(1-\gamma)\in(0,1). It is precisely the above key property that causes uncompressed TD to converge to θ\theta^{*}. Now let us observe:

θθ,𝒬δ(g¯(θ))\displaystyle\langle\theta-\theta^{*},\mathcal{Q}_{\delta}(\bar{g}(\theta))\rangle =θθ,g¯(θ)+θθ,𝒬δ(g¯(θ))g¯(θ)\displaystyle=\langle\theta-\theta^{*},\bar{g}(\theta)\rangle+\langle\theta-\theta^{*},\mathcal{Q}_{\delta}(\bar{g}(\theta))-\bar{g}(\theta)\rangle (37)
θθ,g¯(θ)+θθ𝒬δ(g¯(θ))g¯(θ)\displaystyle\leq\langle\theta-\theta^{*},\bar{g}(\theta)\rangle+\|\theta-\theta^{*}\|\|\mathcal{Q}_{\delta}(\bar{g}(\theta))-\bar{g}(\theta)\|
(a)θθ,g¯(θ)+(11δ)θθg¯(θ)\displaystyle\overset{(a)}{\leq}\langle\theta-\theta^{*},\bar{g}(\theta)\rangle+\sqrt{\left(1-\frac{1}{\delta}\right)}\|\theta-\theta^{*}\|\|\bar{g}(\theta)\|
(b)θθ,g¯(θ)+(11δ)θθ2\displaystyle\overset{(b)}{\leq}\langle\theta-\theta^{*},\bar{g}(\theta)\rangle+\sqrt{\left(1-\frac{1}{\delta}\right)}\|\theta-\theta^{*}\|^{2}
(c)(β(11δ))θθ2.\displaystyle\overset{(c)}{\leq}-\left(\beta-\sqrt{\left(1-\frac{1}{\delta}\right)}\right)\|\theta-\theta^{*}\|^{2}.

In the above steps, (a) follows from equation 4, (b) follows from equation 16, and (c) from equation 36. Comparing equation 37 to equation 36, we conclude that for the distorted TD direction 𝒬δ(g¯(θ))\mathcal{Q}_{\delta}(\bar{g}(\theta)) to ensure progress towards θ\theta^{*}, we need the following condition to hold:

(11δ)<β.\boxed{\sqrt{\left(1-\frac{1}{\delta}\right)}<\beta.} (38)

Simplifying, the above condition amounts to

δ<1(1β2).\boxed{\delta<\frac{1}{(1-\beta^{2})}.} (39)

The parameter β(0,1)\beta\in(0,1) gets fixed when one fixes an MDP, the policy to be evaluated, and the feature vectors for linear function approximation. The condition for contraction/convergence in equation 39 tells us that this parameter β\beta limits the extent of compression δ\delta. Said differently, one cannot choose the compression level δ\delta to be arbitrarily large; rather it is dictated by the problem-dependent parameter β\beta. It is important to note here that no such restriction on δ\delta is necessary when one uses error-feedback, as revealed by our analysis for mean-path EF-TD. This highlights the benefit of using error-feedback in the context of compressed TD learning. With these observations in place, let us return to our analysis of the update rule in equation 34. For ease of notation, let us define

ζ(β(11δ)),\zeta\triangleq\left(\beta-\sqrt{\left(1-\frac{1}{\delta}\right)}\right),

and note that if the compression parameter δ\delta satisfies the condition in equation 39, then ζ>0\zeta>0. Plugging the bound from equation 37 in equation 35, we obtain

θt+1θ2\displaystyle\|\theta_{t+1}-\theta^{*}\|^{2} =θtθ2+2αθtθ,𝒬δ(g¯(θt))+α2𝒬δ(g¯(θt))2\displaystyle=\|\theta_{t}-\theta^{*}\|^{2}+2\alpha\langle\theta_{t}-\theta^{*},\mathcal{Q}_{\delta}(\bar{g}(\theta_{t}))\rangle+\alpha^{2}\|\mathcal{Q}_{\delta}(\bar{g}(\theta_{t}))\|^{2} (40)
(12αζ)θtθ2+α2𝒬δ(g¯(θt))g¯(θt)+g¯(θt)2\displaystyle\leq\left(1-2\alpha\zeta\right)\|\theta_{t}-\theta^{*}\|^{2}+\alpha^{2}\|\mathcal{Q}_{\delta}(\bar{g}(\theta_{t}))-\bar{g}(\theta_{t})+\bar{g}(\theta_{t})\|^{2}
(12αζ)θtθ2+2α2𝒬δ(g¯(θt))g¯(θt)2+2α2g¯(θt)2\displaystyle\leq\left(1-2\alpha\zeta\right)\|\theta_{t}-\theta^{*}\|^{2}+2\alpha^{2}\|\mathcal{Q}_{\delta}(\bar{g}(\theta_{t}))-\bar{g}(\theta_{t})\|^{2}+2\alpha^{2}\|\bar{g}(\theta_{t})\|^{2}
(a)(12αζ)θtθ2+2(21δ)α2g¯(θt)2\displaystyle\overset{(a)}{\leq}\left(1-2\alpha\zeta\right)\|\theta_{t}-\theta^{*}\|^{2}+2\left(2-\frac{1}{\delta}\right)\alpha^{2}\|\bar{g}(\theta_{t})\|^{2}
(b)(12αζ+4α2)θtθ2,\displaystyle\overset{(b)}{\leq}\left(1-2\alpha\zeta+4\alpha^{2}\right)\|\theta_{t}-\theta^{*}\|^{2},

where (a) follows from equation 4 and (b) from equation 16. Thus, with αζ/4\alpha\leq\zeta/4, we have

θt+1θ2(1αζ)θtθ2.\|\theta_{t+1}-\theta^{*}\|^{2}\leq\left(1-\alpha\zeta\right)\|\theta_{t}-\theta^{*}\|^{2}.

We conclude that when the compression parameter δ\delta satisfies the condition in equation 39, and the step-size is chosen to be suitably small, the compressed TD update rule in equation 34 does converge linearly to θ\theta^{*}.

Appendix C Warm-Up: Analysis of EF-TD with a Projection Step

Before attempting to prove Theorem 1, it is instructive to analyze the behavior of EF-TD with a projection step. The benefit of this projection step is that it makes it relatively easier to argue that the iterates generated by EF-TD remain uniformly bounded; nonetheless, as we shall soon see, the analysis remains quite non-trivial even in light of this simplification. Let us now jot down the governing equations of the dynamics we plan to study.

ht\displaystyle h_{t} =𝒬δ(et1+gt(θt)),\displaystyle=\mathcal{Q}_{\delta}\left(e_{t-1}+{g}_{t}(\theta_{t})\right), (41)
θt+1\displaystyle\theta_{t+1} =Π2,(θt+αht),\displaystyle=\Pi_{2,\mathcal{B}}\left(\theta_{t}+\alpha h_{t}\right),
et\displaystyle e_{t} =et1+gt(θt)ht,\displaystyle=e_{t-1}+{g}_{t}(\theta_{t})-h_{t},

where Π2,()\Pi_{2,\mathcal{B}}(\cdot) denotes the standard Euclidean projection on to a convex compact subset K\mathcal{B}\subset\mathbb{R}^{K} that is assumed to contain the fixed point θ\theta^{*}. We also note here that a projection step of the form in equation 41 is common in the literature on stochastic approximation [57] and RL [18, 10].

Our main result concerning the performance of the projected version of EF-TD is the following.

Theorem 5.

Suppose Assumption 1 holds. There exists a universal constant c1c\geq 1 such that the iterates generated by the projected version of EF-TD (i.e., equation 41) with step-size α(1γ)/c\alpha\leq(1-\gamma)/c satisfy the following after TτT\geq\tau iterations:

𝔼[θTθ22]C1(1αω(1γ))Tτ+O(ατδ2G2ω(1γ)),\mathbb{E}\left[\|\theta_{T}-\theta^{*}\|^{2}_{2}\right]\leq C_{1}\left(1-\alpha\omega(1-\gamma)\right)^{T-\tau}+O\left(\frac{\alpha\tau\delta^{2}G^{2}}{\omega(1-\gamma)}\right), (42)

where C1=O(α2δ2G2+G2)C_{1}=O(\alpha^{2}\delta^{2}G^{2}+G^{2}), and GG is the radius of the convex compact set \mathcal{B}.

Main Takeaway. We note that the nature of the above guarantee is similar to that of Theorem 1. That said, while the noise term in Theorem 1 is O(τ+δ)O(\tau+\delta), it is O(τδ2)O(\tau\delta^{2}) in Theorem 5. In words, with the somewhat cruder bounds we obtain via projection, we end up with a looser dependency on the distortion parameter δ\delta. Moreover, the mixing time τ\tau and the distortion parameter δ\delta show up in multiplicative form in Theorem 5. In Section D, we will provide a finer analysis (without the need for projection) that yields the tighter O(τ+δ)O(\tau+\delta) bound.

We now proceed with the proof of Theorem 5. Let us start by defining the projection error ep,te_{p,t} at time-step tt as follows: ep,t=θt(θt1+αht1)e_{p,t}=\theta_{t}-(\theta_{t-1}+\alpha h_{t-1}). We also define an intermediate sequence {θ¯t}\{\bar{\theta}_{t}\} as follows: θ¯tθt1+αht1\bar{\theta}_{t}\triangleq\theta_{t-1}+\alpha h_{t-1}. Thus, θtθ¯t=ep,t.\theta_{t}-\bar{\theta}_{t}=e_{p,t}. Next, inspired by the perturbed iterate framework in [54], we define a modified perturbed iterate as follows:

θ~tθ¯t+αet1.\tilde{\theta}_{t}\triangleq\bar{\theta}_{t}+\alpha e_{t-1}. (43)

Based on the above definitions, observe that

θ~t+1\displaystyle\tilde{\theta}_{t+1} =θ¯t+1+αet\displaystyle=\bar{\theta}_{t+1}+\alpha e_{t} (44)
=θt+αht+α(et1+gt(θt)ht)\displaystyle=\theta_{t}+\alpha h_{t}+\alpha\left(e_{t-1}+g_{t}(\theta_{t})-h_{t}\right)
=θt+αgt(θt)+αet1\displaystyle=\theta_{t}+\alpha g_{t}(\theta_{t})+\alpha e_{t-1}
=θ¯t+αet1+αgt(θt)+ep,t\displaystyle=\bar{\theta}_{t}+\alpha e_{t-1}+\alpha g_{t}(\theta_{t})+e_{p,t}
=θ~t+αgt(θt)+ep,t.\displaystyle=\tilde{\theta}_{t}+\alpha g_{t}(\theta_{t})+e_{p,t}.

Subtracting θ\theta^{*} from each side of equation 44 and then squaring both sides, we obtain:

θ~t+1θ2=θ~tθ2+2αθ~tθ,gt(θt)𝒞1+α2gt(θt)2𝒞2+2θ~tθ+αgt(θt),ep,t𝒞3+ep,t2𝒞4.\|\tilde{\theta}_{t+1}-\theta^{*}\|^{2}=\|\tilde{\theta}_{t}-\theta^{*}\|^{2}+\underbrace{2\alpha\langle\tilde{\theta}_{t}-\theta^{*},{g}_{t}(\theta_{t})\rangle}_{\mathcal{C}_{1}}+\underbrace{\alpha^{2}\|{g}_{t}(\theta_{t})\|^{2}}_{\mathcal{C}_{2}}+\underbrace{2\langle\tilde{\theta}_{t}-\theta^{*}+\alpha g_{t}(\theta_{t}),e_{p,t}\rangle}_{\mathcal{C}_{3}}+\underbrace{\|e_{p,t}\|^{2}}_{\mathcal{C}_{4}}. (45)

In what follows, we outline the key steps of our proof that involve bounding each of the terms 𝒞1𝒞4\mathcal{C}_{1}-\mathcal{C}_{4}.

\bullet Step 1. The dynamics of the model parameter θt\theta_{t}, the Markov variable XtX_{t}, the memory variable ete_{t}, and the projection error ep,te_{p,t} are all closely coupled, leading to a dynamical system far more complex than the standard TD(0) system. To start unravelling this complex dynamical system, our key strategy is to disentangle the memory variable and the projection error from the perturbed iterate and the Markov data tuple. To do so, we derive uniform bounds on et,hte_{t},h_{t}, and ep,te_{p,t} by exploiting the contraction property in equation 4. This is achieved in Lemma 12.

\bullet Step 2. Using the uniform bounds from the previous step in tandem with properties of the Euclidean projection operator, we control terms 𝒞2𝒞4\mathcal{C}_{2}-\mathcal{C}_{4} in Lemma 13.

\bullet Step 3. Bounding 𝒞1\mathcal{C}_{1} takes the most work. For this step, we exploit the idea of conditioning on the system state sufficiently into the past, and using the geometric mixing property of the Markov chain. As we shall see, conditioning into the past creates the need to control θtθtτ,tτ\|\theta_{t}-\theta_{t-\tau}\|,\forall t\geq\tau, where τ\tau is the mixing time. This is done in Lemma 14. Using the result from Lemma 14, we bound 𝒞1\mathcal{C}_{1} in Lemma 15.

At the end of the three steps above, what we wish to establish is a recursion of the following form tτ\forall t\geq\tau:

𝔼[θ~t+1θ2](1αω(1γ))𝔼[θ~tθ2]+O(α2τδ2G2).\mathbb{E}\left[\|\tilde{\theta}_{t+1}-\theta^{*}\|^{2}\right]\leq\left(1-\alpha\omega(1-\gamma)\right)\mathbb{E}\left[\|\tilde{\theta}_{t}-\theta^{*}\|^{2}\right]+O(\alpha^{2}\tau\delta^{2}G^{2}).

To proceed with Step 1, we recall the following result from [58].

Lemma 11.

Let \mathcal{B} be a nonempty, closed, convex set in K\mathbb{R}^{K}. Then, for any xKx\in\mathbb{R}^{K}, we have:

  1. (a)

    Π2,(x)x,xyΠ2,(x)x2,y\langle\Pi_{2,\mathcal{B}}(x)-x,x-y\rangle\leq-\|\Pi_{2,\mathcal{B}}(x)-x\|^{2},\forall y\in\mathcal{B}.

  2. (b)

    Π(x)y2xy2Π(x)x2,y\|\Pi_{\mathcal{B}}(x)-y\|^{2}\leq\|x-y\|^{2}-\|\Pi_{\mathcal{B}}(x)-x\|^{2},\forall y\in\mathcal{B}.

To lighten notation, let us assume without loss of generality that all rewards are uniformly bounded by 11. Our results can be trivially extended to the case where the uniform bound is some finite number RmaxR_{max}. To make the calculations cleaner, we also assume that the projection radius GG is greater than 11. We have the following key result that provides uniform bounds on the memory variable and the projection error.

Lemma 12.

(Uniform bounds on memory variable and projection error) For the dynamics in equation 41, the following hold t0\forall t\geq 0:

  1. (a)

    et6δG.\|e_{t}\|\leq 6\delta G.

  2. (b)

    ht15δG.\|h_{t}\|\leq 15\delta G.

  3. (c)

    ep,t15αδG.\|e_{p,t}\|\leq 15\alpha\delta G.

Proof.

We start by noting that for all t0t\geq 0,

gt(θt)=A(Xt)θtb(Xt)A(Xt)θt+b(Xt)2G+13G,\|g_{t}(\theta_{t})\|=\|A(X_{t})\theta_{t}-b(X_{t})\|\leq\|A(X_{t})\|\|\theta_{t}\|+\|b(X_{t})\|\leq 2G+1\leq 3G, (46)

where we used (i) the feature normalization property; (ii) the fact that the rewards are uniformly bounded by 1; and (iii) the fact that due to projection, θt1,t0.\|\theta_{t}\|\leq 1,\forall t\geq 0. Next, observe that

et2\displaystyle\|e_{t}\|^{2} =et1+gt(θt)ht2\displaystyle=\|e_{t-1}+g_{t}(\theta_{t})-h_{t}\|^{2} (47)
=et1+gt(θt)𝒬δ(et1+gt(θt))2\displaystyle=\|e_{t-1}+g_{t}(\theta_{t})-\mathcal{Q}_{\delta}\left(e_{t-1}+g_{t}(\theta_{t})\right)\|^{2}
(a)(11δ)et1+gt(θt)2\displaystyle\overset{(a)}{\leq}\left(1-\frac{1}{\delta}\right)\|e_{t-1}+g_{t}(\theta_{t})\|^{2}
(b)(11δ)(1+1η)et12+(11δ)(1+η)gt(θt)2,\displaystyle\overset{(b)}{\leq}\left(1-\frac{1}{\delta}\right)\left(1+\frac{1}{\eta}\right)\|e_{t-1}\|^{2}+\left(1-\frac{1}{\delta}\right)\left(1+\eta\right)\|g_{t}(\theta_{t})\|^{2},

for some η>0\eta>0 to be chosen by us shortly. Here, for (a), we used the contraction property of 𝒬δ()\mathcal{Q}_{\delta}(\cdot) in equation 4; for (b), we used the relaxed triangle inequality in equation 18. To ensure that et2\|e_{t}\|^{2} contracts over time, we want

(11δ)(1+1η)<1η>(δ1).\left(1-\frac{1}{\delta}\right)\left(1+\frac{1}{\eta}\right)<1\implies\eta>(\delta-1).

Accordingly, suppose η=δ1.\eta=\delta-1. Simple calculations then yield

(11δ)(1+1η)=(112δ);(11δ)(1+η)<2δ.\left(1-\frac{1}{\delta}\right)\left(1+\frac{1}{\eta}\right)=\left(1-\frac{1}{2\delta}\right);\hskip 14.22636pt\left(1-\frac{1}{\delta}\right)\left(1+\eta\right)<2\delta.

Plugging these bounds back in equation 47 and using equation 46, we obtain

et2\displaystyle\|e_{t}\|^{2} (112δ)et12+2δgt(θt)2\displaystyle\leq\left(1-\frac{1}{2\delta}\right)\|e_{t-1}\|^{2}+2\delta\|{g}_{t}(\theta_{t})\|^{2}
(112δ)et12+18δG2.\displaystyle\leq\left(1-\frac{1}{2\delta}\right)\|e_{t-1}\|^{2}+18\delta G^{2}.

Unrolling the dynamics of the memory variable thus yields:

et2\displaystyle\|e_{t}\|^{2} (112δ)t+1e12+18δG2k=0t(112δ)k\displaystyle\leq\left(1-\frac{1}{2\delta}\right)^{t+1}\|e_{-1}\|^{2}+18\delta G^{2}\sum_{k=0}^{t}\left(1-\frac{1}{2\delta}\right)^{k} (48)
18δG2k=0(112δ)k\displaystyle\leq 18\delta G^{2}\sum_{k=0}^{\infty}\left(1-\frac{1}{2\delta}\right)^{k}
=36δ2G2,\displaystyle=36\delta^{2}G^{2},

where we used the fact that e1=0e_{-1}=0. Thus, et6δG\|e_{t}\|\leq 6\delta G, which establishes part (a). For part (b), we notice that ht=et1et+gt(θt).h_{t}=e_{t-1}-e_{t}+g_{t}(\theta_{t}). This immediately yields:

htet+et1+gt(θt)12δG+3G15δG,\|h_{t}\|\leq\|e_{t}\|+\|e_{t-1}\|+\|g_{t}(\theta_{t})\|\leq 12\delta G+3G\leq 15\delta G,

where we used the fact that δ1\delta\geq 1, the bound from part (a), and the uniform bound on gt(θt)g_{t}(\theta_{t}) established earlier.

Next, for part (c), we use part (b) of Lemma 11 to observe that

ep,t2=Π2,(θ¯t)θ¯t2θ¯tθ2,θ.\|e_{p,t}\|^{2}=\|\Pi_{2,\mathcal{B}}(\bar{\theta}_{t})-\bar{\theta}_{t}\|^{2}\leq\|\bar{\theta}_{t}-\theta\|^{2},\forall\theta\in\mathcal{B}.

Since the above bound holds for all θ\theta\in\mathcal{B}, and θt1\theta_{t-1}\in\mathcal{B}, we have

ep,t2θ¯tθt12=α2ht2225α2δ2G2,\|e_{p,t}\|^{2}\leq\|\bar{\theta}_{t}-\theta_{t-1}\|^{2}=\alpha^{2}\|h_{t}\|^{2}\leq 225\alpha^{2}\delta^{2}G^{2},

where we used the fact that θ¯t=θt1+αht1\bar{\theta}_{t}=\theta_{t-1}+\alpha h_{t-1} by definition, and also the bound on ht\|h_{t}\| from part (b). This concludes the proof. ∎

From the proof of the above lemma, bounds on terms 𝒞2\mathcal{C}_{2} and 𝒞4\mathcal{C}_{4} in equation 45 follow immediately. In our next result, we bound the term 𝒞3\mathcal{C}_{3}.

Lemma 13.

For the dynamics in equation 41, the following holds for all t0t\geq 0:

2θ~tθ+αgt(θt),ep,t45α2δ2G2.2\langle\tilde{\theta}_{t}-\theta^{*}+\alpha g_{t}(\theta_{t}),e_{p,t}\rangle\leq 45\alpha^{2}\delta^{2}G^{2}.
Proof.

We start by decomposing the term we wish to bound into three parts:

2θ~tθ+αgt(θt),ep,t\displaystyle 2\langle\tilde{\theta}_{t}-\theta^{*}+\alpha g_{t}(\theta_{t}),e_{p,t}\rangle =2θ¯tθ+αet1+αgt(θt),ep,t\displaystyle=2\langle\bar{\theta}_{t}-\theta^{*}+\alpha e_{t-1}+\alpha g_{t}(\theta_{t}),e_{p,t}\rangle (49)
=2θ¯tθ,ep,t𝒞31+2αet1,ep,t𝒞32+2αgt(θt),ep,t𝒞33.\displaystyle=\underbrace{2\langle\bar{\theta}_{t}-\theta^{*},e_{p,t}\rangle}_{\mathcal{C}_{31}}+\underbrace{2\alpha\langle e_{t-1},e_{p,t}\rangle}_{\mathcal{C}_{32}}+\underbrace{2\alpha\langle g_{t}(\theta_{t}),e_{p,t}\rangle}_{\mathcal{C}_{33}}.

We now bound each of the three terms above separately. For 𝒞31\mathcal{C}_{31}, we have

2θ¯tθ,ep,t=2θ¯tθ,θtθ¯t2θtθ¯t2=2ep,t2,2\langle\bar{\theta}_{t}-\theta^{*},e_{p,t}\rangle=2\langle\bar{\theta}_{t}-\theta^{*},\theta_{t}-\bar{\theta}_{t}\rangle\leq-2\|\theta_{t}-\bar{\theta}_{t}\|^{2}=-2\|e_{p,t}\|^{2}, (50)

where we used part (a) of the projection lemma, namely Lemma 11, with x=θ¯tx=\bar{\theta}_{t} and y=θy=\theta^{*}; here, note that we used the fact that θ\theta^{*}\in\mathcal{B}. Next, for 𝒞32\mathcal{C}_{32}, observe that

2αet1,ep,t\displaystyle 2\alpha\langle e_{t-1},e_{p,t}\rangle α2et12+ep,t2\displaystyle\leq\alpha^{2}\|e_{t-1}\|^{2}+\|e_{p,t}\|^{2} (51)
36α2δ2G2+ep,t2,\displaystyle\leq 36\alpha^{2}\delta^{2}G^{2}+\|e_{p,t}\|^{2},

where we used the bound on et1\|e_{t-1}\| from part (a) of Lemma 12. Notice that we have kept ep,t2\|e_{p,t}\|^{2} as is in the above bound since we will cancel off its effect with one of the negative terms from the upper bound on 𝒞31.\mathcal{C}_{31}. Finally, we bound the term 𝒞33\mathcal{C}_{33} as follows:

2αgt(θt),ep,t\displaystyle 2\alpha\langle g_{t}(\theta_{t}),e_{p,t}\rangle α2gt(θt)2+ep,t2\displaystyle\leq\alpha^{2}\|g_{t}(\theta_{t})\|^{2}+\|e_{p,t}\|^{2} (52)
9α2G2+ep,t2,\displaystyle\leq 9\alpha^{2}G^{2}+\|e_{p,t}\|^{2},

where we used the uniform bound on gt(θt)g_{t}(\theta_{t}) from equation 46. Combining the bounds in equations 50, 51, and 52, and using the fact that δ1\delta\geq 1 yields the desired result. ∎

Notice that up until now, we have not made any use of the geometric mixing property of the underlying Markov chain. We will call upon this property while bounding 𝒞1\mathcal{C}_{1}. But first, we need the following intermediate result.

Lemma 14.

For the dynamics in equation 41, the following holds for all tτt\geq\tau:

θtθtτ60ατδG.\|\theta_{t}-\theta_{t-\tau}\|\leq 60\alpha\tau\delta G. (53)
Proof.

Based on equation 44, observe that

θ~t+1θ~t\displaystyle\|\tilde{\theta}_{t+1}-\tilde{\theta}_{t}\| αgt(θt)+ep,t\displaystyle\leq\alpha\|g_{t}(\theta_{t})\|+\|e_{p,t}\| (54)
3αG+15αδG\displaystyle\leq 3\alpha G+15\alpha\delta G
18αδG.\displaystyle\leq 18\alpha\delta G.

We also note that

θ~tθ~tτ=k=tτt1(θ~k+1θ~k).\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}=\sum_{k=t-\tau}^{t-1}\left(\tilde{\theta}_{k+1}-\tilde{\theta}_{k}\right).

Based on equation 54, we then immediately have

θ~tθ~tτ\displaystyle\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\| k=tτt1θ~k+1θ~k\displaystyle\leq\sum_{k=t-\tau}^{t-1}\|\tilde{\theta}_{k+1}-\tilde{\theta}_{k}\| (55)
k=tτt1(18αδG)\displaystyle\leq\sum_{k=t-\tau}^{t-1}(18\alpha\delta G)
18ατδG.\displaystyle\leq 18\alpha\tau\delta G.

Our goal is to now relate the above bound on θ~tθ~tτ\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\| to one on θtθtτ\|{\theta}_{t}-{\theta}_{t-\tau}\|. To that end, observe that

θ~t\displaystyle\tilde{\theta}_{t} =θt+αet1ep,t,\displaystyle=\theta_{t}+\alpha e_{t-1}-e_{p,t},
θ~tτ\displaystyle\tilde{\theta}_{t-\tau} =θtτ+αetτ1ep,tτ.\displaystyle=\theta_{t-\tau}+\alpha e_{t-\tau-1}-e_{p,t-\tau}.

This gives us exactly what we need:

θtθtτ\displaystyle\|{\theta}_{t}-{\theta}_{t-\tau}\| θ~tθ~tτ+αet1etτ1+ep,tτep,t\displaystyle\leq\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|+\alpha\|e_{t-1}-e_{t-\tau-1}\|+\|e_{p,t-\tau}-e_{p,t}\| (56)
θ~tθ~tτ+αet1+αetτ1+ep,tτ+ep,t\displaystyle\leq\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|+\alpha\|e_{t-1}\|+\alpha\|e_{t-\tau-1}\|+\|e_{p,t-\tau}\|+\|e_{p,t}\|
18ατδG+12αδG+30αδG60ατδG,\displaystyle\leq 18\alpha\tau\delta G+12\alpha\delta G+30\alpha\delta G\leq 60\alpha\tau\delta G,

where we used equation 55, parts (a) and (c) of Lemma 12, and assumed that the mixing time τ1\tau\geq 1 to state the bounds more cleanly. This completes the proof. ∎

We now turn towards establishing an upper bound on the term 𝒞1\mathcal{C}_{1} in equation 45.

Lemma 15.

Suppose Assumption 1 holds. For the dynamics in equation 41, the following then holds tτ:\forall t\geq\tau:

𝔼[2αθ~tθ,gt(θt)]2α(1γ)𝔼[VθtVθD2]+1454α2δτG2.\mathbb{E}\left[2\alpha\langle\tilde{\theta}_{t}-\theta^{*},{g}_{t}(\theta_{t})\rangle\right]\leq-2\alpha(1-\gamma)\mathbb{E}\left[\|V_{\theta_{t}}-V_{\theta^{*}}\|^{2}_{D}\right]+1454\alpha^{2}\delta\tau G^{2}.
Proof.

Let us first focus on bounding Tθtθ,gt(θt)g¯(θt)T\triangleq\langle\theta_{t}-\theta^{*},g_{t}(\theta_{t})-\bar{g}(\theta_{t})\rangle. Trivially, observe that

T=θtθtτ,gt(θt)g¯(θt)T1+θtτθ,gt(θt)g¯(θt)T2.T=\underbrace{\langle\theta_{t}-\theta_{t-\tau},g_{t}(\theta_{t})-\bar{g}(\theta_{t})\rangle}_{T_{1}}+\underbrace{\langle\theta_{t-\tau}-\theta^{*},g_{t}(\theta_{t})-\bar{g}(\theta_{t})\rangle}_{T_{2}}.

To bound T1T_{1}, we recall from Lemma 7 that for all θ1,θ2K\theta_{1},\theta_{2}\in\mathbb{R}^{K}, it holds that

g¯(θ1)g¯(θ2)θ1θ2.\|\bar{g}(\theta_{1})-\bar{g}(\theta_{2})\|\leq\|\theta_{1}-\theta_{2}\|.

Since g¯(θ)=0\bar{g}(\theta^{*})=0, the above inequality immediately implies that g¯(θ)θ+θ,θK\|\bar{g}(\theta)\|\leq\|\theta\|+\|\theta^{*}\|,\forall\theta\in\mathbb{R}^{K}. In particular, for any θ\theta\in\mathcal{B}, we then have that g¯(θ)2G\|\bar{g}(\theta)\|\leq 2G (since θ\theta^{*}\in\mathcal{B}). Using the bound on θtθtτ\|\theta_{t}-\theta_{t-\tau}\| from Lemma 14, and the uniform bound on the noisy TD(0) update direction from equation 46, we then obtain

T1\displaystyle T_{1} (θtθtτ)(gt(θt)g¯(θt))\displaystyle\leq\left(\|\theta_{t}-\theta_{t-\tau}\|\right)\left(\|g_{t}(\theta_{t})-\bar{g}(\theta_{t})\|\right) (57)
(θtθtτ)(gt(θt)+g¯(θt))\displaystyle\leq\left(\|\theta_{t}-\theta_{t-\tau}\|\right)\left(\|g_{t}(\theta_{t})\|+\|\bar{g}(\theta_{t})\|\right)
60ατδG(3G+2G)=300ατδG2.\displaystyle\leq 60\alpha\tau\delta G\left(3G+2G\right)=300\alpha\tau\delta G^{2}.

To bound T2T_{2}, we further split it into two parts as follows:

T2=θtτθ,gt(θtτ)g¯(θtτ)T21+θtτθ,gt(θt)gt(θtτ)+g¯(θtτ)g¯(θt)T22.T_{2}=\underbrace{\langle\theta_{t-\tau}-\theta^{*},g_{t}(\theta_{t-\tau})-\bar{g}(\theta_{t-\tau})\rangle}_{T_{21}}+\underbrace{\langle\theta_{t-\tau}-\theta^{*},g_{t}(\theta_{t})-{g}_{t}(\theta_{t-\tau})+\bar{g}(\theta_{t-\tau})-\bar{g}(\theta_{t})\rangle}_{T_{22}}.

To bound T22T_{22}, we will exploit the Lipschitz property of the TD(0) update directions in tandem with Lemma 14. Specifically, observe that:

T22\displaystyle T_{22} θtτθ(gt(θt)gt(θtτ)+g¯(θtτ)g¯(θt))\displaystyle\leq\|\theta_{t-\tau}-\theta^{*}\|\left(\|g_{t}(\theta_{t})-{g}_{t}(\theta_{t-\tau})\|+\|\bar{g}(\theta_{t-\tau})-\bar{g}(\theta_{t})\|\right) (58)
(a)2G(gt(θt)gt(θtτ)+g¯(θtτ)g¯(θt))\displaystyle\overset{(a)}{\leq}2G\left(\|g_{t}(\theta_{t})-{g}_{t}(\theta_{t-\tau})\|+\|\bar{g}(\theta_{t-\tau})-\bar{g}(\theta_{t})\|\right)
(b)6Gθtθtτ\displaystyle\overset{(b)}{\leq}6G\|\theta_{t}-\theta_{t-\tau}\|
(c)360ατδG2.\displaystyle\overset{(c)}{\leq}360\alpha\tau\delta G^{2}.

In the above steps, (a) follows from projection; (b) follows from Lemmas 7 and 8; and (c) follows from Lemma 14. It remains to bound T21T_{21}. This is precisely the only place in the entire proof that we will use the geometric mixing property of the Markov chain in Definition 1. We proceed as follows.

𝔼[T21]\displaystyle\mathbb{E}\left[T_{21}\right] =𝔼[θtτθ,gt(θtτ)g¯(θtτ)]\displaystyle=\mathbb{E}\left[\langle\theta_{t-\tau}-\theta^{*},g_{t}(\theta_{t-\tau})-\bar{g}(\theta_{t-\tau})\rangle\right] (59)
=𝔼[𝔼[θtτθ,gt(θtτ)g¯(θtτ)|θtτ,Xtτ]]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\langle\theta_{t-\tau}-\theta^{*},g_{t}(\theta_{t-\tau})-\bar{g}(\theta_{t-\tau})\rangle|\theta_{t-\tau},X_{t-\tau}\right]\right]
=𝔼[θtτθ,𝔼[gt(θtτ)g¯(θtτ)|θtτ,Xtτ]]\displaystyle=\mathbb{E}\left[\langle\theta_{t-\tau}-\theta^{*},\mathbb{E}\left[g_{t}(\theta_{t-\tau})-\bar{g}(\theta_{t-\tau})|\theta_{t-\tau},X_{t-\tau}\right]\rangle\right]
𝔼[θtτθ𝔼[gt(θtτ)g¯(θtτ)|θtτ,Xtτ]]\displaystyle\leq\mathbb{E}\left[\|\theta_{t-\tau}-\theta^{*}\|\|\mathbb{E}\left[g_{t}(\theta_{t-\tau})-\bar{g}(\theta_{t-\tau})|\theta_{t-\tau},X_{t-\tau}\right]\|\right]
(a)2αG(𝔼[θtτθ])\displaystyle\overset{(a)}{\leq}2\alpha G\left(\mathbb{E}\left[\|\theta_{t-\tau}-\theta^{*}\|\right]\right)
4αG2,\displaystyle\leq 4\alpha G^{2},

where (a) follows from the definition of the mixing time τ\tau. Combining the above bound with those in equations 57 and 58, we obtain

𝔼[T]664ατδG2,\mathbb{E}\left[T\right]\leq 664\alpha\tau\delta G^{2}, (60)

where we used τ1\tau\geq 1 and δ1\delta\geq 1.

We can now go back to bounding 𝒞1\mathcal{C}_{1} as follows:

𝒞1\displaystyle\mathcal{C}_{1} =2αθtθ+αet1ep,t,gt(θt)\displaystyle=2\alpha\langle\theta_{t}-\theta^{*}+\alpha e_{t-1}-e_{p,t},g_{t}(\theta_{t})\rangle (61)
=2αθtθ,gt(θt)+2α2et1,gt(θt)2αep,t,gt(θt)\displaystyle=2\alpha\langle\theta_{t}-\theta^{*},g_{t}(\theta_{t})\rangle+2\alpha^{2}\langle e_{t-1},g_{t}(\theta_{t})\rangle-2\alpha\langle e_{p,t},g_{t}(\theta_{t})\rangle
2αθtθ,gt(θt)+2α2et1gt(θt)+2αep,tgt(θt)\displaystyle\leq 2\alpha\langle\theta_{t}-\theta^{*},g_{t}(\theta_{t})\rangle+2\alpha^{2}\|e_{t-1}\|\|g_{t}(\theta_{t})\|+2\alpha\|e_{p,t}\|\|g_{t}(\theta_{t})\|
2αθtθ,gt(θt)+126α2δG2,\displaystyle\leq 2\alpha\langle\theta_{t}-\theta^{*},g_{t}(\theta_{t})\rangle+126\alpha^{2}\delta G^{2},

where we used equation 46, and parts (a) and (c) of Lemma 12. We continue as follows:

𝒞12αθtθ,g¯(θt)+2αT++126α2δG2.\mathcal{C}_{1}\leq 2\alpha\langle\theta_{t}-\theta^{*},\bar{g}(\theta_{t})\rangle+2\alpha T++126\alpha^{2}\delta G^{2}.

Using Lemma 4 and the bound we derived on TT in equation 60, we finally obtain

𝔼[𝒞1]2α(1γ)𝔼[V^θtV^θD2]+1454α2δτG2,\mathbb{E}\left[\mathcal{C}_{1}\right]\leq-2\alpha(1-\gamma)\mathbb{E}\left[\|\hat{V}_{\theta_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}\right]+1454\alpha^{2}\delta\tau G^{2},

where in the last step, we used Lemma 4. ∎

We can now complete the proof of Theorem 5.

Proof.

(Proof of Theorem 5) We combine the bounds derived previously on the terms 𝒞1\mathcal{C}_{1}-𝒞4\mathcal{C}_{4} in Lemmas 12, 13, and 15 to obtain that tτ\forall t\geq\tau,

𝔼[θ~t+1θ22]\displaystyle\mathbb{E}\left[\|\tilde{\theta}_{t+1}-\theta^{*}\|^{2}_{2}\right] 𝔼[θ~tθ22]2α(1γ)𝔼[V^θtV^θD2]+1454α2δτG2\displaystyle\leq\mathbb{E}\left[\|\tilde{\theta}_{t}-\theta^{*}\|^{2}_{2}\right]-2\alpha(1-\gamma)\mathbb{E}\left[\|\hat{V}_{\theta_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}\right]+1454\alpha^{2}\delta\tau G^{2} (62)
+9α2G2+45α2δ2G2+225α2δ2G2\displaystyle\hskip 5.69054pt+9\alpha^{2}G^{2}+45\alpha^{2}\delta^{2}G^{2}+225\alpha^{2}\delta^{2}G^{2}
𝔼[θ~tθ22]2α(1γ)𝔼[V^θtV^θD2]+1733α2δ2τG2.\displaystyle\leq\mathbb{E}\left[\|\tilde{\theta}_{t}-\theta^{*}\|^{2}_{2}\right]-2\alpha(1-\gamma)\mathbb{E}\left[\|\hat{V}_{\theta_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}\right]+1733\alpha^{2}\delta^{2}\tau G^{2}.

To proceed, we need to relate V^θtV^θD2\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D} to V^θ~tV^θD2\|\hat{V}_{\tilde{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}. We do so by using the fact that for any x,ynx,y\in\mathbb{R}^{n}, it holds that x+yD22xD2+2yD2.\|x+y\|^{2}_{D}\leq 2\|x\|^{2}_{D}+2\|y\|^{2}_{D}. This yields:

V^θ~tV^θD2\displaystyle\|\hat{V}_{\tilde{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D} 2V^θtV^θD2+2V^θ~tV^θtD2\displaystyle\leq 2\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}+2\|\hat{V}_{\tilde{\theta}_{t}}-\hat{V}_{\theta_{t}}\|^{2}_{D} (63)
(a)2V^θtV^θD2+2θ~tθt2,\displaystyle\overset{(a)}{\leq}2\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}+2\|{\tilde{\theta}_{t}}-{\theta_{t}}\|^{2},

where for (a), we used Lemma 3. We thus have

2α(1γ)V^θtV^θD2\displaystyle-2\alpha(1-\gamma)\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D} α(1γ)V^θ~tV^θD2+2α(1γ)θ~tθt2\displaystyle\leq-\alpha(1-\gamma)\|\hat{V}_{\tilde{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}+2\alpha(1-\gamma)\|\tilde{\theta}_{t}-\theta_{t}\|^{2} (64)
α(1γ)V^θ~tV^θD2+2α(1γ)αet1ep,t2\displaystyle\leq-\alpha(1-\gamma)\|\hat{V}_{\tilde{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}+2\alpha(1-\gamma)\|\alpha e_{t-1}-e_{p,t}\|^{2}
α(1γ)V^θ~tV^θD2+4α(1γ)(α2et12+ep,t2)\displaystyle\leq-\alpha(1-\gamma)\|\hat{V}_{\tilde{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}+4\alpha(1-\gamma)\left(\alpha^{2}\|e_{t-1}\|^{2}+\|e_{p,t}\|^{2}\right)
α(1γ)V^θ~tV^θD2+1044α3(1γ)δ2G2,\displaystyle\leq-\alpha(1-\gamma)\|\hat{V}_{\tilde{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}+1044\alpha^{3}(1-\gamma)\delta^{2}G^{2},

where we used Lemma 12. Plugging the above bound back in equation 62 yields:

𝔼[θ~t+1θ22]\displaystyle\mathbb{E}\left[\|\tilde{\theta}_{t+1}-\theta^{*}\|^{2}_{2}\right] 𝔼[θ~tθ22]α(1γ)𝔼[V^θ~tV^θD2]+2777α2δ2τG2\displaystyle\leq\mathbb{E}\left[\|\tilde{\theta}_{t}-\theta^{*}\|^{2}_{2}\right]-\alpha(1-\gamma)\mathbb{E}\left[\|\hat{V}_{\tilde{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}\right]+2777\alpha^{2}\delta^{2}\tau G^{2} (65)
(1αω(1γ))𝔼[θ~tθ22]+2777α2δ2τG2,tτ,\displaystyle\leq\left(1-\alpha\omega(1-\gamma)\right)\mathbb{E}\left[\|\tilde{\theta}_{t}-\theta^{*}\|^{2}_{2}\right]+2777\alpha^{2}\delta^{2}\tau G^{2},\forall t\geq\tau,

where we used Lemma 3 in the last step. Unrolling the above recursion starting from t=τt=\tau, we obtain

𝔼[θ~Tθ22]\displaystyle\mathbb{E}\left[\|\tilde{\theta}_{T}-\theta^{*}\|^{2}_{2}\right] (1αω(1γ))Tτ𝔼[θ~τθ22]+2777α2δ2τG2k=0(1αω(1γ))k\displaystyle\leq\left(1-\alpha\omega(1-\gamma)\right)^{T-\tau}\mathbb{E}\left[\|\tilde{\theta}_{\tau}-\theta^{*}\|^{2}_{2}\right]+2777\alpha^{2}\delta^{2}\tau G^{2}\sum_{k=0}^{\infty}\left(1-\alpha\omega(1-\gamma)\right)^{k} (66)
=(1αω(1γ))Tτ𝔼[θ~τθ22]+2777ατδ2G2ω(1γ).\displaystyle=\left(1-\alpha\omega(1-\gamma)\right)^{T-\tau}\mathbb{E}\left[\|\tilde{\theta}_{\tau}-\theta^{*}\|^{2}_{2}\right]+2777\frac{\alpha\tau\delta^{2}G^{2}}{\omega(1-\gamma)}.

We now make use of the following equation twice.

θ~t=θt+αet1ep,t.\tilde{\theta}_{t}=\theta_{t}+\alpha e_{t-1}-e_{p,t}.

First, setting t=τt=\tau in the above equation, subtracting θ\theta^{*} from both sides, and then simplifying, we observe that

θ~τθθτθ+αeτ1+ep,τ=O(αδG+G),\|\tilde{\theta}_{\tau}-\theta^{*}\|\leq\|\theta_{\tau}-\theta^{*}\|+\alpha\|e_{\tau-1}\|+\|e_{p,\tau}\|=O\left(\alpha\delta G+G\right),

where we invoked Lemma 12. Thus,

𝔼[θ~τθ22]=O(α2δ2G2+G2).\mathbb{E}\left[\|\tilde{\theta}_{\tau}-\theta^{*}\|^{2}_{2}\right]=O\left(\alpha^{2}\delta^{2}G^{2}+G^{2}\right).

Using similar arguments as above, one can also show that

θTθ23θ~Tθ2+3α2et12+3ep,t23θ~Tθ2+O(α2δ2G2).\|\theta_{T}-\theta^{*}\|^{2}\leq 3\|\tilde{\theta}_{T}-\theta^{*}\|^{2}+3\alpha^{2}\|e_{t-1}\|^{2}+3\|e_{p,t}\|^{2}\leq 3\|\tilde{\theta}_{T}-\theta^{*}\|^{2}+O(\alpha^{2}\delta^{2}G^{2}).

Plugging the two bounds we derived above in equation 66 completes the proof. ∎

Appendix D Analysis of EF-TD without Projection: Proof of Theorem 1 and Theorem 2

In this section, we will prove Theorem 1. In particular, via a finer analysis relative to that in Appendix C, we will (i) show that the iterates generated by EF-TD remain bounded without the need for an explicit projection step to make this happen; and (ii) obtain a tighter bound w.r.t. the distortion parameter δ\delta. At this stage, we remind the reader about the dynamics we are interested in analyzing:

ht\displaystyle h_{t} =𝒬δ(et1+gt(θt)),\displaystyle=\mathcal{Q}_{\delta}\left(e_{t-1}+{g}_{t}(\theta_{t})\right), (67)
θt+1\displaystyle\theta_{t+1} =θt+αht,\displaystyle=\theta_{t}+\alpha h_{t},
et\displaystyle e_{t} =et1+gt(θt)ht.\displaystyle=e_{t-1}+{g}_{t}(\theta_{t})-h_{t}.

To proceed with the analysis of the above dynamics, let us define the perturbed iterate θ~tθt+αet1.\tilde{\theta}_{t}\triangleq\theta_{t}+\alpha e_{t-1}. Using equation 67, we then obtain:

θ~t+1\displaystyle\tilde{\theta}_{t+1} =θt+1+αet\displaystyle=\theta_{t+1}+\alpha e_{t} (68)
=θt+αht+α(et1+gt(θt)ht)\displaystyle=\theta_{t}+\alpha h_{t}+\alpha\left(e_{t-1}+{g}_{t}(\theta_{t})-h_{t}\right)
=θ~t+αgt(θt).\displaystyle=\tilde{\theta}_{t}+\alpha{g}_{t}(\theta_{t}).

The final recursion above looks almost like the TD(0) update, other than the fact that gt(θt)g_{t}(\theta_{t}) is evaluated at θt\theta_{t}, and not θ~t\tilde{\theta}_{t}. To account for this “mismatch” introduced by the memory-variable et1e_{t-1}, we will analyze the following composite Lyapunov function:

ψt𝔼[d~t2+α2et12],whered~t2=θ~tθ2.\psi_{t}\triangleq\mathbb{E}[\tilde{d}^{2}_{t}+\alpha^{2}\|e_{t-1}\|^{2}],\hskip 2.84526pt\textrm{where}\hskip 2.84526pt\tilde{d}^{2}_{t}=\|\tilde{\theta}_{t}-\theta^{*}\|^{2}. (69)

Note that the above energy function captures the joint dynamics of the perturbed iterate and the memory variable. Our goal is to prove that this energy function decays exponentially over time (up to noise terms). To that end, we start by establishing a bound on d~t+12\tilde{d}^{2}_{t+1} in the following lemma.

Lemma 16.

(Bound on Perturbed Iterate) Suppose the step-size α\alpha satisfies α1/12\alpha\leq 1/12. For the EF-TD algorithm, the following bound then holds for t0\forall t\geq 0:444The requirement that α1/12\alpha\leq 1/12 is not necessary to obtain the type of bound in equation 70. Instead, it only serves to simplify some of the leading constants in the bound.

d~t+12(1αω(1γ)+24α2)d~t2+6α3ω(1γ)et12+2αθ~tθ,gt(θ~t)g¯(θ~t)+32α2σ2.\tilde{d}^{2}_{t+1}\leq\left(1-\alpha\omega(1-\gamma)+24\alpha^{2}\right)\tilde{d}^{2}_{t}+\frac{6\alpha^{3}}{\omega(1-\gamma)}\|e_{t-1}\|^{2}+2\alpha\langle\tilde{\theta}_{t}-\theta^{*},g_{t}(\tilde{\theta}_{t})-\bar{g}(\tilde{\theta}_{t})\rangle+32\alpha^{2}\sigma^{2}. (70)
Proof.

Subtracting θ\theta^{*} from each side of equation LABEL:eqn:perturb and then squaring both sides yields:

d~t+12\displaystyle\tilde{d}^{2}_{t+1} =d~t2+2αθ~tθ,gt(θt)+α2gt(θt)2\displaystyle=\tilde{d}^{2}_{t}+2\alpha\langle\tilde{\theta}_{t}-\theta^{*},g_{t}(\theta_{t})\rangle+\alpha^{2}\|g_{t}(\theta_{t})\|^{2} (71)
=d~t2+2αθ~tθ,gt(θ~t)()+2αθ~tθ,gt(θt)gt(θ~t)()+α2gt(θt)2().\displaystyle=\tilde{d}^{2}_{t}+\underbrace{2\alpha\langle\tilde{\theta}_{t}-\theta^{*},g_{t}(\tilde{\theta}_{t})\rangle}_{(*)}+\underbrace{2\alpha\langle\tilde{\theta}_{t}-\theta^{*},g_{t}({\theta}_{t})-g_{t}(\tilde{\theta}_{t})\rangle}_{(**)}+\underbrace{\alpha^{2}\|g_{t}(\theta_{t})\|^{2}}_{(***)}.

We now have:

()\displaystyle(*) =2αθ~tθ,g¯(θ~t)+2αθ~tθ,gt(θ~t)g¯(θ~t)\displaystyle=2\alpha\langle\tilde{\theta}_{t}-\theta^{*},\bar{g}(\tilde{\theta}_{t})\rangle+2\alpha\langle\tilde{\theta}_{t}-\theta^{*},g_{t}(\tilde{\theta}_{t})-\bar{g}(\tilde{\theta}_{t})\rangle (72)
2αω(1γ)d~t2+2αθ~tθ,gt(θ~t)g¯(θ~t),\displaystyle\leq-2\alpha\omega(1-\gamma)\tilde{d}^{2}_{t}+2\alpha\langle\tilde{\theta}_{t}-\theta^{*},g_{t}(\tilde{\theta}_{t})-\bar{g}(\tilde{\theta}_{t})\rangle,

where in the second step, we invoked Lemmas 3 and 4. To bound ()(**), we proceed as follows:

()\displaystyle(**) (a)4αd~tθtθ~t\displaystyle\overset{(a)}{\leq}4\alpha\tilde{d}_{t}\|\theta_{t}-\tilde{\theta}_{t}\| (73)
(b)2αηd~t2+2αηθtθ~t2\displaystyle\overset{(b)}{\leq}\frac{2\alpha}{\eta}\tilde{d}^{2}_{t}+2\alpha\eta\|\theta_{t}-\tilde{\theta}_{t}\|^{2}
=(c)2αηd~t2+2α3ηet12,\displaystyle\overset{(c)}{=}\frac{2\alpha}{\eta}\tilde{d}^{2}_{t}+2\alpha^{3}\eta\|e_{t-1}\|^{2},

where η>0\eta>0 is a constant to be decided shortly. In the above steps, (a) follows from the Cauchy-Schwarz inequality and the Lipschitz property of the TD(0) update direction in Lemma 8. For (b), we used the fact that for any two scalars x,yx,y\in\mathbb{R}, the following holds for all η>0\eta>0,

xy12ηx2+η2y2.xy\leq\frac{1}{2\eta}x^{2}+\frac{\eta}{2}y^{2}.

Finally, for (c), we simply used the fact that θ~tθt=αet1.\tilde{\theta}_{t}-\theta_{t}=\alpha e_{t-1}. To bound ()(***), observe that

gt(θt)2\displaystyle\|g_{t}(\theta_{t})\|^{2} (a)4(θt+σ)2\displaystyle\overset{(a)}{\leq}4(\|\theta_{t}\|+\sigma)^{2} (74)
8(θt2+σ2)\displaystyle\leq 8(\|\theta_{t}\|^{2}+\sigma^{2})
(b)8(3θtθ~t2+3θ~tθ2+3θ2+σ2)\displaystyle\overset{(b)}{\leq}8\left(3\|\theta_{t}-\tilde{\theta}_{t}\|^{2}+3\|\tilde{\theta}_{t}-\theta^{*}\|^{2}+3\|\theta^{*}\|^{2}+\sigma^{2}\right)
(c)24α2et12+24d~t2+32σ2,\displaystyle\overset{(c)}{\leq}24\alpha^{2}\|e_{t-1}\|^{2}+24\tilde{d}^{2}_{t}+32\sigma^{2},

where (a) follows from Lemma 6, (b) follows from equation 19, and (c) follows from noting that θσ.\|\theta^{*}\|\leq\sigma. Plugging the bounds in equations 72, 73, and 74 in equation 71, we obtain:

d~t+12\displaystyle\tilde{d}^{2}_{t+1} (12αω(1γ)+2αη+24α2)d~t2+2α3(η+12α)et12+2αθ~tθ,gt(θ~t)g¯(θ~t)\displaystyle\leq\left(1-2\alpha\omega(1-\gamma)+\frac{2\alpha}{\eta}+24\alpha^{2}\right)\tilde{d}^{2}_{t}+2\alpha^{3}(\eta+12\alpha)\|e_{t-1}\|^{2}+2\alpha\langle\tilde{\theta}_{t}-\theta^{*},g_{t}(\tilde{\theta}_{t})-\bar{g}(\tilde{\theta}_{t})\rangle (75)
+32α2σ2.\displaystyle\hskip 5.69054pt+32\alpha^{2}\sigma^{2}.

The result follows from setting η=2ω(1γ)\eta=\frac{2}{\omega(1-\gamma)}, and simplifying using α1/12\alpha\leq 1/12. ∎

Unlike the standard TD(0) analysis, we note from Lemma 16 that the distance to optimality of the iterates is intimately coupled with the magnitude of the memory variable ete_{t}. As such, to proceed, we need to bound the growth of this memory variable. We do so in the following lemma.

Lemma 17.

(Bound on Memory Variable) For the EF-TD algorithm, the following bound holds for t0\forall t\geq 0:

et2(112δ+16α2δ)et12+64δd~t2+96δσ2.\|e_{t}\|^{2}\leq\left(1-\frac{1}{2\delta}+16\alpha^{2}\delta\right)\|e_{t-1}\|^{2}+64\delta\tilde{d}^{2}_{t}+96\delta\sigma^{2}. (76)
Proof.

We begin as follows:

et2\displaystyle\|e_{t}\|^{2} =et1+gt(θt)ht2\displaystyle=\|e_{t-1}+{g}_{t}(\theta_{t})-h_{t}\|^{2} (77)
=et1+gt(θt)𝒬δ(et1+gt(θt))2\displaystyle=\|e_{t-1}+{g}_{t}(\theta_{t})-\mathcal{Q}_{\delta}\left(e_{t-1}+{g}_{t}(\theta_{t})\right)\|^{2}
(a)(11δ)et1+gt(θt)2\displaystyle\overset{(a)}{\leq}\left(1-\frac{1}{\delta}\right)\|e_{t-1}+{g}_{t}(\theta_{t})\|^{2}
(b)(11δ)(1+1η)et12+(11δ)(1+η)gt(θt)2,\displaystyle\overset{(b)}{\leq}\left(1-\frac{1}{\delta}\right)\left(1+\frac{1}{\eta}\right)\|e_{t-1}\|^{2}+\left(1-\frac{1}{\delta}\right)\left(1+\eta\right)\|{g}_{t}(\theta_{t})\|^{2},

where (a) follows from the contraction property of 𝒬δ()\mathcal{Q}_{\delta}(\cdot) in equation 4, (b) makes use of the relaxed triangle inequality in equation 18, and η>0\eta>0 is a constant to be chosen by us shortly. To ensure that et\|e_{t}\| contracts over time, we set η=δ1\eta=\delta-1 to obtain:

et2\displaystyle\|e_{t}\|^{2} (112δ)et12+2δgt(θt)2\displaystyle\leq\left(1-\frac{1}{2\delta}\right)\|e_{t-1}\|^{2}+2\delta\|{g}_{t}(\theta_{t})\|^{2} (78)
(112δ)et12+2δgt(θt)gt(θ~t)+gt(θ~t)2\displaystyle\leq\left(1-\frac{1}{2\delta}\right)\|e_{t-1}\|^{2}+2\delta\|{g}_{t}(\theta_{t})-{g}_{t}(\tilde{\theta}_{t})+{g}_{t}(\tilde{\theta}_{t})\|^{2}
(112δ)et12+4δgt(θt)gt(θ~t)2+4δgt(θ~t)2\displaystyle\leq\left(1-\frac{1}{2\delta}\right)\|e_{t-1}\|^{2}+4\delta\|{g}_{t}(\theta_{t})-{g}_{t}(\tilde{\theta}_{t})\|^{2}+4\delta\|{g}_{t}(\tilde{\theta}_{t})\|^{2}
(a)(112δ)et12+16δθtθ~t2+4δgt(θ~t)2\displaystyle\overset{(a)}{\leq}\left(1-\frac{1}{2\delta}\right)\|e_{t-1}\|^{2}+16\delta\|\theta_{t}-\tilde{\theta}_{t}\|^{2}+4\delta\|{g}_{t}(\tilde{\theta}_{t})\|^{2}
=(112δ+16α2δ)et12+4δgt(θ~t)2\displaystyle=\left(1-\frac{1}{2\delta}+16\alpha^{2}\delta\right)\|e_{t-1}\|^{2}+4\delta\|{g}_{t}(\tilde{\theta}_{t})\|^{2}
(b)(112δ+16α2δ)et12+32δ(θ~t2+σ2)\displaystyle\overset{(b)}{\leq}\left(1-\frac{1}{2\delta}+16\alpha^{2}\delta\right)\|e_{t-1}\|^{2}+32\delta\left(\|\tilde{\theta}_{t}\|^{2}+\sigma^{2}\right)
(112δ+16α2δ)et12+32δ(2d~t2+3σ2).\displaystyle\leq\left(1-\frac{1}{2\delta}+16\alpha^{2}\delta\right)\|e_{t-1}\|^{2}+32\delta\left(2\tilde{d}^{2}_{t}+3\sigma^{2}\right).

In the above steps, for (a) we used the Lipschitz property of the noisy TD(0) update direction, and for (b), we appealed to Lemma 6. This completes the proof. ∎

D.1 Bounding the Drift and Bias Terms

Inspecting Lemma 16, it is apparent that we need to bound the “bias” term θ~tθ,gt(θ~t)g¯(θ~t)\langle\tilde{\theta}_{t}-\theta^{*},g_{t}(\tilde{\theta}_{t})-\bar{g}(\tilde{\theta}_{t})\rangle. This requires some work for the following reasons.

  • In the (compressed) optimization setting, one does not encounter this term since gt(θ~t)g_{t}(\tilde{\theta}_{t}) is an unbiased version of g¯(θ~t)\bar{g}(\tilde{\theta}_{t}). Thus, taking expectations causes this term to vanish.

  • In the standard analysis of TD(0), while one does encounter such a bias term (under Markovian sampling), such a term features the true iterate θt\theta_{t}, and not its perturbed version θ~t\tilde{\theta}_{t}. This is where we again need to carefully account for the error between θt\theta_{t} and θ~t\tilde{\theta}_{t}.

  • In Appendix C, we derived a bound on the bias term by leveraging the uniform bounds on the memory variable in Lemma 12. Such uniform bounds were made possible via the projection step. Since we no longer have such a projection step at our disposal, we need an alternate proof technique.

In order to bound θ~tθ,gt(θ~t)g¯(θ~t)\langle\tilde{\theta}_{t}-\theta^{*},g_{t}(\tilde{\theta}_{t})-\bar{g}(\tilde{\theta}_{t})\rangle, we will require a mixing time argument where we condition sufficiently into the past. This, in turn, will create the need to bound the drift θ~tθ~tτ\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\| of the perturbed iterate θ~t\tilde{\theta}_{t}; here, recall that τ\tau is the mixing time. In the analysis of vanilla TD(0), the authors in [14] show how such a drift term can be related to the distance to optimality of the (true) iterate at time tt. The presence of the memory variable ete_{t} (that accounts for past errors) makes it hard to establish such a result for our setting. As such, we will now establish a different bound on the drift θ~tθ~tτ\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\| as a function of the maximum amplitude of our constructed Lyapunov function (in equation 69) over the interval [tτ,t][t-\tau,t]. In this context, we have the following key result (Lemma 1 in the main body of the paper).

Lemma 18.

(Relating Drift to Past Shocks) For EF-TD, the following is true tτ\forall t\geq\tau:

𝔼[θ~tθ~tτ2]12α2τ2maxtτt1ψ+48α2τ2σ2.\mathbb{E}[\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|^{2}]\leq 12\alpha^{2}\tau^{2}\max_{t-\tau\leq\ell\leq t-1}\psi_{\ell}+48\alpha^{2}\tau^{2}\sigma^{2}. (79)
Proof.

Starting from equation LABEL:eqn:perturb, observe that

θ~t+1θ~t\displaystyle\|\tilde{\theta}_{t+1}-\tilde{\theta}_{t}\| αgt(θt)\displaystyle\leq\alpha\|g_{t}(\theta_{t})\| (80)
α(gt(θt)gt(θ~t)+gt(θ~t))\displaystyle\leq\alpha\left(\|g_{t}(\theta_{t})-g_{t}(\tilde{\theta}_{t})\|+\|g_{t}(\tilde{\theta}_{t})\|\right)
(a)2α(θtθ~t+θ~t+σ)\displaystyle\overset{(a)}{\leq}2\alpha\left(\|\theta_{t}-\tilde{\theta}_{t}\|+\|\tilde{\theta}_{t}\|+\sigma\right)
2α(αet1+θ~t+σ)\displaystyle\leq 2\alpha\left(\alpha\|e_{t-1}\|+\|\tilde{\theta}_{t}\|+\sigma\right)
2α(αet1+d~t+2σ),\displaystyle\leq 2\alpha\left(\alpha\|e_{t-1}\|+\tilde{d}_{t}+2\sigma\right),

where (a) follows from Lemmas 6 and 8. We thus have

𝔼[θ~t+1θ~t2]\displaystyle\mathbb{E}\left[\|\tilde{\theta}_{t+1}-\tilde{\theta}_{t}\|^{2}\right] 12α2𝔼[α2et12+d~t2+4σ2]\displaystyle\leq 12\alpha^{2}\mathbb{E}\left[\alpha^{2}\|e_{t-1}\|^{2}+\tilde{d}^{2}_{t}+4\sigma^{2}\right] (81)
=12α2ψt+48α2σ2,\displaystyle=12\alpha^{2}\psi_{t}+48\alpha^{2}\sigma^{2},

where for the first step, we used equation 19, and for the second, the definition of ψt\psi_{t} in equation 69. Appealing to equation 19 again, observe that

𝔼[θ~tθ~tτ2]\displaystyle\mathbb{E}\left[\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|^{2}\right] τ=tτt1𝔼[θ~+1θ~2]\displaystyle\leq\tau\sum_{\ell=t-\tau}^{t-1}\mathbb{E}\left[\|\tilde{\theta}_{\ell+1}-\tilde{\theta}_{\ell}\|^{2}\right] (82)
equation8112α2τ=tτt1(ψ+4σ2)\displaystyle\overset{equation~{}\ref{eqn:dt_interim}}{\leq}12\alpha^{2}\tau\sum_{\ell=t-\tau}^{t-1}\left(\psi_{\ell}+4\sigma^{2}\right)
12α2τ2maxtτt1ψ+48α2τ2σ2,\displaystyle\leq 12\alpha^{2}\tau^{2}\max_{t-\tau\leq\ell\leq t-1}\psi_{\ell}+48\alpha^{2}\tau^{2}\sigma^{2},

which is the desired claim. ∎

Interpreting ψ\psi_{\ell} as a “shock” from time-step \ell, Lemma 18 tells us that the drift of the perturbed iterate over the interval [tτ,t][t-\tau,t] can be bounded above by the maximum shock over this interval (up to noise terms). Fortunately, the effect of this shock is dampened by the presence of the O(α2)O(\alpha^{2}) term multiplying it. Equipped with Lemma 18, we now proceed to bound the bias term.

Lemma 19.

(Bounding the Bias) Suppose Assumption 1 holds. Let the step-size α\alpha be such that ατ1/6\alpha\tau\leq 1/6. For EF-TD, the following is then true tτ\forall t\geq\tau:

𝔼[θ~tθ,gt(θ~t)g¯(θ~t)]31ατ𝔼[d~t2]+103ατVt+454ατσ2,\mathbb{E}\left[\langle\tilde{\theta}_{t}-\theta^{*},g_{t}(\tilde{\theta}_{t})-\bar{g}(\tilde{\theta}_{t})\rangle\right]\leq 31\alpha\tau\mathbb{E}\left[\tilde{d}^{2}_{t}\right]+103\alpha\tau V_{t}+454\alpha\tau\sigma^{2}, (83)

where

Vtmaxtτt1ψ.V_{t}\triangleq\max_{t-\tau\leq\ell\leq t-1}\psi_{\ell}.
Proof.

We start by decomposing the bias term T=θ~tθ,gt(θ~t)g¯(θ~t)T=\langle\tilde{\theta}_{t}-\theta^{*},g_{t}(\tilde{\theta}_{t})-\bar{g}(\tilde{\theta}_{t})\rangle as follows:

T=θ~tθ~tτ,gt(θ~t)g¯(θ~t)T1+θ~tτθ,gt(θ~t)g¯(θ~t)T2.T=\underbrace{\langle\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau},g_{t}(\tilde{\theta}_{t})-\bar{g}(\tilde{\theta}_{t})\rangle}_{T_{1}}+\underbrace{\langle\tilde{\theta}_{t-\tau}-\theta^{*},g_{t}(\tilde{\theta}_{t})-\bar{g}(\tilde{\theta}_{t})\rangle}_{T_{2}}.

To bound T1{T_{1}}, we note that

T1\displaystyle T_{1} θ~tθ~tτgt(θ~t)g¯(θ~t)\displaystyle\leq\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|\|g_{t}(\tilde{\theta}_{t})-\bar{g}(\tilde{\theta}_{t})\| (84)
12ατθ~tθ~tτ2+ατ2gt(θ~t)g¯(θ~t)2\displaystyle\leq\frac{1}{2\alpha\tau}\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|^{2}+\frac{\alpha\tau}{2}\|g_{t}(\tilde{\theta}_{t})-\bar{g}(\tilde{\theta}_{t})\|^{2}
12ατθ~tθ~tτ2+ατ(gt(θ~t)2+g¯(θ~t)2)\displaystyle\leq\frac{1}{2\alpha\tau}\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|^{2}+{\alpha\tau}\left(\|g_{t}(\tilde{\theta}_{t})\|^{2}+\|\bar{g}(\tilde{\theta}_{t})\|^{2}\right)
(a)12ατθ~tθ~tτ2+10ατ(θ~t2+σ2)\displaystyle\overset{(a)}{\leq}\frac{1}{2\alpha\tau}\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|^{2}+10\alpha\tau(\|\tilde{\theta}_{t}\|^{2}+\sigma^{2})
12ατθ~tθ~tτ2+10ατ(2d~t2+3σ2),\displaystyle\leq\frac{1}{2\alpha\tau}\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|^{2}+10\alpha\tau(2\tilde{d}^{2}_{t}+3\sigma^{2}),

where for (a), we used Lemma 6 and equation 16. Taking expectations on both sides of the above inequality, and using Lemma 18, we obtain:

𝔼[T1]20ατ𝔼[d~t2]+6ατVt+54ατσ2.\mathbb{E}\left[T_{1}\right]\leq 20\alpha\tau\mathbb{E}\left[\tilde{d}^{2}_{t}\right]+6\alpha\tau V_{t}+54\alpha\tau\sigma^{2}. (85)

Next, to bound T2T_{2}, we decompose it as follows:

T2=θ~tτθ,gt(θ~tτ)g¯(θ~tτ)()+θ~tτθ,gt(θ~t)gt(θ~tτ)()+θ~tτθ,g¯(θ~tτ)g¯(θ~t)().T_{2}=\underbrace{\langle\tilde{\theta}_{t-\tau}-\theta^{*},g_{t}(\tilde{\theta}_{t-\tau})-\bar{g}(\tilde{\theta}_{t-\tau})\rangle}_{(*)}+\underbrace{\langle\tilde{\theta}_{t-\tau}-\theta^{*},g_{t}(\tilde{\theta}_{t})-{g}_{t}(\tilde{\theta}_{t-\tau})\rangle}_{(**)}+\underbrace{\langle\tilde{\theta}_{t-\tau}-\theta^{*},\bar{g}(\tilde{\theta}_{t-\tau})-\bar{g}(\tilde{\theta}_{t})\rangle}_{(***)}.

We now proceed to bound each of the three terms above. Observe:

()\displaystyle(**) d~tτgt(θ~t)gt(θ~tτ)\displaystyle\leq\tilde{d}_{t-\tau}\|g_{t}(\tilde{\theta}_{t})-{g}_{t}(\tilde{\theta}_{t-\tau})\| (86)
(a)2d~tτθ~tθ~tτ\displaystyle\overset{(a)}{\leq}2\tilde{d}_{t-\tau}\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|
2(d~t+θ~tθ~tτ)θ~tθ~tτ\displaystyle\leq 2(\tilde{d}_{t}+\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|)\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|
(b)2(ατd~t+θ~tθ~tτατ)2\displaystyle\overset{(b)}{\leq}2\left(\sqrt{\alpha\tau}\tilde{d}_{t}+\frac{\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|}{\sqrt{\alpha\tau}}\right)^{2}
4(ατd~t2+θ~tθ~tτ2ατ),\displaystyle\leq 4\left({\alpha\tau}\tilde{d}^{2}_{t}+\frac{\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|^{2}}{{\alpha\tau}}\right),

where for (a), we used the Lipschitz property in Lemma 8, and for (b), we used the fact that ατ1.\alpha\tau\leq 1. Taking expectations on each side of the above inequality and appealing to Lemma 18, we obtain

𝔼[()]4ατ𝔼[d~t2]+48ατVt+192ατσ2.\mathbb{E}\left[(**)\right]\leq 4\alpha\tau\mathbb{E}\left[\tilde{d}^{2}_{t}\right]+48\alpha\tau V_{t}+192\alpha\tau\sigma^{2}. (87)

Using Lemma 7 and the same arguments as above, one can establish the exact same bound on 𝔼[()]\mathbb{E}\left[(***)\right] as in equation 87. Before proceeding to bound ()(*), we make the observation that θ~t\tilde{\theta}_{t} inherits its randomness from all the Markov data tuples up to time t1t-1, i.e., from {Xk}k=0t1\{X_{k}\}_{k=0}^{t-1}. We now have:

𝔼[()]\displaystyle\mathbb{E}\left[(*)\right] =𝔼[θ~tτθ,gt(θ~tτ)g¯(θ~tτ)]\displaystyle=\mathbb{E}\left[\langle\tilde{\theta}_{t-\tau}-\theta^{*},g_{t}(\tilde{\theta}_{t-\tau})-\bar{g}(\tilde{\theta}_{t-\tau})\rangle\right] (88)
=𝔼[𝔼[θ~tτθ,gt(θ~tτ)g¯(θ~tτ)|θ~tτ,Xtτ]]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\langle\tilde{\theta}_{t-\tau}-\theta^{*},g_{t}(\tilde{\theta}_{t-\tau})-\bar{g}(\tilde{\theta}_{t-\tau})\rangle|\tilde{\theta}_{t-\tau},X_{t-\tau}\right]\right]
=𝔼[θ~tτθ,𝔼[gt(θ~tτ)g¯(θ~tτ)|θ~tτ,Xtτ]]\displaystyle=\mathbb{E}\left[\langle\tilde{\theta}_{t-\tau}-\theta^{*},\mathbb{E}\left[g_{t}(\tilde{\theta}_{t-\tau})-\bar{g}(\tilde{\theta}_{t-\tau})|\tilde{\theta}_{t-\tau},X_{t-\tau}\right]\rangle\right]
𝔼[d~tτ𝔼[gt(θ~tτ)g¯(θ~tτ)|θ~tτ,Xtτ]]\displaystyle\leq\mathbb{E}\left[\tilde{d}_{t-\tau}\|\mathbb{E}\left[g_{t}(\tilde{\theta}_{t-\tau})-\bar{g}(\tilde{\theta}_{t-\tau})|\tilde{\theta}_{t-\tau},X_{t-\tau}\right]\|\right]
(a)α𝔼[d~tτ(θ~tτ+1)]\displaystyle\overset{(a)}{\leq}\alpha\mathbb{E}\left[\tilde{d}_{t-\tau}\left(\|\tilde{\theta}_{t-\tau}\|+1\right)\right]
α𝔼[d~tτ(θ~tτθ~t+d~t+θ+1)]\displaystyle\leq\alpha\mathbb{E}\left[\tilde{d}_{t-\tau}\left(\|\tilde{\theta}_{t-\tau}-\tilde{\theta}_{t}\|+\tilde{d}_{t}+\|\theta^{*}\|+1\right)\right]
α𝔼[(d~t+θ~tθ~tτ)(θ~tτθ~t+d~t+θ+1)]\displaystyle\leq\alpha\mathbb{E}\left[(\tilde{d}_{t}+\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|)\left(\|\tilde{\theta}_{t-\tau}-\tilde{\theta}_{t}\|+\tilde{d}_{t}+\|\theta^{*}\|+1\right)\right]
α𝔼[(d~t+θ~tθ~tτ)(θ~tτθ~t+d~t+2σ)]\displaystyle\leq\alpha\mathbb{E}\left[(\tilde{d}_{t}+\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|)\left(\|\tilde{\theta}_{t-\tau}-\tilde{\theta}_{t}\|+\tilde{d}_{t}+2\sigma\right)\right]
(b)ατ𝔼[(d~t+θ~tθ~tτ+2σ)2]\displaystyle\overset{(b)}{\leq}\alpha\tau\mathbb{E}\left[(\tilde{d}_{t}+\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|+2\sigma)^{2}\right]
3ατ𝔼[d~t2]+3ατ𝔼[θ~tθ~tτ2]+12ατσ2\displaystyle\leq 3\alpha\tau\mathbb{E}\left[\tilde{d}_{t}^{2}\right]+3\alpha\tau\mathbb{E}\left[\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|^{2}\right]+12\alpha\tau\sigma^{2}
(c)3ατ𝔼[d~t2]+ατVt+16ατσ2.\displaystyle\overset{(c)}{\leq}3\alpha\tau\mathbb{E}\left[\tilde{d}_{t}^{2}\right]+\alpha\tau V_{t}+16\alpha\tau\sigma^{2}.

In the above steps, (a) follows from the mixing property in Definition 1, (b) follows from the fact that τ1\tau\geq 1, and (c) follows by invoking Lemma 18 and simplifying using ατ1/6.\alpha\tau\leq 1/6. Combining the above bound with that in equation 87, we conclude that

E[T2]11ατ𝔼[d~t2]+97ατVt+400ατσ2.E[T_{2}]\leq 11\alpha\tau\mathbb{E}\left[\tilde{d}_{t}^{2}\right]+97\alpha\tau V_{t}+400\alpha\tau\sigma^{2}.

Combining the above bound with that in equation 85 completes the proof. ∎

We now have all the pieces needed to prove Theorem 1.

Proof.

(Proof of Theorem 1) We break up the proof into two parts. In the first step, we establish a recursion for our potential function ψt\psi_{t}. In the second step, we analyze this recursion by making a connection to the analysis of the Incremental Aggregated Gradient (IAG) algorithm in [19].

Step 1: Establishing a Recursion for ψt\psi_{t}. Combining the bound on the bias term in Lemma 19 with Lemma 16, and simplifying using τ1\tau\geq 1, we obtain the following inequality tτ\forall t\geq\tau:

𝔼[d~t+12](1αω(1γ)+86α2τ)𝔼[d~t2]+206α2τVt+6α3ω(1γ)𝔼[et12]+940α2τ2σ2.\displaystyle\mathbb{E}\left[\tilde{d}^{2}_{t+1}\right]\leq\left(1-\alpha\omega(1-\gamma)+86\alpha^{2}\tau\right)\mathbb{E}\left[\tilde{d}^{2}_{t}\right]+206\alpha^{2}\tau V_{t}+\frac{6\alpha^{3}}{\omega(1-\gamma)}\mathbb{E}\left[\|e_{t-1}\|^{2}\right]+940\alpha^{2}\tau^{2}\sigma^{2}.

Combining the above display with the bound on the memory variable in Lemma 17, and using the definition of the potential function ψt\psi_{t}, we then obtain tτ\forall t\geq\tau:

ψt+1\displaystyle\psi_{t+1} (1αω(1γ)+86α2τ+64α2δ)A1𝔼[d~t2]+206α2τVt\displaystyle\leq\underbrace{\left(1-\alpha\omega(1-\gamma)+86\alpha^{2}\tau+64\alpha^{2}\delta\right)}_{{A}_{1}}\mathbb{E}\left[\tilde{d}^{2}_{t}\right]+206\alpha^{2}\tau V_{t} (89)
+α2(112δ+16α2δ+6αω(1γ))A2𝔼[et12]+α2(940τ+96δ)σ2.\displaystyle\hskip 5.69054pt+\alpha^{2}\underbrace{\left(1-\frac{1}{2\delta}+16\alpha^{2}\delta+\frac{6\alpha}{\omega(1-\gamma)}\right)}_{A_{2}}\mathbb{E}\left[\|e_{t-1}\|^{2}\right]+\alpha^{2}(940\tau+96\delta)\sigma^{2}.

Our immediate goal is to pick α\alpha such that max{A1,A2}<1\max\{A_{1},A_{2}\}<1. Accordingly, it is easy to check that if

αω(1γ)344τandαω(1γ)256δ,\alpha\leq\frac{\omega(1-\gamma)}{344\tau}\hskip 5.69054pt\textrm{and}\hskip 5.69054pt\alpha\leq\frac{\omega(1-\gamma)}{256\delta}, (90)

then

A11αω(1γ)2.A_{1}\leq 1-\frac{\alpha\omega(1-\gamma)}{2}.

It is also easily verified that with the choice of step-size in equation 90, the following hold:

16α2δ18δand6αω(1γ)18δ,16\alpha^{2}\delta\leq\frac{1}{8\delta}\hskip 5.69054pt\textrm{and}\hskip 5.69054pt\frac{6\alpha}{\omega(1-\gamma)}\leq\frac{1}{8\delta},

implying that

A2114δ.A_{2}\leq 1-\frac{1}{4\delta}.

Finally, note that based on the choice of α\alpha in equation 90, we have

114δ1αω(1γ)2.1-\frac{1}{4\delta}\leq 1-\frac{\alpha\omega(1-\gamma)}{2}.

Combining all the above observations, we obtain that for all tτt\geq\tau:

ψt+1(1αω(1γ)2)ψt+206α2τ(maxtτtψ)+O(α2(τ+δ)σ2).\boxed{\psi_{t+1}\leq\left(1-\frac{\alpha\omega(1-\gamma)}{2}\right)\psi_{t}+206\alpha^{2}\tau\left(\max_{t-\tau\leq\ell\leq t}\psi_{\ell}\right)+O(\alpha^{2}(\tau+\delta)\sigma^{2}).} (91)

If the second term (i.e., the “max” term) in the above bound were absent, one could easily unroll the resulting recursion and argue linear convergence to a noise ball. In what follows, we will show that one can still establish such linear convergence guarantees for equation 91.

Step 2: Analyzing equation 91 via a connection to the IAG algorithm. To see how we can analyze equation 91, we take a quick detour and recap the basic idea behind the IAG method for finite-sum optimization. Say we want to minimize

f(x)=1Mi[M]fi(x),f(x)=\frac{1}{M}\sum_{i\in[M]}f_{i}(x),

where each component function is smooth. The IAG method does so in a computationally-efficient manner by processing each of the component functions one at a time in a deterministic order, and crucially, by maintaining a memory of the most recent gradient values of each of the component functions. This memory introduces certain delayed gradient terms in the update rule. In [19], it was shown that the presence of these delayed terms leads to a recursion of the form in equation 91. This turns out to be the key observation needed to complete our analysis. In particular, we recall an important lemma from [59] (used in [19]) that will help us reason about equation 91.

Lemma 20.

Let {Gt}\{G_{t}\} be a sequence of non-negative real numbers satisfying

Gt+1pGt+qmax(tτt)+tG+r,t,G_{t+1}\leq pG_{t}+q\max_{(t-\tau_{t})_{+}\leq\ell\leq t}G_{\ell}+r,\hskip 5.69054ptt\in\mathbb{N},

for some non-negative constants p,q,p,q, and rr. Here, for any real scalar xx, we use the notation (x)+=max{x,0}(x)_{+}=\max\{x,0\}. If p+q<1p+q<1 and 0τtτmax,t00\leq\tau_{t}\leq\tau_{max},\forall t\geq 0 for some positive constant τmax\tau_{max}, then

GtρtG0+ε,t0,G_{t}\leq\rho^{t}G_{0}+\varepsilon,\forall t\geq 0,

where

ρ=(p+q)11+τmax,andε=r(1pq).\rho=(p+q)^{\frac{1}{1+\tau_{max}}},\hskip 5.69054pt\textrm{and}\hskip 5.69054pt\varepsilon=\frac{r}{(1-p-q)}.

Comparing equation 91 to Lemma 20, we note that for us:

p=1αω(1γ)2,q=206α2τ,r=O(α2(τ+δ)σ2),andτt=τ,p=1-\frac{\alpha\omega(1-\gamma)}{2},q=206\alpha^{2}\tau,r=O(\alpha^{2}(\tau+\delta)\sigma^{2}),\hskip 5.69054pt\textrm{and}\hskip 5.69054pt\tau_{t}=\tau,

where τ\tau is the mixing time. Now suppose α\alpha is chosen such that

αω(1γ)824τ.\alpha\leq\frac{\omega(1-\gamma)}{824\tau}. (92)

Then, we immediately obtain that

p+q=1αω(1γ)2+206α2τ<1αω(1γ)4.p+q=1-\frac{\alpha\omega(1-\gamma)}{2}+206\alpha^{2}\tau<1-\frac{\alpha\omega(1-\gamma)}{4}.

Setting C1=max0kτψkC_{1}=\max_{0\leq k\leq\tau}{\psi_{k}}, and appealing to Lemma 20 then yields the following Tτ\forall T\geq\tau:

ψTC1(1αω(1γ)8τ)Tτ+O(α(τ+δ)σ2ω(1γ)),\psi_{T}\leq C_{1}\left(1-\frac{\alpha\omega(1-\gamma)}{8\tau}\right)^{T-\tau}+O\left(\frac{\alpha(\tau+\delta)\sigma^{2}}{\omega(1-\gamma)}\right),

where we used the facts that (1x)a1ax(1-x)^{a}\leq 1-ax for x,a[0,1]x,a\in[0,1], and τ1\tau\geq 1, to simplify the final expression. Next, note that

𝔼[θTθ2]\displaystyle\mathbb{E}\left[\|{\theta}_{T}-\theta^{*}\|^{2}\right] =𝔼[θTθ~T+θ~Tθ2]\displaystyle=\mathbb{E}\left[{\|{\theta}_{T}-\tilde{\theta}_{T}+\tilde{\theta}_{T}-\theta^{*}\|}^{2}\right] (93)
2𝔼[θ~Tθ2]+2𝔼[θTθ~T2]\displaystyle\leq 2\mathbb{E}\left[{\|\tilde{\theta}_{T}-\theta^{*}\|}^{2}\right]+2\mathbb{E}\left[{\|{\theta}_{T}-\tilde{\theta}_{T}\|}^{2}\right]
=2𝔼[d~t2]+2α2𝔼[eT12]\displaystyle=2\mathbb{E}\left[\tilde{d}^{2}_{t}\right]+2{\alpha}^{2}\mathbb{E}\left[{\|e_{T-1}\|}^{2}\right]
=2ψT.\displaystyle=2\psi_{T}.

We conclude that Tτ,\forall T\geq\tau,

𝔼[rT2]2C1(1αω(1γ)8τ)Tτ+O(α(τ+δ)σ2ω(1γ)).\mathbb{E}\left[r^{2}_{T}\right]\leq 2C_{1}\left(1-\frac{\alpha\omega(1-\gamma)}{8\tau}\right)^{T-\tau}+O\left(\frac{\alpha(\tau+\delta)\sigma^{2}}{\omega(1-\gamma)}\right).

Furthermore, from the requirements on the step-size α\alpha in equations 90 and 92, we note that for the above inequality to hold, it suffices for α\alpha to satisfy:

αω(1γ)824max{τ,δ}.\boxed{\alpha\leq\frac{\omega(1-\gamma)}{824\max\{\tau,\delta\}}.}

The only thing that remains to be shown is that C1=max0kτψk=O(d02+σ2)C_{1}=\max_{0\leq k\leq\tau}{\psi_{k}}=O(d^{2}_{0}+\sigma^{2}) based on our choice of step-size above. This follows from straightforward calculations that we provide below for completeness.

From equation LABEL:eqn:perturb, we have

d~t+12\displaystyle\tilde{d}^{2}_{t+1} =d~t2+2αθ~tθ,gt(θt)+α2gt(θt)2\displaystyle=\tilde{d}^{2}_{t}+2\alpha\langle\tilde{\theta}_{t}-\theta^{*},g_{t}(\theta_{t})\rangle+\alpha^{2}\|g_{t}(\theta_{t})\|^{2} (94)
(1+α)d~t2+2αgt(θt)2\displaystyle\leq(1+\alpha)\tilde{d}_{t}^{2}+2\alpha\|g_{t}(\theta_{t})\|^{2}
(1+49α)d~t2+48α3et12+64ασ2,\displaystyle\leq(1+49\alpha)\tilde{d}_{t}^{2}+48\alpha^{3}\|e_{t-1}\|^{2}+64\alpha\sigma^{2},

where in the last step, we used equation 74. Combining the above bound with Lemma 17, we obtain the following inequality for all t0t\geq 0:

ψt+1(1+49α+64α2δ)𝔼[d~t2]+α2(112δ+16α2δ+48α)𝔼[et12]+(64α+96α2δ)σ2.\psi_{t+1}\leq(1+49\alpha+64\alpha^{2}\delta)\mathbb{E}\left[\tilde{d}_{t}^{2}\right]+\alpha^{2}\left(1-\frac{1}{2\delta}+16\alpha^{2}\delta+48\alpha\right)\mathbb{E}\left[\|e_{t-1}\|^{2}\right]+(64\alpha+96\alpha^{2}\delta)\sigma^{2}. (95)

Using the fact that αδ1/824,\alpha\delta\leq 1/824, we can simplify the above display to obtain

ψt+1(1+50α)ψt+65ασ2,t0.\psi_{t+1}\leq(1+50\alpha)\psi_{t}+65\alpha\sigma^{2},\forall t\geq 0. (96)

Unrolling the above inequality and using e1=0e_{-1}=0, we have that for 0kτ0\leq k\leq\tau:

ψk(1+50α)kd02+65ασ2j=0k1(1+50α)j.\psi_{k}\leq(1+50\alpha)^{k}d^{2}_{0}+65\alpha\sigma^{2}\sum_{j=0}^{k-1}(1+50\alpha)^{j}.\\ (97)

Now since (1+x)ex,x(1+x)\leq e^{x},\forall x\in\mathbb{R}, and α1/(200τ)\alpha\leq 1/(200\tau), note that (1+50α)k(1+50α)τe0.252.(1+50\alpha)^{k}\leq(1+50\alpha)^{\tau}\leq e^{0.25}\leq 2. Thus, for 0kτ0\leq k\leq\tau, we have

ψk2d02+130ατσ22d02+σ2,\psi_{k}\leq 2d^{2}_{0}+130\alpha\tau\sigma^{2}\leq 2d^{2}_{0}+\sigma^{2},

where in the last step, we used 130ατ1.130\alpha\tau\leq 1. Thus, C1=max0kτψk=O(d02+σ2)C_{1}=\max_{0\leq k\leq\tau}\psi_{k}=O(d^{2}_{0}+\sigma^{2}). This concludes the proof. ∎

We conclude this section with a note on the proof of Theorem 2.

Proof of Theorem 2: A careful inspection of the proof of Theorem 1 reveals that we never explicitly used the fact that the TD update direction is an affine function of the parameter θ\theta. This was done on purpose to provide a unified analysis framework to not only reason about linear stochastic approximation schemes with error-feedback, but also their nonlinear counterparts. As such, under Assumptions 2 and 3, the analysis for the nonlinear setting in Section 5 follows exactly the same steps as the proof of Theorem 1. All one needs to do is replace ω(1γ)\omega(1-\gamma) in the proof of Theorem 1 with β\beta, where β\beta is as in Assumption 3. Everything else essentially remains the same. We thus omit routine details here.

Appendix E Analysis of Multi-Agent EF-TD: Proof of Theorem 3

In this section, we will analyze the multi-agent version of EF-TD outlined in Algorithm 2. The main technical challenge relative to the single-agent analysis conducted in Appendices C and D is in establishing the linear speedup property w.r.t. the number of agents MM under the Markovian sampling assumption. As we mentioned earlier in the paper, this turns out to be highly non-trivial even in the absence of compression and error-feedback. The only work that establishes such a speedup (under Markovian sampling) is  [13], where the authors use the framework of Generalized Moreau Envelopes to perform their analysis. Although the analysis in [13] is elegant, it is quite involved, and it is unclear whether their framework can accommodate the error-feedback mechanism. Moreover, the analysis in [13] leads to a sub-optimal O(τ2)O(\tau^{2}) dependence on the mixing time τ\tau in the main noise/variance term. In light of the above discussion, we will provide a different analysis in this section that:

  • Departs from the Moreau Envelope approach in [13],

  • Achieves the optimal O(τ)O(\tau) dependence on the mixing time τ\tau in the dominant noise term,

  • Establishes the desired linear speedup property, and

  • Shows that the effect of the distortion parameter δ\delta can be relegated to a higher-order term.

Crucial to achieving all of the above desiderata are a few novel ingredients in the proof that we now outline. First, we will require a more refined Lyapunov function than we used earlier. Before we introduce this Lyapunov function, let us define a couple of objects:

e¯t1Mi[M]ei,t,andh¯t1Mi[M]hi,t.\bar{e}_{t}\triangleq\frac{1}{M}\sum_{i\in[M]}e_{i,t},\hskip 4.2679pt\textrm{and}\hskip 4.2679pt\bar{h}_{t}\triangleq\frac{1}{M}\sum_{i\in[M]}h_{i,t}.

Next, let us define a perturbed iterate for this setting as θ~tθt+αe¯t1\tilde{\theta}_{t}\triangleq\theta_{t}+\alpha\bar{e}_{t-1}. The potential function we employ is as follows:

Ξt𝔼[θ~tθ2]+Cα3Et1,whereEt1M𝔼[i=1Mei,t2],\Xi_{t}\triangleq\mathbb{E}\left[\|\tilde{\theta}_{t}-\theta^{*}\|^{2}\right]+C\alpha^{3}E_{t-1},\hskip 5.69054pt\textrm{where}\hskip 5.69054ptE_{t}\triangleq\frac{1}{M}\mathbb{E}\left[\sum_{i=1}^{M}\|e_{i,t}\|^{2}\right], (98)

and C>1C>1 is a constant that will be chosen by us later. Compared to the Lyapunov function ψt\psi_{t} in equation 69 that we used for the single-agent setting, Ξt\Xi_{t} differs in two ways. First, it incorporates the memory dynamics of all the agents. A more subtle difference, however, stems from the fact that the second term of Ξt\Xi_{t} is scaled by α3\alpha^{3}, and not α2\alpha^{2} (as in equation 69). This higher-order dependence on α\alpha will serve a twofold purpose: (i) help to shift the effect of δ\delta to a higher-order term, and (ii) partially help in achieving the linear-speedup property. Unfortunately, however, the new Lyapunov function on its own will not suffice in terms of achieving the linear speedup effect completely. For this, we need a careful way to bound the norm of the average TD direction defined below:

zt(θ)1Mi[M]gi,t(θ),θK.z_{t}(\theta)\triangleq\frac{1}{M}\sum_{i\in[M]}g_{i,t}(\theta),\forall\theta\in\mathbb{R}^{K}. (99)

An immediate way to bound zt(θ)z_{t}(\theta) is to appeal to the bound on the norm of the TD update direction in Lemma 6. Indeed, this is what we did while proving Theorem 1, and this is also the typical approach for bounding norms of TD update directions in the centralized setting [14, 16]. The issue with adopting this approach in the multi-agent analysis is that it completely ignores the fact that the observations of the agents are statistically independent. As such, following this route will not lead to any “variance-reduction” effect (key to the linear speedup property). At the same time, it is important to realize here that while the Markov data tuples are independent across agents, for any fixed agent ii, {Xi,t}\{X_{i,t}\} comes from a single Markov chain. This makes it trickier to analyze the variance of zt(θt)z_{t}(\theta_{t}). Via a careful mixing time argument, our next result (Lemma 2 in the main body of the paper) shows how this can be done. Before stating this result, we remind the reader that we have used τ\tau as a shorthand for τϵ\tau_{\epsilon}, with ϵ=α2\epsilon=\alpha^{2}. While a precision of ϵ=α\epsilon=\alpha sufficed for all our prior single-agent analyses, we will need ϵ=α2\epsilon=\alpha^{2} for the MARL case (to create higher-order noise terms in α\alpha).

Lemma 21.

(Controlling the norm of the Average TD Direction) Suppose Assumption 1 holds. For Algorithm 2, the following are then true tτ\forall t\geq\tau:

𝔼[zt(θt)2]8𝔼[dt2]+(32M+8α4)σ2,wheredt=θtθ,and\mathbb{E}\left[\|z_{t}(\theta_{t})\|^{2}\right]\leq 8\mathbb{E}\left[d^{2}_{t}\right]+\left(\frac{32}{M}+8\alpha^{4}\right)\sigma^{2},\hskip 5.69054pt\textrm{where}\hskip 5.69054ptd_{t}=\|\theta_{t}-\theta^{*}\|,\hskip 5.69054pt\textrm{and} (100)
𝔼[zt(θt)2]16𝔼[d~t2]+16α2Et1+(32M+8α4)σ2,whered~t=θ~tθ.\mathbb{E}\left[\|z_{t}(\theta_{t})\|^{2}\right]\leq 16\mathbb{E}\left[\tilde{d}_{t}^{2}\right]+16\alpha^{2}E_{t-1}+\left(\frac{32}{M}+8\alpha^{4}\right)\sigma^{2},\hskip 5.69054pt\textrm{where}\hskip 5.69054pt\tilde{d}_{t}=\|\tilde{\theta}_{t}-\theta^{*}\|. (101)
Proof.

Let us start with the following set of observations:

zt(θt)2\displaystyle\|z_{t}(\theta_{t})\|^{2} =1M2i=1Mgi,t(θt)2\displaystyle=\frac{1}{M^{2}}\left\|\sum_{i=1}^{M}g_{i,t}(\theta_{t})\right\|^{2} (102)
=1M2i=1M(gi,t(θt)gi,t(θ))+i=1Mgi,t(θ)2\displaystyle=\frac{1}{M^{2}}\left\|\sum_{i=1}^{M}\left(g_{i,t}(\theta_{t})-g_{i,t}(\theta^{*})\right)+\sum_{i=1}^{M}g_{i,t}(\theta^{*})\right\|^{2}
2M2i=1M(gi,t(θt)gi,t(θ))2+2M2i=1Mgi,t(θ)2\displaystyle\leq\frac{2}{M^{2}}\left\|\sum_{i=1}^{M}\left(g_{i,t}(\theta_{t})-g_{i,t}(\theta^{*})\right)\right\|^{2}+\frac{2}{M^{2}}\left\|\sum_{i=1}^{M}g_{i,t}(\theta^{*})\right\|^{2}
2Mi=1Mgi,t(θt)gi,t(θ)2+2M2i=1Mgi,t(θ)2\displaystyle\leq\frac{2}{M}\sum_{i=1}^{M}\left\|g_{i,t}(\theta_{t})-g_{i,t}(\theta^{*})\right\|^{2}+\frac{2}{M^{2}}\left\|\sum_{i=1}^{M}g_{i,t}(\theta^{*})\right\|^{2}
8dt2+2M2i=1Mgi,t(θ)2,\displaystyle\leq 8d^{2}_{t}+\frac{2}{M^{2}}\left\|\sum_{i=1}^{M}g_{i,t}(\theta^{*})\right\|^{2},

where in the last step, we used the Lipschitz property in Lemma 8. Next, to bound the second term in the above display, we split it into two parts as follows.

i=1Mgi,t(θ)2=i=1Mgi,t(θ)2()+i,j=1ijMgi,t(θ),gj,t(θ)().\left\|\sum_{i=1}^{M}g_{i,t}(\theta^{*})\right\|^{2}=\underbrace{\sum_{i=1}^{M}\|g_{i,t}(\theta^{*})\|^{2}}_{(*)}+\underbrace{\sum_{\begin{subarray}{c}i,j=1\\ i\neq j\end{subarray}}^{M}\langle g_{i,t}(\theta^{*}),g_{j,t}(\theta^{*})\rangle}_{(**)}.

To bound ()(*), we simply use Lemma 6 and the fact that θσ\|\theta^{*}\|\leq\sigma to conclude that

i=1Mgi,t(θ)28M(θ2+σ2)16Mσ2.\sum_{i=1}^{M}\|g_{i,t}(\theta^{*})\|^{2}\leq 8M(\|\theta^{*}\|^{2}+\sigma^{2})\leq 16M\sigma^{2}.

Now to bound ()(**), let us zoom in on a particular cross-term, and write it out in a way that highlights the sources of randomness. Accordingly, consider the term

𝒯=gi,t(θ),gj,t(θ)=g(Xi,t,θ),g(Xj,t,θ).\mathcal{T}=\langle g_{i,t}(\theta^{*}),g_{j,t}(\theta^{*})\rangle=\langle g(X_{i,t},\theta^{*}),g(X_{j,t},\theta^{*})\rangle.

Since θ\theta^{*} is deterministic, we note that the randomness in 𝒯\mathcal{T} originates from the Markov data samples Xi,tX_{i,t} and Xj,tX_{j,t}. Moreover, since Xi,tX_{i,t} and Xj,tX_{j,t} are independent for iji\neq j, we have

𝔼[𝒯]=𝔼[g(Xi,t,θ)],𝔼[g(Xj,t,θ)].\mathbb{E}\left[\mathcal{T}\right]=\langle\mathbb{E}\left[g(X_{i,t},\theta^{*})\right],\mathbb{E}\left[g(X_{j,t},\theta^{*})\right]\rangle.

Now if Xi,tX_{i,t} and Xj,tX_{j,t} were sampled i.i.d. from the stationary distribution π\pi, each of the two expectations within the above inner-product would have amounted to g¯(θ)=0\bar{g}(\theta^{*})=0. Thus, the cross-terms would have vanished. Since for each agent ii, Xi,tX_{i,t} comes from a Markov chain, these expectations do not, unfortunately, vanish any longer. Nonetheless, we now show that for tτt\geq\tau, one can still make the cross-terms suitably “small” by exploiting the mixing property in Definition 1. Observe:

𝔼[𝒯]\displaystyle\mathbb{E}\left[\mathcal{T}\right] =𝔼[g(Xi,t,θ)],𝔼[g(Xj,t,θ)]\displaystyle=\langle\mathbb{E}\left[g(X_{i,t},\theta^{*})\right],\mathbb{E}\left[g(X_{j,t},\theta^{*})\right]\rangle (103)
=(a)𝔼[𝔼[g(Xi,t,θ)|Xi,tτ]g¯(θ)],𝔼[𝔼[g(Xj,t,θ)|Xj,tτ]g¯(θ)]\displaystyle\overset{(a)}{=}\langle\mathbb{E}\left[\mathbb{E}\left[g(X_{i,t},\theta^{*})|X_{i,t-\tau}\right]-\bar{g}(\theta^{*})\right],\mathbb{E}\left[\mathbb{E}\left[g(X_{j,t},\theta^{*})|X_{j,t-\tau}\right]-\bar{g}(\theta^{*})\right]\rangle
(b)𝔼[𝔼[g(Xi,t,θ)|Xi,tτ]g¯(θ)]×𝔼[𝔼[g(Xj,t,θ)|Xj,tτ]g¯(θ)]\displaystyle\overset{(b)}{\leq}\left\|\mathbb{E}\left[\mathbb{E}\left[g(X_{i,t},\theta^{*})|X_{i,t-\tau}\right]-\bar{g}(\theta^{*})\right]\right\|\times\left\|\mathbb{E}\left[\mathbb{E}\left[g(X_{j,t},\theta^{*})|X_{j,t-\tau}\right]-\bar{g}(\theta^{*})\right]\right\|
(c)𝔼[𝔼[g(Xi,t,θ)|Xi,tτ]g¯(θ)]×𝔼[𝔼[g(Xj,t,θ)|Xj,tτ]g¯(θ)]\displaystyle\overset{(c)}{\leq}\mathbb{E}\left[\left\|\mathbb{E}\left[g(X_{i,t},\theta^{*})|X_{i,t-\tau}\right]-\bar{g}(\theta^{*})\right\|\right]\times\mathbb{E}\left[\left\|\mathbb{E}\left[g(X_{j,t},\theta^{*})|X_{j,t-\tau}\right]-\bar{g}(\theta^{*})\right\|\right]
(d)α4(θ+1)2\displaystyle\overset{(d)}{\leq}\alpha^{4}(\|\theta^{*}\|+1)^{2}
4σ2α4.\displaystyle\leq 4\sigma^{2}\alpha^{4}.

In the above steps, (a) follows from the tower property of expectation in conjunction with g¯(θ)=0\bar{g}(\theta^{*})=0. For (b), we used the Cauchy-Schwarz inequality, and (c) follows from Jensen’s inequality. Finally, (d) is a consequence of the mixing property in Definition 1. We immediately conclude:

𝔼[()]4M2σ2α4.\mathbb{E}\left[(**)\right]\leq 4M^{2}\sigma^{2}\alpha^{4}.

Putting the above pieces together and simplifying leads to equation 100. To go from equation 100 to equation 101, we simply note that

𝔼[dt2]2𝔼[d~t2]+2α2𝔼[θ~tθt2]=2𝔼[d~t2]+2α2𝔼[e¯t12]2𝔼[d~t2]+2α2Et1.\mathbb{E}\left[d^{2}_{t}\right]\leq 2\mathbb{E}\left[\tilde{d}_{t}^{2}\right]+2\alpha^{2}\mathbb{E}\left[\|\tilde{\theta}_{t}-\theta_{t}\|^{2}\right]=2\mathbb{E}\left[\tilde{d}_{t}^{2}\right]+2\alpha^{2}\mathbb{E}\left[\|\bar{e}_{t-1}\|^{2}\right]\leq 2\mathbb{E}\left[\tilde{d}_{t}^{2}\right]+2\alpha^{2}E_{t-1}. (104)

This concludes the proof. ∎

Essentially, Lemma 21 shows how the variance of the average TD update direction can be scaled down by a factor of MM, up to a O(α4)O(\alpha^{4}) term. As we shall soon see, this result will play a key role in our subsequent analysis of Algorithm 2. We now proceed to derive a multi-agent version of Lemma 16.

Lemma 22.

Suppose Assumption 1 holds and the step-size α\alpha satisfies α1/16\alpha\leq 1/16. For Algorithm 2, the following bound then holds for tτ\forall t\geq\tau:

𝔼[d~t+12](1αω(1γ)+16α2)𝔼[d~t2]+5α3ω(1γ)Et1+2α𝔼[A]+α2(32M+8α4)σ2,\mathbb{E}\left[\tilde{d}^{2}_{t+1}\right]\leq\left(1-\alpha\omega(1-\gamma)+16\alpha^{2}\right)\mathbb{E}\left[\tilde{d}^{2}_{t}\right]+\frac{5\alpha^{3}}{\omega(1-\gamma)}E_{t-1}+2\alpha\mathbb{E}\left[A\right]+\alpha^{2}\left(\frac{32}{M}+8\alpha^{4}\right)\sigma^{2}, (105)

where A=θ~tθ,zt(θ~t)g¯(θ~t).A=\langle\tilde{\theta}_{t}-\theta^{*},z_{t}(\tilde{\theta}_{t})-\bar{g}(\tilde{\theta}_{t})\rangle.

Proof.

Simple calculations reveal that

θ~t+1=θ~t+α(1Mi[M]gi,t(θt))=θ~t+αzt(θt).\tilde{\theta}_{t+1}=\tilde{\theta}_{t}+\alpha\left(\frac{1}{M}\sum_{i\in[M]}g_{i,t}(\theta_{t})\right)=\tilde{\theta}_{t}+\alpha z_{t}(\theta_{t}). (106)

Subtracting θ\theta^{*} from each side of the above equation and then squaring both sides yields:

d~t+12\displaystyle\tilde{d}^{2}_{t+1} =d~t2+2αθ~tθ,zt(θt)+α2zt(θt)2\displaystyle=\tilde{d}^{2}_{t}+2\alpha\langle\tilde{\theta}_{t}-\theta^{*},z_{t}(\theta_{t})\rangle+\alpha^{2}\|z_{t}(\theta_{t})\|^{2} (107)
=d~t2+2αθ~tθ,zt(θ~t)()+2αθ~tθ,zt(θt)zt(θ~t)()+α2zt(θt)2().\displaystyle=\tilde{d}^{2}_{t}+\underbrace{2\alpha\langle\tilde{\theta}_{t}-\theta^{*},z_{t}(\tilde{\theta}_{t})\rangle}_{(*)}+\underbrace{2\alpha\langle\tilde{\theta}_{t}-\theta^{*},z_{t}(\theta_{t})-z_{t}(\tilde{\theta}_{t})\rangle}_{(**)}+\underbrace{\alpha^{2}\|z_{t}(\theta_{t})\|^{2}}_{(***)}.

We now have:

()\displaystyle(*) =2αθ~tθ,g¯(θ~t)+2αθ~tθ,zt(θ~t)g¯(θ~t)\displaystyle=2\alpha\langle\tilde{\theta}_{t}-\theta^{*},\bar{g}(\tilde{\theta}_{t})\rangle+2\alpha\langle\tilde{\theta}_{t}-\theta^{*},z_{t}(\tilde{\theta}_{t})-\bar{g}(\tilde{\theta}_{t})\rangle (108)
2αω(1γ)d~t2+2αθ~tθ,zt(θ~t)g¯(θ~t),\displaystyle\leq-2\alpha\omega(1-\gamma)\tilde{d}^{2}_{t}+2\alpha\langle\tilde{\theta}_{t}-\theta^{*},z_{t}(\tilde{\theta}_{t})-\bar{g}(\tilde{\theta}_{t})\rangle,

where in second step, we invoked Lemmas 3 and 4. To bound ()(**), we proceed as follows:

()\displaystyle(**) 4αd~tθtθ~t\displaystyle\leq 4\alpha\tilde{d}_{t}\|\theta_{t}-\tilde{\theta}_{t}\| (109)
2αηd~t2+2αηθtθ~t2\displaystyle\leq\frac{2\alpha}{\eta}\tilde{d}^{2}_{t}+2\alpha\eta\|\theta_{t}-\tilde{\theta}_{t}\|^{2}
=2αηd~t2+2α3ηe¯t12\displaystyle=\frac{2\alpha}{\eta}\tilde{d}^{2}_{t}+2\alpha^{3}\eta\|\bar{e}_{t-1}\|^{2}
2αηd~t2+2α3ηEt1,\displaystyle\leq\frac{2\alpha}{\eta}\tilde{d}^{2}_{t}+2\alpha^{3}\eta E_{t-1},

where η>0\eta>0 is a constant to be decided shortly. In the first step above, we used the fact that since each gi,t(θ)g_{i,t}(\theta) is 22-Lipschitz (see Lemma 8), the definition of zt(θ)z_{t}(\theta) in equation 99 implies that zt(θ)z_{t}(\theta) is also 22-Lipschitz. To bound ()(***), we directly use Lemma 21. Combining the above bounds, and simplifying by setting η=2ω(1γ)\eta=\frac{2}{\omega(1-\gamma)} and using α1/16\alpha\leq 1/16 leads to the desired claim. ∎

In what follows, we will focus on bounding the bias term A=θ~tθ,zt(θ~t)g¯(θ~t)A=\langle\tilde{\theta}_{t}-\theta^{*},z_{t}(\tilde{\theta}_{t})-\bar{g}(\tilde{\theta}_{t})\rangle by following the same high-level steps as in the proof of Theorem 1. The key difference, however, will come from invoking Lemma 21 instead of Lemma 6. We start with a bound on the drift θ~tθ~tτ.\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|.

Lemma 23.

Suppose Assumption 1 holds. Then for Algorithm 2, we have the following bound t2τ\forall t\geq 2\tau:

𝔼[θ~tθ~tτ2]α2τ2maxtτt1G,whereG16𝔼[d~2]+16α2E1+(32M+8α4)σ2.\mathbb{E}[\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|^{2}]\leq\alpha^{2}\tau^{2}\max_{t-\tau\leq\ell\leq t-1}G_{\ell},\hskip 5.69054pt\textrm{where}\hskip 5.69054ptG_{\ell}\triangleq 16\mathbb{E}\left[\tilde{d}^{2}_{\ell}\right]+16\alpha^{2}E_{\ell-1}+\left(\frac{32}{M}+8\alpha^{4}\right)\sigma^{2}. (110)
Proof.

The proof is a direct application of Lemma 21. Indeed, notice that

𝔼[θ~tθ~tτ2]\displaystyle\mathbb{E}\left[\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|^{2}\right] τ=tτt1𝔼[θ~+1θ~2]\displaystyle\leq\tau\sum_{\ell=t-\tau}^{t-1}\mathbb{E}\left[\|\tilde{\theta}_{\ell+1}-\tilde{\theta}_{\ell}\|^{2}\right] (111)
equation106α2τ=tτt1𝔼[z(θ)2]\displaystyle\overset{equation~{}\ref{eqn:perturbMA}}{\leq}\alpha^{2}\tau\sum_{\ell=t-\tau}^{t-1}\mathbb{E}\left[\|z_{\ell}(\theta_{\ell})\|^{2}\right]
equation101α2τ2maxtτt1G,\displaystyle\overset{equation~{}\ref{eqn:Varbnd2}}{\leq}\alpha^{2}\tau^{2}\max_{t-\tau\leq\ell\leq t-1}G_{\ell},

which is the desired claim. In the last step, we invoked Lemma 21 by noting that since t2τt\geq 2\tau, we have that τ\ell\geq\tau in the above steps - a requirement for using Lemma 21. ∎

Equipped with Lemmas 21 and 23, we now proceed to bound the bias term following steps similar in spirit to the proof of Lemma 19.

Lemma 24.

Suppose Assumption 1 holds. Let the step-size α\alpha be such that ατ1/3\alpha\tau\leq 1/3. For Algorithm 2, the following is then true t2τ\forall t\geq 2\tau:

𝔼[θ~tθ,zt(θ~t)g¯(θ~t)]13ατ𝔼[d~t2]+11ατmaxtτtG+12α2σ2.\mathbb{E}\left[\langle\tilde{\theta}_{t}-\theta^{*},z_{t}(\tilde{\theta}_{t})-\bar{g}(\tilde{\theta}_{t})\rangle\right]\leq 13\alpha\tau\mathbb{E}\left[\tilde{d}_{t}^{2}\right]+11\alpha\tau\max_{t-\tau\leq\ell\leq t}G_{\ell}+12\alpha^{2}\sigma^{2}. (112)
Proof.

We start by decomposing the bias term θ~tθ,zt(θ~t)g¯(θ~t)\langle\tilde{\theta}_{t}-\theta^{*},z_{t}(\tilde{\theta}_{t})-\bar{g}(\tilde{\theta}_{t})\rangle as follows:

A=θ~tθ~tτ,zt(θ~t)g¯(θ~t)A1+θ~tτθ,zt(θ~t)g¯(θ~t)A2.A=\underbrace{\langle\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau},z_{t}(\tilde{\theta}_{t})-\bar{g}(\tilde{\theta}_{t})\rangle}_{A_{1}}+\underbrace{\langle\tilde{\theta}_{t-\tau}-\theta^{*},z_{t}(\tilde{\theta}_{t})-\bar{g}(\tilde{\theta}_{t})\rangle}_{A_{2}}.

To bound A1{A_{1}}, we note that

A1\displaystyle A_{1} θ~tθ~tτzt(θ~t)g¯(θ~t)\displaystyle\leq\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|\|z_{t}(\tilde{\theta}_{t})-\bar{g}(\tilde{\theta}_{t})\| (113)
12ατθ~tθ~tτ2+ατ2zt(θ~t)g¯(θ~t)2\displaystyle\leq\frac{1}{2\alpha\tau}\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|^{2}+\frac{\alpha\tau}{2}\|z_{t}(\tilde{\theta}_{t})-\bar{g}(\tilde{\theta}_{t})\|^{2}
12ατθ~tθ~tτ2+ατ(zt(θ~t)2+g¯(θ~t)2)\displaystyle\leq\frac{1}{2\alpha\tau}\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|^{2}+{\alpha\tau}\left(\|z_{t}(\tilde{\theta}_{t})\|^{2}+\|\bar{g}(\tilde{\theta}_{t})\|^{2}\right)
12ατθ~tθ~tτ2+ατ(zt(θ~t)2+4d~t2),\displaystyle\leq\frac{1}{2\alpha\tau}\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|^{2}+{\alpha\tau}\left(\|z_{t}(\tilde{\theta}_{t})\|^{2}+4\tilde{d}_{t}^{2}\right),

where the last step follows from Lemmas 3 and 5. Taking expectations on both sides of the above inequality, and using Lemmas 21 and 23 yields:

𝔼[A1]\displaystyle\mathbb{E}\left[A_{1}\right] 4ατ𝔼[d~t2]+ατ2maxtτt1G+ατGt\displaystyle\leq 4\alpha\tau\mathbb{E}\left[\tilde{d}^{2}_{t}\right]+\frac{\alpha\tau}{2}\max_{t-\tau\leq\ell\leq t-1}G_{\ell}+\alpha\tau G_{t} (114)
4ατ𝔼[d~t2]+2ατmaxtτtG.\displaystyle\leq 4\alpha\tau\mathbb{E}\left[\tilde{d}^{2}_{t}\right]+2\alpha\tau\max_{t-\tau\leq\ell\leq t}G_{\ell}.

Next, to bound A2A_{2}, we decompose it as follows:

A2=θ~tτθ,zt(θ~tτ)g¯(θ~tτ)()+θ~tτθ,zt(θ~t)zt(θ~tτ)()+θ~tτθ,g¯(θ~tτ)g¯(θ~t)().A_{2}=\underbrace{\langle\tilde{\theta}_{t-\tau}-\theta^{*},z_{t}(\tilde{\theta}_{t-\tau})-\bar{g}(\tilde{\theta}_{t-\tau})\rangle}_{(*)}+\underbrace{\langle\tilde{\theta}_{t-\tau}-\theta^{*},z_{t}(\tilde{\theta}_{t})-{z}_{t}(\tilde{\theta}_{t-\tau})\rangle}_{(**)}+\underbrace{\langle\tilde{\theta}_{t-\tau}-\theta^{*},\bar{g}(\tilde{\theta}_{t-\tau})-\bar{g}(\tilde{\theta}_{t})\rangle}_{(***)}.

We now proceed to bound each of the three terms above. Let us start by observing that

()\displaystyle(**) d~tτzt(θ~t)zt(θ~tτ)\displaystyle\leq\tilde{d}_{t-\tau}\|z_{t}(\tilde{\theta}_{t})-{z}_{t}(\tilde{\theta}_{t-\tau})\| (115)
(a)2d~tτθ~tθ~tτ\displaystyle\overset{(a)}{\leq}2\tilde{d}_{t-\tau}\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|
2(d~t+θ~tθ~tτ)θ~tθ~tτ\displaystyle\leq 2(\tilde{d}_{t}+\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|)\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|
(b)2(ατd~t+θ~tθ~tτατ)2\displaystyle\overset{(b)}{\leq}2\left(\sqrt{\alpha\tau}\tilde{d}_{t}+\frac{\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|}{\sqrt{\alpha\tau}}\right)^{2}
4(ατd~t2+θ~tθ~tτ2ατ),\displaystyle\leq 4\left({\alpha\tau}\tilde{d}^{2}_{t}+\frac{\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|^{2}}{{\alpha\tau}}\right),

where for (a), we used the fact that zt(θ)z_{t}(\theta) is 2-Lipschitz, and for (b), we used the fact that ατ1.\alpha\tau\leq 1. Taking expectations on each side of the above inequality and invoking Lemma 23, we obtain

𝔼[()]4ατ𝔼[d~t2]+4ατmaxtτtG.\mathbb{E}\left[(**)\right]\leq 4\alpha\tau\mathbb{E}\left[\tilde{d}^{2}_{t}\right]+4\alpha\tau\max_{t-\tau\leq\ell\leq t}G_{\ell}. (116)

As before, the exact same bound as in the above display applies to 𝔼[()]\mathbb{E}\left[(***)\right]. We now turn to the main step in this proof.

𝔼[()]\displaystyle\mathbb{E}\left[(*)\right] =𝔼[θ~tτθ,zt(θ~tτ)g¯(θ~tτ)]\displaystyle=\mathbb{E}\left[\langle\tilde{\theta}_{t-\tau}-\theta^{*},z_{t}(\tilde{\theta}_{t-\tau})-\bar{g}(\tilde{\theta}_{t-\tau})\rangle\right] (117)
=𝔼[θ~tτθ,1Mi=1M(gi,t(θ~tτ)g¯(θ~tτ))]\displaystyle=\mathbb{E}\left[\langle\tilde{\theta}_{t-\tau}-\theta^{*},\frac{1}{M}\sum_{i=1}^{M}\left(g_{i,t}(\tilde{\theta}_{t-\tau})-\bar{g}(\tilde{\theta}_{t-\tau})\right)\rangle\right]
=𝔼[𝔼[θ~tτθ,1Mi=1M(gi,t(θ~tτ)g¯(θ~tτ))|θ~tτ,{Xj,tτ}j[M]]]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\langle\tilde{\theta}_{t-\tau}-\theta^{*},\frac{1}{M}\sum_{i=1}^{M}\left(g_{i,t}(\tilde{\theta}_{t-\tau})-\bar{g}(\tilde{\theta}_{t-\tau})\right)\rangle|\tilde{\theta}_{t-\tau},\{X_{j,t-\tau}\}_{j\in[M]}\right]\right]
=𝔼[θ~tτθ,1Mi=1M(𝔼[gi,t(θ~tτ)|θ~tτ,{Xj,tτ}j[M]]g¯(θ~tτ))]\displaystyle=\mathbb{E}\left[\langle\tilde{\theta}_{t-\tau}-\theta^{*},\frac{1}{M}\sum_{i=1}^{M}\left(\mathbb{E}\left[g_{i,t}(\tilde{\theta}_{t-\tau})|\tilde{\theta}_{t-\tau},\{X_{j,t-\tau}\}_{j\in[M]}\right]-\bar{g}(\tilde{\theta}_{t-\tau})\right)\rangle\right]
=(a)𝔼[θ~tτθ,1Mi=1M(𝔼[gi,t(θ~tτ)|θ~tτ,Xi,tτ]g¯(θ~tτ))]\displaystyle\overset{(a)}{=}\mathbb{E}\left[\langle\tilde{\theta}_{t-\tau}-\theta^{*},\frac{1}{M}\sum_{i=1}^{M}\left(\mathbb{E}\left[g_{i,t}(\tilde{\theta}_{t-\tau})|\tilde{\theta}_{t-\tau},X_{i,t-\tau}\right]-\bar{g}(\tilde{\theta}_{t-\tau})\right)\rangle\right]
(b)𝔼[d~tτ1Mi=1M𝔼[gi,t(θ~tτ)|θ~tτ,Xi,tτ]g¯(θ~tτ)]\displaystyle\overset{(b)}{\leq}\mathbb{E}\left[\tilde{d}_{t-\tau}\frac{1}{M}\sum_{i=1}^{M}\left\|\mathbb{E}\left[g_{i,t}(\tilde{\theta}_{t-\tau})|\tilde{\theta}_{t-\tau},X_{i,t-\tau}\right]-\bar{g}(\tilde{\theta}_{t-\tau})\right\|\right]
(c)α2𝔼[d~tτ(θ~tτ+1)]\displaystyle\overset{(c)}{\leq}\alpha^{2}\mathbb{E}\left[\tilde{d}_{t-\tau}\left(\|\tilde{\theta}_{t-\tau}\|+1\right)\right]
(d)3α2𝔼[d~t2]+3α2𝔼[θ~tθ~tτ2]+12α2σ2\displaystyle\overset{(d)}{\leq}3\alpha^{2}\mathbb{E}\left[\tilde{d}_{t}^{2}\right]+3\alpha^{2}\mathbb{E}\left[\|\tilde{\theta}_{t}-\tilde{\theta}_{t-\tau}\|^{2}\right]+12\alpha^{2}\sigma^{2}
(e)3α2𝔼[d~t2]+3α4τ2maxtτt1G+12α2σ2\displaystyle\overset{(e)}{\leq}3\alpha^{2}\mathbb{E}\left[\tilde{d}_{t}^{2}\right]+3\alpha^{4}\tau^{2}\max_{t-\tau\leq\ell\leq t-1}G_{\ell}+12\alpha^{2}\sigma^{2}
(f)ατ𝔼[d~t2]+ατmaxtτtG+12α2σ2.\displaystyle\overset{(f)}{\leq}\alpha\tau\mathbb{E}\left[\tilde{d}_{t}^{2}\right]+\alpha\tau\max_{t-\tau\leq\ell\leq t}G_{\ell}+12\alpha^{2}\sigma^{2}.

In the above steps, (a) follows from the fact that the Markov data tuples are independent across agents; (b) follows from the Cauchy-Schwarz and the triangle inequality; (c) is a consequence of the mixing property in Definition 1; for (d), we used steps similar to those for arriving at equation 88; for (e), we used the bound on the drift from Lemma 23; and finally for (f), we simplified terms using ατ1/3\alpha\tau\leq 1/3 and τ1\tau\geq 1. Combining the above bounds, we obtain

𝔼[A2]9ατ𝔼[d~t2]+9ατmaxtτtG+12α2σ2.\mathbb{E}\left[A_{2}\right]\leq 9\alpha\tau\mathbb{E}\left[\tilde{d}_{t}^{2}\right]+9\alpha\tau\max_{t-\tau\leq\ell\leq t}G_{\ell}+12\alpha^{2}\sigma^{2}.

The above display, in tandem with equation 114, leads to the claim of the lemma. ∎

We are now ready to prove Theorem 3.

Proof.

(Proof of Theorem 3) As in the proof of Theorem 1, our first step is to establish a recursion for the Lyapunov function Ξt\Xi_{t} in equation 98. To that end, appealing to Lemmas 22 and 24, using the definition of GG_{\ell}, and some algebra leads to the following bound for all t2τ:t\geq 2\tau:

𝔼[d~t+12]\displaystyle\mathbb{E}\left[\tilde{d}^{2}_{t+1}\right] (1αω(1γ)+42α2τ)𝔼[d~t2]+5α3ω(1γ)Et1\displaystyle\leq\left(1-\alpha\omega(1-\gamma)+42\alpha^{2}\tau\right)\mathbb{E}\left[\tilde{d}^{2}_{t}\right]+\frac{5\alpha^{3}}{\omega(1-\gamma)}E_{t-1} (118)
+352α2τmaxtτt𝔼[d~2]+352α4τmaxtτtE1+24α2(τ(32M+8α4)+α)σ2.\displaystyle\hskip 5.69054pt+352\alpha^{2}\tau\max_{t-\tau\leq\ell\leq t}\mathbb{E}\left[\tilde{d}^{2}_{\ell}\right]+352\alpha^{4}\tau\max_{t-\tau\leq\ell\leq t}E_{\ell-1}+24\alpha^{2}\left(\tau\left(\frac{32}{M}+8\alpha^{4}\right)+\alpha\right)\sigma^{2}.

Now similar to the proof of Lemma 17, we have the following for each i[M]i\in[M]:

𝔼[ei,t2]\displaystyle\mathbb{E}\left[\|e_{i,t}\|^{2}\right] (112δ)𝔼[ei,t12]+2δ𝔼[gi,t(θt)2]\displaystyle\leq\left(1-\frac{1}{2\delta}\right)\mathbb{E}\left[\|e_{i,t-1}\|^{2}\right]+2\delta\mathbb{E}\left[\|{g}_{i,t}(\theta_{t})\|^{2}\right] (119)
(112δ)𝔼[ei,t12]+16δ𝔼[θtθ~t2]+4δ𝔼[gi,t(θ~t)2]\displaystyle\leq\left(1-\frac{1}{2\delta}\right)\mathbb{E}\left[\|e_{i,t-1}\|^{2}\right]+16\delta\mathbb{E}\left[\|\theta_{t}-\tilde{\theta}_{t}\|^{2}\right]+4\delta\mathbb{E}\left[\|{g}_{i,t}(\tilde{\theta}_{t})\|^{2}\right]
(112δ)𝔼[ei,t12]+16α2δEt1+64δ𝔼[d~t2]+96δσ2.\displaystyle\leq\left(1-\frac{1}{2\delta}\right)\mathbb{E}\left[\|e_{i,t-1}\|^{2}\right]+16\alpha^{2}\delta E_{t-1}+64\delta\mathbb{E}\left[\tilde{d}_{t}^{2}\right]+96\delta\sigma^{2}.

Averaging the above bound across all agents then yields:

Et\displaystyle E_{t} (112δ+16α2δ)Et1+64δ𝔼[d~t2]+96δσ2.\displaystyle\leq\left(1-\frac{1}{2\delta}+16\alpha^{2}\delta\right)E_{t-1}+64\delta\mathbb{E}\left[\tilde{d}_{t}^{2}\right]+96\delta\sigma^{2}. (120)

Combining the above display with equation 118, and using the definition of Ξt\Xi_{t}, we obtain t2τ:\forall t\geq 2\tau:

Ξt+1\displaystyle\Xi_{t+1} (1αω(1γ)+42α2τ+64Cδα3)𝔼[d~t2]+Cα3(112δ+16α2δ+5Cω(1γ))Et1\displaystyle\leq\left(1-\alpha\omega(1-\gamma)+42\alpha^{2}\tau+64C\delta\alpha^{3}\right)\mathbb{E}\left[\tilde{d}^{2}_{t}\right]+C\alpha^{3}\left(1-\frac{1}{2\delta}+16\alpha^{2}\delta+\frac{5}{C\omega(1-\gamma)}\right)E_{t-1} (121)
+352α2τmaxtτt𝔼[d~2]+352α4τmaxtτtE1+R\displaystyle\hskip 5.69054pt+352\alpha^{2}\tau\max_{t-\tau\leq\ell\leq t}\mathbb{E}\left[\tilde{d}^{2}_{\ell}\right]+352\alpha^{4}\tau\max_{t-\tau\leq\ell\leq t}E_{\ell-1}+R
(1αω(1γ)+42α2τ+64Cδα3)𝔼[d~t2]+Cα3(112δ+16α2δ+5Cω(1γ))Et1\displaystyle\leq\left(1-\alpha\omega(1-\gamma)+42\alpha^{2}\tau+64C\delta\alpha^{3}\right)\mathbb{E}\left[\tilde{d}^{2}_{t}\right]+C\alpha^{3}\left(1-\frac{1}{2\delta}+16\alpha^{2}\delta+\frac{5}{C\omega(1-\gamma)}\right)E_{t-1}
+352α2τmaxtτtΞ+352ατCmaxtτtCα3E1+R\displaystyle\hskip 5.69054pt+352\alpha^{2}\tau\max_{t-\tau\leq\ell\leq t}\Xi_{\ell}+\frac{352\alpha\tau}{C}\max_{t-\tau\leq\ell\leq t}C\alpha^{3}E_{\ell-1}+R
(1αω(1γ)+42α2τ+64Cδα3)B1𝔼[d~t2]+Cα3(112δ+16α2δ+5Cω(1γ))B2Et1\displaystyle\leq\underbrace{\left(1-\alpha\omega(1-\gamma)+42\alpha^{2}\tau+64C\delta\alpha^{3}\right)}_{B_{1}}\mathbb{E}\left[\tilde{d}^{2}_{t}\right]+C\alpha^{3}\underbrace{\left(1-\frac{1}{2\delta}+16\alpha^{2}\delta+\frac{5}{C\omega(1-\gamma)}\right)}_{B_{2}}E_{t-1}
+352ατ(α+1C)maxtτtΞ+R,\displaystyle\hskip 5.69054pt+352\alpha\tau\left(\alpha+\frac{1}{C}\right)\max_{t-\tau\leq\ell\leq t}\Xi_{\ell}+R,

where

R=24α2(τ(32M+8α4)+α)σ2+96Cα3δσ2.R=24\alpha^{2}\left(\tau\left(\frac{32}{M}+8\alpha^{4}\right)+\alpha\right)\sigma^{2}+96C\alpha^{3}\delta\sigma^{2}.

Our goal is to now carefully pick α\alpha and CC so as to ensure that max{B1,B2}<1.\max\{B_{1},B_{2}\}<1. Let us start with B2B_{2}. Suppose

α<112δandC=2816max{δ,τ}ω(1γ).\alpha<\frac{1}{12\delta}\hskip 5.69054pt\textrm{and}\hskip 5.69054ptC=\frac{2816\max\{\delta,\tau\}}{\omega(1-\gamma)}. (122)

It is easy to then verify that

B2<114δ.B_{2}<1-\frac{1}{4\delta}.

As for B1B_{1}, we note that if

αω(1γ)850max{δ,τ},\alpha\leq\frac{\omega(1-\gamma)}{850\max\{\delta,\tau\}}, (123)

then

B11αω(1γ)2.B_{1}\leq 1-\frac{\alpha\omega(1-\gamma)}{2}.

Also, under the requirement on the step-size α\alpha in equation 123, it is easy to check that

114δ<1αω(1γ)2.1-\frac{1}{4\delta}<1-\frac{\alpha\omega(1-\gamma)}{2}.

In light of the above discussion, we conclude that t2τ\forall t\geq 2\tau:

Ξt+1(1αω(1γ)2)Ξt+352ατ(α+1C)maxtτtΞ+R.\boxed{\Xi_{t+1}\leq\left(1-\frac{\alpha\omega(1-\gamma)}{2}\right)\Xi_{t}+352\alpha\tau\left(\alpha+\frac{1}{C}\right)\max_{t-\tau\leq\ell\leq t}\Xi_{\ell}+R.} (124)

We are almost in a position to invoke Lemma 20. All that remains to be verified is whether

1αω(1γ)2p+352α2τ+352ατCq<1.\underbrace{1-\frac{\alpha\omega(1-\gamma)}{2}}_{p}+\underbrace{352\alpha^{2}\tau+352\frac{\alpha\tau}{C}}_{q}<1.

With the choice of CC in equation 122, if we set

αω(1γ)2816τ,\alpha\leq\frac{\omega(1-\gamma)}{2816\tau},

then one can verify that

p+q<1αω(1γ)4.p+q<1-\frac{\alpha\omega(1-\gamma)}{4}.

Combining all our prior requirements on α\alpha, we conclude that if

αω(1γ)2816max{δ,τ},\boxed{\alpha\leq\frac{\omega(1-\gamma)}{2816\max\{\delta,\tau\}}}, (125)

then the following holds for all T2τT\geq 2\tau:

ΞTC1(1αω(1γ)8τ)T2τ+4Rαω(1γ),\Xi_{T}\leq C_{1}\left(1-\frac{\alpha\omega(1-\gamma)}{8\tau}\right)^{T-2\tau}+\frac{4R}{\alpha\omega(1-\gamma)}, (126)

where C1=max0k2τΞk.C_{1}=\max_{0\leq k\leq 2\tau}\Xi_{k}. Simple calculations reveal that

4Rαω(1γ)=O(ατω(1γ))σ2M+O(α2max{δ,τ}δω2(1γ)2)σ2.\frac{4R}{\alpha\omega(1-\gamma)}={O\left(\frac{\alpha\tau}{\omega(1-\gamma)}\right)\frac{\sigma^{2}}{{M}}}+{O\left(\frac{\alpha^{2}\max\{\delta,\tau\}\delta}{\omega^{2}(1-\gamma)^{2}}\right)\sigma^{2}}.

Using the above bound, and noting that 𝔼[d~t2]ΞT,\mathbb{E}\left[\tilde{d}^{2}_{t}\right]\leq\Xi_{T}, we have that T2τ\forall T\geq 2\tau:

𝔼[d~t2]C1(1αω(1γ)8τ)T2τ+O(ατω(1γ))σ2M+O(α2max{δ,τ}δω2(1γ)2)σ2.\mathbb{E}\left[\tilde{d}^{2}_{t}\right]\leq C_{1}\left(1-\frac{\alpha\omega(1-\gamma)}{8\tau}\right)^{T-2\tau}+{O\left(\frac{\alpha\tau}{\omega(1-\gamma)}\right)\frac{\sigma^{2}}{{M}}}+{O\left(\frac{\alpha^{2}\max\{\delta,\tau\}\delta}{\omega^{2}(1-\gamma)^{2}}\right)\sigma^{2}}. (127)

The fact that C1=O(d02+σ2)C_{1}=O(d^{2}_{0}+\sigma^{2}) follows from straightforward algebra similar to that in the proof of Theorem 1. This completes the proof. ∎

Appendix F Analysis of Multi-Agent EF-TD under an I.I.D. Sampling Assumption

In this section, we provide a simpler (relative to that in Appendix E) analysis of the MARL setting under a common i.i.d. sampling assumption. Essentially, we consider a setting where for each agent ii, at each time-step tt, si,ts_{i,t} is sampled independently (from the past and across agents) from the stationary distribution π\pi, and then si,t+1s_{i,t+1} is sampled from Pμ(|si,t).P_{\mu}(\cdot|s_{i,t}). As it turns out, this particular “i.i.d. model” has been widely studied in the RL literature [33, 26, 18, 10, 17]; thus, we believe that providing an analysis for this setting would be useful to the reader. Our main result for this setting is as follows.

Theorem 6.

There exist universal constants c,C1c,C\geq 1, a step-size α(1γ)/(cδ)\alpha\leq(1-\gamma)/(c\delta), and a set of convex weights {w¯t}\{\bar{w}_{t}\}, such that the iterates generated by Algorithm 2 satisfy the following after TT iterations:

𝔼[V^θ¯TV^θD2]=O(exp(ω(1γ)2TCδ))+O~(σ2ω(1γ)2MT)+T3,\mathbb{E}\left[\|\hat{V}_{\bar{\theta}_{T}}-\hat{V}_{\theta^{*}}\|^{2}_{D}\right]=O\left(\exp{\left(\frac{-\omega(1-\gamma)^{2}T}{C\delta}\right)}\right)+\tilde{O}\left(\frac{\sigma^{2}}{\omega(1-\gamma)^{2}{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}M}T}\right)+T_{3}, (128)

where T3=O~(δ2σ2/(ω2(1γ)4T2))T_{3}=\tilde{O}\left(\delta^{2}\sigma^{2}/(\omega^{2}(1-\gamma)^{4}T^{2})\right), θ¯T=t=0Tw¯tθt\bar{\theta}_{T}=\sum_{t=0}^{T}\bar{w}_{t}\theta_{t}, and σ2𝔼[gt(θ)22]\sigma^{2}\triangleq\mathbb{E}\left[\|g_{t}(\theta^{*})\|^{2}_{2}\right].555We remind the reader here that ω\omega is the smallest eigenvalue of the matrix Σ=ΦDΦ.\Sigma=\Phi^{\top}D\Phi.

As in Theorem 3 where we analyzed the multi-agent setting under Markovian sampling, the above result also establishes a linear speedup in sample-complexity w.r.t. the number of agents MM. The above result is somewhat cleaner in the sense that it provides a bound on the performance of the true iterate sequence {θt}\{\theta_{t}\}, as opposed to the perturbed iterate sequence {θ~t}\{\tilde{\theta}_{t}\}. We believe that it should be possible to provide a bound on the true iterates of the form in Theorem 6 under Markovian sampling as well; we leave this as future work.

To proceed with the analysis, we will require two auxiliary results. The first is taken from [18].

Lemma 25.

Fix any θK\theta\in\mathbb{R}^{K}. The following holds under the i.i.d. sampling model:

𝔼[gt(θ)2]2σ2+8V^θV^θD2.\mathbb{E}\left[\|g_{t}(\theta)\|^{2}\right]\leq 2\sigma^{2}+8\|\hat{V}_{\theta}-\hat{V}_{\theta^{*}}\|^{2}_{D}.

The next result is an “averaging lemma” that has been adapted from [60] and [61].

Lemma 26.

Let {pt}t0\{p_{t}\}_{t\geq 0} and {st}t0\{s_{t}\}_{t\geq 0} be sequences of positive numbers satisfying

pt+1(1αA)ptBαst+C¯α2+Dα3,p_{t+1}\leq(1-\alpha A)p_{t}-B\alpha s_{t}+\bar{C}\alpha^{2}+D\alpha^{3},

for some positive constants A,B0A,B\geq 0, C,D0C,D\geq 0, and for constant step-sizes 0<α1E0<\alpha\leq\frac{1}{E}, where E>0E>0. Then, there exists a constant step-size α1E\alpha\leq\frac{1}{E} such that

BWTt=0Twtstp0(E+A)exp(AE(T+1))+2Clnτ¯A(T+1)+Dln2τ¯A2(T+1)2,\frac{B}{W_{T}}\sum_{t=0}^{T}w_{t}s_{t}\leq p_{0}(E+A)\exp{\left(-\frac{A}{E}(T+1)\right)}+\frac{2C\ln{\bar{\tau}}}{A(T+1)}+\frac{D\ln^{2}{\bar{\tau}}}{A^{2}{(T+1)}^{2}}, (129)

for wt(1αA)(t+1)w_{t}\triangleq(1-\alpha A)^{-(t+1)}, WTt=0TwtW_{T}\triangleq\sum_{t=0}^{T}w_{t}, and

τ¯=max{exp(1),min{A2p0(T+1)2/C,A3p0(T+1)3/D}}.\bar{\tau}=\max\{\exp{(1)},\min\{A^{2}p_{0}(T+1)^{2}/C,A^{3}p_{0}(T+1)^{3}/D\}\}.

We start with the following result.

Lemma 27.

Suppose the step-size α\alpha is chosen such that α(1γ)/112\alpha\leq(1-\gamma)/112. Then, the iterates generated by Algorithm 2 satisfy the following under the i.i.d. observation model:

𝔼[θ~t+1θ2](1αω(1γ)8)𝔼[θ~tθ2]α(1γ)4𝔼[V^θtV^θD2]+5α3(1γ)Et1+8α2σ2M.\mathbb{E}\left[\|\tilde{\theta}_{t+1}-\theta^{*}\|^{2}\right]\leq\left(1-\frac{\alpha\omega(1-\gamma)}{8}\right)\mathbb{E}\left[\|\tilde{\theta}_{t}-\theta^{*}\|^{2}\right]-\frac{\alpha(1-\gamma)}{4}\mathbb{E}\left[\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}\right]+\frac{5\alpha^{3}}{(1-\gamma)}E_{t-1}+\frac{8\alpha^{2}\sigma^{2}}{M}. (130)
Proof.

Starting from equation 106, we have:

𝔼[θ~t+1θ2]=𝔼[θ~tθ2]+2αM𝔼[θ~tθ,i[M]gi,t(θt)]T1+α2𝔼[1Mi[M]gi,t(θt)2]T2.\mathbb{E}\left[\|\tilde{\theta}_{t+1}-\theta^{*}\|^{2}\right]=\mathbb{E}\left[\|\tilde{\theta}_{t}-\theta^{*}\|^{2}\right]+\underbrace{\frac{2\alpha}{M}\mathbb{E}\left[\langle\tilde{\theta}_{t}-\theta^{*},\sum_{i\in[M]}g_{i,t}(\theta_{t})\rangle\right]}_{T_{1}}+\underbrace{\alpha^{2}\mathbb{E}\left[\left\|\frac{1}{M}\sum_{i\in[M]}g_{i,t}(\theta_{t})\right\|^{2}\right]}_{T_{2}}. (131)

To bound T1T_{1} and T2T_{2}, let us first define by t1\mathcal{F}_{t-1} the sigma-algebra generated by all the agents’ observations up to time-step t1t-1, i.e., the sigma-algebra generated by {Xi,k}i[M],k=0,1,,t1\{X_{i,k}\}_{i\in[M],k=0,1,\ldots,t-1}. From the dynamics of Algorithm 2, observe that θt,θt~\theta_{t},\tilde{\theta_{t}}, and {ei,t}i[M]\{e_{i,t}\}_{i\in[M]} are all t1\mathcal{F}_{t-1}-measurable. Using this fact, we bound T1T_{1} as follows:

T1\displaystyle T_{1} =2αM𝔼[𝔼[θ~tθ,i[M]gi,t(θt)|t1]]\displaystyle=\frac{2\alpha}{M}\mathbb{E}\left[\mathbb{E}\left[\langle\tilde{\theta}_{t}-\theta^{*},\sum_{i\in[M]}g_{i,t}(\theta_{t})\rangle|\mathcal{F}_{t-1}\right]\right] (132)
=(a)2α𝔼[θ~tθ,g¯(θt)]\displaystyle\overset{(a)}{=}2\alpha\mathbb{E}\left[\langle\tilde{\theta}_{t}-\theta^{*},\bar{g}(\theta_{t})\rangle\right]
=2α𝔼[θtθ,g¯(θt)]+2α𝔼[θ~tθt,g¯(θt)]\displaystyle=2\alpha\mathbb{E}\left[\langle{\theta}_{t}-\theta^{*},\bar{g}(\theta_{t})\rangle\right]+2\alpha\mathbb{E}\left[\langle\tilde{\theta}_{t}-\theta_{t},\bar{g}(\theta_{t})\rangle\right]
2α𝔼[θtθ,g¯(θt)]+α(1γ)4𝔼[g¯(θt)2]+4α(1γ)𝔼[θtθ~t2]\displaystyle\leq 2\alpha\mathbb{E}\left[\langle{\theta}_{t}-\theta^{*},\bar{g}(\theta_{t})\rangle\right]+\frac{\alpha(1-\gamma)}{4}\mathbb{E}\left[\|\bar{g}(\theta_{t})\|^{2}\right]+\frac{4\alpha}{(1-\gamma)}\mathbb{E}\left[\|\theta_{t}-\tilde{\theta}_{t}\|^{2}\right]
2α𝔼[θtθ,g¯(θt)]+α(1γ)4𝔼[g¯(θt)2]+4α3(1γ)𝔼[e¯t12]\displaystyle\leq 2\alpha\mathbb{E}\left[\langle{\theta}_{t}-\theta^{*},\bar{g}(\theta_{t})\rangle\right]+\frac{\alpha(1-\gamma)}{4}\mathbb{E}\left[\|\bar{g}(\theta_{t})\|^{2}\right]+\frac{4\alpha^{3}}{(1-\gamma)}\mathbb{E}\left[\|\bar{e}_{t-1}\|^{2}\right]
(b)2α𝔼[θtθ,g¯(θt)]+α(1γ)4𝔼[g¯(θt)2]+4α3(1γ)Et1\displaystyle\overset{(b)}{\leq}2\alpha\mathbb{E}\left[\langle{\theta}_{t}-\theta^{*},\bar{g}(\theta_{t})\rangle\right]+\frac{\alpha(1-\gamma)}{4}\mathbb{E}\left[\|\bar{g}(\theta_{t})\|^{2}\right]+\frac{4\alpha^{3}}{(1-\gamma)}E_{t-1}
(c)α(1γ)𝔼[V^θtV^θD2]+4α3(1γ)Et1.\displaystyle\overset{(c)}{\leq}-\alpha(1-\gamma)\mathbb{E}\left[\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}\right]+\frac{4\alpha^{3}}{(1-\gamma)}E_{t-1}.

In the above steps, (a) follows from the fact that the agents’ observations are assumed to have been drawn i.i.d. (over time and across agents) from the stationary distribution π\pi; for (b), we applied equation 19; and (c) follows from Lemmas 4 and 5. To bound T2T_{2}, we first split it into two parts as follows:

T2\displaystyle T_{2} =α2M2𝔼[i[M](gi,t(θt)g¯(θt))+Mg¯(θt)2]\displaystyle=\frac{\alpha^{2}}{M^{2}}\mathbb{E}\left[\left\|\sum_{i\in[M]}\left(g_{i,t}(\theta_{t})-\bar{g}(\theta_{t})\right)+M\bar{g}(\theta_{t})\right\|^{2}\right] (133)
2α2M2𝔼[i[M](gi,t(θt)g¯(θt))2]+2α2𝔼[g¯(θt)2].\displaystyle\leq\frac{2\alpha^{2}}{M^{2}}\mathbb{E}\left[\left\|\sum_{i\in[M]}\left(g_{i,t}(\theta_{t})-\bar{g}(\theta_{t})\right)\right\|^{2}\right]+2\alpha^{2}\mathbb{E}\left[\left\|\bar{g}(\theta_{t})\right\|^{2}\right].

To simplify the first term in the above inequality further, let us define Yi,tgi,t(θt)g¯(θt),i[M]Y_{i,t}\triangleq g_{i,t}(\theta_{t})-\bar{g}(\theta_{t}),\forall i\in[M]. Conditioned on t1{\mathcal{F}}_{t-1}, observe that (i) Yi,tY_{i,t} has zero mean for all i[M]i\in[M]; and (ii) Yi,tY_{i,t} and Yj,tY_{j,t} are independent for iji\neq j (this follows from the fact that Xi,tX_{i,t} and Xj,tX_{j,t} are independent by assumption). As a consequence of the two facts above, we immediately have

𝔼[Yi,t,Yj,t|t1]=0,i,j[M]s.t.ij.\mathbb{E}\left[\langle Y_{i,t},Y_{j,t}\rangle|\mathcal{F}_{t-1}\right]=0,\forall i,j\in[M]\,s.t.\,i\neq j.

We thus conclude that

𝔼[i[M](gi,t(θt)g¯(θt))2]\displaystyle\mathbb{E}\left[\left\|\sum_{i\in[M]}\left(g_{i,t}(\theta_{t})-\bar{g}(\theta_{t})\right)\right\|^{2}\right] =𝔼[𝔼[i[M]Yi,t2|t1]]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\|\sum_{i\in[M]}Y_{i,t}\|^{2}|\mathcal{F}_{t-1}\right]\right] (134)
=𝔼[i[M]𝔼[Yi,t2|t1]]\displaystyle=\mathbb{E}\left[\sum_{i\in[M]}\mathbb{E}\left[\|Y_{i,t}\|^{2}|\mathcal{F}_{t-1}\right]\right]
=(a)M(𝔼[𝔼[Yi,t2|t1]])\displaystyle\overset{(a)}{=}M\left(\mathbb{E}\left[\mathbb{E}\left[\|Y_{i,t}\|^{2}|\mathcal{F}_{t-1}\right]\right]\right)
=M(𝔼[Yi,t2]),\displaystyle=M\left(\mathbb{E}\left[\|Y_{i,t}\|^{2}\right]\right),

where (a) follows from the fact that conditioned on t1\mathcal{F}_{t-1}, Yi,tY_{i,t} and Yj,tY_{j,t} are identically distributed for all i,j[M]i,j\in[M] with iji\neq j. Plugging the result in equation 134 back in equation 133, we obtain

T2\displaystyle T_{2} 2α2M𝔼[(gi,t(θt)g¯(θt))2]+2α2𝔼[g¯(θt)2]\displaystyle\leq\frac{2\alpha^{2}}{M}\mathbb{E}\left[\left\|\left(g_{i,t}(\theta_{t})-\bar{g}(\theta_{t})\right)\right\|^{2}\right]+2\alpha^{2}\mathbb{E}\left[\left\|\bar{g}(\theta_{t})\right\|^{2}\right] (135)
4α2M𝔼[gi,t(θt)2]+2α2(1+2M)𝔼[g¯(θt)2]\displaystyle\leq\frac{4\alpha^{2}}{M}\mathbb{E}\left[\left\|g_{i,t}(\theta_{t})\right\|^{2}\right]+2\alpha^{2}\left(1+\frac{2}{M}\right)\mathbb{E}\left[\left\|\bar{g}(\theta_{t})\right\|^{2}\right]
56α2𝔼[V^θtV^θD2]+8α2σ2M,\displaystyle\leq 56\alpha^{2}\mathbb{E}\left[\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}\right]+\frac{8\alpha^{2}\sigma^{2}}{M},

where in the last step, we used Lemmas 5 and 25. Now that we have bounds on each of the terms T1T_{1} and T2T_{2}, we plug them back in equation 131 to obtain

𝔼[θ~t+1θ2]\displaystyle\mathbb{E}\left[\|\tilde{\theta}_{t+1}-\theta^{*}\|^{2}\right] 𝔼[θ~tθ2]α(1γ)(156α(1γ))𝔼[V^θtV^θD2]+4α3(1γ)Et1\displaystyle\leq\mathbb{E}\left[\|\tilde{\theta}_{t}-\theta^{*}\|^{2}\right]-\alpha(1-\gamma)\left(1-\frac{56\alpha}{(1-\gamma)}\right)\mathbb{E}\left[\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}\right]+\frac{4\alpha^{3}}{(1-\gamma)}E_{t-1} (136)
+8α2σ2M\displaystyle\hskip 5.69054pt+\frac{8\alpha^{2}\sigma^{2}}{M}
𝔼[θ~tθ2]α(1γ)2𝔼[V^θtV^θD2]+4α3(1γ)Et1+8α2σ2M,\displaystyle\leq\mathbb{E}\left[\|\tilde{\theta}_{t}-\theta^{*}\|^{2}\right]-\frac{\alpha(1-\gamma)}{2}\mathbb{E}\left[\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}\right]+\frac{4\alpha^{3}}{(1-\gamma)}E_{t-1}+\frac{8\alpha^{2}\sigma^{2}}{M},

where in the last step, we used the fact that α(1γ)/112\alpha\leq(1-\gamma)/112. By splitting the second term in the above inequality into two equal parts, using Lemma 3, and the fact that

𝔼[V^θtV^θD2]12𝔼[V^θ~tV^θD2]+α2Et1,-\mathbb{E}\left[\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}\right]\leq-\frac{1}{2}\mathbb{E}\left[\|\hat{V}_{\tilde{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}\right]+\alpha^{2}E_{t-1},

we further obtain that

𝔼[θ~t+1θ2](1αω(1γ)8)𝔼[θ~tθ2]α(1γ)4𝔼[V^θtV^θD2]+5α3(1γ)Et1+8α2σ2M,\mathbb{E}\left[\|\tilde{\theta}_{t+1}-\theta^{*}\|^{2}\right]\leq\left(1-\frac{\alpha\omega(1-\gamma)}{8}\right)\mathbb{E}\left[\|\tilde{\theta}_{t}-\theta^{*}\|^{2}\right]-\frac{\alpha(1-\gamma)}{4}\mathbb{E}\left[\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}\right]+\frac{5\alpha^{3}}{(1-\gamma)}E_{t-1}+\frac{8\alpha^{2}\sigma^{2}}{M},

which is the desired conclusion. ∎

We now complete the proof of Theorem 6 as follows.

Proof.

(Proof of Theorem 3) Our goal is to establish a recursion of the form in Lemma 26. To that end, we need to first control the aggregate effect of the memory variables of all agents, as captured by the term EtE_{t}. This is easily done by first using the same analysis as in Lemma 17, and then appealing to Lemma 25, to conclude

𝔼[ei,t2](112δ)𝔼[ei,t12]+16δ𝔼[V^θtV^θD2]+4δσ2,i[M].\mathbb{E}\left[\|e_{i,t}\|^{2}\right]\leq\left(1-\frac{1}{2\delta}\right)\mathbb{E}\left[\|e_{i,t-1}\|^{2}\right]+16\delta\mathbb{E}\left[\|\hat{V}_{{\theta}_{t}}-\hat{V}_{{\theta}^{*}}\|^{2}_{D}\right]+4\delta\sigma^{2},\forall i\in[M].

Averaging the above inequality over all agents, and using the definition of EtE_{t} yields:

Et(112δ)Et1+16δ𝔼[V^θtV^θD2]+4δσ2.E_{t}\leq\left(1-\frac{1}{2\delta}\right)E_{t-1}+16\delta\mathbb{E}\left[\|\hat{V}_{{\theta}_{t}}-\hat{V}_{{\theta}^{*}}\|^{2}_{D}\right]+4\delta\sigma^{2}.

Using the above bound along with Lemma 27, we obtain:

Ξt+1\displaystyle\Xi_{t+1} (1αω(1γ)8)𝔼[θ~tθ2]α(1γ)4𝔼[V^θtV^θD2]+5α3(1γ)Et1+8α2σ2M+Cα3Et\displaystyle\leq\left(1-\frac{\alpha\omega(1-\gamma)}{8}\right)\mathbb{E}\left[\|\tilde{\theta}_{t}-\theta^{*}\|^{2}\right]-\frac{\alpha(1-\gamma)}{4}\mathbb{E}\left[\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}\right]+\frac{5\alpha^{3}}{(1-\gamma)}E_{t-1}+\frac{8\alpha^{2}\sigma^{2}}{M}+C\alpha^{3}E_{t} (137)
(1αω(1γ)8)𝔼[θ~tθ2](α(1γ)416Cα3δ)𝔼[V^θtV^θD2]\displaystyle\leq\left(1-\frac{\alpha\omega(1-\gamma)}{8}\right)\mathbb{E}\left[\|\tilde{\theta}_{t}-\theta^{*}\|^{2}\right]-\left(\frac{\alpha(1-\gamma)}{4}-16C\alpha^{3}\delta\right)\mathbb{E}\left[\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}\right]
+Cα3(112δ+5C(1γ))Et1+8α2σ2M+4Cα3δσ2.\displaystyle\hskip 5.69054pt+C\alpha^{3}\left(1-\frac{1}{2\delta}+\frac{5}{C(1-\gamma)}\right)E_{t-1}+\frac{8\alpha^{2}\sigma^{2}}{M}+4C\alpha^{3}\delta\sigma^{2}.

Based on the above inequality, our goal is to now choose α\alpha and CC in a way such that we can establish a contraction (up to higher order noise terms). Accordingly, let us pick these parameters α\alpha and CC as follows:

C=20δ(1γ);α(1γ)55δ.C=\frac{20\delta}{(1-\gamma)};\hskip 5.69054pt\alpha\leq\frac{(1-\gamma)}{55\delta}.

With some simple algebra, it is then easy to verify that:

Ξt+1\displaystyle\Xi_{t+1} (1αω(1γ)8)𝔼[θ~tθ2]α(1γ)8𝔼[V^θtV^θD2]+Cα3(114δ)Et1\displaystyle\leq\left(1-\frac{\alpha\omega(1-\gamma)}{8}\right)\mathbb{E}\left[\|\tilde{\theta}_{t}-\theta^{*}\|^{2}\right]-\frac{\alpha(1-\gamma)}{8}\mathbb{E}\left[\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}\right]+C\alpha^{3}\left(1-\frac{1}{4\delta}\right)E_{t-1} (138)
+8α2σ2M+80α3δ2σ2(1γ)\displaystyle\hskip 5.69054pt+\frac{8\alpha^{2}\sigma^{2}}{M}+\frac{80\alpha^{3}\delta^{2}\sigma^{2}}{(1-\gamma)}
(1αω(1γ)8)(𝔼[θ~tθ2]+Cα3Et1)α(1γ)8𝔼[V^θtV^θD2]\displaystyle\leq\left(1-\frac{\alpha\omega(1-\gamma)}{8}\right){\left(\mathbb{E}\left[\|\tilde{\theta}_{t}-\theta^{*}\|^{2}\right]+C\alpha^{3}E_{t-1}\right)}-\frac{\alpha(1-\gamma)}{8}\mathbb{E}\left[\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}\right]
+8α2σ2M+80α3δ2σ2(1γ)\displaystyle\hskip 5.69054pt+\frac{8\alpha^{2}\sigma^{2}}{M}+\frac{80\alpha^{3}\delta^{2}\sigma^{2}}{(1-\gamma)}
=(1αω(1γ)8)Ξtα(1γ)8𝔼[V^θtV^θD2]+8α2σ2M+80α3δ2σ2(1γ).\displaystyle=\left(1-\frac{\alpha\omega(1-\gamma)}{8}\right)\Xi_{t}-\frac{\alpha(1-\gamma)}{8}\mathbb{E}\left[\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}\right]+\frac{8\alpha^{2}\sigma^{2}}{M}+\frac{80\alpha^{3}\delta^{2}\sigma^{2}}{(1-\gamma)}.

We have thus succeeded in establishing a recursion of the form in Lemma 26. To spell things out explicitly in the language of Lemma 26, we have

pt=Ξt;st=𝔼[V^θtV^θD2];A=ω(1γ)8;B=(1γ)8;C¯=8σ2M;D=80δ2σ2(1γ),p_{t}=\Xi_{t};s_{t}=\mathbb{E}\left[\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}\right];A=\frac{\omega(1-\gamma)}{8};B=\frac{(1-\gamma)}{8};\bar{C}=\frac{8\sigma^{2}}{M};D=\frac{80\delta^{2}\sigma^{2}}{(1-\gamma)},

and α(1γ)/(112δ)\alpha\leq(1-\gamma)/(112\delta) suffices for the recursion in equation 138 to hold. Thus, for us, E=112δ(1γ)E=\frac{112\delta}{(1-\gamma)}. Applying Lemma 26 along with some simplifications then yields:

t=0Tw¯t𝔼[V^θtV^θD2]\displaystyle\sum_{t=0}^{T}\bar{w}_{t}\mathbb{E}\left[\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}\right] O(θ0θ2δ(1γ)2)exp(ω(1γ)2TCδ)+O~(σ2ω(1γ)2MT)\displaystyle\leq O\left(\frac{\|\theta_{0}-\theta^{*}\|^{2}\delta}{(1-\gamma)^{2}}\right)\exp{\left(\frac{-\omega(1-\gamma)^{2}T}{C^{\prime}\delta}\right)}+\tilde{O}\left(\frac{\sigma^{2}}{\omega(1-\gamma)^{2}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}M}T}\right) (139)
+O~(σ2δ2ω2(1γ)4T2),\displaystyle\hskip 5.69054pt+\tilde{O}\left(\frac{\sigma^{2}\delta^{2}}{\omega^{2}(1-\gamma)^{4}T^{2}}\right),

where w¯twt/WT\bar{w}_{t}\triangleq w_{t}/W_{T}, and CC^{\prime} is a suitably large constant. The result follows by noting that

𝔼[V^θ¯TV^θD2]t=0Tw¯t𝔼[V^θtV^θD2],\mathbb{E}\left[\|\hat{V}_{\bar{\theta}_{T}}-\hat{V}_{\theta^{*}}\|^{2}_{D}\right]\leq\sum_{t=0}^{T}\bar{w}_{t}\mathbb{E}\left[\|\hat{V}_{{\theta}_{t}}-\hat{V}_{\theta^{*}}\|^{2}_{D}\right],

where θ¯T=t=0Tw¯tθt.\bar{\theta}_{T}=\sum_{t=0}^{T}\bar{w}_{t}\theta_{t}.