This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Deflated Dynamics Value Iteration

Jongmin Lee1, Amin Rakhsha2, Ernest K. Ryu3, Amir-massoud Farahmand2
Abstract

Abstract: The Value Iteration (VI) algorithm is an iterative procedure to compute the value function of a Markov decision process, and is the basis of many reinforcement learning (RL) algorithms as well. As the error convergence rate of VI as a function of iteration kk is O(γk)O(\gamma^{k}), it is slow when the discount factor γ\gamma is close to 11. To accelerate the computation of the value function, we propose Deflated Dynamics Value Iteration (DDVI). DDVI uses matrix splitting and matrix deflation techniques to effectively remove (deflate) the top ss dominant eigen-structure of the transition matrix 𝒫π{\mathcal{P}}^{\pi}. We prove that this leads to a O~(γk|λs+1|k)\tilde{O}(\gamma^{k}|\lambda_{s+1}|^{k}) convergence rate, where λs+1\lambda_{s+1}is (s+1)(s+1)-th largest eigenvalue of the dynamics matrix. We then extend DDVI to the RL setting and present Deflated Dynamics Temporal Difference (DDTD) algorithm. We empirically show the effectiveness of the proposed algorithms.

Deflated Dynamics Value Iteration


Jongmin Lee1, Amin Rakhsha2, Ernest K. Ryu3, Amir-massoud Farahmand2

1 Seoul National University, South Korea
2 University of Toronto & Vector Institute, Canada
3 University of California, Los Angeles, USA

\@abstract

1.  Introduction

Computing the value function VπV^{\pi} for a policy π\pi or the optimal value function VV^{\star} is an integral step of many planning and reinforcement learning (RL) algorithms. Value Iteration (VI) is a fundamental dynamic programming algorithm for computing the value functions, and its approximate and sample-based variants, such as Temporal Different Learning [Sutton, 1988], Fitted Value Iteration [Ernst et al., 2005, Munos and Szepesvári, 2008], Deep Q-Network [Mnih et al., 2015], are the workhorses of modern RL algorithms [Bertsekas and Tsitsiklis, 1996, Sutton and Barto, 2018, Szepesvári, 2010, Meyn, 2022].

The VI algorithm, however, can be slow for problems with long effective planning horizon problems, when the agent has to look far into future in order to make good decisions. Within the discounted Markov Decision Processes formalism, the discount factor γ\gamma determines the effective planning horizon, with γ\gamma closer to 11 corresponding to longer planning horizon. The error convergence rate of the conventional VI as a function of the iteration kk is O(γk)O(\gamma^{k}), which is slow when γ\gamma is close to 11.

Recently, there has been a growing body of research that explores the application of acceleration techniques of other areas of applied math to planning and RL: Geist and Scherrer [2018], Sun et al. [2021], Ermis and Yang [2020], Park et al. [2022], Ermis et al. [2021], Shi et al. [2019] applies Anderson acceleration of fixed-point iterations, Lee and Ryu [2023] applies Anchor acceleration of minimax optimization, Vieillard et al. [2020], Goyal and Grand-Clément [2022], Grand-Clément [2021], Bowen et al. [2021], Akian et al. [2022] applies Nesterov acceleration of convex optimization and Farahmand and Ghavamzadeh [2021] applies ideas inspired by PID controllers in control theory.

More detail is provided in Appendix 7.1.

We introduce a novel approach to accelerate VI based on modification of the eigenvalues of the transition dynamics, which are closely related to the convergence of VI. To see this connection, consider the policy evaluation problem where the goal is to find the value function VπV^{\pi} for a given policy π\pi. VI starts from an arbitrary V0V^{0} and iteratively sets Vk+1rπ+γ𝒫πVkV^{k+1}\leftarrow r^{\pi}+\gamma{\mathcal{P}}^{\pi}V^{k}, where rπr^{\pi} and 𝒫π{\mathcal{P}}^{\pi} are the reward vector and the transition matrix of policy π\pi, respectively. If V0=0V^{0}=0, the error vector at iteration kk can be shown to be VπVk=i=k(γ𝒫π)irπV^{\pi}-V^{k}=\sum_{i=k}^{\infty}(\gamma{\mathcal{P}}^{\pi})^{i}r^{\pi}. Let us take a closer look at this difference.

For simplicity, assume that 𝒫π{\mathcal{P}}^{\pi} is a diagonalizable matrix, so we can write 𝒫π=UDU1{\mathcal{P}}^{\pi}=UDU^{-1} with UU consisting of the (right) eigenvectors of 𝒫π{\mathcal{P}}^{\pi} and DD being the diagonal matrix diag(λ1,,λn)\mathop{\textrm{diag}}(\lambda_{1},\dotsc,\lambda_{n}) with 1=|λ1||λn|1=|\lambda_{1}|\geq\cdots\geq|\lambda_{n}|. Since (𝒫π)i=UDiU1({\mathcal{P}}^{\pi})^{i}=UD^{i}U^{-1}, after some manipulations, we see that

VπVk=U[(γλ1)k1γλ1000(γλ2)k1γλ20000(γλn)k1γλn]U1rπ.\displaystyle V^{\pi}-V^{k}=U\left[\begin{array}[]{cccc}\frac{(\gamma\lambda_{1})^{k}}{1-\gamma\lambda_{1}}&0&\dotsc&0\\ 0&\frac{(\gamma\lambda_{2})^{k}}{1-\gamma\lambda_{2}}&\cdots&0\\ \vdots&\cdots&\ddots&0\\ 0&\cdots&0&\frac{(\gamma\lambda_{n})^{k}}{1-\gamma\lambda_{n}}\end{array}\right]U^{-1}r^{\pi}.

The diagonal terms are of the (γλi)k(\gamma\lambda_{i})^{k} form, so they all converge to zero. The dominant term is (γλ1)k(\gamma\lambda_{1})^{k}, which corresponds to the largest eigenvalue. As the largest eigenvalue λ1\lambda_{1} of the stochastic matrix 𝒫π{\mathcal{P}}^{\pi} is 11, this leads to the dominant behaviour of O(γk)O(\gamma^{k}), which we also get from the norm-based contraction mapping analysis. The second dominant term behaves as O((γ|λ2|)k)O((\gamma|\lambda_{2}|)^{k}), and so on.

If we could somehow remove from 𝒫π{\mathcal{P}}^{\pi} the subspace corresponding to the top ss eigen-structure with eigen-values λ1,,λs\lambda_{1},\dotsc,\lambda_{s}, the dominant behaviour of the new procedure would be O((γ|λs+1|)k)O((\gamma|\lambda_{s+1}|)^{k}), which can be much faster than O(γk)O(\gamma^{k}) of the conventional VI. Although this is perhaps too good to seem feasible, this is exactly what the proposed Deflated Dynamics Value Iteration (DDVI) algorithm achieves. Even if 𝒫π\mathcal{P}^{\pi} is not a diagonalizable matrix, DDVI works. DDVI is based on two main ideas.

The first idea is the deflation technique, well-studied in linear algebra (see Section 4.2 of Saad [2011]), that allows us to remove large eigenvalues of 𝒫π{\mathcal{P}}^{\pi}. This is done by subtracting a matrix EE from 𝒫π{\mathcal{P}}^{\pi}. This gives us a “deflated dynamics” 𝒫πE{\mathcal{P}}^{\pi}-E that does not have eigenvalues λ1,,λs\lambda_{1},\dotsc,\lambda_{s}. The second idea is based on matrix splitting (see Section 11.2 of Golub and Van Loan [2013]), which allows us to use the modified dynamics 𝒫πE{\mathcal{P}}^{\pi}-E to define a VI-like iterative procedure and still converge to the same solution VπV^{\pi} to which the conventional VI converges.

Deflation has been applied to various algorithms such as conjugate gradient algorithm [Saad et al., 2000], principal component analysis [Mackey, 2008], generalized minimal residual method [Morgan, 1995], and nonlinear fixed point iteration [Shroff and Keller, 1993] for improvement of convergence. While some prior in accelerated planning can be considered as special cases of deflation [Bertsekas, 1995, White, 1963], to the best of our knowledge, this is the first application of the deflation technique that eliminates multiple eigenvalues in the context of RL. On the other hand, multiple planning algorithms have been introduced based on the general idea of matrix splitting [Hastings, 1968, Kushner and Kleinman, 1968, 1971, Reetz, 1973, Porteus, 1975, Rakhsha et al., 2022], though with a different splitting than this work.

After a review of relevant background in Section 2, we present DDVI and prove its convergence rate for the Policy Evaluation (PE) problem in Section 3. Next, in Section 4, we discuss the practical computation of the deflation matrix. In Section 5, we explain how DDVI can be extended to its sample-based variant and introduce the Deflated Dynamics Temporal Difference (DDTD) algorithm. Finally, in Section 6, we empirically evaluate the proposed methods and show their practical feasibility.

2.  Background

We first briefly review basic definitions and concepts of Markov Decision Processes (MDP) and Reinforcement Learning (RL) [Bertsekas and Tsitsiklis, 1996, Sutton and Barto, 2018, Szepesvári, 2010, Meyn, 2022]. We then describe the Power Iteration and QR Iterations, which can be used to compute the eigenvalues of a matrix.

For a generic set Ω\Omega, we denote (Ω)\mathcal{M}(\Omega) as the space of probability distributions over set Ω\Omega. We use (Ω)\mathcal{F}(\Omega) to denote the space of bounded measurable real-valued functions over Ω\Omega. For a matrix AA, we use spec(A) to denote its spectrum (the set of eigenvalues), ρ(A)\rho(A) to denote its spectral radius (the maximum of the absolute value of eigenvalues), and \|\cdot\| to denote its ll^{\infty} norm.

Markov Decision Process.

The discounted MDP is defined by the tuple (𝒳,𝒜,𝒫,,γ)(\mathcal{X},\mathcal{A},\mathcal{P},\mathcal{R},\gamma), where 𝒳\mathcal{X} is the state space, 𝒜\mathcal{A} is the action space, 𝒫:𝒳×𝒜(𝒳)\mathcal{P}\colon\mathcal{X}\times\mathcal{A}\rightarrow\mathcal{M}(\mathcal{X}) is the transition probability kernel, :𝒳×𝒜()\mathcal{R}\colon\mathcal{X}\times\mathcal{A}\rightarrow\mathcal{M}() is the reward kernel, and γ[0,1)\gamma\in[0,1) is the discount factor. In this work, we assume that the MDP has a finite number of states. We use r:𝒳×𝒜r\colon\mathcal{X}\times\mathcal{A}\rightarrow to denote the expected reward at a given state-action pair. Denote π:𝒳(𝒜)\pi\colon\mathcal{X}\rightarrow\mathcal{M}(\mathcal{A}) for a policy. We define the reward function for a policy π\pi as rπ(x)=𝐄aπ(x)[r(x,a)]r^{\pi}(x)=\mathbf{E}_{a\sim\pi(\cdot\mid x)}\left[r(x,a)\right]. The transition kernel of following policy π\pi is denoted by 𝒫π:𝒳(𝒳){\mathcal{P}}^{\pi}\colon\mathcal{X}\rightarrow\mathcal{M}(\mathcal{X}) and is defined as

𝒫π(xx)\displaystyle{\mathcal{P}}^{\pi}(x\rightarrow x^{\prime}) =Prob(xx|aπ(x),x𝒫(x,a)).\displaystyle=\mathrm{Prob}(x\rightarrow x^{\prime}\,|\,a\sim\pi(\cdot\mid x),x^{\prime}\sim\mathcal{P}(\cdot\mid x,a)).

The value function for a policy π\pi is Vπ(x)=𝐄π[t=0γtr(Xk,at)x0=x]V^{\pi}(x)=\mathbf{E}_{\pi}[\sum^{\infty}_{t=0}\gamma^{t}r(X_{k},a_{t})\mid x_{0}=x] where 𝐄π\mathbf{E}_{\pi} denotes the expected value over all trajectories (x0,a0,x1,a1,)(x_{0},a_{0},x_{1},a_{1},\dots) induced by 𝒫\mathcal{P} and π\pi. We say that VV^{\star} is optimal value functions if V=supπVπV^{\star}=\sup_{\pi}V^{\pi}. We say π\pi^{\star} is optimal policies if π=argmaxπVπ\pi^{\star}=\operatorname*{argmax}_{\pi}{V^{\pi}}.

The Bellman operator TπT^{\pi} for policy π\pi is defined as the mapping that takes V(𝒳)V\in\mathcal{F}(\mathcal{X}) and returns a new function such that its value at state xx is (TπV)(x)=𝐄aπ(x),x𝒫(x,a)[r(x,a)+γV(x)](T^{\pi}V)(x)=\mathbf{E}_{a\sim\pi(\cdot\mid x),x^{\prime}\sim\mathcal{P}(\cdot\mid x,a)}\left[r(x,a)+\gamma V(x^{\prime})\right]. The Bellman optimality operator T{T^{\star}} is defined as TV(x)=supa𝒜{r(x,a)+γ𝐄x𝒫(|x,a)[V(x)]}{T^{\star}}V(x)=\sup_{a\in\mathcal{A}}\left\{r(x,a)+\gamma\mathbf{E}_{x^{\prime}\sim\mathcal{P}(\cdot\,|\,x,a)}\left[V(x^{\prime})\right]\right\}. The value functions VπV^{\pi} and VV^{\star} are the fixed points of the Bellman operators, that is, Vπ=TπVπV^{\pi}=T^{\pi}V^{\pi} and V=TπVV^{\star}=T^{\pi}V^{\star}.

Value Iteration.

The Value Iteration algorithm is one of the main methods in dynamic programming and planning for computing the value function VπV^{\pi} of a policy, the Policy Evaluation (PE) problem, or the optimal value function VV^{\star}, for the Control problem. It is iteratively defined as

Vk+1{TπVk(Policy Evaluation)TVk(Control),\displaystyle V^{k+1}\leftarrow\begin{cases}T^{\pi}V^{k}\qquad\text{(Policy Evaluation)}\\ {T^{\star}}V^{k}\qquad\text{(Control)},\\ \end{cases}

where V0V^{0} is the initial function. For discounted MDPs where γ<1\gamma<1, the Bellman operators are contractions, so by the Banach fixed-point theorem [Banach, 1922, Hunter and Nachtergaele, 2001, Hillen, 2023], the VI converges to the unique fixed points, which are VπV^{\pi} or VV^{\star}, with the convergence rate of 𝒪(γk)\mathcal{O}(\gamma^{k}).

Let us now recall some concepts and methods from (numerical) linear algebra, see Golub and Van Loan [2013] for more detail. For any An×nA\in\mathbb{R}^{n\times n} (not necessarily symmetric nor diagonalizable) and for any matrix norm \|\cdot\|, Gelfand’s formula states that ρ(A)=limkAk1/k\rho(A)=\lim_{k\rightarrow\infty}\|A^{k}\|^{1/k}. Hence, Ak=O((ρ(A)+ϵ)k)\|A^{k}\|=O((\rho(A)+\epsilon)^{k}) for any ϵ>0\epsilon>0, which we denote by Ak=O~((ρ(A))k)\|A^{k}\|=\tilde{O}((\rho(A))^{k}). Furthermore, if AA is diagonalizable, then Ak=O(ρ(A)k)\|A^{k}\|=O(\rho(A)^{k}).

Power Iteration.

Powers of AA can be used to compute eigenvalues of AA. The Power Iteration starts with an initial vector b0nb^{0}\in\mathbb{R}^{n} and for k=0,1,2,k=0,1,2,\dotsc computes

bk+1\displaystyle b^{k+1} =AbkAbk2.\displaystyle=\frac{Ab^{k}}{\|Ab^{k}\|_{2}}.

If λ1\lambda_{1} is the eigenvalue such that |λ1|=ρ(A)|\lambda_{1}|=\rho(A) and the other eigenvalues λ2,,λn\lambda_{2},\dots,\lambda_{n} have strictly smaller magnitude, then (bk)Abkλ1(b^{k})^{\top}Ab^{k}\rightarrow\lambda_{1} for almost all starting points b0b^{0} (Section 7.3.1 of Golub and Van Loan [2013]).

QR iteration.

The QR Iteration (or Orthogonal Iteration) algorithm [Golub and Van Loan, 2013, Sections 7.3 and 8.2] is a generalization of the Power Iteration that finds eigenvalues. For any An×nA\in\mathbb{C}^{n\times n}, let U0n×sU^{0}\in\mathbb{C}^{n\times s} have orthonormal columns and perform

Zk+1=AUk\displaystyle Z^{k+1}=AU^{k}
Uk+1Rk+1=Zk+1(QR factorization)\displaystyle U^{k+1}R^{k+1}=Z^{k+1}\quad\text{(QR factorization)}

for k=0,1,k=0,1,\dotsc. If |λs|>|λs+1||\lambda_{s}|>|\lambda_{s+1}|, the columns of UkU^{k} converge to an orthonormal basis for the dominant ss-dimensional invariant subspace associated with the eigenvalues λ1,,λs\lambda_{1},\dots,\lambda_{s} and diagonal entries of (Us)AUs(U^{s})^{\top}AU^{s} converge to λ1,,λs\lambda_{1},\dots,\lambda_{s} for almost all starting U0U^{0} [Golub and Van Loan, 2013, Theorem 7.3.1].

3.  Deflated Dynamics Value Iteration

We present Deflated Dynamics Value Iteration (DDVI), after introducing its two key ingredients, matrix deflation and matrix splitting.

3.1.  Matrix Deflation

Let 𝒫π\mathcal{P}^{\pi} be an n×nn\times n matrix with eigenvalues λ1,,λn\lambda_{1},\dots,\lambda_{n}, sorted in decreasing order of magnitude with ties broken arbitrarily. Let uu^{\top} denote the conjugate transpose of unu\in\mathbb{C}^{n}. We describe three ways of deflating this matrix.

Fact 1 (Hotelling’s deflation, [Meirovitch, 1980, Section 5.6]).

Assume sns\leq n linearly independent eigenvectors corresponding to {λi}i=1s\{\lambda_{i}\}^{s}_{i=1} exist. Write {ui}i=1s\{u_{i}\}^{s}_{i=1} and {vi}i=1s\{v_{i}\}^{s}_{i=1} to denote the top ss right and left eigenvectors scaled to satisfy uivi=1u^{\top}_{i}v_{i}=1 and uivj=0u^{\top}_{i}v_{j}=0 for all 1ijs1\leq i\neq j\leq s. If Es=i=1sλiuiviE_{s}=\sum_{i=1}^{s}\lambda_{i}u_{i}v_{i}^{\top}, then ρ(𝒫πEs)=|λs+1|\rho(\mathcal{P}^{\pi}-E_{s})=|\lambda_{s+1}|.

Hotelling deflation makes ρ(𝒫πEs)\rho(\mathcal{P}^{\pi}-E_{s}) small by eliminating the top ss eigenvalues of 𝒫π\mathcal{P}^{\pi}, but requires both right and left eigenvectors of 𝒫π\mathcal{P}^{\pi}. Wielandt’s deflation, in contrast, requires only right eigenvectors.

Fact 2 (Wielandt’s deflation, [Soto and Rojo, 2006, Theorem 5]).

Assume sns\leq n linearly independent eigenvectors corresponding to {λi}i=1s\{\lambda_{i}\}^{s}_{i=1} exist. Write {ui}i=1s\{u_{i}\}^{s}_{i=1} to denote the top ss linearly independent right eigenvectors. Assume that vectors {vi}i=1s\{v_{i}\}^{s}_{i=1}, which are not necessarily the left eigenvectors, satisfy uivi=1u^{\top}_{i}v_{i}=1 and uivj=0u^{\top}_{i}v_{j}=0 for all 1ijs1\leq i\neq j\leq s. If Es=i=1sλiuiviE_{s}=\sum_{i=1}^{s}\lambda_{i}u_{i}v_{i}^{\top}, then ρ(𝒫πEs)=|λs+1|\rho(\mathcal{P}^{\pi}-E_{s})=|\lambda_{s+1}|.

Wielandt’s deflation, however, still requires right eigenvectors, which are sometimes numerically unstable to compute. The Schur deflation, in contrast, only requires Schur vectors, which are stable to compute  [Golub and Van Loan, 2013, Sections 7.3]. For any 𝒫πn×n\mathcal{P}^{\pi}\in\mathbb{R}^{n\times n}, the Schur decomposition has the form 𝒫π=URU\mathcal{P}^{\pi}=URU^{\top}, where RR is an upper triangular matrix with Rii=λiR_{ii}=\lambda_{i} for i=1,,ni=1,\dots,n, and UU is a unitary matrix. We write uiu_{i} to denote the ii-th column of UU and call it the ii-th Schur vector for i=1,,ni=1,\dots,n. Specifically, the QR iteration computes the top ss eigenvalues and Schur vectors.

Fact 3 (Schur deflation, [Saad, 2011, Proposition 4.2]).

Let sns\leq n. Write {ui}i=1s\{u_{i}\}^{s}_{i=1} to denote the top ss Schur vectors. If Es=i=1sλiuiuiE_{s}=\sum_{i=1}^{s}\lambda_{i}u_{i}u_{i}^{\top}, then ρ(𝒫πEs)=|λs+1|\rho(\mathcal{P}^{\pi}-E_{s})=|\lambda_{s+1}|.

If an EsE_{s} satisfies the conditions of Facts 1, 2, or 3, we say it is a rank-ss deflation matrix. Later, it will be needed that for Es=i=1sλiuiviE_{s}=\sum_{i=1}^{s}\lambda_{i}u_{i}v_{i}^{\top},

(IαγEs)1=I+i=1sαγλi1αγλiuivi\displaystyle(I-\alpha\gamma E_{s})^{-1}=I+\sum_{i=1}^{s}\frac{\alpha\gamma\lambda_{i}}{1-\alpha\gamma\lambda_{i}}u_{i}v_{i}^{\top} (1)

holds. Indeed, the conditions of Facts 1, 2, or 3 do imply (1).

3.2.  Matrix Splitting: SOR

Consider the splitting of a matrix An×nA\in\mathbb{R}^{n\times n} in the form of A=B+C+DA=B+C+D. Let α0\alpha\neq 0. Then, zz solves the linear system Az=bAz=b if and only if

(D+αB)z=αb(αC(α1)D)z,(D+\alpha B)z=\alpha b-(\alpha C-(\alpha-1)D)z,

which, in turn, holds if and only if

z=(D+αB)1(αb(αC(α1)D)z),z=(D+\alpha B)^{-1}(\alpha b-(\alpha C-(\alpha-1)D)z),

provided that D+αBD+\alpha B is invertible. Successive over-relaxation (SOR) attempts to find a solution through the fixed-point iteration

zk+1=(D+αB)1(αb(αC(α1)D)zk).z^{k+1}=(D+\alpha B)^{-1}(\alpha b-(\alpha C-(\alpha-1)D)z^{k}).

Note that classical SOR uses a lower triangular BB, upper triangular CC, and diagonal DD. Here, we generalize the standard derivation of SOR to any splitting A=B+C+DA=B+C+D.

Policy evaluation is the problem of finding a unique VπV^{\pi} that satisfies the Bellman equation Vπ=TπVπV^{\pi}=T^{\pi}V^{\pi}, or in an expanded form, Vπ=rπ+γ𝒫πVπV^{\pi}=r^{\pi}+\gamma\mathcal{P}^{\pi}V^{\pi}. Notice that this is equivalent to

(IγEγ(𝒫πE))Vπ=rπ(I-\gamma E-\gamma(\mathcal{P}^{\pi}-E))V^{\pi}=r^{\pi}

for any En×nE\in\mathbb{R}^{n\times n}. The SOR iteration for PE is

Vk+1=(IαγE)1(αrπ+((1α)I+αγ(𝒫πE))Vk).\displaystyle V^{k+1}=(I-\alpha\gamma E)^{-1}(\alpha r^{\pi}+((1-\alpha)I+\alpha\gamma(\mathcal{P}^{\pi}-E))V^{k}).

As an example, notice that for α=1\alpha=1 and E=0E=0, we recover the original VI. The convergence behaviour of this procedure depends on the choice of α\alpha and EE. Soon we propose particular choices that can lead to acceleration.

3.3.  Deflated Dynamics Value Iteration

We are now ready to introduce DDVI. Let Es=i=1sλiuiviE_{s}=\sum_{i=1}^{s}\lambda_{i}u_{i}v_{i}^{\top} be a rank-ss deflation matrix of 𝒫π\mathcal{P}^{\pi} satisfying the conditions of Facts 1, 2, or 3. Using (1), we can express the SOR iteration for PE with deflation matrix EsE_{s} as

Wk+1\displaystyle W^{k+1} =(1α)Vk+αrπ+αγ(𝒫πi=1sλiuivi)Vk\displaystyle=(1-\alpha)V^{k}+\alpha r^{\pi}+\alpha\gamma\left(\mathcal{P}^{\pi}-\sum_{i=1}^{s}\lambda_{i}u_{i}v_{i}^{\top}\right)V^{k}
Vk+1\displaystyle V^{k+1} =(I+i=1sαγλi1αγλiuivi)Wk+1.\displaystyle=\left(I+\sum^{s}_{i=1}\frac{\alpha\gamma\lambda_{i}}{1-\alpha\gamma\lambda_{i}}u_{i}v_{i}^{\top}\right)W^{k+1}. (2)

We call this method Deflated Dynamics Value Iteration (DDVI). Theorem 3.1 and Corollary 3.2 describe the rate of convergence of DDVI for PE.

Theorem 3.1.

Let π\pi be a policy and let λ1,,λn\lambda_{1},\dots,\lambda_{n} be the eigenvalues of 𝒫π\mathcal{P}^{\pi} sorted in decreasing order of magnitude with ties broken arbitrarily. Let sns\leq n. Let Es=i=1sλiuiviE_{s}=\sum_{i=1}^{s}\lambda_{i}u_{i}v_{i}^{\top} be a rank-ss deflation matrix of 𝒫π\mathcal{P}^{\pi} satisfying the conditions of Facts 1, 2, or 3. For 0<α10<\alpha\leq 1, DDVI (2) exhibits the rate111The precise meaning of the O~\tilde{O} notation is VkVπ=O(|λ+ϵ|kV0Vπ)\|V^{k}-V^{\pi}\|=O(|\lambda+\epsilon|^{k}\|V^{0}-V^{\pi}\|) as kk\rightarrow\infty for any ϵ>0\epsilon>0.

VkVπ=O~(|λ|kV0Vπ)\displaystyle\|V^{k}-V^{\pi}\|=\tilde{O}(|\lambda|^{k}\|V^{0}-V^{\pi}\|)

as kk\rightarrow\infty, where

λ=max1is,s+1jn{|1α1αγλi|,|1α+αγλj|}.\lambda=\max_{1\leq i\leq s,s+1\leq j\leq n}\Big{\{}\left|\frac{1-\alpha}{1-\alpha\gamma\lambda_{i}}\right|,\left|1-\alpha+\alpha\gamma\lambda_{j}\right|\Big{\}}.

When α=1\alpha=1, we can simplify DDVI (2) as follows.

Corollary 3.2.

In the setting of Theorem 3.1, if α=1\alpha=1, then DDVI (2) simplifies to

Wk+1=γ(𝒫πEs)Vk+rπ,Vk+1=(I+i=1sγλi1γλiuivi)Wk+1\displaystyle W^{k+1}=\gamma(\mathcal{P}^{\pi}-E_{s})V^{k}+r^{\pi},\quad V^{k+1}=\left(I+\sum^{s}_{i=1}\frac{\gamma\lambda_{i}}{1-\gamma\lambda_{i}}u_{i}v_{i}^{\top}\right)W^{k+1}

and the the rate reduces to VkVπ=O~(|γλs+1|kV0Vπ)\|V^{k}-V^{\pi}\|=\tilde{O}(|\gamma\lambda_{s+1}|^{k}\|V^{0}-V^{\pi}\|) as kk\rightarrow\infty.

Let us compare this convergence rate with the original VI’s. The rate for VI is O(γk)O(\gamma^{k}), which is slow when γ1\gamma\approx 1. By choosing EE to deflate the top ss eigenvalues of 𝒫π\mathcal{P}^{\pi}, DDVI has the rate of O(|γλs+1|k)O(|\gamma\lambda_{s+1}|^{k}) behaviour, which is exponentially faster whenever λs+1\lambda_{s+1} is smaller than 11. The exact behaviour depends on the spectrum of the Markov chain (whether eigenvalues are close to 11 or far from them) and the number of eigenvalues we decided to deflate by EE.

In Section 4, we discuss a practical implementation of DDVI based on the power iteration and orthogonal iteration and implement it in the experiments of Section 6. We find that smaller values of α\alpha lead to more stable iterations when using approximate deflation matrices. We also show how we can use DDVI in the RL setting in Section 5.

This version of DDVI (2) applies to the policy evaluation (PE) setup. The challenge in extending DDVI to the Control setup is that 𝒫π\mathcal{P}^{\pi} changes throughout the iterations, and so should the deflation matrix EsE_{s}. However, when s=1s=1, the deflation matrix E1E_{1} can be kept constant, and we utilize this fact in Section 4.1.

4.  Computing Deflation Matrix EE

The application of DDVI requires practical means of computing the deflation matrix EsE_{s}. In this section, we provide three approaches as examples.

4.1.  Rank-11 DDVI for PE and Control

Recall that for any stochastic n×nn\times n matrix, the vector 𝟏=[1,,1]n\mathbf{1}=[1,\dots,1]^{\top}\in\mathbb{R}^{n} is a right eigenvector corresponding to eigenvalue 11. This allows us to easily obtain a rank-1 deflation matrix E1E_{1} for 𝒫π{\mathcal{P}}^{\pi} for any policy π\pi.

Let E1=𝟏vE_{1}=\mathbf{1}v^{\top} with vnv\in\mathbb{R}^{n} be a vector with non-negative entries satisfying v𝟏=1v^{\top}\mathbf{1}=1. This rank-11 Wielandt’s deflation matrix can be used for DDVI (PE) as in Section 2, but we can also use it for the Control version of DDVI.

We define the rank-11 DDVI for Control as

Wk+1\displaystyle W^{k+1} =maxπ{rπ+γ(𝒫πE1)Wk}=maxπ{rπ+γ𝒫πWk}γ(vWk)𝟏.\displaystyle=\max_{\pi}\{r^{\pi}+\gamma({\mathcal{P}}^{\pi}-E_{1})W^{k}\}=\max_{\pi}\{r^{\pi}+\gamma{\mathcal{P}}^{\pi}W^{k}\}-\gamma(v^{\top}W^{k})\mathbf{1}. (3)

Here, we benefitted from the fact that E1E_{1} is not a function of π\pi in order to take it out of the maxπ\max_{\pi}. Compare with (2) of DDVI (PE), we set α=1\alpha=1 for simplicity. We have the following guarantee for rank-11 DDVI for Control.

Theorem 4.1.

The DDVI for Control algorithm (3) exhibits the rate VkV21γγkV0V\|V^{k}-V^{\star}\|\leq\frac{2}{1-\gamma}\gamma^{k}\|V^{0}-V^{\star}\|, for k=0,1,k=0,1,\dots where Vk=(I+γ1γ𝟏v)WkV^{k}=(I+\frac{\gamma}{1-\gamma}\mathbf{1}v^{\top})W^{k}. Furthermore, if there exist a unique optimal policy π\pi^{\star}, then VkV=O~(|γλ2|kV0V)\|V^{k}-V^{\star}\|=\tilde{O}(|\gamma\lambda_{2}|^{k}\|V^{0}-V^{\star}\|) as kk\rightarrow\infty, where λ2\lambda_{2} is the second eigenvalue of 𝒫π\mathcal{P}^{\pi^{\star}} (The O~\tilde{O} notation is as defined in Theorem 3.1).

Discussion.

Although rank-1 DDVI for Control does accelerate the convergence VkVV^{k}\rightarrow V^{\star}, a subtle point to note is that the greedy policy πk\pi^{k} obtained from VkV^{k} is not affected by the rank-1 deflation. Briefly speaking, this is because adding a uniform constant to VkV^{k} through 𝟏\mathbf{1} has no effect on the argmax\operatorname*{argmax}. Indeed, the maximizer πk+1\pi^{k+1} in (3) is the same as argmaxπ{rπ+γ𝒫πVk}\operatorname*{argmax}_{\pi}\{r^{\pi}+\gamma{\mathcal{P}}^{\pi}V^{k}\}, produced by the (non-deflated) value iteration, when W0=V0W^{0}=V^{0}. Having the term v𝟏Wkv^{\top}\mathbf{1}W^{k} in the update Wk+1=rπ+γ(𝒫πk+1v𝟏)WkW^{k+1}=r^{\pi}+\gamma(\mathcal{P}^{\pi^{k+1}}-v^{\top}\mathbf{1})W^{k} adds the same constant to all states, so it does not change the maximizer of the next policy.

4.2.  DDVI with Automatic Power Iteration

Assume that the top s+1s+1 eigenvalues of 𝒫π\mathcal{P}^{\pi} have distinct magnitude, i.e., 1=λ1>|λ2|>>|λs+1|1=\lambda_{1}>|\lambda_{2}|>\cdots>|\lambda_{s+1}|. Let Es=i=1sλiuiviE_{s}=\sum_{i=1}^{s}\lambda_{i}u_{i}v_{i}^{\top} be a rank-ss deflation matrix of 𝒫π\mathcal{P}^{\pi} as in Fact 2.

Consider the DDVI algorithm (2) with α=1\alpha=1 as in Corollary 3.2. If V0=0V^{0}=0, since (𝒫πEs)(IγEs)1=𝒫πEs\left(\mathcal{P}^{\pi}-E_{s}\right)(I-\gamma E_{s})^{-1}=\mathcal{P}^{\pi}-E_{s} (verified in Appendix 8), we have Wk=i=0k1γi(𝒫πEs)irπW^{k}=\sum_{i=0}^{k-1}\gamma^{i}\left(\mathcal{P}^{\pi}-E_{s}\right)^{i}r^{\pi}. Therefore, 1γk(Wk+1Wk)=(𝒫πEs)krπ\frac{1}{\gamma^{k}}(W^{k+1}-W^{k})=\left(\mathcal{P}^{\pi}-E_{s}\right)^{k}r^{\pi} is the iterates of a power iteration for the matrix (𝒫πEs)(\mathcal{P}^{\pi}-E_{s}) starting from initial vector rπr^{\pi}. As discussed in Fact 2, the top eigenvalue of (𝒫πEs)(\mathcal{P}^{\pi}-E_{s}) is λs+1\lambda_{s+1}. For large kk, we expect Wk+1WkWk+1Wkw\frac{W^{k+1}-W^{k}}{\|W^{k+1}-W^{k}\|}\approx w, where ww is the top eigenvector of (𝒫πEs)(\mathcal{P}^{\pi}-E_{s}). With ww, we can recover the (s+1)(s+1)-th right eigenvector of 𝒫π\mathcal{P}^{\pi} through the formula ([Bru et al., 2012, Proposition 5]) us+1=wi=1sλiviwλiλs+1uiu_{s+1}=w-\sum^{s}_{i=1}\frac{\lambda_{i}v_{i}^{\top}w}{\lambda_{i}-\lambda_{s+1}}u_{i}.

Leveraging this observation, DDVI with Automatic Power Iteration (AutoPI) computes an approximate rank-ss deflation matrix EsE_{s} while performing DDVI: Start with a rank-11 deflation matrix, using the first right eigenvector u1=𝟏u_{1}=\mathbf{1}, and carry out DDVI iterations. If a certain error criteria is satisfied, use Wk+1WkW^{k+1}-W^{k} to approximate the second right eigenvector u2u_{2}. Then, update the deflation matrix to be rank 22, and gradually increase the deflation rank. Algorithm 3 in Appendix 9 has the detail.

Rank-ss DDVI with the QR Iteration.

We can also use the QR Iteration to compute the top ss Schur vectors, and then construct rank-ss Schur deflation matrix and performs DDVI. We can also perform the QR Iteration in an automated manner, similar to AutoPI. This results in the AutoQR algorithm. The detail is in Appendix 9.

5.  Deflated Dynamics Temporal Difference Learning

Many practical RL algorithms such as Temporal Difference (TD) Learning [Sutton, 1988, Tsitsiklis and V. Roy, 1997], Q-Learning [Watkins, 1989], Fitted Value Iteration [Gordon, 1995], and DQN [Mnih et al., 2015] can be viewed as sample-based variants of VI. Consequently, the slow convergence in the case of γ1\gamma\approx 1 has also been observed for these algorithms by Szepesvári [1997], Even-Dar and Mansour [2003], Wainwright [2019] for TD Learning and by Munos and Szepesvári [2008], Farahmand et al. [2010], Chen and Jiang [2019], Fan et al. [2020] for Fitted VI and DQN. Here we introduce Deflated Dynamics Temporal Difference Learning (DDTD) as a sample-based variant of DDVI. To start, recall that the tabular temporal difference (TD) learning performs the updates

Vk+1(Xk)=Vk(Xk)+ηk[rπ(Xk)+γVk(Xk)Vk(Xk)],V^{k+1}(X_{k})=V^{k}(X_{k})+\eta_{k}[r^{\pi}(X_{k})+\gamma V^{k}(X^{\prime}_{k})-V^{k}(X_{k})],

where ηk\eta_{k} is stepsize, (Xk,Xk)(X_{k},X^{\prime}_{k}) are random samples from the environment such that XkX_{k}^{\prime} is the subsequent state following XkX_{k}, and rπ(Xk)r^{\pi}(X_{k}) is the expected 11-step reward obtained by following policy π\pi from state XkX_{k}. TD is a sample-based variant of VI for PE. Two key ingredients of TD learning are the random coordinate updates and the temporal difference error rπ(X)+γV(X)V(X)r^{\pi}(X)+\gamma V(X^{\prime})-V(X) whose conditional expectation is equal to (TπVV)(X)(T^{\pi}V-V)(X). Applying the same ingredients to DDVI, we obtain DDTD. The improved convergence rate for DDVI compared to VI suggests that DDTD will also exhibit improved convergence rates.

Specifically, we define DDTD as

Wk+1(Xk)\displaystyle W^{k+1}(X_{k}) =Wk(Xk)+ηk[(1α)Vk(Xk)+α(rπ(Xk)+γV(Xk)γ(EsVk)(Xk))Wk(Xk)]\displaystyle=W^{k}(X_{k})+\eta_{k}\big{[}(1-\alpha)V^{k}(X_{k})+\alpha\left(r^{\pi}(X_{k})+\gamma V(X^{\prime}_{k})-\gamma(E_{s}V^{k})(X_{k})\right)-W^{k}(X_{k})\big{]}
Vk+1\displaystyle V^{k+1} =(I+i=1sαγλi1αγλiuivi)Wk+1,\displaystyle=\left(I+\sum^{s}_{i=1}\frac{\alpha\gamma\lambda_{i}}{1-\alpha\gamma\lambda_{i}}u_{i}v_{i}^{\top}\right)W^{k+1}, (4)

where Es=i=1sλiuiviE_{s}=\sum_{i=1}^{s}\lambda_{i}u_{i}v_{i}^{\top} is a rank-ss deflation matrix of 𝒫π\mathcal{P}^{\pi} and {Xk,Xk}k=0,1,\{X_{k},X_{k}^{\prime}\}_{k=0,1,\dots} are i.i.d. random variables such that Xk𝒫π(|Xk)X^{\prime}_{k}\sim\mathcal{P}^{\pi}(\cdot\,|\,X_{k}) and XkUnif(𝒳)X_{k}\sim\text{Unif}(\mathcal{X}). The Wk+1W^{k+1}-update notation means Wk+1(x)=Wk(x)W^{k+1}(x)=W^{k}(x) for all xXkx\neq X_{k}.

DDTD is the sample-based variant of DDVI performing asynchronous updates. The following result provides almost sure convergence of DDTD.

Theorem 5.1.

Let ηk=(i=0k𝟏Xi=x)1\eta_{k}=(\sum^{k}_{i=0}\mathbf{1}_{X_{i}=x})^{-1} when Xk=xX_{k}=x. For α=1\alpha=1, DDTD (4) converges to VπV^{\pi} almost surely.

The following result describes the asymptotic convergence rate of DDTD.

Theorem 5.2.

Let λ1,,λn\lambda_{1},\dots,\lambda_{n} be the eigenvalues of 𝒫π\mathcal{P}^{\pi} sorted in decreasing order of magnitude with ties broken arbitrarily. Let ηk=C/(k+1)\eta_{k}=C/(k+1) for C>n2λDDTDC>\frac{n}{2\lambda_{\text{DDTD}}} where λDDTD=minλ{λs+1,,λn}Re(1γλ)\lambda_{\text{DDTD}}=\min_{\lambda\in\{\lambda_{s+1},\dots,\lambda_{n}\}}\text{Re}(1-\gamma\lambda). For α=1\alpha=1, DDTD (4) exhibits the rate 𝐄[VkVπ2]=O(k1)\mathbf{E}[\|V^{k}-V^{\pi}\|^{2}]=O(k^{-1}).

We note that in Theorem 5.2, as the rank of DDTD increases, λDDTD\lambda_{\text{DDTD}} also increases, and this implies that higher rank DDTD has a larger range of convergent step size compared to the plain TD learning.

5.1.  Implementation of DDTD

To implement DDTD practically, we take a model-based approach for obtaining EsE_{s}. At each iteration, the agent uses the new samples to update an approximate model 𝒫^π\hat{\mathcal{P}}^{\pi} of the true dynamics. This updated approximate model is used to compute EsE_{s}. We update EsE_{s} every KK-th iterations since the change in 𝒫^π\hat{\mathcal{P}}^{\pi} and consequently the resulting EsE_{s} at each iteration is small. Whenever a new EsE_{s} is computed, we set WW to be (IαγEs)V(I-\alpha\gamma E_{s})V. This step ensures that VV smoothly converges to VπV^{\pi} by updating all coordinates of WW at once based on new EsE_{s}. Also, we use the random sample reward RR in place of the expected reward rπ(Xk)r^{\pi}(X_{k}). We formally state the algorithm as Algorithm 1.

Algorithm 1 Rank-ss DDTD with QR iteration
1:Initialize α\alpha, ss, WW, VV, EsE_{s}, 𝒫^π\hat{\mathcal{P}}^{\pi}, and KK
2:for k=1,2,k=1,2,\dots do
3:     Choose XkX_{k} uniformly at random.
4:     Sample (π(Xk),Rk,Xk)(\pi(X_{k}),R_{k},X^{\prime}_{k}) from environment
5:     Update 𝒫^\hat{\mathcal{P}} with (Xk,π(Xk),Rk,Xk)(X_{k},\pi(X_{k}),R_{k},X^{\prime}_{k})
6:     if kk mod K=0K=0 then
7:         {λi}i=1s,{us}i=1s\{\lambda_{i}\}^{s}_{i=1},\{u_{s}\}^{s}_{i=1} = QRiteration(𝒫^π\hat{\mathcal{P}}^{\pi}, s)
8:         Update EsλiuiuiE_{s}\leftarrow\sum\lambda_{i}u_{i}u_{i}^{\top}
9:         W(IαγEs)VW\leftarrow(I-\alpha\gamma E_{s})V      
10:     W(Xk)W(Xk)+ηk[αRk+αγV(Xk)αγ(EsV)(Xk)+(1α)V(Xk)W(Xk)]W(X_{k})\leftarrow W(X_{k})+\eta_{k}\big{[}\alpha R_{k}+\alpha\gamma V(X_{k}^{\prime})-\alpha\gamma(E_{s}V)(X_{k})+(1-\alpha)V(X_{k})-W(X_{k})\big{]}
11:     V(I+i=1sαγλi1αγλiuiui)WV\leftarrow\left(I+\sum^{s}_{i=1}\frac{\alpha\gamma\lambda_{i}}{1-\alpha\gamma\lambda_{i}}u_{i}u_{i}^{\top}\right)W
Refer to caption
Figure 1: Comparison of DDVI with different ranks, AutoPI, and AutoQR against VI in (left) Chain Walk and (right) Maze. Rate of DDVI with AutoPI and AutoQR changes when EsE_{s} is updated.
Refer to caption
Figure 2: Comparison of DDVI with other accelerated VIs. Normalized errors are shown against the iteration number (top-left) and wall clock time (top-right). Runtimes to reach normalized error of 10410^{-4} is shown against the number of states (bottom-left) and horizon 1/(1γ)1/(1-\gamma) (bottom-right). Plots are average of 20 randomly generated Garnet MDPs with shaded areas showing the standard error.
Refer to caption
Figure 3: Comparison of DDTD with QR iteration, Dyna, and TD learning in (left) Chain Walk (right) Maze.

6.  Experiments

For our experiments, we use the following environments: Maze with 5×55\times 5 states and 44 actions, Cliffwalk with 3×73\times 7 states with 4 actions, Chain Walk with 50 states with 2 actions, and random Garnet MDPs [Bhatnagar et al., 2009] with 200 states. The discount factor is set to γ=0.99\gamma=0.99 for DDTD experiments and γ=0.995\gamma=0.995 in other experiments. Appendix 10 provides full definitions of the environments and policies we used. All experiments were carried out on local CPUs. For comparisons, we use the normalized error of VkV^{k} defined as VkVπ1/Vπ1\|V^{k}-V^{\pi}\|_{1}/\|V^{\pi}\|_{1}.

DDVI with AutoPI, AutoQR, different fixed ranks.

In Figure 1, we compare rank-ss DDVI for multiple values of ss and DDVI with AutoPI and AutoQR against the VI in Chain Walk and Maze. QR iteration (Section 2) is used to calculate EsE_{s} with Schur vectors. In almost all cases, DDVI exhibits a significantly faster convergence rate compared to VI. Aligned with the theory, we observe that higher ranks achieve better convergence rates. Also, as DDVI with AutoPI and AutoQR progress and update EsE_{s}, their convergence rate improves.

DDVI and other baselines.

We perform an extensive comparison of DDVI against the prior accelerated VI methods: Safe Accelerated VI (S-AVI)[Goyal and Grand-Clément, 2022], Anderson VI [Geist and Scherrer, 2018], PID VI [Farahmand and Ghavamzadeh, 2021], and Anchored VI [Lee and Ryu, 2023]. In this experiment, we use the Implicitly Restarted Arnoldi Method [Lehoucq et al., 1998] from SciPy package to calculate the eigenvalues and eigenvectors for DDVI. Figure 2 (top-left) shows the convergence behaviour of the algorithms by iteration count in 20 randomly generated Garnet environments, where our algorithms outperform all the baselines. This comparison might not be fair as the amount of computation needed for each iteration of algorithms is not the same. Therefore, in Figure 2 (top-right), we compare the algorithms by wall clock time. It can be seen that rank-22 DDVI initially has to spend time to calculate EsE_{s} before it starts to update the value function, but after the slow start, it has the fastest rate. The fast rate can compensate for the initial time if a high accuracy is needed. DDVI with AutoQR and rank-11 DDVI show a fast convergence from the beginning.

We further investigate how the algorithms scale with the size of MDP and the discount factor. Figure 2 (bottom-left) shows the runtime to reach a normalized error of 10410^{-4} against the number of states. DDVI with AutoQR and rank-11 DDVI show the best scaling. Note that even rank-22 calculates a non-trivial EsE_{s} also scales competitively. Figure 2 (bottom-right) show the scaling behaviour with horizon of the MDP, which is 1/(1γ)1/(1-\gamma). We measure the runtime to reach a normalized error of 10410^{-4} for the horizon ranging from 100100 to 10001000 which corresponds to γ\gamma ranging from 0.990.99 to 0.9990.999. Remarkably, DDVI algorithms have a low constant runtime even with long horizon tasks with γ1\gamma\approx 1. This is aligned with our theoretical result, as the rate γ|λs+1|\gamma|\lambda_{s+1}| remains small as γ\gamma approaches 11.

Rank-ss DDTD with QR iteration.

We compare DDTD with TD Learning and Dyna. For the sake of making the setting more realistic, we consider the case where the approximate model 𝒫^\hat{\mathcal{P}} cannot exactly learn the true dynamics. Figure 3 shows that DDTD with large enough rank, can outperform TD. Note that unlike Dyna, DDTD does not suffer from model error, which shows that DDTD only uses the model for acceleration and is not a pure model-based algorithm.

7.  Conclusion

In this work, we propose a framework for accelerating VI through matrix deflation. We theoretically analyzed our proposed methods DDVI and DDTD, and presented experimental results showing speedups in various setups.

The positive experimental results demonstrate matrix deflation to be a promising technique that may be applicable to a broader range of RL algorithms. One direction of future work is to extend the theoretical analysis DDVI for Control using a general rank-ss deflation matrix. Theoretically analyzing other RL methods combined with matrix deflation is another interesting direction.

Acknowledgements

AMF acknowledges the funding from the Natural Sciences and Engineering Research Council of Canada (NSERC) through the Discovery Grant program (2021-03701) and the Canada CIFAR AI Chairs program.

References

  • Akian et al. [2022] M. Akian, S. Gaubert, Z. Qu, and O. Saadi. Multiply accelerated value iteration for non-symmetric affine fixed point problems and application to Markov decision processes. SIAM Journal on Matrix Analysis and Applications, 43(1):199–232, 2022.
  • Andre et al. [1997] D. Andre, N. Friedman, and R. Parr. Generalized prioritized sweeping. Neural Information Processing Systems, 1997.
  • Azar et al. [2011] M. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. Neural Information Processing Systems, 2011.
  • Bacon [2018] P. Bacon. Temporal Representation Learning. PhD thesis, McGill University, 2018.
  • Bacon and Precup [2016] P. Bacon and D. Precup. A matrix splitting perspective on planning with options. arXiv:1612.00916, 2016.
  • Banach [1922] S. Banach. Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fundamenta Mathematicae, 3(1):133–181, 1922.
  • Bertsekas [1995] D. P. Bertsekas. Generic rank-one corrections for value iteration in markovian decision problems. Operations research letters, 17(3):111–119, 1995.
  • Bertsekas and Tsitsiklis [1996] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.
  • Bhandari et al. [2018] J. Bhandari, D. Russo, and R. Singal. A finite time analysis of temporal difference learning with linear function approximation. 2018.
  • Bhatnagar et al. [2009] S. Bhatnagar, R. S. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor–critic algorithms. Automatica, 45(11):2471–2482, 2009.
  • Borkar [1998] Vivek S Borkar. Asynchronous stochastic approximations. SIAM Journal on Control and Optimization, 36(3):840–851, 1998.
  • Borkar and Meyn [2000] Vivek S Borkar and Sean P Meyn. The ode method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447–469, 2000.
  • Bowen et al. [2021] W. Bowen, X. Huaqing, Z. Lin, L. Yingbin, and Z. Wei. Finite-time theory for momentum Q-learning. Conference on Uncertainty in Artificial Intelligence, 2021.
  • Bru et al. [2012] R. Bru, R. Canto, R. L. Soto, and A. M. Urbano. A Brauer’s theorem and related results. Central European Journal of Mathematics, 10:312–321, 2012.
  • Chen and Jiang [2019] J. Chen and N. Jiang. International conference on machine learning. Information-Theoretic Considerations in Batch Reinforcement Learning, 2019.
  • Chen et al. [2020] S. Chen, A. Devraj, A. Busic, and S. Meyn. Explicit mean-square error bounds for monte-carlo and linear stochastic approximation. International Conference on Artificial Intelligence and Statistics, 2020.
  • Chen et al. [2023] Z. Chen, S. T Maguluri, S. Shakkottai, and K. Shanmugam. A lyapunov theory for finite-sample guarantees of markovian stochastic approximation. Operations Research, 2023.
  • Dai et al. [2011] P. Dai, D. S. Weld, J. Goldsmith, et al. Topological value iteration algorithms. Journal of Artificial Intelligence Research, 42:181–209, 2011.
  • Dalal et al. [2018] G. Dalal, B. Szörényi, G. Thoppe, and S. Mannor. Finite sample analyses for td (0) with function approximation. Association for the Advancement of Artificial Intelligence, 2018.
  • Devraj and Meyn [2021] A. M. Devraj and S. P. Meyn. Q-learning with uniformly bounded variance. IEEE Transactions on Automatic Control, 67(11):5948–5963, 2021.
  • Ermis and Yang [2020] M. Ermis and I. Yang. A3DQN: Adaptive Anderson acceleration for deep Q-networks. IEEE Symposium Series on Computational Intelligence, 2020.
  • Ermis et al. [2021] M. Ermis, M. Park, and I. Yang. On Anderson acceleration for partially observable Markov decision processes. IEEE Conference on Decision and Control, 2021.
  • Ernst et al. [2005] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 2005.
  • Even-Dar and Mansour [2003] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning Research, 2003.
  • Fan et al. [2020] J. Fan, Z. Wang, Y. Xie, and Z. Yang. A theoretical analysis of deep Q-learning. Learning for dynamics and control, 2020.
  • Farahmand and Ghavamzadeh [2021] A. Farahmand and M. Ghavamzadeh. PID accelerated value iteration algorithm. International Conference on Machine Learning, 2021.
  • Farahmand et al. [2010] A. Farahmand, R. Munos, and C. Szepesvári. Error propagation for approximate policy and value iteration. Neural Information Processing Systems, 2010.
  • Geist and Scherrer [2018] M. Geist and B. Scherrer. Anderson acceleration for reinforcement learning. European Workshop on Reinforcement Learning, 2018.
  • Golub and Van Loan [2013] G. H. Golub and C. F. Van Loan. Matrix Computations. The John Hopkins University Press, 4th edition, 2013.
  • Gordon [1995] G. Gordon. Stable function approximation in dynamic programming. International Conference on Machine Learning, 1995.
  • Goyal and Grand-Clément [2022] V. Goyal and J. Grand-Clément. A first-order approach to accelerated value iteration. Operations Research, 71(2):517–535, 2022.
  • Grand-Clément [2021] J. Grand-Clément. From convex optimization to MDPs: A review of first-order, second-order and quasi-newton methods for MDPs. arXiv:2104.10677, 2021.
  • Gupta et al. [2015] A. Gupta, R. Jain, and P. W Glynn. An empirical algorithm for relative value iteration for average-cost MDPs. IEEE Conference on Decision and Control, 2015.
  • Hastings [1968] N. A. J. Hastings. Some notes on dynamic programming and replacement. Journal of the Operational Research Society, 19:453–464, 1968.
  • Hillen [2023] T. Hillen. Elements of Applied Functional Analysis. AMS Open Notes, 2023.
  • Hunter and Nachtergaele [2001] J. K. Hunter and B. Nachtergaele. Applied analysis. World Scientific Publishing Company, 2001.
  • Jaakkola et al. [1993] T. Jaakkola, M. Jordan, and S. Singh. Convergence of stochastic iterative dynamic programming algorithms. Advances in neural information processing systems, 1993.
  • Kushner and Kleinman [1968] H. Kushner and A. Kleinman. Numerical methods for the solution of the degenerate nonlinear elliptic equations arising in optimal stochastic control theory. IEEE Transactions on Automatic Control, 1968.
  • Kushner and Kleinman [1971] H. Kushner and A. Kleinman. Accelerated procedures for the solution of discrete Markov control problems. IEEE Transactions on Automatic Control, 1971.
  • Lagoudakis and Parr [2003] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of Machine Learning Research, 2003.
  • Lakshminarayanan and Szepesvari [2018] C. Lakshminarayanan and C. Szepesvari. Linear stochastic approximation: How far does constant step-size and iterate averaging go? International Conference on Artificial Intelligence and Statistics, 2018.
  • Lee and Ryu [2023] J. Lee and E. K. Ryu. Accelerating value iteration with anchoring. Neural Information Processing Systems, 2023.
  • Lehoucq et al. [1998] R.B. Lehoucq, D.C. Sorensen, and C. Yang. ARPACK Users’ Guide: Solution of Large-scale Eigenvalue Problems with Implicitly Restarted Arnoldi Methods. Society for Industrial and Applied Mathematics, 1998.
  • Mackey [2008] L. Mackey. Deflation methods for sparse PCA. In Neural Information Processing Systems, 2008.
  • McMahan and Gordon [2005] H. B. McMahan and G. J. Gordon. Fast exact planning in Markov decision processes. International Conference on Automated Planning and Scheduling, 2005.
  • Meirovitch [1980] L. Meirovitch. Computational methods in structural dynamics. Springer Science & Business Media, 1980.
  • Meyn [2022] S. Meyn. Control Systems and Reinforcement Learning. Cambridge University Press, 2022.
  • Mnih et al. [2015] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, and et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • Moore and Atkeson [1993] A. W. Moore and C. G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 1993.
  • Morgan [1995] R. B. Morgan. A restarted gmres method augmented with eigenvectors. SIAM Journal on Matrix Analysis and Applications, 16(4):1154–1171, 1995.
  • Munos and Szepesvári [2008] R. Munos and C. Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 2008.
  • Park et al. [2022] M. Park, J. Shin, and I. Yang. Anderson acceleration for partially observable Markov decision processes: A maximum entropy approach. arXiv:2211.14998, 2022.
  • Peng and Williams [1993] J. Peng and R. J. Williams. Efficient learning and planning within the Dyna framework. Adaptive Behavior, 1(4):437–454, 1993.
  • Porteus [1975] E. L. Porteus. Bounds and transformations for discounted finite markov decision chains. Operations Research, 23(4):761–784, 1975.
  • Rakhsha et al. [2022] A. Rakhsha, A. Wang, M. Ghavamzadeh, and A. Farahmand. Operator splitting value iteration. Neural Information Processing Systems, 2022.
  • Reetz [1973] D. Reetz. Solution of a markovian decision problem by successive overrelaxation. Zeitschrift für Operations Research, 17:29–32, 1973.
  • Saad [2011] Y. Saad. Numerical methods for large eigenvalue problems: revised edition. SIAM, 2011.
  • Saad et al. [2000] Y. Saad, M. Yeung, J. Erhel, and F. Guyomarc’h. A deflated version of the conjugate gradient algorithm. SIAM Journal on Scientific Computing, 21(5):1909–1926, 2000.
  • Sharma et al. [2020] H. Sharma, M. Jafarnia-Jahromi, and R. Jain. Approximate relative value learning for average-reward continuous state mdps. Uncertainty in Artificial Intelligence, 2020.
  • Shi et al. [2019] W. Shi, S. Song, H. Wu, Y. Hsu, C. Wu, and G. Huang. Regularized Anderson acceleration for off-policy deep reinforcement learning. Neural Information Processing Systems, 2019.
  • Shroff and Keller [1993] G. M. Shroff and H. B. Keller. Stabilization of unstable procedures: the recursive projection method. SIAM Journal on numerical analysis, 30(4):1099–1120, 1993.
  • Sidford et al. [2023] A. Sidford, M. Wang, X. Wu, and Y. Ye. Variance reduced value iteration and faster algorithms for solving markov decision processes. Naval Research Logistics, 70(5):423–442, 2023.
  • Soto and Rojo [2006] R. L. Soto and O. Rojo. Applications of a Brauer theorem in the nonnegative inverse eigenvalue problem. Linear algebra and its applications, 416(2-3):844–856, 2006.
  • Sun et al. [2021] K. Sun, Y. Wang, Y. Liu, B. Pan, S. Jui, B. Jiang, L. Kong, et al. Damped Anderson mixing for deep reinforcement learning: Acceleration, convergence, and stabilization. Neural Information Processing Systems, 2021.
  • Sutton [1988] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 1988.
  • Sutton and Barto [2018] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.
  • Szepesvári [1997] C. Szepesvári. The asymptotic convergence-rate of Q-learning. Neural Information Processing Systems, 1997.
  • Szepesvári [2010] C. Szepesvári. Algorithms for Reinforcement Learning. Morgan Claypool Publishers, 2010.
  • Tsitsiklis and V. Roy [1997] J. N. Tsitsiklis and B. V. Roy. An analysis of temporal difference learning with function approximation. IEEE Transactions on Automatic Control, 1997.
  • Vieillard et al. [2020] N. Vieillard, B. Scherrer, O. Pietquin, and M. Geist. Momentum in reinforcement learning. International Conference on Artificial Intelligence and Statistics, 2020.
  • Wainwright [2019] M. J. Wainwright. Stochastic approximation with cone-contractive operators: Sharp \ell_{\infty}-bounds for qq-learning. arXiv:1905.06265, 2019.
  • Watkins [1989] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, University of Cambride, 1989.
  • White [1963] D. J. White. Dynamic programming, markov chains, and the method of successive approximations. J. Math. Anal. Appl, 6(3):373–376, 1963.
  • Wingate et al. [2005] D. Wingate, K. D. Seppi, and S. Mahadevan. Prioritization methods for accelerating MDP solvers. Journal of Machine Learning Research, 2005.

Appendix

7.1.  Prior works

Acceleration in RL.

Prioritized sweeping and its several variants [Moore and Atkeson, 1993, Peng and Williams, 1993, McMahan and Gordon, 2005, Wingate et al., 2005, Andre et al., 1997, Dai et al., 2011] specify the order of asynchronous value function updates, which may lead to accelerated convergence to the true value function. Speedy Q-learning [Azar et al., 2011] changes the update rule of Q-learning and employs aggressive learning rates to accelerate convergence. Sidford et al. [2023] accelerate the computation of 𝒫πVk\mathcal{P}^{\pi}V^{k} by sampling it with variance reduction techniques. Recently, there has been a growing body of research that explores the application of acceleration techniques of other areas of applied math to planning and RL: Geist and Scherrer [2018], Sun et al. [2021], Ermis and Yang [2020], Park et al. [2022], Ermis et al. [2021], Shi et al. [2019] applies Anderson acceleration of fixed-point iterations, Lee and Ryu [2023] applies Anchor acceleration of minimax optimization, Vieillard et al. [2020], Goyal and Grand-Clément [2022], Grand-Clément [2021], Bowen et al. [2021], Akian et al. [2022] applies Nesterov acceleration of convex optimization, and Farahmand and Ghavamzadeh [2021] applies ideas inspired by PID controllers in control theory.

Matrix deflation.

Deflation techniques were first developed for eigenvalue computation [Meirovitch, 1980, Saad, 2011, Golub and Van Loan, 2013]. Matrix deflation, eliminating top eigenvalues of a given matrix with leaving the rest of the eigenvalues untouched, has been applied to various algorithms such as conjugate gradient algorithm [Saad et al., 2000], principal component analysis [Mackey, 2008], generalized minimal residual method [Morgan, 1995], and nonlinear fixed point iteration [Shroff and Keller, 1993] for improvement of convergence.

Some prior work in RL has explored subtracting constants from the iterates of VI, resulting in methods that resemble our rank-11 DDVI. For discounted MDP, Devraj and Meyn [2021] propose relative Q-learning, which can roughly be seen as the Q-learning version of rank-1 DDVI for Control, and Bertsekas [1995] proposes an extrapolation method for VI, which can be seen as rank-1 DDVI with shur deflation matrix for PE. White [1963] introduced relative value iteration in the context of undiscounted MDP with average reward, and several variants were proposed [Gupta et al., 2015, Sharma et al., 2020].

Matrix splitting of value iteration.

Matrix splitting has been studied in the RL literature to obtain an acceleration. Hastings [1968] and Kushner and Kleinman [1968] first suggested Gausss-Seidel iteration for computing value function. Kushner and Kleinman [1971] and Reetz [1973] applied Successive Over-Relaxation to VI and generalized Jacobi and Gauss-Seidel iteration. Porteus [1975] proposed several transformations of MDP that can be seen as a Gauss–Seidel variant of VI. Rakhsha et al. [2022] used matrix splitting with the approximate dynamics of the environment and extended to nonlinear operator splitting. Bacon and Precup [2016], Bacon [2018] analyzed planning with options through the matrix splitting perspective.

Convergence analysis of TD learning

Jaakkola et al. [1993] first proved the convergence of TD learning using the stochastic approximation (SA) technique. Borkar and Meyn [2000], Borkar [1998] suggested ODE-based framework to provide asymptotic convergence of SA including TD learning with an asynchronous update. Lakshminarayanan and Szepesvari [2018] study linear SA under i.i.d noise with respect to mean square error and Chen et al. [2020] study asymptotic convergence rate of linear SA under Markovian noise. Finite time analysis of TD learning first provided by Dalal et al. [2018] and extended to Markovian noise setting by Bhandari et al. [2018]. Leveraging Lyapunov theory, Chen et al. [2023] establishes finite time analysis of Markovian SA with an explicit bound.

8.  Proof of Theoretical Results

8.1.  Proof of Theorem 3.1

By definition of DDVI (2) and fixed point VπV^{\pi}, we have

VkVπ\displaystyle V^{k}-V^{\pi}
=(IαγEs)1((1α)I+αγ(𝒫πEs))(Vk1Vπ)\displaystyle=(I-\alpha\gamma E_{s})^{-1}((1-\alpha)I+\alpha\gamma(\mathcal{P}^{\pi}-E_{s}))(V^{k-1}-V^{\pi})
=(IαγEs)1((1α)(IαγEs)1+αγ(𝒫πEs)(IαγEs)1)((1α)I+αγ(𝒫πEs))(Vk2Vπ)\displaystyle=(I-\alpha\gamma E_{s})^{-1}((1-\alpha)(I-\alpha\gamma E_{s})^{-1}+\alpha\gamma(\mathcal{P}^{\pi}-E_{s})(I-\alpha\gamma E_{s})^{-1})((1-\alpha)I+\alpha\gamma(\mathcal{P}^{\pi}-E_{s}))(V^{k-2}-V^{\pi})
=(IαγEs)1((1α)(IαγEs)1+αγ(𝒫πEs)(IαγEs)1)k1((1α)I+αγ(𝒫πEs))(V0Vπ),\displaystyle=(I-\alpha\gamma E_{s})^{-1}((1-\alpha)(I-\alpha\gamma E_{s})^{-1}+\alpha\gamma(\mathcal{P}^{\pi}-E_{s})(I-\alpha\gamma E_{s})^{-1})^{k-1}((1-\alpha)I+\alpha\gamma(\mathcal{P}^{\pi}-E_{s}))(V^{0}-V^{\pi}),

where EsE_{s} is a rank-ss deflation matrix of 𝒫π\mathcal{P}^{\pi} satisfying the conditions of Facts 1, 2, or 3. This implies that

VkVπC((1α)(IαγE)1+αγ(𝒫πE)(IαγE)1)k1V0Vπ\|V^{k}-V^{\pi}\|\leq C\left\|((1-\alpha)(I-\alpha\gamma E)^{-1}+\alpha\gamma(\mathcal{P}^{\pi}-E)(I-\alpha\gamma E)^{-1})^{k-1}\right\|\|V^{0}-V^{\pi}\|

for some constant CC\in.

First, suppose EsE_{s} satisfies Facts 1 or 2. Let Us=[u1,,us]U_{s}=[u_{1},\dots,u_{s}] and Vs=[v1,,vs]V_{s}=[v_{1},\dots,v_{s}]. Define Ds,f(λi)=diag(f(λ1),,f(λs))D_{s,f(\lambda_{i})}=\text{diag}(f(\lambda_{1}),\dots,f(\lambda_{s})) for some function ff. Then Us,Vs,Ds,f(λi)U_{s},V_{s},D_{s,f(\lambda_{i})} are n×sn\times s matrices and Es=UsDs,λiVsE_{s}=U_{s}D_{s,\lambda_{i}}V_{s}^{\top}. By Jordan decomposition, 𝒫π=UJU1\mathcal{P}^{\pi}=UJU^{-1} where JJ is Jordan matrix with Jii=λiJ_{ii}=\lambda_{i} for i=1,,ni=1,\dots,n. Let JsJ_{s} be s×ss\times s submatrix of JJ satisfying (Js)ij=Jij(J_{s})_{ij}=J_{ij} for 1i,js1\leq i,j\leq s. Then, by condition of Theorem 3.1, Js=diag(λ1,,λs)=Ds,λiJ_{s}=\text{diag}(\lambda_{1},\dots,\lambda_{s})=D_{s,\lambda_{i}} and ii-th column of UU is uiu_{i} for 1is1\leq i\leq s. By simple calculation, we get 𝒫πUs=UsJs=UsDs,λi\mathcal{P}^{\pi}U_{s}=U_{s}J_{s}=U_{s}D_{s,\lambda_{i}} and (IαγEs)1=I+UsDs,γαλi1γαλiVs(I-\alpha\gamma E_{s})^{-1}=I+U_{s}D_{s,\frac{\gamma\alpha\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}V^{\top}_{s}.

Since

(𝒫πUsDs,λiVs)(UsDs,γαλi1γαλiVs)\displaystyle(\mathcal{P}^{\pi}-U_{s}D_{s,\lambda_{i}}V^{\top}_{s})(U_{s}D_{s,\frac{\gamma\alpha\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}V^{\top}_{s}) =UsDs,λiDs,γαλi1γαλiVsUsDs,λiDs,γαλi1γαλiVs\displaystyle=U_{s}D_{s,\lambda_{i}}D_{s,\frac{\gamma\alpha\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}V^{\top}_{s}-U_{s}D_{s,\lambda_{i}}D_{s,\frac{\gamma\alpha\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}V^{\top}_{s}
=0,\displaystyle=0,

we get

(𝒫πEs)(IαγEs)1\displaystyle(\mathcal{P}^{\pi}-E_{s})(I-\alpha\gamma E_{s})^{-1} =𝒫πEs.\displaystyle=\mathcal{P}^{\pi}-E_{s}.

Then,

(1α)(IαγEs)1+αγ(𝒫πEs)(IαγEs)1\displaystyle(1-\alpha)(I-\alpha\gamma E_{s})^{-1}+\alpha\gamma(\mathcal{P}^{\pi}-E_{s})(I-\alpha\gamma E_{s})^{-1}
=(1α)(I+UsDs,γαλi1γαλiVs)+αγ(UJU1UsDs,λiVs)\displaystyle=(1-\alpha)\left(I+U_{s}D_{s,\frac{\gamma\alpha\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}V^{\top}_{s}\right)+\alpha\gamma\left(UJU^{-1}-U_{s}D_{s,\lambda_{i}}V^{\top}_{s}\right)
=(1α)I+αγUJU1+UsDs,(λiγ1)γα2λi1γαλiVs\displaystyle=(1-\alpha)I+\alpha\gamma UJU^{-1}+U_{s}D_{s,\frac{(\lambda_{i}\gamma-1)\gamma\alpha^{2}\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}V^{\top}_{s}
=U((1α)I+αγJ+e1:sDs,(λiγ1)γα2λi1γαλiVsU)U1,\displaystyle=U\left((1-\alpha)I+\alpha\gamma J+e_{1:s}D_{s,\frac{(\lambda_{i}\gamma-1)\gamma\alpha^{2}\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}V_{s}^{\top}U\right)U^{-1},

where eine_{i}\in\mathbb{R}^{n} is ii-th unit vector and e1:s=[e1,,es]e_{1:s}=[e_{1},\dots,e_{s}], and (1α)I+αγJ+e1:sDs,(λiγ1)γα2λi1γαλiVsU(1-\alpha)I+\alpha\gamma J+e_{1:s}D_{s,\frac{(\lambda_{i}\gamma-1)\gamma\alpha^{2}\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}V_{s}^{\top}U is upper traingular matrix with diagonal entries (1α)1αγλ1,,(1α)1αγλs,(1α)+αγλs+1,,(1α)+αγλn\frac{(1-\alpha)}{1-\alpha\gamma\lambda_{1}},\dots,\frac{(1-\alpha)}{1-\alpha\gamma\lambda_{s}},(1-\alpha)+\alpha\gamma\lambda_{s+1},\dots,(1-\alpha)+\alpha\gamma\lambda_{n}.

Now suppose Es=i=1sλiuiuiE_{s}=\sum_{i=1}^{s}\lambda_{i}u_{i}u_{i}^{\top} satisfies Fact  3. Similarly, let Us=[u1,,us]U_{s}=[u_{1},\dots,u_{s}]. Then, Es=UsDs,λiUsE_{s}=U_{s}D_{s,\lambda_{i}}U^{\top}_{s}. By Schur decomposition, 𝒫π=URU\mathcal{P}^{\pi}=URU^{\top} where RR is an upper triangular matrix with Rii=λiR_{ii}=\lambda_{i} for i=1,,ni=1,\dots,n, and UU is a unitary matrix. By simple calculation, we have 𝒫πUs=UsRs\mathcal{P}^{\pi}U_{s}=U_{s}R_{s} where RsR_{s} is s×ss\times s submatrix of RR such that (Rs)ij=Rij(R_{s})_{ij}=R_{ij} for 1i,js1\leq i,j\leq s and (IαγEs)1=I+UsDs,γαλi1γαλiUs(I-\alpha\gamma E_{s})^{-1}=I+U_{s}D_{s,\frac{\gamma\alpha\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}U^{\top}_{s}.

Since

(𝒫πUsDs,λiUs)(UsDs,γαλi1γαλiUs)\displaystyle(\mathcal{P}^{\pi}-U_{s}D_{s,\lambda_{i}}U^{\top}_{s})(U_{s}D_{s,\frac{\gamma\alpha\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}U^{\top}_{s}) =UsRsDs,γαλi1γαλiUsUsDs,λiDs,γαλi1γαλiUs\displaystyle=U_{s}R_{s}D_{s,\frac{\gamma\alpha\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}U^{\top}_{s}-U_{s}D_{s,\lambda_{i}}D_{s,\frac{\gamma\alpha\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}U^{\top}_{s}
=Us(RsDs,λi)Ds,γαλi1γαλiUs,\displaystyle=U_{s}(R_{s}-D_{s,\lambda_{i}})D_{s,\frac{\gamma\alpha\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}U^{\top}_{s},

we get

(𝒫πEs)(IαγEs)1\displaystyle(\mathcal{P}^{\pi}-E_{s})(I-\alpha\gamma E_{s})^{-1} =(𝒫πUsDs,λiUs)(I+UsDs,γαλi1γαλiUs)\displaystyle=(\mathcal{P}^{\pi}-U_{s}D_{s,\lambda_{i}}U^{\top}_{s})(I+U_{s}D_{s,\frac{\gamma\alpha\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}U^{\top}_{s})
=(𝒫πUsDs,λiUs)+Us(RsDs,λi)Ds,γαλi1γαλiUs.\displaystyle=(\mathcal{P}^{\pi}-U_{s}D_{s,\lambda_{i}}U^{\top}_{s})+U_{s}(R_{s}-D_{s,\lambda_{i}})D_{s,\frac{\gamma\alpha\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}U^{\top}_{s}.

Then,

(1α)(IαγEs)1+αγ(𝒫πEs)(IαγEs)1\displaystyle(1-\alpha)(I-\alpha\gamma E_{s})^{-1}+\alpha\gamma(\mathcal{P}^{\pi}-E_{s})(I-\alpha\gamma E_{s})^{-1}
=(1α)I+(1α)UsDs,γαλi1γαλiUs+αγURUαγUsDs,λiUs+αγUs(RsDs,λi)Ds,γαλi1γαλiUs\displaystyle=(1-\alpha)I+(1-\alpha)U_{s}D_{s,\frac{\gamma\alpha\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}U^{\top}_{s}+\alpha\gamma URU^{\top}-\alpha\gamma U_{s}D_{s,\lambda_{i}}U^{\top}_{s}+\alpha\gamma U_{s}(R_{s}-D_{s,\lambda_{i}})D_{s,\frac{\gamma\alpha\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}U^{\top}_{s}
=(1α)I+αγURU+Us(Ds,(1α)γαλi1γαλiDs,αγλi+αγ(RsDs,λi)Ds,γαλi1γαλi)Us\displaystyle=(1-\alpha)I+\alpha\gamma URU^{\top}+U_{s}\left(D_{s,\frac{(1-\alpha)\gamma\alpha\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}-D_{s,\alpha\gamma\lambda_{i}}+\alpha\gamma(R_{s}-D_{s,\lambda_{i}})D_{s,\frac{\gamma\alpha\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}\right)U^{\top}_{s}
=(1α)I+αγURU+Us(D(λiγ1)γα2λi1γαλi+αγ(RsDλi)Dγαλi1γαλi)Us\displaystyle=(1-\alpha)I+\alpha\gamma URU^{\top}+U_{s}\left(D_{\frac{(\lambda_{i}\gamma-1)\gamma\alpha^{2}\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}+\alpha\gamma(R_{s}-D_{\lambda_{i}})D_{\frac{\gamma\alpha\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}\right)U^{\top}_{s}
=(1α)I+αγURU+UsRs,(λiγ1)γα2λi1γαλiUs\displaystyle=(1-\alpha)I+\alpha\gamma URU^{\top}+U_{s}R_{s,\frac{(\lambda_{i}\gamma-1)\gamma\alpha^{2}\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}U^{\top}_{s}
=U((1α)I+αγR+e1:sRs,(λiγ1)γα2λi1γαλie1:s)U\displaystyle=U\left((1-\alpha)I+\alpha\gamma R+e_{1:s}R_{s,\frac{(\lambda_{i}\gamma-1)\gamma\alpha^{2}\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}e_{1:s}^{\top}\right)U^{\top}

where Rs,(λiγ1)γα2λi1γαλiR_{s,\frac{(\lambda_{i}\gamma-1)\gamma\alpha^{2}\lambda_{i}}{1-\gamma\alpha\lambda_{i}}} is s×ss\times s upper triangular matrix with digonal entries (λ1γ1)γα2λ11γαλ1,,(λsγ1)γα2λs1γαλs\frac{(\lambda_{1}\gamma-1)\gamma\alpha^{2}\lambda_{1}}{1-\gamma\alpha\lambda_{1}},\dots,\frac{(\lambda_{s}\gamma-1)\gamma\alpha^{2}\lambda_{s}}{1-\gamma\alpha\lambda_{s}}, satisfying Rs,(λiγ1)γα2λi1γαλi=Ds,(λiγ1)γα2λi1γαλi+αγ(RsDs,λi)Ds,γαλi1γαλiR_{s,\frac{(\lambda_{i}\gamma-1)\gamma\alpha^{2}\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}=D_{s,\frac{(\lambda_{i}\gamma-1)\gamma\alpha^{2}\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}+\alpha\gamma(R_{s}-D_{s,\lambda_{i}})D_{s,\frac{\gamma\alpha\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}. By simple calculation, we know that (1α)I+αγR+e1:sR(αλiγα)γαλi1γαλie1:s(1-\alpha)I+\alpha\gamma R+e_{1:s}R_{\frac{(\alpha\lambda_{i}\gamma-\alpha)\gamma\alpha\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}e_{1:s}^{\top} is n×nn\times n upper triangular matrix with digonal entries (1α)1αγλ1,,(1α)1αγλs,(1α)+αγλs+1,,(1α)+αγλn\frac{(1-\alpha)}{1-\alpha\gamma\lambda_{1}},\dots,\frac{(1-\alpha)}{1-\alpha\gamma\lambda_{s}},(1-\alpha)+\alpha\gamma\lambda_{s+1},\dots,(1-\alpha)+\alpha\gamma\lambda_{n}.

Therefore, by spectral analysis, we conclude that

VkVπ=O~(|λ|kV0Vπ)\displaystyle\|V^{k}-V^{\pi}\|=\tilde{O}(|\lambda|^{k}\|V^{0}-V^{\pi}\|)

as kk\rightarrow\infty, where

λ=max1is,s+1jn{|1α1αγλi|,|1α+αγλj|}.\lambda=\max_{1\leq i\leq s,s+1\leq j\leq n}\Big{\{}\left|\frac{1-\alpha}{1-\alpha\gamma\lambda_{i}}\right|,\left|1-\alpha+\alpha\gamma\lambda_{j}\right|\Big{\}}.

8.2.  Proof of Collorary 3.2

Putting α=1\alpha=1 in Theorem 3.1, we conclude Corollary 3.2.

8.3.  Proof of Theorem 4.1

By definition of DDVI for Control (3),

VkV=(I+γ1γE1)(Wk(IγE1)V)\displaystyle V^{k}-V^{\star}=\left(I+\frac{\gamma}{1-\gamma}E_{1}\right)\left(W^{k}-\left(I-\gamma E_{1}\right)V^{\star}\right)

where E1=𝟏vE_{1}=\mathbf{1}v^{\top}. Let W=(IγE1)VW^{\star}=\left(I-\gamma E_{1}\right)V^{\star}. Note that

maxπ{rπ+γ(𝒫πE1)W}\displaystyle\max_{\pi}\{r^{\pi}+\gamma({\mathcal{P}}^{\pi}-E_{1})W^{\star}\} =maxπ{rπ+γ(𝒫πE1)V}\displaystyle=\max_{\pi}\{r^{\pi}+\gamma({\mathcal{P}}^{\pi}-E_{1})V^{\star}\}
=maxπ{rπ+γ𝒫πV}γE1V\displaystyle=\max_{\pi}\{r^{\pi}+\gamma{\mathcal{P}}^{\pi}V^{\star}\}-\gamma E_{1}V^{\star}
=W.\displaystyle=W^{\star}.

Thus,

WkW=γ(𝒫πgE1)Wk1+rπgγ(𝒫πE1)Wrπ\displaystyle W^{k}-W^{\star}=\gamma\left({\mathcal{P}^{\pi_{g}}-E_{1}}\right)W^{k-1}+r^{\pi_{g}}-\gamma\left({\mathcal{P}^{\pi^{\star}}-E_{1}}\right)W^{\star}-r^{\pi^{\star}}

where πg=argmaxπ{rπ+γ(𝒫πE1)Wk1}\pi_{g}=\operatorname*{argmax}_{\pi}\{r^{\pi}+\gamma({\mathcal{P}}^{\pi}-E_{1})W^{k-1}\} and π\pi^{\star} is optimal policy. This implies that

WkW\displaystyle W^{k}-W^{\star} γ(𝒫πgE1)(Wk1W),\displaystyle\leq\gamma\left({\mathcal{P}^{\pi_{g}}-E_{1}}\right)(W^{k-1}-W^{\star}),
WkW\displaystyle W^{k}-W^{\star} γ(𝒫πE1)(Wk1W),\displaystyle\geq\gamma\left({\mathcal{P}^{\pi^{\star}}-E_{1}}\right)(W^{k-1}-W^{\star}),

and

γ𝒫π(Wk1W)WkW+γE1(Wk1W)\displaystyle\gamma\mathcal{P}^{\pi^{\star}}(W^{k-1}-W^{\star})\leq W^{k}-W^{\star}+\gamma E_{1}(W^{k-1}-W^{\star}) γ𝒫πg(Wk1W).\displaystyle\leq\gamma\mathcal{P}^{\pi_{g}}(W^{k-1}-W^{\star}).

Then, there exist 0ti10\leq t_{i}\leq 1 such that

ti(γ𝒫π(Wk1W))i+(1ti)(γ𝒫πg(Wk1W))i=(WkW+γE1(Wk1W))it_{i}\left({\gamma\mathcal{P}^{\pi^{\star}}(W^{k-1}-W^{\star})}\right)_{i}+(1-t_{i})\left({\gamma\mathcal{P}^{\pi_{g}}(W^{k-1}-W^{\star})}\right)_{i}=\left({W^{k}-W^{\star}+\gamma E_{1}(W^{k-1}-W^{\star})}\right)_{i}

for 1in1\leq i\leq n. Define πk(a|i)=tiπ(a|i)+(1ti)πg(a|i)\pi_{k}(a\,|\,i)=t_{i}\pi^{\star}(a\,|\,i)+(1-t_{i})\pi_{g}(a\,|\,i) for all a𝒜a\in\mathcal{A} and 1in1\leq i\leq n. Then πk\pi_{k} satisfies

WkW=γ(𝒫πkE1)(Wk1W).\displaystyle W^{k}-W^{\star}=\gamma\left({\mathcal{P}^{\pi_{k}}-E_{1}}\right)(W^{k-1}-W^{\star}).

Thus, we get

WkW\displaystyle W^{k}-W^{\star} =γki=1k(𝒫πiE1)(W0W)\displaystyle=\gamma^{k}\prod_{i=1}^{k}\left({\mathcal{P}^{\pi_{i}}-E_{1}}\right)(W^{0}-W^{\star})
=γki=1k(𝒫πiE1)(IγE1)(V0V)\displaystyle=\gamma^{k}\prod_{i=1}^{k}\left({\mathcal{P}^{\pi_{i}}-E_{1}}\right)(I-\gamma E_{1})(V^{0}-V^{\star})
=γk(𝒫πkE1)i=1k1𝒫πi(V0V),\displaystyle=\gamma^{k}\left({\mathcal{P}^{\pi_{k}}-E_{1}}\right)\prod_{i=1}^{k-1}\mathcal{P}^{\pi_{i}}(V^{0}-V^{\star}),

and this implies that

VkV\displaystyle V^{k}-V^{\star} =γk(I+γ1γE1)(𝒫πkE1)i=1k1𝒫πi(V0V)\displaystyle=\gamma^{k}\left(I+\frac{\gamma}{1-\gamma}E_{1}\right)\left({\mathcal{P}^{\pi_{k}}-E_{1}}\right)\prod_{i=1}^{k-1}\mathcal{P}^{\pi_{i}}(V^{0}-V^{\star})
=γk(𝒫πk+γ1γE1𝒫πk11γ𝟏v)i=1k1𝒫πi(V0V).\displaystyle=\gamma^{k}\left(\mathcal{P}^{\pi_{k}}+\frac{\gamma}{1-\gamma}E_{1}\mathcal{P}^{\pi_{k}}-\frac{1}{1-\gamma}\mathbf{1}v^{\top}\right)\prod_{i=1}^{k-1}\mathcal{P}^{\pi_{i}}(V^{0}-V^{\star}).

Then, we have

VkV21γγkV0V\|V^{k}-V^{\star}\|\leq\frac{2}{1-\gamma}\gamma^{k}\|V^{0}-V^{\star}\|

since E1𝒫πkE_{1}\mathcal{P}^{\pi_{k}} is stochastic matrix with E1𝒫πk=1\|E_{1}\mathcal{P}^{\pi_{k}}\|=1.

Suppose there exist unique optimal policy π\pi^{\star}. Then, if (non-deflated) VI generates VkV^{k} for k=0,1,k=0,1,\dots, there exist KK such that argmaxπTπVk=π\operatorname*{argmax}_{\pi}T^{\pi}V^{k}=\pi^{\star} for k>Kk>K. Since DDVI for Control generates same policy with VI for Control, DDVI for Control also generates greedy policy π\pi^{\star} for k>Kk>K iterations. Therefore, we get

VkV\displaystyle\|V^{k}-V^{\star}\| =γk(I+γ1γ𝟏v)(𝒫πE1)kKi=1K(𝒫πiE1)(V0V)\displaystyle=\left\|\gamma^{k}\left(I+\frac{\gamma}{1-\gamma}\mathbf{1}v^{\top}\right)\left(\mathcal{P}^{\pi^{\star}}-E_{1}\right)^{k-K}\prod_{i=1}^{K}\left({\mathcal{P}^{\pi_{i}}-E_{1}}\right)(V^{0}-V^{\star})\right\|
=Cγk(𝒫πE1)kKV0V\displaystyle=C\gamma^{k}\left\|\left(\mathcal{P}^{\pi^{\star}}-E_{1}\right)^{k-K}\right\|\|V^{0}-V^{\star}\|
=O~(|γλ2|kV0V)\displaystyle=\tilde{O}(|\gamma\lambda_{2}|^{k}\|V^{0}-V^{\star}\|)

for k>Kk>K and some constant CC\in.

8.4.  Proof of Theorem 5.1

Consider following stochastic approximation algorithm

Yk+1=Yk+ηkf(Yk,ζk)\displaystyle Y^{k+1}=Y^{k}+\eta_{k}f(Y^{k},\zeta_{k}) (5)

for k=0,1,k=0,1,\dots, where YknY^{k}\in^{n}, ηk+\eta_{k}\in^{+}, f(,z):nnf(\cdot,z):^{n}\rightarrow^{n} is uniformly Lipschitz, and {ζk}k=0,1,\{\zeta_{k}\}_{k=0,1,\dots} are i.i.d random variables. Let F(y)=𝐄[f(y,ζk)]F(y)=\mathbf{E}[f(y,\zeta_{k})], F(y)=limrF(ry)/rF_{\infty}(y)=\lim_{r\rightarrow\infty}F(ry)/r, Mk+1=f(Yk,ζk)F(Yt)M^{k+1}=f(Y^{k},\zeta_{k})-F(Y^{t}), and k=σ(Yi,Mi,1it)\mathcal{F}^{k}=\sigma(Y^{i},M^{i},1\leq i\leq t). Then following proposition holds.

Proposition 8.1.

[Borkar and Meyn, 2000, Theorem 2.2 and 2.5] If (i) ηk=(k+1)1\eta_{k}=(k+1)^{-1}, (ii) {Mk,k}k=0,1,,\{M^{k},\mathcal{F}^{k}\}_{k=0,1,\dots,} are martingale difference sequence , (iii) 𝐄[Mk+1|k]K(1+Yk2)\mathbf{E}[M^{k+1}\,|\,\mathcal{F}^{k}]\leq K(1+\|Y^{k}\|^{2}) for some constant KK, (iv) y˙(t)=F(y(t))\dot{y}(t)=F_{\infty}(y(t)) has asymptotically stable equilibrium origin, and (v) y˙(t)=F(y(t))\dot{y}(t)=F(y(t)) has a unique globally asymtotically stable equilibrium yy^{\star}, then {Yt}k=0,1,\{Y^{t}\}_{k=0,1,\dots} of (5) converges to yy^{\star} almost surely, and furthermore, {Yt}k=0,1,\{Y^{t}\}_{k=0,1,\dots} of

Yk+1(ik)=Yk(ik)+ηνik,tf(Yk,ζk)(ik)\displaystyle Y^{k+1}(i_{k})=Y^{k}(i_{k})+\eta_{\nu_{i_{k},t}}f(Y^{k},\zeta_{k})(i_{k}) (6)

for k=0,1,k=0,1,\dots where iki_{k} is random variable taking a value in {1,2,,n}\{1,2,\cdots,n\}, νi,t=k=0k𝟏{i=ik}\nu_{i,t}=\sum^{k}_{k=0}\mathbf{1}_{\{i=i_{k}\}} satisfying lim infνi,t/t>c\liminf\nu_{i,t}/t>c for some constant c>0c>0, also converge to yy^{\star} almost surely.

In (6), Yk(ik)Y^{k}(i_{k}) and f(Yk,ζk)(ik)f(Y^{k},\zeta_{k})(i_{k}) are ii-th coordinate of YkY^{k} and f(Yk,ζk)f(Y^{k},\zeta_{k}), respectively. (6) can be interpreted as asynchronous version of (5), and we note that Proposition 8.1 is a simplified version with stronger conditions of Theorem 2.2 and 2.5 of Borkar and Meyn [2000].

Recall that DDTD (4) with α=1\alpha=1 is

Wk+1(Xk)\displaystyle W^{k+1}(X_{k}) =Wk(Xk)+ηνi,t[rπ(Xk)+γVk(Xk)γ(EsVk)(Xk)Wk(Xk)]\displaystyle=W^{k}(X_{k})+\eta_{\nu_{i,t}}\big{[}r^{\pi}(X_{k})+\gamma V^{k}(X_{k}^{\prime})-\gamma(E_{s}V^{k})(X_{k})-W^{k}(X_{k})\big{]}
Vk+1\displaystyle V^{k+1} =(IγEs)1Wk+1.\displaystyle=\left(I-\gamma E_{s}\right)^{-1}W^{k+1}. (7)

We will first show that Wk(IγEs)VπW^{k}\rightarrow(I-\gamma E_{s})V^{\pi} and this directly implies that VkVπV^{k}\rightarrow V^{\pi}. For applying Proposition 8.1, consider

Wk+1=Wk+ηkf(Wk,Zt)\displaystyle W^{k+1}=W^{k}+\eta_{k}f(W^{k},Z_{t})

where ηk=(k+1)1\eta_{k}=(k+1)^{-1}, ZtnZ_{t}\in^{n} such that Zt(x)𝒫π(|x)Z_{t}(x)\sim\mathcal{P}^{\pi}(\cdot\,|\,x) and {Zt}k=0,1,\{Z_{t}\}_{k=0,1,\dots} are i.i.d random variables, and f(Wk,Zt)(x)=rπ(x)+γ((IγEs)1Wk)(Zt(x))γEs(IγEs)1Wk(x)Wk(x)f(W^{k},Z_{t})(x)=r^{\pi}(x)+\gamma(\left(I-\gamma E_{s}\right)^{-1}W^{k})(Z_{t}(x))-\gamma E_{s}\left(I-\gamma E_{s}\right)^{-1}W^{k}(x)-W^{k}(x). Then, F(W)=𝐄[f(W,Zt)]=[γ(𝒫πEs)(IγEs)1I]W+rπF(W)=\mathbf{E}[f(W,Z_{t})]=[\gamma(\mathcal{P}^{\pi}-E_{s})(I-\gamma E_{s})^{-1}-I]W+r^{\pi}, Mk+1=f(Wk,Zt)F(Wk)M^{k+1}=f(W^{k},Z_{t})-F(W^{k}), and k=σ(Wi,Mi,1it)\mathcal{F}^{k}=\sigma(W^{i},M^{i},1\leq i\leq t).

We now check the conditions. First, since f(W,z)(x)f(W,z)(x)=(IγEs)1(WW)(z(x))(I+γEs(IγEs)1)(WW)(x)f(W,z)(x)-f(W^{\prime},z)(x)=\left(I-\gamma E_{s}\right)^{-1}(W-W^{\prime})(z(x))-(I+\gamma E_{s}\left(I-\gamma E_{s}\right)^{-1})(W-W^{\prime})(x) implies f(W,z)f(W,z)((IγEs)1+(I+γEs(IγEs)1))WW\|f(W,z)-f(W^{\prime},z)\|\leq(\|\left(I-\gamma E_{s}\right)^{-1}\|+\|(I+\gamma E_{s}\left(I-\gamma E_{s}\right)^{-1})\|)\|W-W^{\prime}\|, f(,z)f(\cdot,z) is uniformly Lipschitz.

𝐄[Mk+1(x)|t]\displaystyle\mathbf{E}[M^{k+1}(x)\,|\,\mathcal{F}_{t}]
=𝐄[γ((IγEs)1Wk)(Zt(x))γ(Es(IγEs)1Wk)(x)γ((𝒫πEs))(IγEs)1Wk)(x)|k]\displaystyle=\mathbf{E}[\gamma(\left(I-\gamma E_{s}\right)^{-1}W^{k})(Z_{t}(x))-\gamma(E_{s}\left(I-\gamma E_{s}\right)^{-1}W^{k})(x)-\gamma((\mathcal{P}^{\pi}-E_{s}))(I-\gamma E_{s})^{-1}W^{k})(x)\,|\,\mathcal{F}^{k}]
=γ𝒫π(IγEs)1Wk)(x)γ𝒫π(IγEs)1Wk(x)\displaystyle=\gamma\mathcal{P}^{\pi}\left(I-\gamma E_{s}\right)^{-1}W^{k})(x)-\gamma\mathcal{P}^{\pi}(I-\gamma E_{s})^{-1}W^{k}(x)
=0\displaystyle=0

for all x𝒳x\in\mathcal{X}. This implies that Mk+1M^{k+1} is martingale difference sequence. Also, 𝐄[Mk+12|k](γ(IγEs)1+γ𝒫π(IγEs)1)2Wk2K(1+Wk2)\mathbf{E}[\|M^{k+1}\|^{2}\,|\,\mathcal{F}^{k}]\leq(\|\gamma\left(I-\gamma E_{s}\right)^{-1}\|+\|\gamma\mathcal{P}^{\pi}(I-\gamma E_{s})^{-1}\|)^{2}\|W^{k}\|^{2}\leq K(1+\|{W^{k}}\|^{2}) for constant K=(γ(IγEs)1+γ𝒫π(IγEs)1)2K=(\|\gamma\left(I-\gamma E_{s}\right)^{-1}\|+\|\gamma\mathcal{P}^{\pi}(I-\gamma E_{s})^{-1}\|)^{2}.

By definition, F(y)=limrF(ry)/r=γ(𝒫πEs))(IγEs)1I]yF_{\infty}(y)=\lim_{r\rightarrow\infty}F(ry)/r=\gamma(\mathcal{P}^{\pi}-E_{s}))(I-\gamma E_{s})^{-1}-I]y. In Section 8.1, we showed that spec(F+I)={γλs+1,,γλn,0}\text{spec}(F_{\infty}+I)=\{\gamma\lambda_{s+1},\dots,\gamma\lambda_{n},0\}. Hence, spec(F)={γλs+11,,γλ11,1}\text{spec}(F_{\infty})=\{\gamma\lambda_{s+1}-1,\dots,\gamma\lambda_{1}-1,-1\}. Since |λi|1|\lambda_{i}|\leq 1 for all s+1ins+1\leq i\leq n, real parts of all eigenvalues FF_{\infty} are negative. This implies that y˙(t)=F(y(t))\dot{y}(t)=F_{\infty}(y(t)), has an asymtotically stable equilibrium origin and y˙(t)=F(y(t))\dot{y}(t)=F(y(t)) has a unique globally asymtotically stable equilibrium Wπ=[I(γ(𝒫πEs))(IγEs)1]1rπW^{\pi}=[I-(\gamma(\mathcal{P}^{\pi}-E_{s}))(I-\gamma E_{s})^{-1}]^{-1}r^{\pi}.

Lastly, since XkUnif(𝒳)X_{k}\sim\text{Unif}(\mathcal{X}) and {Xk}k=0,1,\{X_{k}\}_{k=0,1,\dots} are i.i.d random variables, νi,t/t\nu_{i,t}/t converges to 1n\frac{1}{n} almost suerly by law of large numbers. Since ηνXk,t=(i=0k𝟏Xk=Xi)1\eta_{\nu_{X_{k},t}}=(\sum^{k}_{i=0}\mathbf{1}_{X_{k}=X_{i}})^{-1}, Therefore, by Proposition 8.1, {Wk}k=0,1,\{W^{k}\}_{k=0,1,\dots} of (7) converge to WπW^{\pi} almost surely, and this implies that Vk(IγEs)1Wπ=(Iγ𝒫π)1rπ=VπV^{k}\rightarrow(I-\gamma E_{s})^{-1}W^{\pi}=(I-\gamma\mathcal{P}^{\pi})^{-1}r^{\pi}=V^{\pi} almost surely.

8.5.  Proof of Theorem 5.2

Consider following linear stochastic approximation algorithm

Yk+1=Yk+ηk(A(ζk)Yt+b(ζk))\displaystyle Y^{k+1}=Y^{k}+\eta_{k}(A(\zeta_{k})Y^{t}+b(\zeta_{k})) (8)

for k=0,1,k=0,1,\dots, where YknY^{k}\in^{n}, {ζk}k=0,1,,\{\zeta_{k}\}_{k=0,1,\dots,} are random variables, b(ζk)nb(\zeta_{k})\in^{n}, A(ζk)n×nA(\zeta_{k})\in^{n}\times^{n}, and Ai,j,bk<\|A_{i,j}\|,\|b_{k}\|<\infty, for 1i,jn1\leq i,j\leq n and 1kn1\leq k\leq n. Then, following Proposition holds.

Proposition 8.2.

[Chen et al., 2020, Theorem 2.5 and 2.7] If (i) ηk=g(k+1)1\eta_{k}=g(k+1)^{-1} for g>0g>0, (ii) {ζk}k=0,1,\{\zeta_{k}\}_{k=0,1,\dots} is ergodic (aperiodic and irreducible) Markov process with unique invariant measure μ\mu, (iii) 𝐄μ[At]=A\mathbf{E}_{\mu}[A^{t}]=A and 𝐄μ[Bt]=b\mathbf{E}_{\mu}[B^{t}]=b, (iv) AA is Hurwitz, (v) Re(λ)<1/2Re(\lambda)<-1/2 for all λ\lambda\in spec(gAgA), then 𝐄[YtY2]=O(k1)\mathbf{E}[\|Y^{t}-Y^{\star}\|^{2}]=O(k^{-1}) where Y=A1bY^{\star}=-A^{-1}b.

We note that Proposition 8.1 is also a simplified version with stronger conditions of Theorem 2.5 and 2.7 of Chen et al. [2020].

Define ψ(Xk)n\psi(X_{k})\in^{n} as ψ(Xk)(x)=𝟏{Xk=x}\psi(X_{k})(x)=\mathbf{1}_{\{X_{k}=x\}}. Then, DDTD (4) with α=1\alpha=1 is equivalent to

Wk+1=Wk+ηk(A(Xk,Xk)Wk+b(Xk,Xk)),W^{k+1}=W^{k}+\eta_{k}(A(X_{k},X^{\prime}_{k})W^{k}+b(X_{k},X^{\prime}_{k})),

where A(X,X)=ψ(X)(ψ(X)γ(IγEs)1ψ(X)γEs(IγEs)1ψ(X))A(X,X^{\prime})=\psi(X)(\psi(X^{\prime})^{\top}\gamma\left(I-\gamma E_{s}\right)^{-1}-\psi(X)^{\top}\gamma E_{s}\left(I-\gamma E_{s}\right)^{-1}-\psi(X)^{\top}) and b(X,X)=ψ(X)rπ(X)b(X,X^{\prime})=\psi(X)r^{\pi}(X). Hence Ai,j,bk<\|A_{i,j}\|,\|b_{k}\|<\infty, for 1i,jn1\leq i,j\leq n and 1kn1\leq k\leq n.

Since {Xk,Xk}k=0,1,\{X_{k},X^{\prime}_{k}\}_{k=0,1,\dots} are i.i.d random variables, {Xk,Xk}k=0,1,\{X_{k},X^{\prime}_{k}\}_{k=0,1,\dots} are Markov process. Let μ{Xk,Xk}\mu\sim\{X_{k},X^{\prime}_{k}\} and 𝒫(x1,x1|x2,x2)\mathcal{P}(x_{1},x^{\prime}_{1}\,|\,x_{2},x^{\prime}_{2}) be transition matrix of {Xk,Xk}k=0,1,\{X_{k},X^{\prime}_{k}\}_{k=0,1,\dots}. Then 𝒫(x1,x1|x2,x2)=μ(x1,x1)=1n𝒫π(x1|x1)\mathcal{P}(x_{1},x^{\prime}_{1}\,|\,x_{2},x^{\prime}_{2})=\mu(x_{1},x_{1}^{\prime})=\frac{1}{n}\mathcal{P}^{\pi}(x^{\prime}_{1}\,|\,x_{1}). Since (μ)𝒫(,|,)=μ(\mu^{\prime})^{\top}\mathcal{P}(\cdot,\cdot\,|\,\cdot,\cdot)=\mu for any distribution μ\mu^{\prime}, μ\mu is unique invarinat measure. Hence, {Xk,Xk}k=0,1,\{X_{k},X^{\prime}_{k}\}_{k=0,1,\dots} is irreducible and aperiodic.

𝐄μ[A(Xk,Xk)]\displaystyle\mathbf{E}_{\mu}[A(X_{k},X^{\prime}_{k})] =𝐄μ[ψ(Xk)(ψ(Xk)γ(IγEs)1ψ(Xt)γEs(IγEs)1ψ(Xk))|Xk]\displaystyle=\mathbf{E}_{\mu}[\psi(X_{k})(\psi(X^{\prime}_{k})^{\top}\gamma\left(I-\gamma E_{s}\right)^{-1}-\psi(X_{t})^{\top}\gamma E_{s}\left(I-\gamma E_{s}\right)^{-1}-\psi(X_{k})^{\top})\,|\,X_{k}]
=𝐄μ[ψ(Xk)(ψ(Xk)𝒫π(IγEs)1ψ(Xt)γEs(IγEs)1ψ(Xk))]\displaystyle=\mathbf{E}_{\mu}[\psi(X_{k})(\psi(X_{k})^{\top}\mathcal{P}^{\pi}\left(I-\gamma E_{s}\right)^{-1}-\psi(X_{t})^{\top}\gamma E_{s}\left(I-\gamma E_{s}\right)^{-1}-\psi(X_{k})^{\top})]
=𝐄μ[ψ(Xk)ψ(Xk)(𝒫π(IγEs)1γEs(IγEs)1I)]\displaystyle=\mathbf{E}_{\mu}[\psi(X_{k})\psi(X_{k})^{\top}(\mathcal{P}^{\pi}\left(I-\gamma E_{s}\right)^{-1}-\gamma E_{s}\left(I-\gamma E_{s}\right)^{-1}-I)]
=n1((𝒫πγEs)(IγEs)1I)\displaystyle=n^{-1}((\mathcal{P}^{\pi}-\gamma E_{s})\left(I-\gamma E_{s}\right)^{-1}-I)

and 𝐄μ[b(Xk,Xk)]=n1rπ\mathbf{E}_{\mu}[b(X_{k},X^{\prime}_{k})]=n^{-1}r^{\pi}. Then, Y=A1b=[I(γ(𝒫πEs))(IγEs)1]1rπ=WπY^{\star}=-A^{-1}b=[I-(\gamma(\mathcal{P}^{\pi}-E_{s}))(I-\gamma E_{s})^{-1}]^{-1}r^{\pi}=W^{\pi}.

As we discussed in Section 8.4, spec(A)={γλs+11,,γλn1,1}\text{spec}(A)=\{\gamma\lambda_{s+1}-1,\dots,\gamma\lambda_{n}-1,-1\}. Let λDDTD=minλ{λs+1,,λn}Re(1γλ)\lambda_{\text{DDTD}}=\min_{\lambda\in\{\lambda_{s+1},\dots,\lambda_{n}\}}\text{Re}(1-\gamma\lambda). Then, if g>n2λDDTDg>\frac{n}{2\lambda_{\text{DDTD}}}, Re(λ)<1/2Re(\lambda)<-1/2 for all λ\lambda\in spec(gAgA). Therefore, by Proposition 8.2, 𝐄[WkWπ2]=O(k1)\mathbf{E}[\|W^{k}-W^{\pi}\|^{2}]=O(k^{-1}) and this implies that 𝐄[VkVπ2]𝐄[(IγEs)12WkW2]=O(k1)\mathbf{E}[\|V^{k}-V^{\pi}\|^{2}]\leq\mathbf{E}[\|(I-\gamma E_{s})^{-1}\|^{2}\|W^{k}-W^{\star}\|^{2}]=O(k^{-1}).

8.6.  Non-asymptotic convergence analysis of DDTD

Consider following the stochastic approximation algorithm

Yk+1=Yk+ηk(f(Yk,ζk)Yk+Zk)\displaystyle Y^{k+1}=Y^{k}+\eta_{k}(f(Y^{k},\zeta_{k})-Y^{k}+Z_{k}) (9)

for k=0,1,k=0,1,\dots, where YknY^{k}\in^{n}, ηk+\eta_{k}\in^{+}, {ζk}k=0,1,\{\zeta_{k}\}_{k=0,1,\dots} are Markov chain with unique distribution μ\mu and transition matrix PP. Let F(y)=𝐄[f(y,ζk)]F(y)=\mathbf{E}[f(y,\zeta_{k})] and k=σ(Yi,Mi,Zi,ηi1it)\mathcal{F}^{k}=\sigma(Y^{i},M^{i},Z_{i},\eta_{i}1\leq i\leq t). Define mixing time as tδ={mink0,:maxyPk(,y)μ()TV<δ}t_{\delta}=\{\min{k\geq 0,:\max_{y}\|P^{k}(\cdot,y)-\mu(\cdot)\|_{\text{TV}}}<\delta\} where TV\|\cdot\|_{\text{TV}} stands for the total variation distance. Let YY^{\star} be fixed point of F(y)F(y). Then following proposition holds.

Proposition 8.3.

[Chen et al., 2023, Theorem 1] Let (i) {ζk}k=0,1,\{\zeta_{k}\}_{k=0,1,\dots} is aperiodic and irreducible Markov chain, (ii) FF is β\beta-contraction respect to some c\|\cdot\|_{c}, (iii) f(,z):nnf(\cdot,z):^{n}\rightarrow^{n} is A1A_{1}-uniformly Lipschitz with respect to c\|\cdot\|_{c} and f(0,y)cB1\|f(0,y)\|_{c}\leq B_{1} for any yy, (iv) {Mk,k}k=0,1,,\{M^{k},\mathcal{F}^{k}\}_{k=0,1,\dots,} are martingale difference sequence, and (v) 𝐄[Zk+1|k]A2+B2Yk2\mathbf{E}[Z^{k+1}\,|\,\mathcal{F}^{k}]\leq A_{2}+B_{2}\|Y^{k}\|^{2} for some A2,B2>0A_{2},B_{2}>0. Then, if ηk=η\eta_{k}=\eta,

𝐄[YkYc2]c1d1(1d2η)ktα+d3d2ηtα\mathbf{E}[\|Y^{k}-Y^{\star}\|^{2}_{c}]\leq c_{1}d_{1}(1-d_{2}\eta)^{k-t_{\alpha}}+\frac{d_{3}}{d_{2}}\eta t_{\alpha}

for all kKk\geq K, and if ηk=1d2(k+h)\eta_{k}=\frac{1}{d_{2}(k+h)},

𝐄[YkYc2]c1d1K+hk+h+8η2d3c2tklog(k+h)k+h\mathbf{E}[\|Y^{k}-Y^{\star}\|^{2}_{c}]\leq c_{1}d_{1}\frac{K+h}{k+h}+\frac{8\eta^{2}d_{3}c_{2}t_{k}\text{log}(k+h)}{k+h}

for all kKk\geq K, where {ηk}k=0,1,\{\eta_{k}\}_{k=0,1,\dots} is nonincreasing sequence satisfying i=ktkk1αimin{d2d3A2,14A}\sum^{k-1}_{i=k-t_{k}}\alpha_{i}\leq\min\left\{\frac{d_{2}}{d_{3}A^{2}},\frac{1}{4A}\right\} for all ktkk\geq t_{k}, K=min{k0:ktk}K=\min\{k\geq 0:k\geq t_{k}\}, A=A1+A2+1A=A_{1}+A_{2}+1, B=B1+B2B=B_{1}+B_{2}, c1=(Y0Yc+Y0c+A/B)2,c2=(AYc+B)2c_{1}=(\|Y^{0}-Y^{\star}\|_{c}+\|Y^{0}\|_{c}+A/B)^{2},c_{2}=(A\|Y^{\star}\|_{c}+B)^{2}, lcsscucssl_{cs}\|\cdot\|_{s}\leq\|\cdot\|_{c}\leq u_{cs}\|\cdot\|_{s} for chosen norm s\|\cdot\|_{s} such that 12s2\frac{1}{2}\|\cdot\|^{2}_{s} is LL-smooth function with respect to s\|\cdot\|_{s}, d1=1+θucs21+θlcs2d_{1}=\frac{1+\theta u^{2}_{cs}}{1+\theta l^{2}_{cs}}, d2=1βd11/2d_{2}=1-\beta d^{1/2}_{1}, and d3=114L(1+θucs2)θlcs2d_{3}=\frac{114L(1+\theta u^{2}_{cs})}{\theta l^{2}_{cs}} such that θ\theta is chosen satisfying β2(d1)1\beta^{2}\leq(d_{1})^{-1}.

DDTD (4) with α=1\alpha=1 is equivalent to

Wk+1=Wk+ηk(f(Wk,Xk,Xk)Wk+Zk)W^{k+1}=W^{k}+\eta_{k}(f(W^{k},X_{k},X^{\prime}_{k})-W^{k}+Z_{k})

where Zk=0Z_{k}=0, and f(Wk,X,X)(x)=𝟏{X=x}(rπ(x)+γ((IγEs)1Wk)(X)γEs(IγEs)1Wk(x)Wk(x))+Wk(x)f(W^{k},X,X^{\prime})(x)=\mathbf{1}_{\{X=x\}}(r^{\pi}(x)+\gamma(\left(I-\gamma E_{s}\right)^{-1}W^{k})(X^{\prime})-\gamma E_{s}\left(I-\gamma E_{s}\right)^{-1}W^{k}(x)-W^{k}(x))+W^{k}(x). As we showed in Section 8.5, {Xk,Xk}k=0,1,\{X_{k},X^{\prime}_{k}\}_{k=0,1,\dots} is irreducible and aperiodic Markov chain with a invariant measure μ\mu. Furthermore, mixing time of {Xk,Xk}k=0,1,\{X_{k},X^{\prime}_{k}\}_{k=0,1,\dots} is 11.

F(W)\displaystyle F(W) =𝐄μ{X,X}[𝟏{X=x}(rπ(x)+γ((IγEs)1W)(X)γEs(IγEs)1W(x)W(x))+W(x)]\displaystyle=\mathbf{E}_{\mu\sim\{X,X^{\prime}\}}[\mathbf{1}_{\{X=x\}}(r^{\pi}(x)+\gamma(\left(I-\gamma E_{s}\right)^{-1}W)(X^{\prime})-\gamma E_{s}\left(I-\gamma E_{s}\right)^{-1}W(x)-W(x))+W(x)]
=𝐄μ{X,X}[𝟏{X=x}(rπ(x)+γ((IγEs)1W)(X)γEs(IγEs)1W(x)W(x))+W(x)X]\displaystyle=\mathbf{E}_{\mu\sim\{X,X^{\prime}\}}[\mathbf{1}_{\{X=x\}}(r^{\pi}(x)+\gamma(\left(I-\gamma E_{s}\right)^{-1}W)(X^{\prime})-\gamma E_{s}\left(I-\gamma E_{s}\right)^{-1}W(x)-W(x))+W(x)\|X]
=𝐄μ{X,X}[𝟏{X=x}(rπ(x)+𝒫πγ((IγEs)1W)(X)γEs(IγEs)1W(x)W(x))+W(x)]\displaystyle=\mathbf{E}_{\mu\sim\{X,X^{\prime}\}}[\mathbf{1}_{\{X=x\}}(r^{\pi}(x)+\mathcal{P}^{\pi}\gamma(\left(I-\gamma E_{s}\right)^{-1}W)(X)-\gamma E_{s}\left(I-\gamma E_{s}\right)^{-1}W(x)-W(x))+W(x)]
=1n(rπ(x)+γ(𝒫πEs)(IγEs)1W(x))+(11n)W(x).\displaystyle=\frac{1}{n}(r^{\pi}(x)+\gamma(\mathcal{P}^{\pi}-E_{s})\left(I-\gamma E_{s}\right)^{-1}W(x))+(1-\frac{1}{n})W(x).

Let λ1,,λn\lambda_{1},\dots,\lambda_{n} be the eigenvalues of 𝒫π\mathcal{P}^{\pi} sorted in decreasing order of magnitude with ties broken arbitrarily. Let ADDTD=γ(𝒫πEs)(IγEs)1A_{\text{DDTD}}=\gamma(\mathcal{P}^{\pi}-E_{s})\left(I-\gamma E_{s}\right)^{-1}. Since spectral radius is the infimum of matrix norm, for any ϵ>0\epsilon>0, there exist ϵ\|\cdot\|_{\epsilon} such that ρ(1nADDTD+(11n)I)1nADDTD+(11n)Iϵρ(1nADDTD+(11n)I)+ϵ\rho(\frac{1}{n}A_{\text{DDTD}}+(1-\frac{1}{n})I)\leq\|\frac{1}{n}A_{\text{DDTD}}+(1-\frac{1}{n})I\|_{\epsilon}\leq\rho(\frac{1}{n}A_{\text{DDTD}}+(1-\frac{1}{n})I)+\epsilon where ρ(1nADDTD+(11n)I)=maxλ{λs+1,,λn}{|11n+γnλ|}\rho(\frac{1}{n}A_{\text{DDTD}}+(1-\frac{1}{n})I)=\max_{\lambda\in\{\lambda_{s+1},\dots,\lambda_{n}\}}\{|1-\frac{1}{n}+\frac{\gamma}{n}\lambda|\}.

f(0,X,X)ϵ=𝟏x=Xrπ(x)ϵrπϵ=Bϵ\|f(0,X,X^{\prime})\|_{\epsilon}=\|\mathbf{1}_{x=X}r^{\pi}(x)\|_{\epsilon}\leq\|r^{\pi}\|_{\epsilon}=B_{\epsilon}, and f(W,X,X)(x)f(W,X,X)(x)=𝟏{X=x}(γ(IγEs)1(WW)(X)(γEs(IγEs)1(WW))(x))f(W,X,X^{\prime})(x)-f(W^{\prime},X,X^{\prime})(x)=\mathbf{1}_{\{X=x\}}(\gamma\left(I-\gamma E_{s}\right)^{-1}(W-W^{\prime})(X^{\prime})-(\gamma E_{s}\left(I-\gamma E_{s}\right)^{-1}(W-W^{\prime}))(x)) implies that f(W,X,X)f(W,X,X)ϵ((γ(IγEs)1ϵ+(γEs(IγEs)1ϵ)WWϵAϵWWϵ\|f(W,X,X^{\prime})-f(W^{\prime},X,X^{\prime})\|_{\epsilon^{\prime}}\leq(\|(\gamma\left(I-\gamma E_{s}\right)^{-1}\|_{\epsilon}+\|(\gamma E_{s}\left(I-\gamma E_{s}\right)^{-1}\|_{\epsilon})\|W-W^{\prime}\|_{\epsilon}\leq A_{\epsilon}\|W-W^{\prime}\|_{\epsilon}.

F(W)=𝐄μ{X,X}[f(W,X,X)]=1n(rπ+γ(𝒫πEs)(IγEs)1W)+(11n)WF(W)=\mathbf{E}_{\mu\sim\{X,X^{\prime}\}}[f(W,X,X^{\prime})]=\frac{1}{n}(r^{\pi}+\gamma(\mathcal{P}^{\pi}-E_{s})\left(I-\gamma E_{s}\right)^{-1}W)+(1-\frac{1}{n})W. F(W1)F(W2)ϵ=(1nADDTD+(11n)I)(W1W2)ϵ(1nρ+ϵ+(11n))W1W2ϵ\|F(W_{1})-F(W_{2})\|_{\epsilon}=\|(\frac{1}{n}A_{\text{DDTD}}+(1-\frac{1}{n})I)(W_{1}-W_{2})\|_{\epsilon}\leq(\frac{1}{n}\rho+\epsilon+(1-\frac{1}{n}))\|W_{1}-W_{2}\|_{\epsilon} and F(W)F(W) has fixed point WπW^{\pi} since 1nADDTD+(11n)I\frac{1}{n}A_{\text{DDTD}}+(1-\frac{1}{n})I and ADDTDA_{\text{DDTD}} share same fixed point.

Thus, by Theorem 8.3 and 𝐄[VkVπϵ2](IγEs)1ϵ2𝐄[WkWϵ2]\mathbf{E}[\|V^{k}-V^{\pi}\|_{\epsilon}^{2}]\leq\|(I-\gamma E_{s})^{-1}\|_{\epsilon}^{2}\mathbf{E}[\|W^{k}-W^{\star}\|_{\epsilon}^{2}], we have following non-asymptotic convergence result of DDTD.

Theorem 8.4 (DDTD with constant stepsize).

Let λ1,,λn\lambda_{1},\dots,\lambda_{n} be the eigenvalues of 𝒫π\mathcal{P}^{\pi} sorted in decreasing order of magnitude with ties broken arbitrarily. Let ηk=η\eta_{k}=\eta. For any ϵ>0\epsilon>0, there exist aϵ,bϵ,cϵ,dϵ>0a_{\epsilon},b_{\epsilon},c_{\epsilon},d_{\epsilon}>0 and ϵ\|\cdot\|_{\epsilon} such that for α=1\alpha=1 and 0<η<dϵ0<\eta<d_{\epsilon}, DDTD exhibits the rate

𝐄[VkVπϵ2]aϵ(1η+ρbϵη)k1+cϵη\mathbf{E}[\|V^{k}-V^{\pi}\|^{2}_{\epsilon}]\leq a_{\epsilon}(1-\eta+\rho b_{\epsilon}\eta)^{k-1}+c_{\epsilon}\eta

for all k1k\geq 1 where ρ=11n+γnmax{|λs+1|,,|λn|}+ϵ\rho=1-\frac{1}{n}+\frac{\gamma}{n}\max\{|\lambda_{s+1}|,\dots,|\lambda_{n}|\}+\epsilon.

We note that aϵ,bϵ,cϵ,dϵa_{\epsilon},b_{\epsilon},c_{\epsilon},d_{\epsilon} of Theorem 8.4 can be obtained by plugging A1=Aϵ,A2=0,B1=Bϵ,B2=0A_{1}=A_{\epsilon},A_{2}=0,B_{1}=B_{\epsilon},B_{2}=0 in Proposition 8.3.

Theorem 8.5 (DDTD with diminishing stepsize).

Let λ1,,λn\lambda_{1},\dots,\lambda_{n} be the eigenvalues of 𝒫π\mathcal{P}^{\pi} sorted in decreasing order of magnitude with ties broken arbitrarily. For any ϵ>0\epsilon>0, there exist aϵ,bϵ,cϵ,dϵ>0a_{\epsilon},b_{\epsilon},c_{\epsilon},d_{\epsilon}>0 and ϵ\|\cdot\|_{\epsilon} such that for α=1\alpha=1 and ηk=1(k+bϵ)(1cϵρ)\eta_{k}=\frac{1}{(k+b_{\epsilon})(1-c_{\epsilon}\rho)}, DDTD exhibits the rate

𝐄[VkVπ2]aϵk+bϵ+(11cϵρ)2dϵk+bϵ\mathbf{E}[\|V^{k}-V^{\pi}\|^{2}]\leq\frac{a_{\epsilon}}{k+b_{\epsilon}}+\left(\frac{1}{1-c_{\epsilon}\rho}\right)^{2}\frac{d_{\epsilon}}{k+b_{\epsilon}}

for all k1k\geq 1 where ρ=11n+γnmax{|λs+1|,,|λn|}+ϵ\rho=1-\frac{1}{n}+\frac{\gamma}{n}\max\{|\lambda_{s+1}|,\dots,|\lambda_{n}|\}+\epsilon.

Again, we note that a,b,c,a,b,c, of Theorem 8.5 can be obtained by plugging A1=Aϵ,A2=0,B1=Bϵ,B2=0A_{1}=A_{\epsilon},A_{2}=0,B_{1}=B_{\epsilon},B_{2}=0 in Proposition 8.3.

9.  DDVI with QR Iteration, AutoPI and AutoQR

9.1.  DDVI with QR Iteration

Recall that the QR iteration approximates top ss Schur vectors. Algorithm 2 uses the QR Iteration to construct the rank-ss Schur deflation matrix and performs DDVI. Compared to the standard VI, rank-ss DDVI requires additional computation for the QR iteration.

Algorithm 2 Rank-ss DDVI with QR Iteration
Initialize α\alpha, ss, and V0V^{0}
{λi}i=1s,{us}i=1s\{\lambda_{i}\}^{s}_{i=1},\{u_{s}\}^{s}_{i=1} = QRiteration(𝒫π\mathcal{P}^{\pi}, s)
for k=0,,K1k=0,\dots,K-1 do
     Wk+1=αγ(𝒫πi=1sλiuiui)Vk+(1α)Vk+αrπW^{k+1}=\alpha\gamma(\mathcal{P}^{\pi}-\sum_{i=1}^{s}\lambda_{i}u_{i}u_{i}^{\top})V^{k}+(1-\alpha)V^{k}+\alpha r^{\pi}
     Vk+1=(I+i=1sαγλi1αγλiuiui)Wk+1V^{k+1}=\left(I+\sum^{s}_{i=1}\frac{\alpha\gamma\lambda_{i}}{1-\alpha\gamma\lambda_{i}}u_{i}u_{i}^{\top}\right)W^{k+1}

The QR iteration of Algorithm 2 can be carried out in an automated manner similar to AutoPI. We refer to this as AutoQR and formally describe the algorithm in Appendix 9.

9.2.  DDVI with AutoPI and AutoQR

We first introduce DDVI with AutoPI for 0<α10<\alpha\leq 1. If W0=0W^{0}=0, we have

Wk=i=0k1((1α)(IαγEs)1+αγ(𝒫πEs)(IαγEs)1)iαrπ\displaystyle W^{k}=\sum_{i=0}^{k-1}((1-\alpha)(I-\alpha\gamma E_{s})^{-1}+\alpha\gamma(\mathcal{P}^{\pi}-E_{s})(I-\alpha\gamma E_{s})^{-1})^{i}\alpha r^{\pi}

by (2), and

(Wk+1Wk)=((1α)(IαγEs)1+αγ(𝒫πEs)(IαγEs)1)kαrπ(W^{k+1}-W^{k})=((1-\alpha)(I-\alpha\gamma E_{s})^{-1}+\alpha\gamma(\mathcal{P}^{\pi}-E_{s})(I-\alpha\gamma E_{s})^{-1})^{k}\alpha r^{\pi}

is the iterates of a power iteration with respect to (1α)(IαγEs)1+αγ(𝒫πEs)(IαγEs)1(1-\alpha)(I-\alpha\gamma E_{s})^{-1}+\alpha\gamma(\mathcal{P}^{\pi}-E_{s})(I-\alpha\gamma E_{s})^{-1}. In Section 8.1, we show that

(1α)(IαγEs)1+αγ(𝒫πEs)(IαγEs)1=(1α)I+αγ𝒫π+UsDs,(λiγ1)γα2λi1γαλiVs.(1-\alpha)(I-\alpha\gamma E_{s})^{-1}+\alpha\gamma(\mathcal{P}^{\pi}-E_{s})(I-\alpha\gamma E_{s})^{-1}=(1-\alpha)I+\alpha\gamma\mathcal{P}^{\pi}+U_{s}D_{s,\frac{(\lambda_{i}\gamma-1)\gamma\alpha^{2}\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}V^{\top}_{s}.

For α1\alpha\approx 1, we expect that spectral radius of matrix is 1α+γαλs+11-\alpha+\gamma\alpha\lambda_{s+1} and (1α)I+αγ𝒫π+UsDs,(λiγ1)γα2λi1γαλiVs(1-\alpha)I+\alpha\gamma\mathcal{P}^{\pi}+U_{s}D_{s,\frac{(\lambda_{i}\gamma-1)\gamma\alpha^{2}\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}V^{\top}_{s} and 𝒫π+UsDs,(λiγ1)αλi1γαλiVs\mathcal{P}^{\pi}+U_{s}D_{s,\frac{(\lambda_{i}\gamma-1)\alpha\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}V^{\top}_{s} have same top eigenvector. With same argument in Section 4.2, we can recover the s+1s+1-th eigenvector of 𝒫π\mathcal{P}^{\pi} by [Bru et al., 2012, Proposition 5]. Leveraging this observation, we formalize this approach in Algorithm 3.

Algorithm 3 DDVI with AutoPI (Detailed)
1:Initialize C=10C=10, ϵ=104\epsilon=10^{-4}
2:function DDVI((s,V,K,{λi}i=1s,{ui}i=1s,αss,V,K,\{\lambda_{i}\}^{s}_{i=1},\{u_{i}\}^{s}_{i=1},\alpha_{s} ))
3:     V0=VV^{0}=V, c=0c=0
4:     Es=i=1sλiuiviE_{s}=\sum_{i=1}^{s}\lambda_{i}u_{i}v_{i}^{\top} as in Fact 2
5:     for k=0,,K1k=0,\dots,K-1 do
6:         if c Cand|wk+1wk+12wkwk2|<ϵ\geq C\,\,\text{and}\left|\frac{w^{k+1}}{\|w^{k+1}\|_{2}}-\frac{w^{k}}{\|w^{k}\|_{2}}\right|<\epsilon then
7:              λs+1=(wk)wk+1/wk22\lambda^{\prime}_{s+1}=(w^{k})^{\top}w^{k+1}/\|w^{k}\|^{2}_{2}
8:              λs+1=(λs+11+αs)/(αsγ)\lambda_{s+1}=(\lambda^{\prime}_{s+1}-1+\alpha_{s})/(\alpha_{s}\gamma)
9:              us+1=wk+1i=1sαλi(1γλi)1αsγλiviwk+1λiλs+1uiu_{s+1}=w^{k+1}-\sum^{s}_{i=1}\frac{\alpha\lambda_{i}(1-\gamma\lambda_{i})}{1-\alpha_{s}\gamma\lambda_{i}}\frac{v_{i}^{\top}w^{k+1}}{\lambda_{i}-\lambda_{s+1}}u_{i}
10:              Return     DDVI(s+1,Vk,Kc,{λi}i=1s+1,{ui}i=1s+1,αs+1s\!+\!1,V^{k}\!,K\!-\!\text{c},\{\lambda_{i}\}^{s\!+\!1}_{i=1},\{u_{i}\}^{s\!+\!1}_{i=1},\alpha_{s+1}) ​​​​​​​​​​​​​​​
11:         else
12:              Wk+1=(1αs)Vk+αsγ(𝒫παsEs)Vk+αsrπW^{k+1}=(1-\alpha_{s})V^{k}+\alpha_{s}\gamma(\mathcal{P}^{\pi}-\alpha_{s}E_{s})V^{k}+\alpha_{s}r^{\pi}
13:              Vk+1=(IαsγEs)1Wk+1V^{k+1}=(I-\alpha_{s}\gamma E_{s})^{-1}W^{k+1}
14:              c = c+1
15:              wk+1=Wk+1Wkw^{k+1}=W^{k+1}-W^{k}
16:              wk=WkWk1w^{k}=W^{k}-W^{k-1}               
17:     Return VKV^{K}
18:Initialize V0V^{0} and u1=𝟏u_{1}=\mathbf{1}, λ1=1\lambda_{1}=1
19:DDVI(1,V0,K,λ1,u1,1)(1,V^{0},K,\lambda_{1},u_{1},1)

Now, we introduce DDVI with AutoQR for 0<α10<\alpha\leq 1.

Assume the top s+1s+1 eigenvalues of 𝒫π\mathcal{P}^{\pi} are distinct, i.e., 1=λ1>|λ2|>>|λs+1|1=\lambda_{1}>|\lambda_{2}|>\cdots>|\lambda_{s+1}|. Let Es=i=1sλiuiuiE_{s}=\sum_{i=1}^{s}\lambda_{i}u_{i}u_{i}^{\top} be a rank-ss deflation matrix of 𝒫π\mathcal{P}^{\pi} as in Fact 3. If W0=0W^{0}=0, we again have

Wk=i=0k1((1α)(IαγEs)1+αγ(𝒫πEs)(IαγEs)1)iαrπ,\displaystyle W^{k}=\sum_{i=0}^{k-1}((1-\alpha)(I-\alpha\gamma E_{s})^{-1}+\alpha\gamma(\mathcal{P}^{\pi}-E_{s})(I-\alpha\gamma E_{s})^{-1})^{i}\alpha r^{\pi},
Wk+1Wk=((1α)(IαγEs)1+αγ(𝒫πEs)(IαγEs)1)kαrπ,\displaystyle W^{k+1}-W^{k}=((1-\alpha)(I-\alpha\gamma E_{s})^{-1}+\alpha\gamma(\mathcal{P}^{\pi}-E_{s})(I-\alpha\gamma E_{s})^{-1})^{k}\alpha r^{\pi},

and

(Wk+1Wk)=((1α)(IαγEs)1+αγ(𝒫πEs)(IαγEs)1)kαrπ(W^{k+1}-W^{k})=((1-\alpha)(I-\alpha\gamma E_{s})^{-1}+\alpha\gamma(\mathcal{P}^{\pi}-E_{s})(I-\alpha\gamma E_{s})^{-1})^{k}\alpha r^{\pi}

is the iterates of a power iteration with respect to (1α)(IαγEs)1+αγ(𝒫πEs)(IαγEs)1(1-\alpha)(I-\alpha\gamma E_{s})^{-1}+\alpha\gamma(\mathcal{P}^{\pi}-E_{s})(I-\alpha\gamma E_{s})^{-1}. In Section 8.1, we show that

(1α)(IαγEs)1+αγ(𝒫πEs)(IαγEs)1=(1α)I+αγ𝒫π+UsRs,(λiγ1)γα2λi1γαλiUs.(1-\alpha)(I-\alpha\gamma E_{s})^{-1}+\alpha\gamma(\mathcal{P}^{\pi}-E_{s})(I-\alpha\gamma E_{s})^{-1}=(1-\alpha)I+\alpha\gamma\mathcal{P}^{\pi}+U_{s}R_{s,\frac{(\lambda_{i}\gamma-1)\gamma\alpha^{2}\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}U^{\top}_{s}.

For α1\alpha\approx 1, we expect that spectral radius of matrix is 1α+γαλs+11-\alpha+\gamma\alpha\lambda_{s+1} and (1α)I+αγ𝒫π+UsRs,(λiγ1)γα2λi1γαλiVs(1-\alpha)I+\alpha\gamma\mathcal{P}^{\pi}+U_{s}R_{s,\frac{(\lambda_{i}\gamma-1)\gamma\alpha^{2}\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}V^{\top}_{s} and 𝒫π+UsRs,(λiγ1)αλi1γαλiUs\mathcal{P}^{\pi}+U_{s}R_{s,\frac{(\lambda_{i}\gamma-1)\alpha\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}U^{\top}_{s} have same top eigenvector. With same argument in Section 4.2, for large kk, we expect Wk+1WkWk+1Wkw,\frac{W^{k+1}-W^{k}}{\|W^{k+1}-W^{k}\|}\approx w, where ww is the top eigenvector of 𝒫πUsRs,(λiγ1)αλi1γαλiUs\mathcal{P}^{\pi}-U_{s}R_{s,\frac{(\lambda_{i}\gamma-1)\alpha\lambda_{i}}{1-\gamma\alpha\lambda_{i}}}U^{\top}_{s}. Thus, by orthornormalizing ww against u1,,usu_{1},\dots,u_{s}, we can obtain top s+1s+1-th Schur vector of 𝒫π\mathcal{P}^{\pi} Saad [2011, Section 4.2.4]. Leveraging this observation, we propose DDVI with Automatic QR Iteration (AutoQR) and formalize this in Algorithm 4.

Algorithm 4 DDVI with AutoQR
1:Initialize C=10C=10, ϵ=104\epsilon=10^{-4}
2:function DDVI((s,V,K,{λi}i=1s,{ui}i=1s,αss,V,K,\{\lambda_{i}\}^{s}_{i=1},\{u_{i}\}^{s}_{i=1},\alpha_{s}))
3:     V0=VV^{0}=V, c=0c=0,
4:     Es=i=1sλiuiuiE_{s}=\sum_{i=1}^{s}\lambda_{i}u_{i}u_{i}^{\top} as in Fact 3
5:     for k=0,,K1k=0,\dots,K-1 do
6:         if c Cand|wk+1wk+12wkwk2|<ϵ\geq C\,\,\text{and}\left|\frac{w^{k+1}}{\|w^{k+1}\|_{2}}-\frac{w^{k}}{\|w^{k}\|_{2}}\right|<\epsilon then
7:              λs+1=(wk)wk+1/wk22\lambda^{\prime}_{s+1}=(w^{k})^{\top}w^{k+1}/\|w^{k}\|^{2}_{2}
8:              λs+1=(λs+11+αs)/(αsγ)\lambda_{s+1}=(\lambda^{\prime}_{s+1}-1+\alpha_{s})/(\alpha_{s}\gamma)
9:              us+1=wk+1i=1suiwk+1uiu^{\prime}_{s+1}=w^{k+1}-\sum^{s}_{i=1}u^{\top}_{i}w^{k+1}u_{i}
10:              us+1=1(us+1)us+1us+1u_{s+1}=\frac{1}{\sqrt{(u^{\prime}_{s+1})^{\top}u^{\prime}_{s+1}}}u^{\prime}_{s+1}
11:              Return     DDVI(s+1,Vk,Kc,{λi}i=1s+1,{ui}i=1s+1,αss\!+\!1,V^{k}\!,K\!-\!\text{c},\{\lambda_{i}\}^{s\!+\!1}_{i=1},\{u_{i}\}^{s\!+\!1}_{i=1},\alpha_{s}) ​​​​​​​​​​​​​​​
12:         else
13:              Wk+1=(1αs)Vk+αsγ(𝒫παsEs)Vk+αsrπW^{k+1}=(1-\alpha_{s})V^{k}+\alpha_{s}\gamma(\mathcal{P}^{\pi}-\alpha_{s}E_{s})V^{k}+\alpha_{s}r^{\pi}
14:              Vk+1=(IαsγEs)1Wk+1V^{k+1}=(I-\alpha_{s}\gamma E_{s})^{-1}W^{k+1}
15:              c = c+1
16:              wk+1=Wk+1Wkw^{k+1}=W^{k+1}-W^{k}
17:              wk=WkWk1w^{k}=W^{k}-W^{k-1}               
18:     Return VKV^{K}
19:Initialize V0V^{0} and u1=𝟏u_{1}=\mathbf{1}, λ1=1\lambda_{1}=1
20:DDVI(1,V0,K,λ1,u1,1)(1,V^{0},K,\lambda_{1},u_{1},1)

10.  Environments

Refer to caption
Figure 4: Maze environment. The dark shaded region is the goal state with positive reward. Arrows show optimal policy.
Refer to caption
Figure 5: Chain walk environment. Dark Shaded region is goal state with a positive reward and grey shaded region is state with a negative reward. Arrows show optimal policy.
Refer to caption
Figure 6: Cliffwalk environment. Dark shaded region is goal state with a positive reward. Gray shaded regions are terminal states with negative reward. Arrows show optimal policy.

We introduce four environments and polices used in experiments.

Cliffwalk

The Cliffwalk is a 3×73\times 7 grid world. The top-left corner is initial state, top-right corner is goal state with reward of 1010, and other states in first row are terminal states with reward of 10-10. There is a penalty of 1-1 in all other states. The agent has four actions: UP(0), RIGHT(11), DOWN(22), and LEFT(33). Each action has 90%90\% chance to successfully move, but with probability of 10%10\% one of the other three directions is randomly chosen and the agent moves in that direction. If the agent attempts to get out of the boundary, it will stay in place. Optimal and non optimal poilcies of Cliffwalk used in experiments are as follows:

Optimal policy: [2,0,0,0,0,0,0,2,2,2,2,2,1,0,1,1,1,1,1,1,0][2,0,0,0,0,0,0,2,2,2,2,2,1,0,1,1,1,1,1,1,0]

Non optimal policy: [0,0,2,1,2,1,2,1,0,2,0,3,2,0,3,1,0,1,1,1,3][0,0,2,1,2,1,2,1,0,2,0,3,2,0,3,1,0,1,1,1,3]

Maze

The Maze is a 5×55\times 5 grid world with 1616 walls. The top-left corner is the initial state, and the bottom-left corner is the goal state with reward of 1010. Similar to the Clifffwalk, there is a penalty of 1-1 in all other states, and the agent has four actions: UP(0), RIGHT(11), DOWN(22), and LEFT(33). Each action has 90%90\% chance to successfully move but with probability of 10%10\% one of the other three directions is randomly chosen and the agent moves in that direction. If the agent attempts to get out of the boundary or hits a wall, it will stay in place. Optimal and non optimal poilcies of Maze used in experiments are as follows:

Optimal policy: [2,1,1,1,2,2,0,2,3,2,2,0,2,0,3,1,0,1,1,2,0,3,3,3,3][2,1,1,1,2,2,0,2,3,2,2,0,2,0,3,1,0,1,1,2,0,3,3,3,3]

Non optimal policy: [2,2,3,0,3,0,2,1,3,2,2,2,3,3,1,0,3,0,3,3,3,3,1,1,0][2,2,3,0,3,0,2,1,3,2,2,2,3,3,1,0,3,0,3,3,3,3,1,1,0]

Chain Walk

We use the Chain Walk environment, as described by Farahmand and Ghavamzadeh [2021], which is similar to the formulation by Lagoudakis and Parr [2003]. Chain Walk is parametrized by the tuple (|𝒳|,|𝒜|,bp,br)(|\mathcal{X}|,|\mathcal{A}|,b_{p},b_{r}). It is circular chain, where the state 11 and 5050 are connected. The reward is state-dependent. State 33 is goal state with reward 11, state 5050 gives reward 1-1, and all other states give reward 0. Agent has two actions: RIGHT(0) and LEFT(11). Each action has 80%80\% chance to successfully move but with probability of 40/3%40/3\% agent will stay in place, and with probability of 20/3%20/3\% agent will move opposite direction. Optimal poilcy of Chain Walk used in experiments is as follows:

Optimal policy: [0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,[0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0]0,0,0,0,0,0,0,0,0,0,0,0]

Non optimal policy: [1,0,0,1,1,1,1,1,1,1,0,1,0,0,0,0,0,0,1,1,1,1,1,1,0,1,1,1,0,0,0,1,1,1,1,1,[1,0,0,1,1,1,1,1,1,1,0,1,0,0,0,0,0,0,1,1,1,1,1,1,0,1,1,1,0,0,0,1,1,1,1,1,

0,0,1,1,0,1,0,0,1,1,0,0,1,0]0,0,1,1,0,1,0,0,1,1,0,0,1,0], [0,1,1,0,1,1,0,0,1,1,0,1,1,1,1,0,1,1,1,0,0,0,0,0,1,1,1,1,1,1,1,[0,1,1,0,1,1,0,0,1,1,0,1,1,1,1,0,1,1,1,0,0,0,0,0,1,1,1,1,1,1,1,

1,1,1,1,1,1,0,0,1,0,0,1,0,1,0,1,0,1,1]1,1,1,1,1,1,0,0,1,0,0,1,0,1,0,1,0,1,1]

Garnet

We use the Garnet environment as described by Farahmand and Ghavamzadeh [2021], Rakhsha et al. [2022], which is based on Bhatnagar et al. [2009]. Garnet is parameterized by the tuple (|𝒳|,|𝒜|,bp,br)(|\mathcal{X}|,|\mathcal{A}|,b_{p},b_{r}). |𝒳||\mathcal{X}| and |𝒜||\mathcal{A}| are the number of states and actions, and bpb_{p} is the branching factor of the environment, which is the number of possible next states for each state-action pair. We randomly select bpb_{p} states without replacement and then, transition distribution is generated randomly. We select brb_{r} states without replacement, and for each selected xx, we assign state-dependent reward a uniformly sampled value in (0,1)(0,1).

11.  Rank-11 DDVI for Control Experiments

Refer to caption
(a) Rank-11 DDVI for control in Chain Walk
Refer to caption
(b) Rank-11 DDVI for control in Garnet
Figure 7: Comparison of rank-11 DDVI for control and VI in (left) Chain Walk and (right) Garnet.

In this experiment, we consider Chain Walk and Garnet. We use Garnet with 100 states, 8 actions, a branching factor of 6, and 10 non-zero rewards throughout the state space. We use normalized errors VkVπ1/V1\|V^{k}-V^{\pi}\|_{1}/\|V^{\star}\|_{1}, γ=0.995\gamma=0.995, and E1=1n𝟏𝟏E_{1}=\frac{1}{n}\mathbf{1}\mathbf{1}^{\top}, where nn is the number of states. For Garnet, we plot average values on the 100 Garnet instances and denote one standard error with the shaded area. Figure 7(a) and 7(b) show that rank-11 DDVI for Control does indeed provide an acceleration.

12.  Additional DDVI Experiments and Details

In all experiments, we set DDVI’s α=0.99\alpha=0.99. In experiments for Figure 1, the QR iteration is run for 600 iterations. For PID VI, we set η=0.05\eta=0.05 and ϵ=1010\epsilon=10^{-10}. In Anderson VI, we have m=5m=5. In Figure 2, we use 20 Garnet mdps with branching factor bp=3b_{p}=3, and br=0.1|𝒳|b_{r}=0.1*|\mathcal{X}|.

We perform further comparison of convergence in Figures 8, 9, 10, 11. Figure 8 is run with Garnet environment with 5050 states and 4040 branching factor that matches the setting in [Goyal and Grand-Clément, 2022].

Refer to caption
Figure 8: Comparison of DDVI with other accelerated VIs. Normalized errors are shown against the iteration number (left) and wall clock time (right). Plots are average of 20 randomly generated Garnet MDPs with 5050 states and branching factor of 4040 with shaded areas showing the standard error.
Refer to caption
Figure 9: Comparison of DDVI with other accelerated VIs. Normalized errors are shown against the iteration number (left) and wall clock time (right) in Maze.
Refer to caption
Figure 10: Comparison of DDVI with other accelerated VIs. Normalized errors are shown against the iteration number (left) and wall clock time (right) in Chain Walkk.
Refer to caption
Figure 11: Comparison of DDVI with other accelerated VIs. Normalized errors are shown against the iteration number (left) and wall clock time (right) in Cliffwalk.

13.  DDTD Experiments

A key part of our experimental setup is the model 𝒫^\hat{\mathcal{P}}. To show the robustness of DDTD to model error and also its advantage over Dyna, we consider the scenario that 𝒫^\hat{\mathcal{P}} has some non-diminishing error. We achieve this with the same technique as [Rakhsha et al., 2022]. Assume that each iteration tt, the empirical distribution of the next state from x,ax,a is 𝒫MLE(|x,a)\mathcal{P}_{\mathrm{MLE}}(\cdot|x,a), which is going to be updated with every sample. For some hyperparameter θ[0,1]\theta\in[0,1], we set the approximate dynamics 𝒫^(|x,a)\hat{\mathcal{P}}(\cdot|x,a) as

𝒫^(|x,a)=(1θ)𝒫MLE(|x,a)+θU({x|𝒫MLE(x|x,a)>0})\displaystyle\hat{\mathcal{P}}(\cdot|x,a)=(1-\theta)\cdot\mathcal{P}_{\mathrm{MLE}}(\cdot|x,a)+\theta\cdot U(\{x^{\prime}|\mathcal{P}_{\mathrm{MLE}}(x^{\prime}|x,a)>0\}) (10)

where U(S)U(S) for S𝒳S\subseteq\mathcal{X} is the uniform distribution over SS. The hyperparameter θ\theta controls the amount of error introduced in 𝒫^\hat{\mathcal{P}}. If θ=0\theta=0, we have 𝒫^=𝒫MLE\hat{\mathcal{P}}=\mathcal{P}_{\mathrm{MLE}} which becomes arbitrarily accurate with more samples. With larger θ\theta, 𝒫^(|x,a)\hat{\mathcal{P}}(\cdot|x,a) will be smoothed towards the uniform distribution more, which leads to a larger model error. In Dyna, we keep the empirical average of past rewards for each x,ax,a in r^:𝒳×𝒜\hat{r}\colon\mathcal{X}\times\mathcal{A}\to\mathbb{R} and perform planning with 𝒫^,r^\hat{\mathcal{P}},\hat{r} at each iteration to calculate the value function, that is V=(Iγ𝒫^π)1r^πV=(I-\gamma\hat{\mathcal{P}}^{\pi})^{-1}\hat{r}^{\pi}. In Figure 3, we have shown the result for θ=0.3\theta=0.3. In Figure 12 and Figure 13 we show the results for θ=0.1,0.3,0.5\theta=0.1,0.3,0.5 in both Maze and Chain Walk environments.

As we see in Figure 12 and Figure 13, DDTD shows a faster convergence than the conventional TD. In Maze, this is only achieved with higher rank versions of DDTD but in Chain Walk, even rank-1 DDTD is able to significantly accelerate the learning. Also note that unlike Dyna, DDTD is converging to the true value function despite the model error. We observe that the impact of model error on DDTD is very mild. In some cases such as rank-3 DDTD in Maze, between θ=0.3\theta=0.3 and θ=0.5\theta=0.5, we observe slightly slower convergence with higher model error. The hyperparamaters of TD Learning and DDTD are given in Table 1 and 2.

Refer to caption
Figure 12: Comparison of DDTD with TD Learning and Dyna in Maze. Left: low model error with θ=0.1\theta=0.1. Middle: medium model error with θ=0.3\theta=0.3. Right: high model error with θ=0.5\theta=0.5. Each curve is average of 20 runs. Shaded area shows one standard error.
Table 1: Hyperparamters for the Maze environment. Cells with multiple values provide the value of the hyperparameter for θ=0.1\theta=0.1, θ=0.3\theta=0.3, and θ=0.5\theta=0.5, respectively.
rank-1 DDTD rank-2 DDTD rank-3 DDTD rank-4 DDTD TD
η\eta (learning rate) 0.3,0.3,0.30.3,0.3,0.3 0.3,0.3,0.30.3,0.3,0.3 0.07,0.07,0.070.07,0.07,0.07 0.07,0.07,0.070.07,0.07,0.07 0.3
α\alpha 0.8,0.8,0.80.8,0.8,0.8 0.7,0.8,0.80.7,0.8,0.8 0.9,0.9,0.90.9,0.9,0.9 0.9,0.9,0.90.9,0.9,0.9 -
KK 10 10 10 10 -
Refer to caption
Figure 13: Comparison of DDTD with TD Learning and Dyna in Chainwalk. Left: low model error with θ=0.1\theta=0.1. Middle: medium model error with θ=0.3\theta=0.3. Right: high model error with θ=0.5\theta=0.5. Each curve is average of 20 runs. Shaded area shows one standard error.
Table 2: Hyperparamters for the Chainwalk environment. Cells with multiple values provide the value of the hyperparameter for θ=0.1\theta=0.1, θ=0.3\theta=0.3, and θ=0.5\theta=0.5, respectively.
rank-1 DDTD rank-2 DDTD rank-3 DDTD rank-4 DDTD TD
learning rate (η\eta) 1 1 1 1 1
α\alpha 0.9,0.9,0.90.9,0.9,0.9 0.9,0.8,0.80.9,0.8,0.8 0.9,0.8,0.80.9,0.8,0.8 0.9,0.8,0.80.9,0.8,0.8 -
KK 10 10 10 10 -