This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Near Optimality of Finite Memory Feedback Policies in Partially Observed Markov Decision Processes

\nameAli Devran Kara \emailalikara@umich.edu
\addrDepartment of Mathematics
University of Michigan
Ann Arbor, MI 48109-1043, USA \AND\nameSerdar Yüksel \emailyuksel@queensu.ca
\addrDepartment of Mathematics and Statistics
Queen’s University
Kingston, ON, Canada
Abstract

In the theory of Partially Observed Markov Decision Processes (POMDPs), existence of optimal policies have in general been established via converting the original partially observed stochastic control problem to a fully observed one on the belief space, leading to a belief-MDP. However, computing an optimal policy for this fully observed model, and so for the original POMDP, using classical dynamic or linear programming methods is challenging even if the original system has finite state and action spaces, since the state space of the fully observed belief-MDP model is always uncountable. Furthermore, there exist very few rigorous value function approximation and optimal policy approximation results, as regularity conditions needed often require a tedious study involving the spaces of probability measures leading to properties such as Feller continuity. In this paper, we study a planning problem for POMDPs where the system dynamics and measurement channel model are assumed to be known. We construct an approximate belief model by discretizing the belief space using only finite window information variables. We then find optimal policies for the approximate model and we rigorously establish near optimality of the constructed finite window control policies in POMDPs under mild non-linear filter stability conditions and the assumption that the measurement and action sets are finite (and the state space is real vector valued). We also establish a rate of convergence result which relates the finite window memory size and the approximation error bound, where the rate of convergence is exponential under explicit and testable exponential filter stability conditions. While there exist many experimental results and few rigorous asymptotic convergence results, an explicit rate of convergence result is new in the literature, to our knowledge.

Keywords: POMDPs, nonlinear filters, stochastic control

1 Introduction

For Partially Observed Stochastic Control, also known as Partially Observed Markov Decision Problems (POMDPs), existence of optimal policies have in general been established via converting the original partially observed stochastic control problem to a fully observed Markov Decision Problem (MDP) one on the belief space, leading to a belief-MDP. However, computing an optimal policy for this fully observed model, and so for the original POMDP, using classical methods (such as dynamic programming, policy iteration, linear programming) is challenging even if the original system has finite state and action spaces, since the state space of the fully observed model is always uncountable.

In the MDP theory, various methods have been developed to compute approximately optimal policies by reducing the original problem into a simpler one. A partial list of these techniques is as follows: approximate dynamic programming, approximate value or policy iteration, simulation-based techniques, neuro-dynamic programming (or reinforcement learning), state aggregation, etc. (Dufour and Prieto-Rumeau, 2012; Bertsekas, 1975; Chow and Tsitsiklis, 1991). Saldi et al. (2017) investigated finite action and state approximations of fully observed stochastic control problems with general state and action spaces under the discounted cost and average cost optimality criteria where weak continuity conditions were shown to be sufficient for near optimality of finite state approximations in the sense that optimal policies obtained from these models asymptotically achieve the optimal cost for the original problem under the weak continuity assumption on the controlled transition kernel.

On POMDPs, however, the problem of approximation is significantly more challenging. Most of the studies in the literature are algorithmic and computational contributions (Porta et al., 2006; Zhou and Hansen, 2001). These studies develop computational algorithms, utilizing structural convexity/concavity properties of the value function under the discounted cost criterion. Vlassis and Spaan (2005) provide an insightful algorithm which may be regarded as a quantization of the belief space; however, no rigorous convergence results are provided. Smith and Simmons (2012); Pineau et al. (2006) also present quantization based algorithms for the belief state, where the state, measurement, and the action sets are finite. Zhang et al. (2014) also provides a computationally efficient approximation scheme by quantizing the belief space uniformly under L1L_{1} distance.

For partially observed setups, Saldi et al. (2020, 2017) introduce a rigorous approximation analysis after establishing weak continuity conditions on the transition kernel defining the (belief-MDP) via the non-linear filter (Feinberg et al., 2012; Kara et al., 2019), and show that finite model approximations obtained through quantization are asymptotically optimal and the control policies obtained from the finite model can be applied to the actual system with asymptotically vanishing error as the number of quantization bins increases. Another rigorous set of studies is by Zhou et al. (2008, 2010) where the authors provide an explicit quantization method for the set of probability measures containing the belief states, where the state space is parametrically representable under strong density regularity conditions. The quantization is done through the approximations as measured by Kullback-Leibler divergence (relative entropy) between probability density functions. Further recent studies include Mao et al. (2020); Subramanian and Mahajan (2019). Subramanian and Mahajan (2019) present a notion of approximate information variable and studies near optimality of policies that satisfies the approximate information state property. Mao et al. (2020) analyzes a similar problem under a decentralized setup. Our explicit approximation results in this paper will find applications in both of these studies.

We refer the reader to the survey papers by Lovejoy (1991); White (1991); Hansen (2013) and the recent book by Krishnamurthy (2016) for further structural results as well as algorithmic and computational methods for approximating POMDPs. Notably, for POMDPs Krishnamurthy (2016) presents structural results on optimal policies under monotonicity conditions of the value function in the belief variable.

For our work, we specifically focus on finite memory approximations. With regard to approximations based on finite memory, the following two papers are particularly relevant to our paper:

Yu and Bertsekas (2008) study near optimality of finite window policies for average cost problems where the state, action and observation spaces are finite; under the condition that the liminf and limsup of the average cost are equal and independent of the initial state, the paper establishes the near-optimality of (non-stationary) finite memory policies. Here, a concavity argument building on a work of Feinberg (1982) (which becomes consequential by the equality assumption) and the finiteness of the state space is crucial. The paper shows that for any given ϵ>0\epsilon>0, there exists an ϵ\epsilon-optimal finite window policy. However, the authors do not provide a performance bound related to the length of the window, and in fact the proof method builds on convex analysis. Nonetheless, the constant property of the value functions over initial priors is related to unique ergodicity, and thus the stability problem of non-linear filters, which is a topic of current investigation particularly in the controlled setup.

In another related direction, White-III and Scherer (1994) study finite memory approximation techniques for POMDPs with finite state, action and measurements. The POMDP is reduced to a belief MDP and the worst and best case predictors prior to the NN most recent information variables are considered to build an approximate belief MDP. The original value function is bounded using these approximate belief MDPs that use only finite memory, where the finiteness of the state space is critically used. Furthermore, a loss bound is provided for a suboptimally constructed policy that only uses finite history, where the bound depends on a more specific ergodicity coefficient (which requires restrictive, sample pathwise, contraction properties). In our paper, we consider more general signal spaces and more relaxed filter stability conditions, and establish explicit rates of convergence results. We also rigorously establish the relation of the loss bound to nonlinear filter stability and state space reduction techniques for MDPs.

A recent work by the authors Kara and Yüksel (2021) introduces a different finite history approximation technique where the approximation is done via an alternative belief MDP-reduction method rather than direct discretization of the space of probability measures. It is shown that the approximation error can be related to the controlled filter stability in terms of the total variation distance, whereas in this paper, the error bound is also shown to be related to more general, and in particular weak convergence inducing, metrics. Although, the approximation technique introduced in Kara and Yüksel (2021) provides an error upper bound in terms of the more stringent total variation distance, it proves to be numerically efficient as it is shown that a finite history Q learning algorithm converges to the optimality equation of an approximate model. Our analysis here requires less stringent conditions on filter stability, however the use of the bounded-Lipschitz metric on probability measures leads to a significantly more tedious analysis.

Contributions. In this paper, we rigorously establish near optimality of finite memory feedback control policies for the case where the actions and measurements are finite (with the state being real vector valued), provided that the controlled non-linear filter is stable in a sense to be presented in the paper. We also explicitly relate the approximation error with the window size. This is the first rigorous result, to our knowledge, where finite window policies are shown to be ϵ\epsilon-optimal with an explicit rate of convergence with respect to the window size.

1.1 Preliminaries and the Main Results

Let 𝕏m\mathds{X}\subset\mathds{R}^{m} denote a Borel set which is the state space of a partially observed controlled Markov process. Here and throughout the paper +\mathbb{Z}_{+} denotes the set of non-negative integers and \mathds{N} denotes the set of positive integers. Let 𝕐\mathds{Y} be a finite set denoting the observation space of the model, and let the state be observed through an observation channel QQ. The observation channel, QQ, is defined as a stochastic kernel (regular conditional probability) from 𝕏\mathds{X} to 𝕐\mathds{Y}, such that Q(|x)Q(\,\cdot\,|x) is a probability measure on the power set P(𝕐)P(\mathds{Y}) of 𝕐\mathds{Y} for every x𝕏x\in\mathds{X}, and Q(A|):𝕏[0,1]Q(A|\,\cdot\,):\mathds{X}\to[0,1] is a Borel measurable function for every AP(𝕐)A\in P(\mathds{Y}). A decision maker (DM) is located at the output of the channel QQ, and hence it only sees the observations {Yt,t+}\{Y_{t},\,t\in\mathbb{Z}_{+}\} and chooses its actions from 𝕌\mathds{U}, the action space which is a finite subset of some Euclidean space. An admissible policy γ\gamma is a sequence of control functions {γt,t+}\{\gamma_{t},\,t\in\mathbb{Z}_{+}\} such that γt\gamma_{t} is measurable with respect to the σ\sigma-algebra generated by the information variables It={Y[0,t],U[0,t1]},t,I0={Y0},I_{t}=\{Y_{[0,t]},U_{[0,t-1]}\},\quad t\in\mathds{N},\quad\quad I_{0}=\{Y_{0}\}, where

Ut=γt(It),t+,U_{t}=\gamma_{t}(I_{t}),\quad t\in\mathbb{Z}_{+}, (1)

are the 𝕌\mathds{U}-valued control actions and Y[0,t]={Ys, 0st},U[0,t1]={Us, 0st1}.Y_{[0,t]}=\{Y_{s},\,0\leq s\leq t\},\quad U_{[0,t-1]}=\{U_{s},\,0\leq s\leq t-1\}.

We define Γ\Gamma to be the set of all such admissible policies. The update rules of the system are determined by (1) and the following relationships:

Pr((X0,Y0)B)=Bμ(dx0)Q(dy0|x0),B(𝕏×𝕐),\mathop{\rm Pr}\bigl{(}(X_{0},Y_{0})\in B\bigr{)}=\int_{B}\mu(dx_{0})Q(dy_{0}|x_{0}),\quad B\in\mathcal{B}(\mathds{X}\times\mathds{Y}),

where μ\mu is the (prior) distribution of the initial state X0X_{0}, and

Pr((Xt,Yt)B|(X,Y,U)[0,t1]=(x,y,u)[0,t1])=B𝒯(dxt|xt1,ut1)Q(dyt|xt),\displaystyle\mathop{\rm Pr}\biggl{(}(X_{t},Y_{t})\in B\,\bigg{|}\,(X,Y,U)_{[0,t-1]}=(x,y,u)_{[0,t-1]}\biggr{)}=\int_{B}\mathcal{T}(dx_{t}|x_{t-1},u_{t-1})Q(dy_{t}|x_{t}),

B(𝕏×𝕐),t,B\in\mathcal{B}(\mathds{X}\times\mathds{Y}),t\in\mathds{N}, where 𝒯\mathcal{T} is the transition kernel of the model which is a stochastic kernel from 𝕏×𝕌\mathds{X}\times\mathds{U} to 𝕏\mathds{X}. Note that, although 𝕐\mathds{Y} is finite, we use integral sign instead of the summation sign for notation convenience by letting the measure to be sum of dirac-delta measures. We let the objective of the agent (decision maker) be the minimization of the infinite horizon discounted cost,

Jβ(μ,𝒯,γ)=Eμ𝒯,γ[t=0βtc(Xt,Ut)]\displaystyle J_{\beta}(\mu,{\cal T},\gamma)=E_{\mu}^{{\cal T},\gamma}\left[\sum_{t=0}^{\infty}\beta^{t}c(X_{t},U_{t})\right]

for some discount factor β(0,1)\beta\in(0,1), over the set of admissible policies γΓ\gamma\in\Gamma, where c:𝕏×𝕌c:\mathds{X}\times\mathds{U}\to\mathds{R} is a Borel-measurable stage-wise cost function and Eμ𝒯,γE_{\mu}^{{\cal T},\gamma} denotes the expectation with initial state probability measure μ\mu and transition kernel 𝒯{\cal T} under policy γ\gamma. Note that μ𝒫(𝕏)\mu\in\mathcal{P}(\mathds{X}), where we let 𝒫(𝕏)\mathcal{P}(\mathds{X}) denote the set of probability measures on 𝕏\mathds{X}. We define the optimal cost for the discounted infinite horizon setup as a function of the priors and the transition kernels as

Jβ(μ,𝒯)\displaystyle J_{\beta}^{*}(\mu,{\cal T}) =infγΓJβ(μ,𝒯,γ).\displaystyle=\inf_{\gamma\in\Gamma}J_{\beta}(\mu,{\cal T},\gamma).

For partially observed stochastic problems, the optimal policies use all the available information in general. The question we ask is the following one: suppose we define an NN-memory admissible policy γN\gamma^{N} so that γN\gamma^{N} is a sequence of control functions {γt,t+}\{\gamma_{t},\,t\in\mathbb{Z}_{+}\} such that γt\gamma_{t} is measurable with respect to the σ\sigma-algebra generated by the information variables

ItN\displaystyle I_{t}^{N} ={Y[tN,t],U[tN,t1]}, if tN,\displaystyle=\{Y_{[t-N,t]},U_{[t-N,t-1]}\},\text{ if }t\geq N,
ItN\displaystyle I_{t}^{N} ={Y[0,t],U[0,t1]}, if 0<t<N,\displaystyle=\{Y_{[0,t]},U_{[0,t-1]}\},\text{ if }0<t<N,
I0\displaystyle I_{0} ={Y0},\displaystyle=\{Y_{0}\}, (2)

that is the controller can only have access to the information variables through a window whose length is NN. We define ΓN\Gamma^{N} to be the set of all such NN-memory admissible policies. Similarly, we define the optimal cost function under NN-memory admissible policies as

JβN(μ,𝒯)=infγNΓNJβ(μ,𝒯,γN).\displaystyle J_{\beta}^{N}(\mu,\mathcal{T})=\inf_{\gamma^{N}\in\Gamma^{N}}J_{\beta}(\mu,\mathcal{T},\gamma^{N}).

Under this setup, we will study the following problem.

  • Problem:

    Under suitable conditions, can we find explicit bounds on JβN(μ,𝒯)Jβ(μ,𝒯)J_{\beta}^{N}(\mu,\mathcal{T})-J_{\beta}^{*}(\mu,\mathcal{T}) in terms of NN and a constructive approximate solution achieving this bound?

Our goal is to find the best possible control policy in ΓN\Gamma^{N} that is, in the set of policies that use only a finite history of information variables, in an offline setting by reducing the problem to a simpler approximate setup where we assume that the system dynamics are known to the designer. A general summary of the approach we will follow to answer this problem is as follows: We first define the belief MDP counterpart of the partially observed system. Then, we construct a finite subset of the belief state space using the probability distributions that can be achieved using the finite window information variables (ItNI_{t}^{N}’s) from a fixed probability distribution. This finite subset leads to an approximate MDP model for which we find optimal policies. The calculation of the policies is greatly simplified compared to the calculation of optimal policies for the original POMDP model. Finally, we show that the loss occurring from applying this approximate policy to the original model can be upper bounded by the expected error of the dicretization of the belief space. The loss is evaluated compared to the best possible admissible policy in the set Γ\Gamma. The accumulating error then, can be represented in relation to the filter stability problem, that is, how fast the controlled process forgets its initial distribution as it observes the information variables from the system.

Note that although we take the infimum over all NN-memory admissible policies, we will explicitly construct finite window policies which will be time-invariant, or a finite-state probabilistic automaton (as is also referred to by Yu and Bertsekas, 2008) that accepts as inputs the finite window of observations and actions, and produces as outputs the control actions in a time-invariant/stationary fashion. Accordingly, the infimum above for JβN(μ,𝒯)J_{\beta}^{N}(\mu,\mathcal{T}) can be replaced with the minimum over such policies.

We answer the problem above affirmatively in Theorems 12, 16, and 17 under complementary conditions.

2 Regularity and Stability Properties of the Belief-MDP

In this section, we introduce the belief MDP reduction of POMDPs and provide regularity properties of the belief MDPs.

2.1 Convergence Notions for Probability Measures

For the analysis of the technical results, we will use different notions of convergence for sequences of probability measures.

Two important notions of convergences for sequences of probability measures are weak convergence, and convergence under total variation. For some NN\in\mathbb{N} a sequence {μn,n}\{\mu_{n},n\in\mathbb{N}\} in 𝒫(N)\mathcal{P}(\mathds{R}^{N}) is said to converge to μ𝒫(N)\mu\in\mathcal{P}(\mathds{R}^{N}) weakly if Nc(x)μn(dx)Nc(x)μ(dx)\int_{\mathds{R}^{N}}c(x)\mu_{n}(dx)\to\int_{\mathds{R}^{N}}c(x)\mu(dx) for every continuous and bounded c:Nc:\mathds{R}^{N}\to\mathds{R}. One important property of weak convergence is that the space of probability measures on a complete, separable, and metric (Polish) space endowed with the topology of weak convergence is itself complete, separable, and metric (Parthasarathy, 1967). One such metric is the bounded Lipschitz metric (Villani, 2008, p.109), which is defined for μ,ν𝒫(𝕏)\mu,\nu\in{\mathcal{P}}(\mathds{X}) as

ρBL(μ,ν):=supfBL1|f𝑑μf𝑑ν|\rho_{BL}(\mu,\nu):=\sup_{\|f\|_{BL}\leq 1}|\int fd\mu-\int fd\nu| (3)

where

fBL:=f+supxy|f(x)f(y)|d(x,y)\|f\|_{BL}:=\|f\|_{\infty}+\sup_{x\neq y}\frac{|f(x)-f(y)|}{d(x,y)}

and f=supx𝕏|f(x)|\|f\|_{\infty}=\sup_{x\in\mathds{X}}|f(x)|.

For probability measures μ,ν𝒫(N)\mu,\nu\in\mathcal{P}(\mathds{R}^{N}), the total variation metric is given by

μνTV\displaystyle\|\mu-\nu\|_{TV} =2supB(N)|μ(B)ν(B)|=supf:f1|f(x)μ(dx)f(x)ν(dx)|,\displaystyle=2\sup_{B\in\mathcal{B}(\mathds{R}^{N})}|\mu(B)-\nu(B)|=\sup_{f:\|f\|_{\infty}\leq 1}\left|\int f(x)\mu(\mathrm{d}x)-\int f(x)\nu(\mathrm{d}x)\right|,

where the supremum is taken over all measurable real ff such that f=supxN|f(x)|1\|f\|_{\infty}=\sup_{x\in\mathds{R}^{N}}|f(x)|\leq 1. A sequence μn\mu_{n} is said to converge in total variation to μ𝒫(N)\mu\in\mathcal{P}(\mathds{R}^{N}) if μnμTV0\|\mu_{n}-\mu\|_{TV}\to 0.

2.2 Ergodicity and Filter Stability Properties of Partially Observed MDPs

Given a prior μ𝒫(𝕏)\mu\in{\mathcal{P}}(\mathds{X}) and a policy γΓ\gamma\in\Gamma, we define the filter and predictor for a POMDP in the following.

Definition 1

The one step predictor process is defined as the sequence of conditional probability measures

πnμ,γ()\displaystyle\pi_{n-}^{\mu,\gamma}(\cdot) =Pμ,γ(Xn|Y[0,n1],U[0,n1]=γn(Y[0,n1],U[0,n2]))=Pμ,γ(Xn|Y[0,n1]),n\displaystyle=P^{\mu,\gamma}(X_{n}\in\cdot|Y_{[0,n-1]},U_{[0,n-1]}=\gamma_{n}(Y_{[0,n-1]},U_{[0,n-2]}))=P^{\mu,\gamma}(X_{n}\in\cdot|Y_{[0,n-1]}),\quad n\in\mathbb{N}

where Pμ,γP^{\mu,\gamma} is the probability measure induced by the prior μ\mu and the policy γ\gamma, when μ\mu is the probability measure on X0X_{0}.

Definition 2

The filter process is defined as the sequence of conditional probability measures

πnμ,γ()\displaystyle\pi_{n}^{\mu,\gamma}(\cdot) =Pμ,γ(Xn|Y[0,n],U[0,n1]=γn(Y[0,n1],U[0,n2]))=Pμ,γ(Xn|Y[0,n]),n\displaystyle=P^{\mu,\gamma}(X_{n}\in\cdot|Y_{[0,n]},U_{[0,n-1]}=\gamma_{n}(Y_{[0,n-1]},U_{[0,n-2]}))=P^{\mu,\gamma}(X_{n}\in\cdot|Y_{[0,n]}),\quad n\in\mathbb{N} (4)

where Pμ,γP^{\mu,\gamma} is the probability measure induced by the prior μ\mu and the policy γ\gamma.

Definition 3

(Dobrushin, 1956, Equation 1.16) For a kernel operator K:S1𝒫(S2)K:S_{1}\to\mathcal{P}(S_{2}) (that is a regular conditional probability from S1S_{1} to S2S_{2}) for standard Borel spaces S1,S2S_{1},S_{2}, we define the Dobrushin coefficient as:

δ(K)\displaystyle\delta(K) =infi=1nmin(K(x,Ai),K(y,Ai))\displaystyle=\inf\sum_{i=1}^{n}\min(K(x,A_{i}),K(y,A_{i})) (5)

where the infimum is over all x,yS1x,y\in S_{1} and all partitions {Ai}i=1n\{A_{i}\}_{i=1}^{n} of S2S_{2}.

We note that this definition holds for continuous or finite/countable spaces S1S_{1} and S2S_{2} and 0δ(K)10\leq\delta(K)\leq 1 for any kernel operator.

Example 1

Assume for a finite setup, we have the following stochastic transition matrix

K=(1313130121234014)K=\begin{pmatrix}\frac{1}{3}&\frac{1}{3}&\frac{1}{3}\\ 0&\frac{1}{2}&\frac{1}{2}\\ \frac{3}{4}&0&\frac{1}{4}\end{pmatrix}

The Dobrushin coefficient is the minimum over any two rows where we sum the minimum elements among those rows. For this example, the first and the second rows give 23\frac{2}{3}, the first and the third rows give 712\frac{7}{12} and the second and the third rows give 14\frac{1}{4}. Then the Dobrushin coefficient is 14\frac{1}{4}.

Let

δ~(𝒯):=infu𝕌δ(𝒯(|,u)).\tilde{\delta}({\cal T}):=\inf_{u\in\mathds{U}}\delta({\cal T}(\cdot|\cdot,u)).
Definition 4

For 𝕏m\mathds{X}\subset\mathds{R}^{m} for some mm\in\mathds{N}, and for two probability measures μ,ν𝒫(𝕏)\mu,\nu\in{\mathcal{P}}(\mathds{X}), μ\mu is said to be absolutely continuous with respect to ν\nu if μ(A)=0\mu(A)=0 for every set A(𝕏)A\in{\mathcal{B}}(\mathds{X}) for which ν(A)=0\nu(A)=0. We denote the absolute continuity of μ\mu with respect to ν\nu by μν\mu\ll\nu.

Theorem 5

(McDonald and Yüksel, 2020, Theorem 3.3) Assume that for μ,ν𝒫(𝕏)\mu,\nu\in{\mathcal{P}}(\mathds{X}), we have μν\mu\ll\nu. Then we have (exponential filter stability)

Eμ,γ[πn+1μ,γπn+1ν,γTV](1δ~(𝒯))(2δ(Q))Eμ,γ[πnμ,γπnν,γTV].\displaystyle E^{\mu,\gamma}\left[\|\pi_{n+1}^{\mu,\gamma}-\pi_{n+1}^{\nu,\gamma}\|_{TV}\right]\leq(1-\tilde{\delta}(\mathcal{T}))(2-\delta(Q))E^{\mu,\gamma}\left[\|\pi_{n}^{\mu,\gamma}-\pi_{n}^{\nu,\gamma}\|_{TV}\right].

In particular, defining α:=(1δ~(𝒯))(2δ(Q))\alpha:=(1-\tilde{\delta}(\mathcal{T}))(2-\delta(Q)), we have

Eμ,γ[πnμ,γπnν,γTV]2αn.\displaystyle E^{\mu,\gamma}\left[\|\pi_{n}^{\mu,\gamma}-\pi_{n}^{\nu,\gamma}\|_{TV}\right]\leq 2\alpha^{n}.

This result will be a key ingredient for our main results. It provides conditions on when the belief-state processes for a given POMDP under different priors get closer when they are fed with same observation processes, in expectation under the true probability space. In a vague sense, if the state process is tracked only using a finite window of recent measurement and control variables (and forgets the past observations and actions), then the amount of mismatch from the true filter can be bounded with an error that is exponentially diminishing with the window size. The relationship is via the term supπ𝒫(𝕏)supγΓEπγ[ρBL(Pπ(XN|Y[0,N]),Pπ^(XN|Y[0,N]))]\sup_{\pi\in{\mathcal{P}}(\mathds{X})}\sup_{\gamma\in\Gamma}E_{\pi}^{\gamma}\left[\rho_{BL}\left(P^{\pi}(X_{N}\in\cdot|Y_{[0,N]}),P^{\hat{\pi}}(X_{N}\in\cdot|Y_{[0,N]})\right)\right] which appears crucially in (4.1). We note that if one does not wish to have an explicit rate of convergence result, one could have more relaxed conditions for filter stability which will still lead to rigorous approximation results on the performance of finite window policies via the controlled filter stability analysis in McDonald and Yüksel (2022).

2.3 Reduction to Fully Observed Models and Regularity Properties of Belief-MDPs

It is by now a standard result that, for optimality analysis, any POMDP can be reduced to a completely observable Markov decision process (Yushkevich, 1976; Rhenius, 1974), whose states are the posterior state distributions or beliefs of the observer or the filter process as defined in (4); that is, the state at time nn is

𝖯𝗋{Xn|Y0,,Yn,U0,,Un1}𝒫(𝕏).\displaystyle{\mathsf{Pr}}\{X_{n}\in\,\cdot\,|Y_{0},\ldots,Y_{n},U_{0},\ldots,U_{n-1}\}\in{\mathcal{P}}({\mathds{X}}).

We call this equivalent process the filter process . The filter process has state space 𝒵=𝒫(𝕏)\mathcal{Z}={\mathcal{P}}({\mathds{X}}) and action space 𝕌{\mathds{U}}. Here, 𝒵\mathcal{Z} is equipped with the Borel σ\sigma-algebra generated by the topology of weak convergence (Billingsley, 1999). As noted earlier, under this topology, 𝒵\mathcal{Z} is a standard Borel space (Parthasarathy, 1967). Then, the transition probability η\eta of the filter process can be constructed as follows (see also Hernández-Lerma, 1989). If we define the measurable function

F(z,u,y):=F(|y,u,z)=Pr{Xn+1|Zn=z,Un=u,Yn+1=y}F(z,u,y):=F(\,\cdot\,|y,u,z)=\mathop{\rm Pr}\{X_{n+1}\in\,\cdot\,|Z_{n}=z,U_{n}=u,Y_{n+1}=y\}

from 𝒫(𝕏)×𝕌×𝕐{\cal P}(\mathds{X})\times\mathds{U}\times\mathds{Y} to 𝒫(𝕏){\cal P}(\mathds{X}) and use the stochastic kernel P(|z,u)=Pr{Yn+1|Zn=z,Un=u}P(\,\cdot\,|z,u)=\mathop{\rm Pr}\{Y_{n+1}\in\,\cdot\,|Z_{n}=z,U_{n}=u\} from 𝒫(𝕏)×𝕌{\cal P}(\mathds{X})\times\mathds{U} to 𝕐\mathds{Y}, we can write η\eta as

η(|z,u)=𝕐1{F(z,u,y)}P(dy|z,u).\displaystyle\eta(\,\cdot\,|z,u)=\int_{\mathds{Y}}1_{\{F(z,u,y)\in\,\cdot\,\}}P(dy|z,u). (6)

The one-stage cost function c~:𝒫(𝕏)×𝕌[0,)\tilde{c}:{\cal P}(\mathds{X})\times\mathds{U}\rightarrow[0,\infty) of the filter process is given by

c~(z,u):=𝕏c(x,u)z(dx),\displaystyle\tilde{c}(z,u):=\int_{{\mathds{X}}}c(x,u)z(dx), (7)

which is a Borel measurable function. Hence, the filter process is a completely observable Markov process with the components (𝒵,𝕌,c~,η)(\mathcal{Z},{\mathds{U}},\tilde{c},\eta).

For the filter process, the information variables is defined as

I~t={Z[0,t],U[0,t1]},t,I~0={Z0}.\tilde{I}_{t}=\{Z_{[0,t]},U_{[0,t-1]}\},\quad t\in\mathds{N},\quad\quad\tilde{I}_{0}=\{Z_{0}\}.

It is well known that an optimal control policy of the original POMDP can use the belief ZtZ_{t} as a sufficient statistic for optimal policies (see Yushkevich, 1976; Rhenius, 1974), provided they exist. More precisely, the filter process is equivalent to the original POMP in the sense that for any optimal policy for the filter process, one can construct a policy for the original POMP which is optimal. On existence, we note the following.

With the recent results by Feinberg et al. (2016); Kara et al. (2019) the transition model of the belief-MDP can be shown to satisfy weak continuity conditions on the belief state and action variables, and accordingly we have that the measurable selection conditions (Hernandez-Lerma and Lasserre, 1996, Chapter 3) apply. Notably, we state the following.

Assumption 1
  • (i)

    The transition probability 𝒯(|x,u)\mathcal{T}(\cdot|x,u) is weakly continuous in (x,u)(x,u), i.e., for any (xn,un)(x,u)(x_{n},u_{n})\to(x,u), 𝒯(|xn,un)𝒯(|x,u)\mathcal{T}(\cdot|x_{n},u_{n})\to\mathcal{T}(\cdot|x,u) weakly.

  • (ii)

    The observation channel Q(|x,u)Q(\cdot|x,u) is continuous in total variation, i.e., for any (xn,un)(x,u)(x_{n},u_{n})\to(x,u), Q(|xn,un)Q(|x,u)Q(\cdot|x_{n},u_{n})\rightarrow Q(\cdot|x,u) in total variation.

Assumption 2
  • (i)

    The transition probability 𝒯(|x,u)\mathcal{T}(\cdot|x,u) is continuous in total variation in (x,u)(x,u), i.e., for any (xn,un)(x,u)(x_{n},u_{n})\to(x,u), 𝒯(|xn,un)𝒯(|x,u)\mathcal{T}(\cdot|x_{n},u_{n})\to\mathcal{T}(\cdot|x,u) in total variation.

  • (ii)

    The observation channel Q(|x)Q(\cdot|x) is independent of the control variable.

Theorem 6
  • (i)

    (Feinberg et al., 2016) Under Assumption 1, the transition probability η(|z,u)\eta(\cdot|z,u) of the filter process is weakly continuous in (z,u)(z,u).

  • (ii)

    (Kara et al., 2019) Under Assumption 2, the transition probability η(|z,u)\eta(\cdot|z,u) of the filter process is weakly continuous in (z,u)(z,u).

Under the above weak continuity conditions, the measurable selection conditions (Hernandez-Lerma and Lasserre, 1996, Chapter 3) apply and a solution to the discounted cost optimality equation exists, and accordingly an optimal control policy exists. This policy is stationary (in the belief state). Thus there exists a function Φ:𝒫(𝕏)𝕌\Phi:{\mathcal{P}}(\mathds{X})\to\mathds{U} such that for any policy γ\gamma for a prior μ\mu

γ(y[0,n])=Φ(Pμ,γ(Xn|Y[0,n]=y[0,n]))=Φ(πnμ,γ)\displaystyle\gamma(y_{[0,n]})=\Phi\left(P^{\mu,\gamma}(X_{n}\in\cdot|Y_{[0,n]}=y_{[0,n]})\right)=\Phi(\pi_{n}^{\mu,\gamma})

In particular, we have that Jβ(μ,𝒯,Q)=Jβ(μ,η)J_{\beta}^{*}(\mu,\mathcal{T},Q)=J_{\beta}^{*}(\mu,\eta). This will be the case in our paper, under the assumptions we will work with.

For the rest of the paper, we will use γ\gamma for the belief process policy Φ\Phi for consistency of notation.

The following supporting result, to be used later in the paper, provides further regularity properties on the transition model for the belief model under mild conditions on the fully observed model. This result may be useful for POMDP theory beyond the application considered in this paper. We also note that the bounds in (i) and (iii) below are applicable when we only have filter stability, but not exponential filter stability (McDonald and Yüksel, 2022), whereas items (ii)-(iv) will be used under exponential filter stability in this paper.

Theorem 7
  • i.

    Assume that

    𝒯(|x,u)𝒯(|x,u)TVα𝕏|xx|\displaystyle\|\mathcal{T}(\cdot|x,u)-\mathcal{T}(\cdot|x^{\prime},u)\|_{TV}\leq\alpha_{\mathds{X}}|x-x^{\prime}|

    for some α𝕏<\alpha_{\mathds{X}}<\infty for all u𝕌u\in\mathds{U}. We have

    ρBL(η(|z,u),η(|z,u))3(1+α𝕏)ρBL(z,z).\displaystyle\rho_{BL}\left(\eta(\cdot|z,u),\eta(\cdot|z^{\prime},u)\right)\leq 3(1+\alpha_{\mathds{X}})\rho_{BL}(z,z^{\prime}).
  • ii.

    Assume that

    𝒯(|x,u)𝒯(|x,u)TVα𝕏|xx|\displaystyle\|\mathcal{T}(\cdot|x,u)-\mathcal{T}(\cdot|x^{\prime},u)\|_{TV}\leq\alpha_{\mathds{X}}|x-x^{\prime}|

    for some α𝕏<\alpha_{\mathds{X}}<\infty for all u𝕌u\in\mathds{U}. Then, under the conditions of Theorem 5 we have

    ρBL(η(|z,u),η(|z,u))(32δ(Q))(1+α𝕏)ρBL(z,z).\displaystyle\rho_{BL}\left(\eta(\cdot|z,u),\eta(\cdot|z^{\prime},u)\right)\leq(3-2\delta(Q))(1+\alpha_{\mathds{X}})\rho_{BL}(z,z^{\prime}).
  • iii.

    Without any assumption

    ρBL(η(|z,u),η(|z,u))3zzTV.\displaystyle\rho_{BL}\left(\eta(\cdot|z,u),\eta(\cdot|z^{\prime},u)\right)\leq 3\|z-z^{\prime}\|_{TV}.
  • iv.

    Under the conditions of Theorem 5

    ρBL(η(|z,u),η(|z,u))(32δ(Q))(1δ~(𝒯))zzTV.\displaystyle\rho_{BL}\left(\eta(\cdot|z,u),\eta(\cdot|z^{\prime},u)\right)\leq(3-2\delta(Q))(1-\tilde{\delta}(\mathcal{T}))\|z-z^{\prime}\|_{TV}.

Proof  See Section 6.1.  

3 Approximate Model Construction: Finite Belief-MDP through Finite Memory

In this section, we will construct a finite state space by quantizing the belief state space so that the approximate finite model is obtained using only a finite memory.

Our construction builds on but significantly differs from the approach by Saldi et al. (2018, 2017). As we will explain, we cannot afford to use uniform quantization in our setup, which was a crucial tool used by Saldi et al. (2018, 2017).

As we discussed in the previous section, we can write the infinite horizon cost as

Jβ(𝒯,Q,γ,μ)\displaystyle J_{\beta}(\mathcal{T},Q,\gamma,\mu) =t=0βtEμ[c~(πt,γ(πt))]\displaystyle=\sum_{t=0}^{\infty}\beta^{t}E_{\mu}\left[\tilde{c}\left(\pi_{t},\gamma(\pi_{t})\right)\right]
=t=0N1βtEμ[c~(πt,γ(πt))]+t=NβtEμ[c~(πt,γ(πt))].\displaystyle=\sum_{t=0}^{N-1}\beta^{t}E_{\mu}\left[\tilde{c}\left(\pi_{t},\gamma(\pi_{t})\right)\right]+\sum_{t=N}^{\infty}\beta^{t}E_{\mu}\left[\tilde{c}\left(\pi_{t},\gamma(\pi_{t})\right)\right].

Now we focus on the second term:

t=NβtEμ[c~(πt,γ(πt))]\displaystyle\sum_{t=N}^{\infty}\beta^{t}E_{\mu}\left[\tilde{c}\left(\pi_{t},\gamma(\pi_{t})\right)\right]
=t=NβtEμ[c~(Pμ,γ(Xt|Y[0,t],U[0,t1]),γ(Pμ,γ(Xt|Y[0,t],U[0,t1])))].\displaystyle=\sum_{t=N}^{\infty}\beta^{t}E_{\mu}\left[\tilde{c}\left(P^{\mu,\gamma}(X_{t}\in\cdot|Y_{[0,t]},U_{[0,t-1]}),\gamma(P^{\mu,\gamma}(X_{t}\in\cdot|Y_{[0,t]},U_{[0,t-1]}))\right)\right].

Notice that for any time step tNt\geq N and for a fixed observation realization sequence y[0,t]y_{[0,t]} and for a fixed control action sequence u[0,t1]u_{[0,t-1]}, the state process can be viewed as

Pμ(Xt|Y[0,t]=y[0,t],U[0,t1]=u[0,t1])\displaystyle P^{\mu}(X_{t}\in\cdot|Y_{[0,t]}=y_{[0,t]},U_{[0,t-1]}=u_{[0,t-1]})
=PπtN(Xt|Y[tN,t]=y[tN,t],U[tN,t1]=u[tN,t1])\displaystyle=P^{\pi_{{t-N}_{-}}}(X_{t}\in\cdot|Y_{[t-N,t]}=y_{[t-N,t]},U_{[t-N,t-1]}=u_{[t-N,t-1]})

where

πtN()=Pμ(XtN|Y[0,tN1]=y[0,tN1],U[0,tN1]=u[0,tN1]).\displaystyle\pi_{{t-N}_{-}}(\cdot)=P^{\mu}(X_{t-N}\in\cdot|Y_{[0,t-N-1]}=y_{[0,t-N-1]},U_{[0,t-N-1]}=u_{[0,t-N-1]}).

That is, we can view the state as the Bayesian update of πtN\pi_{t-N_{-}}, the predictor at time tNt-N, using the observations YtN,,YtY_{t-N},\dots,Y_{t}. Notice that with this representation only the most recent NN observation realizations are used for the update and the past information of the observations is embedded in πtN\pi_{t-N_{-}}.

Hence, we can view the state space for time stages tNt\geq N as

𝒵={Pπ(XN|Y[0,N],U[0,N1]);π𝒫(𝕏),Y[0,N]𝕐N+1,U[0,N1]𝕌N}\displaystyle\mathcal{Z}=\left\{P^{\pi}(X_{N}\in\cdot|Y_{[0,N]},U_{[0,N-1]});\pi\in{\mathcal{P}}(\mathds{X}),Y_{[0,N]}\in\mathds{Y}^{N+1},U_{[0,N-1]}\in\mathds{U}^{N}\right\}

Consider the following finite set 𝒵π^N\mathcal{Z}_{\hat{\pi}}^{N} defined by, for a fixed probability measure π^𝒫(𝕏)\hat{\pi}\in{\mathcal{P}}(\mathds{X}),

𝒵π^N:={Pπ^(XN|Y[0,N],U[0,N1]);Y[0,N]𝕐N+1,U[0,N1]𝕌N}.\displaystyle\mathcal{Z}_{\hat{\pi}}^{N}:=\left\{P^{\hat{\pi}}(X_{N}\in\cdot|Y_{[0,N]},U_{[0,N-1]});Y_{[0,N]}\in\mathds{Y}^{N+1},U_{[0,N-1]}\in\mathds{U}^{N}\right\}.

Define the map F:𝒵𝒵π^NF:\mathcal{Z}\to\mathcal{Z}_{\hat{\pi}}^{N}, which will serve as the quantizer, with

F(z):=argminy𝒵π^NρBL(z,y)\displaystyle F(z):=\mathop{\rm arg\,min}_{y\in\mathcal{Z}_{\hat{\pi}}^{N}}\rho_{BL}(z,y) (8)

This map separates the set 𝒵\mathcal{Z} into |𝕐|N+1×|𝕌|N|\mathds{Y}|^{N+1}\times|\mathds{U}|^{N} sets, accordingly this quantizes the set of probability measures.

To complete our approximate controlled Markov model, we now define the one-stage cost function cN:𝒵π^N×𝕌[0,)c^{N}:\mathcal{Z}_{\hat{\pi}}^{N}\times\mathds{U}\to[0,\infty) and the transition probability ηN\eta^{N} on 𝒵π^N\mathcal{Z}_{\hat{\pi}}^{N} given realizations in 𝒵π^N×𝕌\mathcal{Z}_{\hat{\pi}}^{N}\times\mathds{U}: For a given zi=Pπ^(XN|y[0,N]i,u[0,N1]i)z_{i}=P^{\hat{\pi}}(X_{N}\in\cdot|y^{i}_{[0,N]},u^{i}_{[0,N-1]}) and control action uu

cN(zi,u)\displaystyle c^{N}(z_{i},u) =cN(Pπ^(XN|y[0,N]i,u[0,N1]i),u):=c~(Pπ^(XN|y[0,N]i,u[0,N1]i),u),\displaystyle=c^{N}(P^{\hat{\pi}}(X_{N}\in\cdot|y^{i}_{[0,N]},u^{i}_{[0,N-1]}),u):=\tilde{c}(P^{\hat{\pi}}(X_{N}\in\cdot|y^{i}_{[0,N]},u^{i}_{[0,N-1]}),u),

where c~\tilde{c} is defined in (7) and

ηN(|zi,u)\displaystyle\eta^{N}(\cdot|z_{i},u) =ηN(|Pπ^(XN|y[0,N]i,u[0,N1]i,u))\displaystyle=\eta_{N}(\cdot|P^{\hat{\pi}}(X_{N}\in\cdot|y^{i}_{[0,N]},u^{i}_{[0,N-1]},u))
:=Fη(|Pπ^(XN|y[0,N]i,u[0,N1]i,u),\displaystyle\qquad\qquad:=F\ast\eta(\cdot|P^{\hat{\pi}}(X_{N}\in\cdot|y^{i}_{[0,N]},u^{i}_{[0,N-1]},u),

where

Fη(z|z,u)=η({z𝒫(𝕏):F(z)=z}|z,u),F\ast\eta(z^{\prime}|z,u)=\eta(\{z\in{\cal P}(\mathds{X}):F(z)=z^{\prime}\}|z,u),

for all z𝒵π^Nz^{\prime}\in\mathcal{Z}_{\hat{\pi}}^{N}.

We thus have defined a finite state MDP with the state space 𝒵π^N\mathcal{Z}_{\hat{\pi}}^{N}, action space 𝕌\mathds{U}, cost function cNc_{N} and the transition probability ηN\eta_{N}.

An optimal policy γN\gamma_{N}^{*} for this finite state model is a function taking values from the finite state space and hence at some time step tNt\geq N it only uses NN most recent observations and control action variables that is, γN\gamma_{N}^{*} is a measurable function of the information set ItNI_{t}^{N} defined in (1.1) for all tNt\geq N.

We list the steps to construct the approximate model informally as follows:

  • Fix a probability distribution π^𝒫(𝕏)\hat{\pi}\in{\mathcal{P}}(\mathds{X}) as an estimator for the predictor of NN step back (πtN\pi_{t-N}^{-}).

  • Calculate Bayesian updates for all possible realizations y[0,N],u[0,N1]y_{[0,N]},u_{[0,N-1]} (|𝕐|N+1×|𝕌|N|\mathds{Y}|^{N+1}\times|\mathds{U}|^{N} many realizations) starting from the prior π^\hat{\pi} to form the finite subset 𝒵π^N\mathcal{Z}_{\hat{\pi}}^{N}.

  • Calculate ηN\eta^{N} (approximate transition model) and cNc^{N} (approximate cost function) using a nearest neighbor map. Note that from any zi𝒵π^Nz_{i}\in\mathcal{Z}_{\hat{\pi}}^{N} there are only |𝕐||\mathds{Y}| many possible transitions under the true dynamics. To construct the ηN\eta^{N}, these possible transitions are mapped to the closest element in 𝒵π^N\mathcal{Z}_{\hat{\pi}}^{N}.

  • Calculate the value functions and the optimal policies for the finite model with state space 𝒵π^N\mathcal{Z}_{\hat{\pi}}^{N}, transition kernel ηN\eta^{N} and the cost function cNc^{N}.

As we have noted before, the complexity of POMDPs in general arises from the structure of the belief state space 𝒵\mathcal{Z} which is a set of probability measures on 𝕏\mathds{X}. This set is always uncountable and needs to be associated with proper topologies to make the analysis feasible. Approximations for POMDPs are usually done by choosing a finite subset, say 𝒵^\hat{\mathcal{Z}}, of the belief state space 𝒵\mathcal{Z} (Smith and Simmons, 2012; Pineau et al., 2006; Saldi et al., 2017; Zhou et al., 2008, 2010; Subramanian and Mahajan, 2019; Zhang et al., 2014), and finding an approximate MDP model for this finite set. To choose the finite set, the aforementioned works use a uniform quantization scheme, in various topologies on 𝒵\mathcal{Z}. In other words, the quantization is made such that for any z𝒵z\in\mathcal{Z}, there exists an element z^𝒵^\hat{z}\in\hat{\mathcal{Z}} with zz^ϵ\|z-\hat{z}\|\leq\epsilon for a fixed ϵ>0\epsilon>0. The metric to measure distances of the belief states varies for different works, although for finite 𝕏\mathds{X}, L1L_{1} distance of distributions is what is used in general for the quantization of 𝒵\mathcal{Z}, which coincides with total variation and weak convergence topology on 𝒵\mathcal{Z} when 𝕏\mathds{X} is finite; for general 𝕏\mathds{X}, a more appropriate and natural topology is the weak convergence topology on 𝒵\mathcal{Z} which is what we work with in the paper since ρBL\rho_{BL} metrizes the weak convergence.

In this paper, instead of quantizing 𝒵\mathcal{Z} directly and uniformly, we use finite window information variables (ItNI_{t}^{N}’s) to construct the finite subset of 𝒵\mathcal{Z} since our goal is to analyze the effect of the window size on the approximation performance. That is, we use the finite set

𝒵π^N:={Pπ^(XN|Y[0,N],U[0,N1]);Y[0,N]𝕐N+1,U[0,N1]𝕌N}\displaystyle\mathcal{Z}_{\hat{\pi}}^{N}:=\left\{P^{\hat{\pi}}(X_{N}\in\cdot|Y_{[0,N]},U_{[0,N-1]});Y_{[0,N]}\in\mathds{Y}^{N+1},U_{[0,N-1]}\in\mathds{U}^{N}\right\}

constructed using Y[0,N],U[0,N1]Y_{[0,N]},U_{[0,N-1]}. For this set, we cannot afford a uniform discretization scheme. A uniform quantization would mean that for a fixed ϵ>0\epsilon>0

ρBL(Pπ^(XN|y[0,N],u[0,N1]),Pπ(XN|y[0,N],u[0,N1]))<ϵ\displaystyle\rho_{BL}\left(P^{\hat{\pi}}(X_{N}\in\cdot|y_{[0,N]},u_{[0,N-1]}),P^{\pi}(X_{N}\in\cdot|y_{[0,N]},u_{[0,N-1]})\right)<\epsilon

uniformly for any π𝒵\pi\in\mathcal{Z} and for any y[0,N]𝕐N+1,u[0,N1]𝕌Ny_{[0,N]}\in\mathds{Y}^{N+1},u_{[0,N-1]}\in\mathds{U}^{N}. However, this is in general inapplicable for filter stability problems as it requires for the processes with different starting points to converge uniformly for any realizations of information variables (even for highly unlikely ones). That is why, we follow a different approach and show that we do not have to force uniform quantization, and the error of approximate value functions can be related to the expected error of the form

Eπγ[ρBL(Pπ(XN|Y[0,N],U[0,N1]),Pπ^(XN|Y[0,N],U[0,N1]))]\displaystyle E_{\pi}^{\gamma}\left[\rho_{BL}\left(P^{\pi}(X_{N}\in\cdot|Y_{[0,N]},U_{[0,N-1]}),P^{\hat{\pi}}(X_{N}\in\cdot|Y_{[0,N]},U_{[0,N-1]})\right)\right]

which in turn can be bounded using Theorem 5. Our technical analysis, accordingly, is slightly more tedious at the benefit of arriving at a practical and intuitive finite-memory method whose near optimality is rigorously established.

Remark 8

The finite subset 𝒵π^N\mathcal{Z}_{\hat{\pi}}^{N} is crucial for the approximation of the belief MDP we construct in this section. This set depends on the choice of π^\hat{\pi} as the starting probability distribution and the finite window information variables Y[0,N],U[0,N1]Y_{[0,N]},U_{[0,N-1]}. Kara and Yüksel (2021) consider a similar approach for construction of an approximate POMDP model by using the same finite set 𝒵π^N\mathcal{Z}_{\hat{\pi}}^{N}. However, instead of using a nearest neighbor map for the correspondence between the original belief space and the finite set 𝒵π^N\mathcal{Z}_{\hat{\pi}}^{N}, there, the states with matching finite history are used for the correspondence by putting a different topology on the belief space and the set 𝒵π^N\mathcal{Z}_{\hat{\pi}}^{N}. The method used in this paper naturally results in a smaller approximation error due to the nature of the nearest neighbor map, and lets us work with the weak convergence topology and ρBL\rho_{BL} metric. On one hand, Kara and Yüksel (2021) provide a greater error term in terms of the total variation distance, which always upper bounds the ρBL\rho_{BL} metric, on the other hand, the method used there, is computationally more efficient and the approximate model can actually be learned with a reinforcement learning algorithm that uses finite window information variables.

On the choice of π^\hat{\pi}, we note that Theorem 12 provides a bound that is independent of the choice of π^\hat{\pi}. However, since this is only an upper bound, it is apparent that the true value of the error may depend on different choices of π^\hat{\pi}. For the approximate model, we use π^\hat{\pi} as an estimator for the true predictor πtμ,γ()=Pμ,γ(Xt|Y[0,t1])\pi_{t_{-}}^{\mu,\gamma}(\cdot)=P^{\mu,\gamma}(X_{t}\in\cdot|Y_{[0,t-1]}) at any given time tt under some policy γ\gamma. Since π^\hat{\pi} is a fixed probability distribution and does not vary with time, we can argue that a reasonable choice would be the time averages of πtμ,γ()\pi_{t_{-}}^{\mu,\gamma^{*}}(\cdot) under an optimal policy γ\gamma^{*} and if the hidden state process XtX_{t} admits an invariant measure under this optimal policy γ\gamma^{*}, the time averages of πtμ,γ()\pi_{t_{-}}^{\mu,\gamma^{*}}(\cdot) will converge to the invariant measure of XtX_{t} under the optimal policy. However, one issue with this approach is that the designer does not have access to the optimal policy and thus, cannot compute the invariant measure under γ\gamma^{*}. The designer, instead, can use the approximate optimal policy γN\gamma^{N}, but the choice of π^\hat{\pi} also affects the approximate policy γN\gamma^{N}, and hence, this approach may not feasible for practical purposes. In any case, our result provides a bound uniform over the prior selections.

We now discuss the the role of the realizations of Y[0,N],U[0,N1]Y_{[0,N]},U_{[0,N-1]}. Notice that, to construct 𝒵π^N\mathcal{Z}_{\hat{\pi}}^{N}, we consider all possible |𝕐|N+1×|𝕌|N|\mathds{Y}|^{N+1}\times|\mathds{U}|^{N} realizations which reduces the uncountable state space to a finite one but the size of this finite subset grows fast with the size of window, NN, and with the size of spaces 𝕐\mathds{Y} and 𝕌\mathds{U}. The bound we provide in Theorem 9, reveals that the approximation error is related to the expectation over the realizations of Y[0,N],U[0,N1]Y_{[0,N]},U_{[0,N-1]} under the true dynamics, which suggests that if a realization y[0,N],u[0,N1]y_{[0,N]},u_{[0,N-1]}, is highly unlikely under the true dynamics, these realizations can be discarded when constructing 𝒵π^N\mathcal{Z}_{\hat{\pi}}^{N} for computational ease by reducing the size of 𝒵π^N\mathcal{Z}_{\hat{\pi}}^{N}.

In fact this question motivates the following direction. A close look at Definition 3 reveals that the Dobrushin coefficient of a channel and a channel obtained by quantizing the measurements of that same channel are so that the coefficient of the former is a lower bound for the latter. Therefore, one can always quantize the measurement channels further if the original channel satisfies the contraction condition presented for filter stability, with the quantized channel also satisfying the contraction property. This presents a recipe for dealing with both low probability channel outcomes or even continuous space measurement channels; however one would additionally need to show that the value function of the approximate model with quantized measurements would be close to the value function of the original model. This is a possible future direction (using the weak Feller property of the belief-MDP) (as shown e.g. by Saldi et al., 2017) and continuity properties of filter updates in measurement realizations McDonald and Yüksel (2022).

4 Approximation Error Analysis and Rates of Convergence

An optimal policy for the constructed finite model, γN\gamma_{N}^{*} can be extended to 𝒫(𝕏){\cal P}(\mathds{X}) and can be used for the original MDP.

4.1 Analysis via Expected Filter Approximation Error

The next result, which is related to the construction given by Saldi et al. (2018, Theorem 4.38) (see also Saldi et al., 2017), provides a mismatch error for using this policy. This result is going to be the key supporting tool for the main theorem of the paper, which will be presented immediately after. The proof requires multiple technical lemmas and are presented in Section 6.2 (with some supporting but tedious technical steps moved to the Appendix).

Assumption 3
  • ρBL(η(|z,u),η(|z,u))α𝒵ρBL(z,z)\rho_{BL}(\eta(\cdot|z,u),\eta(\cdot|z^{\prime},u))\leq\alpha_{\mathcal{Z}}\rho_{BL}(z,z^{\prime}) (see Theorem 7).

  • |c~(z,u)c~(z,u)|αc~ρBL(z,z)|\tilde{c}(z,u)-\tilde{c}(z^{\prime},u)|\leq\alpha_{\tilde{c}}\rho_{BL}(z,z^{\prime}) for some αc~\alpha_{\tilde{c}} for all u𝕌u\in\mathds{U}.

Theorem 9

Under Assumption 3, for all z𝒫(𝕏)z\in{\cal P}(\mathds{X}) of the form

z=Pπ(XN|y[0,N],u[0,N1]),z=P^{\pi}(X_{N}\in\cdot|y_{[0,N]},u_{[0,N-1]}),

the following holds if β<14α𝒵+1\beta<\frac{1}{4\alpha_{\mathcal{Z}}+1}:

supγΓE[Jβ(η,γN,z)Jβ(η,γ,z)|Y[0,N],γ(Y[0,N1])]\displaystyle\sup_{\gamma\in\Gamma}E\bigg{[}J_{\beta}(\eta,\gamma_{N}^{*},z)-J_{\beta}(\eta,\gamma^{*},z)\bigg{|}Y_{[0,N]},\gamma(Y_{[0,N-1]})\bigg{]}
Ksupπ𝒫(𝕏)supγΓEπγ[ρBL(Pπ(XN|Y[0,N],U[0,N1]),Pπ^(XN|Y[0,N],U[0,N1]))].\displaystyle\leq K\sup_{\pi\in{\mathcal{P}}(\mathds{X})}\sup_{\gamma\in\Gamma}E_{\pi}^{\gamma}\left[\rho_{BL}\left(P^{\pi}(X_{N}\in\cdot|Y_{[0,N]},U_{[0,N-1]}),P^{\hat{\pi}}(X_{N}\in\cdot|Y_{[0,N]},U_{[0,N-1]})\right)\right].

Where KK is a constant that depends on β,α𝒵,αc~\beta,\alpha_{\mathcal{Z}},\alpha_{\tilde{c}} and c~\|\tilde{c}\|_{\infty}. (The exact expression for the constant can be found in the proof).

Remark 10

We note that the upper bound for β\beta can be chosen as 1(2+L)α𝒵+1\frac{1}{(2+\|L\|_{\infty})\alpha_{\mathcal{Z}}+1}, instead of 4α𝒵+14\alpha_{\mathcal{Z}}+1 (see Remark 21 in Appendix), where

L=supπ𝒫(𝕏)supγΓsupY[0,N],U[0,N1]ρBL(Pπ(|Y[0,N],U[0,N1]),Pπ^(|Y[0,N],U[0,N1])).\displaystyle\|L\|_{\infty}=\sup_{\pi\in{\mathcal{P}}(\mathds{X})}\sup_{\gamma\in\Gamma}\sup_{Y_{[0,N]},U_{[0,N-1]}}\rho_{BL}\left(P^{\pi}(\cdot|Y_{[0,N]},U_{[0,N-1]}),P^{\hat{\pi}}(\cdot|Y_{[0,N]},U_{[0,N-1]})\right).

Hence, under a uniform filter stability bound, the upper bound can be chosen near 12α𝒵+1\frac{1}{2\alpha_{\mathcal{Z}}+1}.

The proof of this theorem is rather long, and accordingly, it is presented in Section 6.2.

Before presenting the main result of the paper, we provide further supporting results that will let us work with a probability density function.

Lemma 11

Assume that the transition kernel 𝒯(dx1|x0,u0)\mathcal{T}(dx_{1}|x_{0},u_{0}) admits a density function ff with respect to a reference measure ϕ\phi such that 𝒯(dx1|x0,u0)=f(x1,x0,u0)ϕ(dx1)\mathcal{T}(dx_{1}|x_{0},u_{0})=f(x_{1},x_{0},u_{0})\phi(dx_{1}). If |f(x1,x0,u0)f(x1,x0,u0)|α𝕏|x0x0||f(x_{1},x_{0},u_{0})-f(x_{1},{x_{0}}^{\prime},u_{0})|\leq\alpha_{\mathds{X}}|x_{0}-{x_{0}}^{\prime}| for all x1,x0,x0𝕏x_{1},x_{0},{x_{0}}^{\prime}\in\mathds{X} and u0𝕌u_{0}\in\mathds{U} then

𝒯(|x0,u0)𝒯(|x0,u0)TVα𝕏|xx|.\|\mathcal{T}(\cdot|x_{0},u_{0})-\mathcal{T}(\cdot|{x_{0}}^{\prime},u_{0})\|_{TV}\leq\alpha_{\mathds{X}}|x-x^{\prime}|.

We note that using this result, assumptions of Theorem 7 can be expressed with the Lipschitz condition on the density function noted above. We now restate the assumptions that will be used for the main result.

Assumption 4
  • 1.

    The transition kernel 𝒯(dx1|x0,u0)\mathcal{T}(dx_{1}|x_{0},u_{0}) admits a density function ff with respect to a reference measure ϕ\phi such that 𝒯(dx1|x0,u0)=f(x1,x0,u0)ϕ(dx1).\mathcal{T}(dx_{1}|x_{0},u_{0})=f(x_{1},x_{0},u_{0})\phi(dx_{1}).

  • 2.

    There exists some α𝕏<\alpha_{\mathds{X}}<\infty such that |f(x1,x0,u0)f(x1,x0,u0)|α𝕏|x0x0|.|f(x_{1},x_{0},u_{0})-f(x_{1},{x_{0}}^{\prime},u_{0})|\leq\alpha_{\mathds{X}}|x_{0}-{x_{0}}^{\prime}|.

  • 3.

    There exists some αc<\alpha_{c}<\infty such that for all u𝕌u\in\mathds{U}, |c(x,u)c(x,u)|αc|xx|.|c(x,u)-c(x^{\prime},u)|\leq\alpha_{c}|x-x^{\prime}|.

  • 4.

    α:=(1δ~(𝒯))(2δ(Q))<1\alpha:=(1-\tilde{\delta}(\mathcal{T}))(2-\delta(Q))<1,

  • 5.

    The transition kernel 𝒯\mathcal{T} is dominated, i.e. there exists a dominating measure π^𝒫(𝕏)\hat{\pi}\in{\mathcal{P}}(\mathds{X}) such that for every x𝕏x\in\mathds{X} and u𝕌u\in\mathds{U}, 𝒯(|x,u)π^()\mathcal{T}(\cdot|x,u)\ll\hat{\pi}(\cdot) that is 𝒯(|x,u)\mathcal{T}(\cdot|x,u) is absolutely continuous with respect to π^\hat{\pi} for every x𝕏x\in\mathds{X} and u𝕌u\in\mathds{U}.

Before the result, we discuss the Lipschitz constants of interest and their relation to each other. First, note that by Lemma 11, Assumption 4 (second item) implies that 𝒯(|x0,u0)𝒯(|x0,u0)TVα𝕏|xx|\|\mathcal{T}(\cdot|x_{0},u_{0})-\mathcal{T}(\cdot|{x_{0}}^{\prime},u_{0})\|_{TV}\leq\alpha_{\mathds{X}}|x-x^{\prime}|. Hence, by Theorem 7, we can have various bounds on α𝒵\alpha_{\mathcal{Z}}, where ρBL(η(|z,u),η(|z,u))α𝒵ρBL(z,z)\rho_{BL}(\eta(\cdot|z,u),\eta(\cdot|z^{\prime},u))\leq\alpha_{\mathcal{Z}}\rho_{BL}(z,z^{\prime}). In particular, we have α𝒵(32δ(Q))(1+α𝕏)\alpha_{\mathcal{Z}}\leq(3-2\delta(Q))(1+\alpha_{\mathds{X}}).

Theorem 12

Assume that we let the system start running for NN time steps under a known policy γ\gamma (which may not be optimal), and the finite window policy starts acting after observing NN information variables at time t=Nt=N. Under Assumption 4, if β<14α𝒵+1\beta<\frac{1}{4\alpha_{\mathcal{Z}}+1} we have

Eμγ[Jβ(πN,𝒯,γN)Jβ(πN,𝒯,γ)|Y[0,N],U[0,N1]]K(β,α𝕏,αc,c)αN.\displaystyle E_{\mu}^{\gamma}\left[J_{\beta}(\pi_{N_{-}},\mathcal{T},\gamma^{*}_{N})-J_{\beta}(\pi_{N_{-}},\mathcal{T},\gamma^{*})|Y_{[0,N]},U_{[0,N-1]}\right]\leq K(\beta,\alpha_{\mathds{X}},\alpha_{c},\|c\|_{\infty})\alpha^{N}.

where γN\gamma_{N}^{*} is the optimal finite window policy and where K(β,α𝕏,αc,c)K(\beta,\alpha_{\mathds{X}},\alpha_{c},\|c\|_{\infty}) is a constant that depends on β,α𝕏,αc\beta,\alpha_{\mathds{X}},\alpha_{c} and c\|c\|_{\infty} (The exact formula for the constant can be found in the appendix in (28)).

In the statement, Jβ(πN,𝒯,γN)J_{\beta}(\pi_{N_{-}},\mathcal{T},\gamma^{*}_{N}) (respectively Jβ(πN,𝒯,γ)J_{\beta}(\pi_{N_{-}},\mathcal{T},\gamma^{*})) denotes the cost under the policy γN\gamma^{*}_{N} (respectively γ\gamma^{*}) when the initial prior distribution is πN\pi_{N_{-}}, where πN=Pr(XN|Y[0,N1],U[0,N1])\pi_{N_{-}}=Pr(X_{N}\in\cdot|Y_{[0,N-1]},U_{[0,N-1]}) and the expectation is with respect to the realizations of Y[0,N1],U[0,N1]Y_{[0,N-1]},U_{[0,N-1]}.

Remark 13

We note that Theorem 12 applies to all finite state, measurement, action models as long as α=(1δ~(𝒯))(2δ(Q))<1\alpha=(1-\tilde{\delta}(\mathcal{T}))(2-\delta(Q))<1 and β\beta satisfies the condition noted. Example 1 shows how to calculate the Dobrushin coefficient for transition matrices in finite setups. All other conditions apply since all probability measures on a finite/countable set are majorized by a probability measure which places a positive mass on every single point. Notice that the third condition only requires that the difference in the cost is bounded for every fixed control action. Condition 1 and 5 of Assumption 4 coincide in the finite case.

Remark 14

We note that the error bound of the result is independent of the chosen π^\hat{\pi}. As we will see in the following proof, the first upper bound is a result of Theorem 9 which indeed depends on the π^\hat{\pi} chosen by the user. However, thanks to Theorem 5, we can get a further upper bound on the error which is uniform over any π^\hat{\pi} as long as π^\hat{\pi} is a dominating measure.

One can interpret the absolute continuity assumption, that is ππ^\pi\ll\hat{\pi}, as follows: assume that the starting distribution of the process is π\pi but we start the update from the fixed prior π^\hat{\pi}. The information, y[0,t],u[0,t1]y_{[0,t]},u_{[0,t-1]} can eventually fix the starting error uniformly for all such π^\hat{\pi}, as long as, the fixed starting distribution π^\hat{\pi}, puts on a positive measure to every event that the real starting distribution π\pi puts on a positive measure. However, if it is not the case, that is, if the incorrect starting distribution π^\hat{\pi} puts 0 measure to some event, that π\pi puts positive measure to, information variables are not sufficient to fix the starting error occurring from that 0 measure event. Of course this would not be feasible as the prior would not be compatible with the measured data. In any case, in our setup, the fixed prior π^\hat{\pi} serves as an approximation and this can be made to satisfy the absolute continuity condition by design.

Proof  When we reduce a partially observed MDP to a fully observed process, the initial state of the belief process becomes the Bayesian update of the prior distribution of the state process of the POMDP. Hence, we can write that

Eμγ[Jβ(πN,𝒯,γN)Jβ(πN,𝒯,γ)|Y[0,N],U[0,N1]]\displaystyle E_{\mu}^{\gamma}\left[J_{\beta}(\pi_{N_{-}},\mathcal{T},\gamma^{*}_{N})-J_{\beta}(\pi_{N_{-}},\mathcal{T},\gamma^{*})|Y_{[0,N]},U_{[0,N-1]}\right]
=Eμγ[Jβ(πN,η,γN)Jβ(πN,η,γ)|Y[0,N],U[0,N1]].\displaystyle=E_{\mu}^{\gamma}\left[J_{\beta}(\pi_{N},\eta,\gamma_{N}^{*})-J_{\beta}(\pi_{N},\eta,\gamma^{*})|Y_{[0,N]},U_{[0,N-1]}\right].

Notice that, using Theorem 7, Condition 2 of Assumption 4 implies that

ρBL(η(|z,u),η(|z,u))3(1+α𝕏)ρBL(z,z)\displaystyle\rho_{BL}\left(\eta(\cdot|z,u),\eta(\cdot|z^{\prime},u)\right)\leq 3(1+\alpha_{\mathds{X}})\rho_{BL}(z,z^{\prime})

and we also have that

|c~(z,u)c~(z,u)|=|c(x,u)z(dx)c(x,u)z(dx)|(αc+c)ρBL(z,z).\displaystyle|\tilde{c}(z,u)-\tilde{c}(z^{\prime},u)|=\left|\int c(x,u)z(dx)-\int c(x,u)z^{\prime}(dx)\right|\leq(\alpha_{c}+\|c\|_{\infty})\rho_{BL}(z,z^{\prime}).

Thus, we can use Theorems 7 and 9 to write

Eμγ[Jβ(πN,η,γN)Jβ(πN,η,γ)|Y[0,N],U[0,N1]]\displaystyle E_{\mu}^{\gamma}\left[J_{\beta}(\pi_{N},\eta,\gamma_{N}^{*})-J_{\beta}(\pi_{N},\eta,\gamma^{*})|Y_{[0,N]},U_{[0,N-1]}\right]
K(β,α𝕏,αc,c)supπ𝒫(𝕏)supγΓEπγ[ρBL(Pπ(XN|Y[0,N]),Pπ^(XN|Y[0,N]))].\displaystyle\leq K(\beta,\alpha_{\mathds{X}},\alpha_{c},\|c\|_{\infty})\sup_{\pi\in{\mathcal{P}}(\mathds{X})}\sup_{\gamma\in\Gamma}E_{\pi}^{\gamma}\left[\rho_{BL}\left(P^{\pi}(X_{N}\in\cdot|Y_{[0,N]}),P^{\hat{\pi}}(X_{N}\in\cdot|Y_{[0,N]})\right)\right]. (9)

We set π^\hat{\pi} in Assumption 4 as our representative probability measure for the quantization of the belief space. Notice that by our choice of π^\hat{\pi} is a dominating measure and therefore πtπ^\pi_{t-}\ll\hat{\pi} for any time step tt, where πt\pi_{t-} is the predictor at time tt . Thus, Theorem 5 yields that under Assumption 4

Eπγ[Pπ(|Y[0,N])Pπ^(|Y[0,N])TV]αN,\displaystyle E_{\pi}^{\gamma}\left[\|P^{\pi}(\cdot|Y_{[0,N]})-P^{\hat{\pi}}(\cdot|Y_{[0,N]})\|_{TV}\right]\leq\alpha^{N},

for any predictor π\pi. We also have by definition that

ρBL(Pπ,γ(|Y[0,N]),Pπ^,γ(|Y[0,N]))Pπ,γ(|Y[0,N])Pπ^,γ(|Y[0,N])TV.\displaystyle\rho_{BL}\left(P^{\pi,\gamma}(\cdot|Y_{[0,N]}),P^{\hat{\pi},\gamma}(\cdot|Y_{[0,N]})\right)\leq\|P^{\pi,\gamma}(\cdot|Y_{[0,N]})-P^{\hat{\pi},\gamma}(\cdot|Y_{[0,N]})\|_{TV}.

Hence the result follows.  

Remark 15

By studying (4.1) and utilizing Theorem 7(i)-(iii), we can arrive at complementary results when we only have filter stability but not exponential filter stability.

We note that the initial NN time steps of the control problem should be treated separately as our approximation technique uses information variables with size NN, but for the initial NN steps, there is not enough observation or control action variables to make use of the approximate finite window policies. In the following result, for the first NN time steps, we use a control policy that is found with a policy iteration type argument where the terminal cost estimated with βNE[Jβ(γN)]\beta^{N}E[J_{\beta}(\gamma^{*}_{N})] with γN\gamma_{N}^{*} being the approximate finite window policy.

The following result is stated for the best possible policy in the set ΓN\Gamma^{N} (see Equation 1.1), since infγNΓNJβ(μ,𝒯,γN)=JβN(μ,𝒯)\inf_{\gamma^{N}\in\Gamma^{N}}J_{\beta}(\mu,\mathcal{T},\gamma^{N})=J_{\beta}^{N}(\mu,\mathcal{T}), however, in the proof, instead of working with the policy achieving this infimum, we construct a policy (possibly sub-optimal) which achieves the stated upper bound, and this upper bound naturally becomes an upper bound for the best possible finite window policy in ΓN\Gamma^{N}.

Theorem 16

Under Assumption 4, if β<14α𝒵+1\beta<\frac{1}{4\alpha_{\mathcal{Z}}+1} we have

JβN(μ,𝒯)Jβ(μ,𝒯)K(β,α𝕏,αc,c)αNβN\displaystyle J_{\beta}^{N}(\mu,\mathcal{T})-J_{\beta}^{*}(\mu,\mathcal{T})\leq K(\beta,\alpha_{\mathds{X}},\alpha_{c},\|c\|_{\infty})\alpha^{N}\beta^{N}

where K(β,α𝕏,αc,c)K(\beta,\alpha_{\mathds{X}},\alpha_{c},\|c\|_{\infty}) is a constant that depends on β,α𝕏,αc\beta,\alpha_{\mathds{X}},\alpha_{c} and c\|c\|_{\infty}. (The exact formula for the constant can be found in the appendix).

Proof  Recall the optimal policy for the finite model constructed in Section 3 is denoted by γN\gamma_{N}^{*}. Notice that γN\gamma_{N}^{*} is a stationary policy and optimal for any initial point since it solves the discounted cost optimality equation.

Now, we construct the following policy. Use the policy γN\gamma_{N}^{*}, after time NN unaltered, but modify the first NN time-stage policies, as a batch update, which can be solved via a finite dynamic programming algorithm.

γ~0,,γ~N1=argminγ1,,γN1E[k=0N1βkc(xk,uk)]+βkE[Jβ(γN)|IN]\tilde{\gamma}_{0},…,\tilde{\gamma}_{N-1}=\mathrm{argmin}_{\gamma_{1},…,\gamma_{N-1}}E[\sum_{k=0}^{N-1}\beta^{k}c(x_{k},u_{k})]+\beta^{k}E[J_{\beta}(\gamma^{*}_{N})|I_{N}]

where INI_{N} is the history by time NN. We denote this policy by γ++:={γ~0,,γ~N1,γN,γN,}\gamma^{++}:=\{\tilde{\gamma}_{0},…,\tilde{\gamma}_{N-1},\gamma^{*}_{N},\gamma^{*}_{N},\dots\}.

Note that we denote the true optimal policy for the original model by γ\gamma^{*}. We now define the policy γ+:={γ[0,N1],γN,γN,}\gamma^{+}:=\{\gamma^{*}_{[0,N-1]},\gamma^{*}_{N},\gamma^{*}_{N},\dots\}: Apply γ\gamma^{*} until time NN, and then use our finite window policy γN\gamma^{*N}. Note that this policy is not practical as it assumes that the controller already knows the true optimal policy γ\gamma^{*}, however, we will make use this hypothetical policy for the analysis.

From the way γ+\gamma^{+} is constructed, we clearly, Jβ(γ++)Jβ(γ+)J_{\beta}(\gamma^{++})\leq J_{\beta}(\gamma^{+}). And thus,

Jβ(γ++)Jβ(γ)Jβ(γ+)Jβ(γ)J_{\beta}(\gamma^{++})-J_{\beta}(\gamma^{*})\leq J_{\beta}(\gamma^{+})-J_{\beta}(\gamma^{*})

Furthermore, because of the way we constructed them, we have γ++ΓN\gamma^{++}\in\Gamma^{N}. Hence, we can write

JβN(μ,𝒯)Jβ(μ,𝒯)\displaystyle J_{\beta}^{N}(\mu,\mathcal{T})-J_{\beta}^{*}(\mu,\mathcal{T}) =infγΓNJβ(μ𝒯,γ)Jβ(μ,𝒯,γ)\displaystyle=\inf_{\gamma\in\Gamma^{N}}J_{\beta}(\mu\,\mathcal{T},\gamma)-J_{\beta}(\mu,\mathcal{T},\gamma^{*})
Jβ(μ,𝒯,γ++)Jβ(μ,𝒯,γ)\displaystyle\leq J_{\beta}(\mu,\mathcal{T},\gamma^{++})-J_{\beta}(\mu,\mathcal{T},\gamma^{*})
Jβ(μ,𝒯,γ+)Jβ(μ,𝒯,γ)\displaystyle\leq J_{\beta}(\mu,\mathcal{T},\gamma^{+})-J_{\beta}(\mu,\mathcal{T},\gamma^{*})

Then, we start analyzing the error term: because γ+\gamma^{+} and γ\gamma^{*} use the same policy before time NN, we have

Jβ(μ,𝒯,γ+)Jβ(μ,𝒯,γ)\displaystyle J_{\beta}(\mu,\mathcal{T},\gamma^{+})-J_{\beta}(\mu,\mathcal{T},\gamma^{*})
=t=NβtEμ[c(Xt,γN(Y[tN,t],U[tN,t1])]t=NβtEμ[c(Xt,γ(Y[0,t],U[0,t1])]\displaystyle=\sum_{t=N}^{\infty}\beta^{t}E_{\mu}\left[c(X_{t},\gamma_{N}^{*}(Y_{[t-N,t]},U_{[t-N,t-1]})\right]-\sum_{t=N}^{\infty}\beta^{t}E_{\mu}\left[c(X_{t},\gamma^{*}(Y_{[0,t]},U_{[0,t-1]})\right]
=βNt=NβtNEμγ[Eμ[c(Xt,γN(Y[tN,t],U[tN,t1])]\displaystyle=\beta^{N}\sum_{t=N}^{\infty}\beta^{t-N}E_{\mu}^{\gamma^{*}}\bigg{[}E_{\mu}\left[c(X_{t},\gamma_{N}^{*}(Y_{[t-N,t]},U_{[t-N,t-1]})\right]
Eμ[c(Xt,γ(Y[0,t],U[0,t1])]|Y[0,N],U[0,N1]]\displaystyle\qquad\qquad\qquad\qquad\qquad-E_{\mu}\left[c(X_{t},\gamma^{*}(Y_{[0,t]},U_{[0,t-1]})\right]|Y_{[0,N]},U_{[0,N-1]}\bigg{]}
=βNEμγ[Jβ(πN,𝒯,γN)Jβ(πN,𝒯,γ)|Y[0,N],U[0,N1]]\displaystyle=\beta^{N}E_{\mu}^{\gamma^{*}}\left[J_{\beta}(\pi_{N_{-}},\mathcal{T},\gamma^{*}_{N})-J_{\beta}(\pi_{N_{-}},\mathcal{T},\gamma^{*})|Y_{[0,N]},U_{[0,N-1]}\right] (10)

The last step follows from the observation that conditioning on the observations and control actions Y[0,N],U[0,N1]Y_{[0,N]},U_{[0,N-1]}, the state process can be thought as if it starts at time t=Nt=N whose prior measure is πN\pi_{N_{-}} and from the fact that the probability measure of Y[0,N],U[0,N1]Y_{[0,N]},U_{[0,N-1]} is determined by the initial measure μ\mu and the policy γ\gamma^{*} since γtN=γt\gamma_{t}^{N}=\gamma_{t}^{*} for tNt\leq N.

The result then follows from Theorem 12.  

4.2 Analysis via Uniform Bounds on Filter Approximation Error

The following result provides an alternative setup where our analysis is applicable when the filter stability error is presented in terms of a uniform bound on the filter approximation error under the total variation distance. We define this bound, arising from filter stability error, as follows:

L¯TV:=supπ𝒫(𝕏)supγΓsupy[0,N],u[0,N1]Pπ(|y[0,N],u[0,N1])Pπ^(|y[0,N],u[0,N1])TV.\displaystyle\bar{L}_{TV}:=\sup_{\pi\in{\mathcal{P}}(\mathds{X})}\sup_{\gamma\in\Gamma}\sup_{y_{[0,N]},u_{[0,N-1]}}\left\|P^{\pi}(\cdot|y_{[0,N]},u_{[0,N-1]})-P^{\hat{\pi}}(\cdot|y_{[0,N]},u_{[0,N-1]})\right\|_{TV}. (11)
Theorem 17

If β<1α𝒵\beta<\frac{1}{\alpha_{\mathcal{Z}}}, where α𝒵\alpha_{\mathcal{Z}} can be chosen as (32δ(Q))(1δ~(𝒯))(3-2\delta(Q))(1-\tilde{\delta}(\mathcal{T})) by Theorem 7, we have that

  • (i)
    supz|JβN(z)Jβ(z)|(α𝒵1)β+1(1β)2(1α𝒵β)cL¯TV.\displaystyle\sup_{z}\left|J_{\beta}^{N}(z)-J^{*}_{\beta}(z)\right|\leq\frac{(\alpha_{\mathcal{Z}}-1)\beta+1}{(1-\beta)^{2}(1-\alpha_{\mathcal{Z}}\beta)}\|c\|_{\infty}\bar{L}_{TV}.
  • (ii)
    supz|Jβ(z,γN)Jβ(z)|2(1+(α𝒵1)β)(1β)3(1α𝒵β)cL¯TV.\displaystyle\sup_{z}\left|J_{\beta}(z,\gamma_{N})-J^{*}_{\beta}(z)\right|\leq\frac{2(1+(\alpha_{\mathcal{Z}}-1)\beta)}{(1-\beta)^{3}(1-\alpha_{\mathcal{Z}}\beta)}\|c\|_{\infty}\bar{L}_{TV}.

Before the proof, we note that the bound applies for all β<1\beta<1, if (32δ(Q))(1δ~(𝒯))<1(3-2\delta(Q))(1-\tilde{\delta}(\mathcal{T}))<1. Furthermore, since (32δ(Q))(1δ~(𝒯))=2α(1δ~(𝒯))(3-2\delta(Q))(1-\tilde{\delta}(\mathcal{T}))=2\alpha-(1-\tilde{\delta}(\mathcal{T})) where α=(2δ(Q))(1δ~(𝒯))\alpha=(2-\delta(Q))(1-\tilde{\delta}(\mathcal{T})) and, therefore, if α<1\alpha<1 we have that for all β<1/(1+δ~(𝒯))\beta<1/(1+\tilde{\delta}(\mathcal{T})) the result applies independent of δ(Q)\delta(Q), though this is a conservative bound. If δ(Q)=1\delta(Q)=1, the bound applies for all β<1\beta<1 as long as α<1\alpha<1.

Proof  (i) We start by writing the fixed point equations

Jβ(z)=minu(c~(z,u)+βJβ(z1)η(dz1|z,u))\displaystyle J_{\beta}^{*}(z)=\min_{u}\left(\tilde{c}(z,u)+\beta\int J_{\beta}^{*}(z_{1})\eta(dz_{1}|z,u)\right)
JβN(z)=minu(c~(F(z),u)+βJβN(z1)η(dz1|F(z),u)).\displaystyle J^{N}_{\beta}(z)=\min_{u}\left(\tilde{c}(F(z),u)+\beta\int J_{\beta}^{N}(z_{1})\eta(dz_{1}|F(z),u)\right).

Hence we can write that

supz\displaystyle\sup_{z} |Jβ(z)JβN(z)|supu|c~(z,u)c~(F(z),u)|\displaystyle\left|J^{*}_{\beta}(z)-J_{\beta}^{N}(z)\right|\leq\sup_{u}|\tilde{c}(z,u)-\tilde{c}(F(z),u)|
+βsupu|Jβ(z1)η(dz1|z,u)JβN(z1)η(dz1|F(z),u)|\displaystyle\qquad+\beta\sup_{u}\left|\int J_{\beta}^{*}(z_{1})\eta(dz_{1}|z,u)-\int J_{\beta}^{N}(z_{1})\eta(dz_{1}|F(z),u)\right|
czF(z)TV+βsupu|Jβ(z1)JβN(z1)|η(dz1|F(z),u)\displaystyle\leq\|c\|_{\infty}\|z-F(z)\|_{TV}+\beta\sup_{u}\int\left|J_{\beta}^{*}(z_{1})-J_{\beta}^{N}(z_{1})\right|\eta(dz_{1}|F(z),u)
+βsupu|Jβ(z1)η(dz1|z,u)Jβ(z1)η(dz1|F(z),u)|\displaystyle\qquad+\beta\sup_{u}\left|\int J_{\beta}^{*}(z_{1})\eta(dz_{1}|z,u)-\int J_{\beta}^{*}(z_{1})\eta(dz_{1}|F(z),u)\right|
czF(z)TV+βsupz|Jβ(z)JβN(z)|+βJβBLα𝒵zF(z)TV.\displaystyle\leq\|c\|_{\infty}\|z-F(z)\|_{TV}+\beta\sup_{z}\left|J^{*}_{\beta}(z)-J_{\beta}^{N}(z)\right|+\beta\|J_{\beta}^{*}\|_{BL}\alpha_{\mathcal{Z}}\|z-F(z)\|_{TV}. (12)

Note that by Theorem 7, α𝒵(32δ(Q))(1δ~(𝒯))\alpha_{\mathcal{Z}}\leq(3-2\delta(Q))(1-\tilde{\delta}(\mathcal{T})) when 𝒵\mathcal{Z} is associated with total variation distance. We also have that JβBL2β(1β)(1α𝒵β)c\|J_{\beta}^{*}\|_{BL}\leq\frac{2-\beta}{(1-\beta)(1-\alpha_{\mathcal{Z}}\beta)}\|c\|_{\infty} when 𝒵\mathcal{Z} metrized by total variation distance (see Lemma 24). Hence, by noting zF(z)TVL¯TV\|z-F(z)\|_{TV}\leq\bar{L}_{TV}, we can conclude that

supz|Jβ(z)JβN(z)|(α𝒵1)β+1(1β)2(1α𝒵β)cL¯TV.\displaystyle\sup_{z}\left|J^{*}_{\beta}(z)-J_{\beta}^{N}(z)\right|\leq\frac{(\alpha_{\mathcal{Z}}-1)\beta+1}{(1-\beta)^{2}(1-\alpha_{\mathcal{Z}}\beta)}\|c\|_{\infty}\bar{L}_{TV}.

(ii) We start by writing

supz|Jβ(z,γN)Jβ(z)|supz|Jβ(z,γN)JβN(z)|+supz|JβN(z)Jβ(z)|\displaystyle\sup_{z}\left|J_{\beta}(z,\gamma_{N})-J^{*}_{\beta}(z)\right|\leq\sup_{z}\left|J_{\beta}(z,\gamma_{N})-J^{N}_{\beta}(z)\right|+\sup_{z}\left|J^{N}_{\beta}(z)-J^{*}_{\beta}(z)\right|

where the second term is bounded by (i). We now focus on the first term:

|Jβ(z,γN)JβN(z)|\displaystyle\left|J_{\beta}(z,\gamma_{N})-J^{N}_{\beta}(z)\right|\leq supu|c~(z,u)c~(F(z),u)|\displaystyle\sup_{u}\left|\tilde{c}(z,u)-\tilde{c}(F(z),u)\right|
+β|Jβ(z1,γN)η(dz1|z,γN(z))JβN(z1)η(dz1|F(z),γN(z))|\displaystyle+\beta\left|\int J_{\beta}(z_{1},\gamma_{N})\eta(dz_{1}|z,\gamma_{N}(z))-\int J_{\beta}^{N}(z_{1})\eta(dz_{1}|F(z),\gamma_{N}(z))\right|
czF(z)TV+β|Jβ(z1,γN)JβN(z1)|η(dz1|z,γN(z))\displaystyle\leq\|c\|_{\infty}\left\|z-F(z)\right\|_{TV}+\beta\int\left|J_{\beta}(z_{1},\gamma_{N})-J_{\beta}^{N}(z_{1})\right|\eta(dz_{1}|z,\gamma_{N}(z))
+β|JβN(z1)Jβ(z1)|η(dz1|z,γN(z))\displaystyle\qquad+\beta\int\left|J_{\beta}^{N}(z_{1})-J_{\beta}^{*}(z_{1})\right|\eta(dz_{1}|z,\gamma_{N}(z))
+β|Jβ(z1)η(dz1|z,γN(z))Jβ(z1)η(dz1|F(z),γN(z))|\displaystyle\qquad+\beta\left|\int J_{\beta}^{*}(z_{1})\eta(dz_{1}|z,\gamma_{N}(z))-\int J_{\beta}^{*}(z_{1})\eta(dz_{1}|F(z),\gamma_{N}(z))\right|
+β|Jβ(z1)JβN(z1)|η(dz1|F(z),γN(z))\displaystyle\qquad+\beta\int\left|J_{\beta}^{*}(z_{1})-J_{\beta}^{N}(z_{1})\right|\eta(dz_{1}|F(z),\gamma_{N}(z))
czF(z)TV+βsupz|Jβ(z,γN)JβN(z)|\displaystyle\leq\|c\|_{\infty}\left\|z-F(z)\right\|_{TV}+\beta\sup_{z}\left|J_{\beta}(z,\gamma_{N})-J^{N}_{\beta}(z)\right|
+2βsupz|JβN(z)Jβ(z)|+βJβBLα𝒵zF(z)TV.\displaystyle+2\beta\sup_{z}\left|J^{N}_{\beta}(z)-J^{*}_{\beta}(z)\right|+\beta\|J_{\beta}^{*}\|_{BL}\alpha_{\mathcal{Z}}\left\|z-F(z)\right\|_{TV}.

Thus, using (i), and JβBL2β(1β)(1α𝒵β)c\|J_{\beta}^{*}\|_{BL}\leq\frac{2-\beta}{(1-\beta)(1-\alpha_{\mathcal{Z}}\beta)}\|c\|_{\infty}, we can write

supz|Jβ(z,γN)JβN(z)|(1+β)((α𝒵1)β+1)(1β)3(1α𝒵β)cL¯TV.\displaystyle\sup_{z}\left|J_{\beta}(z,\gamma_{N})-J^{N}_{\beta}(z)\right|\leq\frac{(1+\beta)((\alpha_{\mathcal{Z}}-1)\beta+1)}{(1-\beta)^{3}(1-\alpha_{\mathcal{Z}}\beta)}\|c\|_{\infty}\bar{L}_{TV}.

We then conclude that

supz|Jβ(z,γN)Jβ(z)|\displaystyle\sup_{z}\left|J_{\beta}(z,\gamma_{N})-J^{*}_{\beta}(z)\right| supz|Jβ(z,γN)JβN(z)|+supz|JβN(z)Jβ(z)|\displaystyle\leq\sup_{z}\left|J_{\beta}(z,\gamma_{N})-J^{N}_{\beta}(z)\right|+\sup_{z}\left|J^{N}_{\beta}(z)-J^{*}_{\beta}(z)\right|
(α𝒵1)β+1(1β)2(1α𝒵β)cL¯TV.+(1+β)((α𝒵1)β+1)(1β)3(1α𝒵β)cL¯TV\displaystyle\leq\frac{(\alpha_{\mathcal{Z}}-1)\beta+1}{(1-\beta)^{2}(1-\alpha_{\mathcal{Z}}\beta)}\|c\|_{\infty}\bar{L}_{TV}.+\frac{(1+\beta)((\alpha_{\mathcal{Z}}-1)\beta+1)}{(1-\beta)^{3}(1-\alpha_{\mathcal{Z}}\beta)}\|c\|_{\infty}\bar{L}_{TV}
=2(1+(α𝒵1)β)(1β)3(1α𝒵β)cL¯TV.\displaystyle=\frac{2(1+(\alpha_{\mathcal{Z}}-1)\beta)}{(1-\beta)^{3}(1-\alpha_{\mathcal{Z}}\beta)}\|c\|_{\infty}\bar{L}_{TV}.

 

4.3 A Discussion on the Controlled Filter Stability Problem

The above results suggest that the loss occurring from applying a finite window policy is mainly controlled by the term

supπ𝒫(𝕏)supγΓEπγ[ρBL(Pπ(|Y[0,N],U[0,N1]),Pπ^(|Y[0,N],U[0,N1]))],\displaystyle\sup_{\pi\in{\mathcal{P}}(\mathds{X})}\sup_{\gamma\in\Gamma}E_{\pi}^{\gamma}\left[\rho_{BL}\left(P^{\pi}(\cdot|Y_{[0,N]},U_{[0,N-1]}),P^{\hat{\pi}}(\cdot|Y_{[0,N]},U_{[0,N-1]})\right)\right], (13)

or its variations where the BLBL metric is replaced with total variation, or the expectation is replaced with a uniform bound (given in (11)). All these terms describe how fast two different belief-state processes forget their initial priors when fed with the same observations/control actions under a true distribution. Thus, any bound for this term directly applies to the main results we presented for the loss caused by a finite window policy. This term is related to the filter stability problem and our approximation results point out the close relation between filter stability and the performance of finite window policies. In a way, the main result or message of this paper is perhaps to explicitly relate finite window approximations for POMDPs to the filter stability problem.

To bound the term (13), we use Theorem 5, to achieve an exponential convergence rate in the window size for a controlled setup. However, we should note that, Theorem 5 provides only a sufficient condition to bound the filter stability term geometrically fast in the total variation distance. This result can be seen as too strong, if one is only interested in making this ρBL\rho_{BL} distance (13) smaller with increasing window size. In fact, as we will see in Section 5, even when the assumptions of Theorem 5 are not satisfied, the filter stability term still converges to 0. In the literature, there are various set of assumptions to achieve filter stability. Two main approaches have been:

  • The transition kernel is in some sense sufficiently ergodic, forgetting the initial measure and therefore passing this insensitivity (to incorrect initializations) on to the filter process. This condition is often tailored towards control-free models.

  • The measurement channel provides sufficient information about the underlying state, allowing the filter to track the true state process. This approach is typically based on martingale methods and accordingly does not often lead to rates of convergence for the filter stability problem, but only asymptotic filter stability.

The result we use in this paper (Theorem 5) provides exponential filter stability, using a joint contraction property of the Bayesian filter update and measurement update steps through the Dobrushin coefficient. When these requirements are not satisfied, the filter stability can be checked via different assumptions from the literature. However, we also note that, for the controlled setup, filter stability results are limited compared to the control free setup. A comprehensive review on filter stability in the control-free case, is available by Chigansky et al. (2009). In the controlled case, recent additional studies include McDonald and Yüksel (2019, 2022) where martingale methods are used to arrive at controlled filter stability, which in turn can lead to weaker conditions (though without rates of convergence) for near optimality of finite window policies.

With regard to Theorem 5, the following example studies an additive Gaussian model (not necessarily linear) and provides different parameters for the condition (1δ(𝒯))×(2δ(Q^))<1(1-\delta(\mathcal{T}))\times(2-\delta(\hat{Q}))<1 to hold.

Example 2

Consider a system where 𝕏=𝕐=\mathds{X}=\mathds{Y}=\mathds{R} and the transition and measurement kernels are given by

xn+1=f(xn,un)+N(0,σt2),yn=g(xn)+N(0,σq2)\displaystyle x_{n+1}=f(x_{n},u_{n})+N(0,\sigma_{t}^{2}),\qquad y_{n}=g(x_{n})+N(0,\sigma_{q}^{2})

where the functions ff and gg are measurable and bounded such that f(x,u)[t,t]f(x,u)\in[-t,t] and g(x)[q,q]g(x)\in[-q,q].

Note that, in our paper we assume and present our results from finite 𝕐\mathds{Y}. Therefore, to make the example compatible with our results, we discretize the observation space 𝕐\mathds{Y}. We provide two discretization schemes one with 𝕐^1={q,q}\hat{\mathds{Y}}_{1}=\{-q,q\} and the other with 𝕐^2={q,0,q}\hat{\mathds{Y}}_{2}=\{-q,0,q\}. For discretization, we use a nearest neighbor mapping.

First, we study the observation space 𝕐^1={q,q}\hat{\mathds{Y}}_{1}=\{-q,q\}. Using the nearest neighbor mapping, we have that

y^n=q if yn=g(xn)+N(0,σq2)0,\displaystyle\hat{y}_{n}=-q\quad\text{ if }y_{n}=g(x_{n})+N(0,\sigma_{q}^{2})\leq 0,
y^n=q if yn=g(xn)+N(0,σq2)>0\displaystyle\hat{y}_{n}=q\quad\text{ if }y_{n}=g(x_{n})+N(0,\sigma_{q}^{2})>0

We then can write the following for the transition kernel 𝒯(|x,u)𝒫(𝕏)\mathcal{T}(\cdot|x,u)\in{\mathcal{P}}(\mathds{X})

𝒯(dxt+n|xn,un)N(f(xn,un),σt2)\displaystyle\mathcal{T}(dx_{t+n}|x_{n},u_{n})\sim N(f(x_{n},u_{n}),\sigma_{t}^{2})

and for the compound channel Q^(|x)𝒫(𝕐^1)\hat{Q}(\cdot|x)\in{\mathcal{P}}(\hat{\mathds{Y}}_{1}):

Q^(q|xn)=Pr(N(g(xn),σq2)>0),Q^(q|xn)=Pr(N(g(xn),σq2)0).\displaystyle\hat{Q}(q|x_{n})=Pr(N(g(x_{n}),\sigma_{q}^{2})>0),\qquad\hat{Q}(-q|x_{n})=Pr(N(g(x_{n}),\sigma_{q}^{2})\leq 0).

For these kernels, the Dobrushin coefficients can be calculated as

δ(𝒯)2Pr(N(t,σt2)0),δ(Q^)=2Pr(N(q,σq2)0).\displaystyle\delta(\mathcal{T})\geq 2Pr(N(t,\sigma_{t}^{2})\leq 0),\qquad\delta(\hat{Q})=2Pr(N(q,\sigma_{q}^{2})\leq 0).

Notice that these probabilities are fully determined by the ratio of the mean and standard deviation of the Gaussian in question, σt/t\sigma_{t}/t and σq/q\sigma_{q}/q . The higher the ratio, the higher the Dobrushin coefficient. Below, we see a list of the ratio of the transition kernel and lowest possible ratio of the measurement kernel such that (1δ(𝒯))×(2δ(Q^))<1(1-\delta(\mathcal{T}))\times(2-\delta(\hat{Q}))<1. If the ratio of σq/q\sigma_{q}/q is higher than the stated value, we will get exponential stability for the given transition kernel. Note that if σt/t>1.5\sigma_{t}/t>1.5, then δ(𝒯)>0.5\delta(\mathcal{T})>0.5 which makes (1δ(𝒯))×(2δ(Q^))<1(1-\delta(\mathcal{T}))\times(2-\delta(\hat{Q}))<1 regardless of the channel for which we use ’any’ in table to indicate that any channel would lead to exponential filter stability.

σtt\frac{\sigma_{t}}{t} 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3
σqq\frac{\sigma_{q}}{q} any 0.6 0.8 1.01 1.3 1.65 2.13 3.25 5.5 8.0 20.0 70.0 1000.0
δ(𝒯)\delta(\mathcal{T}) 0.50 0.48 0.44 0.40 0.36 0.32 0.27 0.21 0.15 0.10 0.05 0.01 0.00
δ(Q^)\delta(\hat{Q}) any 0.1 0.21 0.32 0.44 0.54 0.64 0.76 0.86 0.90 0.96 0.99 1.00
Table 1: Approximate minimum ratio of σqq\frac{\sigma_{q}}{q} for exponential filter stability for 𝕐^1={q,q}\hat{\mathds{Y}}_{1}=\{-q,q\}

We now analyze the problem for the observation space 𝕐^2={q,0,q}\hat{\mathds{Y}}_{2}=\{-q,0,q\}. For this set, the nearest neighbor mapping yields that

y^n=q if yn=g(xn)+N(0,σq2)q2,\displaystyle\hat{y}_{n}=-q\quad\text{ if }y_{n}=g(x_{n})+N(0,\sigma_{q}^{2})\leq\frac{-q}{2},
y^n=q if yn=g(xn)+N(0,σq2)>q2,\displaystyle\hat{y}_{n}=q\quad\text{ if }y_{n}=g(x_{n})+N(0,\sigma_{q}^{2})>\frac{q}{2},
y^n=0 else.\displaystyle\hat{y}_{n}=0\quad\text{ else}.

For the compound channel, we then have

Q^(q|xn)=Pr(N(g(xn),σq2)>q2),Q^(q|xn)=Pr(N(g(xn),σq2)q2),\displaystyle\hat{Q}(q|x_{n})=Pr(N(g(x_{n}),\sigma_{q}^{2})>\frac{q}{2}),\qquad\hat{Q}(-q|x_{n})=Pr(N(g(x_{n}),\sigma_{q}^{2})\leq\frac{-q}{2}),
Q^(0|xn)=Pr(q2N(g(xn),σq2)>q2).\displaystyle\hat{Q}(0|x_{n})=Pr(\frac{q}{2}\geq N(g(x_{n}),\sigma_{q}^{2})>\frac{-q}{2}).

For these kernels, the Dobrushin coefficients can be calculated as

δ(𝒯)=2Pr(N(t,σt2)0),\displaystyle\delta(\mathcal{T})=2Pr(N(t,\sigma_{t}^{2})\leq 0),
δ(Q^)=2Pr(N(q,σq2)q2)+Pr(q2<N(q,σq2)<q2).\displaystyle\delta(\hat{Q})=2Pr(N(q,\sigma_{q}^{2})\leq\frac{-q}{2})+Pr(\frac{-q}{2}<N(q,\sigma_{q}^{2})<\frac{q}{2}).

Below, we now see a list of the ratio of the transition kernel and lowest possible ratio of the measurement kernel such that (1δ(𝒯))×(2δ(Q^))<1(1-\delta(\mathcal{T}))\times(2-\delta(\hat{Q}))<1 for the observation space 𝕐^2={q,0,q}\hat{\mathds{Y}}_{2}=\{-q,0,q\}.

σtt\frac{\sigma_{t}}{t} 1.5 1.4 1.3 1.2 1.1 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3
σqq\frac{\sigma_{q}}{q} any 0.39 0.6 0.85 1.2 1.54 2.1 3.2 5.9 8.0 20.0 80.0 1000.0
δ(𝒯)\delta(\mathcal{T}) 0.50 0.48 0.44 0.40 0.36 0.32 0.27 0.21 0.15 0.10 0.05 0.01 0.00
δ(Q^)\delta(\hat{Q}) any 0.1 0.21 0.32 0.44 0.54 0.64 0.76 0.86 0.90 0.96 0.99 1.00
Table 2: Approximate minimum ratio of σqq\frac{\sigma_{q}}{q} for exponential filter stability for 𝕐^2={q,0,q}\hat{\mathds{Y}}_{2}=\{-q,0,q\}

5 Numerical Study

In this section, we give an outline of the algorithm used to determine the approximate belief MDP and the finite window policy. We also present the performance of the finite window policies and the value function error for the approximate belief MDP in relation to the window size.

A summary of the algorithm is as follows:

  • We determine a π^𝒫(𝕏)\hat{\pi}\in{\mathcal{P}}(\mathds{X}) such that it puts positive measure over the state space 𝕏\mathds{X}.

  • Following Section 3, we construct the finite belief space 𝒵π^N\mathcal{Z}_{\hat{\pi}}^{N} by taking the Bayesian update of π^\hat{\pi} for all possible realizations of the form {y0,,yN,u0,,uN1}\{y_{0},\dots,y_{N},u_{0},\dots,u_{N-1}\}. Hence we get a approximate finite belief space with size |𝕐|N+1×|𝕌|N|\mathds{Y}|^{N+1}\times|\mathds{U}|^{N}.

  • We calculate all transitions from each z𝒵π^Nz\in\mathcal{Z}_{\hat{\pi}}^{N} by considering every possible observation yy and control action uu and we map them to 𝒵π^N\mathcal{Z}_{\hat{\pi}}^{N} using a nearest neighbor map and construct the transition kernels ηN\eta^{N} for the finite model.

  • For the finite models obtained, through value or policy iteration, we calculate the value functions and optimal policies.

The example we use is a machine repair problem. In this model, we have 𝕏,𝕐,𝕌={0,1}\mathds{X,Y,U}=\{0,1\} with

xt=\displaystyle x_{t}= {1 machine is working at time t 0 machine is not working at time t .ut=\displaystyle\begin{cases}1\quad\text{ machine is working at time t }\\ 0\quad\text{ machine is not working at time t }.\end{cases}u_{t}= {1 machine is being repaired at time t 0 machine is not being repaired at time t .\displaystyle\begin{cases}1\quad\text{ machine is being repaired at time t }\\ 0\quad\text{ machine is not being repaired at time t }.\end{cases}

The probability that the repair was successful given initially the machine was not working is given by κ\kappa:

Pr(xt+1=1|xt=0,ut=1)=κ\displaystyle Pr(x_{t+1}=1|x_{t}=0,u_{t}=1)=\kappa

The probability that the machine breaks down while in a working state is given by θ\theta:

Pr(xt=0|xt=1,ut=0)=θ\displaystyle Pr(x_{t}=0|x_{t}=1,u_{t}=0)=\theta

The probability that the channel gives an incorrect measurement is given by ϵ\epsilon:

Pr(yt=1|xt=0)=Pr(yt=0|xt=1)=ϵ\displaystyle Pr(y_{t}=1|x_{t}=0)=Pr(y_{t}=0|x_{t}=1)=\epsilon

The one stage cost function is given by

c(x,u)=\displaystyle c(x,u)= {R+Ex=0,u=1Ex=0,u=00x=1,u=0Rx=1,u=1\displaystyle\begin{cases}R+E\quad&x=0,u=1\\ E\quad&x=0,u=0\\ 0\quad&x=1,u=0\\ R\quad&x=1,u=1\end{cases}

where RR is the cost of repair and EE is the cost incurred by a broken machine.

We study the example with discount factor β=0.8\beta=0.8, and R=5,E=1R=5,E=1 and present three different results by changing the other parameters. For all different cases, we choose π^()=0.1δ0()+0.9δ1()\hat{\pi}(\cdot)=0.1\delta_{0}(\cdot)+0.9\delta_{1}(\cdot).

For the first case, we take ϵ=0.3\epsilon=0.3, κ=0.2\kappa=0.2, θ=0.1\theta=0.1. We analyze the performance for N{0,,5}N\in\{0,\dots,5\}. To get a proper finite window policy for all NN’s we use, we let the system run 55 time steps under the policy γ0(It)=0\gamma_{0}(I_{t})=0 that is ut=0u_{t}=0 no matter what for the first 55 time steps. Then, we start applying the policy and start calculating the cost. In Figure 1, we plot the approximate value functions and the cost incurred by the finite window policies, that is we plot the terms Eγ0[|JβN(Z5)|Y[0,5],U[0,4]]E^{\gamma_{0}}[|J_{\beta}^{N}(Z_{5})|Y_{[0,5]},U_{[0,4]}] and Eγ0[Jβ(Z5,γN)|Y[0,5],U[0,4]]E^{\gamma_{0}}[J_{\beta}(Z_{5},\gamma^{N})|Y_{[0,5]},U_{[0,4]}].

Refer to caption
Refer to caption
Figure 1: Approximate value function and the performance of finite window policy for different window sizes

Recall that our main result Theorem 12, emphasizes the relation between the filter stability term and ’value error’ and ’robustness error’, both of which are guaranteed to converge to zero with increasing window length, where

Filter stability term: supπsupγEπγ[ρBL(Pπ(|Y[0,N]),Pπ^(|Y[0,N]))]\displaystyle\text{Filter stability term: }\sup_{\pi}\sup_{\gamma}E_{\pi}^{\gamma}\left[\rho_{BL}\left(P^{\pi}(\cdot|Y_{[0,N]}),P^{\hat{\pi}}(\cdot|Y_{[0,N]})\right)\right]
Value error: Eγ0[|JβN(Z5)Jβ(Z5)|Y[0,5],U[0,4]]\displaystyle\text{Value error: }E^{\gamma_{0}}[|J_{\beta}^{N}(Z_{5})-J_{\beta}^{*}(Z_{5})|Y_{[0,5]},U_{[0,4]}]
Robustness error: Eγ0[|Jβ(Z5,γN)Jβ(Z5)|Y[0,5],U[0,4]]\displaystyle\text{Robustness error: }E^{\gamma_{0}}[|J_{\beta}(Z_{5},\gamma^{N})-J_{\beta}^{*}(Z_{5})|Y_{[0,5]},U_{[0,4]}]

To get the errors, we simply subtract the cost values from their minimum (largest window) which serves as an approximation of the value function. Furthermore, we scale the errors according to the filter stability term to get a better understanding of the error rate in relation to the filter stability constant that is we make them start from the same value to see the decrease rates more clearly. Figure 2 shows the relation.

Refer to caption
Figure 2: Comparison of approximation error and filter stability term for different window sizes

It can be seen from Figure 2, that the error rate stays below the filter stability convergence rate and goes to 0 as the window size increases.

For the second case, we take ϵ=0.01\epsilon=0.01, κ=0.3\kappa=0.3, θ=0.1\theta=0.1. For this case, we directly plot the scaled errors in relation to the filter stability term in Figure 3.

Refer to caption
Figure 3: Comparison of approximation error and filter stability term for different window sizes for an informative channel

It can be seen that as the channel is more informative, the filter stability term gets smaller much faster and the recent information becomes more informative. As a result, we get a faster decrease in the error with the increasing window size.

One can notice that for the first two cases, the Dobrushin coefficient α\alpha, we defined in Assumption 4 is greater than 11, however, the error terms still converge to 0 with increasing window. This is because the filter stability term gets smaller even though the Dobrushin constant α>1\alpha>1. For the last case, we select the parameters in a way to make α<1\alpha<1.

We take ϵ=0.3\epsilon=0.3, κ=0.4\kappa=0.4, θ=0.3\theta=0.3 so that α=0.7\alpha=0.7.

Refer to caption
Figure 4: Comparison of approximation error, filter stability term and Dobrushin term for different window sizes

In Figure 4, it can be observed that the error terms and the filter stability term converges to 0 at similar rates. αN\alpha^{N}, however, gets smaller at a slower rate and upper bounds the convergence rate of error terms.

6 Proofs of Main Technical Results

In this section, we provide the technical proofs.

6.1 Proof of Theorem 7

We will build on the proof by Kara et al. (2019). Let 𝕏\mathds{X} be a separable metric space. Another metric that metrizes the weak topology on 𝒫(𝕏){\mathcal{P}}(\mathds{X}) is the following:

ρ(μ,ν)=m=12(m+1)|𝕏fm(x)μ(dx)𝕏fm(x)ν(dx)|,\displaystyle\rho(\mu,\nu)=\sum_{m=1}^{\infty}2^{-(m+1)}\biggr{|}\int_{\mathds{X}}f_{m}(x)\mu(dx)-\int_{\mathds{X}}f_{m}(x)\nu(dx)\biggl{|}, (14)

where {fm}m1\{f_{m}\}_{m\geq 1} is an appropriate sequence of continuous and bounded functions such that fm1\|f_{m}\|_{\infty}\leq 1 for all m1m\geq 1 (see Parthasarathy, 1967, Theorem 6.6, p. 47).

We equip 𝒫(𝕏){\cal P}(\mathds{X}) with the metric ρ\rho to define bounded-Lipschitz norm fBL\|f\|_{BL} of any Borel measurable function f:𝒫(𝕏)f:{\cal P}(\mathds{X})\rightarrow\mathds{R}. With this metric, we can start the proof:

supfBL1|𝒫(𝕏)f(z1)η(dz1|z0,u)𝒫(𝕏)f(z1)η(dz1|z0,u)|\displaystyle\sup_{\|f\|_{BL}\leq 1}\bigg{|}\int_{{\cal P}(\mathds{X})}f(z_{1})\eta(dz_{1}|z_{0}^{\prime},u)-\int_{{\cal P}(\mathds{X})}f(z_{1})\eta(dz_{1}|z_{0},u)\bigg{|}
=supfBL1|𝕐f(z1(z0,u,y1))P(dy1|z0,u)𝕐f(z1(z0,u,y1))P(dy1|z0,u)|\displaystyle=\sup_{\|f\|_{BL}\leq 1}\bigg{|}\int_{\mathds{Y}}f(z_{1}(z_{0}^{\prime},u,y_{1}))P(dy_{1}|z_{0}^{\prime},u)-\int_{\mathds{Y}}f(z_{1}(z_{0},u,y_{1}))P(dy_{1}|z_{0},u)\bigg{|}
supfBL1|𝕐f(z1(z0,u,y1))P(dy1|z0,u)𝕐f(z1(z0,u,y1))P(dy1|z0,u)|\displaystyle\leq\sup_{\|f\|_{BL}\leq 1}\bigg{|}\int_{\mathds{Y}}f(z_{1}(z_{0}^{\prime},u,y_{1}))P(dy_{1}|z_{0}^{\prime},u)-\int_{\mathds{Y}}f(z_{1}(z_{0}^{\prime},u,y_{1}))P(dy_{1}|z_{0},u)\bigg{|}
+supfBL1𝕐|f(z1(z0,u,y1))f(z1(z0,u,y1))|P(dy1|z0,u)\displaystyle\qquad+\sup_{\|f\|_{BL}\leq 1}\int_{\mathds{Y}}\big{|}f(z_{1}(z_{0}^{\prime},u,y_{1}))-f(z_{1}(z_{0},u,y_{1}))\big{|}P(dy_{1}|z_{0},u)
P(|z0,u)P(|z0,u)TV\displaystyle\leq\|P(\cdot|z_{0}^{\prime},u)-P(\cdot|z_{0},u)\|_{TV} (15)
+supfBL1𝕐|f(z1(z0,u,y1))f(z1(z0,u,y1))|P(dy1|z0,u)\displaystyle\qquad+\sup_{\|f\|_{BL}\leq 1}\int_{\mathds{Y}}\big{|}f(z_{1}(z_{0}^{\prime},u,y_{1}))-f(z_{1}(z_{0},u,y_{1}))\big{|}P(dy_{1}|z_{0},u) (16)

where, in the last inequality, we have used ffBL1\|f\|_{\infty}\leq\|f\|_{BL}\leq 1.

We first anayze the first term:

P(|z0,u)P(|z0,u)TV=supf1|f(y1)P(dy1|z0,u)f(y1)P(dy1|z0,u)|\displaystyle\|P(\cdot|z_{0}^{\prime},u)-P(\cdot|z_{0},u)\|_{TV}=\sup_{\|f\|_{\infty}\leq 1}|\int f(y_{1})P(dy_{1}|z_{0}^{\prime},u)-\int f(y_{1})P(dy_{1}|z_{0},u)|
=supf1|𝕏𝕏𝕐f(y1)Q(dy1|x1)𝒯(dx1|x0,u)z0(dx0)\displaystyle=\sup_{\|f\|_{\infty}\leq 1}\bigg{|}\int_{\mathds{X}}\int_{\mathds{X}}\int_{\mathds{Y}}f(y_{1})Q(dy_{1}|x_{1})\mathcal{T}(dx_{1}|x_{0},u)z_{0}^{\prime}(dx_{0})
𝕏𝕏𝕐f(y1)Q(dy1|x1)𝒯(dx1|x0,u)z0(dx0)|.\displaystyle\phantom{xxxxxxxxxxxxx}-\int_{\mathds{X}}\int_{\mathds{X}}\int_{\mathds{Y}}f(y_{1})Q(dy_{1}|x_{1})\mathcal{T}(dx_{1}|x_{0},u)z_{0}(dx_{0})\bigg{|}.

For all x0,x0x_{0}^{\prime},x_{0} and for any f1\|f\|_{\infty}\leq 1, we have

|𝕏𝕐f(y1)Q(dy1|x1)𝒯(dx1|x0,u)𝕏𝕐f(y1)Q(dy1|x1)𝒯(dx1|x0,u)|\displaystyle\left|\int_{\mathds{X}}\int_{\mathds{Y}}f(y_{1})Q(dy_{1}|x_{1})\mathcal{T}(dx_{1}|x_{0}^{\prime},u)-\int_{\mathds{X}}\int_{\mathds{Y}}f(y_{1})Q(dy_{1}|x_{1})\mathcal{T}(dx_{1}|x_{0},u)\right|
𝒯(|x0,u)𝒯(|x0,u)TVα𝕏|x0x0|.\displaystyle\leq\|\mathcal{T}(\cdot|x_{0}^{\prime},u)-\mathcal{T}(\cdot|x_{0},u)\|_{TV}\leq\alpha_{\mathds{X}}|x_{0}^{\prime}-x_{0}|.

Then, we can write

P(|z0,u)P(|z0,u)TV(1+α𝕏)ρBL(z0,z0).\displaystyle\|P(\cdot|z_{0}^{\prime},u)-P(\cdot|z_{0},u)\|_{TV}\leq(1+\alpha_{\mathds{X}})\rho_{BL}(z_{0}^{\prime},z_{0}).

Recall the metric introduced in (14); using this metric, we now focus on the term (16):

supfBL1𝕐|f(z1(z0,u,y1))f(z1(z0,u,y1))|P(dy1|z0,u)\displaystyle\sup_{\|f\|_{BL}\leq 1}\int_{\mathds{Y}}\big{|}f(z_{1}(z_{0}^{\prime},u,y_{1}))-f(z_{1}(z_{0},u,y_{1}))\big{|}P(dy_{1}|z_{0},u)
𝕐ρ(z1(z0,u,y1),z1(z0,u,y1))P(dy1|z0,u)\displaystyle\leq\int_{\mathds{Y}}\rho(z_{1}(z_{0}^{\prime},u,y_{1}),z_{1}(z_{0},u,y_{1}))P(dy_{1}|z_{0},u)
=𝕐m=12m+1|𝕏fm(x1)z1(z0,u,y1)(dx1)\displaystyle=\int_{\mathds{Y}}\sum_{m=1}^{\infty}2^{-m+1}\bigg{|}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})
𝕏fm(x1)z1(z0,u,y1)(dx1)|P(dy1|z0,u)\displaystyle\phantom{xxxxxxxxxxxxxxxxx}-\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0},u,y_{1})(dx_{1})\bigg{|}P(dy_{1}|z_{0},u)
=m=12m+1𝕐|𝕏fm(x1)z1(z0,u,y1)(dx1)\displaystyle=\sum_{m=1}^{\infty}2^{-m+1}\int_{\mathds{Y}}\bigg{|}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})
𝕏fm(x1)z1(z0,u,y1)(dx1)|P(dy1|z0,u),\displaystyle\phantom{xxxxxxxxxxxxxxxxx}-\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0},u,y_{1})(dx_{1})\bigg{|}P(dy_{1}|z_{0},u),

where we have used Fubini’s theorem with the fact that supmfm1\sup_{m}\|f_{m}\|_{\infty}\leq 1. For each mm, let us define

I+:={y1𝕐:𝕏fm(x1)z1(z0,u,y1)(dx1)>𝕏fm(x1)z1(z0,u,y1)(dx1)}\displaystyle I_{+}:=\biggl{\{}y_{1}\in\mathds{Y}:\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})>\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0},u,y_{1})(dx_{1})\biggr{\}}
I:={y1𝕐:𝕏fm(x1)z1(z0,u,y1)(dx1)𝕏fm(x1)z1(z0,u,y1)(dx1)}.\displaystyle I_{-}:=\biggl{\{}y_{1}\in\mathds{Y}:\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})\leq\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0},u,y_{1})(dx_{1})\biggr{\}}. (17)

Then, we can write

𝕐|𝕏fm(x1)z1(z0,u,y1)(dx1)𝕏fm(x1)z1(z0,u,y1)(dx1)|P(dy1|z0,u)\displaystyle\int_{\mathds{Y}}\bigg{|}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})-\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0},u,y_{1})(dx_{1})\bigg{|}P(dy_{1}|z_{0},u)
=I+𝕏fm(x1)z1(z0,u,y1)(dx1)P(dy1|z0,u)I+𝕏fm(x1)z1(z0,u,y1)(dx1)P(dy1|z0,u)\displaystyle=\int_{I_{+}}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})P(dy_{1}|z_{0},u)-\int_{I_{+}}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})P(dy_{1}|z_{0}^{\prime},u)
+I+𝕏fm(x1)z1(z0,u,y1)(dx1)P(dy1|z0,u)I+𝕏fm(x1)z1(z0,u,y1)(dx1)P(dy1|z0,u)\displaystyle+\int_{I_{+}}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})P(dy_{1}|z_{0}^{\prime},u)-\int_{I_{+}}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0},u,y_{1})(dx_{1})P(dy_{1}|z_{0},u)
+I𝕏fm(x1)z1(z0,u,y1)(dx1)P(dy1|z0,u)I𝕏fm(x1)z1(z0,u,y1)(dx1)P(dy1|z0,u)\displaystyle+\int_{I_{-}}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0},u,y_{1})(dx_{1})P(dy_{1}|z_{0},u)-\int_{I_{-}}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})P(dy_{1}|z_{0}^{\prime},u)
+I𝕏fm(x1)z1(z0,u,y1)(dx1)P(dy1|z0,u)I𝕏fm(x1)z1(z0,u,y1)(dx1)P(dy1|z0,u)\displaystyle+\int_{I_{-}}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})P(dy_{1}|z_{0}^{\prime},u)-\int_{I_{-}}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})P(dy_{1}|z_{0},u)

For the first and the last term, we can write

I+𝕏fm(x1)z1(z0,u,y1)(dx1)P(dy1|z0,u)I+𝕏fm(x1)z1(z0,u,y1)(dx1)P(dy1|z0,u)\displaystyle\int_{I_{+}}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})P(dy_{1}|z_{0},u)-\int_{I_{+}}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})P(dy_{1}|z_{0}^{\prime},u)
+I𝕏fm(x1)z1(z0,u,y1)(dx1)P(dy1|z0,u)I𝕏fm(x1)z1(z0,u,y1)(dx1)P(dy1|z0,u)\displaystyle+\int_{I_{-}}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})P(dy_{1}|z_{0}^{\prime},u)-\int_{I_{-}}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})P(dy_{1}|z_{0},u)
=𝕐fm(x1)z1(z0,u,y1)(dx1)(𝟙I𝟙I+)P(dy1|z0,u)\displaystyle=\int_{\mathds{Y}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})\left(\mathds{1}_{I_{-}}-\mathds{1}_{I_{+}}\right)P(dy_{1}|z_{0}^{\prime},u)
𝕐fm(x1)z1(z0,u,y1)(dx1)(𝟙I𝟙I+)P(dy1|z0,u)\displaystyle\qquad-\int_{\mathds{Y}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})\left(\mathds{1}_{I_{-}}-\mathds{1}_{I_{+}}\right)P(dy_{1}|z_{0},u)
P(|z0,u)P(|z0,u)TV(1+α𝕏)ρBL(z0,z0).\displaystyle\leq\|P(\cdot|z_{0}^{\prime},u)-P(\cdot|z_{0},u)\|_{TV}\leq(1+\alpha_{\mathds{X}})\rho_{BL}(z_{0}^{\prime},z_{0}). (18)

For the second and the third term:

I+𝕏fm(x1)z1(z0,u,y1)(dx1)P(dy1|z0,u)I+𝕏fm(x1)z1(z0,u,y1)(dx1)P(dy1|z0,u)\displaystyle\int_{I_{+}}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})P(dy_{1}|z_{0}^{\prime},u)-\int_{I_{+}}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0},u,y_{1})(dx_{1})P(dy_{1}|z_{0},u)
+I𝕏fm(x1)z1(z0,u,y1)(dx1)P(dy1|z0,u)I𝕏fm(x1)z1(z0,u,y1)(dx1)P(dy1|z0,u)\displaystyle+\int_{I_{-}}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0},u,y_{1})(dx_{1})P(dy_{1}|z_{0},u)-\int_{I_{-}}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})P(dy_{1}|z_{0}^{\prime},u)
=I+𝕏fm(x1)Q(dy1|x1)𝒯(dx1|z0,u)I+𝕏fm(x1)Q(dy1|x1)𝒯(dx1|z0,u)\displaystyle=\int_{I_{+}}\int_{\mathds{X}}f_{m}(x_{1})Q(dy_{1}|x_{1})\mathcal{T}(dx_{1}|z_{0}^{\prime},u)-\int_{I_{+}}\int_{\mathds{X}}f_{m}(x_{1})Q(dy_{1}|x_{1})\mathcal{T}(dx_{1}|z_{0},u)
+I𝕏fm(x1)Q(dy1|x1)𝒯(dx1|z0,u)I𝕏fm(x1)Q(dy1|x1)𝒯(dx1|z0,u)\displaystyle\qquad+\int_{I_{-}}\int_{\mathds{X}}f_{m}(x_{1})Q(dy_{1}|x_{1})\mathcal{T}(dx_{1}|z_{0},u)-\int_{I_{-}}\int_{\mathds{X}}f_{m}(x_{1})Q(dy_{1}|x_{1})\mathcal{T}(dx_{1}|z_{0}^{\prime},u)
=𝕏fm(x1)(Q(I+|x1)Q(I|x1))𝒯(dx1|z0,u)\displaystyle=\int_{\mathds{X}}f_{m}(x_{1})\left(Q(I_{+}|x_{1})-Q(I_{-}|x_{1})\right)\mathcal{T}(dx_{1}|z_{0}^{\prime},u)
𝕏fm(x1)(Q(I+|x1)Q(I|x1))𝒯(dx1|z0,u)\displaystyle\qquad-\int_{\mathds{X}}f_{m}(x_{1})\left(Q(I_{+}|x_{1})-Q(I_{-}|x_{1})\right)\mathcal{T}(dx_{1}|z_{0},u) (19)
(1+α𝕏)ρBL(z0,z0).\displaystyle\leq(1+\alpha_{\mathds{X}})\rho_{BL}(z_{0}^{\prime},z_{0}).

Hence, we can write that

𝕐|𝕏fm(x1)z1(z0,u,y1)(dx1)𝕏fm(x1)z1(z0,u,y1)(dx1)|P(dy1|z0,u)\displaystyle\int_{\mathds{Y}}\bigg{|}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})-\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0},u,y_{1})(dx_{1})\bigg{|}P(dy_{1}|z_{0},u)
2(1+α𝕏)ρBL(z0,z0)\displaystyle\leq 2(1+\alpha_{\mathds{X}})\rho_{BL}(z_{0}^{\prime},z_{0})

Now we go back to the term (16):

supfBL1𝕐|f(z1(z0,u,y1))f(z1(z0,u,y1))|P(dy1|z0,u)\displaystyle\sup_{\|f\|_{BL}\leq 1}\int_{\mathds{Y}}\big{|}f(z_{1}(z_{0}^{\prime},u,y_{1}))-f(z_{1}(z_{0},u,y_{1}))\big{|}P(dy_{1}|z_{0},u)
𝕐ρ(z1(z0,u,y1),z1(z0,u,y1))P(dy1|z0,u)\displaystyle\leq\int_{\mathds{Y}}\rho(z_{1}(z_{0}^{\prime},u,y_{1}),z_{1}(z_{0},u,y_{1}))P(dy_{1}|z_{0},u)
=𝕐m=12m+1|𝕏fm(x1)z1(z0,u,y1)(dx1)𝕏fm(x1)z1(z0,u,y1)(dx1)|P(dy1|z0,u)\displaystyle=\int_{\mathds{Y}}\sum_{m=1}^{\infty}2^{-m+1}\bigg{|}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})-\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0},u,y_{1})(dx_{1})\bigg{|}P(dy_{1}|z_{0},u)
=m=12m+1𝕐|𝕏fm(x1)z1(z0,u,y1)(dx1)𝕏fm(x1)z1(z0,u,y1)(dx1)|P(dy1|z0,u)\displaystyle=\sum_{m=1}^{\infty}2^{-m+1}\int_{\mathds{Y}}\bigg{|}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})-\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0},u,y_{1})(dx_{1})\bigg{|}P(dy_{1}|z_{0},u)
2(1+α𝕏)ρBL(z0,z0).\displaystyle\leq 2(1+\alpha_{\mathds{X}})\rho_{BL}(z_{0}^{\prime},z_{0}).

Thus, combining all the results:

supfBL1|𝒫(𝕏)f(z1)η(dz1|z0,u)𝒫(𝕏)f(z1)η(dz1|z0,u)|\displaystyle\sup_{\|f\|_{BL}\leq 1}\bigg{|}\int_{{\cal P}(\mathds{X})}f(z_{1})\eta(dz_{1}|z_{0}^{\prime},u)-\int_{{\cal P}(\mathds{X})}f(z_{1})\eta(dz_{1}|z_{0},u)\bigg{|}
P(|z0,u)P(|z0,u)TV\displaystyle\leq\|P(\cdot|z_{0}^{\prime},u)-P(\cdot|z_{0},u)\|_{TV}
+supfBL1𝕐|f(z1(z0,u,y1))f(z1(z0,u,y1))|P(dy1|z0,u)\displaystyle\qquad+\sup_{\|f\|_{BL}\leq 1}\int_{\mathds{Y}}\big{|}f(z_{1}(z_{0}^{\prime},u,y_{1}))-f(z_{1}(z_{0},u,y_{1}))\big{|}P(dy_{1}|z_{0},u)
3(1+α𝕏)ρBL(z0,z0).\displaystyle\leq 3(1+\alpha_{\mathds{X}})\rho_{BL}(z_{0}^{\prime},z_{0}).

We now prove part (ii); we note first that if P(|z0,u)P(|z0,u)TV2(1δ(Q))ρBL(z0,z0)\|P(\cdot|z_{0}^{\prime},u)-P(\cdot|z_{0},u)\|_{TV}\leq 2(1-\delta(Q))\rho_{BL}(z_{0}^{\prime},z_{0}), the result follows. Hence, we write (by Dobrushin, 1956)

P(|z0,u)P(|z0,u)TV=supf1|f(y1)P(dy1|z0,u)f(y1)P(dy1|z0,u)|\displaystyle\|P(\cdot|z_{0}^{\prime},u)-P(\cdot|z_{0},u)\|_{TV}=\sup_{\|f\|_{\infty}\leq 1}|\int f(y_{1})P(dy_{1}|z_{0}^{\prime},u)-\int f(y_{1})P(dy_{1}|z_{0},u)|
=supf1|f(y1)Q(dy1|x1)𝒯(dx1|z0,u)f(y1)Q(dy1|x1)𝒯(dx1|z0,u)|\displaystyle=\sup_{\|f\|_{\infty}\leq 1}\bigg{|}\int f(y_{1})Q(dy_{1}|x_{1})\mathcal{T}(dx_{1}|z_{0}^{\prime},u)-\int f(y_{1})Q(dy_{1}|x_{1})\mathcal{T}(dx_{1}|z_{0},u)\bigg{|}
(1δ(Q))𝒯(dx1|z0,u)𝒯(dx1|z0,u)TV(1δ(Q))(1+α𝕏)ρBL(z0,z0).\displaystyle\leq(1-\delta(Q))\|\mathcal{T}(dx_{1}|z_{0}^{\prime},u)-\mathcal{T}(dx_{1}|z_{0},u)\|_{TV}\leq(1-\delta(Q))(1+\alpha_{\mathds{X}})\rho_{BL}(z_{0}^{\prime},z_{0}).

To prove part (iii) of the theorem, we start from the terms (15) and (16). For (15), we have

P(|z0,u)P(|z0,u)TV\displaystyle\|P(\cdot|z_{0}^{\prime},u)-P(\cdot|z_{0},u)\|_{TV}
=supf1|f(y1)Q(dy1|x1)𝒯(dx1|x0,u)z0(dx0)f(y1)Q(dy1|x1)𝒯(dx1|x0,u)z0(dx0)|\displaystyle=\sup_{\|f\|_{\infty}\leq 1}\bigg{|}\int f(y_{1})Q(dy_{1}|x_{1})\mathcal{T}(dx_{1}|x_{0},u)z_{0}^{\prime}(dx_{0})-\int f(y_{1})Q(dy_{1}|x_{1})\mathcal{T}(dx_{1}|x_{0},u)z_{0}(dx_{0})\bigg{|}
supf1|f(x0)z0(dx0)f(x0)z0(dx0)|=z0z0TV.\displaystyle\leq\sup_{\|f\|_{\infty}\leq 1}\left|\int f(x_{0})z_{0}(dx_{0})-\int f(x_{0})z_{0}^{\prime}(dx_{0})\right|=\|z_{0}-z_{0}^{\prime}\|_{TV}.

For (16), as before, we start by writing

supfBL1𝕐|f(z1(z0,u,y1))f(z1(z0,u,y1))|P(dy1|z0,u)\displaystyle\sup_{\|f\|_{BL}\leq 1}\int_{\mathds{Y}}\big{|}f(z_{1}(z_{0}^{\prime},u,y_{1}))-f(z_{1}(z_{0},u,y_{1}))\big{|}P(dy_{1}|z_{0},u)
m=12m+1𝕐|𝕏fm(x1)z1(z0,u,y1)(dx1)\displaystyle\leq\sum_{m=1}^{\infty}2^{-m+1}\int_{\mathds{Y}}\bigg{|}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})
𝕏fm(x1)z1(z0,u,y1)(dx1)|P(dy1|z0,u).\displaystyle\phantom{xxxxxxxxxxxxxxxxx}-\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0},u,y_{1})(dx_{1})\bigg{|}P(dy_{1}|z_{0},u).

For each mm, using the bounding terms (18) and (19), we have

𝕐|𝕏fm(x1)z1(z0,u,y1)(dx1)𝕏fm(x1)z1(z0,u,y1)(dx1)|P(dy1|z0,u)\displaystyle\int_{\mathds{Y}}\bigg{|}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})-\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0},u,y_{1})(dx_{1})\bigg{|}P(dy_{1}|z_{0},u)
P(|z0,u)P(|z0,u)TV+𝒯(|z0,u)𝒯(|z0,u)TV2z0z0TV\displaystyle\leq\|P(\cdot|z_{0}^{\prime},u)-P(\cdot|z_{0},u)\|_{TV}+\|\mathcal{T}(\cdot|z_{0}^{\prime},u)-\mathcal{T}(\cdot|z_{0},u)\|_{TV}\leq 2\|z_{0}^{\prime}-z_{0}\|_{TV}

which completes the proof of part (iii).

For part (iv), for (15) we have (by Dobrushin, 1956)

P(|z0,u)P(|z0,u)TV\displaystyle\|P(\cdot|z_{0}^{\prime},u)-P(\cdot|z_{0},u)\|_{TV}
=supf1|f(y1)Q(dy1|x1)𝒯(dx1|z0,u)f(y1)Q(dy1|x1)𝒯(dx1|z0,u)|\displaystyle=\sup_{\|f\|_{\infty}\leq 1}\bigg{|}\int f(y_{1})Q(dy_{1}|x_{1})\mathcal{T}(dx_{1}|z_{0}^{\prime},u)-\int f(y_{1})Q(dy_{1}|x_{1})\mathcal{T}(dx_{1}|z_{0},u)\bigg{|}
(1δ(Q))𝒯(dx1|z0,u)𝒯(dx1|z0,u)TV\displaystyle\leq(1-\delta(Q))\|\mathcal{T}(dx_{1}|z_{0}^{\prime},u)-\mathcal{T}(dx_{1}|z_{0},u)\|_{TV}
=(1δ(Q))supf1|f(x1)𝒯(dx1|x0,u)z0(dx0)f(x1)𝒯(dx1|x0,u)z0(dx0)|\displaystyle=(1-\delta(Q))\sup_{\|f\|_{\infty}\leq 1}\bigg{|}\int f(x_{1})\mathcal{T}(dx_{1}|x_{0},u)z_{0}^{\prime}(dx_{0})-\int f(x_{1})\mathcal{T}(dx_{1}|x_{0},u)z_{0}^{\prime}(dx_{0})\bigg{|}
(1δ(Q))(1δ~(𝒯))z0z0TV.\displaystyle\leq(1-\delta(Q))(1-\tilde{\delta}(\mathcal{T}))\|z_{0}^{\prime}-z_{0}\|_{TV}.

For (16), (using Lemma 3.2 McDonald and Yüksel, 2020) and the analysis above,

𝕐|𝕏fm(x1)z1(z0,u,y1)(dx1)𝕏fm(x1)z1(z0,u,y1)(dx1)|P(dy1|z0,u)\displaystyle\int_{\mathds{Y}}\bigg{|}\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0}^{\prime},u,y_{1})(dx_{1})-\int_{\mathds{X}}f_{m}(x_{1})z_{1}(z_{0},u,y_{1})(dx_{1})\bigg{|}P(dy_{1}|z_{0},u)
(2δ(Q))𝒯(|z0,u)𝒯(|z0,u)TV(2δ(Q))(1δ~(𝒯))z0z0TV\displaystyle\leq(2-\delta(Q))\|\mathcal{T}(\cdot|z_{0}^{\prime},u)-\mathcal{T}(\cdot|z_{0},u)\|_{TV}\leq(2-\delta(Q))(1-\tilde{\delta}(\mathcal{T}))\|z_{0}^{\prime}-z_{0}\|_{TV}

which completes the proof.

6.2 Proof of Theorem 9

Before presenting our proof program and the series of technical results needed, we introduce some notation.

We denote the loss function due to the quantization by L(z)L(z), i.e.

L(z)=ρBL(z,F(z))\displaystyle L(z)=\rho_{BL}(z,F(z))

with FF defined in Equation 8.

In the following, to specify the probability measures according to which the expectations are taken, we use the following notation; for the kernel ηN\eta^{N} and a policy γ\gamma, we use ENγE_{N}^{\gamma} and for the kernel η\eta and a policy γ\gamma, we will use EγE^{\gamma}.

The last notation we introduce is the following: Recall that we denote the optimal cost for the finite model by JβNJ^{N}_{\beta} and the optimal policy by γN\gamma_{N}^{*}. These are defined on a finite set 𝒵π^N\mathcal{Z}_{\hat{\pi}}^{N}, however, we can always extend them over the original state space 𝒵\mathcal{Z} so that they are constant over the quantization bins. We denote the extended versions by J~βN\tilde{J}^{N}_{\beta} and γ~N\tilde{\gamma}_{N}^{*}.

Now, we introduce our value iteration approach for the original model and the finite model. We write for any k<k<\infty

vk+1(z)\displaystyle v_{k+1}(z) =minu(c~(z,u)+βvk(z1)η(dz1|z,u))z𝒵,\displaystyle=\min_{u}\left(\tilde{c}(z,u)+\beta\int v_{k}(z_{1})\eta(dz_{1}|z,u)\right)\quad\forall z\in\mathcal{Z},
vk+1N(z)\displaystyle v^{N}_{k+1}(z) =minu(cN(z,u)+βz1vkN(z1)ηN(z1|z,u))z𝒵π^N.\displaystyle=\min_{u}\left(c^{N}(z,u)+\beta\sum_{z_{1}}v^{N}_{k}(z_{1})\eta^{N}(z_{1}|z,u)\right)\quad\forall z\in\mathcal{Z}_{\hat{\pi}}^{N}.

where v0,v0N0v_{0},v^{N}_{0}\equiv 0. It is well known that the above operations define contractions under either model and hence the value functions converge to the optimal expected discounted cost. In particular, we have that

|JβN(z)vkN(z)|c~βk1β,|Jβ(z)vk(z)|c~βk1β.\displaystyle\big{|}J^{N}_{\beta}(z)-v^{N}_{k}(z)\big{|}\leq\|\tilde{c}\|_{\infty}\frac{\beta^{k}}{1-\beta},\quad\big{|}J^{*}_{\beta}(z)-v_{k}(z)\big{|}\leq\|\tilde{c}\|_{\infty}\frac{\beta^{k}}{1-\beta}.

Notice that if we extend vk+1N(z)v^{N}_{k+1}(z) and cN(z,u)c^{N}(z,u) over all of the state space then v~k+1N(z)\tilde{v}^{N}_{k+1}(z) and c~N(z,u)\tilde{c}^{N}(z,u) becomes constant over the quantization bins. Then, the extended value functions for the finite model can be also obtained with

v~k+1N(z)\displaystyle\tilde{v}^{N}_{k+1}(z) =minu(c~(F(z),u)+βv~kN(z1)η(dz1|F(z),u))z𝒵.\displaystyle=\min_{u}\left(\tilde{c}(F(z),u)+\beta\int\tilde{v}^{N}_{k}(z_{1})\eta(dz_{1}|F(z),u)\right)\quad\forall z\in\mathcal{Z}. (20)

To see this, observe the following:

v~k+1N(z)\displaystyle\tilde{v}^{N}_{k+1}(z) =minu(c(F(z),u)+βv~kN(z1)η(dz1|F(z),u))\displaystyle=\min_{u}\left(c(F(z),u)+\beta\int\tilde{v}^{N}_{k}(z_{1})\eta(dz_{1}|F(z),u)\right)
=minu(c~(F(z),u)+βi=1|𝒵π^N|ZiNv~kN(z1)η(dz1|F(z),u))\displaystyle=\min_{u}\left(\tilde{c}(F(z),u)+\beta\sum_{i=1}^{|\mathcal{Z}_{\hat{\pi}}^{N}|}\int_{Z^{N}_{i}}\tilde{v}^{N}_{k}(z_{1})\eta(dz_{1}|F(z),u)\right)
=minu(c~(F(z),u)+βi=1|𝒵π^N|v~kN(z1i)𝒵iNη(dy|F(z),u))\displaystyle=\min_{u}\left(\tilde{c}(F(z),u)+\beta\sum_{i=1}^{|\mathcal{Z}_{\hat{\pi}}^{N}|}\tilde{v}^{N}_{k}(z_{1}^{i})\int_{\mathcal{Z}^{N}_{i}}\eta(dy|F(z),u)\right)
=minu(c~(F(z),u)+βi=1|𝒵π^N|v~kN(z1i)η(𝒵iN|F(z),u))\displaystyle=\min_{u}\left(\tilde{c}(F(z),u)+\beta\sum_{i=1}^{|\mathcal{Z}_{\hat{\pi}}^{N}|}\tilde{v}^{N}_{k}(z_{1}^{i})\eta(\mathcal{Z}^{N}_{i}|F(z),u)\right)

where 𝒵in\mathcal{Z}^{n}_{i} denotes the ith quantization bin and v~kN(z1i)\tilde{v}^{N}_{k}(z_{1}^{i}) denotes the value of v~kN\tilde{v}^{N}_{k} at that quantization bin and it is constant over the bin.

For the proof the goal is to bound the term |Jβ(γ~N,z)Jβ(γ,z)||J_{\beta}(\tilde{\gamma}_{N}^{*},z)-J_{\beta}(\gamma^{*},z)|. We will see in the following that bounding this term is related to studying the term |J~βN(γ~N,z)Jβ(γ,z)||\tilde{J}^{N}_{\beta}(\tilde{\gamma}_{N}^{*},z)-J_{\beta}(\gamma^{*},z)| (the value function approximation error). Therefore, in what follows, we first analyze |J~βN(γ~N,z)Jβ(γ,z)||\tilde{J}^{N}_{\beta}(\tilde{\gamma}_{N}^{*},z)-J_{\beta}(\gamma^{*},z)| and observe that it can be bounded using the expected loss terms E[L(Zt)]E[L(Z_{t})]. Finally, we will observe that upper bounds on these expressions can be written through the filter stability term

supπ𝒫(𝕏)supγΓEπγ[ρBL(Pπ(XN|Y[0,N],γ(Y[0,N1])),Pπ^(XN|Y[0,N]),γ(Y[0,N1]))].\sup_{\pi\in{\mathcal{P}}(\mathds{X})}\sup_{\gamma\in\Gamma}E_{\pi}^{\gamma}\left[\rho_{BL}\left(P^{\pi}(X_{N}\in\cdot|Y_{[0,N]},\gamma(Y_{[0,N-1]})),P^{\hat{\pi}}(X_{N}\in\cdot|Y_{[0,N]}),\gamma(Y_{[0,N-1]})\right)\right].

Next, we present some supporting technical results.

Lemma 18

Under Assumption 3, we have

|J~βN(γ~N,z)Jβ(γ,z)|\displaystyle|\tilde{J}^{N}_{\beta}(\tilde{\gamma}_{N}^{*},z)-J_{\beta}(\gamma^{*},z)|
limksupγΓ(αc~t=0k1βt(Ezγ[L(Zt)]+3α𝒵m=0t1(4α𝒵+1)mEzγ[L(Ztm1)])\displaystyle\leq\lim_{k\to\infty}\sup_{\gamma\in\Gamma}\Bigg{(}\alpha_{\tilde{c}}\sum_{t=0}^{k-1}\beta^{t}\left(E^{\gamma}_{z}\left[L(Z_{t})\right]+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}(4\alpha_{\mathcal{Z}}+1)^{m}E^{\gamma}_{z}[L(Z_{t-m-1})]\right)
+α𝒵t=0k1βt+1vkt1BL(Ezγ[L(Zt)]+3α𝒵m=0t1(4α𝒵+1)mEzγ[L(Ztm1)])).\displaystyle\qquad\qquad+\alpha_{\mathcal{Z}}\sum_{t=0}^{k-1}\beta^{t+1}\|v_{k-t-1}\|_{BL}\left(E^{\gamma}_{z}\left[L(Z_{t})\right]+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}(4\alpha_{\mathcal{Z}}+1)^{m}E^{\gamma}_{z}[L(Z_{t-m-1})]\right)\Bigg{)}.

where vkv_{k} denotes the value function at a time step kk that is produced by the value iteration with v00v_{0}\equiv 0.

Proof 

We follow a value iteration approach to write for any k<k<\infty

vk+1(z)\displaystyle v_{k+1}(z) =minu(c~(z,u)+βvk(y)η(dy|z,u))z𝒵,\displaystyle=\min_{u}\left(\tilde{c}(z,u)+\beta\int v_{k}(y)\eta(dy|z,u)\right)\quad\forall z\in\mathcal{Z},
v~k+1N(z)\displaystyle\tilde{v}^{N}_{k+1}(z) =minu(c~(F(z),u)+βv~kN(y)η(dy|F(z),u))z𝒵.\displaystyle=\min_{u}\left(\tilde{c}(F(z),u)+\beta\int\tilde{v}^{N}_{k}(y)\eta(dy|F(z),u)\right)\quad\forall z\in\mathcal{Z}.

where v0,v~0N0v_{0},\tilde{v}^{N}_{0}\equiv 0.

We then write

|J~βN(z)Jβ(z)||J~βN(z)v~kN(z)|+|v~kN(z)vk(z)|+|vk(z)Jβ(z)|\displaystyle|\tilde{J}_{\beta}^{N}(z)-J^{*}_{\beta}(z)|\leq|\tilde{J}^{N}_{\beta}(z)-\tilde{v}_{k}^{N}(z)|+|\tilde{v}_{k}^{N}(z)-v_{k}(z)|+|v_{k}(z)-J^{*}_{\beta}(z)| (21)

For the first and the last term, using the fact that the dynamic programming operator is a contraction, we have that

|J~βN(z)v~kN(z)|c~βk1β,|Jβ(z)vk(z)|c~βk1β.\displaystyle\big{|}\tilde{J}^{N}_{\beta}(z)-\tilde{v}^{N}_{k}(z)\big{|}\leq\|\tilde{c}\|_{\infty}\frac{\beta^{k}}{1-\beta},\quad\big{|}J^{*}_{\beta}(z)-v_{k}(z)\big{|}\leq\|\tilde{c}\|_{\infty}\frac{\beta^{k}}{1-\beta}.

Now, we focus on the second term |v~kN(z)vk(z)||\tilde{v}_{k}^{N}(z)-v_{k}(z)|.

Step 1: We show in the Appendix, Section A.1, that for all k1k\geq 1,

|v~kN(z)vk(z)|\displaystyle|\tilde{v}_{k}^{N}(z)-v_{k}(z)|
supγΓ(αc~t=0k1βtEz,Nγ[L(Zt)]+t=0k1βt+1vkt1BLEz,Nγ[L(Zt)]α𝒵)\displaystyle\leq\sup_{\gamma\in\Gamma}\left(\alpha_{\tilde{c}}\sum_{t=0}^{k-1}\beta^{t}E^{\gamma}_{z,N}\left[L(Z_{t})\right]+\sum_{t=0}^{k-1}\beta^{t+1}\|v_{k-t-1}\|_{BL}E^{\gamma}_{z,N}\left[L(Z_{t})\right]\alpha_{\mathcal{Z}}\right)

where L(z)L(z) is the loss function due to the quantization, i.e. L(z)=ρBL(z,z^)L(z)=\rho_{BL}(z,\hat{z}) where z^\hat{z} is the representative state zz belongs to.

Step 2: We show in the Appendix, Section A.2, that

Ez,Nγ[L(Zt)]Ezγ[L(Zt)]+3α𝒵m=0t1(4α𝒵+1)mEzγ[L(Ztm1)].\displaystyle E^{\gamma}_{z,N}\left[L(Z_{t})\right]\leq E^{\gamma}_{z}\left[L(Z_{t})\right]+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}(4\alpha_{\mathcal{Z}}+1)^{m}E^{\gamma}_{z}[L(Z_{t-m-1})].

Step 3: Combining our results, we have that

|v~kN(z)vk(z)|\displaystyle|\tilde{v}_{k}^{N}(z)-v_{k}(z)|
supγ(αc~t=0k1βtEz,Nγ[L(Zt)]+t=0k1βt+1vkt1BLEz,Nγ[L(Zt)]α𝒵)\displaystyle\leq\sup_{\gamma}\Bigg{(}\alpha_{\tilde{c}}\sum_{t=0}^{k-1}\beta^{t}E^{\gamma}_{z,N}\left[L(Z_{t})\right]+\sum_{t=0}^{k-1}\beta^{t+1}\|v_{k-t-1}\|_{BL}E^{\gamma}_{z,N}\left[L(Z_{t})\right]\alpha_{\mathcal{Z}}\Bigg{)}
supγ(αc~t=0k1βt(Ezγ[L(Zt)]+3α𝒵m=0t1(4α𝒵+1)mEzγ[L(Ztm1)])\displaystyle\leq\sup_{\gamma}\Bigg{(}\alpha_{\tilde{c}}\sum_{t=0}^{k-1}\beta^{t}\left(E^{\gamma}_{z}\left[L(Z_{t})\right]+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}(4\alpha_{\mathcal{Z}}+1)^{m}E^{\gamma}_{z}[L(Z_{t-m-1})]\right)
+α𝒵t=0k1βt+1vkt1BL(Ezγ[L(Zt)]+3α𝒵m=0t1(4α𝒵+1)mEzγ[L(Ztm1)])).\displaystyle\qquad\qquad+\alpha_{\mathcal{Z}}\sum_{t=0}^{k-1}\beta^{t+1}\|v_{k-t-1}\|_{BL}\left(E^{\gamma}_{z}\left[L(Z_{t})\right]+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}(4\alpha_{\mathcal{Z}}+1)^{m}E^{\gamma}_{z}[L(Z_{t-m-1})]\right)\Bigg{)}. (22)

Hence, using (21) and (6.2), we write

|J~βN(z)Jβ(z)|limk|v~kN(z)vk(z)|\displaystyle|\tilde{J}_{\beta}^{N}(z)-J^{*}_{\beta}(z)|\leq\lim_{k\to\infty}|\tilde{v}_{k}^{N}(z)-v_{k}(z)|
limksupγ(αc~t=0k1βt(Ezγ[L(Zt)]+3α𝒵m=0t1(4α𝒵+1)mEzγ[L(Ztm1)])\displaystyle\leq\lim_{k\to\infty}\sup_{\gamma}\Bigg{(}\alpha_{\tilde{c}}\sum_{t=0}^{k-1}\beta^{t}\left(E^{\gamma}_{z}\left[L(Z_{t})\right]+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}(4\alpha_{\mathcal{Z}}+1)^{m}E^{\gamma}_{z}[L(Z_{t-m-1})]\right)
+α𝒵t=0k1βt+1vkt1BL(Ezγ[L(Zt)]+3α𝒵m=0t1(4α𝒵+1)mEzγ[L(Ztm1)])).\displaystyle\qquad\qquad+\alpha_{\mathcal{Z}}\sum_{t=0}^{k-1}\beta^{t+1}\|v_{k-t-1}\|_{BL}\left(E^{\gamma}_{z}\left[L(Z_{t})\right]+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}(4\alpha_{\mathcal{Z}}+1)^{m}E^{\gamma}_{z}[L(Z_{t-m-1})]\right)\Bigg{)}.

 

The next result, gives a bound on the loss function occurring from the quantization (on L(z)L(z)) and relates the bound to the filter stability problem.

Lemma 19

For Z0=Pπ0(XN|Y[0,N],γ(Y[0,N1]))Z_{0}=P^{\pi_{0}}(X_{N}\in\cdot|Y_{[0,N]},\gamma(Y_{[0,N-1]})), where π0𝒫(𝕏)\pi_{0}\in{\mathcal{P}}(\mathds{X}) is the prior distribution of X0X_{0}, we have that for any N<t<N<t<\infty

supγΓEπ0γ[EZ0γ[L(Zt)]|Y[0,N],γ(Y[0,N1])]\displaystyle\sup_{\gamma\in\Gamma}E_{\pi_{0}}^{\gamma}\left[E_{Z_{0}}^{\gamma}\left[L(Z_{t})\right]|Y_{[0,N]},\gamma\left(Y_{[0,N-1]}\right)\right]
supπ𝒫(𝕏)supγΓEπγ[ρBL(Pπ(XN|Y[0,N],γ(Y[0,N1])),Pπ^(XN|Y[0,N]),γ(Y[0,N1]))].\displaystyle\leq\sup_{\pi\in{\mathcal{P}}(\mathds{X})}\sup_{\gamma\in\Gamma}E_{\pi}^{\gamma}\left[\rho_{BL}\left(P^{\pi}(X_{N}\in\cdot|Y_{[0,N]},\gamma(Y_{[0,N-1]})),P^{\hat{\pi}}(X_{N}\in\cdot|Y_{[0,N]}),\gamma(Y_{[0,N-1]})\right)\right].

Proof  We first note that, since the quantizer is a nearest-neighbor quantization, the quantization error is almost surely upper bounded by
ρBL(Pπt(XN+t|Y[t,t+N],U[t,t+N1]),Pπ^(XN+t|Y[t,t+N],U[t,t+N1]))\rho_{BL}\left(P^{\pi_{t_{-}}}(X_{N+t}\in\cdot|Y_{[t,t+N]},U_{[t,t+N-1]}),P^{\hat{\pi}}(X_{N+t}\in\cdot|Y_{[t,t+N]},U_{[t,t+N-1]})\right). Using this and the law of total expectation:

Eπ0γ[Ez0γ[L(Zt)]|Y[0,N],γ(Y[0,N1])]\displaystyle E^{\gamma}_{\pi_{0}}\left[E_{z_{0}}^{\gamma}\left[L(Z_{t})\right]|Y_{[0,N]},\gamma(Y_{[0,N-1]})\right]
supγΓEπ0γ[Eπtγ[ρBL(Pπt(XN+t|Y[t,t+N],U[t,t+N1])\displaystyle\leq\sup_{\gamma\in\Gamma}E_{\pi_{0}}^{\gamma}\bigg{[}E_{\pi_{t_{-}}}^{\gamma}\bigg{[}\rho_{BL}\bigg{(}P^{\pi_{t_{-}}}(X_{N+t}\in\cdot|Y_{[t,t+N]},U_{[t,t+N-1]})
,Pπ^(XN+t|Y[t,t+N],U[t,t+N1]))]|(U,Y)[0,t1]]\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad,P^{\hat{\pi}}(X_{N+t}\in\cdot|Y_{[t,t+N]},U_{[t,t+N-1]})\bigg{)}\bigg{]}\bigg{|}{(U,Y)}_{[0,t-1]}\bigg{]}
supπ𝒫(𝕏)supγΓEπγ[ρBL(Pπ(XN|Y[0,N],U[0,N1]),Pπ^(XN|Y[0,N],U[0,N1]))]\displaystyle\leq\sup_{\pi\in{\mathcal{P}}(\mathds{X})}\sup_{\gamma\in\Gamma}E_{\pi}^{\gamma}\left[\rho_{BL}\left(P^{\pi}(X_{N}\in\cdot|Y_{[0,N]},U_{[0,N-1]}),P^{\hat{\pi}}(X_{N}\in\cdot|Y_{[0,N]},U_{[0,N-1]})\right)\right]

where (U,Y)[0,t1]=U[0,t1],Y[0,t1]{(U,Y)}_{[0,t-1]}=U_{[0,t-1]},Y_{[0,t-1]}.  

Lemma 20

Under Assumption 3 we have that

vkBLαc~t=0k1(βα𝒵)t+c~t=0k1(βα𝒵)t1βkt1β.\displaystyle\|v_{k}\|_{BL}\leq\alpha_{\tilde{c}}\sum_{t=0}^{k-1}(\beta\alpha_{\mathcal{Z}})^{t}+\|\tilde{c}\|_{\infty}\sum_{t=0}^{k-1}(\beta\alpha_{\mathcal{Z}})^{t}\frac{1-\beta^{k-t}}{1-\beta}.

In particular, we have that for all kk

vkBL11βα𝒵(c~1β+αc~).\displaystyle\|v_{k}\|_{BL}\leq\frac{1}{1-\beta\alpha_{\mathcal{Z}}}\left(\frac{\|\tilde{c}\|_{\infty}}{1-\beta}+\alpha_{\tilde{c}}\right).

Proof  It is easy to see that vkc~t=0k1βt=c~1βk1β\|v_{k}\|_{\infty}\leq\|\tilde{c}\|_{\infty}\sum_{t=0}^{k-1}\beta^{t}=\|\tilde{c}\|_{\infty}\frac{1-\beta^{k}}{1-\beta}. Then, we use an inductive approach and assume the claim holds for kk and analyze the term k+1k+1

vk+1BL=vk+1+supxy|vk+1(x)vk+1(y)|ρBL(x,y)\displaystyle\|v_{k+1}\|_{BL}=\|v_{k+1}\|_{\infty}+\sup_{x\neq y}\frac{|v_{k+1}(x)-v_{k+1}(y)|}{\rho_{BL}(x,y)}

For the second term, we have

|vk+1(x)vk+1(y)|supu(|c~(x,u)c~(y,u)|+β|vk(z)η(dz|x,u)vk(z)η(dz|y,u)|)\displaystyle|v_{k+1}(x)-v_{k+1}(y)|\leq\sup_{u}\left(|\tilde{c}(x,u)-\tilde{c}(y,u)|+\beta\left|\int v_{k}(z)\eta(dz|x,u)-\int v_{k}(z)\eta(dz|y,u)\right|\right)
αc~ρBL(x,y)+βvkBLα𝒵ρBL(x,y).\displaystyle\leq\alpha_{\tilde{c}}\rho_{BL}(x,y)+\beta\|v_{k}\|_{BL}\alpha_{\mathcal{Z}}\rho_{BL}(x,y).

Hence, using the induction hypothesis, we have that

vk+1BLvk+1+αc~+βα𝒵vkBL\displaystyle\|v_{k+1}\|_{BL}\leq\|v_{k+1}\|_{\infty}+\alpha_{\tilde{c}}+\beta\alpha_{\mathcal{Z}}\|v_{k}\|_{BL}
c~1βk+11β+αc~+βα𝒵(αc~t=0k1(βα𝒵)t+c~t=0k1(βα𝒵)t1βkt1β)\displaystyle\qquad\leq\|\tilde{c}\|_{\infty}\frac{1-\beta^{k+1}}{1-\beta}+\alpha_{\tilde{c}}+\beta\alpha_{\mathcal{Z}}\left(\alpha_{\tilde{c}}\sum_{t=0}^{k-1}(\beta\alpha_{\mathcal{Z}})^{t}+\|\tilde{c}\|_{\infty}\sum_{t=0}^{k-1}(\beta\alpha_{\mathcal{Z}})^{t}\frac{1-\beta^{k-t}}{1-\beta}\right)
=αc~t=0k(βα𝒵)t+c~t=0k(βα𝒵)t1βkt1β,t=t+1,\displaystyle\qquad=\alpha_{\tilde{c}}\sum_{t^{\prime}=0}^{k}(\beta\alpha_{\mathcal{Z}})^{t^{\prime}}+\|\tilde{c}\|_{\infty}\sum_{t^{\prime}=0}^{k}(\beta\alpha_{\mathcal{Z}})^{t^{\prime}}\frac{1-\beta^{k-t^{\prime}}}{1-\beta},\quad t^{\prime}=t+1,

which concludes the induction argument. Therefore, we can write

vkBL\displaystyle\|v_{k}\|_{BL} αc~t=0k1(βα𝒵)t+c~t=0k1(βα𝒵)t1βkt1β\displaystyle\leq\alpha_{\tilde{c}}\sum_{t=0}^{k-1}(\beta\alpha_{\mathcal{Z}})^{t}+\|\tilde{c}\|_{\infty}\sum_{t=0}^{k-1}(\beta\alpha_{\mathcal{Z}})^{t}\frac{1-\beta^{k-t}}{1-\beta}
=(αc~+c~1β)1(βα𝒵)k1βα𝒵c~βk(1α𝒵k)(1β)(1α𝒵).\displaystyle=\left(\alpha_{\tilde{c}}+\frac{\|\tilde{c}\|_{\infty}}{1-\beta}\right)\frac{1-(\beta\alpha_{\mathcal{Z}})^{k}}{1-\beta\alpha_{\mathcal{Z}}}-\frac{\|\tilde{c}\|_{\infty}\beta^{k}(1-\alpha_{\mathcal{Z}}^{k})}{(1-\beta)(1-\alpha_{\mathcal{Z}})}.

 
Now, we can prove Theorem 9.

Proof of Theorem 9

Consider the following dynamic programming operators, T^n,Tn:b(𝒵)b(𝒵)\hat{T}_{n},T_{n}:{\mathcal{B}}_{b}(\mathcal{Z})\to{\mathcal{B}}_{b}(\mathcal{Z}) (b(𝒵){\mathcal{B}}_{b}(\mathcal{Z}) denotes the set of measurable and bounded functions on 𝒵\mathcal{Z}) such that for any fb(𝒵)f\in{\mathcal{B}}_{b}(\mathcal{Z}) and for any z𝒵z\in\mathcal{Z}

(Tn(f))(z)\displaystyle(T_{n}(f))(z) :=c(z,γ~N(z))+βf(y)η(dy|z,γ~N(z))\displaystyle:=c(z,\tilde{\gamma}^{*}_{N}(z))+\beta\int f(y)\eta(dy|z,\tilde{\gamma}_{N}^{*}(z))
(T^n(f))(z)\displaystyle(\hat{T}_{n}(f))(z) :=c(F(z),γ~N(z))+βf(y)η(dy|F(z),γ~N(z))\displaystyle:=c(F(z),\tilde{\gamma}_{N}^{*}(z))+\beta\int f(y)\eta(dy|F(z),\tilde{\gamma}_{N}^{*}(z))

where γ~N\tilde{\gamma}_{N}^{*} denotes the extension of the optimal policy for the finite state space model to the belief space 𝒵\mathcal{Z}, i.e. it is constant over the quantization bins, furthermore F(z)F(z) is the nearest neighbor quantization map such that it maps zz to the closest point from the finite state space 𝒵π^N\mathcal{Z}_{\hat{\pi}}^{N} (recall that ρBL\rho_{BL} is used to metrize the belief space).

The optimal cost for the finite model, JβNJ^{N}_{\beta} is only defined on a finite set, we denote by J~βN\tilde{J}^{N}_{\beta} the extension of the optimal cost for the finite model to the belief space 𝒫(𝕏){\mathcal{P}}(\mathds{X}) by assigning the same values over the quantizaton bins , i.e. it is a piece-wise constant function over the quantization bins. Note that, we have

Jβ(γ~N,z)\displaystyle J_{\beta}(\tilde{\gamma}_{N}^{*},z) =Tn(Jβ(γ~N,z)),\displaystyle=T_{n}(J_{\beta}(\tilde{\gamma}^{*}_{N},z)),
J~βN(γ~N,z)\displaystyle\tilde{J}^{N}_{\beta}(\tilde{\gamma}_{N}^{*},z) =T^n(J~βN(γ~N,z)).\displaystyle=\hat{T}_{n}(\tilde{J}^{N}_{\beta}(\tilde{\gamma}_{N}^{*},z)).

Using these equalities, we write

Jβ(γ~N,z)Jβ(γ,z)|Tn(Jβ(γ~N,z))Tn(Jβ(γ,z))|\displaystyle J_{\beta}(\tilde{\gamma}_{N}^{*},z)-J_{\beta}(\gamma^{*},z)\leq|T_{n}(J_{\beta}(\tilde{\gamma}_{N}^{*},z))-T_{n}(J_{\beta}(\gamma^{*},z))|
+|Tn(Jβ(γ,z))T^n(Jβ(γ,z))|\displaystyle\hskip 99.58464pt+|T_{n}(J_{\beta}(\gamma^{*},z))-\hat{T}_{n}(J_{\beta}(\gamma^{*},z))|
+|T^n(Jβ(γ,z))T^n(J~βN(γ~N,z))|\displaystyle\hskip 99.58464pt+|\hat{T}_{n}(J_{\beta}(\gamma^{*},z))-\hat{T}_{n}(\tilde{J}^{N}_{\beta}(\tilde{\gamma}_{N}^{*},z))|
+|J~βN(γ~N,z)Jβ(γ,z)|\displaystyle\hskip 99.58464pt+|\tilde{J}^{N}_{\beta}(\tilde{\gamma}_{N}^{*},z)-J_{\beta}(\gamma^{*},z)|
β|Jβ(γ~N,z1)Jβ(γ,z1)|η(dz1|z,γ~N(z))\displaystyle\leq\beta\int|J_{\beta}(\tilde{\gamma}_{N}^{*},z_{1})-J_{\beta}(\gamma^{*},z_{1})|\eta(dz_{1}|z,\tilde{\gamma}_{N}^{*}(z))
+|c(z,γ~N(z))c(F(z),γ~N(z))|\displaystyle\qquad+|c(z,\tilde{\gamma}_{N}^{*}(z))-c(F(z),\tilde{\gamma}_{N}^{*}(z))|
+β|Jβ(γ,z1)η(dz1|z,γ~N(z))Jβ(γ,z1)η(dz1|F(z),γ~N(z))|\displaystyle\qquad+\beta|\int J_{\beta}(\gamma^{*},z_{1})\eta(dz_{1}|z,\tilde{\gamma}_{N}^{*}(z))-\int J_{\beta}(\gamma^{*},z_{1})\eta(dz_{1}|F(z),\tilde{\gamma}_{N}^{*}(z))|
+β|J~βN(γ~N,z1)Jβ(γ,z1)|η(dz1|F(z),γ~N(z))\displaystyle\qquad+\beta\int|\tilde{J}_{\beta}^{N}(\tilde{\gamma}_{N}^{*},z_{1})-J_{\beta}(\gamma^{*},z_{1})|\eta(dz_{1}|F(z),\tilde{\gamma}_{N}^{*}(z))
+|J~βN(γ~N,z)Jβ(γ,z)|\displaystyle\qquad+|\tilde{J}_{\beta}^{N}(\tilde{\gamma}_{N}^{*},z)-J_{\beta}(\gamma^{*},z)|
βsupγ(Ezγ[|Jβ(γ~N,Z1)Jβ(γ,Z1)|])+αcL(z)+βα𝒵L(z)JβBL\displaystyle\leq\beta\sup_{\gamma}\bigg{(}E^{\gamma}_{z}[|J_{\beta}(\tilde{\gamma}_{N}^{*},Z_{1})-J_{\beta}(\gamma^{*},Z_{1})|]\bigg{)}+\alpha_{c}L(z)+\beta\alpha_{\mathcal{Z}}L(z)\|J_{\beta}^{*}\|_{BL}
+βEz,Nγ[|J~βN(γ~N,Z1)Jβ(γ,Z1)|]+|J~βN(γ~N,z)Jβ(γ,z)|\displaystyle\qquad\qquad\qquad+\beta E^{\gamma}_{z,N}[|\tilde{J}_{\beta}^{N}(\tilde{\gamma}_{N}^{*},Z_{1})-J_{\beta}(\gamma^{*},Z_{1})|]+|\tilde{J}_{\beta}^{N}(\tilde{\gamma}_{N}^{*},z)-J_{\beta}(\gamma^{*},z)| (23)

where Ez,γNE^{N}_{z,\gamma} denotes the expectation with respect the the kernel η(dy|F(z),γ(z))\eta(dy|F(z),\gamma(z)) when the initial point is zz.

Now, using Lemma 18, we can write that:

|J~βN(γ~N,z)Jβ(γ,z)|\displaystyle|\tilde{J}^{N}_{\beta}(\tilde{\gamma}_{N}^{*},z)-J_{\beta}(\gamma^{*},z)|
limksupγ(αct=0k1βt(Ezγ[L(Zt)]+3α𝒵m=0t1(4α𝒵+1)mEzγ[L(Ztm1)])\displaystyle\leq\lim_{k\to\infty}\sup_{\gamma}\Bigg{(}\alpha_{c}\sum_{t=0}^{k-1}\beta^{t}\left(E^{\gamma}_{z}\left[L(Z_{t})\right]+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}(4\alpha_{\mathcal{Z}}+1)^{m}E^{\gamma}_{z}[L(Z_{t-m-1})]\right)
+α𝒵t=0k1βt+1vkt1BL(Ezγ[L(Zt)]+3α𝒵m=0t1(4α𝒵+1)mEzγ[L(Ztm1)]))\displaystyle+\alpha_{\mathcal{Z}}\sum_{t=0}^{k-1}\beta^{t+1}\|v_{k-t-1}\|_{BL}\left(E^{\gamma}_{z}\left[L(Z_{t})\right]+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}(4\alpha_{\mathcal{Z}}+1)^{m}E^{\gamma}_{z}[L(Z_{t-m-1})]\right)\Bigg{)}
:=g(z).\displaystyle:=g(z). (24)

We first introduce the following notation along with the above notation (6.2)

f(z)\displaystyle f(z) :=|Jβ(γ~N,z)Jβ(γ,z)|.\displaystyle:=|J_{\beta}(\tilde{\gamma}_{N}^{*},z)-J_{\beta}(\gamma^{*},z)|.

Notice that with the new notation, we can rewrite the bound on (6.2) as:

f(z)\displaystyle f(z) βsupγEzγ[f(Z1)]+αcL(z)+βα𝒵L(z)JβBL+βEz,Nγ[g(Z1)]+g(z)\displaystyle\leq\beta\sup_{\gamma}E_{z}^{\gamma}[f(Z_{1})]+\alpha_{c}L(z)+\beta\alpha_{\mathcal{Z}}L(z)\|J^{*}_{\beta}\|_{BL}+\beta E_{z,N}^{\gamma}[g(Z_{1})]+g(z)
βsupγEzγ[f(Z1)]+αcL(z)+βα𝒵L(z)JβBL+βEzγ[g(Z1)]+βgBLα𝒵L(z)+g(z)\displaystyle\leq\beta\sup_{\gamma}E_{z}^{\gamma}[f(Z_{1})]+\alpha_{c}L(z)+\beta\alpha_{\mathcal{Z}}L(z)\|J^{*}_{\beta}\|_{BL}+\beta E_{z}^{\gamma}[g(Z_{1})]+\beta\|g\|_{BL}\alpha_{\mathcal{Z}}L(z)+g(z)

where we used the fact that

Ez,Nγ[g(Z1)]Ezγ[g(Z1)]\displaystyle E_{z,N}^{\gamma}[g(Z_{1})]-E_{z}^{\gamma}[g(Z_{1})] =g(z1)η(dz1|F(z),γ(z))g(z1)η(dz1|z,γ(z))\displaystyle=\int g(z_{1})\eta(dz_{1}|F(z),\gamma(z))-\int g(z_{1})\eta(dz_{1}|z,\gamma(z))
gBLα𝒵L(z).\displaystyle\leq\|g\|_{BL}\alpha_{\mathcal{Z}}L(z).

Using the same bound on f(Z1)f(Z_{1}) one can also write that for any initial point z0z_{0}:

f(z0)\displaystyle f(z_{0}) supγΓβ2Ez0γ[f(Z2)]+αct=01βtEz0γ[L(Zt)]+α𝒵JβBLt=01βt+1Ez0γ[L(Zt)]\displaystyle\leq\sup_{\gamma\in\Gamma}\beta^{2}E^{\gamma}_{z_{0}}[f(Z_{2})]+\alpha_{c}\sum_{t=0}^{1}\beta^{t}E^{\gamma}_{z_{0}}[L(Z_{t})]+\alpha_{\mathcal{Z}}\|J_{\beta}^{*}\|_{BL}\sum_{t=0}^{1}\beta^{t+1}E^{\gamma}_{z_{0}}[L(Z_{t})]
+t=01βt+1Ez0γ[g(Zt)]+gBLα𝒵t=01βt+1Ez0γ[L(Zt)]+t=01βtEz0γ[g(Zt)].\displaystyle+\sum_{t=0}^{1}\beta^{t+1}E^{\gamma}_{z_{0}}[g(Z_{t})]+\|g\|_{BL}\alpha_{\mathcal{Z}}\sum_{t=0}^{1}\beta^{t+1}E^{\gamma}_{z_{0}}[L(Z_{t})]+\sum_{t=0}^{1}\beta^{t}E^{\gamma}_{z_{0}}[g(Z_{t})].

In general, for any k<k<\infty, we can write

f(z0)\displaystyle f(z_{0}) supγΓ(βkEz0γ[f(Zk)]+αct=0k1βtEz0γ[L(Zt)]+α𝒵JβBLt=0k1βt+1Ez0γ[L(Zt)]\displaystyle\leq\sup_{\gamma\in\Gamma}\Bigg{(}\beta^{k}E^{\gamma}_{z_{0}}[f(Z_{k})]+\alpha_{c}\sum_{t=0}^{k-1}\beta^{t}E^{\gamma}_{z_{0}}[L(Z_{t})]+\alpha_{\mathcal{Z}}\|J_{\beta}^{*}\|_{BL}\sum_{t=0}^{k-1}\beta^{t+1}E^{\gamma}_{z_{0}}[L(Z_{t})]
+t=0k1βt+1Ez0γ[g(Zt)]+gBLα𝒵t=0k1βt+1Ez0γ[L(Zt)]+t=0k1βtEz0γ[g(Zt)]).\displaystyle+\sum_{t=0}^{k-1}\beta^{t+1}E^{\gamma}_{z_{0}}[g(Z_{t})]+\|g\|_{BL}\alpha_{\mathcal{Z}}\sum_{t=0}^{k-1}\beta^{t+1}E^{\gamma}_{z_{0}}[L(Z_{t})]+\sum_{t=0}^{k-1}\beta^{t}E^{\gamma}_{z_{0}}[g(Z_{t})]\Bigg{)}. (25)

Recall that our main goal is to bound Eπ0[f(Z0)|Y[0,N],γ(Y[0,N1])]E_{\pi_{0}}[f(Z_{0})|Y_{[0,N]},\gamma(Y_{[0,N-1]})] where Z0=Pπ0(XN|Y[0,N],γ(Y[0,N1])Z_{0}=P_{\pi_{0}}(X_{N}\in\cdot|Y_{[0,N]},\gamma(Y_{[0,N-1]}). To this end, first notice that using Lemma 19, we have for any tt

Eπ0[EZ0γ[L(Zt)]|Y[0,N],γ(Y[0,N1])]\displaystyle E_{\pi_{0}}\left[E^{\gamma}_{Z_{0}}[L(Z_{t})]|Y_{[0,N]},\gamma(Y_{[0,N-1]})\right]
supπ𝒫(𝕏)supγΓEπ,γ[ρBL(Pπ(XN|Y[0,N]),Pπ^(XN|Y[0,N]))]:=B.\displaystyle\leq\sup_{\pi\in{\mathcal{P}}(\mathds{X})}\sup_{\gamma\in\Gamma}E^{\pi,\gamma}\left[\rho_{BL}\left(P^{\pi}(X_{N}\in\cdot|Y_{[0,N]}),P^{\hat{\pi}}(X_{N}\in\cdot|Y_{[0,N]})\right)\right]:=B. (26)

Using this bound and Lemma 20 for vkt1BL\|v_{k-t-1}\|_{BL}, we can write that

E[EZ0γ[g(Zt)]|Y[0,N],γ(Y[0,N1])]\displaystyle E\left[E^{\gamma}_{Z_{0}}[g(Z_{t})]\big{|}Y_{[0,N]},\gamma(Y_{[0,N-1]})\right]
limk(αct=0k1βt(B+3α𝒵m=0t1B(4α𝒵+1)m)\displaystyle\leq\lim_{k\to\infty}\Bigg{(}\alpha_{c}\sum_{t=0}^{k-1}\beta^{t}\left(B+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}B(4\alpha_{\mathcal{Z}}+1)^{m}\right)
+α𝒵t=0k1βt+1vkt1BL(B+3α𝒵m=0t1B(4α𝒵+1)m))\displaystyle\qquad+\alpha_{\mathcal{Z}}\sum_{t=0}^{k-1}\beta^{t+1}\|v_{k-t-1}\|_{BL}\left(B+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}B(4\alpha_{\mathcal{Z}}+1)^{m}\right)\Bigg{)}
B(αc+βα𝒵JβBL)limkt=0k1βt(1+3α𝒵m=0t1(4α𝒵+1)m)\displaystyle\leq B\left(\alpha_{c}+\beta\alpha_{\mathcal{Z}}\|J_{\beta}^{*}\|_{BL}\right)\lim_{k\to\infty}\sum_{t=0}^{k-1}\beta^{t}\left(1+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}(4\alpha_{\mathcal{Z}}+1)^{m}\right)
B(αc+βα𝒵JβBL)limkt=0k1βt(1+3α𝒵(4α𝒵+1)t14α𝒵)\displaystyle\leq B\left(\alpha_{c}+\beta\alpha_{\mathcal{Z}}\|J_{\beta}^{*}\|_{BL}\right)\lim_{k\to\infty}\sum_{t=0}^{k-1}\beta^{t}\left(1+3\alpha_{\mathcal{Z}}\frac{(4\alpha_{\mathcal{Z}}+1)^{t}-1}{4\alpha_{\mathcal{Z}}}\right)
B(αc+βα𝒵JβBL)limkt=0k1βt(4α𝒵+1)t\displaystyle\leq B\left(\alpha_{c}+\beta\alpha_{\mathcal{Z}}\|J_{\beta}^{*}\|_{BL}\right)\lim_{k\to\infty}\sum_{t=0}^{k-1}\beta^{t}(4\alpha_{\mathcal{Z}}+1)^{t}
=BK0(β,α𝒵,αc~,c~)\displaystyle=BK_{0}(\beta,\alpha_{\mathcal{Z}},\alpha_{\tilde{c}},\|\tilde{c}\|_{\infty}) (27)

where

K0(β,α𝒵,αc~,c~)=\displaystyle K_{0}(\beta,\alpha_{\mathcal{Z}},\alpha_{\tilde{c}},\|\tilde{c}\|_{\infty})= (αc+βα𝒵JβBL)11β(4α𝒵+1).\displaystyle\left(\alpha_{c}+\beta\alpha_{\mathcal{Z}}\|J_{\beta}^{*}\|_{BL}\right)\frac{1}{1-\beta(4\alpha_{\mathcal{Z}}+1)}.

By Lemma 23 we also have that

gBL\displaystyle\|g\|_{BL}\leq (αc+βα𝒵JβBL)(21β(4α𝒵+1)+3α𝒵1βα𝒵+9α𝒵2(1β(4α𝒵+1))2)\displaystyle(\alpha_{c}+\beta\alpha_{\mathcal{Z}}\|J_{\beta}\|_{BL})\left(\frac{2}{1-\beta(4\alpha_{\mathcal{Z}}+1)}+\frac{3\alpha_{\mathcal{Z}}}{1-\beta\alpha_{\mathcal{Z}}}+\frac{9\alpha_{\mathcal{Z}}^{2}}{(1-\beta(4\alpha_{\mathcal{Z}}+1))^{2}}\right)
:=K^0(β,α𝒵,αc~,c~)\displaystyle:=\hat{K}_{0}(\beta,\alpha_{\mathcal{Z}},\alpha_{\tilde{c}},\|\tilde{c}\|_{\infty})

Going back to (6.2), using (6.2) and (6.2) and taking the limit kk\to\infty:

E[f(Z0)|Y[0,N],γ(Y[0,N1])]\displaystyle E[f(Z_{0})|Y_{[0,N]},\gamma(Y_{[0,N-1]})] Bαc1β+βα𝒵JβBLB1β+βK0(β,α𝒵,αc~,c~)B1β\displaystyle\leq\frac{B\alpha_{c}}{1-\beta}+\frac{\beta\alpha_{\mathcal{Z}}\|J_{\beta}^{*}\|_{BL}B}{1-\beta}+\frac{\beta K_{0}(\beta,\alpha_{\mathcal{Z}},\alpha_{\tilde{c}},\|\tilde{c}\|_{\infty})B}{1-\beta}
+K^0(β,α𝒵,αc~,c~)α𝒵βB1β+K0(β,α𝒵,αc~,c~)B1β.\displaystyle\qquad+\frac{\hat{K}_{0}(\beta,\alpha_{\mathcal{Z}},\alpha_{\tilde{c}},\|\tilde{c}\|_{\infty})\alpha_{\mathcal{Z}}\beta B}{1-\beta}+\frac{K_{0}(\beta,\alpha_{\mathcal{Z}},\alpha_{\tilde{c}},\|\tilde{c}\|_{\infty})B}{1-\beta}.

Thus, we can write that

E[f(Z0)|\displaystyle E[f(Z_{0})| Y[0,N],γ(Y[0,N1])]\displaystyle Y_{[0,N]},\gamma(Y_{[0,N-1]})]\leq
Ksupπ𝒫(𝕏)supγΓEπ,γ[ρBL(Pπ(XN|Y[0,N]),Pπ^(XN|Y[0,N]))],\displaystyle K\sup_{\pi\in{\mathcal{P}}(\mathds{X})}\sup_{\gamma\in\Gamma}E^{\pi,\gamma}\left[\rho_{BL}\left(P^{\pi}(X_{N}\in\cdot|Y_{[0,N]}),P^{\hat{\pi}}(X_{N}\in\cdot|Y_{[0,N]})\right)\right],

where

K=αc+βα𝒵JβBL+(β+1)K0(β,α𝒵,αc~,c~)+K^0(β,α𝒵,αc~,c~)βα𝒵1β\displaystyle K=\frac{\alpha_{c}+\beta\alpha_{\mathcal{Z}}\|J^{*}_{\beta}\|_{BL}+(\beta+1)K_{0}(\beta,\alpha_{\mathcal{Z}},\alpha_{\tilde{c}},\|\tilde{c}\|_{\infty})+\hat{K}_{0}(\beta,\alpha_{\mathcal{Z}},\alpha_{\tilde{c}},\|\tilde{c}\|_{\infty})\beta\alpha_{\mathcal{Z}}}{1-\beta} (28)

and using Lemma 24

JβBL11βα𝒵(c~1β+αc~).\displaystyle\|J_{\beta}^{*}\|_{BL}\leq\frac{1}{1-\beta\alpha_{\mathcal{Z}}}\left(\frac{\|\tilde{c}\|_{\infty}}{1-\beta}+\alpha_{\tilde{c}}\right).

7 Conclusion

In this paper, we studied performance bounds for policies that use only a finite-window of recent observation and action variables rather than the entire history in partially observed stochastic control problems. We have rigorously established approximation bounds that are easy to compute, and have shown that this bound critically depends on the ergodicity and stability properties of the belief-state process. We have provided the results for continuous-space valued state spaces and finite observation and action spaces, however our studies suggest that these results can also be generalized to real valued observation and actions also. Application to decentralized POMDPs is another direction that will benefit from the analysis presented.

A Technical Proofs of Supporting Results

In this section, we prove the proofs of some technical results.

A.1 Proof of Step 1 in the Proof of Lemma 18

We claim that

|v~kN(z)vk(z)|supγΓ(αc~t=0k1βtEz,γN[L(Zt)]+t=0k1βt+1vkt1BLEz,γN[L(Zt)]α𝒵)\displaystyle|\tilde{v}_{k}^{N}(z)-v_{k}(z)|\leq\sup_{\gamma\in\Gamma}\left(\alpha_{\tilde{c}}\sum_{t=0}^{k-1}\beta^{t}E^{N}_{z,\gamma}\left[L(Z_{t})\right]+\sum_{t=0}^{k-1}\beta^{t+1}\|v_{k-t-1}\|_{BL}E^{N}_{z,\gamma}\left[L(Z_{t})\right]\alpha_{\mathcal{Z}}\right)

where L(z)L(z) is the loss function due to the quantization, i.e. L(z)=ρBL(z,F(z))L(z)=\rho_{BL}(z,F(z)) with FF defined in (8).

We prove the claim using an inductive approach: for k=1k=1 we have (noting v0,v~0N0v_{0},\tilde{v}^{N}_{0}\equiv 0)

v~1N(z)=minu(c~(F(z),u)+βv~0N(y)η(dy|F(z),u))=minuc~(F(z),u)\displaystyle\tilde{v}_{1}^{N}(z)=\min_{u}\left(\tilde{c}(F(z),u)+\beta\int\tilde{v}_{0}^{N}(y)\eta(dy|F(z),u)\right)=\min_{u}\tilde{c}(F(z),u)
v1(z)=minu(c~(z,u)+βv0(y)η(dy|z,u))=minuc~(z,u).\displaystyle v_{1}(z)=\min_{u}\left(\tilde{c}(z,u)+\beta\int v_{0}(y)\eta(dy|z,u)\right)=\min_{u}\tilde{c}(z,u).

Note that under the stated assumptions the measurable selection conditions hold, and the minimum can be achieved using a policy γ\gamma for the original model and a policy γN\gamma^{N} for the finite model which defined only on a finite set. By extending the finite model policy γN\gamma^{N} over all state space 𝒫(𝕏){\mathcal{P}}(\mathds{X}), we can write that

|v~1N(z)v1(z)|\displaystyle|\tilde{v}_{1}^{N}(z)-v_{1}(z)| max(c~(F(z),γ(z))c~(z,γ(z)),c~(z,γN(z))c~(F(z),γN(z)))\displaystyle\leq\max\bigg{(}\tilde{c}(F(z),\gamma(z))-\tilde{c}(z,\gamma(z)),\tilde{c}(z,\gamma^{N}(z))-\tilde{c}(F(z),\gamma^{N}(z))\bigg{)}
supγ|c~(F(z),γ(z))c~(z,γ(z))|αc~L(z)\displaystyle\leq\sup_{\gamma}|\tilde{c}(F(z),\gamma(z))-\tilde{c}(z,\gamma(z))|\leq\alpha_{\tilde{c}}L(z)

which completes the proof for k=1k=1. Now, we assume that the claim holds for kk and analyze the step k+1k+1. Similar to k=1k=1 case, we can write

|v~k+1N(z)vk+1(z)|\displaystyle|\tilde{v}_{k+1}^{N}(z)-v_{k+1}(z)|
supγ|c~(F(z),γ(z))c~(z,γ(z))+βv~kN(y)η(dy|F(z),γ(z))βvk(y)η(dy|z,γ(z))|\displaystyle\leq\sup_{\gamma}\left|\tilde{c}(F(z),\gamma(z))-\tilde{c}(z,\gamma(z))+\beta\int\tilde{v}_{k}^{N}(y)\eta(dy|F(z),\gamma(z))-\beta\int v_{k}(y)\eta(dy|z,\gamma(z))\right|
supγ(|c~(F(z),γ(z))c~(z,γ(z))|+β|v~kN(y)vk(y)|η(dy|F(z),γ(z))\displaystyle\leq\sup_{\gamma}\Bigg{(}\left|\tilde{c}(F(z),\gamma(z))-\tilde{c}(z,\gamma(z))\right|+\beta\int|\tilde{v}_{k}^{N}(y)-v_{k}(y)|\eta(dy|F(z),\gamma(z))
+β|vk(y)η(dy|F(z),γ(z))vk(y)η(dy|z,γ(z))|)\displaystyle\qquad\qquad+\beta\left|\int v_{k}(y)\eta(dy|F(z),\gamma(z))-\int v_{k}(y)\eta(dy|z,\gamma(z))\right|\Bigg{)}
αc~L(z)+βvkBLα𝒵L(z)\displaystyle\leq\alpha_{\tilde{c}}L(z)+\beta\|v_{k}\|_{BL}\alpha_{\mathcal{Z}}L(z)
+βsupγ(|αc~t=0k1βtEy,γN[L(Zt)]+t=0k1βt+1vkt1BLEy,γN[L(Zt)]α𝒵]|η(dy|F(z),γ(z)))\displaystyle+\beta\sup_{\gamma}\bigg{(}\int\left|\alpha_{\tilde{c}}\sum_{t=0}^{k-1}\beta^{t}E^{N}_{y,\gamma}\left[L(Z_{t})\right]+\sum_{t=0}^{k-1}\beta^{t+1}\|v_{k-t-1}\|_{BL}E^{N}_{y,\gamma}\left[L(Z_{t})\right]\alpha_{\mathcal{Z}}]\right|\eta(dy|F(z),\gamma(z))\bigg{)}
supγ(αc~L(z)+αc~t=0k1βt+1Ez,γN[L(Zt+1)]+βvkBLα𝒵L(z)\displaystyle\leq\sup_{\gamma}\bigg{(}\alpha_{\tilde{c}}L(z)+\alpha_{\tilde{c}}\sum_{t=0}^{k-1}\beta^{t+1}E^{N}_{z,\gamma}\left[L(Z_{t+1})\right]+\beta\|v_{k}\|_{BL}\alpha_{\mathcal{Z}}L(z)
+t=0k1βt+2vkt1BLEz,γN[L(Zt+1)])\displaystyle\qquad\qquad+\sum_{t=0}^{k-1}\beta^{t+2}\|v_{k-t-1}\|_{BL}E^{N}_{z,\gamma}[L(Z_{t+1})]\bigg{)}
=supγ(αc~t=0kβtEz,γN[L(Zt)]+t=0kβt+1vktBLEz,γN[L(Zt)]),(t=t+1).\displaystyle=\sup_{\gamma}\left(\alpha_{\tilde{c}}\sum_{t^{\prime}=0}^{k}\beta^{t^{\prime}}E^{N}_{z,\gamma}\left[L(Z_{t^{\prime}})\right]+\sum_{t^{\prime}=0}^{k}\beta^{t^{\prime}+1}\|v_{k-t^{\prime}}\|_{BL}E^{N}_{z,\gamma}\left[L(Z_{t^{\prime}})\right]\right),\quad(t^{\prime}=t+1).

For the last two steps, note that Ey,γN[L(Zt)]E_{y,\gamma}^{N}\left[L(Z_{t})\right] denotes the expected loss at time tt when the initial state Z0=yZ_{0}=y, thus using the iterative expectation we have

Ey,γN[L(Zt)]η(dy|F(z),γ(z))=Ez,γN[L(Zt+1)].\displaystyle\int E_{y,\gamma}^{N}\left[L(Z_{t})\right]\eta(dy|F(z),\gamma(z))=E_{z,\gamma}^{N}\left[L(Z_{t+1})\right].

Hence, we have proved that for all k1k\geq 1

|v~kN(z)vk(z)|supγ(αc~t=0k1βtEz,γN[L(Zt)]\displaystyle|\tilde{v}_{k}^{N}(z)-v_{k}(z)|\leq\sup_{\gamma}\bigg{(}\alpha_{\tilde{c}}\sum_{t=0}^{k-1}\beta^{t}E^{N}_{z,\gamma}\left[L(Z_{t})\right] +t=0k1βt+1vkt1BLEz,γN[L(Zt)]α𝒵).\displaystyle+\sum_{t=0}^{k-1}\beta^{t+1}\|v_{k-t-1}\|_{BL}E^{N}_{z,\gamma}\left[L(Z_{t})\right]\alpha_{\mathcal{Z}}\bigg{)}.

A.2 Proof of Step 2 in the Proof of Lemma 18

The claim is that

Ez,γN[L(Zt)]Ez,γ[L(Zt)]+3α𝒵m=0t1(4α𝒵+1)mEz,γ[L(Ztm1)].\displaystyle E^{N}_{z,\gamma}\left[L(Z_{t})\right]\leq E_{z,\gamma}\left[L(Z_{t})\right]+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}(4\alpha_{\mathcal{Z}}+1)^{m}E_{z,\gamma}[L(Z_{t-m-1})].

We first write

Ez,γN[L(Zt)]Ez,γ[L(Zt)]+(LBL)ρBL(Pz,tγ,Pz,tγ,N)\displaystyle E^{N}_{z,\gamma}\left[L(Z_{t})\right]\leq E_{z,\gamma}\left[L(Z_{t})\right]+\bigg{(}\|L\|_{BL}\bigg{)}\rho_{BL}\left(P^{\gamma}_{z,t},P^{\gamma,N}_{z,t}\right) (29)

where

Pz,tγ,Pz,tγ,NP^{\gamma}_{z,t},\quad P^{\gamma,N}_{z,t}

are the marginal distributions of the state ztz_{t} for the true model and approximate model respectively, with Z0=zZ_{0}=z.

We focus on the term LBLρBL(Pz,tγ,Pz,tγ,N)\|L\|_{BL}\rho_{BL}\left(P^{\gamma}_{z,t},P^{\gamma,N}_{z,t}\right). We first claim that LBL3\|L\|_{BL}\leq 3, where L(z)=ρ(z,F(z))L(z)=\rho(z,F(z)). Recall that FF determines quantization given in (8).

LBLL+supz,z|ρ(z,F(z))ρ(z,F(z))|ρ(z,z)2+supz,z|ρ(z,F(z))ρ(z,F(z))|ρ(z,z)\displaystyle\|L\|_{BL}\leq\|L\|_{\infty}+\sup_{z,z^{\prime}}\frac{\left|\rho(z,F(z))-\rho(z^{\prime},F(z^{\prime}))\right|}{\rho(z,z^{\prime})}\leq 2+\sup_{z,z^{\prime}}\frac{\left|\rho(z,F(z))-\rho(z^{\prime},F(z^{\prime}))\right|}{\rho(z,z^{\prime})}

where we used the fact that

L=supzρBL(z,F(z))2\displaystyle\|L\|_{\infty}=\sup_{z}\rho_{BL}(z,F(z))\leq 2

as for any two probability measures μ,ν\mu,\nu,we have that ρBL(μ,ν)2\rho_{BL}(\mu,\nu)\leq 2 (see (3)).

Note that FF is a nearest neighbor quantizer as defined in (8), thus we can write that

|ρ(z,F(z))ρ(z,F(z))|\displaystyle\left|\rho(z,F(z))-\rho(z^{\prime},F(z^{\prime}))\right| max(ρ(z,F(z))ρ(z,F(z)),ρ(z,F(z))ρ(z,F(z)))\displaystyle\leq\max\left(\rho(z,F(z^{\prime}))-\rho(z^{\prime},F(z^{\prime})),\rho(z^{\prime},F(z))-\rho(z,F(z))\right)
supz^|ρ(z,z^)ρ(z,z^)|ρ(z,z)\displaystyle\leq\sup_{\hat{z}}\left|\rho(z,\hat{z})-\rho(z^{\prime},\hat{z})\right|\leq\rho(z,z^{\prime})

where for the last step we used the triangle inequality. Hence we can conclude that

LBL2+supz,z|ρ(z,F(z))ρ(z,F(z))|ρ(z,z)3.\displaystyle\|L\|_{BL}\leq 2+\sup_{z,z^{\prime}}\frac{\left|\rho(z,F(z))-\rho(z^{\prime},F(z^{\prime}))\right|}{\rho(z,z^{\prime})}\leq 3. (30)

Now, we show that

ρBL(Pz,tγ,Pz,tγ,N)α𝒵m=0t1(4α𝒵+1)mEz,γ[L(Ztm1)].\displaystyle\rho_{BL}\left(P^{\gamma}_{z,t},P^{\gamma,N}_{z,t}\right)\leq\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}(4\alpha_{\mathcal{Z}}+1)^{m}E_{z,\gamma}[L(Z_{t-m-1})]. (31)

We prove the claim by induction: for t=1t=1

ρBL(Pz,1γ,Pz,1γ,N)\displaystyle\rho_{BL}\left(P^{\gamma}_{z,1},P^{\gamma,N}_{z,1}\right) =supfBL1|f(z1)η(dz1|F(z),u1)f(z1)η(z1)η(dz1|z,u1)|\displaystyle=\sup_{\|f\|_{BL}\leq 1}\left|\int f(z_{1})\eta(dz_{1}|F(z),u_{1})-\int f(z_{1})\eta(z_{1})\eta(dz_{1}|z,u_{1})\right|
α𝒵L(z).\displaystyle\leq\alpha_{\mathcal{Z}}L(z).

Now assume the claim holds for t1t-1

ρBL(Pz,tγ,Pz,tγ,N)=supfBL1|f(zt)η(dzt|zt1,γ(zt1)))Pγz,t1(dzt1)\displaystyle\rho_{BL}\left(P^{\gamma}_{z,t},P^{\gamma,N}_{z,t}\right)=\sup_{\|f\|_{BL}\leq 1}\bigg{|}\int f(z_{t})\eta(dz_{t}|z_{t-1},\gamma(z_{t-1})))P^{\gamma}_{z,t-1}(dz_{t-1})
f(zt)η(dzt|F(zt1),γ(zt1))Pz,t1γ,N(dzt1)|\displaystyle\hskip 142.26378pt-\int f(z_{t})\eta(dz_{t}|F(z_{t-1}),\gamma(z_{t-1}))P^{\gamma,N}_{z,t-1}(dz_{t-1})\bigg{|}
supfBL1|f(zt)η(dzt|zt1,γ(zt1))Pz,t1γ(dzt1)\displaystyle\leq\sup_{\|f\|_{BL}\leq 1}\bigg{|}\int f(z_{t})\eta(dz_{t}|z_{t-1},\gamma(z_{t-1}))P^{\gamma}_{z,t-1}(dz_{t-1})
f(zt)η(dzt|zt1,γ(zt1))Pz,t1γ,N(dzt1)|\displaystyle\hskip 71.13188pt-\int f(z_{t})\eta(dz_{t}|z_{t-1},\gamma(z_{t-1}))P^{\gamma,N}_{z,t-1}(dz_{t-1})\bigg{|}
+supfBL1|f(zt)η(dzt|zt1,γ(zt1))Pz,t1γ,N(dzt1)\displaystyle\quad+\sup_{\|f\|_{BL}\leq 1}\bigg{|}\int f(z_{t})\eta(dz_{t}|z_{t-1},\gamma(z_{t-1}))P^{\gamma,N}_{z,t-1}(dz_{t-1})
f(zt)η(dzt|F(zt1),γ(zt1))Pz,t1γ,N(dzt1)|\displaystyle\hskip 71.13188pt-\int f(z_{t})\eta(dz_{t}|F(z_{t-1}),\gamma(z_{t-1}))P^{\gamma,N}_{z,t-1}(dz_{t-1})\bigg{|}
(f+α𝒵)ρBL(Pz,t1γ,Pz,t1γ,N)+α𝒵EzN[L(Zt1)]\displaystyle\leq(\|f\|_{\infty}+\alpha_{\mathcal{Z}})\rho_{BL}\left(P^{\gamma}_{z,t-1},P^{\gamma,N}_{z,t-1}\right)+\alpha_{\mathcal{Z}}E^{N}_{z}[L(Z_{t-1})]
(1+α𝒵)ρBL(Pz,t1γ,Pz,t1γ,N)+α𝒵EzN[L(Zt1)].\displaystyle\leq(1+\alpha_{\mathcal{Z}})\rho_{BL}\left(P^{\gamma}_{z,t-1},P^{\gamma,N}_{z,t-1}\right)+\alpha_{\mathcal{Z}}E^{N}_{z}[L(Z_{t-1})].

Using (29), we can write

ρBL\displaystyle\rho_{BL} (Pz,tγ,Pz,tγ,N)\displaystyle\left(P^{\gamma}_{z,t},P^{\gamma,N}_{z,t}\right)
(1+α𝒵)ρBL(Pz,t1γ,Pz,t1γ,N)+α𝒵(Ez[L(Zt1)]+3ρBL(Pz,t1γ,Pz,t1γ,N))\displaystyle\leq(1+\alpha_{\mathcal{Z}})\rho_{BL}\left(P^{\gamma}_{z,t-1},P^{\gamma,N}_{z,t-1}\right)+\alpha_{\mathcal{Z}}\bigg{(}E_{z}[L(Z_{t-1})]+3\rho_{BL}\left(P^{\gamma}_{z,t-1},P_{z,t-1}^{\gamma,N}\right)\bigg{)}
=(4α𝒵+1)ρBL(Pz,t1γ,Pz,t1γ,N)+α𝒵Ez[L(Zt1)]\displaystyle=(4\alpha_{\mathcal{Z}}+1)\rho_{BL}\left(P^{\gamma}_{z,t-1},P^{\gamma,N}_{z,t-1}\right)+\alpha_{\mathcal{Z}}E_{z}[L(Z_{t-1})]
(4α𝒵+1)(α𝒵m=0t2(4α𝒵+1)mEz[L(Ztm2)])+α𝒵Ez[L(Zt1)]\displaystyle\leq(4\alpha_{\mathcal{Z}}+1)\left(\alpha_{\mathcal{Z}}\sum_{m=0}^{t-2}(4\alpha_{\mathcal{Z}}+1)^{m}E_{z}[L(Z_{t-m-2})]\right)+\alpha_{\mathcal{Z}}E_{z}[L(Z_{t-1})]
=α𝒵m=0t1(4α𝒵+1)mE[Lz(Ztm1)]\displaystyle=\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}(4\alpha_{\mathcal{Z}}+1)^{m}E[L_{z}(Z_{t-m-1})]

which completes the proof of (31).

Now, we go back to the main claim and start from (29) to write

Ez,γN[L(Zt)]Ez,γ[L(Zt)]+LBLρBL(Pz,tγ,Pz,tγ,N)Ez,γ[L(Zt)]+3ρBL(Pz,tγ,Pz,tγ,N)\displaystyle E^{N}_{z,\gamma}\left[L(Z_{t})\right]\leq E_{z,\gamma}\left[L(Z_{t})\right]+\|L\|_{BL}\rho_{BL}\left(P^{\gamma}_{z,t},P^{\gamma,N}_{z,t}\right)\leq E_{z,\gamma}\left[L(Z_{t})\right]+3\rho_{BL}\left(P^{\gamma}_{z,t},P^{\gamma,N}_{z,t}\right)
Ez,γ[L(Zt)]+3α𝒵m=0t1(4α𝒵+1)mEz,γ[L(Ztm1)].\displaystyle\leq E_{z,\gamma}\left[L(Z_{t})\right]+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}(4\alpha_{\mathcal{Z}}+1)^{m}E_{z,\gamma}[L(Z_{t-m-1})].
Remark 21

Note that for the previous proof, (4α𝒵+1)(4\alpha_{\mathcal{Z}}+1) can be replaced by 1+α𝒵(L+2)1+\alpha_{\mathcal{Z}}(\|L\|_{\infty}+2) where

L=supπ𝒫(𝕏)supγΓsupy[0,N],u[0,N1]ρBL(Pπ(|y[0,N],u[0,N1]),Pπ^(|y[0,N],u[0,N1]))\displaystyle\|L\|_{\infty}=\sup_{\pi\in{\mathcal{P}}(\mathds{X})}\sup_{\gamma\in\Gamma}\sup_{y_{[0,N]},u_{[0,N-1]}}\rho_{BL}\left(P^{\pi}(\cdot|y_{[0,N]},u_{[0,N-1]}),P^{\hat{\pi}}(\cdot|y_{[0,N]},u_{[0,N-1]})\right)

which is upper bounded by 22, however, usually smaller than 22, provided there is a uniform filter stability.

Lemma 22

We introduce the following notation

E^z[L(Zt)]\displaystyle\hat{E}_{z}[L(Z_{t})] =supu0suput2suput1L(zt)η(dzt|zt1,ut1)η(dzt1|zt2,ut2)η(dz1|z,u0).\displaystyle=\sup_{u_{0}}\int\dots\sup_{u_{t-2}}\int\sup_{u_{t-1}}\int L(z_{t})\eta(dz_{t}|z_{t-1},u_{t-1})\eta(dz_{t-1}|z_{t-2},u_{t-2})\dots\eta(dz_{1}|z,u_{0}).

Under Assumption 3, we have that

supγΓEz,γ[L(Zt)]=E^z[L(Zt)].\displaystyle\sup_{\gamma\in\Gamma}E_{z,\gamma}[L(Z_{t})]=\hat{E}_{z}[L(Z_{t})].

Proof  Recall that for a fixed policy γ={γt}t\gamma=\{\gamma_{t}\}_{t},

Ez,γ[L(Zt)]=L(zt)η(dzt|zt1,γt1(zt1))η(dzt1|zt1,γt2(zt2))η(dz1|z,γ0(z0))\displaystyle E_{z,\gamma}[L(Z_{t})]=\int L(z_{t})\eta(dz_{t}|z_{t-1},\gamma_{t-1}(z_{t-1}))\eta(dz_{t-1}|z_{t-1},\gamma_{t-2}(z_{t-2}))\dots\eta(dz_{1}|z,\gamma_{0}(z_{0}))

It is easy to see that

L(zt)η(dzt|zt1,γt1(zt1))η(dzt1|zt1,γt2(zt2))η(dz1|z,γ0(z0))\displaystyle\int L(z_{t})\eta(dz_{t}|z_{t-1},\gamma_{t-1}(z_{t-1}))\eta(dz_{t-1}|z_{t-1},\gamma_{t-2}(z_{t-2}))\dots\eta(dz_{1}|z,\gamma_{0}(z_{0}))
suput1L(zt)η(dzt|zt1,ut1)η(dzt1|zt1,γt2(zt2))η(dz1|z,γ0(z0))\displaystyle\leq\int\sup_{u_{t-1}}\int L(z_{t})\eta(dz_{t}|z_{t-1},u_{t-1})\eta(dz_{t-1}|z_{t-1},\gamma_{t-2}(z_{t-2}))\dots\eta(dz_{1}|z,\gamma_{0}(z_{0}))

By repeating this step we can have

L(zt)η(dzt|zt1,γt1(zt1))η(dzt1|zt1,γt2(zt2))η(dz1|z,γ0(z0))\displaystyle\int L(z_{t})\eta(dz_{t}|z_{t-1},\gamma_{t-1}(z_{t-1}))\eta(dz_{t-1}|z_{t-1},\gamma_{t-2}(z_{t-2}))\dots\eta(dz_{1}|z,\gamma_{0}(z_{0}))
supu0suput2suput1L(zt)η(dzt|zt1,ut1)η(dzt1|zt2,ut2)η(dz1|z,u0)\displaystyle\leq\sup_{u_{0}}\int\dots\sup_{u_{t-2}}\int\sup_{u_{t-1}}\int L(z_{t})\eta(dz_{t}|z_{t-1},u_{t-1})\eta(dz_{t-1}|z_{t-2},u_{t-2})\dots\eta(dz_{1}|z,u_{0})

Hence by taking supremum over all policies we can write

supγΓEz,γ[L(Zt)]E^z[L(Zt)].\displaystyle\sup_{\gamma\in\Gamma}E_{z,\gamma}[L(Z_{t})]\leq\hat{E}_{z}[L(Z_{t})].

For the other direction, we first focus on the term suput1L(zt)η(dzt|zt1,ut1)\sup_{u_{t-1}}\int L(z_{t})\eta(dz_{t}|z_{t-1},u_{t-1}), using Assumption 3, one can show that there exists a measurable map γt1\gamma_{t-1} such that

suput1L(zt)η(dzt|zt1,ut1)=L(zt)η(dzt|zt1,γt1(zt1)).\displaystyle\sup_{u_{t-1}}\int L(z_{t})\eta(dz_{t}|z_{t-1},u_{t-1})=\int L(z_{t})\eta(dz_{t}|z_{t-1},\gamma_{t-1}(z_{t-1})).

Using the same argument, we can see that there exists a sequence of measurable functions
{γ0,γ1,,γt1}\{\gamma_{0},\gamma_{1},\dots,\gamma_{t-1}\} such that

supu0suput2suput1L(zt)η(dzt|zt1,ut1)η(dzt1|zt2,ut2)η(dz1|z,u0)\displaystyle\sup_{u_{0}}\int\dots\sup_{u_{t-2}}\int\sup_{u_{t-1}}\int L(z_{t})\eta(dz_{t}|z_{t-1},u_{t-1})\eta(dz_{t-1}|z_{t-2},u_{t-2})\dots\eta(dz_{1}|z,u_{0})
=L(zt)η(dzt|zt1,γt1(zt1))η(dzt1|zt1,γt2(zt2))η(dz1|z,γ0(z0)).\displaystyle=\int L(z_{t})\eta(dz_{t}|z_{t-1},\gamma_{t-1}(z_{t-1}))\eta(dz_{t-1}|z_{t-1},\gamma_{t-2}(z_{t-2}))\dots\eta(dz_{1}|z,\gamma_{0}(z_{0})).

Hence we can write that

E^z[L(Zt)]supγΓEz,γ[L(Zt)]\displaystyle\hat{E}_{z}[L(Z_{t})]\leq\sup_{\gamma\in\Gamma}E_{z,\gamma}[L(Z_{t})]

which proves the main claim.  

Lemma 23

Under Assumption 3

gBL(αc+βα𝒵JβBL)(21β(4α𝒵+1)+3α𝒵1βα𝒵+9α𝒵2(1β(4α𝒵+1))2).\displaystyle\|g\|_{BL}\leq(\alpha_{c}+\beta\alpha_{\mathcal{Z}}\|J_{\beta}\|_{BL})\left(\frac{2}{1-\beta(4\alpha_{\mathcal{Z}}+1)}+\frac{3\alpha_{\mathcal{Z}}}{1-\beta\alpha_{\mathcal{Z}}}+\frac{9\alpha_{\mathcal{Z}}^{2}}{(1-\beta(4\alpha_{\mathcal{Z}}+1))^{2}}\right).

Proof  Using Lemma 22, we can write that

g(z)\displaystyle g(z) =limksupγ(αct=0k1βt(Ez,γ[L(Zt)]+3α𝒵m=0t1(4α𝒵+1)mEz,γ[L(Ztm1)])\displaystyle=\lim_{k\to\infty}\sup_{\gamma}\Bigg{(}\alpha_{c}\sum_{t=0}^{k-1}\beta^{t}\left(E_{z,\gamma}\left[L(Z_{t})\right]+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}(4\alpha_{\mathcal{Z}}+1)^{m}E_{z,\gamma}[L(Z_{t-m-1})]\right)
+α𝒵t=0k1βt+1vkt1BL(Ez,γ[L(Zt)]+3α𝒵m=0t1(4α𝒵+1)mEz,γ[L(Ztm1)]))\displaystyle+\alpha_{\mathcal{Z}}\sum_{t=0}^{k-1}\beta^{t+1}\|v_{k-t-1}\|_{BL}\left(E_{z,\gamma}\left[L(Z_{t})\right]+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}(4\alpha_{\mathcal{Z}}+1)^{m}E_{z,\gamma}[L(Z_{t-m-1})]\right)\Bigg{)}
=limk(αct=0k1βt(E^z[L(Zt)]+3α𝒵m=0t1(4α𝒵+1)mE^z[L(Ztm1)])\displaystyle=\lim_{k\to\infty}\Bigg{(}\alpha_{c}\sum_{t=0}^{k-1}\beta^{t}\left(\hat{E}_{z}\left[L(Z_{t})\right]+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}(4\alpha_{\mathcal{Z}}+1)^{m}\hat{E}_{z}[L(Z_{t-m-1})]\right)
+α𝒵t=0k1βt+1vkt1BL(E^z[L(Zt)]+3α𝒵m=0t1(4α𝒵+1)mE^z[L(Ztm1)])).\displaystyle+\alpha_{\mathcal{Z}}\sum_{t=0}^{k-1}\beta^{t+1}\|v_{k-t-1}\|_{BL}\left(\hat{E}_{z}\left[L(Z_{t})\right]+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}(4\alpha_{\mathcal{Z}}+1)^{m}\hat{E}_{z}[L(Z_{t-m-1})]\right)\Bigg{)}.

Using the fact that L2\|L\|_{\infty}\leq 2 we can write

g\displaystyle\|g\|_{\infty} limk(αct=0k1βt(2+3α𝒵m=0t12(4α𝒵+1)m)\displaystyle\leq\lim_{k\to\infty}\Bigg{(}\alpha_{c}\sum_{t=0}^{k-1}\beta^{t}\left(2+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}2(4\alpha_{\mathcal{Z}}+1)^{m}\right)
+α𝒵t=0k1βt+1vkt1BL(2+3α𝒵m=0t12(4α𝒵+1)m))\displaystyle\qquad+\alpha_{\mathcal{Z}}\sum_{t=0}^{k-1}\beta^{t+1}\|v_{k-t-1}\|_{BL}\left(2+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}2(4\alpha_{\mathcal{Z}}+1)^{m}\right)\Bigg{)}
2(αc+βα𝒵JβBL)11β(4α𝒵+1).\displaystyle\leq 2\left(\alpha_{c}+\beta\alpha_{\mathcal{Z}}\|J_{\beta}^{*}\|_{BL}\right)\frac{1}{1-\beta(4\alpha_{\mathcal{Z}}+1)}.

Next, we show that when defined as a function of zz, E^z[L(Zt)]BL3α𝒵t+11α𝒵1\|\hat{E}_{z}[L(Z_{t})]\|_{BL}\leq 3\frac{\alpha_{\mathcal{Z}}^{t+1}-1}{\alpha_{\mathcal{Z}}-1} We follow an inductive approach. For t=1t=1:

|E^z[L(Z1)]E^z^[L(Z1)]|\displaystyle\big{|}\hat{E}_{z}\left[L(Z_{1})\right]-\hat{E}_{\hat{z}}\left[L(Z_{1})\right]\big{|} supu|L(z1)η(dz1|z,u)L(z1)η(dz1|z^,u)|\displaystyle\leq\sup_{u}\left|\int L(z_{1})\eta(dz_{1}|z,u)-\int L(z_{1})\eta(dz_{1}|\hat{z},u)\right|
LBLα𝒵ρBL(z,z^)3α𝒵ρBL(z,z^)\displaystyle\leq\|L\|_{BL}\alpha_{\mathcal{Z}}\rho_{BL}(z,\hat{z})\leq 3\alpha_{\mathcal{Z}}\rho_{BL}(z,\hat{z})

where LBL3\|L\|_{BL}\leq 3 follows from (30). Hence we have that E^z[L(Z1)]BL3+3α𝒵\|\hat{E}_{z}[L(Z_{1})]\|_{BL}\leq 3+3\alpha_{\mathcal{Z}}. Now we assume the claim holds for t1t-1 and focus on the case for tt:

|E^z[L(Zt)]E^z^[L(Zt)]|\displaystyle\big{|}\hat{E}_{z}\left[L(Z_{t})\right]-\hat{E}_{\hat{z}}\left[L(Z_{t})\right]\big{|}
=supu0suput2suput1L(zt)η(dzt|zt1,ut1)η(dzt1|zt2,ut2)η(dz1|z,u0)\displaystyle=\sup_{u_{0}}\int\dots\sup_{u_{t-2}}\int\sup_{u_{t-1}}\int L(z_{t})\eta(dz_{t}|z_{t-1},u_{t-1})\eta(dz_{t-1}|z_{t-2},u_{t-2})\dots\eta(dz_{1}|z,u_{0})
supu0suput2suput1L(zt)η(dzt|zt1,ut1)η(dzt1|zt2,ut2)η(dz1|z^,u0)\displaystyle\quad-\sup_{u_{0}}\int\dots\sup_{u_{t-2}}\int\sup_{u_{t-1}}\int L(z_{t})\eta(dz_{t}|z_{t-1},u_{t-1})\eta(dz_{t-1}|z_{t-2},u_{t-2})\dots\eta(dz_{1}|\hat{z},u_{0})
supu0(E^z1[L(Zt)]η(dz1|z,u0)E^z1[L(Zt)]η(dz1|z^,u0))\displaystyle\leq\sup_{u_{0}}\left(\int\hat{E}_{z_{1}}\left[L(Z_{t})\right]\eta(dz_{1}|z,u_{0})-\int\hat{E}_{z_{1}}\left[L(Z_{t})\right]\eta(dz_{1}|\hat{z},u_{0})\right)
α𝒵ρBL(z,z^)E^z[L(Zt1)]BLα𝒵ρBL(z,z^)3α𝒵t1α𝒵1.\displaystyle\leq\alpha_{\mathcal{Z}}\rho_{BL}(z,\hat{z})\|\hat{E}_{z}[L(Z_{t-1})]\|_{BL}\leq\alpha_{\mathcal{Z}}\rho_{BL}(z,\hat{z})3\frac{\alpha_{\mathcal{Z}}^{t}-1}{\alpha_{\mathcal{Z}}-1}.

We then have that

E^z[L(Zt)]BLL+3α𝒵α𝒵t1α𝒵13+3α𝒵α𝒵t1α𝒵13α𝒵t+11α𝒵1.\displaystyle\|\hat{E}_{z}[L(Z_{t})]\|_{BL}\leq\|L\|_{\infty}+3\alpha_{\mathcal{Z}}\frac{\alpha_{\mathcal{Z}}^{t}-1}{\alpha_{\mathcal{Z}}-1}\leq 3+3\alpha_{\mathcal{Z}}\frac{\alpha_{\mathcal{Z}}^{t}-1}{\alpha_{\mathcal{Z}}-1}\leq 3\frac{\alpha_{\mathcal{Z}}^{t+1}-1}{\alpha_{\mathcal{Z}}-1}.

Thus, we can write that

|E^z[L(Zt)]E^z^[L(Zt)]|3α𝒵α𝒵t1α𝒵1ρBL(z,z^).\displaystyle\big{|}\hat{E}_{z}\left[L(Z_{t})\right]-\hat{E}_{\hat{z}}\left[L(Z_{t})\right]\big{|}\leq 3\alpha_{\mathcal{Z}}\frac{\alpha_{\mathcal{Z}}^{t}-1}{\alpha_{\mathcal{Z}}-1}\rho_{BL}(z,\hat{z}).

Hence for any z,z^z,\hat{z}

|g(z)g(z^)|limk(αct=0k1βt(|E^z[L(Zt)]E^z^[L(Zt)]|\displaystyle|g(z)-g(\hat{z})|\leq\lim_{k\to\infty}\bigg{(}\alpha_{c}\sum_{t=0}^{k-1}\beta^{t}\big{(}\big{|}\hat{E}_{z}\left[L(Z_{t})\right]-\hat{E}_{\hat{z}}\left[L(Z_{t})\right]\big{|}
+3α𝒵m=0t1(4α𝒵+1)m|E^z[L(Ztm1)]E^z^[L(Ztm1)]|)\displaystyle\hskip 71.13188pt+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}(4\alpha_{\mathcal{Z}}+1)^{m}\left|\hat{E}_{z}\left[L(Z_{t-m-1})\right]-\hat{E}_{\hat{z}}\left[L(Z_{t-m-1})\right]\right|\bigg{)}
+α𝒵t=0k1βt+1vkt1BL(|E^z[L(Zt)]E^z^[L(Zt)]|\displaystyle\hskip 71.13188pt+\alpha_{\mathcal{Z}}\sum_{t=0}^{k-1}\beta^{t+1}\|v_{k-t-1}\|_{BL}\bigg{(}\left|\hat{E}_{z}\left[L(Z_{t})\right]-\hat{E}_{\hat{z}}\left[L(Z_{t})\right]\right|
+3α𝒵m=0t1(4α𝒵+1)m|E^z[L(Ztm1)]E^z^[L(Ztm1)]|))\displaystyle\hskip 71.13188pt+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}(4\alpha_{\mathcal{Z}}+1)^{m}\left|\hat{E}_{z}\left[L(Z_{t-m-1})\right]-\hat{E}_{\hat{z}}\left[L(Z_{t-m-1})\right]\right|\bigg{)}\bigg{)}
limk(αct=0k1βt(3α𝒵α𝒵t1α𝒵1ρBL(z,z^)+3α𝒵m=0t1(4α𝒵+1)m3α𝒵α𝒵tm11α𝒵1ρBL(z,z^))\displaystyle\leq\lim_{k\to\infty}\Bigg{(}\alpha_{c}\sum_{t=0}^{k-1}\beta^{t}\left(3\alpha_{\mathcal{Z}}\frac{\alpha_{\mathcal{Z}}^{t}-1}{\alpha_{\mathcal{Z}}-1}\rho_{BL}(z,\hat{z})+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}(4\alpha_{\mathcal{Z}}+1)^{m}3\alpha_{\mathcal{Z}}\frac{\alpha_{\mathcal{Z}}^{t-m-1}-1}{\alpha_{\mathcal{Z}}-1}\rho_{BL}(z,\hat{z})\right)
+α𝒵t=0k1βt+1vkt1BL(3α𝒵α𝒵t1α𝒵1ρBL(z,z^)+3α𝒵m=0t1(4α𝒵+1)m3α𝒵α𝒵tm11α𝒵1ρBL(z,z^)))\displaystyle+\alpha_{\mathcal{Z}}\sum_{t=0}^{k-1}\beta^{t+1}\|v_{k-t-1}\|_{BL}\left(3\alpha_{\mathcal{Z}}\frac{\alpha_{\mathcal{Z}}^{t}-1}{\alpha_{\mathcal{Z}}-1}\rho_{BL}(z,\hat{z})+3\alpha_{\mathcal{Z}}\sum_{m=0}^{t-1}(4\alpha_{\mathcal{Z}}+1)^{m}3\alpha_{\mathcal{Z}}\frac{\alpha_{\mathcal{Z}}^{t-m-1}-1}{\alpha_{\mathcal{Z}}-1}\rho_{BL}(z,\hat{z})\right)\Bigg{)}
3α𝒵α𝒵1(αc+βα𝒵JβBL)(11βα𝒵+3α𝒵(1β(4α𝒵+1))2)ρBL(z,z^).\displaystyle\leq\frac{3\alpha_{\mathcal{Z}}}{\alpha_{\mathcal{Z}}-1}(\alpha_{c}+\beta\alpha_{\mathcal{Z}}\|J_{\beta}\|_{BL})\left(\frac{1}{1-\beta\alpha_{\mathcal{Z}}}+\frac{3\alpha_{\mathcal{Z}}}{(1-\beta(4\alpha_{\mathcal{Z}}+1))^{2}}\right)\rho_{BL}(z,\hat{z}).

Hence, we have that

gBL(αc+βα𝒵JβBL)(21β(4α𝒵+1)+3α𝒵1βα𝒵+9α𝒵2(1β(4α𝒵+1))2).\displaystyle\|g\|_{BL}\leq(\alpha_{c}+\beta\alpha_{\mathcal{Z}}\|J_{\beta}\|_{BL})\left(\frac{2}{1-\beta(4\alpha_{\mathcal{Z}}+1)}+\frac{3\alpha_{\mathcal{Z}}}{1-\beta\alpha_{\mathcal{Z}}}+\frac{9\alpha_{\mathcal{Z}}^{2}}{(1-\beta(4\alpha_{\mathcal{Z}}+1))^{2}}\right).

 

Lemma 24
  • i.

    Under Assumption 3, if 𝒵\mathcal{Z} is metrized with ρBL\rho_{BL}, we have that

    JβBL11βα𝒵(c~1β+αc~).\displaystyle\|J^{*}_{\beta}\|_{BL}\leq\frac{1}{1-\beta\alpha_{\mathcal{Z}}}\left(\frac{\|\tilde{c}\|_{\infty}}{1-\beta}+\alpha_{\tilde{c}}\right).
  • ii.

    Without any assumption, if 𝒵\mathcal{Z} is metrized with total variation distance, we have that

    JβBL2β(1β)(1βα𝒵)c\displaystyle\|J^{*}_{\beta}\|_{BL}\leq\frac{2-\beta}{(1-\beta)(1-\beta\alpha_{\mathcal{Z}})}\|c\|_{\infty}

Proof  We start with the first part, for any x,yZx,y\in Z

|Jβ(x)Jβ(y)|\displaystyle|J_{\beta}^{*}(x)-J_{\beta}^{*}(y)| supu(|c~(x,u)c~(y,u)|+β|Jβ(z)η(dz|x,u)Jβ(z)η(dz|y,u)|)\displaystyle\leq\sup_{u}\bigg{(}|\tilde{c}(x,u)-\tilde{c}(y,u)|+\beta|\int J_{\beta}^{*}(z)\eta(dz|x,u)-\int J_{\beta}^{*}(z)\eta(dz|y,u)|\bigg{)}
αc~ρBL(x,y)+βJβBLα𝒵ρBL(x,y).\displaystyle\leq\alpha_{\tilde{c}}\rho_{BL}(x,y)+\beta\|J_{\beta}^{*}\|_{BL}\alpha_{\mathcal{Z}}\rho_{BL}(x,y).

Thus, we can write that

JβBLJβ+αc~+βα𝒵JβBLc~1β+αc~+βα𝒵JβBL.\displaystyle\|J_{\beta}^{*}\|_{BL}\leq\|J_{\beta}^{*}\|_{\infty}+\alpha_{\tilde{c}}+\beta\alpha_{\mathcal{Z}}\|J_{\beta}^{*}\|_{BL}\leq\frac{\|\tilde{c}\|_{\infty}}{1-\beta}+\alpha_{\tilde{c}}+\beta\alpha_{\mathcal{Z}}\|J_{\beta}^{*}\|_{BL}.

Hence, rearranging the terms, we can write (provided βα𝒵<1\beta\alpha_{\mathcal{Z}}<1)

JβBL11βα𝒵(c~1β+αc~).\displaystyle\|J_{\beta}^{*}\|_{BL}\leq\frac{1}{1-\beta\alpha_{\mathcal{Z}}}\left(\frac{\|\tilde{c}\|_{\infty}}{1-\beta}+\alpha_{\tilde{c}}\right).

For the second part, similar arguments lead to the following bound

JβBLJβ+c+βα𝒵JβBL,\displaystyle\|J_{\beta}^{*}\|_{BL}\leq\|J^{*}_{\beta}\|_{\infty}+\|c\|_{\infty}+\beta\alpha_{\mathcal{Z}}\|J_{\beta}^{*}\|_{BL},

hence the result follows after rearranging the terms and noting that Jβc1β\|J_{\beta}^{*}\|_{\infty}\leq\frac{\|c\|_{\infty}}{1-\beta}.  

References

  • Bertsekas [1975] D.P. Bertsekas. Convergence of discretization procedures in dynamic programming. IEEE Trans. Autom. Control, 20(3):415–419, Jun. 1975.
  • Billingsley [1999] P. Billingsley. Convergence of probability measures. New York: Wiley, 2nd edition, 1999.
  • Chigansky et al. [2009] P. Chigansky, R. Liptser, and R. van Handel. Intrinsic methods in filter stability. Handbook of Nonlinear Filtering, 2009.
  • Chow and Tsitsiklis [1991] C.S. Chow and J. N. Tsitsiklis. An optimal one-way multigrid algorithm for discrete-time stochastic control. IEEE transactions on automatic control, 36(8):898–914, 1991.
  • Dobrushin [1956] R.L. Dobrushin. Central limit theorem for nonstationary Markov chains. i. Theory of Probability & Its Applications, 1(1):65–80, 1956.
  • Dufour and Prieto-Rumeau [2012] F. Dufour and T. Prieto-Rumeau. Approximation of Markov decision processes with general state space. J. Math. Anal. Appl., 388:1254–1267, 2012.
  • Feinberg [1982] E. A. Feinberg. Controlled Markov processes with arbitrary numerical criteria. Th. Probability and its Appl., pages 486–503, 1982.
  • Feinberg et al. [2012] E.A. Feinberg, P.O. Kasyanov, and N.V. Zadioanchuk. Average cost Markov decision processes with weakly continuous transition probabilities. Math. Oper. Res., 37(4):591–607, Nov. 2012.
  • Feinberg et al. [2016] E.A. Feinberg, P.O. Kasyanov, and M.Z. Zgurovsky. Partially observable total-cost Markov decision process with weakly continuous transition probabilities. Mathematics of Operations Research, 41(2):656–681, 2016.
  • Hansen [2013] E. A. Hansen. Solving POMDPs by searching in policy space. arXiv preprint arXiv:1301.7380, 2013.
  • Hernández-Lerma [1989] O. Hernández-Lerma. Adaptive Markov Control Processes. Springer-Verlag, 1989.
  • Hernandez-Lerma and Lasserre [1996] O. Hernandez-Lerma and J. B. Lasserre. Discrete-Time Markov Control Processes: Basic Optimality Criteria. Springer, 1996.
  • Kara and Yüksel [2021] A. D. Kara and S. Yüksel. Convergence of finite memory q-learning for pomdps and near optimality of learned policies under filter stability. arXiv preprint arXiv:2103.12158, 2021.
  • Kara et al. [2019] A. D. Kara, N. Saldi, and S. Yüksel. Weak feller property of non-linear filters. Systems & Control Letters, 134:104–512, 2019.
  • Krishnamurthy [2016] V. Krishnamurthy. Partially observed Markov decision processes: from filtering to controlled sensing. Cambridge University Press, 2016.
  • Lovejoy [1991] W.S. Lovejoy. A survey of algorithmic methods for partially observed Markov decision processes. Annals of Operations Research, 28:47–66, 1991.
  • Mao et al. [2020] W. Mao, K. Zhang, E. Miehling, and T. Başar. Information state embedding in partially observable cooperative multi-agent reinforcement learning. arXiv preprint arXiv:2004.01098, 2020.
  • McDonald and Yüksel [2019] C. McDonald and S. Yüksel. Observability and filter stability for partially observed markov processes. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 1623–1628. IEEE, 2019.
  • McDonald and Yüksel [2020] C. McDonald and S. Yüksel. Exponential filter stability via Dobrushin’s coefficient. Electronic Communications in Probability, 25, 2020.
  • McDonald and Yüksel [2022] C. McDonald and S. Yüksel. Robustness to incorrect priors and controlled filter stability in partially observed stochastic control. SIAM Journal on Control and Optimization, to appear (also arXiv:2110.03720), 2022.
  • Parthasarathy [1967] K.R. Parthasarathy. Probability Measures on Metric Spaces. AMS Bookstore, 1967.
  • Pineau et al. [2006] J. Pineau, G. Gordon, and S. Thrun. Anytime point-based approximations for large pomdps. Journal of Artificial Intelligence Research, 27:335–380, 2006.
  • Porta et al. [2006] J. M. Porta, N. Vlassis, M. T. J. Spaan, and P. Poupart. Point-based value iteration for continuous pomdps. Journal of Machine Learning Research, 7(Nov):2329–2367, 2006.
  • Rhenius [1974] D. Rhenius. Incomplete information in Markovian decision models. Ann. Statist., 2:1327–1334, 1974.
  • Saldi et al. [2017] N. Saldi, S. Yüksel, and T. Linder. On the asymptotic optimality of finite approximations to markov decision processes with borel spaces. Mathematics of Operations Research, 42(4):945–978, 2017.
  • Saldi et al. [2018] N. Saldi, T. Linder, and S. Yuksel. Finite Approximations in discrete-time stochastic control. Birkhäuser,, 2018.
  • Saldi et al. [2020] N. Saldi, S. Yüksel, and T. Linder. Finite model approximations for partially observed markov decision processes with discounted cost. IEEE Transactions on Automatic Control, 65, 2020.
  • Smith and Simmons [2012] T. Smith and R. Simmons. Point-based pomdp algorithms: Improved analysis and implementation. arXiv preprint arXiv:1207.1412, 2012.
  • Subramanian and Mahajan [2019] J. Subramanian and A. Mahajan. Approximate information state for partially observed systems. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 1629–1636, 2019.
  • Villani [2008] C. Villani. Optimal transport: old and new. Springer, 2008.
  • Vlassis and Spaan [2005] N. Vlassis and M. T. J. Spaan. Perseus: Randomized point-based value iteration for pomdps. Journal of artificial intelligence research, 24:195–220, 2005.
  • White [1991] C.C. White. A survey of solution techniques for the partially observed Markov decision process. Annals of Operations Research, 32:215–230, 1991.
  • White-III and Scherer [1994] C. C. White-III and W. T. Scherer. Finite-memory suboptimal design for partially observed markov decision processes. Operations Research, 42(3):439–455, 1994.
  • Yu and Bertsekas [2008] H. Yu and D.P. Bertsekas. On near optimality of the set of finite-state controllers for average cost pomdp. Mathematics of Operations Research, 33(1):1–11, 2008.
  • Yushkevich [1976] A.A. Yushkevich. Reduction of a controlled Markov model with incomplete data to a problem with complete information in the case of Borel state and control spaces. Theory Prob. Appl., 21:153–158, 1976.
  • Zhang et al. [2014] Z. Zhang, D. Hsu, and W. S. Lee. Covering number for efficient heuristic-based pomdp planning. In International conference on machine learning, pages 28–36. PMLR, 2014.
  • Zhou et al. [2008] E. Zhou, M. C. Fu, and S. I. Marcus. A density projection approach to dimension reduction for continuous-state POMDPs. In Decision and Control, 2008. CDC 2008. 47th IEEE Conference on, pages 5576–5581, 2008.
  • Zhou et al. [2010] E. Zhou, M. C. Fu, and S. I. Marcus. Solving continuous-state POMDPs via density projection. IEEE Transactions on Automatic Control, 55(5):1101 – 1116, 2010.
  • Zhou and Hansen [2001] R. Zhou and E.A. Hansen. An improved grid-based approximation algorithm for POMDPs. In Int. J. Conf. Artificial Intelligence, pages 707–714, Aug. 2001.