Near Optimality of Finite Memory Feedback Policies in Partially Observed Markov Decision Processes
Abstract
In the theory of Partially Observed Markov Decision Processes (POMDPs), existence of optimal policies have in general been established via converting the original partially observed stochastic control problem to a fully observed one on the belief space, leading to a belief-MDP. However, computing an optimal policy for this fully observed model, and so for the original POMDP, using classical dynamic or linear programming methods is challenging even if the original system has finite state and action spaces, since the state space of the fully observed belief-MDP model is always uncountable. Furthermore, there exist very few rigorous value function approximation and optimal policy approximation results, as regularity conditions needed often require a tedious study involving the spaces of probability measures leading to properties such as Feller continuity. In this paper, we study a planning problem for POMDPs where the system dynamics and measurement channel model are assumed to be known. We construct an approximate belief model by discretizing the belief space using only finite window information variables. We then find optimal policies for the approximate model and we rigorously establish near optimality of the constructed finite window control policies in POMDPs under mild non-linear filter stability conditions and the assumption that the measurement and action sets are finite (and the state space is real vector valued). We also establish a rate of convergence result which relates the finite window memory size and the approximation error bound, where the rate of convergence is exponential under explicit and testable exponential filter stability conditions. While there exist many experimental results and few rigorous asymptotic convergence results, an explicit rate of convergence result is new in the literature, to our knowledge.
Keywords: POMDPs, nonlinear filters, stochastic control
1 Introduction
For Partially Observed Stochastic Control, also known as Partially Observed Markov Decision Problems (POMDPs), existence of optimal policies have in general been established via converting the original partially observed stochastic control problem to a fully observed Markov Decision Problem (MDP) one on the belief space, leading to a belief-MDP. However, computing an optimal policy for this fully observed model, and so for the original POMDP, using classical methods (such as dynamic programming, policy iteration, linear programming) is challenging even if the original system has finite state and action spaces, since the state space of the fully observed model is always uncountable.
In the MDP theory, various methods have been developed to compute approximately optimal policies by reducing the original problem into a simpler one. A partial list of these techniques is as follows: approximate dynamic programming, approximate value or policy iteration, simulation-based techniques, neuro-dynamic programming (or reinforcement learning), state aggregation, etc. (Dufour and Prieto-Rumeau, 2012; Bertsekas, 1975; Chow and Tsitsiklis, 1991). Saldi et al. (2017) investigated finite action and state approximations of fully observed stochastic control problems with general state and action spaces under the discounted cost and average cost optimality criteria where weak continuity conditions were shown to be sufficient for near optimality of finite state approximations in the sense that optimal policies obtained from these models asymptotically achieve the optimal cost for the original problem under the weak continuity assumption on the controlled transition kernel.
On POMDPs, however, the problem of approximation is significantly more challenging. Most of the studies in the literature are algorithmic and computational contributions (Porta et al., 2006; Zhou and Hansen, 2001). These studies develop computational algorithms, utilizing structural convexity/concavity properties of the value function under the discounted cost criterion. Vlassis and Spaan (2005) provide an insightful algorithm which may be regarded as a quantization of the belief space; however, no rigorous convergence results are provided. Smith and Simmons (2012); Pineau et al. (2006) also present quantization based algorithms for the belief state, where the state, measurement, and the action sets are finite. Zhang et al. (2014) also provides a computationally efficient approximation scheme by quantizing the belief space uniformly under distance.
For partially observed setups, Saldi et al. (2020, 2017) introduce a rigorous approximation analysis after establishing weak continuity conditions on the transition kernel defining the (belief-MDP) via the non-linear filter (Feinberg et al., 2012; Kara et al., 2019), and show that finite model approximations obtained through quantization are asymptotically optimal and the control policies obtained from the finite model can be applied to the actual system with asymptotically vanishing error as the number of quantization bins increases. Another rigorous set of studies is by Zhou et al. (2008, 2010) where the authors provide an explicit quantization method for the set of probability measures containing the belief states, where the state space is parametrically representable under strong density regularity conditions. The quantization is done through the approximations as measured by Kullback-Leibler divergence (relative entropy) between probability density functions. Further recent studies include Mao et al. (2020); Subramanian and Mahajan (2019). Subramanian and Mahajan (2019) present a notion of approximate information variable and studies near optimality of policies that satisfies the approximate information state property. Mao et al. (2020) analyzes a similar problem under a decentralized setup. Our explicit approximation results in this paper will find applications in both of these studies.
We refer the reader to the survey papers by Lovejoy (1991); White (1991); Hansen (2013) and the recent book by Krishnamurthy (2016) for further structural results as well as algorithmic and computational methods for approximating POMDPs. Notably, for POMDPs Krishnamurthy (2016) presents structural results on optimal policies under monotonicity conditions of the value function in the belief variable.
For our work, we specifically focus on finite memory approximations. With regard to approximations based on finite memory, the following two papers are particularly relevant to our paper:
Yu and Bertsekas (2008) study near optimality of finite window policies for average cost problems where the state, action and observation spaces are finite; under the condition that the liminf and limsup of the average cost are equal and independent of the initial state, the paper establishes the near-optimality of (non-stationary) finite memory policies. Here, a concavity argument building on a work of Feinberg (1982) (which becomes consequential by the equality assumption) and the finiteness of the state space is crucial. The paper shows that for any given , there exists an -optimal finite window policy. However, the authors do not provide a performance bound related to the length of the window, and in fact the proof method builds on convex analysis. Nonetheless, the constant property of the value functions over initial priors is related to unique ergodicity, and thus the stability problem of non-linear filters, which is a topic of current investigation particularly in the controlled setup.
In another related direction, White-III and Scherer (1994) study finite memory approximation techniques for POMDPs with finite state, action and measurements. The POMDP is reduced to a belief MDP and the worst and best case predictors prior to the most recent information variables are considered to build an approximate belief MDP. The original value function is bounded using these approximate belief MDPs that use only finite memory, where the finiteness of the state space is critically used. Furthermore, a loss bound is provided for a suboptimally constructed policy that only uses finite history, where the bound depends on a more specific ergodicity coefficient (which requires restrictive, sample pathwise, contraction properties). In our paper, we consider more general signal spaces and more relaxed filter stability conditions, and establish explicit rates of convergence results. We also rigorously establish the relation of the loss bound to nonlinear filter stability and state space reduction techniques for MDPs.
A recent work by the authors Kara and Yüksel (2021) introduces a different finite history approximation technique where the approximation is done via an alternative belief MDP-reduction method rather than direct discretization of the space of probability measures. It is shown that the approximation error can be related to the controlled filter stability in terms of the total variation distance, whereas in this paper, the error bound is also shown to be related to more general, and in particular weak convergence inducing, metrics. Although, the approximation technique introduced in Kara and Yüksel (2021) provides an error upper bound in terms of the more stringent total variation distance, it proves to be numerically efficient as it is shown that a finite history Q learning algorithm converges to the optimality equation of an approximate model. Our analysis here requires less stringent conditions on filter stability, however the use of the bounded-Lipschitz metric on probability measures leads to a significantly more tedious analysis.
Contributions. In this paper, we rigorously establish near optimality of finite memory feedback control policies for the case where the actions and measurements are finite (with the state being real vector valued), provided that the controlled non-linear filter is stable in a sense to be presented in the paper. We also explicitly relate the approximation error with the window size. This is the first rigorous result, to our knowledge, where finite window policies are shown to be -optimal with an explicit rate of convergence with respect to the window size.
1.1 Preliminaries and the Main Results
Let denote a Borel set which is the state space of a partially observed controlled Markov process. Here and throughout the paper denotes the set of non-negative integers and denotes the set of positive integers. Let be a finite set denoting the observation space of the model, and let the state be observed through an observation channel . The observation channel, , is defined as a stochastic kernel (regular conditional probability) from to , such that is a probability measure on the power set of for every , and is a Borel measurable function for every . A decision maker (DM) is located at the output of the channel , and hence it only sees the observations and chooses its actions from , the action space which is a finite subset of some Euclidean space. An admissible policy is a sequence of control functions such that is measurable with respect to the -algebra generated by the information variables where
(1) |
are the -valued control actions and
We define to be the set of all such admissible policies. The update rules of the system are determined by (1) and the following relationships:
where is the (prior) distribution of the initial state , and
where is the transition kernel of the model which is a stochastic kernel from to . Note that, although is finite, we use integral sign instead of the summation sign for notation convenience by letting the measure to be sum of dirac-delta measures. We let the objective of the agent (decision maker) be the minimization of the infinite horizon discounted cost,
for some discount factor , over the set of admissible policies , where is a Borel-measurable stage-wise cost function and denotes the expectation with initial state probability measure and transition kernel under policy . Note that , where we let denote the set of probability measures on . We define the optimal cost for the discounted infinite horizon setup as a function of the priors and the transition kernels as
For partially observed stochastic problems, the optimal policies use all the available information in general. The question we ask is the following one: suppose we define an -memory admissible policy so that is a sequence of control functions such that is measurable with respect to the -algebra generated by the information variables
(2) |
that is the controller can only have access to the information variables through a window whose length is . We define to be the set of all such -memory admissible policies. Similarly, we define the optimal cost function under -memory admissible policies as
Under this setup, we will study the following problem.
-
Problem:
Under suitable conditions, can we find explicit bounds on in terms of and a constructive approximate solution achieving this bound?
Our goal is to find the best possible control policy in that is, in the set of policies that use only a finite history of information variables, in an offline setting by reducing the problem to a simpler approximate setup where we assume that the system dynamics are known to the designer. A general summary of the approach we will follow to answer this problem is as follows: We first define the belief MDP counterpart of the partially observed system. Then, we construct a finite subset of the belief state space using the probability distributions that can be achieved using the finite window information variables (’s) from a fixed probability distribution. This finite subset leads to an approximate MDP model for which we find optimal policies. The calculation of the policies is greatly simplified compared to the calculation of optimal policies for the original POMDP model. Finally, we show that the loss occurring from applying this approximate policy to the original model can be upper bounded by the expected error of the dicretization of the belief space. The loss is evaluated compared to the best possible admissible policy in the set . The accumulating error then, can be represented in relation to the filter stability problem, that is, how fast the controlled process forgets its initial distribution as it observes the information variables from the system.
Note that although we take the infimum over all -memory admissible policies, we will explicitly construct finite window policies which will be time-invariant, or a finite-state probabilistic automaton (as is also referred to by Yu and Bertsekas, 2008) that accepts as inputs the finite window of observations and actions, and produces as outputs the control actions in a time-invariant/stationary fashion. Accordingly, the infimum above for can be replaced with the minimum over such policies.
2 Regularity and Stability Properties of the Belief-MDP
In this section, we introduce the belief MDP reduction of POMDPs and provide regularity properties of the belief MDPs.
2.1 Convergence Notions for Probability Measures
For the analysis of the technical results, we will use different notions of convergence for sequences of probability measures.
Two important notions of convergences for sequences of probability measures are weak convergence, and convergence under total variation. For some a sequence in is said to converge to weakly if for every continuous and bounded . One important property of weak convergence is that the space of probability measures on a complete, separable, and metric (Polish) space endowed with the topology of weak convergence is itself complete, separable, and metric (Parthasarathy, 1967). One such metric is the bounded Lipschitz metric (Villani, 2008, p.109), which is defined for as
(3) |
where
and .
For probability measures , the total variation metric is given by
where the supremum is taken over all measurable real such that . A sequence is said to converge in total variation to if .
2.2 Ergodicity and Filter Stability Properties of Partially Observed MDPs
Given a prior and a policy , we define the filter and predictor for a POMDP in the following.
Definition 1
The one step predictor process is defined as the sequence of conditional probability measures
where is the probability measure induced by the prior and the policy , when is the probability measure on .
Definition 2
The filter process is defined as the sequence of conditional probability measures
(4) |
where is the probability measure induced by the prior and the policy .
Definition 3
(Dobrushin, 1956, Equation 1.16) For a kernel operator (that is a regular conditional probability from to ) for standard Borel spaces , we define the Dobrushin coefficient as:
(5) |
where the infimum is over all and all partitions of .
We note that this definition holds for continuous or finite/countable spaces and and for any kernel operator.
Example 1
Assume for a finite setup, we have the following stochastic transition matrix
The Dobrushin coefficient is the minimum over any two rows where we sum the minimum elements among those rows. For this example, the first and the second rows give , the first and the third rows give and the second and the third rows give . Then the Dobrushin coefficient is .
Let
Definition 4
For for some , and for two probability measures , is said to be absolutely continuous with respect to if for every set for which . We denote the absolute continuity of with respect to by .
Theorem 5
(McDonald and Yüksel, 2020, Theorem 3.3) Assume that for , we have . Then we have (exponential filter stability)
In particular, defining , we have
This result will be a key ingredient for our main results. It provides conditions on when the belief-state processes for a given POMDP under different priors get closer when they are fed with same observation processes, in expectation under the true probability space. In a vague sense, if the state process is tracked only using a finite window of recent measurement and control variables (and forgets the past observations and actions), then the amount of mismatch from the true filter can be bounded with an error that is exponentially diminishing with the window size. The relationship is via the term which appears crucially in (4.1). We note that if one does not wish to have an explicit rate of convergence result, one could have more relaxed conditions for filter stability which will still lead to rigorous approximation results on the performance of finite window policies via the controlled filter stability analysis in McDonald and Yüksel (2022).
2.3 Reduction to Fully Observed Models and Regularity Properties of Belief-MDPs
It is by now a standard result that, for optimality analysis, any POMDP can be reduced to a completely observable Markov decision process (Yushkevich, 1976; Rhenius, 1974), whose states are the posterior state distributions or beliefs of the observer or the filter process as defined in (4); that is, the state at time is
We call this equivalent process the filter process . The filter process has state space and action space . Here, is equipped with the Borel -algebra generated by the topology of weak convergence (Billingsley, 1999). As noted earlier, under this topology, is a standard Borel space (Parthasarathy, 1967). Then, the transition probability of the filter process can be constructed as follows (see also Hernández-Lerma, 1989). If we define the measurable function
from to and use the stochastic kernel from to , we can write as
(6) |
The one-stage cost function of the filter process is given by
(7) |
which is a Borel measurable function. Hence, the filter process is a completely observable Markov process with the components .
For the filter process, the information variables is defined as
It is well known that an optimal control policy of the original POMDP can use the belief as a sufficient statistic for optimal policies (see Yushkevich, 1976; Rhenius, 1974), provided they exist. More precisely, the filter process is equivalent to the original POMP in the sense that for any optimal policy for the filter process, one can construct a policy for the original POMP which is optimal. On existence, we note the following.
With the recent results by Feinberg et al. (2016); Kara et al. (2019) the transition model of the belief-MDP can be shown to satisfy weak continuity conditions on the belief state and action variables, and accordingly we have that the measurable selection conditions (Hernandez-Lerma and Lasserre, 1996, Chapter 3) apply. Notably, we state the following.
Assumption 1
-
(i)
The transition probability is weakly continuous in , i.e., for any , weakly.
-
(ii)
The observation channel is continuous in total variation, i.e., for any , in total variation.
Assumption 2
-
(i)
The transition probability is continuous in total variation in , i.e., for any , in total variation.
-
(ii)
The observation channel is independent of the control variable.
Theorem 6
Under the above weak continuity conditions, the measurable selection conditions (Hernandez-Lerma and Lasserre, 1996, Chapter 3) apply and a solution to the discounted cost optimality equation exists, and accordingly an optimal control policy exists. This policy is stationary (in the belief state). Thus there exists a function such that for any policy for a prior
In particular, we have that . This will be the case in our paper, under the assumptions we will work with.
For the rest of the paper, we will use for the belief process policy for consistency of notation.
The following supporting result, to be used later in the paper, provides further regularity properties on the transition model for the belief model under mild conditions on the fully observed model. This result may be useful for POMDP theory beyond the application considered in this paper. We also note that the bounds in (i) and (iii) below are applicable when we only have filter stability, but not exponential filter stability (McDonald and Yüksel, 2022), whereas items (ii)-(iv) will be used under exponential filter stability in this paper.
Theorem 7
Proof See Section 6.1.
3 Approximate Model Construction: Finite Belief-MDP through Finite Memory
In this section, we will construct a finite state space by quantizing the belief state space so that the approximate finite model is obtained using only a finite memory.
Our construction builds on but significantly differs from the approach by Saldi et al. (2018, 2017). As we will explain, we cannot afford to use uniform quantization in our setup, which was a crucial tool used by Saldi et al. (2018, 2017).
As we discussed in the previous section, we can write the infinite horizon cost as
Now we focus on the second term:
Notice that for any time step and for a fixed observation realization sequence and for a fixed control action sequence , the state process can be viewed as
where
That is, we can view the state as the Bayesian update of , the predictor at time , using the observations . Notice that with this representation only the most recent observation realizations are used for the update and the past information of the observations is embedded in .
Hence, we can view the state space for time stages as
Consider the following finite set defined by, for a fixed probability measure ,
Define the map , which will serve as the quantizer, with
(8) |
This map separates the set into sets, accordingly this quantizes the set of probability measures.
To complete our approximate controlled Markov model, we now define the one-stage cost function and the transition probability on given realizations in : For a given and control action
where is defined in (7) and
where
for all .
We thus have defined a finite state MDP with the state space , action space , cost function and the transition probability .
An optimal policy for this finite state model is a function taking values from the finite state space and hence at some time step it only uses most recent observations and control action variables that is, is a measurable function of the information set defined in (1.1) for all .
We list the steps to construct the approximate model informally as follows:
-
•
Fix a probability distribution as an estimator for the predictor of step back ().
-
•
Calculate Bayesian updates for all possible realizations ( many realizations) starting from the prior to form the finite subset .
-
•
Calculate (approximate transition model) and (approximate cost function) using a nearest neighbor map. Note that from any there are only many possible transitions under the true dynamics. To construct the , these possible transitions are mapped to the closest element in .
-
•
Calculate the value functions and the optimal policies for the finite model with state space , transition kernel and the cost function .
As we have noted before, the complexity of POMDPs in general arises from the structure of the belief state space which is a set of probability measures on . This set is always uncountable and needs to be associated with proper topologies to make the analysis feasible. Approximations for POMDPs are usually done by choosing a finite subset, say , of the belief state space (Smith and Simmons, 2012; Pineau et al., 2006; Saldi et al., 2017; Zhou et al., 2008, 2010; Subramanian and Mahajan, 2019; Zhang et al., 2014), and finding an approximate MDP model for this finite set. To choose the finite set, the aforementioned works use a uniform quantization scheme, in various topologies on . In other words, the quantization is made such that for any , there exists an element with for a fixed . The metric to measure distances of the belief states varies for different works, although for finite , distance of distributions is what is used in general for the quantization of , which coincides with total variation and weak convergence topology on when is finite; for general , a more appropriate and natural topology is the weak convergence topology on which is what we work with in the paper since metrizes the weak convergence.
In this paper, instead of quantizing directly and uniformly, we use finite window information variables (’s) to construct the finite subset of since our goal is to analyze the effect of the window size on the approximation performance. That is, we use the finite set
constructed using . For this set, we cannot afford a uniform discretization scheme. A uniform quantization would mean that for a fixed
uniformly for any and for any . However, this is in general inapplicable for filter stability problems as it requires for the processes with different starting points to converge uniformly for any realizations of information variables (even for highly unlikely ones). That is why, we follow a different approach and show that we do not have to force uniform quantization, and the error of approximate value functions can be related to the expected error of the form
which in turn can be bounded using Theorem 5. Our technical analysis, accordingly, is slightly more tedious at the benefit of arriving at a practical and intuitive finite-memory method whose near optimality is rigorously established.
Remark 8
The finite subset is crucial for the approximation of the belief MDP we construct in this section. This set depends on the choice of as the starting probability distribution and the finite window information variables . Kara and Yüksel (2021) consider a similar approach for construction of an approximate POMDP model by using the same finite set . However, instead of using a nearest neighbor map for the correspondence between the original belief space and the finite set , there, the states with matching finite history are used for the correspondence by putting a different topology on the belief space and the set . The method used in this paper naturally results in a smaller approximation error due to the nature of the nearest neighbor map, and lets us work with the weak convergence topology and metric. On one hand, Kara and Yüksel (2021) provide a greater error term in terms of the total variation distance, which always upper bounds the metric, on the other hand, the method used there, is computationally more efficient and the approximate model can actually be learned with a reinforcement learning algorithm that uses finite window information variables.
On the choice of , we note that Theorem 12 provides a bound that is independent of the choice of . However, since this is only an upper bound, it is apparent that the true value of the error may depend on different choices of . For the approximate model, we use as an estimator for the true predictor at any given time under some policy . Since is a fixed probability distribution and does not vary with time, we can argue that a reasonable choice would be the time averages of under an optimal policy and if the hidden state process admits an invariant measure under this optimal policy , the time averages of will converge to the invariant measure of under the optimal policy. However, one issue with this approach is that the designer does not have access to the optimal policy and thus, cannot compute the invariant measure under . The designer, instead, can use the approximate optimal policy , but the choice of also affects the approximate policy , and hence, this approach may not feasible for practical purposes. In any case, our result provides a bound uniform over the prior selections.
We now discuss the the role of the realizations of . Notice that, to construct , we consider all possible realizations which reduces the uncountable state space to a finite one but the size of this finite subset grows fast with the size of window, , and with the size of spaces and . The bound we provide in Theorem 9, reveals that the approximation error is related to the expectation over the realizations of under the true dynamics, which suggests that if a realization , is highly unlikely under the true dynamics, these realizations can be discarded when constructing for computational ease by reducing the size of .
In fact this question motivates the following direction. A close look at Definition 3 reveals that the Dobrushin coefficient of a channel and a channel obtained by quantizing the measurements of that same channel are so that the coefficient of the former is a lower bound for the latter. Therefore, one can always quantize the measurement channels further if the original channel satisfies the contraction condition presented for filter stability, with the quantized channel also satisfying the contraction property. This presents a recipe for dealing with both low probability channel outcomes or even continuous space measurement channels; however one would additionally need to show that the value function of the approximate model with quantized measurements would be close to the value function of the original model. This is a possible future direction (using the weak Feller property of the belief-MDP) (as shown e.g. by Saldi et al., 2017) and continuity properties of filter updates in measurement realizations McDonald and Yüksel (2022).
4 Approximation Error Analysis and Rates of Convergence
An optimal policy for the constructed finite model, can be extended to and can be used for the original MDP.
4.1 Analysis via Expected Filter Approximation Error
The next result, which is related to the construction given by Saldi et al. (2018, Theorem 4.38) (see also Saldi et al., 2017), provides a mismatch error for using this policy. This result is going to be the key supporting tool for the main theorem of the paper, which will be presented immediately after. The proof requires multiple technical lemmas and are presented in Section 6.2 (with some supporting but tedious technical steps moved to the Appendix).
Assumption 3
-
•
(see Theorem 7).
-
•
for some for all .
Theorem 9
Where is a constant that depends on and . (The exact expression for the constant can be found in the proof).
Remark 10
We note that the upper bound for can be chosen as , instead of (see Remark 21 in Appendix), where
Hence, under a uniform filter stability bound, the upper bound can be chosen near .
The proof of this theorem is rather long, and accordingly, it is presented in Section 6.2.
Before presenting the main result of the paper, we provide further supporting results that will let us work with a probability density function.
Lemma 11
Assume that the transition kernel admits a density function with respect to a reference measure such that . If for all and then
We note that using this result, assumptions of Theorem 7 can be expressed with the Lipschitz condition on the density function noted above. We now restate the assumptions that will be used for the main result.
Assumption 4
-
1.
The transition kernel admits a density function with respect to a reference measure such that
-
2.
There exists some such that
-
3.
There exists some such that for all ,
-
4.
,
-
5.
The transition kernel is dominated, i.e. there exists a dominating measure such that for every and , that is is absolutely continuous with respect to for every and .
Before the result, we discuss the Lipschitz constants of interest and their relation to each other. First, note that by Lemma 11, Assumption 4 (second item) implies that . Hence, by Theorem 7, we can have various bounds on , where . In particular, we have .
Theorem 12
Assume that we let the system start running for time steps under a known policy (which may not be optimal), and the finite window policy starts acting after observing information variables at time . Under Assumption 4, if we have
where is the optimal finite window policy and where is a constant that depends on and (The exact formula for the constant can be found in the appendix in (28)).
In the statement, (respectively ) denotes the cost under the policy (respectively ) when the initial prior distribution is , where and the expectation is with respect to the realizations of .
Remark 13
We note that Theorem 12 applies to all finite state, measurement, action models as long as and satisfies the condition noted. Example 1 shows how to calculate the Dobrushin coefficient for transition matrices in finite setups. All other conditions apply since all probability measures on a finite/countable set are majorized by a probability measure which places a positive mass on every single point. Notice that the third condition only requires that the difference in the cost is bounded for every fixed control action. Condition 1 and 5 of Assumption 4 coincide in the finite case.
Remark 14
We note that the error bound of the result is independent of the chosen . As we will see in the following proof, the first upper bound is a result of Theorem 9 which indeed depends on the chosen by the user. However, thanks to Theorem 5, we can get a further upper bound on the error which is uniform over any as long as is a dominating measure.
One can interpret the absolute continuity assumption, that is , as follows: assume that the starting distribution of the process is but we start the update from the fixed prior . The information, can eventually fix the starting error uniformly for all such , as long as, the fixed starting distribution , puts on a positive measure to every event that the real starting distribution puts on a positive measure. However, if it is not the case, that is, if the incorrect starting distribution puts measure to some event, that puts positive measure to, information variables are not sufficient to fix the starting error occurring from that measure event. Of course this would not be feasible as the prior would not be compatible with the measured data. In any case, in our setup, the fixed prior serves as an approximation and this can be made to satisfy the absolute continuity condition by design.
Proof When we reduce a partially observed MDP to a fully observed process, the initial state of the belief process becomes the Bayesian update of the prior distribution of the state process of the POMDP. Hence, we can write that
Notice that, using Theorem 7, Condition 2 of Assumption 4 implies that
and we also have that
Thus, we can use Theorems 7 and 9 to write
(9) |
We set in Assumption 4 as our representative probability measure for the quantization of the belief space. Notice that by our choice of is a dominating measure and therefore for any time step , where is the predictor at time . Thus, Theorem 5 yields that under Assumption 4
for any predictor . We also have by definition that
Hence the result follows.
Remark 15
We note that the initial time steps of the control problem should be treated separately as our approximation technique uses information variables with size , but for the initial steps, there is not enough observation or control action variables to make use of the approximate finite window policies. In the following result, for the first time steps, we use a control policy that is found with a policy iteration type argument where the terminal cost estimated with with being the approximate finite window policy.
The following result is stated for the best possible policy in the set (see Equation 1.1), since , however, in the proof, instead of working with the policy achieving this infimum, we construct a policy (possibly sub-optimal) which achieves the stated upper bound, and this upper bound naturally becomes an upper bound for the best possible finite window policy in .
Theorem 16
Under Assumption 4, if we have
where is a constant that depends on and . (The exact formula for the constant can be found in the appendix).
Proof Recall the optimal policy for the finite model constructed in Section 3 is denoted by . Notice that is a stationary policy and optimal for any initial point since it solves the discounted cost optimality equation.
Now, we construct the following policy. Use the policy , after time unaltered, but modify the first time-stage policies, as a batch update, which can be solved via a finite dynamic programming algorithm.
where is the history by time . We denote this policy by .
Note that we denote the true optimal policy for the original model by . We now define the policy : Apply until time , and then use our finite window policy . Note that this policy is not practical as it assumes that the controller already knows the true optimal policy , however, we will make use this hypothetical policy for the analysis.
From the way is constructed, we clearly, . And thus,
Furthermore, because of the way we constructed them, we have . Hence, we can write
Then, we start analyzing the error term: because and use the same policy before time , we have
(10) |
The last step follows from the observation that conditioning on the observations and control actions , the state process can be thought as if it starts at time whose prior measure is and from the fact that the probability measure of is determined by the initial measure and the policy since for .
The result then follows from Theorem 12.
4.2 Analysis via Uniform Bounds on Filter Approximation Error
The following result provides an alternative setup where our analysis is applicable when the filter stability error is presented in terms of a uniform bound on the filter approximation error under the total variation distance. We define this bound, arising from filter stability error, as follows:
(11) |
Theorem 17
If , where can be chosen as by Theorem 7, we have that
-
(i)
-
(ii)
Before the proof, we note that the bound applies for all , if . Furthermore, since where and, therefore, if we have that for all the result applies independent of , though this is a conservative bound. If , the bound applies for all as long as .
Proof (i) We start by writing the fixed point equations
Hence we can write that
(12) |
Note that by Theorem 7, when is associated with total variation distance. We also have that when metrized by total variation distance (see Lemma 24). Hence, by noting , we can conclude that
(ii) We start by writing
where the second term is bounded by (i). We now focus on the first term:
Thus, using (i), and , we can write
We then conclude that
4.3 A Discussion on the Controlled Filter Stability Problem
The above results suggest that the loss occurring from applying a finite window policy is mainly controlled by the term
(13) |
or its variations where the metric is replaced with total variation, or the expectation is replaced with a uniform bound (given in (11)). All these terms describe how fast two different belief-state processes forget their initial priors when fed with the same observations/control actions under a true distribution. Thus, any bound for this term directly applies to the main results we presented for the loss caused by a finite window policy. This term is related to the filter stability problem and our approximation results point out the close relation between filter stability and the performance of finite window policies. In a way, the main result or message of this paper is perhaps to explicitly relate finite window approximations for POMDPs to the filter stability problem.
To bound the term (13), we use Theorem 5, to achieve an exponential convergence rate in the window size for a controlled setup. However, we should note that, Theorem 5 provides only a sufficient condition to bound the filter stability term geometrically fast in the total variation distance. This result can be seen as too strong, if one is only interested in making this distance (13) smaller with increasing window size. In fact, as we will see in Section 5, even when the assumptions of Theorem 5 are not satisfied, the filter stability term still converges to . In the literature, there are various set of assumptions to achieve filter stability. Two main approaches have been:
-
•
The transition kernel is in some sense sufficiently ergodic, forgetting the initial measure and therefore passing this insensitivity (to incorrect initializations) on to the filter process. This condition is often tailored towards control-free models.
-
•
The measurement channel provides sufficient information about the underlying state, allowing the filter to track the true state process. This approach is typically based on martingale methods and accordingly does not often lead to rates of convergence for the filter stability problem, but only asymptotic filter stability.
The result we use in this paper (Theorem 5) provides exponential filter stability, using a joint contraction property of the Bayesian filter update and measurement update steps through the Dobrushin coefficient. When these requirements are not satisfied, the filter stability can be checked via different assumptions from the literature. However, we also note that, for the controlled setup, filter stability results are limited compared to the control free setup. A comprehensive review on filter stability in the control-free case, is available by Chigansky et al. (2009). In the controlled case, recent additional studies include McDonald and Yüksel (2019, 2022) where martingale methods are used to arrive at controlled filter stability, which in turn can lead to weaker conditions (though without rates of convergence) for near optimality of finite window policies.
With regard to Theorem 5, the following example studies an additive Gaussian model (not necessarily linear) and provides different parameters for the condition to hold.
Example 2
Consider a system where and the transition and measurement kernels are given by
where the functions and are measurable and bounded such that and .
Note that, in our paper we assume and present our results from finite . Therefore, to make the example compatible with our results, we discretize the observation space . We provide two discretization schemes one with and the other with . For discretization, we use a nearest neighbor mapping.
First, we study the observation space . Using the nearest neighbor mapping, we have that
We then can write the following for the transition kernel
and for the compound channel :
For these kernels, the Dobrushin coefficients can be calculated as
Notice that these probabilities are fully determined by the ratio of the mean and standard deviation of the Gaussian in question, and . The higher the ratio, the higher the Dobrushin coefficient. Below, we see a list of the ratio of the transition kernel and lowest possible ratio of the measurement kernel such that . If the ratio of is higher than the stated value, we will get exponential stability for the given transition kernel. Note that if , then which makes regardless of the channel for which we use ’any’ in table to indicate that any channel would lead to exponential filter stability.
1.5 | 1.4 | 1.3 | 1.2 | 1.1 | 1.0 | 0.9 | 0.8 | 0.7 | 0.6 | 0.5 | 0.4 | 0.3 | |
any | 0.6 | 0.8 | 1.01 | 1.3 | 1.65 | 2.13 | 3.25 | 5.5 | 8.0 | 20.0 | 70.0 | 1000.0 | |
0.50 | 0.48 | 0.44 | 0.40 | 0.36 | 0.32 | 0.27 | 0.21 | 0.15 | 0.10 | 0.05 | 0.01 | 0.00 | |
any | 0.1 | 0.21 | 0.32 | 0.44 | 0.54 | 0.64 | 0.76 | 0.86 | 0.90 | 0.96 | 0.99 | 1.00 |
We now analyze the problem for the observation space . For this set, the nearest neighbor mapping yields that
For the compound channel, we then have
For these kernels, the Dobrushin coefficients can be calculated as
Below, we now see a list of the ratio of the transition kernel and lowest possible ratio of the measurement kernel such that for the observation space .
1.5 | 1.4 | 1.3 | 1.2 | 1.1 | 1.0 | 0.9 | 0.8 | 0.7 | 0.6 | 0.5 | 0.4 | 0.3 | |
any | 0.39 | 0.6 | 0.85 | 1.2 | 1.54 | 2.1 | 3.2 | 5.9 | 8.0 | 20.0 | 80.0 | 1000.0 | |
0.50 | 0.48 | 0.44 | 0.40 | 0.36 | 0.32 | 0.27 | 0.21 | 0.15 | 0.10 | 0.05 | 0.01 | 0.00 | |
any | 0.1 | 0.21 | 0.32 | 0.44 | 0.54 | 0.64 | 0.76 | 0.86 | 0.90 | 0.96 | 0.99 | 1.00 |
5 Numerical Study
In this section, we give an outline of the algorithm used to determine the approximate belief MDP and the finite window policy. We also present the performance of the finite window policies and the value function error for the approximate belief MDP in relation to the window size.
A summary of the algorithm is as follows:
-
•
We determine a such that it puts positive measure over the state space .
-
•
Following Section 3, we construct the finite belief space by taking the Bayesian update of for all possible realizations of the form . Hence we get a approximate finite belief space with size .
-
•
We calculate all transitions from each by considering every possible observation and control action and we map them to using a nearest neighbor map and construct the transition kernels for the finite model.
-
•
For the finite models obtained, through value or policy iteration, we calculate the value functions and optimal policies.
The example we use is a machine repair problem. In this model, we have with
The probability that the repair was successful given initially the machine was not working is given by :
The probability that the machine breaks down while in a working state is given by :
The probability that the channel gives an incorrect measurement is given by :
The one stage cost function is given by
where is the cost of repair and is the cost incurred by a broken machine.
We study the example with discount factor , and and present three different results by changing the other parameters. For all different cases, we choose .
For the first case, we take , , . We analyze the performance for . To get a proper finite window policy for all ’s we use, we let the system run time steps under the policy that is no matter what for the first time steps. Then, we start applying the policy and start calculating the cost. In Figure 1, we plot the approximate value functions and the cost incurred by the finite window policies, that is we plot the terms and .


Recall that our main result Theorem 12, emphasizes the relation between the filter stability term and ’value error’ and ’robustness error’, both of which are guaranteed to converge to zero with increasing window length, where
To get the errors, we simply subtract the cost values from their minimum (largest window) which serves as an approximation of the value function. Furthermore, we scale the errors according to the filter stability term to get a better understanding of the error rate in relation to the filter stability constant that is we make them start from the same value to see the decrease rates more clearly. Figure 2 shows the relation.

It can be seen from Figure 2, that the error rate stays below the filter stability convergence rate and goes to as the window size increases.
For the second case, we take , , . For this case, we directly plot the scaled errors in relation to the filter stability term in Figure 3.

It can be seen that as the channel is more informative, the filter stability term gets smaller much faster and the recent information becomes more informative. As a result, we get a faster decrease in the error with the increasing window size.
One can notice that for the first two cases, the Dobrushin coefficient , we defined in Assumption 4 is greater than , however, the error terms still converge to with increasing window. This is because the filter stability term gets smaller even though the Dobrushin constant . For the last case, we select the parameters in a way to make .
We take , , so that .

In Figure 4, it can be observed that the error terms and the filter stability term converges to at similar rates. , however, gets smaller at a slower rate and upper bounds the convergence rate of error terms.
6 Proofs of Main Technical Results
In this section, we provide the technical proofs.
6.1 Proof of Theorem 7
We will build on the proof by Kara et al. (2019). Let be a separable metric space. Another metric that metrizes the weak topology on is the following:
(14) |
where is an appropriate sequence of continuous and bounded functions such that for all (see Parthasarathy, 1967, Theorem 6.6, p. 47).
We equip with the metric to define bounded-Lipschitz norm of any Borel measurable function . With this metric, we can start the proof:
(15) | |||
(16) |
where, in the last inequality, we have used .
We first anayze the first term:
For all and for any , we have
Then, we can write
Recall the metric introduced in (14); using this metric, we now focus on the term (16):
where we have used Fubini’s theorem with the fact that . For each , let us define
(17) |
Then, we can write
For the first and the last term, we can write
(18) |
For the second and the third term:
(19) | |||
Hence, we can write that
We now prove part (ii); we note first that if , the result follows. Hence, we write (by Dobrushin, 1956)
6.2 Proof of Theorem 9
Before presenting our proof program and the series of technical results needed, we introduce some notation.
In the following, to specify the probability measures according to which the expectations are taken, we use the following notation; for the kernel and a policy , we use and for the kernel and a policy , we will use .
The last notation we introduce is the following: Recall that we denote the optimal cost for the finite model by and the optimal policy by . These are defined on a finite set , however, we can always extend them over the original state space so that they are constant over the quantization bins. We denote the extended versions by and .
Now, we introduce our value iteration approach for the original model and the finite model. We write for any
where . It is well known that the above operations define contractions under either model and hence the value functions converge to the optimal expected discounted cost. In particular, we have that
Notice that if we extend and over all of the state space then and becomes constant over the quantization bins. Then, the extended value functions for the finite model can be also obtained with
(20) |
To see this, observe the following:
where denotes the ith quantization bin and denotes the value of at that quantization bin and it is constant over the bin.
For the proof the goal is to bound the term . We will see in the following that bounding this term is related to studying the term (the value function approximation error). Therefore, in what follows, we first analyze and observe that it can be bounded using the expected loss terms . Finally, we will observe that upper bounds on these expressions can be written through the filter stability term
Next, we present some supporting technical results.
Lemma 18
Under Assumption 3, we have
where denotes the value function at a time step that is produced by the value iteration with .
Proof
We follow a value iteration approach to write for any
where .
We then write
(21) |
For the first and the last term, using the fact that the dynamic programming operator is a contraction, we have that
Now, we focus on the second term .
Step 1: We show in the Appendix, Section A.1, that for all ,
where is the loss function due to the quantization, i.e. where is the representative state belongs to.
Step 2: We show in the Appendix, Section A.2, that
The next result, gives a bound on the loss function occurring from the quantization (on ) and relates the bound to the filter stability problem.
Lemma 19
For , where is the prior distribution of , we have that for any
Proof
We first note that, since the quantizer is a nearest-neighbor quantization, the quantization error is almost surely upper bounded by
. Using this and the law of total expectation:
where .
Lemma 20
Proof It is easy to see that . Then, we use an inductive approach and assume the claim holds for and analyze the term
For the second term, we have
Hence, using the induction hypothesis, we have that
which concludes the induction argument. Therefore, we can write
Now, we can prove Theorem 9.
Proof of Theorem 9
Consider the following dynamic programming operators, ( denotes the set of measurable and bounded functions on ) such that for any and for any
where denotes the extension of the optimal policy for the finite state space model to the belief space , i.e. it is constant over the quantization bins, furthermore is the nearest neighbor quantization map such that it maps to the closest point from the finite state space (recall that is used to metrize the belief space).
The optimal cost for the finite model, is only defined on a finite set, we denote by the extension of the optimal cost for the finite model to the belief space by assigning the same values over the quantizaton bins , i.e. it is a piece-wise constant function over the quantization bins. Note that, we have
Using these equalities, we write
(23) |
where denotes the expectation with respect the the kernel when the initial point is .
Now, using Lemma 18, we can write that:
(24) |
We first introduce the following notation along with the above notation (6.2)
Notice that with the new notation, we can rewrite the bound on (6.2) as:
where we used the fact that
Using the same bound on one can also write that for any initial point :
In general, for any , we can write
(25) |
Recall that our main goal is to bound where . To this end, first notice that using Lemma 19, we have for any
(26) |
By Lemma 23 we also have that
7 Conclusion
In this paper, we studied performance bounds for policies that use only a finite-window of recent observation and action variables rather than the entire history in partially observed stochastic control problems. We have rigorously established approximation bounds that are easy to compute, and have shown that this bound critically depends on the ergodicity and stability properties of the belief-state process. We have provided the results for continuous-space valued state spaces and finite observation and action spaces, however our studies suggest that these results can also be generalized to real valued observation and actions also. Application to decentralized POMDPs is another direction that will benefit from the analysis presented.
A Technical Proofs of Supporting Results
In this section, we prove the proofs of some technical results.
A.1 Proof of Step 1 in the Proof of Lemma 18
We prove the claim using an inductive approach: for we have (noting )
Note that under the stated assumptions the measurable selection conditions hold, and the minimum can be achieved using a policy for the original model and a policy for the finite model which defined only on a finite set. By extending the finite model policy over all state space , we can write that
which completes the proof for . Now, we assume that the claim holds for and analyze the step . Similar to case, we can write
For the last two steps, note that denotes the expected loss at time when the initial state , thus using the iterative expectation we have
Hence, we have proved that for all
A.2 Proof of Step 2 in the Proof of Lemma 18
The claim is that
We first write
(29) |
where
are the marginal distributions of the state for the true model and approximate model respectively, with .
We focus on the term . We first claim that , where . Recall that determines quantization given in (8).
where we used the fact that
as for any two probability measures ,we have that (see (3)).
Note that is a nearest neighbor quantizer as defined in (8), thus we can write that
where for the last step we used the triangle inequality. Hence we can conclude that
(30) |
Now, we show that
(31) |
We prove the claim by induction: for
Now assume the claim holds for
Using (29), we can write
which completes the proof of (31).
Now, we go back to the main claim and start from (29) to write
Remark 21
Note that for the previous proof, can be replaced by where
which is upper bounded by , however, usually smaller than , provided there is a uniform filter stability.
Lemma 22
Proof Recall that for a fixed policy ,
It is easy to see that
By repeating this step we can have
Hence by taking supremum over all policies we can write
For the other direction, we first focus on the term , using Assumption 3, one can show that there exists a measurable map such that
Using the same argument, we can see that there exists a sequence of measurable functions
such that
Hence we can write that
which proves the main claim.
Lemma 23
Under Assumption 3
Proof Using Lemma 22, we can write that
Using the fact that we can write
Next, we show that when defined as a function of , We follow an inductive approach. For :
where follows from (30). Hence we have that . Now we assume the claim holds for and focus on the case for :
We then have that
Thus, we can write that
Hence for any
Hence, we have that
Lemma 24
-
i.
Under Assumption 3, if is metrized with , we have that
-
ii.
Without any assumption, if is metrized with total variation distance, we have that
Proof We start with the first part, for any
Thus, we can write that
Hence, rearranging the terms, we can write (provided )
For the second part, similar arguments lead to the following bound
hence the result follows after rearranging the terms and noting that .
References
- Bertsekas [1975] D.P. Bertsekas. Convergence of discretization procedures in dynamic programming. IEEE Trans. Autom. Control, 20(3):415–419, Jun. 1975.
- Billingsley [1999] P. Billingsley. Convergence of probability measures. New York: Wiley, 2nd edition, 1999.
- Chigansky et al. [2009] P. Chigansky, R. Liptser, and R. van Handel. Intrinsic methods in filter stability. Handbook of Nonlinear Filtering, 2009.
- Chow and Tsitsiklis [1991] C.S. Chow and J. N. Tsitsiklis. An optimal one-way multigrid algorithm for discrete-time stochastic control. IEEE transactions on automatic control, 36(8):898–914, 1991.
- Dobrushin [1956] R.L. Dobrushin. Central limit theorem for nonstationary Markov chains. i. Theory of Probability & Its Applications, 1(1):65–80, 1956.
- Dufour and Prieto-Rumeau [2012] F. Dufour and T. Prieto-Rumeau. Approximation of Markov decision processes with general state space. J. Math. Anal. Appl., 388:1254–1267, 2012.
- Feinberg [1982] E. A. Feinberg. Controlled Markov processes with arbitrary numerical criteria. Th. Probability and its Appl., pages 486–503, 1982.
- Feinberg et al. [2012] E.A. Feinberg, P.O. Kasyanov, and N.V. Zadioanchuk. Average cost Markov decision processes with weakly continuous transition probabilities. Math. Oper. Res., 37(4):591–607, Nov. 2012.
- Feinberg et al. [2016] E.A. Feinberg, P.O. Kasyanov, and M.Z. Zgurovsky. Partially observable total-cost Markov decision process with weakly continuous transition probabilities. Mathematics of Operations Research, 41(2):656–681, 2016.
- Hansen [2013] E. A. Hansen. Solving POMDPs by searching in policy space. arXiv preprint arXiv:1301.7380, 2013.
- Hernández-Lerma [1989] O. Hernández-Lerma. Adaptive Markov Control Processes. Springer-Verlag, 1989.
- Hernandez-Lerma and Lasserre [1996] O. Hernandez-Lerma and J. B. Lasserre. Discrete-Time Markov Control Processes: Basic Optimality Criteria. Springer, 1996.
- Kara and Yüksel [2021] A. D. Kara and S. Yüksel. Convergence of finite memory q-learning for pomdps and near optimality of learned policies under filter stability. arXiv preprint arXiv:2103.12158, 2021.
- Kara et al. [2019] A. D. Kara, N. Saldi, and S. Yüksel. Weak feller property of non-linear filters. Systems & Control Letters, 134:104–512, 2019.
- Krishnamurthy [2016] V. Krishnamurthy. Partially observed Markov decision processes: from filtering to controlled sensing. Cambridge University Press, 2016.
- Lovejoy [1991] W.S. Lovejoy. A survey of algorithmic methods for partially observed Markov decision processes. Annals of Operations Research, 28:47–66, 1991.
- Mao et al. [2020] W. Mao, K. Zhang, E. Miehling, and T. Başar. Information state embedding in partially observable cooperative multi-agent reinforcement learning. arXiv preprint arXiv:2004.01098, 2020.
- McDonald and Yüksel [2019] C. McDonald and S. Yüksel. Observability and filter stability for partially observed markov processes. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 1623–1628. IEEE, 2019.
- McDonald and Yüksel [2020] C. McDonald and S. Yüksel. Exponential filter stability via Dobrushin’s coefficient. Electronic Communications in Probability, 25, 2020.
- McDonald and Yüksel [2022] C. McDonald and S. Yüksel. Robustness to incorrect priors and controlled filter stability in partially observed stochastic control. SIAM Journal on Control and Optimization, to appear (also arXiv:2110.03720), 2022.
- Parthasarathy [1967] K.R. Parthasarathy. Probability Measures on Metric Spaces. AMS Bookstore, 1967.
- Pineau et al. [2006] J. Pineau, G. Gordon, and S. Thrun. Anytime point-based approximations for large pomdps. Journal of Artificial Intelligence Research, 27:335–380, 2006.
- Porta et al. [2006] J. M. Porta, N. Vlassis, M. T. J. Spaan, and P. Poupart. Point-based value iteration for continuous pomdps. Journal of Machine Learning Research, 7(Nov):2329–2367, 2006.
- Rhenius [1974] D. Rhenius. Incomplete information in Markovian decision models. Ann. Statist., 2:1327–1334, 1974.
- Saldi et al. [2017] N. Saldi, S. Yüksel, and T. Linder. On the asymptotic optimality of finite approximations to markov decision processes with borel spaces. Mathematics of Operations Research, 42(4):945–978, 2017.
- Saldi et al. [2018] N. Saldi, T. Linder, and S. Yuksel. Finite Approximations in discrete-time stochastic control. Birkhäuser,, 2018.
- Saldi et al. [2020] N. Saldi, S. Yüksel, and T. Linder. Finite model approximations for partially observed markov decision processes with discounted cost. IEEE Transactions on Automatic Control, 65, 2020.
- Smith and Simmons [2012] T. Smith and R. Simmons. Point-based pomdp algorithms: Improved analysis and implementation. arXiv preprint arXiv:1207.1412, 2012.
- Subramanian and Mahajan [2019] J. Subramanian and A. Mahajan. Approximate information state for partially observed systems. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 1629–1636, 2019.
- Villani [2008] C. Villani. Optimal transport: old and new. Springer, 2008.
- Vlassis and Spaan [2005] N. Vlassis and M. T. J. Spaan. Perseus: Randomized point-based value iteration for pomdps. Journal of artificial intelligence research, 24:195–220, 2005.
- White [1991] C.C. White. A survey of solution techniques for the partially observed Markov decision process. Annals of Operations Research, 32:215–230, 1991.
- White-III and Scherer [1994] C. C. White-III and W. T. Scherer. Finite-memory suboptimal design for partially observed markov decision processes. Operations Research, 42(3):439–455, 1994.
- Yu and Bertsekas [2008] H. Yu and D.P. Bertsekas. On near optimality of the set of finite-state controllers for average cost pomdp. Mathematics of Operations Research, 33(1):1–11, 2008.
- Yushkevich [1976] A.A. Yushkevich. Reduction of a controlled Markov model with incomplete data to a problem with complete information in the case of Borel state and control spaces. Theory Prob. Appl., 21:153–158, 1976.
- Zhang et al. [2014] Z. Zhang, D. Hsu, and W. S. Lee. Covering number for efficient heuristic-based pomdp planning. In International conference on machine learning, pages 28–36. PMLR, 2014.
- Zhou et al. [2008] E. Zhou, M. C. Fu, and S. I. Marcus. A density projection approach to dimension reduction for continuous-state POMDPs. In Decision and Control, 2008. CDC 2008. 47th IEEE Conference on, pages 5576–5581, 2008.
- Zhou et al. [2010] E. Zhou, M. C. Fu, and S. I. Marcus. Solving continuous-state POMDPs via density projection. IEEE Transactions on Automatic Control, 55(5):1101 – 1116, 2010.
- Zhou and Hansen [2001] R. Zhou and E.A. Hansen. An improved grid-based approximation algorithm for POMDPs. In Int. J. Conf. Artificial Intelligence, pages 707–714, Aug. 2001.