Uncertainty and filtering of hidden Markov models in discrete time
Abstract
We consider the problem of filtering an unseen Markov chain from noisy observations, in the presence of uncertainty regarding the parameters of the processes involved. Using the theory of nonlinear expectations, we describe the uncertainty in terms of a penalty function, which can be propagated forward in time in the place of the filter. We also investigate a simple control problem in this context.
Keywords: Filtering, Optimal control, Robustness, nonlinear expectation MSC: 62M20, 60G35, 93E11
1 Introduction
Filtering is a common problem in many applications. The essential concept is that there is an unseen Markov process, which influences the state of some observed process, and our task is to approximate the state of the unseen process using a form of Bayes’ theorem. Many results have been obtained in this direction, most famously the Kalman filter (Kalman [26], Kalman and Bucy [27]), which assumes the underlying processes considered are Gaussian, and gives explicit formulae accordingly. Similarly, under the assumption that the underlying process is a finite-state Markov chain, a general formula to calculate the filter can be obtained (the Wonham filter [37]). These results are well known, in both discrete and continuous time (see Bain and Crisan [3] or Cohen and Elliott [9, Chapter 21] for further general discussion).
In this paper, we shall consider a simple setting in discrete time, where the underlying process is a finite-state Markov chain. Our concern will be to study uncertainty in the dynamics of the underlying processes, in particular its effect on the behaviour of the corresponding filter. That is, we will assume that the observer has only imperfect knowledge of the dynamics of the underlying process and of its relationship with the observation process, and wishes to incorporate this uncertainty in their estimates of the unseen state. We are particularly interested in allowing the level of uncertainty in the filtered state to be endogenous to the filtering problem, arising from the uncertainty in parameter estimates and process dynamics.
We will model this uncertainty in a general manner, using the theory of nonlinear expectations, and shall particularly concern ourselves with a description of uncertainty for which explicit calculations can still be carried out, and which can be motivated by considering statistical estimation of parameters. We then apply this to building a dynamically consistent expectation for random variables based on future states, and to a general control problem with learning under uncertainty.
1.1 Basic filtering
Consider two stochastic processes, and . Let be the space of paths of and be a probability measure on . We denote by the (completed) filtration generated by and , and the (completed) filtration generated by . The key problem of filtering is to determine estimates of given , that is where is an arbitrary Borel function.
Suppose that is a Markov chain with (possibly time-dependent) transition matrix under (the transpose here saves notational complexity later). Without loss of generality we can assume that takes values in the standard basis vectors of (where is the number of states of ), and so we can write
where , so .
We suppose the process is multivariate real-valued222This assumption can easily be relaxed, to allow for to take values in an appropriate Polish or Blackwell space. We restrict to the real setting purely for simplicity.. The law of will be allowed to depend on , in particular, the -distribution of given (that is, given all past observations of and and the current state of ) is
for a reference measure on .
For simplicity, we shall assume that , so no information is revealed about at time . It is convenient to write for the diagonal matrix with entries , so that
Note that these assumptions, in particular the values of and , depend on the choice of probability measure . Conversely, as our space is the space of paths of , the measure is determined by and .
As we have assumed takes values in the standard basis in , the expectation determines the entire conditional distribution of given . In this discrete time context, the filtering problem can be solved in a fairly simple manner: Suppose we have already calculated . Then by linearity and the dynamics of , using the fact
we can calculate
Bayes’ theorem then states that, with probability one,
which can be written in a simple matrix form,
(1) |
As is a probability vector, normalization of the right hand side determines directly. We call the ‘filter state’ at time . Note that, if we assume the density is positive, is irreducible and has all entries positive, then will also have all entries positive.
Definition 1.
For future use, if denotes the and matrices described above and is the initial filter state, then we will write for the filter state at time , that is, the solution to (1) (where the observations are implicit).
In practice, the key problem with implementing these methods is the requirement that we know the underlying transition matrix and the density . These are generally not known perfectly, but need to be estimated prior to the implementation of the filter. Uncertainty in the choice of these parameters will lead to uncertainty in the estimates of the filtered state, and the aim of this paper is to derive useful representations of that uncertainty.
As variation in the choice of and corresponds to a different choice of measure , we see that using an uncertain collection of generators corresponds naturally to uncertainty regarding . This type of uncertainty, where the probability measure is not known, is commonly referred to as ‘Knightian’ uncertainty (with reference to Knight [29], related ideas are also discussed by Keynes [28]).
Effectively, we wish to consider the propagation of uncertainty in Bayesian updating (as the filter is simply a special case of this). Huber and Ronchetti [24, p331] briefly touch on this, however (based on earlier work by Kong) argue that this propagation is computationally infeasible. However, their approach was based on Choquet integrals, rather than nonlinear expectations. In the coming sections, we shall see how the structure of nonlinear expectations allows us to derive comparatively simple rules for updating.
Remark 1.
While we will present our theory in the context where is a finite state Markov chain, our approach does not depend in any significant way on this assumption. In particular, it would be equally valid mutatis mutandis when we supposed that followed the dynamics of the Kalman filter, and our uncertainty was on the coefficients of the filter. We specialize to the Markov chain case purely for the sake of concreteness.
2 Conditionally Markov Measures
In order to incorporate learning in our nonlinear expectations and filtering, it is useful to extend slightly from the family of measures previously described. In particular, we wish to allow the dynamics to depend on the past observations, while preserving enough Markov structure to enable filtering. We write for the space of probability measures equivalent to a reference measure . The following two classes of probability measures will be of interest.
Definition 2.
Let denote the probability measures under which
-
•
is a Markov chain, that is, for all , is independent of given ,
-
•
is independent of given ,
-
•
both and are time homogeneous, that is, the conditional distributions of and do not depend on .
To extend this slightly, we can allow our processes to depend on the past of the observed process .
Definition 3.
Let denote the probability measures under which
-
•
is a conditional Markov chain, that is, for all , is independent of given and , and
-
•
is independent of given .
We should note that, if we consider a measure in , there is a natural notion of the generators and . In particular, corresponds to those measures under which the generators and are constant, while corresponds to those measures under which the generators and are deterministic functions of time and .
Definition 4.
We shall write for the space in which the generator takes values333This space can be thought of as the product of the space of transition matrices and the space of diagonal matrix-valued functions, where each diagonal element is a probability density on . and write for the collection of generators associated with , that is, -adapted processes taking values in .
For each , these generators determine the measure on given , and (together with the distribution of ) this determines the measure at all times. It is straightforward to verify that our filtering equations hold for all measures in , with the appropriate modification of the generators.
Definition 5.
For a measure , we shall write for the generator of under , recalling that , and that and are now allowed to depend on . For notational convenience, we shall typically not write the dependence on explicitly.
Similarly, for and a probability vector in , we shall write for the measure with generator and under .
In our setting, our fundamental problem is that we do not know what measure is ‘true’, and so work instead under a family of measures. In general, measure changes can be described as follows.
Proposition 1.
Let be the reference measure (which does not have to be a probability measure), under which is a sequence of iid uniform random variables from the basis vectors and is independent of , with iid distribution . The measure where has generator , where and has Radon–Nikodym derivative (or likelihood)
The above proposition gives a Radon–Nikodym derivative adapted to the full filtration . In practice, it is also useful to consider the corresponding Radon–Nikodym derivative adapted to the observation filtration . As this filtration is generated by the process , it is enough to multiply together the conditional distributions of , leading to the following convenient representation. For notational simplicity, we write
3 Nonlinear Expectations
In this section we introduce the concepts of nonlinear expectations and convex risk measures, and discuss their connection with penalty functions on the space of measures. These objects provide a technical foundation with which to model the presence of uncertainty in a random setting. This theory is explored in some detail in Föllmer and Schied [17]. Other key works which have used or contributed to this theory, in no particular order, are Hansen and Sargent [23] (see also [21, 22] for work related to what we present here), Huber and Ronchetti [24], Peng [31], El Karoui, Peng and Quenez [13], Delbaen, Peng and Rosazza Gianin [10], Duffie and Epstein [12], Rockafellar, Uryasev and Zabarankin [33], Riedel [32] and Epstein and Schneider [14]. We base our terminology on that used in [17] and [10].
We here present, without proof, the key details of this theory as needed for our analysis.
Definition 6.
For a -algebra on , let denote the space of essentially bounded -measurable random variables. A nonlinear expectation on is a mapping
satisfying the assumptions
-
•
Strict Monotonicity: for any , if a.s. then , and if in addition then a.s.,
-
•
Constant triviality: for any constant , ,
-
•
Translation equivariance: for any , , .
A ‘convex’ expectation in addition satisfies
-
•
Convexity: for any , ,
If is a convex expectation, then the operator defined by is called a convex risk measure. A particularly nice class of convex expectations is those which satisfy
-
•
Lower semicontinuity: For a sequence with pointwise (and ), .
The following theorem (which was expressed in the language of risk measures) is due to Föllmer and Schied [16] and Frittelli and Rosazza Gianin [18].
Theorem 1.
Suppose is a lower semicontinuous convex expectation. Then there exists a ‘penalty’ function such that
Provided for some equivalent to , we can restrict our attention to measures in equivalent to without loss of generality.
Remark 2.
This result gives some intuition as to how a convex expectation can model ‘Knightian’ uncertainty. One considers all the possible probability measures on the space, and then selects the maximal expectation among all measures, penalizing each measure depending on how plausible it is considered. As convexity of is a natural requirement of an ‘uncertainty averse’ assessment of outcomes, Theorem 1 shows that this is the only way to construct an ‘expectation’ which penalizes uncertainty, while preserving monotonicity, translation equivariance and constant triviality.
3.1 DR-expectations
From the discussion above, it is apparent that we can focus our attention on calculating the penalty function , rather than the nonlinear expectation directly. This penalty function is meant to encode how ‘unreasonable’ a probability measure is as a model for our outcomes. So far, we have assumed that the penalty did not depend on time or on observations. By relaxing this assumption, we can incorporate learning of which models are ‘good’ in our framework.
In [6], we have considered a framework which links the choice of the penalty function to statistical estimation of a model. The key idea of [6] is to use the negative log-likelihood function for this purpose, where the likelihood is taken against an arbitrary reference measure, and evaluated using the observed data. This directly uses the statistical information from observations in the quantification of uncertainty.
In this paper, we shall make a slight extension of this idea, to explicitly incorporate prior beliefs. In particular, we shall replace the log-likelihood with the log-posterior density, which in turn gives an additional term in the penalty. In order to be precise, we now give a formal definition of the likelihood, which is sufficient for our purposes.
Remark 3.
In what follows, we will be variously wishing to restrict a measure to a -algebra, and to condition a measure on a -algebra. To prevent notational confusion, we shall write for the restriction of to , and for conditioned on .
Definition 7.
Let be a set of models under consideration (for example, a parametric set of distributions). For observations taking values in , we define the likelihood to be a fixed map , measurable with respect to its second argument, such that is a version of the Radon–Nikodym derivative .
Inspired by a ‘Bayesian’ approach, we augment this by the addition of a prior distribution over . Suppose a (possibly improper444As we have not specified a reference measure over , we have not defined the prior density as a Radon–Nikodym derivative, and cannot integrate it over the class of models. Therefore, we do not require it to ‘integrate to ’, that is, we have an improper prior. This has no significant impact in what follows, as (2) normalizes away the effect of the reference distribution.) prior with density of the form is given, then we define the posterior relative density
We then define the “-divergence” to be the negative log-likelihood ratio (or log-posterior relative density)
(2) |
Remark 4.
The right hand side of (2) is well defined whether or not a maximum a posteriori estimator555Recall that a -MAP (maximum a posteriori estimator) is a map such that for all . exists. Given a -MAP , we would have the simpler representation
Definition 8.
For fixed observations , for an uncertainty aversion parameter and exponent , we define the convex expectation
(3) |
where we adopt the convention for and otherwise.
We call the “DR-expectation666DR refers either to divergence-robust or data-driven robust.” (with parameter ). We may omit to write for notational simplicity.
Remark 5.
In the cases of interest for this paper, we shall assume that is parameterized by some finite-dimensional real value, such that the divergence and conditional expectations given are continuous with respect to this parameterization, and are Borel measurable with respect to . This means that the measure theoretic concerns which arise from our definitions in terms of the Radon–Nikodym derivative and taking the supremum will not cause difficulty, in particular, the DR-expectation defined in (3) is guaranteed to be a Borel measurable function of for every . (This follows from Filippov’s implicit function theorem.)
Remark 6.
In theory, we could now apply the DR-expectation framework to a filtering context as follows: Take a collection of models . For a random variable , and for each measure , compute and . Taking a supremum as in (3), we obtain the DR-expectation. However, this is generally not computationally tractable in this form.
Lemma 1.
Let be a filtration such that is adapted. For -measurable random variables, the choice of horizon in the definition of the penalty function is irrelevant. That is, for -measurable and any ,
where is defined as above, in terms of the restricted measure .
Proof.
By definition, the likelihood is determined by the restriction of to , while the expectation depends only on the restriction of to . As these are the only terms needed to compute the DR-expectation, the result follows. ∎
Remark 7.
The purpose of the nonlinear expectation is to give an ‘upper’ estimate of a random variable, accounting for uncertainty in the underlying probabilities. This is closely related to robust estimation in the sense of Wald [35]. In particular, one can consider the robust estimator given by
which gives a ‘minimax’ estimate of , given the observations and a quadratic loss function. The advantage of the nonlinear expectation approach is that it allows one to construct such an estimate for every random variable/loss function, giving a cost-specific quantification of uncertainty in each case.
We can also see a connection with the theory of filtering (see, for example Grimble and El Sayed [20] or more recently Zhang, Xia and Shi [38] and references therein, or the more general -control theory in Başar and Bernhard [2]). In this setting, we look for estimates which perform best in the worst-situation, where ‘worst’ is usually defined in terms of a perturbation to the input signal or coefficients. In our setting, we focus not on the estimation problem directly, but on the ‘dual’ problem of building an upper expectation, i.e. calculating the ‘worst’ expectation in terms of a class of perturbations to the coefficients (our setting is general enough that perturbation to the signal can also be included, through shifting the coefficients).
Remark 8.
There are also connections between our approach and what is called ‘risk-sensitive filtering’, see for example James, Baras and Elliott [25], Dey and Moore [11], or the review of Boel, James and Petersen [5] and references therein (from an engineering perspective) or Hansen and Sargent [22, 23] (from an economic perspective). In their setting, one uses the nonlinear expectation defined by
for some choice of robustness parameter . This leads to significant simplification, as dynamic consistency and recursivity is guaranteed in every filtration (see Graf [19] and Kupper and Schachermeyer [30], and further discussion in Section 5) and the corresponding penalty function is given by the conditional relative entropy
the one-step penalty can be calculated accordingly. The optimization for the nonlinear expectation can be taken over , so this approach has a claim to be including ‘nonparametric’ uncertainty, as all measures are considered, rather than purely Markov measures or measures in a parametric family (however the optimization can be taken over conditionally Markov measures, and one will obtain an identical result!).
The difficulty with this approach is that it does not allow for easy incorporation of knowledge of the error of estimation of the generator in the level of robustness – the only parameter available to choose is , which multiplies the relative entropy. A small choice of corresponds to a small penalty, hence a very robust expectation, but this robustness is not directly linked to the estimation of the generators . Therefore the impact of statistical estimation error remains obscure, as is chosen largely exogenously of this error. For this reason, our approach, which directly allows for the penalty to be based on the statistical estimation of the generators, has advantages over this simpler method.
4 Recursive penalties
The DR-expectation provides us with an approach to including statistical estimation in our valuations. However, the calculations suggested by Remark 6 are generally intractable in their stated form. In this section, we shall see how the DR-expectation approach, and an approach with a constant penalty , specialize in a filtering setting.
The class of models we shall consider are based on two key questions:
-
1.
“Static or Dynamic generators?” Are the generators and static (through time) and unknown (so can be estimated) or are they not only unknown but dynamically changing (and we depend principally on prior information about their likely behaviour)?
In other words, do our models come from (static generators) or from (dynamic generators)?
-
2.
“Uncertain prior (UP) or DR-expectations?” Do we (i) have a fixed penalty on the ‘reasonableness’ of a model, and then use new observations to update within each model using Bayes’ theorem, or are we (ii) attempting to determine which model is reasonable using our observations, while simultaneously updating our model (with the same observations).
In other words, is our penalty constant (UP framework) or does it change with new observations (DR framework)?
For practical purposes, it is critical that we refine our approach to provide a recursive construction of our nonlinear expectation. In classical filtering, one obtains a recursion for expectations , for Borel functions ; one does not typically consider the expectations of general random variables. In the same way, we will consider the expectations of random variables .
It is clear that we can consider as a nonlinear expectation with the probability space being only the states of . By Theorem 1, it follows that, for each , there exists a -measurable function such that
(4) |
where denotes the probability simplex in , that is, . Our aim is to find a recursion for , for various choices of . Without loss of generality, we will write
where has an appropriate set of arguments, as this gives consistent notation between the DR-expectation and Uncertain Prior settings.
If the generators are assumed to be static but unknown, the following definition will prove useful.
Definition 9.
Let denote an extended penalty function, which encodes the penalty associated with a state and a generator . The penalty is obtained from through the relation
(5) |
The reason for this definition is that, in a static generator framework, it is , not , which will satisfy a recursive equation. We note, however, that the dimension of is larger, and typically much larger, than , which leads to difficulty in implementation.
Our constructions will depend on the following object.
Definition 10.
For each generator and each , we define the random set
By recursion, we extend this to all
The set represents the filter states at time which evolve to at time . Clearly this is only known at time , as it depends on . Similarly, we think of as the set of filter states at time which would evolve to the state at time , given the generator and the observations . This set may be empty, if no such filter states exist. As is a diagonal matrix, if its entries do not vanish777This corresponds to there being no state which yields a zero likelihood of the observed value. then it is invertible. However, the matrix will often not be an invertible matrix, so is not generally a singleton.
4.1 Filtering with Uncertain Priors
We shall first consider the case where we assume the filtering equations apply, and the only uncertainty in our model is given by our uncertainty over the prior inputs to the filter. In particular, this “prior uncertainty” is not updated given new observations, and is a fixed function of
We can now consider our two cases: with static and dynamic generators.
4.1.1 Static Generators (StaticUP)
With a static generator, our approach is simple.
Definition 11.
In a StaticUP setting, the inputs to the filtering problem are the initial filter state and the generator . We therefore take a penalty where
for some prescribed penalty .
Remark 9.
Inspired by the DR-expectation, a natural choice of penalty function is the negative log-density of a prior distribution for the inputs , shifted to have minimal value zero. Alternatively, taking an empirical Bayesian perspective, could be the log-likelihood from a prior calibration process. In this case, we are directly using our statistical uncertainty in the parameters in our understanding of the filtering problem.
Lemma 2.
If the extended penalty is defined by
with the convention , then writing , we satisfy the dynamic updating equation (4).
The extended penalty is useful, as it can be calculated recursively.
Theorem 2.
The extended penalty satisfies the recursion
Proof.
This is immediate from the definition of . ∎
Remark 10.
We emphasize that there is no learning of being done in this framework – the penalty on applied at time is simply propagated forward; our observations do not affect our opinion of its likely value. Furthermore, we are not adjusting our prior penalty to account for the ‘unreasonableness’ of our models as we observe data. In particular, if we assume no knowledge of the initial state (i.e. a zero penalty), then we will have no knowledge of the state at time (unless the observations cause the filter to degenerate).
Example 1.
We take the class of models in where and are perfectly known, and , so is constant (but is unknown). We take , so takes only one of two values. Finally, we assume that
where . Effectively, in this example we are using filtering to determine which of two possible means is the correct mean for our observation sequence. It is worth emphasising that the filter process corresponds to the posterior probabilities, in a Bayesian setting, of the events that our Bernoulli process has parameter or .
It will be useful to note that, from classical Bayesian statistical calculations888One can derive the stated formula using the filtering equations, for the vector . However, the closed-form solution given here is more easily obtained using alternative methods for Bayesian hypothesis testing (which is effectively what this problem encodes)., for a given , one can see that the corresponding value of is determined from the log odds ratio
To write down the StaticUP penalty function, let the (known) dynamics be described by . Consequently, we can write for all . As is known, there is no distinction between and , so
We initialize with a known penalty for all .
In this setting, we can express our penalty in terms of the log-odds, for the sake of notational simplicity given the closed-form solution to the filtering problem, and hence can explicitly calculate , which contains only single points. In particular, the time- penalty is given by a shift of the initial penalty:
Remark 11.
This example demonstrates the following behaviour:
-
•
If the initial penalty is zero, then the penalty at time is also zero – there is no learning of which state we are in.
-
•
There is no variation in the curvature of the penalty (and so no change in our ‘uncertainty’), we simply shift the penalty around, corresponding to our changing posterior probabilities.
-
•
The update of is done purely using the tools of Bayesian statistics, rather than having any direct incorporation of our uncertainty.
4.1.2 Dynamic generators (DynamicUP)
If we model the generator as fixed and unknown, calculation of suffers from a curse of dimensionality. We also need to assume that is constant through time, which may be a dubious assumption in practice.
In this case, we can obtain a practical model by allowing to vary independently at each point in time. Superficially, this significantly worsens the curse of dimensionality, as we no longer take a fixed , but regard it as a process through time. The advantage of this is that we can then use dynamic programming to calculate the penalty .
In order to include our knowledge of the generator, we will write for the generator applicable at time . Recall that this is a process taking values in , and we write for the space of such processes adapted to the observation filtration. For , we define using the natural analogue of Definition 10 incorporating time dependence.
Definition 12.
In the DynamicUP setting, for an initial penalty on the initial hidden state, , and a penalty on the time- generator, , our total penalty is given by
Theorem 3.
We can define a dynamic penalty
(6) |
which satisfies (4). Furthermore, the penalty satisfies the recursion
(7) |
Proof.
This formulation of our problem allows us to use dynamic programming to solve our problem forward in time. In the setting of Example 1, as the dynamics are perfectly known, there is no distinction between the dynamic and static cases.
Remark 13.
The dynamic formulation adds penalties together, so if the same penalty on is used in the static and dynamic settings, then the dynamic setting will typically have a higher penalty than the static setting. Practically, this effect is lessened by the minimization in (6), and the fact that the filter has good ergodic properties (as discussed by van Handel [34]). This ergodicity implies that, at time , the filter will not significantly depend on the generator for , and so the minimization will render the penalty at irrelevant.
A continuous-time version of this setting is considered in [1].
4.2 Filtering with DR-expectations
In the above, we have regarded the prior as uncertain, and used this to penalize over models. We did not use the data to modify our penalty function, simply to revise within each model. The DR-expectation appoach gives us an alternative approach, in which the data guides our model choice more directly. In what follows, we will apply the DR-expectation in our filtering context, and observe that it gives a slightly different recursion for the penalty function. Again, we can consider models where is regarded as static or as dynamic.
4.2.1 Static generators (StaticDR)
For a fixed , we have already written the likelihood function (Proposition 2). As we are working in the observation filtration, this gives the natural decomposition of the likelihood for our calculation. This leads us to a simple formulation of the penalty .
Lemma 3.
Suppose the initial filter state and static generator are distributed according to the prior density . Then the -divergence of the measure (i.e. normalized log-posterior density of the parameters ) given observations can be written
where is a sequence of normalizing values, independent of and , such that has minimal value zero, and is the solution to the filtering equations with the given generator and initial filter state .
Proof.
The divergence is given by the negative posterior log-density of , shifted to have minimal value zero. From Bayes’ rule, this is simply the sum of the negative prior log-density and the negative log-likelihood. The likelihood is stated in Proposition 2, and gives us the term . The prior then contributes the term , as desired. ∎
Remark 14.
In this framework, we have not assumed we can propagate our uncertainty using the filtering equations, as in an uncertain prior setting – we are instead simply evaluating the DR-expectation conditional on our observations. Nevertheless, the filtering equations naturally appear through their presence in the likelihood function, and so will still form part of our dynamic penalty.
Theorem 4.
Consider the family of models with a fixed, but unknown, generator . The DR-expectation can then be calculated as in (4), writing , where
Furthermore, has recursive representation
where and is a constant to ensure .
Proof.
As we are looking at a function of the current hidden state, we are interested in the minimal penalty associated with each (as any larger penalty will be ignored by the supremum when calculating the DR-expectation). Given our definition of , as in the StaticUP case (Lemma 2), we observe that satisfies (4) by rearrangement. Using the definition of and the recursive nature of the filter, by the usual dynamic programming arguments it is easy to see that we can calculate the penalty recursively,
with the initial value given by the prior penalty . ∎
Remark 15.
Comparing the results of Theorem 2 with Theorem 4, we see that the key distinction is the presence of the log-likelihood term . This term implies that observations of will affect our quantification of uncertainty, rather than purely updating each model. We also note that the filtering equations have arisen naturally as a part of the penalty (through their appearance in the likelihood function adapted to ), rather than us assuming that the filtering equations are the ‘correct’ method for updating in the presence of uncertainty.
Example 2.
In the setting of Example 1, recall that is constant, so we know . One can calculate the penalty either directly, or through solving the stated recursion using the dynamics of . The resulting penalty is given by first calculating from through
and then
where is chosen to ensure . From this, we can see that the likelihood will modify our uncertainty directly, rather than us simply propagating each model via Bayes’ rule. A consequence of this is that, if we start with extreme uncertainty (), then our observations will teach us what models are reasonable, thereby reducing our uncertainty (i.e. we will find for when ).
4.2.2 Dynamic generators (DynamicDR)
As in the uncertain priors case, it is often practically inconvenient to calculate a recursion for , given the high dimension of . We can avoid this by allowing to vary through time. In order to place this within the DR expectation framework, we consider the following models.
Definition 13.
Suppose is a sequence, selected according to distributions with conditional density
for a known function. Note that this structure implies .
In addition, suppose that is selected from a distribution with density on , independently of .
This assumption allows us to decouple the choice of generators at different times, leading to a significant reduction in complexity. The prior penalty, represented by is assumed to be known, potentially from calibration of the model.
Theorem 5.
Proof.
Using the same logic as in Lemma 3, and using Lemma 1 to restrict our horizon to , we observe that the divergence, for a model with generator is given by
where is the corresponding solution to the filtering equations, and is a normalizing constant.
As in the static generator case, we can then reparameterize in terms of the terminal value of the filter, and notice that we are only interested in the minimal penalty for a given . Comparing with (4), we have
The recursion then follows by the usual dynamic programming arguments. ∎
Remark 16.
We expect that there will be less difference between the dynamic uncertain prior and dynamic DR-expectation settings than between the static uncertain prior and static DR-expectation settings. This is because there is only limited learning possible in the dynamic DR-expectation, as the underlying generator is independently given at every time, so the DR-expectation has only one value with which to infer its behaviour. This increases the relative importance of the prior term , which describes our understanding of typical values of the generator. In practice, the key distinction between the dynamic DR-expectation and uncertain prior models appears to be when the initial penalty is near zero – in this case, the DR-expectation regularizes the initial state quickly, while the uncertain prior model may remain near zero indefinitely.
Example 3.
In the setting of Example 1, as the dynamics are perfectly known, there is again no difference between the dynamic and static generator DR-expectation cases.
Remark 17.
In these sections, we have considered the case where the generator is either dynamic or static. Of course, further variations can be had, by allowing the generator to have some parts static and others dynamic, or to be dynamic but with penalty determined by an unknown static parameter. These perturbations give this approach a high degree of flexibility for practical modelling.
5 Expectations of the future
The nonlinear expectations considered above do not consider how the future will evolve. In particular, we have focussed our attention on calculating , that is, on calculating the expectation of functions of the current hidden state. In other words, we can consider our nonlinear expectation as a mapping
If we wish to calculate expectations of future states, then we may wish to consider doing so in a filtration-consistent manner. This is of particular importance when considering optimal control problems.
Definition 14.
For a fixed horizon , suppose that for each we have a mapping . We say that is a -consistent convex expectation if satisifes the following assumptions, analogous to those above,
-
•
Strict Monotonicity: for any , if a.s. then a.s. and if in addition then a.s.
-
•
Constant triviality: for , .
-
•
Translation equivariance: for any , , .
-
•
Convexity: for any , ,
-
•
Lower semicontinuity: For a sequence with pointwise, pointwise for every .
and the additional asssumptions
-
•
-consistency: for any , any ,
-
•
Relevance: for any , any , .
The assumption of -consistency is sometimes simply called recursivity, time consistency or dynamic consistency (and is closely related to the validity of the dynamic programming principle), however, it is important to note that this depends on the choice of filtration. In our context, consistency with the observation filtration is natural, as this describes the information available for us to make decisions.
Remark 18.
Definition 14 is equivalent to considering a lower semicontinuous convex expectation, as in Definition 6 and assuming that for any and any , there exists a random variable such that for all . In this case, one can define and verify that the definition given is satisfied (see Föllmer and Schied [17]).
Much work has been done on the construction of dynamic nonlinear expectations (see for example Epstein and Schneider [14], Duffie and Epstein [12], El Karoui, Peng and Quenez [13], Cohen and Elliott [7], and references therein). In particular, there have been close relations drawn between these operators and the theory of BSDEs (for a setting covering the discrete-time examples we consider here, see [7, 8]).
Remark 19.
The importance of -consistency is twofold: First, it guarantees that, when using a nonlinear expectation to construct the value function for a control problem, an optimal policy will be consistent in the sense that (assuming an optimal policy exists) a policy which is optimal at time zero will remain optimal in the future. Secondly, -consistency allows the nonlinear expectation to be calculated recursively, working backwards from a terminal time. This leads to a considerable simplification numerically, as it avoids a curse of dimensionality in intertemporal control problems.
Remark 20.
One issue in our setting is that our lack of knowledge does not simply line up with the arrow of time – we are unaware of events which occurred in the past, as well as those which are in the future. This leads to delicacies in questions of dynamic consistency. Conventionally, this has been often been considered in a setting of ‘partially observed control’, and these issues are resolved by taking the filter state to play the role of a state variable, and solving the corresponding ‘fully observed control problem’ with as underlying. In our context, we do not know the value of , instead we have the (even higher dimensional) penalty function (or worse, ) as a state variable (taking values in the space of functions on ).
In this section, we will outline how this perspective can provide a dynamically consistent extension of our expectations, and how enforcing dynamic consistency will modify our perception of risk. For this and the remainder of the paper, we will focus our attention on the dynamic-generator DR-expectation framework. The corresponding theory in the dynamic-generator uncertain-prior setting can be obtained by simply removing the relevant likelihood term whenever it appears.
5.1 Asynchronous expectations
We will focus our attention on constructing a dynamically consistent nonlinear expectation for random variables in , given observations up to times . Given our earlier analysis, we already have a map
We therefore need only to construct a -consistent family of maps
which we can extend by composition with to be defined on our space of interest .
As we are in discrete time, we can construct a -consistent family through recursion. Therefore, the key question is how we construct a nonlinear expectation over one step. The definition of the DR-expectation can be applied to generate these one-step expectations in a natural way.
Recall that, as is generated by , any -measurable function is simply a function of so we can write
(8) |
For any conditionally Markov measure , if has generator , it follows that
Lemma 4.
For a dynamic DR-expectation, the one-step expectation (i.e. for -measurable ) can be written
where is as in (8).
Proof.
Our DR-expectation can be written
As is -measurable, rather than -measurable, we cannot exploit Lemma 1 fully, as the generator at is still relevant when calculating our -conditional expectation. As we are in a dynamic-generator DR-expectation setting, for a measure with generator and initial hidden distribution , we have
Our expectation is independent of for , and is independent of for given . Therefore, without loss of generality we can minimize this over possible values of and, expressing in terms of the current filter state, we obtain
where is as constructed in Theorem 5. Combining the conditional expectation and the penalty, this gives the desired representation. ∎
Remark 21.
There is a surprising form of double-counting of the penalty here. For notational simplicity, let’s assume does not depend on . If we consider , then we have included a penalty for the proposed model at , that is,
where is the penalty associated with the filter state at time , which comes from the generators .
When we calculate , we then add on the penalty , which again penalizes unreasonable values of and the generator . This ‘double counting’ of the penalty corresponds to us including both our ‘expected’ uncertainty at time , and also our ‘uncertainty at about our uncertainty at ’.
Remark 22.
In the case where we considered a DynamicUP model, the equations would be identical, with the corresponding choice of . This is because the DR and uncertain prior models differ only through the incorporation of learning through the log-likelihood term, which is not a consideration when it comes to evaluating our future expectations.
Remark 23.
If we take a StaticDR model, then the equations vary as one might expect:
Similarly for a StaticUP model. However, one should be careful in this setting, as the recursively-defined nonlinear expectation will consider models which allow the generator to vary through time, even though this does not form part of the original static generator framework.
Given this recursion, we can now define the nonlinear expectation at every time.
Definition 15.
The dynamically consistent expectation of (for either the static or dynamic generator cases, and either the uncertain prior or DR-expectation models), is given by the recursion
with . As is trivial, we identify
Note that is -measurable for all .
Remark 24.
As we have defined using recursion, and our DR-expectation (and uncertain prior expectation) are convex, it is easy to verify that the map is a -consistent convex expectation.
5.2 Recall of BSDE theory
While it is useful to give a recursive definition of our nonlinear expectation, a better understanding of its dynamics is of practical importance. In what follows we will, for the dynamic generator case, consider the corresponding BSDE theory, assuming that can take only finitely many values, as in [7]. We will now present the key results of [7], in a simplified setting.
In what follows, we suppose that takes values, which we associate with the standard basis vectors in . For simplicity, we write for the vector in with all components .
Definition 16.
Write for a probability measure such that is an iid sequence, uniformly distributed over the states, and for the -martingale difference process . As in [7], has the property that any -adapted -martingale can be represented by for some (and is unique up to addition of a multiple of ).
Remark 25.
The construction of in fact also shows that, if is written , then for every (up to addition of a multiple of ).
We can then define a BSDE (Backward Stochastic Difference Equation) with solution :
(9) |
where is a finite deterministic terminal time, a -adapted map , and a given -valued -measurable terminal condition. For simplicity, we henceforth omit the argument of and .
The general existence and uniqueness result for BSDEs in this context is as follows:
Theorem 6.
Suppose is such that the following two assumptions hold:
-
(i)
For any , if for some , then , -a.s. for all .
-
(ii)
For any , for all , for -almost all , the map
is a bijection .
Then for any terminal condition essentially bounded, -measurable, and with values in , the BSDE (9) has a -adapted solution . Moreover, this solution is unique up to indistinguishability for and indistinguishability up to addition of multiples of for .
In this setting, we also have a comparison theorem:
Theorem 7.
Consider two discrete time BSDEs as in (9), corresponding to coefficients and terminal values . Suppose the conditions of Theorem 6 are satisfied for both equations, let and be the associated solutions. Suppose the following conditions hold:
-
(i)
-a.s.
-
(ii)
-a.s., for all times and every and ,
-
(iii)
-a.s., for all , satisfies
-
(iv)
-a.s., for all and all , is an increasing function.
It is then true that -a.s. A driver satisfying (iii) and (iv) will be called ‘balanced’.
Finally, we also know that all dynamically consistent nonlinear expectations can be represented through BSDEs:
Theorem 8.
The following two statements are equivalent.
-
(i)
is a -consistent, dynamically translation invariant, nonlinear expectation
-
(ii)
There exists a driver which is balanced, independent of , and satisfies the normalisation condition , such that, for all , is the solution to a BSDE with terminal condition and driver .
Furthermore, these two statements are related by the equation
5.3 BSDEs for forward expectations
By applying the above general theory, we can easily see that our nonlinear expectation has a representation as the solution to a particular BSDE.
Theorem 9.
In the dynamic generator setting, writing , the dynamically consistent expectation satisfies the BSDE
where
Proof.
As is -measurable, by the Doob–Dynkin lemma there exists a -measurable function such that (we omit to write as an argument). We write for the vector containing each of the values of this function. From the definition of , as in the proof of the martingale representation theorem in [7], it follows that,
We can then calculate
The answer follows by rearrangement. ∎
We can now observe that, if we treat the function as a state variable, we obtain a ‘Markovian’ BSDE.
Theorem 10.
Suppose and are independent of . Then the solution to the above BSDE is a functional of , that is, .
Proof.
This argument follows in the usual manner – we observe that has recursive dynamics (and furthermore, under the reference measure, is Markovian) and that the terminal value of our BSDE is a function of only through . Consequently, we can use backward induction to construct the solution to the BSDE as a function of (by evolving forward one step in time, then solving the BSDE backward one step), and the result follows. ∎
6 A control problem with uncertain filtering
In this final section, we will consider the solution of a simple control problem under uncertainty, using the formal structures previously developed. We shall focus our attention on a DR-expectation with dynamic generator, however similar arguments can be used in each of the other settings considered. In some ways, this approach is similar to those considered by Bielecki, Chen and Cialenco [4], where the DR-expectation is replaced by an approximate confidence interval. (Taking in our analysis would give a very similar problem to the one they consider.)
Suppose a controller selects a control from a set , which we assume is a countable union of compact metrizable sets (this assumption is purely to enable us to use appropriate measurable selection theorems). Controls are required to be -predictable (i.e. is -measurable), and we write for the space of such controls.
A control has an impact on the generator of , through modifying the penalty function , which describes the ‘reasonable’ models for the transition matrix and the distribution of observations . In particular, for a given we will now have a penalty , which we assume is continuous in for every . This allows a controller to modify what are ‘reasonable’ values of the generator, even though the generator may not be fully known. We write for the corresponding -consistent expectation and for the corresponding dynamic penalty. The expectation then encodes both the change in the hidden dynamics in the future due to future controls and the change in our agent’s understanding of the present hidden state (as represented via ) due to her past controls.
The controller wishes to minimize an expected cost
Here is a terminal cost, which may depend on the hidden state , and is a running cost, which will depend on the control used at time . We assume and are continuous in (almost surely). We do not allow to depend on , as this would potentially lead to paradoxes (as the agent could learn information about the hidden state by observing their running costs). We think of the cost as being paid at time , depending on the choice of control (which will affect the generator at time ). For notational simplicity, we will omit to write as an argument when unnecessary.
For a given control process , we define the remaining cost
and hence the value function
Remark 26.
We define our expected cost using the -consistent expectation , rather than the (inconsistent) DR-expectation ), as this leads to time-consistency in the choice of controls.
Remark 27.
We can see that the calculation of the value function is a ‘minimax’ problem, in that minimizes the cost, which we evaluate using a maximum over a set of models. However, given the potential for learning, the requirement for time consistency, and the uncertainties involved, it is not clear that one can write explicitly in terms of a single minimization and maximization of a given function.
Remark 28.
As the filter-state penalty is a general function depending on the control, and only takes finitely many states, it is not generally possible to express the effect (on ) of a control through a change of measure relative to some reference dynamics. In particular, we face the problem that controls for times will have an impact on the terminal cost , through their impact on the uncertainty , so, unlike in a traditional control problem, is not independent of given . For this reason, even though we model the impact of a control through its effect on the generator, we cannot give a fully ‘weak’ formulation of our control problem, and are restricted to a ‘Markovian’ setting, where we shall exploit dynamic programming.
Theorem 11.
The value function satisfies a dynamic programming principle, in particular, if an optimal control exists, then for every ,
(and similarly if we only assume an -optimal control exists for every ).
Proof.
For any control , using the recursivity of we have
The stated equality (which is the dynamic programming relation) then follows from a standard pasting argument (along with a measurable selection result to ensure optimal or -optimal controls exist, see for example [9, Appendix 10]). ∎
Theorem 12.
The value function of the control problem satisfies the recursion
where
for a normalizing constant (which may depend on ) to ensure has minimal value zero, and terminal value
Proof.
We proceed by recursion, and assume that an optimal policy is always attainable (this can be relaxed, with an increase of notational complexity). We suppose that the value function is given by , for some function , where denotes the space of functions from to with minimal value zero. From our assumptions, this is clearly true at time , and has the stated form.
From the perspective of time , suppose the current penalty state is given. Then, for any proposed control (which is assumed to be -measurable), when is observed at time , the time- penalty will be given by . For each choice of and each conditionally Markov measure (with generator ), the remaining cost faced at time will be
The conditional expectation of can be written
The DR-expected cost of using control , as seen from time , fixing the state of the penalty at , is then given by taking a conditional expectation and adding the running cost term, which yields
Finally, optimizing this cost gives the one-step minimal cost in terms of a function of and , as required. (To be rigorous, this can be done using a measurable selection argument, as in [9, Appendix 10].) The result follows by recursion, as our problem is known to satisfy the dynamic programming principle. ∎
Corollary 1.
A control is optimal if and only if it achieves the infimum in the formula for above.
Remark 29.
If we assume that the terminal cost depends only on (and not on ), and the running cost does not depend on , then one can observe a Markov property to the control problem, that is, is conditionally independent of given . The corresponding optimal controls can then also be taken only to depend on .
Remark 30.
We can write (using for the partial difference operator and omitting dependence on )
Rearranging, we obtain the following Bellman–Isaacs-type equation for
with terminal value
References
- [1] Andrew Allan and Samuel N. Cohen. Parameter uncertainty in the Kalman–Bucy filter. to appear.
- [2] Tamer Başar and Pierre Bernhard. -Optimal Control and Related Minimax Design Problems, A dynamic game approach. Birkhäuser, 1991.
- [3] Alan Bain and Dan Crisan. Fundamentals of Stochastic Filtering. Springer, 2009.
- [4] Tomasz R. Bielecki, Tao Chen, and Igor Cialenco. Recursive construction of confidence regions. https://arxiv.org/abs/1605.08010.
- [5] Rene K. Boel, Matthew R. James, and Ian R. Petersen. Robustness and risk-sensitive filtering. IEEE Transactions on Automatic Control, 47(3):451–461, 2002.
- [6] Samuel N. Cohen. Data-driven nonlinear expectations for statistical uncertainty in decisions. Electronic Journal of Statistics, 11(1):1858–1889, 2017.
- [7] Samuel N. Cohen and Robert J. Elliott. A general theory of finite state backward stochastic difference equations. Stochastic Processes and their Applications, 120(4):442–466, 2010.
- [8] Samuel N. Cohen and Robert J. Elliott. Backward stochastic difference equations and nearly-time-consistent nonlinear expectations. SIAM Journal on Control & Optimization, 49(1):125–139, 2011.
- [9] Samuel N. Cohen and Robert J. Elliott. Stochastic Calculus and Applications. Birkhäuser, 2nd ed. edition, 2015.
- [10] Freddy Delbaen, Shige Peng, and Emanuela Rosazza Gianin. Representation of the penalty term of dynamic concave utilities. Finance and Stochastics, 14(3):449–472, 2010.
- [11] Subhrakanti Dey and John B. Moore. Risk-sensitive filtering and smoothing for hidden Markov models. Systems and Control Leters, 25:361–366, 1995.
- [12] Darrell Duffie and Larry G. Epstein. Asset pricing with stochastic differential utility. The Review of Financial Studies, 5(3):411–436, 1992.
- [13] Nicole El Karoui, Shige Peng, and M.C. Quenez. Backward stochastic differential equations in finance. Mathematical Finance, 7(1):1–71, January 1997.
- [14] Larry G. Epstein and Martin Schneider. Recursive multiple-priors. Journal of Economic Theory, 113:1–31, 2003.
- [15] Ronald Fagin and Joseph Halpern. A new approach to updating beliefs. In Proceedings of the Sixth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-90), pages 317–325, Corvallis, Oregon, 1990. AUAI Press.
- [16] Hans Föllmer and Alexander Schied. Convex measures of risk and trading constraints. Finance and Stochastics, 6:429–447, 2002.
- [17] Hans Föllmer and Alexander Schied. Stochastic Finance: An introduction in discrete time. Studies in Mathematics 27. de Gruyter, Berlin-New York, 2002.
- [18] Marco Frittelli and Emanuela Rosazza Gianin. Putting order in risk measures. Journal of Banking & Finance, 26(7):1473–1486, 2002.
- [19] Siegfried Graf. A Radon–Nikodym theorem for capacities. Journal für die reine und angewandte Mathematik, 320:192–214, 1980.
- [20] Michael J. Grimble and Ahmed El Sayed. Solution of the optimal linear filtering problem for discrete-time systems. IEEE Transactions on Acoustics Speech and Signal Processing, 38(7), 1990.
- [21] Lars Peter Hansen and Thomas J. Sargent. Robust estimation and control under commitment. Journal of Economic Theory, 124:258–301, 2005.
- [22] Lars Peter Hansen and Thomas J. Sargent. Recursive robust estimation and control without commitment. Journal of Economic Theory, 136(1):1–27, 2007.
- [23] Lars Peter Hansen and Thomas J. Sargent. Robustness. Princeton University Press, 2008.
- [24] Peter J. Huber and Elvezio M. Roncetti. Robust Statistics. Wiley, 2nd edition, 2009.
- [25] Matthew R. James, John S. Baras, and Robert J. Elliott. Risk-sensitive control and dynamic games for partially observed discrete-time nonlinear systems. IEEE Transactions on Automatic Control, 39(4), 1994.
- [26] R.E. Kalman. A new approach to linear filtering and prediction problems. J. Basic Eng. ASME, 82:33–45, 1960.
- [27] R.E. Kalman and R.S. Bucy. New results in linear filtering and prediction theory. J. Basic Eng. ASME, 83:95–108, 1961.
- [28] John Maynard Keynes. A treatise on Probability. 1921.
- [29] Frank Knight. Risk, Uncertainty and Profit. Houghton Mifflin, 1921.
- [30] Michael Kupper and Walter Schachermayer. Representation results for law invariant time consistent functions. Mathematics and Financial Economics, 2(3):189–210, September 2009.
- [31] Shige Peng. Nonlinear expectations and stochastic calculus under uncertainty. arxiv::1002.4546v1, 2010.
- [32] Frank Riedel. Dynamic coherent risk measures. Stochastic Processes and their Applications, 112(2):185–200, 2004.
- [33] R. Tyrrell Rockafellar, Stan Uryasev, and Michael Zabarankin. Generalized deviations in risk analysis. Finance and Stochastics, 10:51–74, 2006.
- [34] Ramon van Handel. Discrete time nonlinear filters with informative observations are stable. Electronic Communications in Probability, 13:53, 2008.
- [35] Abraham Wald. Statistical decision functions which minimize the maximum risk. Annals of Mathematics, 46(2):265–280, 1945.
- [36] Peter Walley. Statistical Reasoning with Imprecise Probabilities. Chapman and Hall, 1991.
- [37] W.N. Wonham. Some applications of stochastic differential equations to optimal nonlinear filtering. SIAM J. Control, 2:347–369, 1965.
- [38] Jinhui Zhang, Yuanqing Xia, and Peng Shi. Parameter-dependent robust filtering for uncertain discrete-time systems. Automatica, 45:560–565, 2009.