This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Uncertainty and filtering of hidden Markov models in discrete time

Samuel N. Cohen111Research supported by the Oxford–Man Institute for Quantitative Finance and the Oxford–Nie Financial Data Laboratory. Thanks to Ramon van Handel, Michael Monoyios, Sergey Nadtochiy, Andrew Allan, Gonçalo Simões and Robert Elliott for useful conversations.
samuel.cohen@maths.ox.ac.uk
Mathematical Institute, University of Oxford
(September 24, 2025)
Abstract

We consider the problem of filtering an unseen Markov chain from noisy observations, in the presence of uncertainty regarding the parameters of the processes involved. Using the theory of nonlinear expectations, we describe the uncertainty in terms of a penalty function, which can be propagated forward in time in the place of the filter. We also investigate a simple control problem in this context.

Keywords: Filtering, Optimal control, Robustness, nonlinear expectation MSC: 62M20, 60G35, 93E11

1 Introduction

Filtering is a common problem in many applications. The essential concept is that there is an unseen Markov process, which influences the state of some observed process, and our task is to approximate the state of the unseen process using a form of Bayes’ theorem. Many results have been obtained in this direction, most famously the Kalman filter (Kalman [26], Kalman and Bucy [27]), which assumes the underlying processes considered are Gaussian, and gives explicit formulae accordingly. Similarly, under the assumption that the underlying process is a finite-state Markov chain, a general formula to calculate the filter can be obtained (the Wonham filter [37]). These results are well known, in both discrete and continuous time (see Bain and Crisan [3] or Cohen and Elliott [9, Chapter 21] for further general discussion).

In this paper, we shall consider a simple setting in discrete time, where the underlying process is a finite-state Markov chain. Our concern will be to study uncertainty in the dynamics of the underlying processes, in particular its effect on the behaviour of the corresponding filter. That is, we will assume that the observer has only imperfect knowledge of the dynamics of the underlying process and of its relationship with the observation process, and wishes to incorporate this uncertainty in their estimates of the unseen state. We are particularly interested in allowing the level of uncertainty in the filtered state to be endogenous to the filtering problem, arising from the uncertainty in parameter estimates and process dynamics.

We will model this uncertainty in a general manner, using the theory of nonlinear expectations, and shall particularly concern ourselves with a description of uncertainty for which explicit calculations can still be carried out, and which can be motivated by considering statistical estimation of parameters. We then apply this to building a dynamically consistent expectation for random variables based on future states, and to a general control problem with learning under uncertainty.

1.1 Basic filtering

Consider two stochastic processes, X={Xt}t0X=\{X_{t}\}_{t\geq 0} and Y={Yt}t0Y=\{Y_{t}\}_{t\geq 0}. Let Ω\Omega be the space of paths of (X,Y)(X,Y) and \mathbb{P} be a probability measure on Ω\Omega. We denote by {t}t0\{\mathcal{F}_{t}\}_{t\geq 0} the (completed) filtration generated by XX and YY, and 𝒴={𝒴t}t0\mathcal{Y}=\{\mathcal{Y}_{t}\}_{t\geq 0} the (completed) filtration generated by YY. The key problem of filtering is to determine estimates of ϕ(Xt)\phi(X_{t}) given 𝒴t\mathcal{Y}_{t}, that is 𝔼[ϕ(Xt)|𝒴t]\mathbb{E}_{\mathbb{P}}[\phi(X_{t})|\mathcal{Y}_{t}] where ϕ\phi is an arbitrary Borel function.

Suppose that XX is a Markov chain with (possibly time-dependent) transition matrix AtA_{t}^{\top} under \mathbb{P} (the transpose here saves notational complexity later). Without loss of generality we can assume that XX takes values in the standard basis vectors {ei}i=1N\{e_{i}\}_{i=1}^{N} of N\mathbb{R}^{N} (where NN is the number of states of XX), and so we can write

Xt=AtXt1+MtX_{t}=A_{t}X_{t-1}+M_{t}

where 𝔼[Mt+1|t]=0\mathbb{E}_{\mathbb{P}}[M_{t+1}|\mathcal{F}_{t}]=0, so 𝔼[Xt|t1]=AtXt1\mathbb{E}_{\mathbb{P}}[X_{t}|\mathcal{F}_{t-1}]=A_{t}X_{t-1}.

We suppose the process YY is multivariate real-valued222This assumption can easily be relaxed, to allow for YY to take values in an appropriate Polish or Blackwell space. We restrict to the real setting purely for simplicity.. The law of YY will be allowed to depend on XX, in particular, the \mathbb{P}-distribution of YtY_{t} given {Xs}st{Ys}s<t\{X_{s}\}_{s\leq t}\cup\{Y_{s}\}_{s<t} (that is, given all past observations of XX and YY and the current state of XX) is

Ytc(y;t,Xt)dμ(y)Y_{t}\sim c(y;t,X_{t})d\mu(y)

for μ\mu a reference measure on (d,(d))(\mathbb{R}^{d},\mathcal{B}(\mathbb{R}^{d})).

For simplicity, we shall assume that Y00Y_{0}\equiv 0, so no information is revealed about X0X_{0} at time 0. It is convenient to write Ct(y)=C(y;t)C_{t}(y)=C(y;t) for the diagonal matrix with entries c(y;t,ei)c(y;t,e_{i}), so that

Ct(y)Xt=c(y;t,Xt)Xt.C_{t}(y)X_{t}=c(y;t,X_{t})X_{t}.

Note that these assumptions, in particular the values of AA and cc, depend on the choice of probability measure \mathbb{P}. Conversely, as our space Ω\Omega is the space of paths of (X,Y)(X,Y), the measure \mathbb{P} is determined by AA and cc.

As we have assumed XtX_{t} takes values in the standard basis in N\mathbb{R}^{N}, the expectation 𝔼[Xt|𝒴t]\mathbb{E}_{\mathbb{P}}[X_{t}|\mathcal{Y}_{t}] determines the entire conditional distribution of XtX_{t} given 𝒴t\mathcal{Y}_{t}. In this discrete time context, the filtering problem can be solved in a fairly simple manner: Suppose we have already calculated pt1:=𝔼[Xt1|𝒴t1]p_{t-1}:=\mathbb{E}_{\mathbb{P}}[X_{t-1}|\mathcal{Y}_{t-1}]. Then by linearity and the dynamics of XX, using the fact

𝔼[Mt|𝒴t1]=𝔼[𝔼[Mt|t1]|𝒴t1]=0,\mathbb{E}_{\mathbb{P}}[M_{t}|\mathcal{Y}_{t-1}]=\mathbb{E}_{\mathbb{P}}[\mathbb{E}_{\mathbb{P}}[M_{t}|\mathcal{F}_{t-1}]|\mathcal{Y}_{t-1}]=0,

we can calculate

𝔼[Xt|𝒴t1]=𝔼[AtXt1+Mt|𝒴t1]=Atpt1.\mathbb{E}_{\mathbb{P}}[X_{t}|\mathcal{Y}_{t-1}]=\mathbb{E}_{\mathbb{P}}[A_{t}X_{t-1}+M_{t}|\mathcal{Y}_{t-1}]=A_{t}p_{t-1}.

Bayes’ theorem then states that, with probability one,

(Xt=ei|𝒴t)=(Xt=ei|{Ys}s<t,Yt)c(Yt;t,ei)(Xt=ei|𝒴t1),\mathbb{P}(X_{t}=e_{i}|\mathcal{Y}_{t})=\mathbb{P}(X_{t}=e_{i}|\{Y_{s}\}_{s<t},Y_{t})\propto c(Y_{t};t,e_{i})\mathbb{P}(X_{t}=e_{i}|\mathcal{Y}_{t-1}),

which can be written in a simple matrix form,

ptCt(Yt)Atpt1.p_{t}\propto C_{t}(Y_{t})A_{t}p_{t-1}. (1)

As ptp_{t} is a probability vector, normalization of the right hand side determines ptp_{t} directly. We call ptp_{t} the ‘filter state’ at time tt. Note that, if we assume the density cc is positive, AtA_{t} is irreducible and pt1p_{t-1} has all entries positive, then ptp_{t} will also have all entries positive.

Definition 1.

For future use, if 𝔄=(A,C())\mathfrak{A}=(A,C(\cdot)) denotes the AA and CC matrices described above and p0p_{0} is the initial filter state, then we will write pt𝔄,p0p_{t}^{\mathfrak{A},p_{0}} for the filter state at time tt, that is, the solution to (1) (where the observations YY are implicit).

In practice, the key problem with implementing these methods is the requirement that we know the underlying transition matrix AA^{\top} and the density cc. These are generally not known perfectly, but need to be estimated prior to the implementation of the filter. Uncertainty in the choice of these parameters will lead to uncertainty in the estimates of the filtered state, and the aim of this paper is to derive useful representations of that uncertainty.

As variation in the choice of AA and cc corresponds to a different choice of measure \mathbb{P}, we see that using an uncertain collection of generators corresponds naturally to uncertainty regarding \mathbb{P}. This type of uncertainty, where the probability measure is not known, is commonly referred to as ‘Knightian’ uncertainty (with reference to Knight [29], related ideas are also discussed by Keynes [28]).

Effectively, we wish to consider the propagation of uncertainty in Bayesian updating (as the filter is simply a special case of this). Huber and Ronchetti [24, p331] briefly touch on this, however (based on earlier work by Kong) argue that this propagation is computationally infeasible. However, their approach was based on Choquet integrals, rather than nonlinear expectations. In the coming sections, we shall see how the structure of nonlinear expectations allows us to derive comparatively simple rules for updating.

Remark 1.

While we will present our theory in the context where XX is a finite state Markov chain, our approach does not depend in any significant way on this assumption. In particular, it would be equally valid mutatis mutandis when we supposed that XX followed the dynamics of the Kalman filter, and our uncertainty was on the coefficients of the filter. We specialize to the Markov chain case purely for the sake of concreteness.

2 Conditionally Markov Measures

In order to incorporate learning in our nonlinear expectations and filtering, it is useful to extend slightly from the family of measures previously described. In particular, we wish to allow the dynamics to depend on the past observations, while preserving enough Markov structure to enable filtering. We write 1\mathcal{M}_{1} for the space of probability measures equivalent to a reference measure \mathbb{P}. The following two classes of probability measures will be of interest.

Definition 2.

Let M1\mathcal{M}_{M}\subset\mathcal{M}_{1} denote the probability measures under which

  • XX is a Markov chain, that is, for all tt, Xt+1X_{t+1} is independent of t\mathcal{F}_{t} given XtX_{t},

  • {Ys}st+1\{Y_{s}\}_{s\geq{t+1}} is independent of t\mathcal{F}_{t} given Xt+1X_{t+1},

  • both XX and YY are time homogeneous, that is, the conditional distributions of Xt+1|XtX_{t+1}|X_{t} and Yt|XtY_{t}|X_{t} do not depend on tt.

To extend this slightly, we can allow our processes to depend on the past of the observed process YY.

Definition 3.

Let M|𝒴1\mathcal{M}_{M|\mathcal{Y}}\subset\mathcal{M}_{1} denote the probability measures under which

  • XX is a conditional Markov chain, that is, for all tt, Xt+1X_{t+1} is independent of t\mathcal{F}_{t} given XtX_{t} and {Ys}st\{Y_{s}\}_{s\leq t}, and

  • {Ys}st+1\{Y_{s}\}_{s\geq{t+1}} is independent of t\mathcal{F}_{t} given {Xt+1}{Ys}st\{X_{t+1}\}\cup\{Y_{s}\}_{s\leq t}.

We should note that, if we consider a measure in M|𝒴\mathcal{M}_{M|\mathcal{Y}}, there is a natural notion of the generators AA and CC. In particular, M\mathcal{M}_{M} corresponds to those measures under which the generators AA and CC are constant, while M|𝒴\mathcal{M}_{M|\mathcal{Y}} corresponds to those measures under which the generators AA and CC are deterministic functions of time and {Ys}st\{Y_{s}\}_{s\leq t}.

Definition 4.

We shall write 𝔸\mathbb{A} for the space in which the generator takes values333This space can be thought of as the product of the space of transition matrices and the space of diagonal matrix-valued functions, where each diagonal element is a probability density on d\mathbb{R}^{d}. and write 𝒜𝒴\mathcal{A}_{\mathcal{Y}} for the collection of generators associated with M|𝒴\mathcal{M}_{M|\mathcal{Y}}, that is, 𝒴\mathcal{Y}-adapted processes taking values in 𝔸\mathbb{A}.

For each tt, these generators determine the measure on t\mathcal{F}_{t} given t1\mathcal{F}_{t-1}, and (together with the distribution of X0X_{0}) this determines the measure at all times. It is straightforward to verify that our filtering equations hold for all measures in M|𝒴\mathcal{M}_{M|\mathcal{Y}}, with the appropriate modification of the generators.

Definition 5.

For a measure M|𝒴\mathbb{Q}\in\mathcal{M}_{M|\mathcal{Y}}, we shall write 𝔄=(A,C())\mathfrak{A}^{\mathbb{Q}}=\big{(}A^{\mathbb{Q}},C^{\mathbb{Q}}(\cdot)\big{)} for the generator of (X,Y)(X,Y) under \mathbb{Q}, recalling that Ct(y)=diag({ct(y;ei)}i=1N)C^{\mathbb{Q}}_{t}(y)=\mathrm{diag}(\{c^{\mathbb{Q}}_{t}(y;e_{i})\}_{i=1}^{N}), and that AtA^{\mathbb{Q}}_{t} and CtC^{\mathbb{Q}}_{t} are now allowed to depend on {Ys}s<t\{Y_{s}\}_{s<t}. For notational convenience, we shall typically not write the dependence on {Ys}s<t\{Y_{s}\}_{s<t} explicitly.

Similarly, for 𝔄𝒜𝒴\mathfrak{A}\in\mathcal{A}_{\mathcal{Y}} and p0p_{0} a probability vector in N\mathbb{R}^{N}, we shall write 𝔄,p0\mathbb{Q}^{\mathfrak{A},p_{0}} for the measure with generator 𝔄\mathfrak{A} and X0p0X_{0}\sim p_{0} under \mathbb{Q}.

In our setting, our fundamental problem is that we do not know what measure is ‘true’, and so work instead under a family of measures. In general, measure changes can be described as follows.

Proposition 1.

Let ¯\bar{\mathbb{P}} be the reference measure (which does not have to be a probability measure), under which XX is a sequence of iid uniform random variables from the basis vectors {e1,eN}N\{e_{1},...e_{N}\}\subset\mathbb{R}^{N} and {Yt}t0\{Y_{t}\}_{t\geq 0} is independent of XX, with iid distribution YtdμY_{t}\sim d\mu. The measure M|𝒴\mathbb{Q}\in\mathcal{M}_{M|\mathcal{Y}} where XX has generator {At}t0\{A^{\mathbb{Q}}_{t}\}_{t\geq 0}, where Ytct(y,Xt)dμ(y)Y_{t}\sim c^{\mathbb{Q}}_{t}(y,X_{t})d\mu(y) and X0p0X_{0}\sim p^{\mathbb{Q}}_{0} has Radon–Nikodym derivative (or likelihood)

dd¯|T=(X0p0)Nt=1T((XtAt1Xt1)ct(Yt;Xt)).\frac{d\mathbb{Q}}{d\bar{\mathbb{P}}}\Big{|}_{\mathcal{F}_{T}}=(X_{0}^{\top}p_{0}^{\mathbb{Q}})N\prod_{t=1}^{T}\Big{(}\big{(}X_{t}^{\top}A_{t-1}^{\mathbb{Q}}X_{t-1}\big{)}c^{\mathbb{Q}}_{t}(Y_{t};X_{t})\Big{)}.

The above proposition gives a Radon–Nikodym derivative adapted to the full filtration {t}t0\{\mathcal{F}_{t}\}_{t\geq 0}. In practice, it is also useful to consider the corresponding Radon–Nikodym derivative adapted to the observation filtration {𝒴t}t0\{\mathcal{Y}_{t}\}_{t\geq 0}. As this filtration is generated by the process YY, it is enough to multiply together the conditional distributions of Yt|𝒴t1Y_{t}|\mathcal{Y}_{t-1}, leading to the following convenient representation. For notational simplicity, we write

ct(y;p):=ipict(y;ei).c_{t}(y;p):=\sum_{i}p_{i}c_{t}(y;e_{i}).
Proposition 2.

For \mathbb{Q} as in Proposition 2, the Radon–Nikodym derivative restricted to 𝒴T\mathcal{Y}_{T} is given by

dd¯|𝒴T=t=1Tct(Yt;Apt1𝔄,p0)\frac{d\mathbb{Q}}{d{\bar{\mathbb{P}}}}\Big{|}_{\mathcal{Y}_{T}}=\prod_{t=1}^{T}c^{\mathbb{Q}}_{t}(Y_{t};A^{\mathbb{Q}}p_{t-1}^{\mathfrak{A}^{\mathbb{Q}},p^{\mathbb{Q}}_{0}})

where we recall that pt𝔄,p0p_{t}^{\mathfrak{A}^{\mathbb{Q}},p^{\mathbb{Q}}_{0}} is the solution to the filtering problem in the measure M|𝒴\mathbb{Q}\in\mathcal{M}_{M|\mathcal{Y}}, as determined by (1) (and so includes further dependence on {Ys}s<t\{Y_{s}\}_{s<t}).

3 Nonlinear Expectations

In this section we introduce the concepts of nonlinear expectations and convex risk measures, and discuss their connection with penalty functions on the space of measures. These objects provide a technical foundation with which to model the presence of uncertainty in a random setting. This theory is explored in some detail in Föllmer and Schied [17]. Other key works which have used or contributed to this theory, in no particular order, are Hansen and Sargent [23] (see also [21, 22] for work related to what we present here), Huber and Ronchetti [24], Peng [31], El Karoui, Peng and Quenez [13], Delbaen, Peng and Rosazza Gianin [10], Duffie and Epstein [12], Rockafellar, Uryasev and Zabarankin [33], Riedel [32] and Epstein and Schneider [14]. We base our terminology on that used in [17] and [10].

We here present, without proof, the key details of this theory as needed for our analysis.

Definition 6.

For a σ\sigma-algebra 𝒢\mathcal{G} on Ω\Omega, let L(𝒢)L^{\infty}(\mathcal{G}) denote the space of essentially bounded 𝒢\mathcal{G}-measurable random variables. A nonlinear expectation on L(𝒢)L^{\infty}(\mathcal{G}) is a mapping

:L(𝒢)\mathcal{E}:L^{\infty}(\mathcal{G})\to\mathbb{R}

satisfying the assumptions

  • Strict Monotonicity: for any ξ1,ξ2L(𝒢)\xi_{1},\xi_{2}\in L^{\infty}(\mathcal{G}), if ξ1ξ2\xi_{1}\geq\xi_{2} a.s. then (ξ1)(ξ2)\mathcal{E}(\xi_{1})\geq\mathcal{E}(\xi_{2}), and if in addition (ξ1)=(ξ2)\mathcal{E}(\xi_{1})=\mathcal{E}(\xi_{2}) then ξ1=ξ2\xi_{1}=\xi_{2} a.s.,

  • Constant triviality: for any constant kk, (k)=k\mathcal{E}(k)=k,

  • Translation equivariance: for any kk\in\mathbb{R}, ξL(𝒢)\xi\in L^{\infty}(\mathcal{G}), (ξ+k)=(ξ)+k\mathcal{E}(\xi+k)=\mathcal{E}(\xi)+k.

A ‘convex’ expectation in addition satisfies

  • Convexity: for any λ[0,1]\lambda\in[0,1], ξ1,ξ2L(𝒢)\xi_{1},\xi_{2}\in L^{\infty}(\mathcal{G}),

    (λξ1+(1λ)ξ2)λ(ξ1)+(1λ)(ξ2).\mathcal{E}(\lambda\xi_{1}+(1-\lambda)\xi_{2})\leq\lambda\mathcal{E}(\xi_{1})+(1-\lambda)\mathcal{E}(\xi_{2}).

If \mathcal{E} is a convex expectation, then the operator defined by ρ(ξ)=(ξ)\rho(\xi)=\mathcal{E}(-\xi) is called a convex risk measure. A particularly nice class of convex expectations is those which satisfy

  • Lower semicontinuity: For a sequence {ξn}n\{\xi_{n}\}_{n\in\mathbb{N}} with ξnξ\xi_{n}\uparrow\xi pointwise (and ξL(𝒢)\xi\in L^{\infty}(\mathcal{G})), (ξn)(ξ)\mathcal{E}(\xi_{n})\uparrow\mathcal{E}(\xi).

The following theorem (which was expressed in the language of risk measures) is due to Föllmer and Schied [16] and Frittelli and Rosazza Gianin [18].

Theorem 1.

Suppose \mathcal{E} is a lower semicontinuous convex expectation. Then there exists a ‘penalty’ function :1[0,]\mathcal{R}:\mathcal{M}_{1}\to[0,\infty] such that

(ξ)=sup1{𝔼[ξ]()}.\mathcal{E}(\xi)=\sup_{\mathbb{Q}\in\mathcal{M}_{1}}\big{\{}\mathbb{E}_{\mathbb{Q}}[\xi]-\mathcal{R}(\mathbb{Q})\big{\}}.

Provided ()<\mathcal{R}(\mathbb{Q})<\infty for some \mathbb{Q} equivalent to \mathbb{P}, we can restrict our attention to measures in 1\mathcal{M}_{1} equivalent to \mathbb{P} without loss of generality.

Remark 2.

This result gives some intuition as to how a convex expectation can model ‘Knightian’ uncertainty. One considers all the possible probability measures on the space, and then selects the maximal expectation among all measures, penalizing each measure depending on how plausible it is considered. As convexity of \mathcal{E} is a natural requirement of an ‘uncertainty averse’ assessment of outcomes, Theorem 1 shows that this is the only way to construct an ‘expectation’ \mathcal{E} which penalizes uncertainty, while preserving monotonicity, translation equivariance and constant triviality.

3.1 DR-expectations

From the discussion above, it is apparent that we can focus our attention on calculating the penalty function \mathcal{R}, rather than the nonlinear expectation directly. This penalty function is meant to encode how ‘unreasonable’ a probability measure \mathbb{Q} is as a model for our outcomes. So far, we have assumed that the penalty did not depend on time or on observations. By relaxing this assumption, we can incorporate learning of which models are ‘good’ in our framework.

In [6], we have considered a framework which links the choice of the penalty function to statistical estimation of a model. The key idea of [6] is to use the negative log-likelihood function for this purpose, where the likelihood is taken against an arbitrary reference measure, and evaluated using the observed data. This directly uses the statistical information from observations in the quantification of uncertainty.

In this paper, we shall make a slight extension of this idea, to explicitly incorporate prior beliefs. In particular, we shall replace the log-likelihood with the log-posterior density, which in turn gives an additional term in the penalty. In order to be precise, we now give a formal definition of the likelihood, which is sufficient for our purposes.

Remark 3.

In what follows, we will be variously wishing to restrict a measure \mathbb{Q} to a σ\sigma-algebra, and to condition a measure on a σ\sigma-algebra. To prevent notational confusion, we shall write \mathbb{Q}\|_{\mathcal{F}} for the restriction of \mathbb{Q} to \mathcal{F}, and |\mathbb{Q}|_{\mathcal{F}} for \mathbb{Q} conditioned on \mathcal{F}.

Definition 7.

Let 𝒬1\mathcal{Q}\subseteq\mathcal{M}_{1} be a set of models under consideration (for example, a parametric set of distributions). For observations 𝐲\mathbf{y} taking values in N\mathbb{R}^{N}, we define the likelihood to be a fixed map Lobs:𝒬×NL^{\mathrm{obs}}:\mathcal{Q}\times\mathbb{R}^{N}\to\mathbb{R}, measurable with respect to its second argument, such that ωLobs(|𝐲(ω))\omega\mapsto L^{\mathrm{obs}}(\mathbb{Q}|\mathbf{y}(\omega)) is a version of the Radon–Nikodym derivative dσ(𝐲)/d¯σ(𝐲)d\mathbb{Q}\|_{\sigma(\mathbf{y})}/d\bar{\mathbb{P}}\|_{\sigma(\mathbf{y})}.

Inspired by a ‘Bayesian’ approach, we augment this by the addition of a prior distribution over 𝒬\mathcal{Q}. Suppose a (possibly improper444As we have not specified a reference measure over 𝒬\mathcal{Q}, we have not defined the prior density as a Radon–Nikodym derivative, and cannot integrate it over the class of models. Therefore, we do not require it to ‘integrate to 11’, that is, we have an improper prior. This has no significant impact in what follows, as (2) normalizes away the effect of the reference distribution.) prior with density of the form exp(γ())\exp(-\gamma(\mathbb{Q})) is given, then we define the posterior relative density

L(|𝐲)=Lobs(|𝐲)exp(γ()).L(\mathbb{Q}|\mathbf{y})=L^{\mathrm{obs}}(\mathbb{Q}|\mathbf{y})\exp(-\gamma(\mathbb{Q})).

We then define the “𝒬|𝐲\mathcal{Q}|\mathbf{y}-divergence” to be the negative log-likelihood ratio (or log-posterior relative density)

α𝒬|𝐲():=log(L(|𝐲))+sup~𝒬{log(L(~|𝐲))}.\alpha_{\mathcal{Q}|\mathbf{y}}(\mathbb{Q}):=-\log\big{(}L(\mathbb{Q}|\mathbf{y})\big{)}+\sup_{\tilde{\mathbb{Q}}\in\mathcal{Q}}\Big{\{}\log\big{(}L(\tilde{\mathbb{Q}}|\mathbf{y})\big{)}\Big{\}}. (2)
Remark 4.

The right hand side of (2) is well defined whether or not a maximum a posteriori estimator555Recall that a 𝒬\mathcal{Q}-MAP (maximum a posteriori estimator) is a map 𝐲^𝒬\mathbf{y}\to\hat{\mathbb{Q}}\in\mathcal{Q} such that L(^|𝐲)L(|𝐲)L(\hat{\mathbb{Q}}|\mathbf{y})\geq L(\mathbb{Q}|\mathbf{y}) for all 𝒬\mathbb{Q}\in\mathcal{Q}. exists. Given a 𝒬\mathcal{Q}-MAP ^\hat{\mathbb{Q}}, we would have the simpler representation

α𝒬|𝐲():=log(L(|𝐲)L(^|𝐲)).\alpha_{\mathcal{Q}|\mathbf{y}}(\mathbb{Q}):=-\log\Big{(}\frac{L(\mathbb{Q}|\mathbf{y})}{L(\hat{\mathbb{Q}}|\mathbf{y})}\Big{)}.
Definition 8.

For fixed observations 𝐲t=(Y1,Y2,,Yt)\mathbf{y}_{t}=(Y_{1},Y_{2},...,Y_{t}), for an uncertainty aversion parameter k>0k>0 and exponent k[1,]k^{\prime}\in[1,\infty], we define the convex expectation

𝒬|𝐲tk,k(ξ):=sup𝒬{𝔼[ξ|𝐲t](1kα𝒬|𝐲t())k},\mathcal{E}_{\mathcal{Q}|\mathbf{y}_{t}}^{k,k^{\prime}}(\xi):=\sup_{\mathbb{Q}\in\mathcal{Q}}\Big{\{}\mathbb{E}_{\mathbb{Q}}[\xi|\mathbf{y}_{t}]-\Big{(}\frac{1}{k}\alpha_{\mathcal{Q}|\mathbf{y}_{t}}(\mathbb{Q})\Big{)}^{k^{\prime}}\Big{\}}, (3)

where we adopt the convention x=0x^{\infty}=0 for x[0,1]x\in[0,1] and ++\infty otherwise.

We call 𝒬|𝐲tk,k\mathcal{E}_{\mathcal{Q}|\mathbf{y}_{t}}^{k,k^{\prime}} the “DR-expectation666DR refers either to divergence-robust or data-driven robust.” (with parameter k,kk,k^{\prime}). We may omit to write k,kk,k^{\prime} for notational simplicity.

Remark 5.

In the cases of interest for this paper, we shall assume that 𝒬\mathcal{Q} is parameterized by some finite-dimensional real value, such that the divergence and conditional expectations given 𝐲t\mathbf{y}_{t} are continuous with respect to this parameterization, and are Borel measurable with respect to 𝐲t\mathbf{y}_{t}. This means that the measure theoretic concerns which arise from our definitions in terms of the Radon–Nikodym derivative and taking the supremum will not cause difficulty, in particular, the DR-expectation defined in (3) is guaranteed to be a Borel measurable function of 𝐲t\mathbf{y}_{t} for every ξ\xi. (This follows from Filippov’s implicit function theorem.)

Remark 6.

In theory, we could now apply the DR-expectation framework to a filtering context as follows: Take a collection of models 𝒬M|𝒴\mathcal{Q}\subseteq\mathcal{M}_{M|\mathcal{Y}}. For a random variable ξ\xi, and for each measure 𝒬\mathbb{Q}\in\mathcal{Q}, compute E[ξ|𝐲t]E_{\mathbb{Q}}[\xi|\mathbf{y}_{t}] and α𝒬|𝐲t()\alpha_{\mathcal{Q}|\mathbf{y}_{t}}(\mathbb{Q}). Taking a supremum as in (3), we obtain the DR-expectation. However, this is generally not computationally tractable in this form.

Lemma 1.

Let {t}t0\{\mathcal{F}_{t}\}_{t\geq 0} be a filtration such that YY is adapted. For t\mathcal{F}_{t}-measurable random variables, the choice of horizon TtT\geq t in the definition of the penalty function α\alpha is irrelevant. That is, for t\mathcal{F}_{t}-measurable ξ\xi and any sts\geq t,

𝒬|𝐲t(ξ)=sup𝒬{𝔼[ξ|𝐲t](1kα𝒬|𝐲t(s))k},\mathcal{E}_{\mathcal{Q}|\mathbf{y}_{t}}(\xi)=\sup_{\mathbb{Q}\in\mathcal{Q}}\Big{\{}\mathbb{E}_{\mathbb{Q}}[\xi|\mathbf{y}_{t}]-\Big{(}\frac{1}{k}\alpha_{\mathcal{Q}|\mathbf{y}_{t}}(\mathbb{Q}\|_{\mathcal{F}_{s}})\Big{)}^{k^{\prime}}\Big{\}},

where α𝒬|𝐲t(s)\alpha_{\mathcal{Q}|\mathbf{y}_{t}}(\mathbb{Q}\|_{\mathcal{F}_{s}}) is defined as above, in terms of the restricted measure s\mathbb{Q}\|_{\mathcal{F}_{s}}.

Proof.

By definition, the likelihood is determined by the restriction of \mathbb{Q} to 𝒴tts\mathcal{Y}_{t}\subset\mathcal{F}_{t}\subset\mathcal{F}_{s}, while the expectation depends only on the restriction of \mathbb{Q} to ts\mathcal{F}_{t}\subset\mathcal{F}_{s}. As these are the only terms needed to compute the DR-expectation, the result follows. ∎

Remark 7.

The purpose of the nonlinear expectation is to give an ‘upper’ estimate of a random variable, accounting for uncertainty in the underlying probabilities. This is closely related to robust estimation in the sense of Wald [35]. In particular, one can consider the robust estimator given by

arginfξ^N(ξξ^2|𝒴t),\mathrm{arginf}_{\hat{\xi}\in\mathbb{R}^{N}}\mathcal{E}(\|\xi-\hat{\xi}\|^{2}|\mathcal{Y}_{t}),

which gives a ‘minimax’ estimate of ξ\xi, given the observations 𝒴t\mathcal{Y}_{t} and a quadratic loss function. The advantage of the nonlinear expectation approach is that it allows one to construct such an estimate for every random variable/loss function, giving a cost-specific quantification of uncertainty in each case.

We can also see a connection with the theory of HH^{\infty} filtering (see, for example Grimble and El Sayed [20] or more recently Zhang, Xia and Shi [38] and references therein, or the more general HH^{\infty}-control theory in Başar and Bernhard [2]). In this setting, we look for estimates which perform best in the worst-situation, where ‘worst’ is usually defined in terms of a perturbation to the input signal or coefficients. In our setting, we focus not on the estimation problem directly, but on the ‘dual’ problem of building an upper expectation, i.e. calculating the ‘worst’ expectation in terms of a class of perturbations to the coefficients (our setting is general enough that perturbation to the signal can also be included, through shifting the coefficients).

Remark 8.

There are also connections between our approach and what is called ‘risk-sensitive filtering’, see for example James, Baras and Elliott [25], Dey and Moore [11], or the review of Boel, James and Petersen [5] and references therein (from an engineering perspective) or Hansen and Sargent [22, 23] (from an economic perspective). In their setting, one uses the nonlinear expectation defined by

(ξ|𝒴t)=klog𝔼[exp(ξ/k)|𝒴t],\mathcal{E}(\xi|\mathcal{Y}_{t})=-k\log\mathbb{E}_{\mathbb{P}}\big{[}\exp(-\xi/k)\big{|}\mathcal{Y}_{t}\big{]},

for some choice of robustness parameter 1/k>01/k>0. This leads to significant simplification, as dynamic consistency and recursivity is guaranteed in every filtration (see Graf [19] and Kupper and Schachermeyer [30], and further discussion in Section 5) and the corresponding penalty function is given by the conditional relative entropy

t()=k𝔼[log(d/d)|𝒴t],\mathcal{R}_{t}(\mathbb{Q})=k\mathbb{E}_{\mathbb{Q}}[\log(d\mathbb{Q}/d\mathbb{P})|\mathcal{Y}_{t}],

the one-step penalty can be calculated accordingly. The optimization for the nonlinear expectation can be taken over 1\mathcal{M}_{1}, so this approach has a claim to be including ‘nonparametric’ uncertainty, as all measures are considered, rather than purely Markov measures or measures in a parametric family (however the optimization can be taken over conditionally Markov measures, and one will obtain an identical result!).

The difficulty with this approach is that it does not allow for easy incorporation of knowledge of the error of estimation of the generator 𝔄\mathfrak{A} in the level of robustness – the only parameter available to choose is kk, which multiplies the relative entropy. A small choice of kk corresponds to a small penalty, hence a very robust expectation, but this robustness is not directly linked to the estimation of the generators 𝔄\mathfrak{A}. Therefore the impact of statistical estimation error remains obscure, as kk is chosen largely exogenously of this error. For this reason, our approach, which directly allows for the penalty to be based on the statistical estimation of the generators, has advantages over this simpler method.

4 Recursive penalties

The DR-expectation provides us with an approach to including statistical estimation in our valuations. However, the calculations suggested by Remark 6 are generally intractable in their stated form. In this section, we shall see how the DR-expectation approach, and an approach with a constant penalty \mathcal{R}, specialize in a filtering setting.

The class of models we shall consider are based on two key questions:

  1. 1.

    “Static or Dynamic generators?” Are the generators AA and CC static (through time) and unknown (so can be estimated) or are they not only unknown but dynamically changing (and we depend principally on prior information about their likely behaviour)?

    In other words, do our models come from M\mathcal{M}_{M} (static generators) or from M|𝒴\mathcal{M}_{M|\mathcal{Y}} (dynamic generators)?

  2. 2.

    “Uncertain prior (UP) or DR-expectations?” Do we (i) have a fixed penalty on the ‘reasonableness’ of a model, and then use new observations to update within each model using Bayes’ theorem, or are we (ii) attempting to determine which model is reasonable using our observations, while simultaneously updating our model (with the same observations).

    In other words, is our penalty \mathcal{R} constant (UP framework) or does it change with new observations (DR framework)?

For practical purposes, it is critical that we refine our approach to provide a recursive construction of our nonlinear expectation. In classical filtering, one obtains a recursion for expectations 𝔼[ϕ(Xt)|𝒴t]\mathbb{E}[\phi(X_{t})|\mathcal{Y}_{t}], for Borel functions ϕ\phi; one does not typically consider the expectations of general random variables. In the same way, we will consider the expectations of random variables ϕ(Xt)\phi(X_{t}).

It is clear that we can consider 𝒬|𝐲t(ϕ(Xt))\mathcal{E}_{\mathcal{Q}|\mathbf{y}_{t}}(\phi(X_{t})) as a nonlinear expectation with the probability space being only the states of XtX_{t}. By Theorem 1, it follows that, for each tt, there exists a 𝒴t()\mathcal{Y}_{t}\otimes\mathcal{B}(\mathbb{R})-measurable function κt\kappa_{t} such that

𝒬|𝐲t(ϕ(Xt)):=sup𝒬{𝔼[ϕ(Xt)|𝐲t]t()}=supqSN+{iqiϕ(ei)(1kκt(ω,q))k},\begin{split}\mathcal{E}_{\mathcal{Q}|\mathbf{y}_{t}}(\phi(X_{t}))&:=\sup_{\mathbb{Q}\in\mathcal{Q}}\Big{\{}\mathbb{E}_{\mathbb{Q}}[\phi(X_{t})|\mathbf{y}_{t}]-\mathcal{R}_{t}(\mathbb{Q})\Big{\}}\\ &=\sup_{q\in S_{N}^{+}}\Big{\{}\sum_{i}q_{i}\phi(e_{i})-\Big{(}\frac{1}{k}\kappa_{t}(\omega,q)\Big{)}^{k^{\prime}}\Big{\}},\end{split} (4)

where SN+S_{N}^{+} denotes the probability simplex in N\mathbb{R}^{N}, that is, SN+={xN:ixi=1,xi0i}S_{N}^{+}=\{x\in\mathbb{R}^{N}:\sum_{i}x_{i}=1,\quad x_{i}\geq 0\quad\forall i\}. Our aim is to find a recursion for κt\kappa_{t}, for various choices of \mathcal{R}. Without loss of generality, we will write

t()=(1kα()())k,\mathcal{R}_{t}(\mathbb{Q})=\Big{(}\frac{1}{k}\alpha_{(...)}(\mathbb{Q})\Big{)}^{k^{\prime}},

where α\alpha has an appropriate set of arguments, as this gives consistent notation between the DR-expectation and Uncertain Prior settings.

If the generators are assumed to be static but unknown, the following definition will prove useful.

Definition 9.

Let Kt:Ω×SN+×𝔸K_{t}:\Omega\times S_{N}^{+}\times\mathbb{A}\to\mathbb{R} denote an extended penalty function, which encodes the penalty associated with a state Xt|𝒴tptSN+X_{t}|\mathcal{Y}_{t}\sim p_{t}\in S_{N}^{+} and a generator 𝔄𝔸\mathfrak{A}\in\mathbb{A}. The penalty κt:Ω×SN+\kappa_{t}:\Omega\times S_{N}^{+}\to\mathbb{R} is obtained from KtK_{t} through the relation

κt(ω,p)=inf𝔄𝔸Kt(ω,p,𝔄).\kappa_{t}(\omega,p)=\inf_{\mathfrak{A}\in\mathbb{A}}K_{t}(\omega,p,\mathfrak{A}). (5)

The reason for this definition is that, in a static generator framework, it is KK, not κ\kappa, which will satisfy a recursive equation. We note, however, that the dimension of (p,𝔄)(p,\mathfrak{A}) is larger, and typically much larger, than N1=dimSN+N-1=\mathrm{dim}S_{N}^{+}, which leads to difficulty in implementation.

Our constructions will depend on the following object.

Definition 10.

For each generator 𝔄=(A,C)𝔸\mathfrak{A}=(A,C)\in\mathbb{A} and each pSN+p\in S_{N}^{+}, we define the random set

(p)t𝔄={pt1SN+:C(Yt)Apt1p}.(p)_{t}^{\overleftarrow{\mathfrak{A}}}=\big{\{}p_{t-1}\in S_{N}^{+}:C(Y_{t})Ap_{t-1}\propto p\big{\}}.

By recursion, we extend this to all s<ts<t

(p)t(𝔄,s1)={ps1SN+:C(Ys)Aps1ps(p)t(𝔄,s)}.(p)_{t}^{(\overleftarrow{\mathfrak{A}},s-1)}=\big{\{}p_{s-1}\in S_{N}^{+}:C(Y_{s})Ap_{s-1}\propto p_{s}\in(p)_{t}^{(\overleftarrow{\mathfrak{A}},s)}\big{\}}.

The set (p)t𝔄(p)_{t}^{\overleftarrow{\mathfrak{A}}} represents the filter states at time t1t-1 which evolve to pp at time tt. Clearly this is only known at time tt, as it depends on YtY_{t}. Similarly, we think of (p)t(𝔄,s)(p)_{t}^{(\overleftarrow{\mathfrak{A}},s)} as the set of filter states at time ss which would evolve to the state pp at time tt, given the generator 𝔄\mathfrak{A} and the observations {Yr}r=s+1t\{Y_{r}\}_{r=s+1}^{t}. This set may be empty, if no such filter states exist. As CC is a diagonal matrix, if its entries do not vanish777This corresponds to there being no state which yields a zero likelihood of the observed value. then it is invertible. However, the matrix AA will often not be an invertible matrix, so (p)t(𝔄,s)(p)_{t}^{(\overleftarrow{\mathfrak{A}},s)} is not generally a singleton.

4.1 Filtering with Uncertain Priors

We shall first consider the case where we assume the filtering equations apply, and the only uncertainty in our model is given by our uncertainty over the prior inputs to the filter. In particular, this “prior uncertainty” is not updated given new observations, and \mathcal{R} is a fixed function of \mathbb{Q}

We can now consider our two cases: with static and dynamic generators.

4.1.1 Static Generators (StaticUP)

With a static generator, our approach is simple.

Definition 11.

In a StaticUP setting, the inputs to the filtering problem are the initial filter state p0p_{0} and the generator 𝔄\mathfrak{A}. We therefore take a penalty ()=(k1α())k\mathcal{R}(\mathbb{Q})=(k^{-1}\alpha(\mathbb{Q}))^{k^{\prime}} where

α()=γ(p0,𝔄)\alpha(\mathbb{Q})=\gamma(p_{0}^{\mathbb{Q}},\mathfrak{A}^{\mathbb{Q}})

for some prescribed penalty γ\gamma.

Remark 9.

Inspired by the DR-expectation, a natural choice of penalty function γ\gamma is the negative log-density of a prior distribution for the inputs (p0,𝔄)(p_{0},\mathfrak{A}), shifted to have minimal value zero. Alternatively, taking an empirical Bayesian perspective, γ\gamma could be the log-likelihood from a prior calibration process. In this case, we are directly using our statistical uncertainty in the parameters in our understanding of the filtering problem.

Lemma 2.

If the extended penalty is defined by

Kt(pt,𝔄)=infp0(pt)(𝔄,0){γ(p0,𝔄)},K_{t}(p_{t},\mathfrak{A})=\inf_{p_{0}\in(p_{t})^{(\overleftarrow{\mathfrak{A}},0)}}\Big{\{}\gamma(p_{0},\mathfrak{A})\Big{\}},

with the convention inf=\inf\emptyset=\infty, then writing κt(p)=inf𝔄𝔸Kt(p,𝔄)\kappa_{t}(p)=\inf_{\mathfrak{A}\in\mathbb{A}}K_{t}(p,\mathfrak{A}), we satisfy the dynamic updating equation (4).

Proof.

From (4), we see that it is enough to guarantee that

κt(q)\displaystyle\kappa_{t}(q) =infM{α():Xt|𝒴tq under }\displaystyle=\inf_{\mathbb{Q}\in\mathcal{M}_{M}}\Big{\{}\alpha(\mathbb{Q}):X_{t}|\mathcal{Y}_{t}\sim q\text{ under }\mathbb{Q}\Big{\}}
=inf𝔄𝔸,p0SN+{α(𝔄,p0):Xt|𝒴tq under }\displaystyle=\inf_{\mathfrak{A}\in\mathbb{A},p_{0}\in S_{N}^{+}}\Big{\{}\alpha(\mathbb{Q}^{\mathfrak{A},p_{0}}):X_{t}|\mathcal{Y}_{t}\sim q\text{ under }\mathbb{Q}\Big{\}}
=inf𝔄𝔸infp0(q)t(𝔄,0){γ(p0,𝔄)}=inf𝔄𝔸K(q,𝔄).\displaystyle=\inf_{\mathfrak{A}\in\mathbb{A}}\inf_{p_{0}\in(q)_{t}^{(\overleftarrow{\mathfrak{A}},0)}}\Big{\{}\gamma(p_{0},\mathfrak{A})\Big{\}}=\inf_{\mathfrak{A}\in\mathbb{A}}K(q,\mathfrak{A}).

The result then follows from (5). ∎

The extended penalty KK is useful, as it can be calculated recursively.

Theorem 2.

The extended penalty KK satisfies the recursion

Kt(p,𝔄)=infpt1(p)t𝔄{Kt1(pt1,𝔄)}.K_{t}(p,\mathfrak{A})=\inf_{p_{t-1}\in(p)_{t}^{\overleftarrow{\mathfrak{A}}}}\Big{\{}K_{t-1}(p_{t-1},\mathfrak{A})\Big{\}}.
Proof.

This is immediate from the definition of (p)t𝔄(p)_{t}^{\overleftarrow{\mathfrak{A}}}. ∎

Remark 10.

We emphasize that there is no learning of 𝔄\mathfrak{A} being done in this framework – the penalty on 𝔄\mathfrak{A} applied at time 0 is simply propagated forward; our observations do not affect our opinion of its likely value. Furthermore, we are not adjusting our prior penalty to account for the ‘unreasonableness’ of our models as we observe data. In particular, if we assume no knowledge of the initial state (i.e. a zero penalty), then we will have no knowledge of the state at time tt (unless the observations cause the filter to degenerate).

Example 1.

We take the class of models in M\mathcal{M}_{M} where AA and CC are perfectly known, and A=IA=I, so Xt=X0X_{t}=X_{0} is constant (but X0X_{0} is unknown). We take N=2N=2, so XX takes only one of two values. Finally, we assume that

Yt|(Xt=e1)Bernoulli(a),Yt|(Xt=e2)Bernoulli(b),Y_{t}|(X_{t}=e_{1})\sim\mathrm{Bernoulli}(a),\qquad Y_{t}|(X_{t}=e_{2})\sim\mathrm{Bernoulli}(b),

where a,b(0,1)a,b\in(0,1). Effectively, in this example we are using filtering to determine which of two possible means is the correct mean for our observation sequence. It is worth emphasising that the filter process pp corresponds to the posterior probabilities, in a Bayesian setting, of the events that our Bernoulli process has parameter aa or bb.

It will be useful to note that, from classical Bayesian statistical calculations888One can derive the stated formula using the filtering equations, for the vector pt=(pt1,pt2)p_{t}=(p_{t}^{1},p_{t}^{2})^{\top}. However, the closed-form solution given here is more easily obtained using alternative methods for Bayesian hypothesis testing (which is effectively what this problem encodes)., for a given p0p_{0}, one can see that the corresponding value of ptp_{t} is determined from the log odds ratio

log(pt1pt2)=log(p01p02)+NY¯log(ab)+N(1Y¯)log(1a1b).\log\Big{(}\frac{p_{t}^{1}}{p_{t}^{2}}\Big{)}=\log\Big{(}\frac{p_{0}^{1}}{p_{0}^{2}}\Big{)}+{N\bar{Y}}\log\Big{(}\frac{a}{b}\Big{)}+N(1-\bar{Y})\log\Big{(}\frac{1-a}{1-b}\Big{)}.

To write down the StaticUP penalty function, let the (known) dynamics be described by 𝔄\mathfrak{A}^{*}. Consequently, we can write K(p,𝔄)=K(p,\mathfrak{A})=\infty for all 𝔄𝔄\mathfrak{A}\neq\mathfrak{A}^{*}. As 𝔄\mathfrak{A}^{*} is known, there is no distinction between KK and κ\kappa, so

κt(p)=Kt(p,𝔄)=infp0(p)t(𝔄,0){γ(p0,𝔄)}.\kappa_{t}(p)=K_{t}(p,\mathfrak{A}^{*})=\inf_{p_{0}\in(p)_{t}^{(\overleftarrow{\mathfrak{A}^{*}},0)}}\Big{\{}\gamma(p_{0},\mathfrak{A}^{*})\Big{\}}.

We initialize with a known penalty γ(p,𝔄)=κ0(p)\gamma(p,\mathfrak{A}^{*})=\kappa_{0}(p) for all pSN+p\in S_{N}^{+}.

In this setting, we can express our penalty in terms of the log-odds, for the sake of notational simplicity given the closed-form solution to the filtering problem, and hence can explicitly calculate (pt)(𝔄,0)(p_{t})^{(\overleftarrow{\mathfrak{A}^{*}},0)}, which contains only single points. In particular, the time-tt penalty is given by a shift of the initial penalty:

κt(log(pt1pt2))=κ0(log(pt1pt2)NY¯log(ab)N(1Y¯)log(1a1b)).\kappa_{t}\bigg{(}\log\Big{(}\frac{p_{t}^{1}}{p_{t}^{2}}\Big{)}\bigg{)}=\kappa_{0}\bigg{(}\log\Big{(}\frac{p_{t}^{1}}{p_{t}^{2}}\Big{)}-{N\bar{Y}}\log\Big{(}\frac{a}{b}\Big{)}-N(1-\bar{Y})\log\Big{(}\frac{1-a}{1-b}\Big{)}\bigg{)}.
Remark 11.

This example demonstrates the following behaviour:

  • If the initial penalty is zero, then the penalty at time tt is also zero – there is no learning of which state we are in.

  • There is no variation in the curvature of the penalty (and so no change in our ‘uncertainty’), we simply shift the penalty around, corresponding to our changing posterior probabilities.

  • The update of κ\kappa is done purely using the tools of Bayesian statistics, rather than having any direct incorporation of our uncertainty.

Remark 12.

We point out that this is, effectively, the model of uncertainty proposed by Walley [36] (see in particular Section 5.3, although he there takes a model where the unknown parameter is Beta distributed). See also Fagin and Halpern [15].

4.1.2 Dynamic generators (DynamicUP)

If we model the generator 𝔄\mathfrak{A} as fixed and unknown, calculation of Kt(p,𝔄)K_{t}(p,\mathfrak{A}) suffers from a curse of dimensionality. We also need to assume that 𝔄\mathfrak{A} is constant through time, which may be a dubious assumption in practice.

In this case, we can obtain a practical model by allowing 𝔄\mathfrak{A} to vary independently at each point in time. Superficially, this significantly worsens the curse of dimensionality, as we no longer take a fixed 𝔄\mathfrak{A}, but regard it as a process through time. The advantage of this is that we can then use dynamic programming to calculate the penalty κt(pt)\kappa_{t}(p_{t}).

In order to include our knowledge of the generator, we will write 𝔄t\mathfrak{A}_{t} for the generator applicable at time tt. Recall that this is a process taking values in 𝔸\mathbb{A}, and we write 𝒜𝒴\mathcal{A}_{\mathcal{Y}} for the space of such processes adapted to the observation filtration. For 𝔄𝒜𝒴\mathfrak{A}\in\mathcal{A}_{\mathcal{Y}}, we define (p)t(𝔄,0)(p)_{t}^{(\overleftarrow{\mathfrak{A}},0)} using the natural analogue of Definition 10 incorporating time dependence.

Definition 12.

In the DynamicUP setting, for an initial penalty on the initial hidden state, κ0(p0)\kappa_{0}(p_{0}), and a penalty on the time-tt generator, γt(𝔄t)\gamma_{t}(\mathfrak{A}_{t}), our total penalty is given by

α()=κ0(p0)+s=1γs(𝔄s).\alpha(\mathbb{Q})=\kappa_{0}(p_{0}^{\mathbb{Q}})+\sum_{s=1}^{\infty}\gamma_{s}(\mathfrak{A}_{s}^{\mathbb{Q}}).
Theorem 3.

We can define a dynamic penalty

κt(pt)=inf𝔄𝒜𝒴{infp0(pt)({𝔄t},0){κ0(p0)+s=1tγs(𝔄s)}}.\kappa_{t}(p_{t})=\inf_{\mathfrak{A}\in\mathcal{A}_{\mathcal{Y}}}\bigg{\{}\inf_{p_{0}\in(p_{t})^{(\{\overleftarrow{\mathfrak{A}}_{t}\},0)}}\Big{\{}\kappa_{0}(p_{0})+\sum_{s=1}^{t}\gamma_{s}(\mathfrak{A}_{s})\Big{\}}\bigg{\}}. (6)

which satisfies (4). Furthermore, the penalty satisfies the recursion

κt(pt)=inf𝔄t𝔸{infpt1(pt)𝔄t{κt1(pt1)+γt(𝔄t)}}.\kappa_{t}(p_{t})=\inf_{\mathfrak{A}_{t}\in\mathbb{A}}\bigg{\{}\inf_{p_{t-1}\in(p_{t})^{\overleftarrow{\mathfrak{A}}_{t}}}\Big{\{}\kappa_{t-1}(p_{t-1})+\gamma_{t}(\mathfrak{A}_{t})\Big{\}}\bigg{\}}. (7)
Proof.

From Lemma 1 it is clear that, with our definition of κt\kappa_{t}, (4) can be obtained as a reparameterization of the nonlinear expectation. The recursion (7) follows by the definition of (p)t𝔄t(p)_{t}^{\overleftarrow{\mathfrak{A}}_{t}} and standard dynamic programming arguments, as in the StaticUP case. ∎

This formulation of our problem allows us to use dynamic programming to solve our problem forward in time. In the setting of Example 1, as the dynamics 𝔄\mathfrak{A} are perfectly known, there is no distinction between the dynamic and static cases.

Remark 13.

The dynamic formulation adds penalties together, so if the same penalty on 𝔄\mathfrak{A} is used in the static and dynamic settings, then the dynamic setting will typically have a higher penalty than the static setting. Practically, this effect is lessened by the minimization in (6), and the fact that the filter has good ergodic properties (as discussed by van Handel [34]). This ergodicity implies that, at time tt, the filter will not significantly depend on the generator 𝔄s\mathfrak{A}_{s} for sts\ll t, and so the minimization will render the penalty at ss irrelevant.

A continuous-time version of this setting is considered in [1].

4.2 Filtering with DR-expectations

In the above, we have regarded the prior as uncertain, and used this to penalize over models. We did not use the data to modify our penalty function, simply to revise within each model. The DR-expectation appoach gives us an alternative approach, in which the data guides our model choice more directly. In what follows, we will apply the DR-expectation in our filtering context, and observe that it gives a slightly different recursion for the penalty function. Again, we can consider models where 𝔄\mathfrak{A} is regarded as static or as dynamic.

4.2.1 Static generators (StaticDR)

For a fixed 𝔄\mathfrak{A}, we have already written the likelihood function (Proposition 2). As we are working in the observation filtration, this gives the natural decomposition of the likelihood for our calculation. This leads us to a simple formulation of the penalty α𝒬|𝐲t\alpha_{\mathcal{Q}|\mathbf{y}_{t}}.

Lemma 3.

Suppose the initial filter state p0p_{0} and static generator 𝔄\mathfrak{A} are distributed according to the prior density exp(γ(p0,𝔄))\exp(-\gamma(p_{0},\mathfrak{A})). Then the 𝒬|𝐲t\mathcal{Q}|\mathbf{y}_{t}-divergence of the measure (p0,𝔄)\mathbb{Q}^{(p_{0},\mathfrak{A})} (i.e. normalized log-posterior density of the parameters (p0,𝔄)(p_{0},\mathfrak{A})) given observations 𝐲t=(Y1,Yt)\mathbf{y}_{t}=(Y_{1},...Y_{t}) can be written

α𝒬|𝐲t((p0,𝔄))=γ(p0,𝔄)stlog(c𝔄(Ys;A𝔄ps1𝔄,p0))+mt,\alpha_{\mathcal{Q}|\mathbf{y}_{t}}(\mathbb{Q}^{(p_{0},\mathfrak{A})})=\gamma(p_{0},\mathfrak{A})-\sum_{s\leq t}\log\Big{(}c_{\mathfrak{A}}(Y_{s};A^{\mathfrak{A}}p_{s-1}^{\mathfrak{A},p_{0}})\Big{)}+m_{t},

where mm is a sequence of normalizing values, independent of 𝔄\mathfrak{A} and p0p_{0}, such that α𝒬|𝐲t\alpha_{\mathcal{Q}|\mathbf{y}_{t}} has minimal value zero, and ps𝔄,p0p_{s}^{\mathfrak{A},p_{0}} is the solution to the filtering equations with the given generator 𝔄\mathfrak{A} and initial filter state p0p_{0}.

Proof.

The divergence is given by the negative posterior log-density of (p0,𝔄)(p_{0},\mathfrak{A}), shifted to have minimal value zero. From Bayes’ rule, this is simply the sum of the negative prior log-density and the negative log-likelihood. The likelihood is stated in Proposition 2, and gives us the term stlog(c𝔄(Ys;As𝔄ps1𝔄,p0))\sum_{s\leq t}\log\big{(}c_{\mathfrak{A}}(Y_{s};A^{\mathfrak{A}}_{s}p_{s-1}^{\mathfrak{A},p_{0}})\big{)}. The prior then contributes the term γ(p0,𝔄)\gamma(p_{0},\mathfrak{A}), as desired. ∎

Remark 14.

In this framework, we have not assumed we can propagate our uncertainty using the filtering equations, as in an uncertain prior setting – we are instead simply evaluating the DR-expectation conditional on our observations. Nevertheless, the filtering equations naturally appear through their presence in the likelihood function, and so will still form part of our dynamic penalty.

Theorem 4.

Consider the family of models with a fixed, but unknown, generator 𝔄\mathfrak{A}. The DR-expectation can then be calculated as in (4), writing κt(p)=inf𝔄𝒜Kt(p,𝔄)\kappa_{t}(p)=\inf_{\mathfrak{A}\in\mathcal{A}}K_{t}(p,\mathfrak{A}), where

Kt(p,𝔄):=infp0(p)t(𝔄,0){γ(p0,𝔄)stlog(c𝔄(Ys;As𝔄ps1𝔄,p0))+mt}.K_{t}(p,\mathfrak{A}):=\inf_{p_{0}\in(p)_{t}^{(\overleftarrow{\mathfrak{A}},0)}}\Big{\{}\gamma(p_{0},\mathfrak{A})-\sum_{s\leq t}\log\Big{(}c_{\mathfrak{A}}(Y_{s};A^{\mathfrak{A}}_{s}p_{s-1}^{\mathfrak{A},p_{0}})\Big{)}+m_{t}\Big{\}}.

Furthermore, KK has recursive representation

Kt(p,𝔄)=infpt1(p)t𝔄{Kt1(pt1,𝔄)log(c𝔄(Yt;A𝔄pt1)}+mt,K_{t}(p,\mathfrak{A})=\inf_{p_{t-1}\in(p)_{t}^{\overleftarrow{\mathfrak{A}}}}\Big{\{}K_{t-1}(p_{t-1},\mathfrak{A})-\log\Big{(}c_{\mathfrak{A}}(Y_{t};A^{\mathfrak{A}}p_{t-1}\Big{)}\Big{\}}+m_{t},

where K(p0,𝔄)=γ(p0,𝔄)K(p_{0},\mathfrak{A})=\gamma(p_{0},\mathfrak{A}) and mtm_{t} is a constant to ensure infpt,𝔄Kt(pt,𝔄)=0\inf_{p_{t},\mathfrak{A}}K_{t}(p_{t},\mathfrak{A})=0.

Proof.

As we are looking at a function of the current hidden state, we are interested in the minimal penalty associated with each ptp_{t} (as any larger penalty will be ignored by the supremum when calculating the DR-expectation). Given our definition of KK, as in the StaticUP case (Lemma 2), we observe that κt(p)=inf𝔄𝔸{Kt(p,𝔄)}\kappa_{t}(p)=\inf_{\mathfrak{A}\in\mathbb{A}}\{K_{t}(p,\mathfrak{A})\} satisfies (4) by rearrangement. Using the definition of (p)t𝔄(p)_{t}^{\overleftarrow{\mathfrak{A}}} and the recursive nature of the filter, by the usual dynamic programming arguments it is easy to see that we can calculate the penalty recursively,

Kt(p,𝔄)=infpt1(p)t𝔄{Kt1(pt1,𝔄)log(c𝔄(Yt;A𝔄pt1)}+mtK_{t}(p,\mathfrak{A})=\inf_{p_{t-1}\in(p)_{t}^{\overleftarrow{\mathfrak{A}}}}\Big{\{}K_{t-1}(p_{t-1},\mathfrak{A})-\log\Big{(}c_{\mathfrak{A}}(Y_{t};A^{\mathfrak{A}}p_{t-1}\Big{)}\Big{\}}+m_{t}

with the initial value given by the prior penalty K(p0,𝔄)=γ(p0,𝔄)K(p_{0},\mathfrak{A})=\gamma(p_{0},\mathfrak{A}). ∎

Remark 15.

Comparing the results of Theorem 2 with Theorem 4, we see that the key distinction is the presence of the log-likelihood term log(c𝔄(Ys;As𝔄ps1𝔄))\log\big{(}c_{\mathfrak{A}}(Y_{s};A^{\mathfrak{A}}_{s}p_{s-1}^{\mathfrak{A}})\big{)}. This term implies that observations of YY will affect our quantification of uncertainty, rather than purely updating each model. We also note that the filtering equations have arisen naturally as a part of the penalty (through their appearance in the likelihood function adapted to 𝒴\mathcal{Y}), rather than us assuming that the filtering equations are the ‘correct’ method for updating in the presence of uncertainty.

Example 2.

In the setting of Example 1, recall that XX is constant, so we know 𝔄\mathfrak{A}. One can calculate the penalty either directly, or through solving the stated recursion using the dynamics of pp. The resulting penalty is given by first calculating p0p_{0} from ptp_{t} through

log(p01p02)=log(pt1pt2)NY¯log(ab)N(1Y¯)log(1a1b)\log\Big{(}\frac{p_{0}^{1}}{p_{0}^{2}}\Big{)}=\log\Big{(}\frac{p_{t}^{1}}{p_{t}^{2}}\Big{)}-{N\bar{Y}}\log\Big{(}\frac{a}{b}\Big{)}-N(1-\bar{Y})\log\Big{(}\frac{1-a}{1-b}\Big{)}

and then

κt(pt)=κ0(p0)log(p01aNY¯(1a)N(1Y¯)+p02bNY¯(1b)N(1Y¯))mt,\kappa_{t}(p_{t})=\kappa_{0}(p_{0})-\log\Big{(}p_{0}^{1}a^{N\bar{Y}}(1-a)^{N(1-\bar{Y})}+p_{0}^{2}b^{N\bar{Y}}(1-b)^{N(1-\bar{Y})}\Big{)}-m_{t},

where mtm_{t} is chosen to ensure infpκt(p)=0\inf_{p}\kappa_{t}(p)=0. From this, we can see that the likelihood will modify our uncertainty directly, rather than us simply propagating each model via Bayes’ rule. A consequence of this is that, if we start with extreme uncertainty (κ00\kappa_{0}\equiv 0), then our observations will teach us what models are reasonable, thereby reducing our uncertainty (i.e. we will find κt(p)>0\kappa_{t}(p)>0 for p(0,1)p\in(0,1) when t>0t>0).

4.2.2 Dynamic generators (DynamicDR)

As in the uncertain priors case, it is often practically inconvenient to calculate a recursion for K(p,𝔄)K(p,\mathfrak{A}), given the high dimension of 𝔄\mathfrak{A}. We can avoid this by allowing 𝔄\mathfrak{A} to vary through time. In order to place this within the DR expectation framework, we consider the following models.

Definition 13.

Suppose {𝔄t}t>0\{\mathfrak{A}_{t}\}_{t>0} is a sequence, selected according to distributions with conditional density

𝔄t|{𝔄s,Ys}s<texp(γ(𝔄t;{Ys}s<t)),\mathfrak{A}_{t}|\{\mathfrak{A}_{s},Y_{s}\}_{s<t}\sim\exp(-\gamma(\mathfrak{A}_{t};\{Y_{s}\}_{s<t})),

for γ\gamma a known function. Note that this structure implies {𝔄t}t>0𝒜𝒴\{\mathfrak{A}_{t}\}_{t>0}\in\mathcal{A}_{\mathcal{Y}}.

In addition, suppose that p0p_{0} is selected from a distribution with density exp(π(p0))\exp(-\pi(p_{0})) on SN+S_{N}^{+}, independently of {𝔄t}t>0\{\mathfrak{A}_{t}\}_{t>0}.

This assumption allows us to decouple the choice of generators at different times, leading to a significant reduction in complexity. The prior penalty, represented by (γ,π)(\gamma,\pi) is assumed to be known, potentially from calibration of the model.

Theorem 5.

Consider models where p0p_{0} and {𝔄t}t>0\{\mathfrak{A}_{t}\}_{t>0} satisfy Definition 13. In a DR-expectation setting, a dynamic penalty κ\kappa satisfying (4) can be obtained from the recursion

κt(pt)=inf𝔄t{infpt1(pt)𝔄t{κt1(pt1)+γt(𝔄t;{Ys}s<t)log(c𝔄(Yt;At𝔄pt1))}}mt,\begin{split}\kappa_{t}(p_{t})=\inf_{\mathfrak{A}_{t}}\bigg{\{}\inf_{p_{t-1}\in(p_{t})^{\overleftarrow{\mathfrak{A}}_{t}}}\Big{\{}\kappa_{t-1}(p_{t-1})&+\gamma_{t}(\mathfrak{A}_{t};\{Y_{s}\}_{s<t})\\ &-\log\Big{(}c_{\mathfrak{A}}(Y_{t};A^{\mathfrak{A}}_{t}p_{t-1})\Big{)}\Big{\}}\bigg{\}}-m_{t},\end{split}

with initial value κ0(p)=π(p)\kappa_{0}(p)=\pi(p), where mtm_{t} is chosen to ensure infpSN+κt(p)=0\inf_{p\in S_{N}^{+}}\kappa_{t}(p)=0 for all tt.

Proof.

Using the same logic as in Lemma 3, and using Lemma 1 to restrict our horizon to tt, we observe that the divergence, for a model \mathbb{Q} with generator 𝔄={As,Cs}s>0\mathfrak{A}=\{A_{s},C_{s}\}_{s>0} is given by

α𝒬|𝐲t()=π(p0)+0<st(log(c𝔄(Ys;Asps1𝔄,p0))+γs(𝔄s;{Yr}r<s))mt,\alpha_{\mathcal{Q}|\mathbf{y}_{t}}(\mathbb{Q})=\pi(p_{0})+\sum_{0<s\leq t}\bigg{(}-\log\Big{(}c_{\mathfrak{A}}(Y_{s};A_{s}p_{s-1}^{\mathfrak{A},p_{0}})\Big{)}+\gamma_{s}(\mathfrak{A}_{s};\{Y_{r}\}_{r<s})\bigg{)}-m_{t},

where ps𝔄,p0p_{s}^{\mathfrak{A},p_{0}} is the corresponding solution to the filtering equations, and mtm_{t} is a normalizing constant.

As in the static generator case, we can then reparameterize in terms of the terminal value of the filter, and notice that we are only interested in the minimal penalty for a given ptp_{t}. Comparing with (4), we have

κt(p)=inf𝔄𝔸{infp0(p)t(𝔄,0){π(p0)+0<st(log(c𝔄(Ys;As𝔄ps1𝔄,p0))+γs(𝔄s;{Yr}r<s))}}mt.\begin{split}\kappa_{t}(p)=\inf_{\mathfrak{A}\in\mathbb{A}}\bigg{\{}&\inf_{p_{0}\in(p)_{t}^{(\overleftarrow{\mathfrak{A}},0)}}\bigg{\{}\pi(p_{0})\\ &+\sum_{0<s\leq t}\bigg{(}-\log\Big{(}c_{\mathfrak{A}}(Y_{s};A^{\mathfrak{A}}_{s}p_{s-1}^{\mathfrak{A},p_{0}})\Big{)}+\gamma_{s}(\mathfrak{A}_{s};\{Y_{r}\}_{r<s})\bigg{)}\bigg{\}}\bigg{\}}-m_{t}.\end{split}

The recursion then follows by the usual dynamic programming arguments. ∎

Remark 16.

We expect that there will be less difference between the dynamic uncertain prior and dynamic DR-expectation settings than between the static uncertain prior and static DR-expectation settings. This is because there is only limited learning possible in the dynamic DR-expectation, as the underlying generator is independently given at every time, so the DR-expectation has only one value with which to infer its behaviour. This increases the relative importance of the prior term γ\gamma, which describes our understanding of typical values of the generator. In practice, the key distinction between the dynamic DR-expectation and uncertain prior models appears to be when the initial penalty is near zero – in this case, the DR-expectation regularizes the initial state quickly, while the uncertain prior model may remain near zero indefinitely.

Example 3.

In the setting of Example 1, as the dynamics are perfectly known, there is again no difference between the dynamic and static generator DR-expectation cases.

Remark 17.

In these sections, we have considered the case where the generator is either dynamic or static. Of course, further variations can be had, by allowing the generator to have some parts static and others dynamic, or to be dynamic but with penalty determined by an unknown static parameter. These perturbations give this approach a high degree of flexibility for practical modelling.

5 Expectations of the future

The nonlinear expectations considered above do not consider how the future will evolve. In particular, we have focussed our attention on calculating 𝒬|𝐲t(ϕ(Xt))\mathcal{E}_{\mathcal{Q}|\mathbf{y}_{t}}(\phi(X_{t})), that is, on calculating the expectation of functions of the current hidden state. In other words, we can consider our nonlinear expectation as a mapping

𝒬|𝐲t:L(σ(Xt)𝒴t)L(𝒴t).\mathcal{E}_{\mathcal{Q}|\mathbf{y}_{t}}:L^{\infty}(\sigma(X_{t})\otimes\mathcal{Y}_{t})\to L^{\infty}(\mathcal{Y}_{t}).

If we wish to calculate expectations of future states, then we may wish to consider doing so in a filtration-consistent manner. This is of particular importance when considering optimal control problems.

Definition 14.

For a fixed horizon T>0T>0, suppose that for each t<Tt<T we have a mapping (|𝒴t):L(𝒴T)L(𝒴t)\mathcal{E}(\cdot|\mathcal{Y}_{t}):L^{\infty}(\mathcal{Y}_{T})\to L^{\infty}(\mathcal{Y}_{t}). We say that \mathcal{E} is a 𝒴\mathcal{Y}-consistent convex expectation if (|𝒴t)\mathcal{E}(\cdot|\mathcal{Y}_{t}) satisifes the following assumptions, analogous to those above,

  • Strict Monotonicity: for any ξ1,ξ2L(𝒴T)\xi_{1},\xi_{2}\in L^{\infty}(\mathcal{Y}_{T}), if ξ1ξ2\xi_{1}\geq\xi_{2} a.s. then (ξ1|𝒴t)(ξ2|𝒴t)\mathcal{E}(\xi_{1}|\mathcal{Y}_{t})\geq\mathcal{E}(\xi_{2}|\mathcal{Y}_{t}) a.s. and if in addition (ξ1|𝒴t)=(ξ2|𝒴t)\mathcal{E}(\xi_{1}|\mathcal{Y}_{t})=\mathcal{E}(\xi_{2}|\mathcal{Y}_{t}) then ξ1=ξ2\xi_{1}=\xi_{2} a.s.

  • Constant triviality: for KL(𝒴t)K\in L^{\infty}(\mathcal{Y}_{t}), (K|𝒴t)=K\mathcal{E}(K|\mathcal{Y}_{t})=K.

  • Translation equivariance: for any KL(𝒴t)K\in L^{\infty}(\mathcal{Y}_{t}), ξL(𝒴T)\xi\in L^{\infty}(\mathcal{Y}_{T}), (ξ+K|𝒴t)=(ξ|𝒴t)+K\mathcal{E}(\xi+K|\mathcal{Y}_{t})=\mathcal{E}(\xi|\mathcal{Y}_{t})+K.

  • Convexity: for any λ[0,1]\lambda\in[0,1], ξ1,ξ2L(𝒴T)\xi_{1},\xi_{2}\in L^{\infty}(\mathcal{Y}_{T}),

    (λξ1+(1λ)ξ2|𝒴t)λ(ξ1|𝒴t)+(1λ)(ξ2|𝒴t)\mathcal{E}(\lambda\xi_{1}+(1-\lambda)\xi_{2}|\mathcal{Y}_{t})\leq\lambda\mathcal{E}(\xi_{1}|\mathcal{Y}_{t})+(1-\lambda)\mathcal{E}(\xi_{2}|\mathcal{Y}_{t})
  • Lower semicontinuity: For a sequence {ξn}nL(𝒴T)\{\xi_{n}\}_{n\in\mathbb{N}}\subset L^{\infty}(\mathcal{Y}_{T}) with ξnξL(𝒴T)\xi_{n}\uparrow\xi\in L^{\infty}(\mathcal{Y}_{T}) pointwise, (ξn|𝒴t)(ξ|𝒴t)\mathcal{E}(\xi_{n}|\mathcal{Y}_{t})\uparrow\mathcal{E}(\xi|\mathcal{Y}_{t}) pointwise for every t<Tt<T.

and the additional asssumptions

  • {𝒴t}t0\{\mathcal{Y}_{t}\}_{t\geq 0}-consistency: for any s<t<Ts<t<T, any ξL(𝒴T)\xi\in L^{\infty}(\mathcal{Y}_{T}),

    (ξ|𝒴s)=((ξ|𝒴t)|𝒴s).\mathcal{E}(\xi|\mathcal{Y}_{s})=\mathcal{E}(\mathcal{E}(\xi|\mathcal{Y}_{t})|\mathcal{Y}_{s}).
  • Relevance: for any t<Tt<T, any A𝒴tA\in\mathcal{Y}_{t}, (IAξ|𝒴t)=IA(ξ|𝒴t)\mathcal{E}(I_{A}\xi|\mathcal{Y}_{t})=I_{A}\mathcal{E}(\xi|\mathcal{Y}_{t}).

The assumption of 𝒴\mathcal{Y}-consistency is sometimes simply called recursivity, time consistency or dynamic consistency (and is closely related to the validity of the dynamic programming principle), however, it is important to note that this depends on the choice of filtration. In our context, consistency with the observation filtration 𝒴\mathcal{Y} is natural, as this describes the information available for us to make decisions.

Remark 18.

Definition 14 is equivalent to considering a lower semicontinuous convex expectation, as in Definition 6 and assuming that for any ξL(𝒴T)\xi\in L^{\infty}(\mathcal{Y}_{T}) and any t<Tt<T, there exists a random variable ξt\xi_{t} such that (IAξ)=(IAξt)\mathcal{E}(I_{A}\xi)=\mathcal{E}(I_{A}\xi_{t}) for all A𝒴tA\in\mathcal{Y}_{t}. In this case, one can define (ξ|𝒴t)=ξt\mathcal{E}(\xi|\mathcal{Y}_{t})=\xi_{t} and verify that the definition given is satisfied (see Föllmer and Schied [17]).

Much work has been done on the construction of dynamic nonlinear expectations (see for example Epstein and Schneider [14], Duffie and Epstein [12], El Karoui, Peng and Quenez [13], Cohen and Elliott [7], and references therein). In particular, there have been close relations drawn between these operators and the theory of BSDEs (for a setting covering the discrete-time examples we consider here, see [7, 8]).

Remark 19.

The importance of 𝒴\mathcal{Y}-consistency is twofold: First, it guarantees that, when using a nonlinear expectation to construct the value function for a control problem, an optimal policy will be consistent in the sense that (assuming an optimal policy exists) a policy which is optimal at time zero will remain optimal in the future. Secondly, {𝒴t}t0\{\mathcal{Y}_{t}\}_{t\geq 0}-consistency allows the nonlinear expectation to be calculated recursively, working backwards from a terminal time. This leads to a considerable simplification numerically, as it avoids a curse of dimensionality in intertemporal control problems.

Remark 20.

One issue in our setting is that our lack of knowledge does not simply line up with the arrow of time – we are unaware of events which occurred in the past, as well as those which are in the future. This leads to delicacies in questions of dynamic consistency. Conventionally, this has been often been considered in a setting of ‘partially observed control’, and these issues are resolved by taking the filter state ptp_{t} to play the role of a state variable, and solving the corresponding ‘fully observed control problem’ with ptp_{t} as underlying. In our context, we do not know the value of ptp_{t}, instead we have the (even higher dimensional) penalty function κt\kappa_{t} (or worse, KtK_{t}) as a state variable (taking values in the space of functions on SN+S_{N}^{+}).

In this section, we will outline how this perspective can provide a dynamically consistent extension of our expectations, and how enforcing dynamic consistency will modify our perception of risk. For this and the remainder of the paper, we will focus our attention on the dynamic-generator DR-expectation framework. The corresponding theory in the dynamic-generator uncertain-prior setting can be obtained by simply removing the relevant likelihood term whenever it appears.

5.1 Asynchronous expectations

We will focus our attention on constructing a dynamically consistent nonlinear expectation for random variables in L(σ(XT)𝒴T)L^{\infty}(\sigma(X_{T})\otimes\mathcal{Y}_{T}), given observations up to times t<Tt<T. Given our earlier analysis, we already have a map

𝒬|𝐲T:L(σ(XT)𝒴T)L(𝒴T).\mathcal{E}_{\mathcal{Q}|\mathbf{y}_{T}}:L^{\infty}(\sigma(X_{T})\otimes\mathcal{Y}_{T})\to L^{\infty}(\mathcal{Y}_{T}).

We therefore need only to construct a 𝒴\mathcal{Y}-consistent family of maps

(|𝒴t):L(𝒴T)L(𝒴t),\overleftarrow{\mathcal{E}}\hskip-1.99997pt(\cdot|\mathcal{Y}_{t}):L^{\infty}(\mathcal{Y}_{T})\to L^{\infty}(\mathcal{Y}_{t}),

which we can extend by composition with 𝒬|𝐲T\mathcal{E}_{\mathcal{Q}|\mathbf{y}_{T}} to be defined on our space of interest L(σ(XT)𝒴T)L^{\infty}(\sigma(X_{T})\otimes\mathcal{Y}_{T}).

As we are in discrete time, we can construct a 𝒴\mathcal{Y}-consistent family through recursion. Therefore, the key question is how we construct a nonlinear expectation over one step. The definition of the DR-expectation can be applied to generate these one-step expectations in a natural way.

Recall that, as 𝒴\mathcal{Y} is generated by YY, any 𝒴t+1\mathcal{Y}_{t+1}-measurable function ξ\xi is simply a function of {Ys}st+1\{Y_{s}\}_{s\leq t+1} so we can write

ξ(ω)=ξ^(Yt+1,{Ys}st).\xi(\omega)=\hat{\xi}(Y_{t+1},\{Y_{s}\}_{s\leq t}). (8)

For any conditionally Markov measure QQ, if QQ has generator {𝔄t}t>0\{\mathfrak{A}_{t}\}_{t>0}, it follows that

𝔼[ξ|𝒴t]=dξ^(y,{Ys}st)c𝔄t+1(dy;At𝔄pt).\mathbb{E}_{\mathbb{Q}}[\xi|\mathcal{Y}_{t}]=\int_{\mathbb{R}^{d}}\hat{\xi}(y,\{Y_{s}\}_{s\leq t})c_{\mathfrak{A}_{t+1}}(dy;A^{\mathfrak{A}}_{t}p_{t}).
Lemma 4.

For a dynamic DR-expectation, the one-step expectation (i.e. for 𝒴t+1\mathcal{Y}_{t+1}-measurable ξt+1\xi_{t+1}) can be written

(ξt+1|𝒴t)\displaystyle\mathcal{E}(\xi_{t+1}|\mathcal{Y}_{t}) :=𝒬|𝐲t(ξt+1)\displaystyle:=\mathcal{E}_{\mathcal{Q}|\mathbf{y}_{t}}(\xi_{t+1})
=supptSN+,𝔄t𝒜{dξ^t+1(y,{Ys}st)c𝔄t+1(dy;At𝔄pt)\displaystyle=\sup_{p_{t}\in S_{N}^{+},\mathfrak{A}_{t}\in\mathcal{A}}\Big{\{}\int_{\mathbb{R}^{d}}\hat{\xi}_{t+1}(y,\{Y_{s}\}_{s\leq t})c_{\mathfrak{A}_{t+1}}(dy;A^{\mathfrak{A}}_{t}p_{t})
(1k(κt(pt)+γt(𝔄t;{Ys}s<t))k},\displaystyle\qquad\qquad-\Big{(}\frac{1}{k}(\kappa_{t}(p_{t})+\gamma_{t}(\mathfrak{A}_{t};\{Y_{s}\}_{s<t})\Big{)}^{k^{\prime}}\Big{\}},

where ξ^\hat{\xi} is as in (8).

Proof.

Our DR-expectation can be written

𝒬|𝐲t(ξt+1)=sup{E[ξt+1|𝐲](1kα𝒬|𝐲t())k}.\mathcal{E}_{\mathcal{Q}|\mathbf{y}_{t}}(\xi_{t+1})=\sup_{\mathbb{Q}}\Big{\{}E_{\mathbb{Q}}[\xi_{t+1}|\mathbf{y}]-\Big{(}\frac{1}{k}\alpha_{\mathcal{Q}|\mathbf{y}_{t}}(\mathbb{Q})\Big{)}^{k^{\prime}}\Big{\}}.

As ξt+1\xi_{t+1} is 𝒴t+1\mathcal{Y}_{t+1}-measurable, rather than 𝒴t\mathcal{Y}_{t}-measurable, we cannot exploit Lemma 1 fully, as the generator at t+1t+1 is still relevant when calculating our 𝒴t\mathcal{Y}_{t}-conditional expectation. As we are in a dynamic-generator DR-expectation setting, for a measure \mathbb{Q} with generator {𝔄s}0<sT\{\mathfrak{A}_{s}\}_{0<s\leq T} and initial hidden distribution p0p_{0}, we have

α𝒬|𝐲t()=π(p0)0<stlog(c𝔄(Ys;Asps1𝔄,p0))+0<sTγs(𝔄s;{Yr}r<s)mt,\alpha_{\mathcal{Q}|\mathbf{y}_{t}}(\mathbb{Q})=\pi(p_{0})-\sum_{0<s\leq t}\log\Big{(}c_{\mathfrak{A}}(Y_{s};A_{s}p_{s-1}^{\mathfrak{A},p_{0}})\Big{)}+\sum_{0<s\leq T}\gamma_{s}(\mathfrak{A}_{s};\{Y_{r}\}_{r<s})-m_{t},

Our expectation 𝔼[ξt+1|𝒴t]\mathbb{E}_{\mathbb{Q}}[\xi_{t+1}|\mathcal{Y}_{t}] is independent of 𝔄s\mathfrak{A}_{s} for s>t+1s>t+1, and is independent of 𝔄s\mathfrak{A}_{s} for sts\leq t given ptp_{t}. Therefore, without loss of generality we can minimize this over possible values of {𝔄s}st+1\{\mathfrak{A}_{s}\}_{s\neq t+1} and, expressing in terms of the current filter state, we obtain

inf{𝔄s}st+1{infp0(p)t({𝔄s}s<t,0){α𝒬|𝐲t()}}=κt(pt)+γt+1(𝔄t+1),\inf_{\{\mathfrak{A}_{s}\}_{s\neq t+1}}\Big{\{}\inf_{p_{0}\in(p)_{t}^{(\{\overleftarrow{\mathfrak{A}}_{s}\}_{s<t},0)}}\big{\{}\alpha_{\mathcal{Q}|\mathbf{y}_{t}}(\mathbb{Q})\big{\}}\Big{\}}=\kappa_{t}(p_{t})+\gamma_{t+1}(\mathfrak{A}_{t+1}),

where κt\kappa_{t} is as constructed in Theorem 5. Combining the conditional expectation and the penalty, this gives the desired representation. ∎

Remark 21.

There is a surprising form of double-counting of the penalty here. For notational simplicity, let’s assume ϕ\phi does not depend on YY. If we consider ξt+1=(ϕ(Xt+1)|𝒴t+1)\xi_{t+1}=\mathcal{E}(\phi(X_{t+1})|\mathcal{Y}_{t+1}), then we have included a penalty for the proposed model at t+1t+1, that is,

ξt+1=(ϕ(Xt+1)|𝒴t+1)=suppSN+{ipiϕ(ei)(1k(κt+1(p))k}\xi_{t+1}=\mathcal{E}(\phi(X_{t+1})|\mathcal{Y}_{t+1})=\sup_{p\in S_{N}^{+}}\Big{\{}\sum_{i}p^{i}\phi(e_{i})-\Big{(}\frac{1}{k}(\kappa_{t+1}(p)\Big{)}^{k^{\prime}}\Big{\}}

where κt+1(p)\kappa_{t+1}(p) is the penalty associated with the filter state at time t+1t+1, which comes from the generators (p0,{𝔄s}st+1)(p_{0},\{\mathfrak{A}_{s}\}_{s\leq t+1}).

When we calculate (ξt+1|𝒴t)\mathcal{E}(\xi_{t+1}|\mathcal{Y}_{t}), we then add on the penalty κt(pt)+γt+1(𝔄t+1)\kappa_{t}(p_{t})+\gamma_{t+1}(\mathfrak{A}_{t+1}), which again penalizes unreasonable values of ptp_{t} and the generator 𝔄t+1\mathfrak{A}_{t+1}. This ‘double counting’ of the penalty corresponds to us including both our ‘expected’ uncertainty at time t+1t+1, and also our ‘uncertainty at tt about our uncertainty at t+1t+1’.

Remark 22.

In the case where we considered a DynamicUP model, the equations would be identical, with the corresponding choice of γ\gamma. This is because the DR and uncertain prior models differ only through the incorporation of learning through the log-likelihood term, which is not a consideration when it comes to evaluating our future expectations.

Remark 23.

If we take a StaticDR model, then the equations vary as one might expect:

(ξt+1|𝒴t)=supptSN+,𝔄𝒜{dξ^(y,{Ys}st)c𝔄t+1(dy;At𝔄pt)(1kKt(pt,𝔄))k}.\mathcal{E}(\xi_{t+1}|\mathcal{Y}_{t})=\sup_{p_{t}\in S_{N}^{+},\mathfrak{A}\in\mathcal{A}}\Big{\{}\int_{\mathbb{R}^{d}}\hat{\xi}(y,\{Y_{s}\}_{s\leq t})c_{\mathfrak{A}_{t+1}}(dy;A^{\mathfrak{A}}_{t}p_{t})-\Big{(}\frac{1}{k}K_{t}(p_{t},\mathfrak{A})\Big{)}^{k^{\prime}}\Big{\}}.

Similarly for a StaticUP model. However, one should be careful in this setting, as the recursively-defined nonlinear expectation will consider models which allow the generator 𝔄\mathfrak{A} to vary through time, even though this does not form part of the original static generator framework.

Given this recursion, we can now define the nonlinear expectation at every time.

Definition 15.

The dynamically consistent expectation of ϕ(XT,{Yt}tT)\phi(X_{T},\{Y_{t}\}_{t\leq T}) (for either the static or dynamic generator cases, and either the uncertain prior or DR-expectation models), is given by the recursion

(ϕ(XT,{Yt}tT)|𝒴t):=ξt=(ξt+1|𝒴t)\overleftarrow{\mathcal{E}}\hskip-1.99997pt(\phi(X_{T},\{Y_{t}\}_{t\leq T})|\mathcal{Y}_{t}):=\xi_{t}=\mathcal{E}(\xi_{t+1}|\mathcal{Y}_{t})

with ξT=𝒬|𝐲T(ϕ(XT,{Yt}tT))\xi_{T}=\mathcal{E}_{\mathcal{Q}|\mathbf{y}_{T}}(\phi(X_{T},\{Y_{t}\}_{t\leq T})). As 𝒴0\mathcal{Y}_{0} is trivial, we identify

(ϕ(XT,{Yt}tT))=(ϕ(XT,{Yt}tT)|𝒴0).\overleftarrow{\mathcal{E}}\hskip-1.99997pt(\phi(X_{T},\{Y_{t}\}_{t\leq T}))=\overleftarrow{\mathcal{E}}\hskip-1.99997pt(\phi(X_{T},\{Y_{t}\}_{t\leq T})|\mathcal{Y}_{0}).

Note that ξt\xi_{t} is 𝒴t\mathcal{Y}_{t}-measurable for all t<Tt<T.

Remark 24.

As we have defined ξ\xi using recursion, and our DR-expectation (and uncertain prior expectation) are convex, it is easy to verify that the map ϕ(XT,{Yt}tT)ξt\phi(X_{T},\{Y_{t}\}_{t\leq T})\mapsto\xi_{t} is a 𝒴\mathcal{Y}-consistent convex expectation.

5.2 Recall of BSDE theory

While it is useful to give a recursive definition of our nonlinear expectation, a better understanding of its dynamics is of practical importance. In what follows we will, for the dynamic generator case, consider the corresponding BSDE theory, assuming that YtY_{t} can take only finitely many values, as in [7]. We will now present the key results of [7], in a simplified setting.

In what follows, we suppose that YY takes dd values, which we associate with the standard basis vectors in d\mathbb{R}^{d}. For simplicity, we write 𝟏\mathbf{1} for the vector in d\mathbb{R}^{d} with all components 11.

Definition 16.

Write ¯\bar{\mathbb{P}} for a probability measure such that {Yt}t0\{Y_{t}\}_{t\geq 0} is an iid sequence, uniformly distributed over the dd states, and MM for the ¯\bar{\mathbb{P}}-martingale difference process Ytd1𝟏Y_{t}-d^{-1}\mathbf{1}. As in [7], MM has the property that any 𝒴\mathcal{Y}-adapted ¯\bar{\mathbb{P}}-martingale LL can be represented by Lt=L0+0s<tZsMs+1L_{t}=L_{0}+\sum_{0\leq s<t}Z_{s}M_{s+1} for some ZZ (and ZZ is unique up to addition of a multiple of 𝟏\mathbf{1}).

Remark 25.

The construction of ZZ in fact also shows that, if LL is written Lt=L~(Y1,,Yt1,Yt)L_{t}=\tilde{L}(Y_{1},...,Y_{t-1},Y_{t}), then eiZt=L(Y1,,Yt1,ei)e_{i}^{\top}Z_{t}=L(Y_{1},...,Y_{t-1},e_{i}) for every ii (up to addition of a multiple of 𝟏\mathbf{1}).

We can then define a BSDE (Backward Stochastic Difference Equation) with solution (ξ,Z)(\xi,Z):

ξt(ω)tu<Tf(ω,u,ξu(ω),Zu(ω))+tu<TZu(ω)Mu+1(ω)=ξT(ω),\xi_{t}(\omega)-\sum_{t\leq u<T}f(\omega,u,\xi_{u}(\omega),Z_{u}(\omega))+\sum_{t\leq u<T}Z_{u}(\omega)M_{u+1}(\omega)=\xi_{T}(\omega), (9)

where TT is a finite deterministic terminal time, ff a 𝒴\mathcal{Y}-adapted map F:Ω×{0,,T}××dF:\Omega\times\{0,...,T\}\times\mathbb{R}\times\mathbb{R}^{d}\rightarrow\mathbb{R}, and ξT\xi_{T} a given \mathbb{R}-valued 𝒴T\mathcal{Y}_{T}-measurable terminal condition. For simplicity, we henceforth omit the ω\omega argument of ξ,Z\xi,Z and MM.

The general existence and uniqueness result for BSDEs in this context is as follows:

Theorem 6.

Suppose ff is such that the following two assumptions hold:

  1. (i)

    For any ξ\xi, if Z1=Z2+k𝟏Z^{1}=Z^{2}+k\mathbf{1} for some kk, then f(ω,t,ξt,Zt1)=f(ω,t,ξt,Zt2)f(\omega,t,\xi_{t},Z^{1}_{t})=f(\omega,t,\xi_{t},Z^{2}_{t}), ¯\bar{\mathbb{P}}-a.s. for all tt.

  2. (ii)

    For any zdz\in\mathbb{R}^{d}, for all tt, for ¯\bar{\mathbb{P}}-almost all ω\omega, the map

    ξξf(ω,t,ξ,z)\xi\mapsto\xi-f(\omega,t,\xi,z)

    is a bijection \mathbb{R}\rightarrow\mathbb{R}.

Then for any terminal condition ξT\xi_{T} essentially bounded, 𝒴T\mathcal{Y}_{T}-measurable, and with values in \mathbb{R}, the BSDE (9) has a 𝒴\mathcal{Y}-adapted solution (ξ,Z)(\xi,Z). Moreover, this solution is unique up to indistinguishability for ξ\xi and indistinguishability up to addition of multiples of 𝟏\mathbf{1} for ZZ.

In this setting, we also have a comparison theorem:

Theorem 7.

Consider two discrete time BSDEs as in (9), corresponding to coefficients f1,f2f^{1},f^{2} and terminal values ξT1,ξT2\xi^{1}_{T},\xi^{2}_{T}. Suppose the conditions of Theorem 6 are satisfied for both equations, let (ξ1,Z1)(\xi^{1},Z^{1}) and (ξ2,Z2)(\xi^{2},Z^{2}) be the associated solutions. Suppose the following conditions hold:

  1. (i)

    ξT1ξT2\xi^{1}_{T}\geq\xi^{2}_{T} ¯\bar{\mathbb{P}}-a.s.

  2. (ii)

    ¯\bar{\mathbb{P}}-a.s., for all times tt and every ξ\xi\in\mathbb{R} and zdz\in\mathbb{R}^{d},

    f1(ω,t,ξ,z)f2(ω,t,ξ,z).f^{1}(\omega,t,\xi,z)\geq f^{2}(\omega,t,\xi,z).
  3. (iii)

    ¯\bar{\mathbb{P}}-a.s., for all tt, f1f^{1} satisfies

    f1(ω,t,ξt2,Zt1)f1(ω,t,ξt2,Zt2)minj𝕁t{(Zt1Zt2)(ejd1𝟏)}.f^{1}(\omega,t,\xi_{t}^{2},Z_{t}^{1})-f^{1}(\omega,t,\xi_{t}^{2},Z_{t}^{2})\geq\min_{j\in\mathbb{J}_{t}}\{(Z^{1}_{t}-Z^{2}_{t})(e_{j}-d^{-1}\mathbf{1})\}.
  4. (iv)

    ¯\bar{\mathbb{P}}-a.s., for all tt and all zdz\in\mathbb{R}^{d}, ξξf1(ω,t,ξ,z)\xi\mapsto\xi-f^{1}(\omega,t,\xi,z) is an increasing function.

It is then true that ξ1ξ2\xi^{1}\geq\xi^{2} ¯\bar{\mathbb{P}}-a.s. A driver f1f^{1} satisfying (iii) and (iv) will be called ‘balanced’.

Finally, we also know that all dynamically consistent nonlinear expectations can be represented through BSDEs:

Theorem 8.

The following two statements are equivalent.

  1. (i)

    (|𝒴t)\overleftarrow{\mathcal{E}}\hskip-1.99997pt(\cdot|\mathcal{Y}_{t}) is a 𝒴t\mathcal{Y}_{t}-consistent, dynamically translation invariant, nonlinear expectation

  2. (ii)

    There exists a driver ff which is balanced, independent of ξ\xi, and satisfies the normalisation condition f(ω,t,ξt,0)=0f(\omega,t,\xi_{t},0)=0, such that, for all ξT\xi_{T}, ξt=(ξT|𝒴t)\xi_{t}=\overleftarrow{\mathcal{E}}\hskip-1.99997pt(\xi_{T}|\mathcal{Y}_{t}) is the solution to a BSDE with terminal condition ξT\xi_{T} and driver ff.

Furthermore, these two statements are related by the equation

f(ω,t,ξ,z)=(zMt+1|𝒴t).f(\omega,t,\xi,z)=\overleftarrow{\mathcal{E}}\hskip-1.99997pt(zM_{t+1}|\mathcal{Y}_{t}).

5.3 BSDEs for forward expectations

By applying the above general theory, we can easily see that our nonlinear expectation has a representation as the solution to a particular BSDE.

Theorem 9.

In the dynamic generator setting, writing (ϕ(XT,{Yt}tT)|𝒴t)=ξt\overleftarrow{\mathcal{E}}\hskip-1.99997pt(\phi(X_{T},\{Y_{t}\}_{t\leq T})|\mathcal{Y}_{t})=\xi_{t}, the dynamically consistent expectation satisfies the BSDE

ξt+1=ξtf(Zt;κt)+ZtMt+1\xi_{t+1}=\xi_{t}-f(Z_{t};\kappa_{t})+Z_{t}M_{t+1}

where

f(Zt;κt)=supp,𝔄{i(Zi(c𝔄(ei;A𝔄p)d1))(κt(p)+γt+1(𝔄)k)k}.f(Z_{t};\kappa_{t})=\sup_{p,\mathfrak{A}}\Big{\{}\sum_{i}\Big{(}Z^{i}\big{(}c_{\mathfrak{A}}(e_{i};A^{\mathfrak{A}}p)-d^{-1}\big{)}\Big{)}-\Big{(}\frac{\kappa_{t}(p)+\gamma_{t+1}(\mathfrak{A})}{k}\Big{)}^{k^{\prime}}\Big{\}}.
Proof.

As ξt+1\xi_{t+1} is 𝒴t+1\mathcal{Y}_{t+1}-measurable, by the Doob–Dynkin lemma there exists a 𝒴t\mathcal{Y}_{t}-measurable function ξ^t+1\hat{\xi}_{t+1} such that ξt+1=ξ^t+1(Yt+1)\xi_{t+1}=\hat{\xi}_{t+1}(Y_{t+1}) (we omit to write {Ys}st\{Y_{s}\}_{s\leq t} as an argument). We write ZtZ_{t} for the vector containing each of the values of this function. From the definition of MM, as in the proof of the martingale representation theorem in [7], it follows that,

ξt+1𝔼¯[ξt+1|𝒴t]=ZtMt+1.\xi_{t+1}-\mathbb{E}_{\bar{\mathbb{P}}}[\xi_{t+1}|\mathcal{Y}_{t}]=Z_{t}M_{t+1}.

We can then calculate

ξt𝔼¯[ξt+1|𝒴t]=(ξt+1|𝒴t)𝔼¯[ξt+1|𝒴t]=supp,𝔄{yξ^t+1(y)c𝔄(y;A𝔄p)(κt(p)+γt+1(𝔄)k)k}𝔼¯[ξt+1|𝒴t]=supp,𝔄{yξ^t+1(y)(c𝔄(y;A𝔄p)d1)(κt(p)+γt+1(𝔄)k)k}=f(Zt;κt).\begin{split}&\xi_{t}-\mathbb{E}_{\bar{\mathbb{P}}}[\xi_{t+1}|\mathcal{Y}_{t}]\\ &=\mathcal{E}(\xi_{t+1}|\mathcal{Y}_{t})-\mathbb{E}_{\bar{\mathbb{P}}}[\xi_{t+1}|\mathcal{Y}_{t}]\\ &=\sup_{p,\mathfrak{A}}\Big{\{}\sum_{y}\hat{\xi}_{t+1}(y)c_{\mathfrak{A}}(y;A^{\mathfrak{A}}p)-\Big{(}\frac{\kappa_{t}(p)+\gamma_{t+1}(\mathfrak{A})}{k}\Big{)}^{k^{\prime}}\Big{\}}-\mathbb{E}_{\bar{\mathbb{P}}}[\xi_{t+1}|\mathcal{Y}_{t}]\\ &=\sup_{p,\mathfrak{A}}\Big{\{}\sum_{y}\hat{\xi}_{t+1}(y)\big{(}c_{\mathfrak{A}}(y;A^{\mathfrak{A}}p)-d^{-1}\big{)}-\Big{(}\frac{\kappa_{t}(p)+\gamma_{t+1}(\mathfrak{A})}{k}\Big{)}^{k^{\prime}}\Big{\}}\\ &=f(Z_{t};\kappa_{t}).\end{split}

The answer follows by rearrangement. ∎

We can now observe that, if we treat the function κt:SN+\kappa_{t}:S_{N}^{+}\to\mathbb{R} as a state variable, we obtain a ‘Markovian’ BSDE.

Theorem 10.

Suppose ϕ\phi and γ\gamma are independent of {Yt}tT\{Y_{t}\}_{t\leq T}. Then the solution to the above BSDE is a functional of κt\kappa_{t}, that is, ξt+1(ω)=Ξt+1(κt(ω,))\xi_{t+1}(\omega)=\Xi_{t+1}(\kappa_{t}(\omega,\cdot)).

Proof.

This argument follows in the usual manner – we observe that κ\kappa has recursive dynamics (and furthermore, under the reference measure, κ\kappa is Markovian) and that the terminal value of our BSDE is a function of ω\omega only through κT\kappa_{T}. Consequently, we can use backward induction to construct the solution to the BSDE as a function of κt\kappa_{t} (by evolving κ\kappa forward one step in time, then solving the BSDE backward one step), and the result follows. ∎

6 A control problem with uncertain filtering

In this final section, we will consider the solution of a simple control problem under uncertainty, using the formal structures previously developed. We shall focus our attention on a DR-expectation with dynamic generator, however similar arguments can be used in each of the other settings considered. In some ways, this approach is similar to those considered by Bielecki, Chen and Cialenco [4], where the DR-expectation is replaced by an approximate confidence interval. (Taking k=k^{\prime}=\infty in our analysis would give a very similar problem to the one they consider.)

Suppose a controller selects a control uu from a set UU, which we assume is a countable union of compact metrizable sets (this assumption is purely to enable us to use appropriate measurable selection theorems). Controls are required to be 𝒴\mathcal{Y}-predictable (i.e. utu_{t} is 𝒴t1\mathcal{Y}_{t-1}-measurable), and we write 𝒰\mathcal{U} for the space of such controls.

A control has an impact on the generator of X,YX,Y, through modifying the penalty function γ\gamma, which describes the ‘reasonable’ models for the transition matrix AA and the distribution of observations cc. In particular, for a given uu we will now have a penalty γt(𝔄t;ut)\gamma_{t}(\mathfrak{A}_{t};u_{t}), which we assume is continuous in utu_{t} for every 𝔄t𝔸\mathfrak{A}_{t}\in\mathbb{A}. This allows a controller to modify what are ‘reasonable’ values of the generator, even though the generator may not be fully known. We write u\overleftarrow{\mathcal{E}}\hskip-1.99997pt_{u} for the corresponding 𝒴\mathcal{Y}-consistent expectation and κu\kappa^{u} for the corresponding dynamic penalty. The expectation u\overleftarrow{\mathcal{E}}\hskip-1.99997pt_{u} then encodes both the change in the hidden dynamics in the future due to future controls and the change in our agent’s understanding of the present hidden state (as represented via κu\kappa^{u}) due to her past controls.

The controller wishes to minimize an expected cost

u((XT,{Ys}sT)+t<T𝔏t({Ys}s<t,ut+1)).\overleftarrow{\mathcal{E}}\hskip-1.99997pt_{u}\Big{(}\mathfrak{C}(X_{T},\{Y_{s}\}_{s\leq T})+\sum_{t<T}\mathfrak{L}_{t}(\{Y_{s}\}_{s<t},u_{t+1})\Big{)}.

Here \mathfrak{C} is a terminal cost, which may depend on the hidden state XTX_{T}, and 𝔏\mathfrak{L} is a running cost, which will depend on the control ut+1u_{t+1} used at time tt. We assume \mathfrak{C} and 𝔏\mathfrak{L} are continuous in uu (almost surely). We do not allow 𝔏t\mathfrak{L}_{t} to depend on XtX_{t}, as this would potentially lead to paradoxes (as the agent could learn information about the hidden state by observing their running costs). We think of the cost 𝔏t\mathfrak{L}_{t} as being paid at time tt, depending on the choice of control ut+1u_{t+1} (which will affect the generator at time t+1t+1). For notational simplicity, we will omit to write YY as an argument when unnecessary.

For a given control process uu, we define the remaining cost

J(ω,t,u)=u((XT)+ts<T𝔏s(us+1)|𝒴t)J(\omega,t,u)=\overleftarrow{\mathcal{E}}\hskip-1.99997pt_{u}\Big{(}\mathfrak{C}(X_{T})+\sum_{t\leq s<T}\mathfrak{L}_{s}(u_{s+1})\Big{|}\mathcal{Y}_{t}\Big{)}

and hence the value function

V(ω,t)=infu𝒰u((XT)+ts<T𝔏s(us+1)|𝒴t).V(\omega,t)=\inf_{u\in\mathcal{U}}\overleftarrow{\mathcal{E}}\hskip-1.99997pt_{u}\Big{(}\mathfrak{C}(X_{T})+\sum_{t\leq s<T}\mathfrak{L}_{s}(u_{s+1})\Big{|}\mathcal{Y}_{t}\Big{)}.
Remark 26.

We define our expected cost using the 𝒴\mathcal{Y}-consistent expectation u\overleftarrow{\mathcal{E}}\hskip-1.99997pt_{u}, rather than the (inconsistent) DR-expectation 𝒬|𝐲t\mathcal{E}_{\mathcal{Q}|\mathbf{y}_{t}}), as this leads to time-consistency in the choice of controls.

Remark 27.

We can see that the calculation of the value function is a ‘minimax’ problem, in that VV minimizes the cost, which we evaluate using a maximum over a set of models. However, given the potential for learning, the requirement for time consistency, and the uncertainties involved, it is not clear that one can write VV explicitly in terms of a single minimization and maximization of a given function.

Remark 28.

As the filter-state penalty κ\kappa is a general function depending on the control, and YY only takes finitely many states, it is not generally possible to express the effect (on κ\kappa) of a control through a change of measure relative to some reference dynamics. In particular, we face the problem that controls usu_{s} for times s<Ts<T will have an impact on the terminal cost VT=u((XT)|𝒴T)V_{T}=\overleftarrow{\mathcal{E}}\hskip-1.99997pt_{u}(\mathfrak{C}(X_{T})|\mathcal{Y}_{T}), through their impact on the uncertainty κTu\kappa^{u}_{T}, so, unlike in a traditional control problem, VTV_{T} is not independent of uu given 𝒴T\mathcal{Y}_{T}. For this reason, even though we model the impact of a control through its effect on the generator, we cannot give a fully ‘weak’ formulation of our control problem, and are restricted to a ‘Markovian’ setting, where we shall exploit dynamic programming.

Theorem 11.

The value function satisfies a dynamic programming principle, in particular, if an optimal control uu^{*} exists, then for every tTt\leq T,

Vt1\displaystyle V_{t-1} =u(Vt|𝒴t1)+𝔏t1(ut)\displaystyle=\overleftarrow{\mathcal{E}}\hskip-1.99997pt_{u^{*}}(V_{t}|\mathcal{Y}_{t-1})+\mathfrak{L}_{t-1}(u_{t}^{*})

(and similarly if we only assume an ϵ\epsilon-optimal control exists for every ϵ>0\epsilon>0).

Proof.

For any control uu, using the recursivity of \overleftarrow{\mathcal{E}}\hskip-1.99997pt we have

J(ω,t1,u)=u((XT)+t1sT𝔏s(us+1)|𝒴t1)=u((XT)+tsT𝔏s(us+1)|𝒴t1)+𝔏t1(ut)=u(J(ω,t,u)|𝒴t1)+𝔏t1(ut).\begin{split}J(\omega,t-1,u)&=\overleftarrow{\mathcal{E}}\hskip-1.99997pt_{u}\Big{(}\mathfrak{C}(X_{T})+\sum_{t-1\leq s\leq T}\mathfrak{L}_{s}(u_{s+1})\Big{|}\mathcal{Y}_{t-1}\Big{)}\\ &=\overleftarrow{\mathcal{E}}\hskip-1.99997pt_{u}\Big{(}\mathfrak{C}(X_{T})+\sum_{t\leq s\leq T}\mathfrak{L}_{s}(u_{s+1})\Big{|}\mathcal{Y}_{t-1}\Big{)}+\mathfrak{L}_{t-1}(u_{t})\\ &=\overleftarrow{\mathcal{E}}\hskip-1.99997pt_{u}\big{(}J(\omega,t,u)\big{|}\mathcal{Y}_{t-1}\big{)}+\mathfrak{L}_{t-1}(u_{t}).\end{split}

The stated equality (which is the dynamic programming relation) then follows from a standard pasting argument (along with a measurable selection result to ensure optimal or ϵ\epsilon-optimal controls exist, see for example [9, Appendix 10]). ∎

Theorem 12.

The value function of the control problem satisfies the recursion

Vt1=Ξt1(κt1,{Ys}st1)=infuU{𝔏t(u)+suppSN+,𝔄𝔸{dΞt(κtu(|y,κt1),y,{Ys}st1)c𝔄(dy;A𝔄pt1)(k1κt1(p))k}}\begin{split}&V_{t-1}=\Xi_{t-1}(\kappa_{t-1},\{Y_{s}\}_{s\leq t-1})\\ &=\inf_{u\in U}\bigg{\{}\mathfrak{L}_{t}(u)+\sup_{p\in S_{N}^{+},\mathfrak{A}\in\mathbb{A}}\Big{\{}\int_{\mathbb{R}^{d}}\Xi_{t}\big{(}\kappa^{u}_{t}(\cdot|y,\kappa_{t-1}),y,\{Y_{s}\}_{s\leq t-1}\big{)}c_{\mathfrak{A}}(dy;A^{\mathfrak{A}}p_{t-1})\\ &\qquad\qquad\qquad\qquad-\big{(}k^{-1}\kappa_{t-1}(p)\big{)}^{k^{\prime}}\Big{\}}\bigg{\}}\end{split}

where

κtu(|y,κt1)=inf𝔄t{infpt1()𝔄t{κt1(pt1)+γt(𝔄t;{Ys}s<t,ut)log(c𝔄(y;At𝔄pt1,ut))}}mt\begin{split}\kappa_{t}^{u}(\cdot|y,\kappa_{t-1})=\inf_{\mathfrak{A}_{t}}\bigg{\{}\inf_{p_{t-1}\in(\cdot)^{\overleftarrow{\mathfrak{A}}_{t}}}\Big{\{}&\kappa_{t-1}(p_{t-1})+\gamma_{t}(\mathfrak{A}_{t};\{Y_{s}\}_{s<t},u_{t})\\ &-\log\Big{(}c_{\mathfrak{A}}(y;A^{\mathfrak{A}}_{t}p_{t-1},u_{t})\Big{)}\Big{\}}\bigg{\}}-m_{t}\end{split}

for mtm_{t} a normalizing constant (which may depend on utu_{t}) to ensure κtu\kappa_{t}^{u} has minimal value zero, and terminal value

ΞT(κT,{Ys}sT):=infpSN+{i(ei,{Ys}sT)pi(k1κT(p))k}.\Xi_{T}(\kappa_{T},\{Y_{s}\}_{s\leq T}):=\inf_{p\in S_{N}^{+}}\Big{\{}\sum_{i}\mathfrak{C}(e_{i},\{Y_{s}\}_{s\leq T})p_{i}-\big{(}k^{-1}\kappa_{T}(p)\big{)}^{k^{\prime}}\Big{\}}.
Proof.

We proceed by recursion, and assume that an optimal policy is always attainable (this can be relaxed, with an increase of notational complexity). We suppose that the value function is given by Vt=Ξt(κtu,{Ys}st)V_{t}=\Xi_{t}(\kappa_{t}^{u^{*}},\{Y_{s}\}_{s\leq t}), for some function Ξt:𝔎×t\Xi_{t}:\mathfrak{K}\times\mathbb{R}^{t}\to\mathbb{R}, where 𝔎\mathfrak{K} denotes the space of functions from SN+S_{N}^{+} to \mathbb{R} with minimal value zero. From our assumptions, this is clearly true at time TT, and Ξ\Xi has the stated form.

From the perspective of time T1T-1, suppose the current penalty state κT1\kappa_{T-1} is given. Then, for any proposed control uTu_{T} (which is assumed to be 𝒴T1\mathcal{Y}_{T-1}-measurable), when YTY_{T} is observed at time TT, the time-TT penalty will be given by κTu(|YT,κT1)\kappa_{T}^{u}(\cdot|Y_{T},\kappa_{T-1}). For each choice of κT1\kappa_{T-1} and each conditionally Markov measure \mathbb{Q} (with generator 𝔄\mathfrak{A}), the remaining cost faced at time TT will be

ΞT(κTu(|YT,κT1),YT,{Ys}sT1).\Xi_{T}\big{(}\kappa^{u}_{T}(\cdot|Y_{T},\kappa_{T-1}),Y_{T},\{Y_{s}\}_{s\leq T-1}\big{)}.

The conditional expectation of ΞT\Xi_{T} can be written

𝔼[ΞT(κTu)|𝒴T1]=dΞT(κTu(|y,κT1),y,{Ys}sT1)c𝔄(dy;A𝔄pT1).\begin{split}\mathbb{E}_{\mathbb{Q}}[\Xi_{T}(\kappa^{u}_{T})|\mathcal{Y}_{T-1}]&=\int_{\mathbb{R}^{d}}\Xi_{T}(\kappa^{u}_{T}(\cdot|y,\kappa_{T-1}),y,\{Y_{s}\}_{s\leq T-1})c_{\mathfrak{A}}(dy;A^{\mathfrak{A}}p_{T-1}).\end{split}

The DR-expected cost of using control uTu_{T}, as seen from time T1T-1, fixing the state of the penalty at T1T-1, is then given by taking a conditional expectation and adding the running cost term, which yields

𝔏T1(uT)+u(Ξ(κTu,{Ys}sT)|𝒴T1)=𝔏T1(uT)+suppSN+,𝔄𝔸{dΞT(κTu(|y,κT1))c𝔄(dy;AT1𝔄p)(k1κT1(p))}.\begin{split}&\mathfrak{L}_{T-1}(u_{T})+\overleftarrow{\mathcal{E}}\hskip-1.99997pt_{u}\big{(}\Xi(\kappa^{u}_{T},\{Y_{s}\}_{s\leq T})\big{|}\mathcal{Y}_{T-1}\big{)}\\ &=\mathfrak{L}_{T-1}(u_{T})\\ &\qquad+\sup_{p\in S_{N}^{+},\mathfrak{A}\in\mathbb{A}}\Big{\{}\int_{\mathbb{R}^{d}}\Xi_{T}\big{(}\kappa^{u}_{T}(\cdot|y,\kappa_{T-1})\big{)}c_{\mathfrak{A}}(dy;A^{\mathfrak{A}}_{T-1}p)-\big{(}k^{-1}\kappa_{T-1}(p)\big{)}\Big{\}}.\end{split}

Finally, optimizing this cost gives the one-step minimal cost in terms of a function of {Ys}sT1\{Y_{s}\}_{s\leq T-1} and κT1\kappa_{T-1}, as required. (To be rigorous, this can be done using a measurable selection argument, as in [9, Appendix 10].) The result follows by recursion, as our problem is known to satisfy the dynamic programming principle. ∎

Corollary 1.

A control is optimal if and only if it achieves the infimum in the formula for VtV_{t} above.

Remark 29.

If we assume that the terminal cost depends only on XTX_{T} (and not on YY), and the running cost does not depend on YY, then one can observe a Markov property to the control problem, that is, VsV_{s} is conditionally independent of YY given κs\kappa_{s}. The corresponding optimal controls can then also be taken only to depend on κs\kappa_{s}.

Remark 30.

We can write (using \partial for the partial difference operator and omitting dependence on YY)

tΞt(κ)=Ξt(κ)Ξt1(κ),κΞt(κ)=Ξt(κ+κ)Ξt(κ)\partial_{t}\Xi_{t}(\kappa)=\Xi_{t}(\kappa)-\Xi_{t-1}(\kappa),\qquad\partial_{\kappa}\Xi_{t}(\kappa^{\prime})=\Xi_{t}(\kappa+\kappa^{\prime})-\Xi_{t}(\kappa)

Rearranging, we obtain the following Bellman–Isaacs-type equation for Ξ\Xi

tΞt(κ)=infuUsuppSN+,𝔄𝔸{dκΞt(κtu(|y,κ)κ)c𝔄(dy;A𝔄p)+𝔏(uT)(k1κT1(p))k},\begin{split}\partial_{t}\Xi_{t}(\kappa)=-\inf_{u\in U}\sup_{p\in S_{N}^{+},\mathfrak{A}\in\mathbb{A}}\Big{\{}&\int_{\mathbb{R}^{d}}\partial_{\kappa}\Xi_{t}\big{(}\kappa^{u}_{t}(\cdot|y,\kappa)-\kappa\big{)}c_{\mathfrak{A}}(dy;A^{\mathfrak{A}}p)\\ &+\mathfrak{L}(u_{T})-\big{(}k^{-1}\kappa_{T-1}(p)\big{)}^{k^{\prime}}\Big{\}},\end{split}

with terminal value ΞT(κ)=infpSN+{i(ei)pi(k1κ(p))k}.\Xi_{T}(\kappa)=\inf_{p\in S_{N}^{+}}\big{\{}\sum_{i}\mathfrak{C}(e_{i})p_{i}-\big{(}k^{-1}\kappa(p)\big{)}^{k^{\prime}}\big{\}}.

References

  • [1] Andrew Allan and Samuel N. Cohen. Parameter uncertainty in the Kalman–Bucy filter. to appear.
  • [2] Tamer Başar and Pierre Bernhard. HH^{\infty}-Optimal Control and Related Minimax Design Problems, A dynamic game approach. Birkhäuser, 1991.
  • [3] Alan Bain and Dan Crisan. Fundamentals of Stochastic Filtering. Springer, 2009.
  • [4] Tomasz R. Bielecki, Tao Chen, and Igor Cialenco. Recursive construction of confidence regions. https://arxiv.org/abs/1605.08010.
  • [5] Rene K. Boel, Matthew R. James, and Ian R. Petersen. Robustness and risk-sensitive filtering. IEEE Transactions on Automatic Control, 47(3):451–461, 2002.
  • [6] Samuel N. Cohen. Data-driven nonlinear expectations for statistical uncertainty in decisions. Electronic Journal of Statistics, 11(1):1858–1889, 2017.
  • [7] Samuel N. Cohen and Robert J. Elliott. A general theory of finite state backward stochastic difference equations. Stochastic Processes and their Applications, 120(4):442–466, 2010.
  • [8] Samuel N. Cohen and Robert J. Elliott. Backward stochastic difference equations and nearly-time-consistent nonlinear expectations. SIAM Journal on Control & Optimization, 49(1):125–139, 2011.
  • [9] Samuel N. Cohen and Robert J. Elliott. Stochastic Calculus and Applications. Birkhäuser, 2nd ed. edition, 2015.
  • [10] Freddy Delbaen, Shige Peng, and Emanuela Rosazza Gianin. Representation of the penalty term of dynamic concave utilities. Finance and Stochastics, 14(3):449–472, 2010.
  • [11] Subhrakanti Dey and John B. Moore. Risk-sensitive filtering and smoothing for hidden Markov models. Systems and Control Leters, 25:361–366, 1995.
  • [12] Darrell Duffie and Larry G. Epstein. Asset pricing with stochastic differential utility. The Review of Financial Studies, 5(3):411–436, 1992.
  • [13] Nicole El Karoui, Shige Peng, and M.C. Quenez. Backward stochastic differential equations in finance. Mathematical Finance, 7(1):1–71, January 1997.
  • [14] Larry G. Epstein and Martin Schneider. Recursive multiple-priors. Journal of Economic Theory, 113:1–31, 2003.
  • [15] Ronald Fagin and Joseph Halpern. A new approach to updating beliefs. In Proceedings of the Sixth Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI-90), pages 317–325, Corvallis, Oregon, 1990. AUAI Press.
  • [16] Hans Föllmer and Alexander Schied. Convex measures of risk and trading constraints. Finance and Stochastics, 6:429–447, 2002.
  • [17] Hans Föllmer and Alexander Schied. Stochastic Finance: An introduction in discrete time. Studies in Mathematics 27. de Gruyter, Berlin-New York, 2002.
  • [18] Marco Frittelli and Emanuela Rosazza Gianin. Putting order in risk measures. Journal of Banking & Finance, 26(7):1473–1486, 2002.
  • [19] Siegfried Graf. A Radon–Nikodym theorem for capacities. Journal für die reine und angewandte Mathematik, 320:192–214, 1980.
  • [20] Michael J. Grimble and Ahmed El Sayed. Solution of the H{H}_{\infty} optimal linear filtering problem for discrete-time systems. IEEE Transactions on Acoustics Speech and Signal Processing, 38(7), 1990.
  • [21] Lars Peter Hansen and Thomas J. Sargent. Robust estimation and control under commitment. Journal of Economic Theory, 124:258–301, 2005.
  • [22] Lars Peter Hansen and Thomas J. Sargent. Recursive robust estimation and control without commitment. Journal of Economic Theory, 136(1):1–27, 2007.
  • [23] Lars Peter Hansen and Thomas J. Sargent. Robustness. Princeton University Press, 2008.
  • [24] Peter J. Huber and Elvezio M. Roncetti. Robust Statistics. Wiley, 2nd edition, 2009.
  • [25] Matthew R. James, John S. Baras, and Robert J. Elliott. Risk-sensitive control and dynamic games for partially observed discrete-time nonlinear systems. IEEE Transactions on Automatic Control, 39(4), 1994.
  • [26] R.E. Kalman. A new approach to linear filtering and prediction problems. J. Basic Eng. ASME, 82:33–45, 1960.
  • [27] R.E. Kalman and R.S. Bucy. New results in linear filtering and prediction theory. J. Basic Eng. ASME, 83:95–108, 1961.
  • [28] John Maynard Keynes. A treatise on Probability. 1921.
  • [29] Frank Knight. Risk, Uncertainty and Profit. Houghton Mifflin, 1921.
  • [30] Michael Kupper and Walter Schachermayer. Representation results for law invariant time consistent functions. Mathematics and Financial Economics, 2(3):189–210, September 2009.
  • [31] Shige Peng. Nonlinear expectations and stochastic calculus under uncertainty. arxiv::1002.4546v1, 2010.
  • [32] Frank Riedel. Dynamic coherent risk measures. Stochastic Processes and their Applications, 112(2):185–200, 2004.
  • [33] R. Tyrrell Rockafellar, Stan Uryasev, and Michael Zabarankin. Generalized deviations in risk analysis. Finance and Stochastics, 10:51–74, 2006.
  • [34] Ramon van Handel. Discrete time nonlinear filters with informative observations are stable. Electronic Communications in Probability, 13:53, 2008.
  • [35] Abraham Wald. Statistical decision functions which minimize the maximum risk. Annals of Mathematics, 46(2):265–280, 1945.
  • [36] Peter Walley. Statistical Reasoning with Imprecise Probabilities. Chapman and Hall, 1991.
  • [37] W.N. Wonham. Some applications of stochastic differential equations to optimal nonlinear filtering. SIAM J. Control, 2:347–369, 1965.
  • [38] Jinhui Zhang, Yuanqing Xia, and Peng Shi. Parameter-dependent robust HH_{\infty} filtering for uncertain discrete-time systems. Automatica, 45:560–565, 2009.