Risk-Averse Decision Making Under Uncertainty

Mohamadreza Ahmadi, Ugo Rosolia, Michel D. Ingham, Richard M. Murray, and Aaron D. Ames M. Ahmadi, U. Rosolia, R. Murray, and A. Ames are with Control and Dynamical Systems (CDS) at the California Institute of Technology, 1200 E. California Blvd., MC 104-44, Pasadena, CA 91125, e-mail: ({mrahmadi,urosolia,murray,ames}@caltech.edu). M. Ingham is with NASA Jet Propulsion Laboratory, 4800 Oak Grove Dr, Pasadena, CA 91109, e-mail: (michel.d.ingham@jpl.nasa.gov).

Abstract

A large class of decision making under uncertainty problems can be described via Markov decision processes (MDPs) or partially observable MDPs (POMDPs), with application to artificial intelligence and operations research, among others. Traditionally, policy synthesis techniques are proposed such that a total expected cost/reward is minimized/maximized. However, optimality in the total expected cost sense is only reasonable if system’s behavior in the large number of runs is of interest, which has limited the use of such policies in practical mission-critical scenarios, wherein large deviations from the expected behavior may lead to mission failure. In this paper, we consider the problem of designing policies for MDPs and POMDPs with objectives and constraints in terms of dynamic coherent risk measures, which we refer to as the constrained risk-averse problem. Our contributions are fourfold:

(i)

For MDPs, we reformulate the problem into a inf-sup problem via the Lagrangian framework. Under the assumption that the risk objectives and constraints can be represented by a Markov risk transition mapping, we propose an optimization-based method to synthesize Markovian policies;
(ii)

For MDPs, we demonstrate that the formulated optimization problems are in the form of difference convex programs (DCPs) and can be solved by the disciplined convex-concave programming (DCCP) framework. We show that these results generalize linear programs for constrained MDPs with total discounted expected costs and constraints;
(iii)

For POMDPs, we show that, if the coherent risk measures can be defined as a Markov risk transition mapping, an infinite-dimensional optimization can be used to design Markovian belief-based policies;
(iv)

For POMDPs with stochastic finite-state controllers (FSCs), we show that the latter optimization simplifies to a (finite-dimensional) DCP and can be solved by the DCCP framework. We incorporate these DCPs in a policy iteration algorithm to design risk-averse FSCs for POMDPs.

We demonstrate the efficacy of the proposed method with numerical experiments involving conditional-value-at-risk (CVaR) and entropic-value-at-risk (EVaR) risk measures.

Index Terms:

Markov Processes, Stochastic systems, Uncertain systems.

I Introduction

Autonomous systems are being increasingly deployed in real-world settings. Hence, the associated risk that stems from unknown and unforeseen circumstances is correspondingly on the rise. This demands for autonomous systems that can make appropriately conservative decisions when faced with uncertainty in their environment and behavior. Mathematically speaking, risk can be quantified in numerous ways, such as chance constraints [50] and distributional robustness [51]. However, applications in autonomy and robotics require more “nuanced assessments of risk” [32]. Artzner et. al. [10] characterized a set of natural properties that are desirable for a risk measure, called a coherent risk measure, and have obtained widespread acceptance in finance and operations research, among other fields.

A popular model for representing sequential decision making under uncertainty is a Markov decision processes (MDP) [37]. MDPs with coherent risk objectives were studied in [47, 46], where the authors proposed a sampling-based algorithm for finding saddle point solutions using policy gradient methods. However, [47] requires the risk envelope appearing in the dual representation of the coherent risk measure to be known with an explicit canonical convex programming formulation. While this may be the case for CVaR, mean-semi-deviation, and spectral risk measures [42], such explicit form is not known for general coherent risk measures, such as EVaR. Furthermore, it is not clear whether the saddle point solutions are a lower bound or upper bound to the optimal value. Also, policy-gradient based methods require calculating the gradient of the coherent risk measure, which is not available in explicit form in general. For the CVaR measure, MDPs with risk constraints and total expected costs were studied in [36, 16] and locally optimal solutions were found via policy gradients, as well. However, this method also leads to saddle point solutions (which cannot be shown to be upper bounds or lower bounds of the optimal value) and cannot be applied to general coherent risk measures. In addition, because the objective and the constraints are in terms of different coherent risk measures, the authors assume there exists a policy that satisfies the CVaR constraint (feasibility assumption), which may not be the case in general. Following the footsteps of [35], a promising approach based on approximate value iteration was proposed for MDPs with CVaR objectives in [17]. A policy iteration algorithm for finding policies that minimize total coherent risk measures for MDPs was studied in [41] and a computational non-smooth Newton method was proposed in [41].

When the states of the agent and/or the environment are not directly observable, a partially observable MDP (POMDP) can be used to study decision making under uncertainty introduced by the partial state observability [26, 2]. POMDPs with coherent risk measure objectives were studied in [22, 23]. Despite the elegance of the theory, no computational method was proposed to design policies for general coherent risk measures. In [3], we proposed a method for finding finite-state controllers for POMDPs with objectives defined in terms of coherent risk measures, which takes advantage of convex optimization techniques. However, the method can only be used if the risk transition mapping is affine in the policy.

Summary of Contributions: In this paper, we consider MDPs and POMDPs with both objectives and constraints in terms of coherent risk measures. Our contributions are fourfold:

(i)

For MDPs, we use the Lagrangian framework and reformulate the problem into a inf-sup problem. For Markov risk transition mappings, we propose an optimization-based method to design Markovian policies that lower-bound the constrained risk-averse problem;
(ii)

For MDPs, we evince that the optimization problems are in the special form of DCPs and can be solved by the DCCP method. We also demonstrate that these results generalize linear programs for constrained MDPs with total discounted expected costs and constraints;
(iii)

For POMDPs, we demonstrate that, if the coherent risk measures can be defined as a Markov risk transition mapping, an infinite-dimensional optimization can be used to design Markovian belief-based policies, which in theory requires infinite memory to synthesize (in accordance with classical POMDP complexity results);
(iv)

For POMDPs with stochastic finite-state controllers (FSCs), we show that the latter optimization converts to a (finite-dimensional) DCP and can be solved by the DCCP framework. We incorporate these DCPs in a policy iteration algorithm to design risk-averse FSCs for POMDPs.

We assess the efficacy of the proposed method with numerical experiments involving conditional-value-at-risk (CVaR) and entropic-value-at-risk (EVaR) risk measures.

Preliminary results on risk-averse MDPs were presented in [4]. This paper, in addition to providing detailed proofs and new numerical analysis in the MDP case, generalizes [4] to partially observable systems (POMDPs) with dynamic coherent risk objectives and constraints.

The rest of the paper is organized as follows. In the next section, we briefly review some notions used in the paper. In Section III, we formulate the problem under study. In Section IV, we present the optimization-based method for designing risk-averse policies for MDPs. In Section V, we describe a policy iteration method for designing finite-memory controllers for risk-averse POMDPs. In Section VI, we illustrate the proposed methodology via numerical experiments. Finally, in Section VII, we conclude the paper and give directions for future research.

Notation: We denote by $\mathbb{R}^{n}$ the $n$ -dimensional Euclidean space and $\mathbb{N}_{\geq 0}$ the set of non-negative integers. Throughout the paper, we use bold font to denote a vector and $(\cdot)^{\top}$ for its transpose, e.g., $\boldsymbol{a}=(a_{1},\ldots,a_{n})^{\top}$ , with $n\in\{1,2,\ldots\}$ . For a vector $\boldsymbol{a}$ , we use $\boldsymbol{a}\succeq(\preceq)\boldsymbol{0}$ to denote element-wise non-negativity (non-positivity) and $\boldsymbol{a}\equiv\boldsymbol{0}$ to show all elements of $\boldsymbol{a}$ are zero. For two vectors $a,b\in\mathbb{R}^{n}$ , we denote their inner product by $\langle\boldsymbol{a},\boldsymbol{b}\rangle$ , i.e., $\langle\boldsymbol{a},\boldsymbol{b}\rangle=\boldsymbol{a}^{\top}\boldsymbol{b}$ . For a finite set $\mathcal{A}$ , we denote its power set by $2^{\mathcal{A}}$ , i.e., the set of all subsets of $\mathcal{A}$ . For a probability space $(\Omega,\mathcal{F},\mathbb{P})$ and a constant $p\in[1,\infty)$ , $\mathcal{L}_{p}(\Omega,\mathcal{F},\mathbb{P})$ denotes the vector space of real valued random variables $c$ for which $\mathbb{E}|c|^{p}<\infty$ .

II Preliminaries

In this section, we briefly review some notions and definitions used throughout the paper.

II-A Markov Decision Processes

An MDP is a tuple $\mathcal{M}=(\mathcal{S},Act,T,\kappa_{0})$ consisting of a set of states $\mathcal{S}=\{s_{1},\dots,s_{|\mathcal{S}|}\}$ of the autonomous agent(s) and world model, actions $Act=\{\alpha_{1},\dots,\alpha_{|Act|}\}$ available to the agent, a transition function $T(s_{j}|s_{i},\alpha)$ , and $\kappa_{0}$ describing the initial distribution over the states.

This paper considers finite Markov decision processes, where $\mathcal{S}$ and $Act$ are finite sets. For each action the probability of making a transition from state $s_{i}\in\mathcal{S}$ to state $s_{j}\in\mathcal{S}$ under action $\alpha\in Act$ is given by $T(s_{j}|s_{i},\alpha)$ . The probabilistic components of a MDP must satisfy the following:

\begin{cases}\sum_{s\in\mathcal{S}}T(s|s_{i},\alpha)=1,&\forall s_{i}\in\mathcal{S},\forall\alpha\in Act,\\ \sum_{s\in\mathcal{S}}\kappa_{0}(s)=1.&{}\textstyle\end{cases}

II-B Partially Observable MDPs

A POMDP is a tuple $\mathcal{PM}=(\mathcal{M},\mathcal{O},O)$ consisting of an MDP $\mathcal{M}$ , observations $\mathcal{O}=\{o_{1},\dots,o_{|\mathcal{O}|}\}$ , and an observation model $O(o\mid s)$ . We consider finite POMDPs, where $\mathcal{O}$ is a finite set. Then, for each state $s_{i}$ , an observation $o\in\mathcal{O}$ is generated independently with probability $O(o|s_{i})$ , which satisfies

\sum_{s\in\mathcal{S}}O(o|s)=1,\quad\forall s\in\mathcal{S}.

In POMDPs, the states $s\in\mathcal{S}$ are not directly observable. The beliefs $b\in\Delta(\mathcal{S})$ , i.e., the probability of being in different states, with $\Delta(\mathcal{S})$ being the set of probability distributions over $\mathcal{S}$ , for all $s\in\mathcal{S}$ can be computed using the Bayes’ law as follows:

	$\displaystyle b_{0}(s)$	$\displaystyle=\frac{\kappa_{0}(s)O(o_{0}\mid s)}{\sum_{o\in O}\kappa_{0}(s)O(o\mid s)},$		(1)
	$\displaystyle b_{t}(s)$	$\displaystyle=\frac{O(o_{t}\mid s)\sum_{s^{\prime}\in\mathcal{S}}T(s\mid s^{\prime},\alpha_{t})b_{t-1}(s^{\prime})}{\sum_{s\in\mathcal{S}}O(o_{t}\mid s)\sum_{s^{\prime}\in\mathcal{S}}T(s\mid s^{\prime},\alpha_{t})b_{t-1}(s^{\prime})},$		(2)

for all $t\geq 1$ .

II-C Finite State Control of POMDPs

It is well established that designing optimal policies for POMDPs based on the (continuous) belief states require uncountably infinite memory or internal states [15, 31]. This paper focuses on a particular class of POMDP controllers, namely, FSCs.

A stochastic finite state controller for $\mathcal{PM}$ is given by the tuple $\mathcal{G}=(G,\omega,\kappa)$ , where $G=\{g_{1},g_{2},\dots,g_{|G|}\}$ is a finite set of internal states (I-states), $\omega:G\times\mathcal{O}\to\Delta({G\times Act})$ is a function of internal stochastic finite state controller states $g_{k}$ and observation $o$ , such that $\omega(g_{k},o)$ is a probability distribution over $G\times Act$ . The next internal state and action pair $(g_{l},\alpha)$ is chosen by independent sampling of $\omega(g_{k},o)$ . By abuse of notation, $\omega(g_{l},\alpha|g_{k},o)$ will denote the probability of transitioning to internal stochastic finite state controller state $g_{l}$ and taking action $\alpha$ , when the current internal state is $g_{k}$ and observation $o$ is received. $\kappa:\Delta({\mathcal{S}})\to\Delta(G)$ chooses the starting internal FSC state $g_{0}$ , by independent sampling of $\kappa(\kappa_{0})$ , given initial distribution $\kappa_{0}$ of $\mathcal{PM}$ , and $\kappa(g|\kappa_{0})$ will denote the probability of starting the FSC in internal state $g$ when the initial POMDP distribution is $\kappa_{0}$ .

II-D Coherent Risk Measures

Consider a probability space $(\Omega,\mathcal{F},\mathbb{P})$ , a filteration $\mathcal{F}_{0}\subset\cdots\mathcal{F}_{N}\subset\mathcal{F}$ , and an adapted sequence of random variables (stage-wise costs) $c_{t},\leavevmode\nobreak\ t=0,\ldots,N$ , where $N\in\mathbb{N}_{\geq 0}\cup\{\infty\}$ . For $t=0,\ldots,N$ , we further define the spaces $\mathcal{C}_{t}=\mathcal{L}_{p}(\Omega,\mathcal{F}_{t},\mathbb{P})$ , $p\in[1,\infty)$ , $\mathcal{C}_{t:N}=\mathcal{C}_{t}\times\cdots\times\mathcal{C}_{N}$ and $\mathcal{C}=\mathcal{C}_{0}\times\mathcal{C}_{1}\times\cdots$ . We assume that the sequence $\boldsymbol{c}\in\mathcal{C}$ is almost surely bounded (with exceptions having probability zero), i.e., $\max_{t}\operatorname*{ess\,sup}\leavevmode\nobreak\ |c_{t}(\omega)|<\infty.$

In order to describe how one can evaluate the risk of sub-sequence $c_{t},\ldots,c_{N}$ from the perspective of stage $t$ , we require the following definitions.

Definition 1 (Conditional Risk Measure).

A mapping $\rho_{t:N}:\mathcal{C}_{t:N}\to\mathcal{C}_{t}$ , where $0\leq t\leq N$ , is called a conditional risk measure, if it has the following monoticity property:

\rho_{t:N}(\boldsymbol{c})\leq\rho_{t:N}(\boldsymbol{c}^{\prime}),\quad\forall\boldsymbol{c},\forall\boldsymbol{c}^{\prime}\in\mathcal{C}_{t:N}\leavevmode\nobreak\ \text{such that}\leavevmode\nobreak\ \boldsymbol{c}\preceq\boldsymbol{c}^{\prime}.

Definition 2 (Dynamic Risk Measure).

A dynamic risk measure is a sequence of conditional risk measures $\rho_{t:N}:\mathcal{C}_{t:N}\to\mathcal{C}_{t}$ , $t=0,\ldots,N$ .

One fundamental property of dynamic risk measures is their consistency over time [41, Definition 3]. That is, if $c$ will be as good as $c^{\prime}$ from the perspective of some future time $\theta$ , and they are identical between time $\tau$ and $\theta$ , then $c$ should not be worse than $c^{\prime}$ from the perspective at time $\tau$ .

In this paper, we focus on time consistent, coherent risk measures, which satisfy four nice mathematical properties, as defined below [42, p. 298].

Definition 3 (Coherent Risk Measure).

We call the one-step conditional risk measures $\rho_{t}:\mathcal{C}_{t+1}\to\mathcal{C}_{t}$ , $t=1,\ldots,N-1$ a coherent risk measure if it satisfies the following conditions

•

Convexity: $\rho_{t}(\lambda c+(1-\lambda)c^{\prime})\leq\lambda\rho_{t}(c)+(1-\lambda)\rho_{t}(c^{\prime})$ , for all $\lambda\in(0,1)$ and all $c,c^{\prime}\in\mathcal{C}_{t+1}$ ;
•

Monotonicity: If $c\leq c^{\prime}$ then $\rho_{t}(c)\leq\rho_{t}(c^{\prime})$ for all $c,c^{\prime}\in\mathcal{C}_{t+1}$ ;
•

Translational Invariance: $\rho_{t}(c+c^{\prime})=c+\rho_{t}(c^{\prime})$ for all $c\in\mathcal{C}_{t}$ and $c^{\prime}\in\mathcal{C}_{t+1}$ ;
•

Positive Homogeneity: $\rho_{t}(\beta c)=\beta\rho_{t}(c)$ for all $c\in\mathcal{C}_{t+1}$ and $\beta\geq 0$ .

We are interested in the discounted infinite horizon problems. Let $\gamma\in(0,1)$ be a given discount factor. For $t=0,1,\ldots$ , we define the functional

\rho^{\gamma}_{0,t}(c_{0},\ldots,c_{t})=\rho_{0,t}(c_{0},\gamma c_{1},\ldots,\gamma^{t}c_{t})\\ =\rho_{0}\bigg{(}c_{0}+\rho_{1}\big{(}\gamma c_{1}+\rho_{2}(\gamma^{2}c_{2}+\cdots\\ \leavevmode\nobreak\ \leavevmode\nobreak\ +\rho_{t-1}\left(\gamma^{t-1}c_{t-1}+\rho_{t}(\gamma^{t}c_{t})\right)\cdots)\big{)}\bigg{)}.

(3)

Finally, we have total discounted risk functional $\rho^{\gamma}:\mathcal{C}\to\mathbb{R}$ defined as

\rho^{\gamma}(\boldsymbol{c})=\lim_{t\to\infty}\rho^{\gamma}_{0,t}(c_{0},\ldots,c_{t}).

(4)

From [41, Theorem 3], we have that $\rho^{\gamma}$ is convex, monotone, and positive homogeneous.

II-E Examples of Coherent Risk Measures

Next, we briefly review three examples of coherent risk measures that will be used in this paper.

Total Conditional Expectation: The simplest risk measure is the total conditional expectation given by

\rho_{t}(c_{t+1})=\mathbb{E}\left[c_{t+1}\mid\mathcal{F}_{t}\right].

(5)

It is easy to see that total conditional expectation satisfies the properties of a coherent risk measure as outlined in Definition 3. Unfortunately, total conditional expectation is agnostic to realization fluctuations of the random variable $c$ and is only concerned with the mean value of $c$ at large number of realizations. Thus, it is a risk-neutral measure of performance.

Conditional Value-at-Risk: Let $c\in\mathcal{C}$ be a random variable. For a given confidence level $\varepsilon\in(0,1)$ , value-at-risk ( $\mathrm{VaR}_{\varepsilon}$ ) denotes the $(1-\varepsilon)$ -quantile value of the random variable $c\in\mathcal{C}$ . Unfortunately, working with VaR for non-normal random variables is numerically unstable and optimizing models involving VaR is intractable in high dimensions [39].

In contrast, CVaR overcomes the shortcomings of VaR. CVaR with confidence level $\varepsilon\in(0,1)$ denoted $\mathrm{CVaR}_{\varepsilon}$ measures the expected loss in the $(1-\varepsilon)$ -tail given that the particular threshold $\mathrm{VaR}_{\varepsilon}$ has been crossed, i.e., $\mathrm{CVaR}_{\varepsilon}(c)=\mathbb{E}\left[c\mid c\geq\mathrm{VaR}_{\varepsilon}(c)\right]$ . An optimization formulation for CVaR was proposed in [39]. That is, $\mathrm{CVaR}_{\varepsilon}$ is given by

\rho_{t}(c_{t+1})=\mathrm{CVaR}_{\varepsilon}(c_{t+1})\\ :=\inf_{\zeta\in\mathbb{R}}\left(\zeta+\frac{1}{\varepsilon}\mathbb{E}\left[(c_{t+1}-\zeta)_{+}\mid\mathcal{F}_{t}\right]\right),

(6)

where $(\cdot)_{+}=\max\{\cdot,0\}$ . A value of $\varepsilon\to 1$ corresponds to a risk-neutral case, i.e., $\mathrm{CVaR_{1}}(c)=\mathbb{E}(c)$ ; whereas, a value of $\varepsilon\to 0$ is rather a risk-averse case, i.e., $\mathrm{CVaR_{0}}(c)=\mathrm{VaR}_{0}(c)=\operatorname*{ess\,inf}(c)$ [38]. Figure 1 illustrates these notions for an example $c$ variable with distribution $p(c)$ .

Refer to caption — Figure 1: Comparison of the mean, VaR, and CVaR for a given confidence $\varepsilon\in(0,1)$ . The axes denote the values of the stochastic variable $c$ and its probability density function $p(c)$ . The shaded area denotes the $\%\varepsilon$ of the area under $p(c)$ . The expected cost $\mathbb{E}(c)$ is much smaller than the worst case cost. VaR gives the value of $c$ at the $(1-\varepsilon)$ -tail of the distribution. But, it ignores the values of $c$ with probability below $\varepsilon$ . CVaR is the average of the values of VaR with probability less than $\varepsilon$ (average of the worst-case values of $c$ in the $(1-\varepsilon)$ tail of the distribution).

Entropic Value-at-Risk: Unfortunately, CVaR ignores the losses below the VaR threshold. EVaR is the tightest upper bound in the sense of Chernoff inequality for VaR and CVaR and its dual representation is associated with the relative entropy. In fact, it was shown in [8] that $\mathrm{EVaR}_{\varepsilon}$ and $\mathrm{CVaR}_{\varepsilon}$ are equal only if there are no losses ( $c\to-\infty$ ) below the $\mathrm{VaR}_{\varepsilon}$ threshold. In addition, EVaR is a strictly monotone risk measure; whereas, CVaR is only monotone [7]. $\mathrm{EVaR}_{\varepsilon}$ is given by

\rho_{t}(c_{t+1})=\inf_{\zeta>0}\left({\log\left(\frac{\mathbb{E}[e^{\zeta c_{t+1}}\mid\mathcal{F}_{t}]}{\varepsilon}\right)/\zeta}\right).

(7)

Similar to $\mathrm{CVaR}_{\varepsilon}$ , for $\mathrm{EVaR}_{\varepsilon}$ , $\varepsilon\to 1$ corresponds to a risk-neutral case; whereas, $\varepsilon\to 0$ corresponds to a risk-averse case. In fact, it was demonstrated in [6, Proposition 3.2] that $\lim_{\varepsilon\to 0}\mathrm{EVaR}_{\varepsilon}(c)=\operatorname*{ess\,inf}(c)$ .

III Problem Formulation

In the past two decades, coherent risk and dynamic risk measures have been developed and used in microeconomics and mathematical finance fields [49]. Generally speaking, risk-averse decision making is concerned with the behavior of agents, e.g. consumers and investors, who, when exposed to uncertainty, attempt to lower that uncertainty. The agents may avoid situations with unknown payoffs, in favor of situations with payoffs that are more predictable.

The core idea in risk-averse planning is to replace the conventional risk-neutral conditional expectation of the cumulative cost objectives with the more general coherent risk measures. In path planning scenarios, in particular, we will show in our numerical experiments that considering coherent risk measures will lead to significantly more robustness to environment uncertainty and collisions leading to mission failures.

In addition to total cost risk-aversity, an agent is often subject to constraints, e.g. fuel, communication, or energy budgets [27]. These constraints can also represent mission objectives, e.g. explore an area or reach a goal.

Consider a stationary controlled Markov process $\{q_{t}\}$ , $t=0,1,\ldots$ (an MDP or a POMDP) with initial probability distribution $\kappa_{0}$ , wherein policies, transition probabilities, and cost functions do not depend explicitly on time. Each policy $\pi=\{\pi_{t}\}_{t=0}^{\infty}$ leads to cost sequences $\boldsymbol{c}_{t}=c(q_{t},\alpha_{t})$ , $t=0,1,\ldots$ and $\boldsymbol{d}_{t}^{i}=d^{i}(q_{t},\alpha_{t})$ , $t=0,1,\ldots$ , $i=1,2,\ldots,n_{c}$ . We define the dynamic risk of evaluating the $\gamma$ -discounted cost of a policy $\pi$ as

J_{\gamma}(\kappa_{0},\pi)=\rho^{\gamma}\big{(}c(q_{0},\alpha_{0}),c(q_{1},\alpha_{1}),\ldots\big{)},

(8)

and the $\gamma$ -discounted dynamic risk constraints of executing policy $\pi$ as

D_{\gamma}^{i}(\kappa_{0},\pi)=\rho^{\gamma}\left(d^{i}(q_{0},\alpha_{0}),d^{i}(q_{1},\alpha_{1}),\ldots\right)\leq\beta^{i},\\ i=1,2,\ldots,n_{c},

(9)

where $\rho^{\gamma}$ is defined in equation (4), $q_{0}\sim\kappa_{0}$ , and $\beta^{i}>0$ , $i=1,2,\ldots,n_{c}$ , are given constants. We assume that $c(\cdot,\cdot)$ and $d^{i}(\cdot,\cdot)$ , $i=1,2,\ldots,n_{c}$ , are non-negative and upper-bounded. For a discount factor $\gamma\in(0,1)$ , an initial condition $\kappa_{0}$ , and a policy $\pi$ , we infer from [41, Theorem 3] that both $J_{\gamma}(\kappa_{0},\pi)$ and $D_{\gamma}^{i}(\kappa_{0},\pi)$ are well-defined (bounded), if $c$ and $d$ are bounded.

In this work, we are interested in addressing the following problem:

Problem 1.

For a controlled Markov decision process (an MDP or a POMDP), a discount factor $\gamma\in(0,1)$ , and a total risk functional $J_{\gamma}(\kappa_{0},\pi)$ as in equation (8) and total cost constraints (9), where $\{\rho_{t}\}_{t=0}^{\infty}$ are coherent risk measures, compute

	$\displaystyle\pi^{*}\in$	$\displaystyle\leavevmode\nobreak\ \operatorname*{argmin}_{\pi}\leavevmode\nobreak\ \leavevmode\nobreak\ J_{\gamma}(\kappa_{0},\pi)$
		$\displaystyle\text{subject to}\quad\boldsymbol{D}_{\gamma}(\kappa_{0},\pi)\preceq\boldsymbol{\beta}.$		(10)

We call a controlled Markov process with the “nested” objective (8) and constraints (9) a constrained risk-averse Markov process.

For MDPs, [17, 33] show that such coherent risk measure objectives can account for modeling errors and parametric uncertainties. We can also interpret Problem 1 as designing policies that minimize the accrued costs in a risk-averse sense and at the same time ensuring that the system constraints, e.g., fuel constraints, are not violated even in the rare but costly scenarios.

Note that in Problem 1 both the objective function and the constraints are in general non-differentiable and non-convex in policy $\pi$ (with the exception of total expected cost as the coherent risk measure $\rho^{\gamma}$ [9]). Therefore, finding optimal policies in general may be hopeless. Instead, we find sub-optimal polices by taking advantage of a Lagrangian formulation and then using an optimization form of Bellman’s equations.

Next, we show that the constrained risk-averse problem is equivalent to a non-constrained inf-sup risk-averse problem thanks to the Lagrangian method.

Proposition 1.

Let $J_{\gamma}(\kappa_{0})$ be the value of Problem 1 for a given initial distribution $\kappa_{0}$ and discount factor $\gamma$ . Then, (i) the value function satisfies

J_{\gamma}(\kappa_{0})=\inf_{\pi}\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}L_{\gamma}(\pi,{\boldsymbol{\lambda}}),

(11)

where

\displaystyle L_{\gamma}(\pi,\boldsymbol{\lambda})=J_{\gamma}(\kappa_{0},\pi)+\langle\boldsymbol{\lambda},\left(\boldsymbol{D}_{\gamma}(\kappa_{0},\pi)-\boldsymbol{\beta}\right)\rangle,

(12)

is the Lagrangian function.
(ii) Furthermore, a policy $\pi^{*}$ is optimal for Problem 1, if and only if $J_{\gamma}(\kappa_{0})=\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\leavevmode\nobreak\ L_{\gamma}(\pi^{*},\boldsymbol{\lambda})$ .

Proof.

(i) If for some $\pi$ Problem 1 is not feasible, then $\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}L_{\gamma}(\pi,{\lambda})=\infty$ . In fact, if the $i$ th constraint is not satisfied, i.e., $D_{\gamma}^{i}>\beta^{i}$ , we can achieve the latter supremum by choosing $\lambda_{i}\to\infty$ , while keeping the rest of $\lambda^{i}$ s constant or zero. If Problem 1 is feasible for some $\pi$ , then the supremum is achieved by setting $\boldsymbol{\lambda}=\boldsymbol{0}$ . Hence, $L_{\gamma}(\lambda,\pi)=J_{\gamma}(\kappa_{0},\pi)$ and

\inf_{\pi}\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\leavevmode\nobreak\ L_{\gamma}(\pi,{\lambda})=\inf_{\pi:\boldsymbol{D}_{\gamma}(\kappa_{0},\pi)\leq\boldsymbol{\beta}}\leavevmode\nobreak\ \leavevmode\nobreak\ J_{\gamma}(\kappa_{0},\pi),

which implies (i).
(ii) If $\pi$ is optimal, then, from (11), we have

J_{\gamma}(\kappa_{0})=\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}L_{\gamma}(\pi^{*},{\lambda}).

Conversely, if $J_{\gamma}(\kappa_{0})=\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}L_{\gamma}(\pi^{\prime},{\lambda})$ for some $\pi^{\prime}$ , then from (11), we have $\inf_{\pi}\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}L_{\gamma}(\pi,\lambda)=\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}L_{\gamma}(\pi^{\prime},{\lambda})$ . Hence, $\pi^{\prime}$ is the optimal policy. ∎

IV Constrained Risk-Averse MDPs

At any time $t$ , the value of $\rho_{t}$ is $\mathcal{F}_{t}$ -measurable and is allowed to depend on the entire history of the process $\{s_{0},s_{1},\ldots\}$ and we cannot expect to obtain a Markov optimal policy [34, 11]. In order to obtain Markov policies, we need the following property [41].

Definition 4 (Markov Risk Measure).

Let $m,n\in[1,\infty)$ such that $1/m+1/n=1$ and $\mathcal{P}=\big{\{}p\in\mathcal{L}_{n}(\mathcal{S},2^{\mathcal{S}},\mathbb{P})\mid\sum_{s^{\prime}\in\mathcal{S}}p(s^{\prime})\mathbb{P}(s^{\prime})=1,\leavevmode\nobreak\ p\geq 0\big{\}}.$ A one-step conditional risk measure $\rho_{t}:\mathcal{C}_{t+1}\to\mathcal{C}_{t}$ is a Markov risk measure with respect to the controlled Markov process $\{s_{t}\}$ , $t=0,1,\ldots$ , if there exist a risk transition mapping $\sigma_{t}:\mathcal{L}_{m}(\mathcal{S},2^{\mathcal{S}},\mathbb{P})\times\mathcal{S}\times\mathcal{P}\to\mathbb{R}$ such that for all $v\in\mathcal{L}_{m}(\mathcal{S},2^{\mathcal{S}},\mathbb{P})$ and $\alpha_{t}\in\pi(s_{t})$ , we have

\rho_{t}(v(s_{t+1}))=\sigma_{t}(v(s_{t+1}),s_{t},p(s_{t+1}|s_{t},\alpha_{t})),

(13)

where $p:\mathcal{S}\times Act\to\mathcal{P}$ is called the controlled kernel.

In fact, if $\rho_{t}$ is a coherent risk measure, $\sigma_{t}$ also satisfies the properties of a coherent risk measure (Definition 3). In this paper, since we are concerned with MDPs, the controlled kernel is simply the transition function $T$ .

Assumption 1.

The one-step coherent risk measure $\rho_{t}$ is a Markov risk measure.

The simplest case of the risk transition mapping is in the conditional expectation case $\rho_{t}(v(s_{t+1}))=\mathbb{E}\{v(s_{t+1})\mid s_{t},\alpha_{t}\}$ , i.e.,

\sigma\left\{v(s_{t+1}),s_{t},p(s_{t+1}|s_{t},\alpha_{t})\right\}=\mathbb{E}\{v(s_{t+1})\mid s_{t},\alpha_{t}\}\\ =\sum_{s_{t+1}\in\mathcal{S}}v(s_{t+1})T(s_{t+1}\mid s_{t},\alpha_{t}).

(14)

Note that in the total discounted expectation case $\sigma$ is a linear function in $v$ rather than a convex function, which is the case for a general coherent risk measures. For example, for the CVaR risk measure, the Markov risk transition mapping is given by

\sigma\{v(s_{t+1}),s_{t},p(s_{t+1}|s_{t},\alpha_{t})\}\\ =\inf_{\zeta\in\mathbb{R}}\left\{\zeta+\frac{1}{\varepsilon}\sum_{s_{t+1}\in\mathcal{S}}\left(v(s_{t+1})-\zeta\right)_{+}T(s_{t+1}\mid s_{t},\alpha_{t})\right\},

where $(\cdot)_{+}=\max\{\cdot,0\}$ is a convex function in $v$ .

If $\sigma$ is a coherent, Markov risk measure, then the Markov policies are sufficient to ensure optimality [41].

In the next result, we show that we can find a lower bound to the solution to Problem 1 via solving an optimization problem.

Theorem 1.

Consider an MDP $\mathcal{M}$ with the nested risk objective (8), constraints (9), and discount factor $\gamma\in(0,1)$ . Let Assumption 1 hold and $\rho_{t},\leavevmode\nobreak\ t=0,1,\ldots$ be coherent risk measures as described in Definition 3. Then, the solution $(\boldsymbol{V}^{*}_{\gamma},\boldsymbol{\lambda}^{*})$ to the following optimization problem (Bellman’s equation)

		$\displaystyle\sup_{\boldsymbol{V}_{\gamma},\boldsymbol{\lambda}\succeq\boldsymbol{0}}\leavevmode\nobreak\ \leavevmode\nobreak\ \langle\boldsymbol{\kappa_{0}},\boldsymbol{V}_{\gamma}\rangle-\langle\boldsymbol{\lambda},\boldsymbol{\beta}\rangle$
		subject to
		$\displaystyle V_{\gamma}(s)\leq c(s,\alpha)+\langle\boldsymbol{\lambda},\boldsymbol{d}(s,\alpha)\rangle$
		$\displaystyle\quad\quad\quad+\gamma\sigma\{{V}_{\gamma}(s^{\prime}),s,p(s^{\prime}\|s,\alpha)\},\leavevmode\nobreak\ \forall s\in\mathcal{S},\leavevmode\nobreak\ \forall\alpha\in{Act,}$		(15)

satisfies

J_{\gamma}(\kappa_{0})\geq\langle\boldsymbol{\kappa_{0}},\boldsymbol{V}^{*}_{\gamma}\rangle-\langle\boldsymbol{\lambda}^{*},\boldsymbol{\beta}\rangle.

(16)

Proof.

From Proposition 1, we have know that (11) holds. Hence, we have

$\displaystyle J_{\gamma}(\kappa_{0})$	$\displaystyle=\inf_{\pi}\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\left(J_{\gamma}(\kappa_{0},\pi)+\langle\lambda,\left(\boldsymbol{D}_{\gamma}(\kappa_{0},\pi)-\boldsymbol{\beta}\right)\rangle\right)$
	$\displaystyle=\inf_{\pi}\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\left(J_{\gamma}(\kappa_{0},\pi)+\langle\lambda,\boldsymbol{D}_{\gamma}(\kappa_{0},\pi)\rangle-\langle\lambda,\beta\rangle\right)$
	$\displaystyle=\inf_{\pi}\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\left(\rho^{\gamma}(c)+\langle\lambda,\rho^{\gamma}(d)\rangle-\langle\lambda,\beta\rangle\right)$
	$\displaystyle=\inf_{\pi}\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\left(\rho^{\gamma}(c)+\rho^{\gamma}(\langle\lambda,d\rangle)-\langle\lambda,\beta\rangle\right)$
	$\displaystyle\geq\inf_{\pi}\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\left(\rho^{\gamma}(c+\langle\lambda,d\rangle)-\langle\lambda,\beta\rangle\right),$
	$\displaystyle\geq\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\inf_{\pi}\left(\rho^{\gamma}(c+\langle\lambda,d\rangle)-\langle\lambda,\beta\rangle\right)$	(17)

wherein the fourth, fifth, and sixth inequalities above we used the positive homogeneity property of $\rho^{\gamma}$ , sub-additivity property of $\rho^{\gamma}$ , and the minimax inequality respectively. Since $\langle\lambda,\beta\rangle$ does not depend on $\pi$ , to find the solution the infimum it suffices to find the solution to

\inf_{\pi}\rho^{\gamma}(\tilde{c}),

where $\tilde{c}=c+\lambda^{\prime}d$ . The value to the above optimization can be obtained by solving the following Bellman equation [41, Theorem 4]

V_{\gamma}(s)=\inf_{\alpha\in Act}\Big{(}\tilde{c}(s,\alpha)+\gamma\sigma\{V_{\gamma}(s^{\prime}),s,p(s^{\prime}|s,\alpha)\}\Big{)}.

Next, we show that the solution to the above Bellman equation can be alternatively obtained by solving the convex optimization

		$\displaystyle\sup_{V_{\gamma}}\leavevmode\nobreak\ \leavevmode\nobreak\ \langle\kappa_{0},V_{\gamma}\rangle$
		subject to
		$\displaystyle V_{\gamma}(s)\leq\tilde{c}(s,\alpha)+\gamma\sigma\{V_{\gamma}(s^{\prime}),s,p(s^{\prime}\|s,\alpha)\},\leavevmode\nobreak\ \forall s,\alpha.$		(18)

Define

\mathfrak{D}_{\pi}v:=\tilde{c}(s,\pi(s))+\gamma\sigma\{v(s^{\prime}),s,p(s^{\prime}|s,\pi(s))\},\quad\forall s\in\mathcal{S},

and $\mathfrak{D}v:=\min_{\alpha\in Act}\left(\tilde{c}(s,\alpha)+\gamma\sigma\{v(s^{\prime}),s,p(s^{\prime}|s,\alpha)\}\right)$ for all $s\in\mathcal{S}$ . From [41, Lemma 1], we infer that $\mathfrak{D}_{\pi}$ and $\mathfrak{D}$ are non-decreasing; i.e., for $v\leq w$ , we have $\mathfrak{D}_{\pi}v\leq\mathfrak{D}_{\pi}w$ and $\mathfrak{D}v\leq\mathfrak{D}w$ . Therefore, if $V_{\gamma}\leq\mathfrak{D}_{\pi}V_{\gamma}$ , then $\mathfrak{D}_{\pi}V_{\gamma}\leq\mathfrak{D}_{\pi}(\mathfrak{D}_{\pi}V_{\gamma})$ . By repeated application of $\mathfrak{D}_{\pi}$ , we obtain

V_{\gamma}\leq\mathfrak{D}_{\pi}V_{\gamma}\leq\mathfrak{D}_{\pi}^{2}V_{\gamma}\leq\mathfrak{D}_{\pi}^{\infty}V_{\gamma}=V^{*}_{\gamma}.

Any feasible solution to (IV) must satisfy $V_{\gamma}\geq\mathfrak{D}_{\pi}V_{\gamma}$ and hence must satisfy $V_{\gamma}\geq V^{*}_{\gamma}$ . Thus, given that all entries of $\kappa_{0}$ are positive, $V^{*}_{\gamma}$ is the optimal solution to (IV). Substituting (IV) back in the last inequality in (IV) yields the result. ∎

Once the values of $\boldsymbol{\lambda}^{*}$ and $\boldsymbol{V}^{*}_{\gamma}$ are found by solving optimization problem (1), we can find the policy as

	$\displaystyle\pi^{*}(s)\in$	$\displaystyle\leavevmode\nobreak\ \operatorname{argmin}_{\alpha\in Act}\leavevmode\nobreak\ \Big{(}c(s,\alpha)+\langle\boldsymbol{\lambda}^{},\boldsymbol{d}(s,\alpha)\rangle$
		$\displaystyle\quad\quad\quad\quad+\gamma\sigma\{V^{*}_{\gamma}(s^{\prime}),s,p(s^{\prime}\|s,\alpha)\}\Big{)}.$		(19)

One interesting observation is that if the coherent risk measure $\rho^{t}$ is the total discounted expectation, then Theorem 1 is consistent with the classical result by [9] on constrained MDPs.

Corollary 1.

Let the assumptions of Theorem 1 hold and let $\rho_{t}(\cdot)=\mathbb{E}(\cdot|s_{t},\alpha_{t})$ , $t=1,2,\ldots$ . Then the solution $(\boldsymbol{V}^{*}_{\gamma},\boldsymbol{\lambda}^{*})$ to optimization (1) satisfies

J_{\gamma}(\kappa_{0})=\langle\boldsymbol{\kappa_{0}},\boldsymbol{V}^{*}_{\gamma}\rangle-\langle\boldsymbol{\lambda}^{*},\boldsymbol{\beta}\rangle.

Furthermore, with $\rho_{t}(\cdot)=\mathbb{E}(\cdot|s_{t},\alpha_{t})$ , $t=1,2,\ldots$ , optimization (1) becomes a linear program.

Proof.

From the derivation in (IV), we observe the two inequalities are from the application of (a) the sub-additivity property of $\rho^{\gamma}$ and (b) the max-min inequality. Next, we show that in the case of total expectation both of these properties lead to an equality.
(a) Sub-additivity property of $\rho^{\gamma}$ : for total expectation, we have

\sum_{t}\mathbb{E}_{\kappa_{0}}^{\pi}\gamma^{t}c_{t}+\sum_{t}\mathbb{E}_{\kappa_{0}}^{\pi}\gamma^{t}\langle\boldsymbol{\lambda},\boldsymbol{d}_{t}\rangle=\sum_{t}\mathbb{E}_{\kappa_{0}}^{\pi}\gamma^{t}(c_{t}+\langle\boldsymbol{\lambda},\boldsymbol{d}_{t}\rangle).

Thus, equality holds.
(b) Max-min inequality: in the $\rho^{\gamma}_{\kappa_{0}}(\cdot)=\sum_{t}\mathbb{E}_{\kappa_{0}}^{\pi}\gamma^{t}(\cdot)$ case, both the objective function and the constraints are linear in the decision variables $\pi$ and $\boldsymbol{\lambda}$ . Therefore, the sixth line in (IV) reads as

	$\displaystyle\inf_{\pi}\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\left(\rho^{\gamma}(\boldsymbol{c}+\langle\boldsymbol{\lambda},\boldsymbol{d}\rangle)-\langle\boldsymbol{\lambda},\boldsymbol{\beta}\rangle\right)$
	$\displaystyle=\inf_{\pi}\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\left(\sum_{t}\mathbb{E}_{\kappa_{0}}^{\pi}\gamma^{t}(c_{t}+\langle\boldsymbol{\lambda},\boldsymbol{d}_{t}\rangle)-\langle\boldsymbol{\lambda},\boldsymbol{\beta}\rangle\right).$		(20)

Since the expression inside parantheses above is convex in $\pi$ ( $\mathbb{E}_{\kappa_{0}}^{\pi}$ is linear in the policy) and concave (linear) in $\boldsymbol{\lambda}$ . From Minimax Theorem [20], we have that the following equality holds

	$\displaystyle\inf_{\pi}\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\left(\sum_{t}\mathbb{E}_{\kappa_{0}}^{\pi}\gamma^{t}(c_{t}+\langle\boldsymbol{\lambda},\boldsymbol{d}_{t}\rangle)-\langle\boldsymbol{\lambda},\boldsymbol{\beta}\rangle\right)$
	$\displaystyle=\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\inf_{\pi}\left(\sum_{t}\mathbb{E}_{\kappa_{0}}^{\pi}\gamma^{t}(c_{t}+\langle\boldsymbol{\lambda},\boldsymbol{d}_{t}\rangle)-\langle\boldsymbol{\lambda},\boldsymbol{\beta}\rangle\right).$

Furthermore, from (14), we see that $\sigma$ is linear in $v$ for total expectation. Therefore, the constraint in (1) is linear in $V_{\gamma}$ and $\lambda$ . Since $\langle\boldsymbol{\kappa_{0}},\boldsymbol{V}_{\gamma}\rangle-\langle\boldsymbol{\lambda},\boldsymbol{\beta}\rangle$ is also linear in $V_{\gamma}$ s and $\lambda$ s, optimization (1) becomes a linear program in the case of total expectation coherent risk measure. ∎

In [4], we presented a method based on difference convex programs to solve (1), when $\rho^{\gamma}$ is an arbitrary coherent risk measure and we described the specific structure of the optimization problem for conditional expectation, CVaR, and EVaR. In fact, it was shown that (1) can be written in a standard DCP format as

		$\displaystyle\inf_{\boldsymbol{V}_{\gamma},\boldsymbol{\lambda}\succeq\boldsymbol{0}}\leavevmode\nobreak\ \leavevmode\nobreak\ f_{0}(\boldsymbol{\lambda})-g_{0}(\boldsymbol{V}_{\gamma})$
		subject to
		$\displaystyle f_{1}({V}_{\gamma})-g_{1}(\boldsymbol{\lambda})-g_{2}({V}_{\gamma})\leq 0,\leavevmode\nobreak\ \leavevmode\nobreak\ \forall s,\alpha.$		(21)

Optimization problem (IV) is a standard DCP [25]. DCPs arise in many applications, such as feature selection in machine learning [29] and inverse covariance estimation in statistics [48]. Although DCPs can be solved globally [25], e.g. using branch and bound algorithms [28], a locally optimal solution can be obtained based on techniques of nonlinear optimization [13] more efficiently. In particular, in this work, we use a variant of the convex-concave procedure [30, 43], wherein the concave terms are replaced by a convex upper bound and solved. In fact, the disciplined convex-concave programming (DCCP) [43] technique linearizes DCP problems into a (disciplined) convex program (carried out automatically via the DCCP Python package [43]), which is then converted into an equivalent cone program by replacing each function with its graph implementation. Then, the cone program can be solved readily by available convex programming solvers, such as CVXPY [18].

We end this section by pointing out that solving (1) using the DCCP method, only finds the (local) saddle points to optimization problem (1). Nevertheless, every saddle point to (1) satisfies (16) (from Theorem 1). In fact, every saddle point is a lower bound of the optimal value of Problem 1.

V Constrained Risk-Averse POMDPs

Next, we show that, in the case of POMDPs, we can find a lower bound to the solution to Problem 1 via solving an infinite-dimensional optimization problem. Note that a POMDP is equivalent to a belief MDP $\{b_{t}\}$ , $t=1,2,\ldots$ , where $b_{t}$ is defined in (2).

Theorem 2.

Consider a POMDP $\mathcal{PM}$ with the nested risk objective (8) and constraint (9) with $\gamma\in(0,1)$ . Let Assumption 1 hold, let $\rho_{t},\leavevmode\nobreak\ t=0,1,\ldots$ be coherent risk measures, and suppose $c(\cdot,\cdot)$ and $\{d^{i}(\cdot,\cdot)\}_{i=1}^{n_{c}}$ be non-negative and upper-bounded. Then, the solution $(\lambda^{*},V^{*}_{\gamma})$ to the following Bellman’s equation

		$\displaystyle\sup_{\boldsymbol{V}_{\gamma},\boldsymbol{\lambda}\succeq 0}\leavevmode\nobreak\ \leavevmode\nobreak\ \langle\boldsymbol{b}_{0},\boldsymbol{V}_{\gamma}\rangle-\langle\boldsymbol{\lambda},\boldsymbol{\beta}\rangle$
		subject to
		$\displaystyle V_{\gamma}(b)\leq c(b,\alpha)+\langle\boldsymbol{\lambda},\boldsymbol{d}(b,\alpha)\rangle$
		$\displaystyle\quad\quad\quad+\gamma\sigma\{V_{\gamma}(b^{\prime}),b,p(b^{\prime}\|b,\alpha)\},\leavevmode\nobreak\ \leavevmode\nobreak\ \forall b\in\Delta(\mathcal{S}),\leavevmode\nobreak\ \forall\alpha\in{Act,}$		(22)

where $c(b,\alpha)=\sum_{s\in\mathcal{S}}c(s,\alpha)b(s)$ and $d(b,\alpha)=\sum_{s\in\mathcal{S}}d(s,\alpha)b(s)$ satisfies

J_{\gamma}(b_{0})\geq\langle\boldsymbol{b}_{0},\boldsymbol{V}^{*}_{\gamma}\rangle-\langle\boldsymbol{\lambda}^{*},\boldsymbol{\beta}\rangle.

(23)

Proof.

Note that a POMDP can be represented as an MDP over the belief states (2) with initial distribution (1). Hence, a POMDP is a controlled Markov process with states $b\in\Delta(\mathcal{S})$ , where the controlled belief transition probability is described as

p(b^{\prime}\mid b,\alpha)=\sum_{o\in\mathcal{O}}p(b^{\prime}\mid b,o,\alpha)\leavevmode\nobreak\ p(o\mid b,\alpha)\\ =\sum_{o\in\mathcal{O}}\delta\left(b^{\prime}-\frac{O(o\mid s,\alpha)\sum_{s^{\prime}\in\mathcal{S}}T(s\mid s^{\prime},\alpha)b(s^{\prime})}{\sum_{s\in\mathcal{S}}O(o\mid s,\alpha)\sum_{s^{\prime}\in\mathcal{S}}T(s\mid s^{\prime},\alpha)b(s^{\prime})}\right)\\ \times\sum_{s\in\mathcal{S}}O(o\mid s,\alpha)\sum_{s^{\prime\prime}\in\mathcal{S}}T(s\mid s^{\prime\prime},\alpha)b(s^{\prime\prime}),

with

\delta(a)=\begin{cases}1&a=0,\\ 0&\text{otherwise}.\end{cases}

The rest of the proof follows the same footsteps on Theorem 1 over the belief MDP with $p(b^{\prime}|b,\alpha)$ as defined above. ∎

Unfortunately, since $b\in\Delta(\mathcal{S})$ and hence $V_{\gamma}:\Delta(\mathcal{S})\to\mathbb{R}$ , optimization (2) is infinite-dimensional and we cannot solve it efficiently.

If the one-step coherent risk measure $\rho_{t}$ is the total discounted expectation, we can show that optimization problem (2) simplifies to an infinite-dimensional linear program and equality holds in (23). This can be proved following the same lines as the proof of Corollary 1 but for the belief MDP. Hence, Theorem 2 also provides an optimization based solution to the constrained POMDP problem.

V-A Risk-Averse FSC Synthesis via Policy Iteration

In order to synthesize risk-averse FSCs, we employ a policy iteration algorithm. Policy iteration incrementally improves a controller by alternating between two steps: Policy Evaluation (computing value functions by fixing the policy) and Policy Improvement (computing the policy by fixing the value functions), until convergence to a satisfactory policy [12]. For a risk-averse POMDP, policy evaluation can be carried out by solving (2). However, as mentioned earlier, (2) is difficult to use directly as it must be computed at each (continuous) belief state in the belief space, which is uncountably infinite.

In the following, we show that if instead of considering policies with infinite-memory, we search over finite-memory policies, then we can find suboptimal solutions to Problem 1 that lower-bound $J_{\gamma}(\kappa_{0})$ . To this end, we consider stochastic but finite-memory controllers as described in Section II.C.

Closing the loop around a POMDP with an FSC $\mathcal{G}$ induces a Markov chain. The global Markov chain $\mathcal{MC}^{\mathcal{PM},\mathcal{G}}_{\mathcal{S}\times G}$ (or simply $\mathcal{MC}$ , where the stochastic finite state controller and the POMDP are clear from the context) with execution $\{[s_{0},g_{0}],[s_{1},g_{1}],\dots\},\ [s_{t},\ g_{t}]\in\mathcal{S}\times G$ . The probability of initial global state $[s_{0},g_{0}]$ is

\iota_{init}\left(\left[s_{0},g_{0}\right]\right)=\kappa_{0}(s_{0})\kappa(g_{0}|\kappa_{0}).

The state transition probability, $T^{\mathcal{M}}$ , is given by

	$\displaystyle T^{\mathcal{M}}$	$\displaystyle\left(\left[s_{t+1},g_{t+1}\right]\left\|\left[s_{t},g_{t}\right]\right.\right)=$
	$\displaystyle\sum_{o\in\mathcal{O}}$	$\displaystyle\sum_{\alpha\in Act}O(o\|s_{t})\omega(g_{t+1},\alpha\|g_{t},o)T(s_{t+1}\|s_{t},\alpha).$

V-B Risk Value Function Computation

Under an FSC, the POMDP is transformed into a Markov chain $\mathcal{M}^{\mathcal{PM}\times\mathcal{G}}_{\mathcal{S}\times\mathcal{G}}$ with design probability distributions $\omega$ and $\kappa$ . The closed-loop Markov chain $\mathcal{M}^{\mathcal{PM}\times\mathcal{G}}_{\mathcal{S}\times\mathcal{G}}$ is a controlled Markov process with $\{q_{t}\}=\{[s_{t},g_{t}]\}$ , $t=1,2,\ldots$ . In this setting, the total risk functional (8) becomes a function of $\iota_{init}$ and FSC $\mathcal{G}$ , i.e.,

J_{\gamma}(\iota_{\mathrm{init}},\mathcal{G})=\rho^{\gamma}\big{(}c([s_{0},g_{0}],\alpha_{0}),c([s_{1},g_{1}],\alpha_{1}),\ldots\big{)},\\ \leavevmode\nobreak\ \leavevmode\nobreak\ s_{0}\sim\kappa_{0},\leavevmode\nobreak\ g_{0}\sim\kappa,

(24)

where $\alpha_{t}$ s and $g_{t}$ s are drawn from the probability distribution $\omega(g_{t+1},\alpha_{t}\mid g_{t},o_{t})$ . The constraint functionals $D_{\gamma}^{i}(\iota_{\mathrm{init}},\mathcal{G})$ , $i=1,2,\ldots,n_{c}$ can also be defined similarly.

Let $J_{\gamma}(\boldsymbol{\iota}_{init})$ be the value of Problem 1 under a FSC $\mathcal{G}$ . Then, it is evident that $J_{\gamma}(\boldsymbol{b}_{0})\geq J_{\gamma}(\boldsymbol{\iota}_{init})$ , since FSCs restrict the search space of the policy $\pi$ . That is, they can only be as good as the (infinite-dimensional) belief-based policy $\pi(b)$ as $|G|\to\infty$ (infinite-memory).

Risk Value Function Optimization: For POMDPs controlled by stochastic finite state controllers, the dynamic program is developed in the global state space $\mathcal{S}\times G$ . From Theorem 1, we see that for a given FSC, $\mathcal{G}$ , and POMDP $\mathcal{PM}$ , the value function $V_{\gamma,\mathcal{M}}([s,g])$ can be computed by solving the following finite dimensional optimization

		$\displaystyle\sup_{\boldsymbol{V}_{\gamma,\mathcal{M}},\boldsymbol{\lambda}\succeq\boldsymbol{0}}\leavevmode\nobreak\ \leavevmode\nobreak\ \langle\boldsymbol{\iota}_{init},\boldsymbol{V}_{\gamma,\mathcal{M}}\rangle-\langle\boldsymbol{\lambda},\boldsymbol{\beta}\rangle$
		subject to
		$\displaystyle V_{\gamma,\mathcal{M}}([s,g])\leq\sum_{\alpha\in Act}p(\alpha\mid g)\tilde{c}([s,g],\alpha)$
		$\displaystyle\quad\quad\quad+\gamma\sigma\Big{\{}V_{\gamma,\mathcal{M}}([s^{\prime},g^{\prime}]),[s,g],T^{\mathcal{M}}\left([s^{\prime},g^{\prime}]\left\|[s,g]\right.\right)\Big{\}},$
		$\displaystyle\quad\quad\quad\quad\forall s\in\mathcal{S},\leavevmode\nobreak\ \forall g\in G,$		(25)

where $p(\alpha\mid g)={\sum_{g^{\prime}\in\mathcal{G},o\in\mathcal{O}}\omega(g^{\prime},\alpha\mid g,o)O(o|g^{\prime})},$ and $\tilde{c}([s,g],\alpha)={c}([s,g],\alpha)+\langle\boldsymbol{\lambda},\boldsymbol{d}([s,g],\alpha)\rangle$ . Then, the solution $(\boldsymbol{V}^{*}_{\gamma,\mathcal{M}},\boldsymbol{\lambda}^{*})$ satisfies

J_{\gamma}(\boldsymbol{\iota}_{init})\geq\langle\boldsymbol{\iota}_{init},\boldsymbol{V}^{*}_{\gamma,\mathcal{M}}\rangle-\langle\boldsymbol{\lambda}^{*},\boldsymbol{\beta}\rangle.

(26)

Note that since $\rho^{\gamma}$ is a coherent, Markov risk measure (Assumption 1), $v\mapsto\sigma(v,\cdot,\cdot)$ is convex (because $\sigma$ is also a coherent risk measure). In fact, optimization problem (V-B) is indeed a DCP in the form of (IV), where we should replace $V_{\gamma}$ with $V_{\gamma,\mathcal{M}}$ and set $f_{0}(\boldsymbol{\lambda})=\langle\boldsymbol{\lambda},\boldsymbol{\beta}\rangle$ , $g_{0}(\boldsymbol{V}_{\gamma,\mathcal{M}})=\langle\boldsymbol{\iota_{init}},\boldsymbol{V}_{\gamma,\mathcal{M}}\rangle$ , $f_{1}({V}_{\gamma,\mathcal{M}})={V}_{\gamma,\mathcal{M}}$ , $g_{1}(\boldsymbol{\lambda})=\sum_{\alpha\in Act}p(\alpha\mid g)\tilde{c}([s,g],\alpha)$ , and $g_{2}({V}_{\gamma,\mathcal{M}})=\gamma\sigma({V}_{\gamma,\mathcal{M}},\cdot,\cdot)$ .

The above optimization is in standard DCP form because $f_{0}$ and $g_{1}$ are convex (linear) functions of $\boldsymbol{\lambda}$ and $g_{0}$ , $f_{1}$ , and $g_{2}$ are convex functions in ${V}_{\gamma,\mathcal{M}}$ .

Solving (IV) gives a set of value functions $V_{\gamma,\mathcal{M}}$ . In the next section, we discuss how to use the solutions from this DCP in our proposed policy iteration algorithm to sequentially improve the FSC parameters $\omega$ .

V-C I-States Improvement

Let $\vec{V}_{\gamma,\mathcal{M}}(g)\in\mathbb{R}^{|S|}$ denote the vectorized $V_{\gamma,\mathcal{M}}([s,g])$ in $s$ . We say that an I-state $g$ is improved, if the tunable FSC parameters associated with that I-state can be adjusted so that $\vec{V}^{*}_{\gamma,\mathcal{M}}(g)$ increases.

To begin with, we compute the initial I-state by finding the best valued I-state for a given initial belief, i.e., $\kappa(g_{{init}})=1$ , where

g_{{init}}=\underset{g\in G}{\mbox{argmax}}\leavevmode\nobreak\ \left\langle\boldsymbol{\iota}_{init},\vec{V}_{\gamma,\mathcal{M}}(g)\right\rangle.

After this initialization, we search for FSC parameters $\omega$ that result in an improvement.

I-state Improvement Optimization: Given value functions $V_{\gamma,\mathcal{M}}([s,g])$ for all $s\in\mathcal{S}$ and $g\in G$ and Lagrangian parameters $\boldsymbol{\lambda}$ , for every I-state $g$ , we can find FSC parameters $\omega$ that result in an improvement by solving the following optimization

	$\displaystyle\underset{\epsilon>0,\omega(g^{\prime},\alpha\|g,o)}{\max}\ \ \ \epsilon$
	subject to
	Improvement Constraint:
	$\displaystyle V_{\gamma,\mathcal{M}}([s,g])+\epsilon\leq\text{r.h.s. of \eqref{eq:valueiterationsfc}},\leavevmode\nobreak\ \leavevmode\nobreak\ \forall s\in\mathcal{S},$
	Probability Constraints:
	$\displaystyle\underset{(g^{\prime},\alpha)\in G\times Act}{\sum}\omega(g^{\prime},\alpha\mid g,o)=1,\leavevmode\nobreak\ \leavevmode\nobreak\ \forall o\in\mathcal{O},$
	$\displaystyle\omega(g^{\prime},\alpha\mid g,o)\geq 0,\leavevmode\nobreak\ \leavevmode\nobreak\ \forall g^{\prime}\in G,\alpha\in Act,o\in\mathcal{O}.$		(27)

Note that the above optimization searches for $\omega$ values that improve the I-state value vector $\vec{V}^{*}_{\gamma,\mathcal{M}}(g)$ by maximizing the auxiliary decision variable $\epsilon$ .

Optimization problem (V-C) is in general non-convex. This can be inferred from the fact that, although the first term in the r.h.s. of (V-B) is linear in $\omega$ , its convexity or concavity is not clear in the $\sigma$ term for a general coherent risk measure. Fortunately, we can prove the following result.

Proposition 2.

Let $\boldsymbol{V}_{\gamma,\mathcal{M}}$ and $\boldsymbol{\lambda}$ be given. Then, the I-State Improvement Optimization (V-C) is a linear program for conditional expectation and CVaR risk measures. Furthermore, (V-C) is a convex optimization for EVaR risk measure.

Proof.

We present different forms of the Improvement Constraint in (V-C) for different risk measures. Note that the rest of the constraints and the cost function are linear in the decision variables $\epsilon$ and $\omega$ . The Improvement Constraint in (V-C) is linear in $\epsilon$ . However, its convexity or concavity in $\omega$ changes depending on the risk measure one considers. We recall from the previous section that in the Policy Evaluation step, the quantities for ${\boldsymbol{V}}_{\gamma,\mathcal{M}}$ and ${\boldsymbol{\lambda}}\succeq\boldsymbol{0}$ (for conditional expectation, CVaR, and EVaR measures) and $\zeta$ for (CVaR and EVaR measures) are calculated and therefore fixed here.

For conditional expectation, the Improvement Constraint alters to

V_{\gamma,\mathcal{M}}([s,g])+\epsilon\leq\sum_{\alpha\in Act}p(\alpha\mid g)\tilde{c}([s,g],\alpha)\\ +\gamma\sum_{s^{\prime}\in\mathcal{S},g^{\prime}\in\mathcal{G}}V_{\gamma,\mathcal{M}}([s^{\prime},g^{\prime}])T^{\mathcal{M}}\left([s^{\prime},g^{\prime}]\left|[s,g]\right.\right),\\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall s\in\mathcal{S},\leavevmode\nobreak\ \forall g\in G.

(28)

Substituting the expression for $T^{\mathcal{M}}$ , i.e.,

T^{\mathcal{M}}\left([s^{\prime},g^{\prime}]\left|[s,g]\right.\right)=\sum_{o\in\mathcal{O}}\sum_{\alpha\in Act}O(o|s)\omega(g^{\prime},\alpha|g,o)T(s^{\prime}|s,\alpha),

and $p(\alpha\mid g)$ , i.e.,

p(\alpha\mid g)={\sum_{g^{\prime}\in\mathcal{G},o\in\mathcal{O}}\omega(g^{\prime},\alpha\mid g,o)O(o|g^{\prime})},

we obtain

V_{\gamma,\mathcal{M}}([s,g])+\epsilon\leq{\sum_{\alpha,g^{\prime},o}\omega(g^{\prime},\alpha\mid g,o)O(o|g^{\prime})}\tilde{c}([s,g],\alpha)\\ +\gamma\sum_{s^{\prime},g^{\prime},o,\alpha}V_{\gamma,\mathcal{M}}([s^{\prime},g^{\prime}])O(o|s)\omega(g^{\prime},\alpha|g,o)T(s^{\prime}|s,\alpha),\\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall s\in\mathcal{S},\leavevmode\nobreak\ \forall g\in G.

(29)

The above expression is linear in $\omega$ as well as $\epsilon$ . Hence, I-State Improvement Optimization becomes a linear program for conditional expectation risk measure.

Based on a similar construction, for CVaR measure, the Improvement Constraint changes to

V_{\gamma,\mathcal{M}}([s,g])+\epsilon\leq\sum_{\alpha,g^{\prime},o}\omega(g^{\prime},\alpha\mid g,o)O(o|g^{\prime})\tilde{c}([s,g],\alpha)\\ +\gamma\bigg{\{}\zeta+\frac{1}{\varepsilon}\sum_{g^{\prime},s^{\prime}}\left(V_{\gamma,\mathcal{M}}\left([s^{\prime},g^{\prime}]\right)-\zeta\right)_{+}T^{\mathcal{M}}([s^{\prime},g^{\prime}]|[s,g])\bigg{\}},\\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall s\in\mathcal{S},\leavevmode\nobreak\ \forall g\in G.

(30)

After substituting the term for $T^{\mathcal{M}}$ , we obtain

V_{\gamma,\mathcal{M}}([s,g])+\epsilon\leq\sum_{\alpha,g^{\prime},o}\omega(g^{\prime},\alpha\mid g,o)O(o|g^{\prime})\tilde{c}([s,g],\alpha)\\ +\gamma\bigg{\{}\zeta+\frac{1}{\varepsilon}\sum_{g^{\prime},s^{\prime},o,\alpha}\left(V_{\gamma,\mathcal{M}}\left([s^{\prime},g^{\prime}]\right)-\zeta\right)_{+}O(o|s)\times\\ \omega(g^{\prime},\alpha|g,o)T(s^{\prime}|s,\alpha)\bigg{\}},\leavevmode\nobreak\ \leavevmode\nobreak\ \forall s\in\mathcal{S},\leavevmode\nobreak\ \forall g\in G.

(31)


	$\displaystyle\underset{\epsilon>0,\omega(g^{\prime},\alpha\|g,o)}{\max}\ \ \ \langle\boldsymbol{\iota}_{init},V_{\gamma,\mathcal{M}}\rangle-\langle\lambda,\beta\rangle+\epsilon$
	subject to
	Improvement Constraint:
	$\displaystyle V_{\gamma,\mathcal{M}}([s,g])+\epsilon-\sum_{\alpha,g^{\prime},o}\omega(g^{\prime},\alpha\mid g,o)O(o\|g^{\prime})\tilde{c}([s,g],\alpha)\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad$		(32a)
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad-\gamma\bigg{\{}\zeta+\frac{1}{\varepsilon}\sum_{g^{\prime},s^{\prime},o,\alpha}\left(V_{\gamma,\mathcal{M}}\left([s^{\prime},g^{\prime}]\right)-\zeta\right)_{+}O(o\|s)\omega(g^{\prime},\alpha\|g,o)T(s^{\prime}\|s,\alpha)\bigg{\}}\leq 0,\leavevmode\nobreak\ \leavevmode\nobreak\ \forall s\in\mathcal{S},\leavevmode\nobreak\ \forall g\in G,$
	Probability Constraints:
	$\displaystyle\underset{(g^{\prime},\alpha)\in G\times Act}{\sum}\omega(g^{\prime},\alpha\mid g,o)=1,\leavevmode\nobreak\ \leavevmode\nobreak\ \forall o\in\mathcal{O},$
	$\displaystyle\omega(g^{\prime},\alpha\mid g,o)\geq 0,\leavevmode\nobreak\ \leavevmode\nobreak\ \forall g^{\prime}\in G,\alpha\in Act,o\in\mathcal{O}.$		(32b)

Furthermore, for fixed $V_{\gamma,\mathcal{M}}$ , $\lambda$ , and $\zeta$ , the above inequality is linear in $\omega$ and $\epsilon$ . Hence, (31) becomes a linear constraint rendering (V-C) a linear program (maximizing a linear objective subject to linear constraints), i.e., optimization problem (32).

For the EVaR measure, the Improvement Constraint is given by

V_{\gamma,\mathcal{M}}([s,g])+\epsilon\leq\sum_{\alpha,g^{\prime},o}\omega(g^{\prime},\alpha\mid g,o)O(o|g^{\prime})\tilde{c}([s,g],\alpha)\\ +\gamma\Bigg{\{}\frac{1}{\zeta}\log\left(\frac{\sum_{g^{\prime},s^{\prime}}e^{\zeta{V}_{\gamma,\mathcal{M}}([s^{\prime},g^{\prime}])}T^{\mathcal{M}}([s^{\prime},g^{\prime}]|[s,g])}{\varepsilon}\right)\Bigg{\}},\\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall s\in\mathcal{S},\leavevmode\nobreak\ \forall g\in G.

(33)

Substituting the expression for $T^{\mathcal{M}}$ , i.e.,

T^{\mathcal{M}}\left([s^{\prime},g^{\prime}]\left|[s,g]\right.\right)=\sum_{o\in\mathcal{O}}\sum_{\alpha\in Act}O(o|s)\omega(g^{\prime},\alpha|g,o)T(s^{\prime}|s,\alpha),

we obtain

V_{\gamma,\mathcal{M}}([s,g])+\epsilon\leq\sum\limits_{\alpha,g^{\prime},o}\omega(g^{\prime},\alpha\mid g,o)O(o|g^{\prime})\tilde{c}([s,g],\alpha)\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\\ +\frac{\gamma}{\zeta}\log\left(\frac{\sum\limits_{g^{\prime},s^{\prime},o,\alpha}e^{\zeta{V}_{\gamma,\mathcal{M}}([s^{\prime},g^{\prime}])}O(o|s)\omega(g^{\prime},\alpha|g,o)T(s^{\prime}|s,\alpha)}{\varepsilon}\right),\\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall s\in\mathcal{S},\leavevmode\nobreak\ \forall g\in G,

(34)


	$\displaystyle\underset{\epsilon>0,\omega(g^{\prime},\alpha\|g,o)}{\max}\ \ \ \langle\boldsymbol{\iota}_{init},V_{\gamma,\mathcal{M}}\rangle-\langle\lambda,\beta\rangle+\epsilon$
	subject to
	Improvement Constraint:
	$\displaystyle V_{\gamma,\mathcal{M}}([s,g])+\epsilon-\sum_{\alpha,g^{\prime},o}\omega(g^{\prime},\alpha\mid g,o)O(o\|g^{\prime})\tilde{c}([s,g],\alpha)\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad$		(35a)
	$\displaystyle\quad\quad\quad\quad\quad\quad\quad-\gamma\Bigg{\{}\frac{1}{\zeta}\log\left(\frac{\sum_{g^{\prime},s^{\prime},o,\alpha}e^{\zeta{V}_{\gamma,\mathcal{M}}([s^{\prime},g^{\prime}])}O(o\|s)\omega(g^{\prime},\alpha\|g,o)T(s^{\prime}\|s,\alpha)}{\varepsilon}\right)\Bigg{\}}\leq 0,\leavevmode\nobreak\ \leavevmode\nobreak\ \forall s\in\mathcal{S},\leavevmode\nobreak\ \forall g\in G,$
	Probability Constraints:
	$\displaystyle\underset{(g^{\prime},\alpha)\in G\times Act}{\sum}\omega(g^{\prime},\alpha\mid g,o)=1,\leavevmode\nobreak\ \leavevmode\nobreak\ \forall o\in\mathcal{O},$
	$\displaystyle\omega(g^{\prime},\alpha\mid g,o)\geq 0,\leavevmode\nobreak\ \leavevmode\nobreak\ \forall g^{\prime}\in G,\alpha\in Act,o\in\mathcal{O}.$		(35b)

In the above inequality, the first term on the right-hand side of the is linear in $\omega$ and the second term on the right-hand side (logarithm term) is concave in $\omega$ (convex if all terms are moved to the left side, since $-\log(x)$ is convex in $x$ ). Therefore, (34) becomes a convex constraint rendering (V-C) a convex optimization problem (maximizing a linear objective subject to linear and convex constraints) for EVaR measures. That is, the I-State Improvement Optimization takes the convex optimization form of (35). ∎

If no improvement is achieved by optimization (V-C), i.e., $\epsilon=0$ , for fixed number of internal states $|G|$ , we can increase $|G|$ by one following the footsteps of the bounded policy iteration method proposed in [3, Section V.B].

V-D Policy Iteration Algorithm

Algorithm 1 outlines the main steps in the proposed policy iteration method for the constrained risk-averse FSC synthesis. The algorithm has two distinct parts. First, for fixed parameters of the FSC ( $\omega$ ), policy evaluation is carried out, in which $V_{\gamma,\mathcal{M}}([s,g])$ and $\lambda$ are computed using DCP (V-B) (Steps 2, 10 and 18). Second, after evaluating the current value functions and the Lagrange multipliers, an improvement is carried out either by changing the parameters of existing I-states via optimization (V-C), or if no new parameters can improve any I-state, then a fixed number of I-states are added to escape the local minima (Steps 14-17) based on the method proposed in [3, Section V.B].

Algorithm 1 Policy Iteration For Synthesizing Constrained Risk-Averse FSC

0: (a) An initial feasible FSC,

\mathcal{G}

. (b) Maximum size of FSC

N_{max}

. (c)

N_{new}\leq N_{max}

number of I-states

improved\leftarrow True

2: Compute the value vectors,

\vec{V}_{\gamma,\mathcal{M}}

and Lagrange multipliers

\boldsymbol{\lambda}

, based on DCP (V-B).

3: while

|G|\leq N_{max}

and

improved=True

improved\leftarrow False

5: for all I-states

g\in G

6: Solve the I-State Improvement Optimization (V-C).

7: if I-State Improvement Optimization results in

\epsilon>0

then

8: Replace the parameters

\omega

for I-state

g

improved\leftarrow True

10: Compute the value vectors,

\vec{V}_{\gamma,\mathcal{M}}

and Lagrange multipliers

\boldsymbol{\lambda}

, based on optimization (V-B).

11: if

improved=False

and

|G|<N_{max}

then

12:

n_{added}\leftarrow 0

13:

N^{\prime}_{new}\leftarrow\min(N_{new},N_{max}-|G|)

14: Try to add

N^{\prime}_{new}

I-state(s) to

\mathcal{G}

15:

n_{added}\leftarrow

actual number of I-states added in previous step.

16: if

n_{added}>0

then

17:

improved\leftarrow True

18: Compute the value vectors,

\vec{V}_{\gamma,\mathcal{M}}

and Lagrange multipliers

\boldsymbol{\lambda}

, based on optimization (V-B).

18:

\mathcal{G}

VI Numerical Experiments

In this section, we evaluate the proposed methodology with numerical experiments. In addition to the traditional total expectation, we consider two other coherent risk measures, namely, CVaR and EVaR. All experiments were carried out on a MacBook Pro with 2.8 GHz Quad-Core Intel Core i5 and 16 GB of RAM. The resultant linear programs and DCPs were solved using CVXPY [18] with DCCP [43] add-on in Python.

VI-A Rover MDP Example Set Up

An agent (e.g. a rover) must autonomously navigate a 2-dimensional terrain map (e.g. Mars surface) represented by an $M\times N$ grid with $0.25MN$ obstacles. The state space is given by $\mathcal{S}=\{s_{i}|i=x+My,x\in\{1,\dots,M\},y\in\{1,\dots,N\}\}.$ The action set available to the robot is $Act=\{E,\ W,\ N,\ S,\ NE,\ NW,\ SE,\ SW\}$ . The state transition probabilities for various cell types are shown for actions $E$ in Figure 2, i.e., the agent moves to the grid implied by the action with $0.7$ probability but can also move to any adjacent ones with $0.3$ probability. Partial observability arises because the rover cannot determine obstacle cell location from measurements directly. The observation space is $\mathcal{O}=\{o_{i}|i=x+My,x\in\{1,\dots,M\},y\in\{1,\dots,N\}\}.$ Once at an adjacent cell to an obstacle, the rover can identify an actual obstacle position (dark green) with probability $0.6$ , and a distribution over the nearby cells (light green).

Hitting an obstacle incurs the immediate cost of $10$ , while the goal grid region has zero immediate cost. Any other grid has a cost of $2$ to represent fuel consumption. The discount factor is set to $\gamma=0.95$ .

The objective is to compute a safe path that is fuel efficient, i.e., solving Problem 1. To this end, we consider total expectation, CVaR, and EVaR as the coherent risk measure.

Once a policy is calculated, as a robustness test, inspired by [17], we included a set of single grid obstacles that are perturbed in a random direction to one of the neighboring grid cells with probability $0.3$ to represent uncertainty in the terrain map. For each risk measure, we run $100$ Monte Carlo simulations with the calculated policies and count failure rates, i.e., the number of times a collision has occurred during a run.

VI-B MDP Results

To evaluate the technique discussed in Section IV, we assume that there is no partial observation. In our experiments, we consider four grid-world sizes of $10\times 10$ , $15\times 15$ , $20\times 20$ , and $30\times 30$ corresponding to $100$ , $225$ , $400$ , and $900$ states, respectively. For each grid-world, we randomly allocate 25% of the grids to obstacles, including 3, 6, 9, and 12 uncertain (single-cell) obstacles for the $10\times 10$ , $15\times 15$ , $20\times 20$ , and $30\times 30$ grids, respectively. In each case, we solve DCP (1) (linear program in the case of total expectation) with $|\mathcal{S}||Act|=MN\times 8=8MN$ constraints and $MN+2$ variables (the risk value functions $V_{\gamma}$ ’s, Langrangian coefficient $\lambda$ , and $\zeta$ for CVaR and EVaR). In these experiments, we set $\varepsilon=0.2$ for CVaR and EVaR coherent risk measures to represent risk-averse policies. The fuel budget (constraint bound $\beta$ ) was set to 50, 10, 200, and 600 for the $10\times 10$ , $15\times 15$ , $20\times 20$ , and $30\times 30$ grid-worlds, respectively. The initial condition was chosen as $\kappa_{0}(s_{M})=M-1$ , i.e., the agent starts at the second left most grid at the bottom.

A summary of our numerical experiments is provided in Table 1. Note the computed values of Problem 1 satisfy $\mathbb{E}(c)\leq\mathrm{CVaR}_{\varepsilon}(c)\leq\mathrm{EVaR}_{\varepsilon}(c)$ , which is consistent with that fact that EVaR is a more conservative coherent risk measure than CVaR [6].

$(M\times N)_{\rho_{t}}$	$J_{\gamma}(\kappa_{0})$	Total Time [s]	# U.O.	F.R.
$(10\times 10)_{\mathbb{E}}$	9.12	0.8	3	11%
$(15\times 15)_{\mathbb{E}}$	12.53	0.9	6	23%
$(20\times 20)_{\mathbb{E}}$	19.93	1.7	9	33%
$(30\times 30)_{\mathbb{E}}$	27.30	2.4	12	41%
$(10\times 10)_{\text{CVaR}_{0.7}}$	$\geq$ 12.04	5.8	3	8%
$(15\times 15)_{\text{CVaR}_{0.7}}$	$\geq$ 14.83	9.3	6	18%
$(20\times 20)_{\text{CVaR}_{0.7}}$	$\geq$ 20.19	10.34	9	19%
$(30\times 30)_{\text{CVaR}_{0.7}}$	$\geq$ 34.95	14.2	12	32%
$(10\times 10)_{\text{CVaR}_{0.2}}$	$\geq$ 14.45	6.2	3	3%
$(15\times 15)_{\text{CVaR}_{0.2}}$	$\geq$ 17.82	9.0	6	5%
$(20\times 20)_{\text{CVaR}_{0.2}}$	$\geq$ 25.63	11.1	9	13%
$(30\times 30)_{\text{CVaR}_{0.2}}$	$\geq$ 44.83	15.25	12	22%
$(10\times 10)_{\text{EVaR}_{0.7}}$	$\geq$ 14.53	4.8	3	4%
$(15\times 15)_{\text{EVaR}_{0.7}}$	$\geq$ 16.36	8.8	6	11%
$(20\times 20)_{\text{EVaR}_{0.7}}$	$\geq$ 29.89	10.5	9	15%
$(30\times 30)_{\text{EVaR}_{0.7}}$	$\geq$ 54.13	14.99	12	12%
$(10\times 10)_{\text{EVaR}_{0.2}}$	$\geq$ 18.03	5.8	3	1%
$(15\times 15)_{\text{EVaR}_{0.2}}$	$\geq$ 21.10	8.7	6	3%
$(20\times 20)_{\text{EVaR}_{0.2}}$	$\geq$ 24.08	10.2	9	7%
$(30\times 30)_{\text{EVaR}_{0.2}}$	$\geq$ 63.04	14.25	12	10%

TABLE I: Comparison between total expectation, CVaR, and EVaR coherent risk measures.

(M\times N)_{\rho_{t}}

denotes experiments with grid-world of size

M\times N

and one-step coherent risk measure

\rho_{t}

J_{\gamma}(\kappa_{0})

is the valued of the constrained risk-averse problem (Problem 1). Total Time denotes the time taken by the CVXPY solver to solve the associated linear programs or DCPs in seconds.

\#

U.O. denotes the number of single grid uncertain obstacles used for robustness test. F.R. denotes the failure rate out of 100 Monte Carlo simulations with the computed policy.

For total expectation coherent risk measure, the calculations took significantly less time, since they are the result of solving a set of linear programs. For CVaR and EVaR, a set of DCPs were solved. CVaR calculation was the most computationally involved. This observation is consistent with [7] were it was discussed that EVaR calculation is much more efficient than CVaR. Note that these calculations can be carried out offline for policy synthesis and then the policy can be applied for risk-averse robot path planning.

The table also outlines the failure ratios of each risk measure. In this case, EVaR outperformed both CVaR and total expectation in terms of robustness, which is consistent with the fact that EVaR is more conservative. In addition, these results imply that, although discounted total expectation is a measure of performance in high number of Monte Carlo simulations, it may not be practical to use it for mission-critical decision making under uncertainty scenarios. CVaR and especially EVaR seem to be a more reliable metric for performance in planning under uncertainty.

For the sake of illustrating the computed policies, Figure 3 depicts the results obtained from solving DCP (1) for a $30\times 30$ grid-world. The arrows on grids depict the (sub)optimal actions and the heat map indicates the values of Problem 1 for each grid state. Note that the values for EVaR are greater than those for CVaR and the values for CVaR are greater from those of total expectation. This is in accordance with the theory that $\mathbb{E}(c)\leq\mathrm{CVaR}_{\varepsilon}(c)\leq\mathrm{EVaR}_{\varepsilon}(c)$ [6]. In addition, by inspecting the computed actions in obstacle dense areas of the grid-world (for example, the middle right area), we infer that the actions in more risk-averse cases (especially, for EVaR) have a higher tendency to steer the agent away from the obstacles given the diagonal transition uncertainty as depicted in Figure 2; whereas, for total expectation, the actions are merely concerned about reaching the goal.

VI-C POMDP Results

In our experiments, we consider two grid-world sizes of $10\times 10$ and $20\times 20$ corresponding to $100$ and $400$ states, respectively. For each grid-world, we allocate 25% of the grid to obstacles, including 8, and 16 uncertain (single-cell) obstacles for the $10\times 10$ and $20\times 20$ grids, respectively. In each case, we run Algorithm 1 for risk-averse FSC synthesis with $N_{max}=6$ and a maximum number of 100 iterations were considered.

In these experiments, we set the confidence level $\varepsilon=0.15$ for CVaR and EVaR coherent risk measures. The fuel budget (constraint bound $\beta$ ) was set to 50 and 200 for the $10\times 10$ and $20\times 20$ grid-worlds, respectively. The initial condition was chosen as $\kappa_{0}(s_{M})=1$ , i.e., the agent starts at the right most grid at the bottom.

A summary of our numerical experiments is provided in Table 1. Note the computed values of Problem 1 satisfy $\mathbb{E}(c)\leq\mathrm{CVaR}_{\varepsilon}(c)\leq\mathrm{EVaR}_{\varepsilon}(c)$ [6].

$(M\times N)_{\rho_{t}}$	$J_{\gamma}(\iota_{init})$	AIT [s]	# U.O.	F.R.
$(10\times 10)_{\mathbb{E}}$	10.53	0.2	3	15%
$(20\times 20)_{\mathbb{E}}$	19.98	0.3	9	37%
$(10\times 10)_{\text{CVaR}_{0.7}}$	$\geq$ 11.02	2.9	3	9%
$(20\times 20)_{\text{CVaR}_{0.7}}$	$\geq$ 20.19	7.5	9	22%
$(10\times 10)_{\text{CVaR}_{0.2}}$	$\geq$ 16.53	3.1	3	4%
$(20\times 20)_{\text{CVaR}_{0.2}}$	$\geq$ 24.92	7.6	9	16%
$(10\times 10)_{\text{EVaR}_{0.7}}$	$\geq$ 15.02	3.3	3	5%
$(20\times 20)_{\text{EVaR}_{0.7}}$	$\geq$ 23.42	9.9	9	11%
$(10\times 10)_{\text{EVaR}_{0.2}}$	$\geq$ 19.62	3.9	3	2%
$(20\times 20)_{\text{EVaR}_{0.2}}$	$\geq$ 29.36	9.7	9	6%

TABLE II: Comparison between total expectation, CVaR, and EVaR coherent risk measures.

(M\times N)_{\rho_{t}}

denotes experiments with grid-world of size

M\times N

and one-step coherent risk measure

\rho_{t}

J_{\gamma}(\iota_{init})

is the valued of the constrained risk-averse POMDP problem (Problem 1). AIT denotes the average time spent for each iteration of Algorithm 1.

\#

U.O. denotes the number of single grid uncertain obstacles used for robustness test. F.R. denotes the failure rate out of 100 Monte Carlo simulations with the computed policy.

For total expectation coherent risk measure, the calculations took significantly less time, since they are the result of solving a set of linear programs. For CVaR and EVaR, a set of DCPs were solved in the Risk Value Function Computation step. In the I-State Improvement step, a set of linear programs were solved for CVaR and convex optimizations for EVaR. Hence, EVaR calculation was the most computationally involved in this case. Note that these calculations can be carried out offline for policy synthesis and then the policy can be applied for risk-averse robot path planning.

The table also outlines the failure ratios of each risk measure. In this case, EVaR outperformed both CVaR and total expectation in terms of robustness, tallying with the fact that EVaR is conservative. In addition, these results suggest that, although discounted total expectation is a measure of performance in high number of Monte Carlo simulations, it may not be practical to use it for real-world planning under uncertainty scenarios. CVaR and especially EVaR seem to be a more reliable metric for performance in planning under uncertainty.

For the sake of illustrating the computed policies, Figure 3 depicts the results obtained from solving DCP (1) for a $20\times 20$ grid-world. The arrows on grids depict the (sub)optimal actions and the heat map indicates the values of Problem 1 for each grid state. Note that the values for EVaR are greater than those for CVaR and the values for CVaR are greater from those of total expectation. This is in accordance with the theory that $\mathbb{E}(c)\leq\mathrm{CVaR}_{\varepsilon}(c)\leq\mathrm{EVaR}_{\varepsilon}(c)$ [6].

Moreover, for the $20\times 20$ gridworld with EVaR coherent risk measure, Figure 2 depicts the evolution of the number of FSC I-states $|G|$ and the lower bound on the optimal value of Problem 1, $J_{\gamma}(\iota_{init})$ , with respect to the iteration number of Algorithm 1. We can see that as the number of I-states increase, the lower bound is improved.

VII Conclusions and Future Research

We proposed an optimization-based method for designing policies for MDPs and POMDPs with coherent risk measure objectives and constraints. We showed that such value function optimizations are in the form of DCPs. In the case of POMDPs, we proposed a policy iteration method for finding sub-optimal FSCs that lower-bound the constrained risk-averse problem and we demonstrated that dependent on the coherent risk measure of interest the policy search can be carried out via a linear program or a convex optimization. Numerical experiments were provided to show the efficacy of our approach. In particular, we showed that considering coherent risk measures lead to significantly lower collision rates in Monte Carlo simulations in navigation problems.

In this work, we focused on discounted infinite horizon risk-averse problems. Future work will explore other cost criteria [14]. The interested reader is referred to our preliminary results on total cost risk-averse MDPs [1], where in Bellman’s equations for the risk-averse stochastic shortest path problem are derived. Expanding on the latter work, we will also explore high-level mission specifications in terms of temporal logic formulas for risk-averse MDPs and POMDPs [5, 40]. Another area for more research is concerned with receding-horizon motion planning under uncertainty with coherent risk constraints [19, 24], with particular application in robot exploration in unstructured subterranean environments [21] (also see works on receding horizon path planning where the coherent risk measure is in the total cost [45, 44] rather than the collision avoidance constraint).

Acknowledgment

M. Ahmadi acknowledges stimulating discussions with Dr. Masahiro Ono at NASA Jet Propulsion Laboratory and Prof. Marco Pavone at Nvidia Research-Stanford University.

References

[1] M. Ahmadi, A. Dixit, J. W. Burdick, and A. D. Ames. Risk-averse stochastic shortest path planning. arXiv preprint arXiv:2103.14727, 2021.
[2] M. Ahmadi, N. Jansen, B. Wu, and U. Topcu. Control theory meets POMDPs: A hybrid systems approach. IEEE Transactions on Automatic Control, 2020.
[3] M. Ahmadi, M. Ono, M. D. Ingham, R. M. Murray, and A. D. Ames. Risk-averse planning under uncertainty. In 2020 American Control Conference (ACC), pages 3305–3312. IEEE, 2020.
[4] M. Ahmadi, U. Rosolia, M. Ingham, R. Murray, and A. Ames. Constrained risk-averse Markov decision processes. In 35th AAAI Conference on Artificial Intelligence, 2021.
[5] M. Ahmadi, R. Sharan, and J. W. Burdick. Stochastic finite state control of POMDPs with LTL specifications. arXiv preprint arXiv:2001.07679, 2020.
[6] A. Ahmadi-Javid. Entropic value-at-risk: A new coherent risk measure. Journal of Optimization Theory and Applications, 155(3):1105–1123, 2012.
[7] A. Ahmadi-Javid and M. Fallah-Tafti. Portfolio optimization with entropic value-at-risk. European Journal of Operational Research, 279(1):225–241, 2019.
[8] A. Ahmadi-Javid and A. Pichler. An analytical study of norms and banach spaces induced by the entropic value-at-risk. Mathematics and Financial Economics, 11(4):527–550, 2017.
[9] E. Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.
[10] P. Artzner, F. Delbaen, J. Eber, and D. Heath. Coherent measures of risk. Mathematical finance, 9(3):203–228, 1999.
[11] N. Bäuerle and J. Ott. Markov decision processes with average-value-at-risk criteria. Mathematical Methods of Operations Research, 74(3):361–379, 2011.
[12] Dimitri P Bertsekas. Dynamic programming and stochastic control. Number 10. Academic Press, 1976.
[13] D.P. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
[14] S. Carpin, Y. Chow, and M. Pavone. Risk aversion in finite Markov Decision Processes using total cost criteria and average value at risk. In 2016 ieee international conference on robotics and automation (icra), pages 335–342. IEEE, 2016.
[15] Anthony R. Cassandra, Leslie Pack Kaelbling, and Michael L. Littman. Acting optimally in partially observable stochastic domains. In AAAI, pages 1023–1028, 1994.
[16] Y. Chow and M. Ghavamzadeh. Algorithms for cvar optimization in mdps. In Advances in neural information processing systems, pages 3509–3517, 2014.
[17] Y. Chow, A. Tamar, S. Mannor, and M. Pavone. Risk-sensitive and robust decision-making: a cvar optimization approach. In Advances in Neural Information Processing Systems, pages 1522–1530, 2015.
[18] S. Diamond and S. Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.
[19] A. Dixit, M. Ahmadi, and J. W. Burdick. Risk-sensitive motion planning using entropic value-at-risk. In European Control Conference, 2021.
[20] D. Du and P. M. Pardalos. Minimax and applications, volume 4. Springer Science & Business Media, 2013.
[21] D. D. Fan, K. Otsu, Y. Kubo, A. Dixit, J. Burdick, and A. Agha-Mohammadi. STEP: Stochastic traversability evaluation and planning for safe off-road navigation. arXiv preprint arXiv:2103.02828, 2021.
[22] J. Fan and A. Ruszczyński. Process-based risk measures and risk-averse control of discrete-time systems. Mathematical Programming, pages 1–28, 2018.
[23] J. Fan and A. Ruszczyński. Risk measurement and risk-averse control of partially observable discrete-time markov systems. Mathematical Methods of Operations Research, 88(2):161–184, 2018.
[24] A. Hakobyan, Gyeong C. Kim, and I. Yang. Risk-aware motion planning and control using CVaR-constrained optimization. IEEE Robotics and Automation Letters, 4(4):3924–3931, 2019.
[25] R. Horst and N. V. Thoai. DC programming: overview. Journal of Optimization Theory and Applications, 103(1):1–43, 1999.
[26] V. Krishnamurthy. Partially observed Markov decision processes. Cambridge University Press, 2016.
[27] V. Krishnamurthy and S. Bhatt. Sequential Detection of Market Shocks With Risk-Averse CVaR Social Sensors. IEEE Journal of Selected Topics in Signal Processing, 10(6):1061–1072, 2016.
[28] E. L. Lawler and D. E. Wood. Branch-and-bound methods: A survey. Operations research, 14(4):699–719, 1966.
[29] H. A. Le Thi, H. M. Le, T. P. Dinh, et al. A dc programming approach for feature selection in support vector machines learning. Advances in Data Analysis and Classification, 2(3):259–278, 2008.
[30] T. Lipp and S. Boyd. Variations and extension of the convex–concave procedure. Optimization and Engineering, 17(2):263–287, 2016.
[31] O. Madani, S. Hanks, and A. Condon. On the undecidability of probabilistic planning and related stochastic optimization problems. Artificial Intelligence, 147(1):5 – 34, 2003.
[32] A. Majumdar and M. Pavone. How should a robot assess risk? towards an axiomatic theory of risk in robotics. In Robotics Research, pages 75–84. Springer, 2020.
[33] T. Osogami. Robustness and risk-sensitivity in markov decision processes. In Advances in Neural Information Processing Systems, pages 233–241, 2012.
[34] Jonathan Theodor Ott. A Markov decision model for a surveillance application and risk-sensitive Markov decision processes. 2010.
[35] G. C. Pflug and A. Pichler. Time-consistent decisions and temporal decomposition of coherent risk functionals. Mathematics of Operations Research, 41(2):682–699, 2016.
[36] L. Prashanth. Policy gradients for CVaR-constrained MDPs. In International Conference on Algorithmic Learning Theory, pages 155–169. Springer, 2014.
[37] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1994.
[38] R Tyrrell Rockafellar and Stanislav Uryasev. Conditional value-at-risk for general loss distributions. Journal of banking & finance, 26(7):1443–1471, 2002.
[39] R Tyrrell Rockafellar, Stanislav Uryasev, et al. Optimization of conditional value-at-risk. Journal of risk, 2:21–42, 2000.
[40] U. Rosolia, M. Ahmadi, R. M. Murray, and A. D. Ames. Time-optimal navigation in uncertain environments with high-level specifications. arXiv preprint arXiv:2103.01476, 2021.
[41] A. Ruszczyński. Risk-averse dynamic programming for markov decision processes. Mathematical programming, 125(2):235–261, 2010.
[42] A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on stochastic programming: modeling and theory. SIAM, 2014.
[43] X. Shen, S. Diamond, Y. Gu, and S. Boyd. Disciplined convex-concave programming. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 1009–1014. IEEE, 2016.
[44] S. Singh, Y. Chow, A. Majumdar, and M. Pavone. A framework for time-consistent, risk-sensitive model predictive control: Theory and algorithms. IEEE Transactions on Automatic Control, 2018.
[45] P. Sopasakis, D. Herceg, A. Bemporad, and P. Patrinos. Risk-averse model predictive control. Automatica, 100:281–288, 2019.
[46] A. Tamar, Y. Chow, M. Ghavamzadeh, and S. Mannor. Policy gradient for coherent risk measures. In Advances in Neural Information Processing Systems, pages 1468–1476, 2015.
[47] A. Tamar, Y. Chow, M. Ghavamzadeh, and S. Mannor. Sequential decision making with coherent risk. IEEE Transactions on Automatic Control, 62(7):3323–3338, 2016.
[48] J. Thai, T. Hunter, A. K. Akametalu, C. J. Tomlin, and A. M. Bayen. Inverse covariance estimation from data with missing values using the concave-convex procedure. In 53rd IEEE Conference on Decision and Control, pages 5736–5742. IEEE, 2014.
[49] D. Vose. Risk analysis: a quantitative guide. John Wiley & Sons, 2008.
[50] A. Wang, A. M Jasour, and B. Williams. Non-gaussian chance-constrained trajectory planning for autonomous vehicles under agent uncertainty. IEEE Robotics and Automation Letters, 2020.
[51] Huan Xu and Shie Mannor. Distributionally robust markov decision processes. In Advances in Neural Information Processing Systems, volume 23, pages 2505–2513, 2010.