This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Risk-Averse Decision Making Under Uncertainty

Mohamadreza Ahmadi, Ugo Rosolia, Michel D. Ingham, Richard M. Murray, and Aaron D. Ames M. Ahmadi, U. Rosolia, R. Murray, and A. Ames are with Control and Dynamical Systems (CDS) at the California Institute of Technology, 1200 E. California Blvd., MC 104-44, Pasadena, CA 91125, e-mail: ({mrahmadi,urosolia,murray,ames}@caltech.edu). M. Ingham is with NASA Jet Propulsion Laboratory, 4800 Oak Grove Dr, Pasadena, CA 91109, e-mail: (michel.d.ingham@jpl.nasa.gov).
Abstract

A large class of decision making under uncertainty problems can be described via Markov decision processes (MDPs) or partially observable MDPs (POMDPs), with application to artificial intelligence and operations research, among others. Traditionally, policy synthesis techniques are proposed such that a total expected cost/reward is minimized/maximized. However, optimality in the total expected cost sense is only reasonable if system’s behavior in the large number of runs is of interest, which has limited the use of such policies in practical mission-critical scenarios, wherein large deviations from the expected behavior may lead to mission failure. In this paper, we consider the problem of designing policies for MDPs and POMDPs with objectives and constraints in terms of dynamic coherent risk measures, which we refer to as the constrained risk-averse problem. Our contributions are fourfold:

  • (i)

    For MDPs, we reformulate the problem into a inf-sup problem via the Lagrangian framework. Under the assumption that the risk objectives and constraints can be represented by a Markov risk transition mapping, we propose an optimization-based method to synthesize Markovian policies;

  • (ii)

    For MDPs, we demonstrate that the formulated optimization problems are in the form of difference convex programs (DCPs) and can be solved by the disciplined convex-concave programming (DCCP) framework. We show that these results generalize linear programs for constrained MDPs with total discounted expected costs and constraints;

  • (iii)

    For POMDPs, we show that, if the coherent risk measures can be defined as a Markov risk transition mapping, an infinite-dimensional optimization can be used to design Markovian belief-based policies;

  • (iv)

    For POMDPs with stochastic finite-state controllers (FSCs), we show that the latter optimization simplifies to a (finite-dimensional) DCP and can be solved by the DCCP framework. We incorporate these DCPs in a policy iteration algorithm to design risk-averse FSCs for POMDPs.

We demonstrate the efficacy of the proposed method with numerical experiments involving conditional-value-at-risk (CVaR) and entropic-value-at-risk (EVaR) risk measures.

Index Terms:
Markov Processes, Stochastic systems, Uncertain systems.

I Introduction

Autonomous systems are being increasingly deployed in real-world settings. Hence, the associated risk that stems from unknown and unforeseen circumstances is correspondingly on the rise. This demands for autonomous systems that can make appropriately conservative decisions when faced with uncertainty in their environment and behavior. Mathematically speaking, risk can be quantified in numerous ways, such as chance constraints [50] and distributional robustness [51]. However, applications in autonomy and robotics require more “nuanced assessments of risk” [32]. Artzner et. al. [10] characterized a set of natural properties that are desirable for a risk measure, called a coherent risk measure, and have obtained widespread acceptance in finance and operations research, among other fields.

A popular model for representing sequential decision making under uncertainty is a Markov decision processes (MDP) [37]. MDPs with coherent risk objectives were studied in [47, 46], where the authors proposed a sampling-based algorithm for finding saddle point solutions using policy gradient methods. However, [47] requires the risk envelope appearing in the dual representation of the coherent risk measure to be known with an explicit canonical convex programming formulation. While this may be the case for CVaR, mean-semi-deviation, and spectral risk measures [42], such explicit form is not known for general coherent risk measures, such as EVaR. Furthermore, it is not clear whether the saddle point solutions are a lower bound or upper bound to the optimal value. Also, policy-gradient based methods require calculating the gradient of the coherent risk measure, which is not available in explicit form in general. For the CVaR measure, MDPs with risk constraints and total expected costs were studied in [36, 16] and locally optimal solutions were found via policy gradients, as well. However, this method also leads to saddle point solutions (which cannot be shown to be upper bounds or lower bounds of the optimal value) and cannot be applied to general coherent risk measures. In addition, because the objective and the constraints are in terms of different coherent risk measures, the authors assume there exists a policy that satisfies the CVaR constraint (feasibility assumption), which may not be the case in general. Following the footsteps of [35], a promising approach based on approximate value iteration was proposed for MDPs with CVaR objectives in [17]. A policy iteration algorithm for finding policies that minimize total coherent risk measures for MDPs was studied in [41] and a computational non-smooth Newton method was proposed in [41].

When the states of the agent and/or the environment are not directly observable, a partially observable MDP (POMDP) can be used to study decision making under uncertainty introduced by the partial state observability [26, 2]. POMDPs with coherent risk measure objectives were studied in [22, 23]. Despite the elegance of the theory, no computational method was proposed to design policies for general coherent risk measures. In [3], we proposed a method for finding finite-state controllers for POMDPs with objectives defined in terms of coherent risk measures, which takes advantage of convex optimization techniques. However, the method can only be used if the risk transition mapping is affine in the policy.

Summary of Contributions: In this paper, we consider MDPs and POMDPs with both objectives and constraints in terms of coherent risk measures. Our contributions are fourfold:

  • (i)

    For MDPs, we use the Lagrangian framework and reformulate the problem into a inf-sup problem. For Markov risk transition mappings, we propose an optimization-based method to design Markovian policies that lower-bound the constrained risk-averse problem;

  • (ii)

    For MDPs, we evince that the optimization problems are in the special form of DCPs and can be solved by the DCCP method. We also demonstrate that these results generalize linear programs for constrained MDPs with total discounted expected costs and constraints;

  • (iii)

    For POMDPs, we demonstrate that, if the coherent risk measures can be defined as a Markov risk transition mapping, an infinite-dimensional optimization can be used to design Markovian belief-based policies, which in theory requires infinite memory to synthesize (in accordance with classical POMDP complexity results);

  • (iv)

    For POMDPs with stochastic finite-state controllers (FSCs), we show that the latter optimization converts to a (finite-dimensional) DCP and can be solved by the DCCP framework. We incorporate these DCPs in a policy iteration algorithm to design risk-averse FSCs for POMDPs.

We assess the efficacy of the proposed method with numerical experiments involving conditional-value-at-risk (CVaR) and entropic-value-at-risk (EVaR) risk measures.

Preliminary results on risk-averse MDPs were presented in [4]. This paper, in addition to providing detailed proofs and new numerical analysis in the MDP case, generalizes [4] to partially observable systems (POMDPs) with dynamic coherent risk objectives and constraints.

The rest of the paper is organized as follows. In the next section, we briefly review some notions used in the paper. In Section III, we formulate the problem under study. In Section IV, we present the optimization-based method for designing risk-averse policies for MDPs. In Section V, we describe a policy iteration method for designing finite-memory controllers for risk-averse POMDPs. In Section VI, we illustrate the proposed methodology via numerical experiments. Finally, in Section VII, we conclude the paper and give directions for future research.

Notation: We denote by n\mathbb{R}^{n} the nn-dimensional Euclidean space and 0\mathbb{N}_{\geq 0} the set of non-negative integers. Throughout the paper, we use bold font to denote a vector and ()(\cdot)^{\top} for its transpose, e.g., 𝒂=(a1,,an)\boldsymbol{a}=(a_{1},\ldots,a_{n})^{\top}, with n{1,2,}n\in\{1,2,\ldots\}. For a vector 𝒂\boldsymbol{a}, we use 𝒂()𝟎\boldsymbol{a}\succeq(\preceq)\boldsymbol{0} to denote element-wise non-negativity (non-positivity) and 𝒂𝟎\boldsymbol{a}\equiv\boldsymbol{0} to show all elements of 𝒂\boldsymbol{a} are zero. For two vectors a,bna,b\in\mathbb{R}^{n}, we denote their inner product by 𝒂,𝒃\langle\boldsymbol{a},\boldsymbol{b}\rangle, i.e., 𝒂,𝒃=𝒂𝒃\langle\boldsymbol{a},\boldsymbol{b}\rangle=\boldsymbol{a}^{\top}\boldsymbol{b}. For a finite set 𝒜\mathcal{A}, we denote its power set by 2𝒜2^{\mathcal{A}}, i.e., the set of all subsets of 𝒜\mathcal{A}. For a probability space (Ω,,)(\Omega,\mathcal{F},\mathbb{P}) and a constant p[1,)p\in[1,\infty), p(Ω,,)\mathcal{L}_{p}(\Omega,\mathcal{F},\mathbb{P}) denotes the vector space of real valued random variables cc for which 𝔼|c|p<\mathbb{E}|c|^{p}<\infty.

II Preliminaries

In this section, we briefly review some notions and definitions used throughout the paper.

II-A Markov Decision Processes

An MDP is a tuple =(𝒮,Act,T,κ0)\mathcal{M}=(\mathcal{S},Act,T,\kappa_{0}) consisting of a set of states 𝒮={s1,,s|𝒮|}\mathcal{S}=\{s_{1},\dots,s_{|\mathcal{S}|}\} of the autonomous agent(s) and world model, actions Act={α1,,α|Act|}Act=\{\alpha_{1},\dots,\alpha_{|Act|}\} available to the agent, a transition function T(sj|si,α)T(s_{j}|s_{i},\alpha), and κ0\kappa_{0} describing the initial distribution over the states.

This paper considers finite Markov decision processes, where 𝒮\mathcal{S} and ActAct are finite sets. For each action the probability of making a transition from state si𝒮s_{i}\in\mathcal{S} to state sj𝒮s_{j}\in\mathcal{S} under action αAct\alpha\in Act is given by T(sj|si,α)T(s_{j}|s_{i},\alpha). The probabilistic components of a MDP must satisfy the following:

{s𝒮T(s|si,α)=1,si𝒮,αAct,s𝒮κ0(s)=1.\begin{cases}\sum_{s\in\mathcal{S}}T(s|s_{i},\alpha)=1,&\forall s_{i}\in\mathcal{S},\forall\alpha\in Act,\\ \sum_{s\in\mathcal{S}}\kappa_{0}(s)=1.&{}\textstyle\end{cases}

II-B Partially Observable MDPs

A POMDP is a tuple 𝒫=(,𝒪,O)\mathcal{PM}=(\mathcal{M},\mathcal{O},O) consisting of an MDP \mathcal{M}, observations 𝒪={o1,,o|𝒪|}\mathcal{O}=\{o_{1},\dots,o_{|\mathcal{O}|}\}, and an observation model O(os)O(o\mid s). We consider finite POMDPs, where 𝒪\mathcal{O} is a finite set. Then, for each state sis_{i}, an observation o𝒪o\in\mathcal{O} is generated independently with probability O(o|si)O(o|s_{i}), which satisfies

s𝒮O(o|s)=1,s𝒮.\sum_{s\in\mathcal{S}}O(o|s)=1,\quad\forall s\in\mathcal{S}.

In POMDPs, the states s𝒮s\in\mathcal{S} are not directly observable. The beliefs bΔ(𝒮)b\in\Delta(\mathcal{S}), i.e., the probability of being in different states, with Δ(𝒮)\Delta(\mathcal{S}) being the set of probability distributions over 𝒮\mathcal{S}, for all s𝒮s\in\mathcal{S} can be computed using the Bayes’ law as follows:

b0(s)\displaystyle b_{0}(s) =κ0(s)O(o0s)oOκ0(s)O(os),\displaystyle=\frac{\kappa_{0}(s)O(o_{0}\mid s)}{\sum_{o\in O}\kappa_{0}(s)O(o\mid s)}, (1)
bt(s)\displaystyle b_{t}(s) =O(ots)s𝒮T(ss,αt)bt1(s)s𝒮O(ots)s𝒮T(ss,αt)bt1(s),\displaystyle=\frac{O(o_{t}\mid s)\sum_{s^{\prime}\in\mathcal{S}}T(s\mid s^{\prime},\alpha_{t})b_{t-1}(s^{\prime})}{\sum_{s\in\mathcal{S}}O(o_{t}\mid s)\sum_{s^{\prime}\in\mathcal{S}}T(s\mid s^{\prime},\alpha_{t})b_{t-1}(s^{\prime})}, (2)

for all t1t\geq 1.

II-C Finite State Control of POMDPs

It is well established that designing optimal policies for POMDPs based on the (continuous) belief states require uncountably infinite memory or internal states [15, 31]. This paper focuses on a particular class of POMDP controllers, namely, FSCs.

A stochastic finite state controller for 𝒫\mathcal{PM} is given by the tuple 𝒢=(G,ω,κ)\mathcal{G}=(G,\omega,\kappa), where G={g1,g2,,g|G|}G=\{g_{1},g_{2},\dots,g_{|G|}\} is a finite set of internal states (I-states), ω:G×𝒪Δ(G×Act)\omega:G\times\mathcal{O}\to\Delta({G\times Act}) is a function of internal stochastic finite state controller states gkg_{k} and observation oo, such that ω(gk,o)\omega(g_{k},o) is a probability distribution over G×ActG\times Act. The next internal state and action pair (gl,α)(g_{l},\alpha) is chosen by independent sampling of ω(gk,o)\omega(g_{k},o). By abuse of notation, ω(gl,α|gk,o)\omega(g_{l},\alpha|g_{k},o) will denote the probability of transitioning to internal stochastic finite state controller state glg_{l} and taking action α\alpha, when the current internal state is gkg_{k} and observation oo is received. κ:Δ(𝒮)Δ(G)\kappa:\Delta({\mathcal{S}})\to\Delta(G) chooses the starting internal FSC state g0g_{0}, by independent sampling of κ(κ0)\kappa(\kappa_{0}), given initial distribution κ0\kappa_{0} of 𝒫\mathcal{PM}, and κ(g|κ0)\kappa(g|\kappa_{0}) will denote the probability of starting the FSC in internal state gg when the initial POMDP distribution is κ0\kappa_{0}.

II-D Coherent Risk Measures

Consider a probability space (Ω,,)(\Omega,\mathcal{F},\mathbb{P}), a filteration 0N\mathcal{F}_{0}\subset\cdots\mathcal{F}_{N}\subset\mathcal{F}, and an adapted sequence of random variables (stage-wise costs) ct,t=0,,Nc_{t},\leavevmode\nobreak\ t=0,\ldots,N, where N0{}N\in\mathbb{N}_{\geq 0}\cup\{\infty\}. For t=0,,Nt=0,\ldots,N, we further define the spaces 𝒞t=p(Ω,t,)\mathcal{C}_{t}=\mathcal{L}_{p}(\Omega,\mathcal{F}_{t},\mathbb{P}), p[1,)p\in[1,\infty), 𝒞t:N=𝒞t××𝒞N\mathcal{C}_{t:N}=\mathcal{C}_{t}\times\cdots\times\mathcal{C}_{N} and 𝒞=𝒞0×𝒞1×\mathcal{C}=\mathcal{C}_{0}\times\mathcal{C}_{1}\times\cdots. We assume that the sequence 𝒄𝒞\boldsymbol{c}\in\mathcal{C} is almost surely bounded (with exceptions having probability zero), i.e., maxtesssup|ct(ω)|<.\max_{t}\operatorname*{ess\,sup}\leavevmode\nobreak\ |c_{t}(\omega)|<\infty.

In order to describe how one can evaluate the risk of sub-sequence ct,,cNc_{t},\ldots,c_{N} from the perspective of stage tt, we require the following definitions.

Definition 1 (Conditional Risk Measure).

A mapping ρt:N:𝒞t:N𝒞t\rho_{t:N}:\mathcal{C}_{t:N}\to\mathcal{C}_{t}, where 0tN0\leq t\leq N, is called a conditional risk measure, if it has the following monoticity property:

ρt:N(𝒄)ρt:N(𝒄),𝒄,𝒄𝒞t:Nsuch that𝒄𝒄.\rho_{t:N}(\boldsymbol{c})\leq\rho_{t:N}(\boldsymbol{c}^{\prime}),\quad\forall\boldsymbol{c},\forall\boldsymbol{c}^{\prime}\in\mathcal{C}_{t:N}\leavevmode\nobreak\ \text{such that}\leavevmode\nobreak\ \boldsymbol{c}\preceq\boldsymbol{c}^{\prime}.
Definition 2 (Dynamic Risk Measure).

A dynamic risk measure is a sequence of conditional risk measures ρt:N:𝒞t:N𝒞t\rho_{t:N}:\mathcal{C}_{t:N}\to\mathcal{C}_{t}, t=0,,Nt=0,\ldots,N.

One fundamental property of dynamic risk measures is their consistency over time [41, Definition 3]. That is, if cc will be as good as cc^{\prime} from the perspective of some future time θ\theta, and they are identical between time τ\tau and θ\theta, then cc should not be worse than cc^{\prime} from the perspective at time τ\tau.

In this paper, we focus on time consistent, coherent risk measures, which satisfy four nice mathematical properties, as defined below [42, p. 298].

Definition 3 (Coherent Risk Measure).

We call the one-step conditional risk measures ρt:𝒞t+1𝒞t\rho_{t}:\mathcal{C}_{t+1}\to\mathcal{C}_{t}, t=1,,N1t=1,\ldots,N-1 a coherent risk measure if it satisfies the following conditions

  • Convexity: ρt(λc+(1λ)c)λρt(c)+(1λ)ρt(c)\rho_{t}(\lambda c+(1-\lambda)c^{\prime})\leq\lambda\rho_{t}(c)+(1-\lambda)\rho_{t}(c^{\prime}), for all λ(0,1)\lambda\in(0,1) and all c,c𝒞t+1c,c^{\prime}\in\mathcal{C}_{t+1};

  • Monotonicity: If ccc\leq c^{\prime} then ρt(c)ρt(c)\rho_{t}(c)\leq\rho_{t}(c^{\prime}) for all c,c𝒞t+1c,c^{\prime}\in\mathcal{C}_{t+1};

  • Translational Invariance: ρt(c+c)=c+ρt(c)\rho_{t}(c+c^{\prime})=c+\rho_{t}(c^{\prime}) for all c𝒞tc\in\mathcal{C}_{t} and c𝒞t+1c^{\prime}\in\mathcal{C}_{t+1};

  • Positive Homogeneity: ρt(βc)=βρt(c)\rho_{t}(\beta c)=\beta\rho_{t}(c) for all c𝒞t+1c\in\mathcal{C}_{t+1} and β0\beta\geq 0.

We are interested in the discounted infinite horizon problems. Let γ(0,1)\gamma\in(0,1) be a given discount factor. For t=0,1,t=0,1,\ldots, we define the functional

ρ0,tγ(c0,,ct)=ρ0,t(c0,γc1,,γtct)=ρ0(c0+ρ1(γc1+ρ2(γ2c2++ρt1(γt1ct1+ρt(γtct))))).\rho^{\gamma}_{0,t}(c_{0},\ldots,c_{t})=\rho_{0,t}(c_{0},\gamma c_{1},\ldots,\gamma^{t}c_{t})\\ =\rho_{0}\bigg{(}c_{0}+\rho_{1}\big{(}\gamma c_{1}+\rho_{2}(\gamma^{2}c_{2}+\cdots\\ \leavevmode\nobreak\ \leavevmode\nobreak\ +\rho_{t-1}\left(\gamma^{t-1}c_{t-1}+\rho_{t}(\gamma^{t}c_{t})\right)\cdots)\big{)}\bigg{)}. (3)

Finally, we have total discounted risk functional ργ:𝒞\rho^{\gamma}:\mathcal{C}\to\mathbb{R} defined as

ργ(𝒄)=limtρ0,tγ(c0,,ct).\rho^{\gamma}(\boldsymbol{c})=\lim_{t\to\infty}\rho^{\gamma}_{0,t}(c_{0},\ldots,c_{t}). (4)

From [41, Theorem 3], we have that ργ\rho^{\gamma} is convex, monotone, and positive homogeneous.

II-E Examples of Coherent Risk Measures

Next, we briefly review three examples of coherent risk measures that will be used in this paper.

Total Conditional Expectation: The simplest risk measure is the total conditional expectation given by

ρt(ct+1)=𝔼[ct+1t].\rho_{t}(c_{t+1})=\mathbb{E}\left[c_{t+1}\mid\mathcal{F}_{t}\right]. (5)

It is easy to see that total conditional expectation satisfies the properties of a coherent risk measure as outlined in Definition 3. Unfortunately, total conditional expectation is agnostic to realization fluctuations of the random variable cc and is only concerned with the mean value of cc at large number of realizations. Thus, it is a risk-neutral measure of performance.

Conditional Value-at-Risk: Let c𝒞c\in\mathcal{C} be a random variable. For a given confidence level ε(0,1)\varepsilon\in(0,1), value-at-risk (VaRε\mathrm{VaR}_{\varepsilon}) denotes the (1ε)(1-\varepsilon)-quantile value of the random variable c𝒞c\in\mathcal{C}. Unfortunately, working with VaR for non-normal random variables is numerically unstable and optimizing models involving VaR is intractable in high dimensions [39].

In contrast, CVaR overcomes the shortcomings of VaR. CVaR with confidence level ε(0,1)\varepsilon\in(0,1) denoted CVaRε\mathrm{CVaR}_{\varepsilon} measures the expected loss in the (1ε)(1-\varepsilon)-tail given that the particular threshold VaRε\mathrm{VaR}_{\varepsilon} has been crossed, i.e., CVaRε(c)=𝔼[ccVaRε(c)]\mathrm{CVaR}_{\varepsilon}(c)=\mathbb{E}\left[c\mid c\geq\mathrm{VaR}_{\varepsilon}(c)\right]. An optimization formulation for CVaR was proposed in [39]. That is, CVaRε\mathrm{CVaR}_{\varepsilon} is given by

ρt(ct+1)=CVaRε(ct+1):=infζ(ζ+1ε𝔼[(ct+1ζ)+t]),\rho_{t}(c_{t+1})=\mathrm{CVaR}_{\varepsilon}(c_{t+1})\\ :=\inf_{\zeta\in\mathbb{R}}\left(\zeta+\frac{1}{\varepsilon}\mathbb{E}\left[(c_{t+1}-\zeta)_{+}\mid\mathcal{F}_{t}\right]\right), (6)

where ()+=max{,0}(\cdot)_{+}=\max\{\cdot,0\}. A value of ε1\varepsilon\to 1 corresponds to a risk-neutral case, i.e., CVaR1(c)=𝔼(c)\mathrm{CVaR_{1}}(c)=\mathbb{E}(c); whereas, a value of ε0\varepsilon\to 0 is rather a risk-averse case, i.e., CVaR0(c)=VaR0(c)=essinf(c)\mathrm{CVaR_{0}}(c)=\mathrm{VaR}_{0}(c)=\operatorname*{ess\,inf}(c) [38]. Figure 1 illustrates these notions for an example cc variable with distribution p(c)p(c).

Refer to caption
Figure 1: Comparison of the mean, VaR, and CVaR for a given confidence ε(0,1)\varepsilon\in(0,1). The axes denote the values of the stochastic variable cc and its probability density function p(c)p(c). The shaded area denotes the %ε\%\varepsilon of the area under p(c)p(c). The expected cost 𝔼(c)\mathbb{E}(c) is much smaller than the worst case cost. VaR gives the value of cc at the (1ε)(1-\varepsilon)-tail of the distribution. But, it ignores the values of cc with probability below ε\varepsilon. CVaR is the average of the values of VaR with probability less than ε\varepsilon (average of the worst-case values of cc in the (1ε)(1-\varepsilon) tail of the distribution).

Entropic Value-at-Risk: Unfortunately, CVaR ignores the losses below the VaR threshold. EVaR is the tightest upper bound in the sense of Chernoff inequality for VaR and CVaR and its dual representation is associated with the relative entropy. In fact, it was shown in [8] that EVaRε\mathrm{EVaR}_{\varepsilon} and CVaRε\mathrm{CVaR}_{\varepsilon} are equal only if there are no losses (cc\to-\infty) below the VaRε\mathrm{VaR}_{\varepsilon} threshold. In addition, EVaR is a strictly monotone risk measure; whereas, CVaR is only monotone [7]. EVaRε\mathrm{EVaR}_{\varepsilon} is given by

ρt(ct+1)=infζ>0(log(𝔼[eζct+1t]ε)/ζ).\rho_{t}(c_{t+1})=\inf_{\zeta>0}\left({\log\left(\frac{\mathbb{E}[e^{\zeta c_{t+1}}\mid\mathcal{F}_{t}]}{\varepsilon}\right)/\zeta}\right). (7)

Similar to CVaRε\mathrm{CVaR}_{\varepsilon}, for EVaRε\mathrm{EVaR}_{\varepsilon}, ε1\varepsilon\to 1 corresponds to a risk-neutral case; whereas, ε0\varepsilon\to 0 corresponds to a risk-averse case. In fact, it was demonstrated in [6, Proposition 3.2] that limε0EVaRε(c)=essinf(c)\lim_{\varepsilon\to 0}\mathrm{EVaR}_{\varepsilon}(c)=\operatorname*{ess\,inf}(c).

III Problem Formulation

In the past two decades, coherent risk and dynamic risk measures have been developed and used in microeconomics and mathematical finance fields [49]. Generally speaking, risk-averse decision making is concerned with the behavior of agents, e.g. consumers and investors, who, when exposed to uncertainty, attempt to lower that uncertainty. The agents may avoid situations with unknown payoffs, in favor of situations with payoffs that are more predictable.

The core idea in risk-averse planning is to replace the conventional risk-neutral conditional expectation of the cumulative cost objectives with the more general coherent risk measures. In path planning scenarios, in particular, we will show in our numerical experiments that considering coherent risk measures will lead to significantly more robustness to environment uncertainty and collisions leading to mission failures.

In addition to total cost risk-aversity, an agent is often subject to constraints, e.g. fuel, communication, or energy budgets [27]. These constraints can also represent mission objectives, e.g. explore an area or reach a goal.

Consider a stationary controlled Markov process {qt}\{q_{t}\}, t=0,1,t=0,1,\ldots (an MDP or a POMDP) with initial probability distribution κ0\kappa_{0}, wherein policies, transition probabilities, and cost functions do not depend explicitly on time. Each policy π={πt}t=0\pi=\{\pi_{t}\}_{t=0}^{\infty} leads to cost sequences 𝒄t=c(qt,αt)\boldsymbol{c}_{t}=c(q_{t},\alpha_{t}), t=0,1,t=0,1,\ldots and 𝒅ti=di(qt,αt)\boldsymbol{d}_{t}^{i}=d^{i}(q_{t},\alpha_{t}), t=0,1,t=0,1,\ldots, i=1,2,,nci=1,2,\ldots,n_{c}. We define the dynamic risk of evaluating the γ\gamma-discounted cost of a policy π\pi as

Jγ(κ0,π)=ργ(c(q0,α0),c(q1,α1),),J_{\gamma}(\kappa_{0},\pi)=\rho^{\gamma}\big{(}c(q_{0},\alpha_{0}),c(q_{1},\alpha_{1}),\ldots\big{)}, (8)

and the γ\gamma-discounted dynamic risk constraints of executing policy π\pi as

Dγi(κ0,π)=ργ(di(q0,α0),di(q1,α1),)βi,i=1,2,,nc,D_{\gamma}^{i}(\kappa_{0},\pi)=\rho^{\gamma}\left(d^{i}(q_{0},\alpha_{0}),d^{i}(q_{1},\alpha_{1}),\ldots\right)\leq\beta^{i},\\ i=1,2,\ldots,n_{c}, (9)

where ργ\rho^{\gamma} is defined in equation (4), q0κ0q_{0}\sim\kappa_{0}, and βi>0\beta^{i}>0, i=1,2,,nci=1,2,\ldots,n_{c}, are given constants. We assume that c(,)c(\cdot,\cdot) and di(,)d^{i}(\cdot,\cdot), i=1,2,,nci=1,2,\ldots,n_{c}, are non-negative and upper-bounded. For a discount factor γ(0,1)\gamma\in(0,1), an initial condition κ0\kappa_{0}, and a policy π\pi, we infer from [41, Theorem 3] that both Jγ(κ0,π)J_{\gamma}(\kappa_{0},\pi) and Dγi(κ0,π)D_{\gamma}^{i}(\kappa_{0},\pi) are well-defined (bounded), if cc and dd are bounded.

In this work, we are interested in addressing the following problem:

Problem 1.

For a controlled Markov decision process (an MDP or a POMDP), a discount factor γ(0,1)\gamma\in(0,1), and a total risk functional Jγ(κ0,π)J_{\gamma}(\kappa_{0},\pi) as in equation (8) and total cost constraints (9), where {ρt}t=0\{\rho_{t}\}_{t=0}^{\infty} are coherent risk measures, compute

π\displaystyle\pi^{*}\in argminπJγ(κ0,π)\displaystyle\leavevmode\nobreak\ \operatorname*{argmin}_{\pi}\leavevmode\nobreak\ \leavevmode\nobreak\ J_{\gamma}(\kappa_{0},\pi)
subject to𝑫γ(κ0,π)𝜷.\displaystyle\text{subject to}\quad\boldsymbol{D}_{\gamma}(\kappa_{0},\pi)\preceq\boldsymbol{\beta}. (10)

We call a controlled Markov process with the “nested” objective (8) and constraints (9) a constrained risk-averse Markov process.

For MDPs, [17, 33] show that such coherent risk measure objectives can account for modeling errors and parametric uncertainties. We can also interpret Problem 1 as designing policies that minimize the accrued costs in a risk-averse sense and at the same time ensuring that the system constraints, e.g., fuel constraints, are not violated even in the rare but costly scenarios.

Note that in Problem 1 both the objective function and the constraints are in general non-differentiable and non-convex in policy π\pi (with the exception of total expected cost as the coherent risk measure ργ\rho^{\gamma} [9]). Therefore, finding optimal policies in general may be hopeless. Instead, we find sub-optimal polices by taking advantage of a Lagrangian formulation and then using an optimization form of Bellman’s equations.

Next, we show that the constrained risk-averse problem is equivalent to a non-constrained inf-sup risk-averse problem thanks to the Lagrangian method.

Proposition 1.

Let Jγ(κ0)J_{\gamma}(\kappa_{0}) be the value of Problem 1 for a given initial distribution κ0\kappa_{0} and discount factor γ\gamma. Then, (i) the value function satisfies

Jγ(κ0)=infπsup𝝀𝟎Lγ(π,𝝀),J_{\gamma}(\kappa_{0})=\inf_{\pi}\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}L_{\gamma}(\pi,{\boldsymbol{\lambda}}), (11)

where

Lγ(π,𝝀)=Jγ(κ0,π)+𝝀,(𝑫γ(κ0,π)𝜷),\displaystyle L_{\gamma}(\pi,\boldsymbol{\lambda})=J_{\gamma}(\kappa_{0},\pi)+\langle\boldsymbol{\lambda},\left(\boldsymbol{D}_{\gamma}(\kappa_{0},\pi)-\boldsymbol{\beta}\right)\rangle, (12)

is the Lagrangian function.
(ii) Furthermore, a policy π\pi^{*} is optimal for Problem 1, if and only if Jγ(κ0)=sup𝛌𝟎Lγ(π,𝛌)J_{\gamma}(\kappa_{0})=\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\leavevmode\nobreak\ L_{\gamma}(\pi^{*},\boldsymbol{\lambda}).

Proof.

(i) If for some π\pi Problem 1 is not feasible, then sup𝝀𝟎Lγ(π,λ)=\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}L_{\gamma}(\pi,{\lambda})=\infty. In fact, if the iith constraint is not satisfied, i.e., Dγi>βiD_{\gamma}^{i}>\beta^{i}, we can achieve the latter supremum by choosing λi\lambda_{i}\to\infty, while keeping the rest of λi\lambda^{i}s constant or zero. If Problem 1 is feasible for some π\pi, then the supremum is achieved by setting 𝝀=𝟎\boldsymbol{\lambda}=\boldsymbol{0}. Hence, Lγ(λ,π)=Jγ(κ0,π)L_{\gamma}(\lambda,\pi)=J_{\gamma}(\kappa_{0},\pi) and

infπsup𝝀𝟎Lγ(π,λ)=infπ:𝑫γ(κ0,π)𝜷Jγ(κ0,π),\inf_{\pi}\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\leavevmode\nobreak\ L_{\gamma}(\pi,{\lambda})=\inf_{\pi:\boldsymbol{D}_{\gamma}(\kappa_{0},\pi)\leq\boldsymbol{\beta}}\leavevmode\nobreak\ \leavevmode\nobreak\ J_{\gamma}(\kappa_{0},\pi),

which implies (i).
(ii) If π\pi is optimal, then, from (11), we have

Jγ(κ0)=sup𝝀𝟎Lγ(π,λ).J_{\gamma}(\kappa_{0})=\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}L_{\gamma}(\pi^{*},{\lambda}).

Conversely, if Jγ(κ0)=sup𝝀𝟎Lγ(π,λ)J_{\gamma}(\kappa_{0})=\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}L_{\gamma}(\pi^{\prime},{\lambda}) for some π\pi^{\prime}, then from (11), we have infπsup𝝀𝟎Lγ(π,λ)=sup𝝀𝟎Lγ(π,λ)\inf_{\pi}\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}L_{\gamma}(\pi,\lambda)=\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}L_{\gamma}(\pi^{\prime},{\lambda}). Hence, π\pi^{\prime} is the optimal policy. ∎

IV Constrained Risk-Averse MDPs

At any time tt, the value of ρt\rho_{t} is t\mathcal{F}_{t}-measurable and is allowed to depend on the entire history of the process {s0,s1,}\{s_{0},s_{1},\ldots\} and we cannot expect to obtain a Markov optimal policy [34, 11]. In order to obtain Markov policies, we need the following property [41].

Definition 4 (Markov Risk Measure).

Let m,n[1,)m,n\in[1,\infty) such that 1/m+1/n=11/m+1/n=1 and 𝒫={pn(𝒮,2𝒮,)s𝒮p(s)(s)=1,p0}.\mathcal{P}=\big{\{}p\in\mathcal{L}_{n}(\mathcal{S},2^{\mathcal{S}},\mathbb{P})\mid\sum_{s^{\prime}\in\mathcal{S}}p(s^{\prime})\mathbb{P}(s^{\prime})=1,\leavevmode\nobreak\ p\geq 0\big{\}}. A one-step conditional risk measure ρt:𝒞t+1𝒞t\rho_{t}:\mathcal{C}_{t+1}\to\mathcal{C}_{t} is a Markov risk measure with respect to the controlled Markov process {st}\{s_{t}\}, t=0,1,t=0,1,\ldots, if there exist a risk transition mapping σt:m(𝒮,2𝒮,)×𝒮×𝒫\sigma_{t}:\mathcal{L}_{m}(\mathcal{S},2^{\mathcal{S}},\mathbb{P})\times\mathcal{S}\times\mathcal{P}\to\mathbb{R} such that for all vm(𝒮,2𝒮,)v\in\mathcal{L}_{m}(\mathcal{S},2^{\mathcal{S}},\mathbb{P}) and αtπ(st)\alpha_{t}\in\pi(s_{t}), we have

ρt(v(st+1))=σt(v(st+1),st,p(st+1|st,αt)),\rho_{t}(v(s_{t+1}))=\sigma_{t}(v(s_{t+1}),s_{t},p(s_{t+1}|s_{t},\alpha_{t})), (13)

where p:𝒮×Act𝒫p:\mathcal{S}\times Act\to\mathcal{P} is called the controlled kernel.

In fact, if ρt\rho_{t} is a coherent risk measure, σt\sigma_{t} also satisfies the properties of a coherent risk measure (Definition 3). In this paper, since we are concerned with MDPs, the controlled kernel is simply the transition function TT.

Assumption 1.

The one-step coherent risk measure ρt\rho_{t} is a Markov risk measure.

The simplest case of the risk transition mapping is in the conditional expectation case ρt(v(st+1))=𝔼{v(st+1)st,αt}\rho_{t}(v(s_{t+1}))=\mathbb{E}\{v(s_{t+1})\mid s_{t},\alpha_{t}\}, i.e.,

σ{v(st+1),st,p(st+1|st,αt)}=𝔼{v(st+1)st,αt}=st+1𝒮v(st+1)T(st+1st,αt).\sigma\left\{v(s_{t+1}),s_{t},p(s_{t+1}|s_{t},\alpha_{t})\right\}=\mathbb{E}\{v(s_{t+1})\mid s_{t},\alpha_{t}\}\\ =\sum_{s_{t+1}\in\mathcal{S}}v(s_{t+1})T(s_{t+1}\mid s_{t},\alpha_{t}). (14)

Note that in the total discounted expectation case σ\sigma is a linear function in vv rather than a convex function, which is the case for a general coherent risk measures. For example, for the CVaR risk measure, the Markov risk transition mapping is given by

σ{v(st+1),st,p(st+1|st,αt)}=infζ{ζ+1εst+1𝒮(v(st+1)ζ)+T(st+1st,αt)},\sigma\{v(s_{t+1}),s_{t},p(s_{t+1}|s_{t},\alpha_{t})\}\\ =\inf_{\zeta\in\mathbb{R}}\left\{\zeta+\frac{1}{\varepsilon}\sum_{s_{t+1}\in\mathcal{S}}\left(v(s_{t+1})-\zeta\right)_{+}T(s_{t+1}\mid s_{t},\alpha_{t})\right\},

where ()+=max{,0}(\cdot)_{+}=\max\{\cdot,0\} is a convex function in vv.

If σ\sigma is a coherent, Markov risk measure, then the Markov policies are sufficient to ensure optimality [41].

In the next result, we show that we can find a lower bound to the solution to Problem 1 via solving an optimization problem.

Theorem 1.

Consider an MDP \mathcal{M} with the nested risk objective (8), constraints (9), and discount factor γ(0,1)\gamma\in(0,1). Let Assumption 1 hold and ρt,t=0,1,\rho_{t},\leavevmode\nobreak\ t=0,1,\ldots be coherent risk measures as described in Definition 3. Then, the solution (𝐕γ,𝛌)(\boldsymbol{V}^{*}_{\gamma},\boldsymbol{\lambda}^{*}) to the following optimization problem (Bellman’s equation)

sup𝑽γ,𝝀𝟎𝜿𝟎,𝑽γ𝝀,𝜷\displaystyle\sup_{\boldsymbol{V}_{\gamma},\boldsymbol{\lambda}\succeq\boldsymbol{0}}\leavevmode\nobreak\ \leavevmode\nobreak\ \langle\boldsymbol{\kappa_{0}},\boldsymbol{V}_{\gamma}\rangle-\langle\boldsymbol{\lambda},\boldsymbol{\beta}\rangle
subject to
Vγ(s)c(s,α)+𝝀,𝒅(s,α)\displaystyle V_{\gamma}(s)\leq c(s,\alpha)+\langle\boldsymbol{\lambda},\boldsymbol{d}(s,\alpha)\rangle
+γσ{Vγ(s),s,p(s|s,α)},s𝒮,αAct,\displaystyle\quad\quad\quad+\gamma\sigma\{{V}_{\gamma}(s^{\prime}),s,p(s^{\prime}|s,\alpha)\},\leavevmode\nobreak\ \forall s\in\mathcal{S},\leavevmode\nobreak\ \forall\alpha\in{Act,} (15)

satisfies

Jγ(κ0)𝜿𝟎,𝑽γ𝝀,𝜷.J_{\gamma}(\kappa_{0})\geq\langle\boldsymbol{\kappa_{0}},\boldsymbol{V}^{*}_{\gamma}\rangle-\langle\boldsymbol{\lambda}^{*},\boldsymbol{\beta}\rangle. (16)
Proof.

From Proposition 1, we have know that (11) holds. Hence, we have

Jγ(κ0)\displaystyle J_{\gamma}(\kappa_{0}) =infπsup𝝀𝟎(Jγ(κ0,π)+λ,(𝑫γ(κ0,π)𝜷))\displaystyle=\inf_{\pi}\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\left(J_{\gamma}(\kappa_{0},\pi)+\langle\lambda,\left(\boldsymbol{D}_{\gamma}(\kappa_{0},\pi)-\boldsymbol{\beta}\right)\rangle\right)
=infπsup𝝀𝟎(Jγ(κ0,π)+λ,𝑫γ(κ0,π)λ,β)\displaystyle=\inf_{\pi}\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\left(J_{\gamma}(\kappa_{0},\pi)+\langle\lambda,\boldsymbol{D}_{\gamma}(\kappa_{0},\pi)\rangle-\langle\lambda,\beta\rangle\right)
=infπsup𝝀𝟎(ργ(c)+λ,ργ(d)λ,β)\displaystyle=\inf_{\pi}\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\left(\rho^{\gamma}(c)+\langle\lambda,\rho^{\gamma}(d)\rangle-\langle\lambda,\beta\rangle\right)
=infπsup𝝀𝟎(ργ(c)+ργ(λ,d)λ,β)\displaystyle=\inf_{\pi}\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\left(\rho^{\gamma}(c)+\rho^{\gamma}(\langle\lambda,d\rangle)-\langle\lambda,\beta\rangle\right)
infπsup𝝀𝟎(ργ(c+λ,d)λ,β),\displaystyle\geq\inf_{\pi}\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\left(\rho^{\gamma}(c+\langle\lambda,d\rangle)-\langle\lambda,\beta\rangle\right),
sup𝝀𝟎infπ(ργ(c+λ,d)λ,β)\displaystyle\geq\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\inf_{\pi}\left(\rho^{\gamma}(c+\langle\lambda,d\rangle)-\langle\lambda,\beta\rangle\right) (17)

wherein the fourth, fifth, and sixth inequalities above we used the positive homogeneity property of ργ\rho^{\gamma}, sub-additivity property of ργ\rho^{\gamma}, and the minimax inequality respectively. Since λ,β\langle\lambda,\beta\rangle does not depend on π\pi, to find the solution the infimum it suffices to find the solution to

infπργ(c~),\inf_{\pi}\rho^{\gamma}(\tilde{c}),

where c~=c+λd\tilde{c}=c+\lambda^{\prime}d. The value to the above optimization can be obtained by solving the following Bellman equation [41, Theorem 4]

Vγ(s)=infαAct(c~(s,α)+γσ{Vγ(s),s,p(s|s,α)}).V_{\gamma}(s)=\inf_{\alpha\in Act}\Big{(}\tilde{c}(s,\alpha)+\gamma\sigma\{V_{\gamma}(s^{\prime}),s,p(s^{\prime}|s,\alpha)\}\Big{)}.

Next, we show that the solution to the above Bellman equation can be alternatively obtained by solving the convex optimization

supVγκ0,Vγ\displaystyle\sup_{V_{\gamma}}\leavevmode\nobreak\ \leavevmode\nobreak\ \langle\kappa_{0},V_{\gamma}\rangle
subject to
Vγ(s)c~(s,α)+γσ{Vγ(s),s,p(s|s,α)},s,α.\displaystyle V_{\gamma}(s)\leq\tilde{c}(s,\alpha)+\gamma\sigma\{V_{\gamma}(s^{\prime}),s,p(s^{\prime}|s,\alpha)\},\leavevmode\nobreak\ \forall s,\alpha. (18)

Define

𝔇πv:=c~(s,π(s))+γσ{v(s),s,p(s|s,π(s))},s𝒮,\mathfrak{D}_{\pi}v:=\tilde{c}(s,\pi(s))+\gamma\sigma\{v(s^{\prime}),s,p(s^{\prime}|s,\pi(s))\},\quad\forall s\in\mathcal{S},

and 𝔇v:=minαAct(c~(s,α)+γσ{v(s),s,p(s|s,α)})\mathfrak{D}v:=\min_{\alpha\in Act}\left(\tilde{c}(s,\alpha)+\gamma\sigma\{v(s^{\prime}),s,p(s^{\prime}|s,\alpha)\}\right) for all s𝒮s\in\mathcal{S}. From [41, Lemma 1], we infer that 𝔇π\mathfrak{D}_{\pi} and 𝔇\mathfrak{D} are non-decreasing; i.e., for vwv\leq w, we have 𝔇πv𝔇πw\mathfrak{D}_{\pi}v\leq\mathfrak{D}_{\pi}w and 𝔇v𝔇w\mathfrak{D}v\leq\mathfrak{D}w. Therefore, if Vγ𝔇πVγV_{\gamma}\leq\mathfrak{D}_{\pi}V_{\gamma}, then 𝔇πVγ𝔇π(𝔇πVγ)\mathfrak{D}_{\pi}V_{\gamma}\leq\mathfrak{D}_{\pi}(\mathfrak{D}_{\pi}V_{\gamma}). By repeated application of 𝔇π\mathfrak{D}_{\pi}, we obtain

Vγ𝔇πVγ𝔇π2Vγ𝔇πVγ=Vγ.V_{\gamma}\leq\mathfrak{D}_{\pi}V_{\gamma}\leq\mathfrak{D}_{\pi}^{2}V_{\gamma}\leq\mathfrak{D}_{\pi}^{\infty}V_{\gamma}=V^{*}_{\gamma}.

Any feasible solution to (IV) must satisfy Vγ𝔇πVγV_{\gamma}\geq\mathfrak{D}_{\pi}V_{\gamma} and hence must satisfy VγVγV_{\gamma}\geq V^{*}_{\gamma}. Thus, given that all entries of κ0\kappa_{0} are positive, VγV^{*}_{\gamma} is the optimal solution to (IV). Substituting (IV) back in the last inequality in (IV) yields the result. ∎

Once the values of 𝝀\boldsymbol{\lambda}^{*} and 𝑽γ\boldsymbol{V}^{*}_{\gamma} are found by solving optimization problem (1), we can find the policy as

π(s)\displaystyle\pi^{*}(s)\in argminαAct(c(s,α)+𝝀,𝒅(s,α)\displaystyle\leavevmode\nobreak\ \operatorname*{argmin}_{\alpha\in Act}\leavevmode\nobreak\ \Big{(}c(s,\alpha)+\langle\boldsymbol{\lambda}^{*},\boldsymbol{d}(s,\alpha)\rangle
+γσ{Vγ(s),s,p(s|s,α)}).\displaystyle\quad\quad\quad\quad+\gamma\sigma\{V^{*}_{\gamma}(s^{\prime}),s,p(s^{\prime}|s,\alpha)\}\Big{)}. (19)

One interesting observation is that if the coherent risk measure ρt\rho^{t} is the total discounted expectation, then Theorem 1 is consistent with the classical result by [9] on constrained MDPs.

Corollary 1.

Let the assumptions of Theorem 1 hold and let ρt()=𝔼(|st,αt)\rho_{t}(\cdot)=\mathbb{E}(\cdot|s_{t},\alpha_{t}), t=1,2,t=1,2,\ldots. Then the solution (𝐕γ,𝛌)(\boldsymbol{V}^{*}_{\gamma},\boldsymbol{\lambda}^{*}) to optimization (1) satisfies

Jγ(κ0)=𝜿𝟎,𝑽γ𝝀,𝜷.J_{\gamma}(\kappa_{0})=\langle\boldsymbol{\kappa_{0}},\boldsymbol{V}^{*}_{\gamma}\rangle-\langle\boldsymbol{\lambda}^{*},\boldsymbol{\beta}\rangle.

Furthermore, with ρt()=𝔼(|st,αt)\rho_{t}(\cdot)=\mathbb{E}(\cdot|s_{t},\alpha_{t}), t=1,2,t=1,2,\ldots, optimization (1) becomes a linear program.

Proof.

From the derivation in (IV), we observe the two inequalities are from the application of (a) the sub-additivity property of ργ\rho^{\gamma} and (b) the max-min inequality. Next, we show that in the case of total expectation both of these properties lead to an equality.
(a) Sub-additivity property of ργ\rho^{\gamma}: for total expectation, we have

t𝔼κ0πγtct+t𝔼κ0πγt𝝀,𝒅t=t𝔼κ0πγt(ct+𝝀,𝒅t).\sum_{t}\mathbb{E}_{\kappa_{0}}^{\pi}\gamma^{t}c_{t}+\sum_{t}\mathbb{E}_{\kappa_{0}}^{\pi}\gamma^{t}\langle\boldsymbol{\lambda},\boldsymbol{d}_{t}\rangle=\sum_{t}\mathbb{E}_{\kappa_{0}}^{\pi}\gamma^{t}(c_{t}+\langle\boldsymbol{\lambda},\boldsymbol{d}_{t}\rangle).

Thus, equality holds.
(b) Max-min inequality: in the ρκ0γ()=t𝔼κ0πγt()\rho^{\gamma}_{\kappa_{0}}(\cdot)=\sum_{t}\mathbb{E}_{\kappa_{0}}^{\pi}\gamma^{t}(\cdot) case, both the objective function and the constraints are linear in the decision variables π\pi and 𝝀\boldsymbol{\lambda}. Therefore, the sixth line in (IV) reads as

infπsup𝝀𝟎(ργ(𝒄+𝝀,𝒅)𝝀,𝜷)\displaystyle\inf_{\pi}\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\left(\rho^{\gamma}(\boldsymbol{c}+\langle\boldsymbol{\lambda},\boldsymbol{d}\rangle)-\langle\boldsymbol{\lambda},\boldsymbol{\beta}\rangle\right)
=infπsup𝝀𝟎(t𝔼κ0πγt(ct+𝝀,𝒅t)𝝀,𝜷).\displaystyle=\inf_{\pi}\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\left(\sum_{t}\mathbb{E}_{\kappa_{0}}^{\pi}\gamma^{t}(c_{t}+\langle\boldsymbol{\lambda},\boldsymbol{d}_{t}\rangle)-\langle\boldsymbol{\lambda},\boldsymbol{\beta}\rangle\right). (20)

Since the expression inside parantheses above is convex in π\pi (𝔼κ0π\mathbb{E}_{\kappa_{0}}^{\pi} is linear in the policy) and concave (linear) in 𝝀\boldsymbol{\lambda}. From Minimax Theorem [20], we have that the following equality holds

infπsup𝝀𝟎(t𝔼κ0πγt(ct+𝝀,𝒅t)𝝀,𝜷)\displaystyle\inf_{\pi}\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\left(\sum_{t}\mathbb{E}_{\kappa_{0}}^{\pi}\gamma^{t}(c_{t}+\langle\boldsymbol{\lambda},\boldsymbol{d}_{t}\rangle)-\langle\boldsymbol{\lambda},\boldsymbol{\beta}\rangle\right)
=sup𝝀𝟎infπ(t𝔼κ0πγt(ct+𝝀,𝒅t)𝝀,𝜷).\displaystyle=\sup_{\boldsymbol{\lambda}\succeq\boldsymbol{0}}\inf_{\pi}\left(\sum_{t}\mathbb{E}_{\kappa_{0}}^{\pi}\gamma^{t}(c_{t}+\langle\boldsymbol{\lambda},\boldsymbol{d}_{t}\rangle)-\langle\boldsymbol{\lambda},\boldsymbol{\beta}\rangle\right).

Furthermore, from (14), we see that σ\sigma is linear in vv for total expectation. Therefore, the constraint in (1) is linear in VγV_{\gamma} and λ\lambda. Since 𝜿𝟎,𝑽γ𝝀,𝜷\langle\boldsymbol{\kappa_{0}},\boldsymbol{V}_{\gamma}\rangle-\langle\boldsymbol{\lambda},\boldsymbol{\beta}\rangle is also linear in VγV_{\gamma}s and λ\lambdas, optimization (1) becomes a linear program in the case of total expectation coherent risk measure. ∎

In [4], we presented a method based on difference convex programs to solve (1), when ργ\rho^{\gamma} is an arbitrary coherent risk measure and we described the specific structure of the optimization problem for conditional expectation, CVaR, and EVaR. In fact, it was shown that (1) can be written in a standard DCP format as

inf𝑽γ,𝝀𝟎f0(𝝀)g0(𝑽γ)\displaystyle\inf_{\boldsymbol{V}_{\gamma},\boldsymbol{\lambda}\succeq\boldsymbol{0}}\leavevmode\nobreak\ \leavevmode\nobreak\ f_{0}(\boldsymbol{\lambda})-g_{0}(\boldsymbol{V}_{\gamma})
subject to
f1(Vγ)g1(𝝀)g2(Vγ)0,s,α.\displaystyle f_{1}({V}_{\gamma})-g_{1}(\boldsymbol{\lambda})-g_{2}({V}_{\gamma})\leq 0,\leavevmode\nobreak\ \leavevmode\nobreak\ \forall s,\alpha. (21)

Optimization problem (IV) is a standard DCP [25]. DCPs arise in many applications, such as feature selection in machine learning [29] and inverse covariance estimation in statistics [48]. Although DCPs can be solved globally [25], e.g. using branch and bound algorithms [28], a locally optimal solution can be obtained based on techniques of nonlinear optimization [13] more efficiently. In particular, in this work, we use a variant of the convex-concave procedure [30, 43], wherein the concave terms are replaced by a convex upper bound and solved. In fact, the disciplined convex-concave programming (DCCP) [43] technique linearizes DCP problems into a (disciplined) convex program (carried out automatically via the DCCP Python package [43]), which is then converted into an equivalent cone program by replacing each function with its graph implementation. Then, the cone program can be solved readily by available convex programming solvers, such as CVXPY [18].

We end this section by pointing out that solving (1) using the DCCP method, only finds the (local) saddle points to optimization problem  (1). Nevertheless, every saddle point to (1) satisfies (16) (from Theorem 1). In fact, every saddle point is a lower bound of the optimal value of Problem 1.

V Constrained Risk-Averse POMDPs

Next, we show that, in the case of POMDPs, we can find a lower bound to the solution to Problem 1 via solving an infinite-dimensional optimization problem. Note that a POMDP is equivalent to a belief MDP {bt}\{b_{t}\}, t=1,2,t=1,2,\ldots, where btb_{t} is defined in (2).

Theorem 2.

Consider a POMDP 𝒫\mathcal{PM} with the nested risk objective (8) and constraint (9) with γ(0,1)\gamma\in(0,1). Let Assumption 1 hold, let ρt,t=0,1,\rho_{t},\leavevmode\nobreak\ t=0,1,\ldots be coherent risk measures, and suppose c(,)c(\cdot,\cdot) and {di(,)}i=1nc\{d^{i}(\cdot,\cdot)\}_{i=1}^{n_{c}} be non-negative and upper-bounded. Then, the solution (λ,Vγ)(\lambda^{*},V^{*}_{\gamma}) to the following Bellman’s equation

sup𝑽γ,𝝀0𝒃0,𝑽γ𝝀,𝜷\displaystyle\sup_{\boldsymbol{V}_{\gamma},\boldsymbol{\lambda}\succeq 0}\leavevmode\nobreak\ \leavevmode\nobreak\ \langle\boldsymbol{b}_{0},\boldsymbol{V}_{\gamma}\rangle-\langle\boldsymbol{\lambda},\boldsymbol{\beta}\rangle
subject to
Vγ(b)c(b,α)+𝝀,𝒅(b,α)\displaystyle V_{\gamma}(b)\leq c(b,\alpha)+\langle\boldsymbol{\lambda},\boldsymbol{d}(b,\alpha)\rangle
+γσ{Vγ(b),b,p(b|b,α)},bΔ(𝒮),αAct,\displaystyle\quad\quad\quad+\gamma\sigma\{V_{\gamma}(b^{\prime}),b,p(b^{\prime}|b,\alpha)\},\leavevmode\nobreak\ \leavevmode\nobreak\ \forall b\in\Delta(\mathcal{S}),\leavevmode\nobreak\ \forall\alpha\in{Act,} (22)

where c(b,α)=s𝒮c(s,α)b(s)c(b,\alpha)=\sum_{s\in\mathcal{S}}c(s,\alpha)b(s) and d(b,α)=s𝒮d(s,α)b(s)d(b,\alpha)=\sum_{s\in\mathcal{S}}d(s,\alpha)b(s) satisfies

Jγ(b0)𝒃0,𝑽γ𝝀,𝜷.J_{\gamma}(b_{0})\geq\langle\boldsymbol{b}_{0},\boldsymbol{V}^{*}_{\gamma}\rangle-\langle\boldsymbol{\lambda}^{*},\boldsymbol{\beta}\rangle. (23)
Proof.

Note that a POMDP can be represented as an MDP over the belief states (2) with initial distribution (1). Hence, a POMDP is a controlled Markov process with states bΔ(𝒮)b\in\Delta(\mathcal{S}), where the controlled belief transition probability is described as

p(bb,α)=o𝒪p(bb,o,α)p(ob,α)=o𝒪δ(bO(os,α)s𝒮T(ss,α)b(s)s𝒮O(os,α)s𝒮T(ss,α)b(s))×s𝒮O(os,α)s′′𝒮T(ss′′,α)b(s′′),p(b^{\prime}\mid b,\alpha)=\sum_{o\in\mathcal{O}}p(b^{\prime}\mid b,o,\alpha)\leavevmode\nobreak\ p(o\mid b,\alpha)\\ =\sum_{o\in\mathcal{O}}\delta\left(b^{\prime}-\frac{O(o\mid s,\alpha)\sum_{s^{\prime}\in\mathcal{S}}T(s\mid s^{\prime},\alpha)b(s^{\prime})}{\sum_{s\in\mathcal{S}}O(o\mid s,\alpha)\sum_{s^{\prime}\in\mathcal{S}}T(s\mid s^{\prime},\alpha)b(s^{\prime})}\right)\\ \times\sum_{s\in\mathcal{S}}O(o\mid s,\alpha)\sum_{s^{\prime\prime}\in\mathcal{S}}T(s\mid s^{\prime\prime},\alpha)b(s^{\prime\prime}),

with

δ(a)={1a=0,0otherwise.\delta(a)=\begin{cases}1&a=0,\\ 0&\text{otherwise}.\end{cases}

The rest of the proof follows the same footsteps on Theorem 1 over the belief MDP with p(b|b,α)p(b^{\prime}|b,\alpha) as defined above. ∎

Unfortunately, since bΔ(𝒮)b\in\Delta(\mathcal{S}) and hence Vγ:Δ(𝒮)V_{\gamma}:\Delta(\mathcal{S})\to\mathbb{R}, optimization (2) is infinite-dimensional and we cannot solve it efficiently.

If the one-step coherent risk measure ρt\rho_{t} is the total discounted expectation, we can show that optimization problem (2) simplifies to an infinite-dimensional linear program and equality holds in (23). This can be proved following the same lines as the proof of Corollary 1 but for the belief MDP. Hence, Theorem 2 also provides an optimization based solution to the constrained POMDP problem.

V-A Risk-Averse FSC Synthesis via Policy Iteration

In order to synthesize risk-averse FSCs, we employ a policy iteration algorithm. Policy iteration incrementally improves a controller by alternating between two steps: Policy Evaluation (computing value functions by fixing the policy) and Policy Improvement (computing the policy by fixing the value functions), until convergence to a satisfactory policy [12]. For a risk-averse POMDP, policy evaluation can be carried out by solving (2). However, as mentioned earlier,  (2) is difficult to use directly as it must be computed at each (continuous) belief state in the belief space, which is uncountably infinite.

In the following, we show that if instead of considering policies with infinite-memory, we search over finite-memory policies, then we can find suboptimal solutions to Problem 1 that lower-bound Jγ(κ0)J_{\gamma}(\kappa_{0}). To this end, we consider stochastic but finite-memory controllers as described in Section II.C.

Closing the loop around a POMDP with an FSC 𝒢\mathcal{G} induces a Markov chain. The global Markov chain 𝒞𝒮×G𝒫,𝒢\mathcal{MC}^{\mathcal{PM},\mathcal{G}}_{\mathcal{S}\times G} (or simply 𝒞\mathcal{MC}, where the stochastic finite state controller and the POMDP are clear from the context) with execution {[s0,g0],[s1,g1],},[st,gt]𝒮×G\{[s_{0},g_{0}],[s_{1},g_{1}],\dots\},\ [s_{t},\ g_{t}]\in\mathcal{S}\times G. The probability of initial global state [s0,g0][s_{0},g_{0}] is

ιinit([s0,g0])=κ0(s0)κ(g0|κ0).\iota_{init}\left(\left[s_{0},g_{0}\right]\right)=\kappa_{0}(s_{0})\kappa(g_{0}|\kappa_{0}).

The state transition probability, TT^{\mathcal{M}}, is given by

T\displaystyle T^{\mathcal{M}} ([st+1,gt+1]|[st,gt])=\displaystyle\left(\left[s_{t+1},g_{t+1}\right]\left|\left[s_{t},g_{t}\right]\right.\right)=
o𝒪\displaystyle\sum_{o\in\mathcal{O}} αActO(o|st)ω(gt+1,α|gt,o)T(st+1|st,α).\displaystyle\sum_{\alpha\in Act}O(o|s_{t})\omega(g_{t+1},\alpha|g_{t},o)T(s_{t+1}|s_{t},\alpha).

V-B Risk Value Function Computation

Under an FSC, the POMDP is transformed into a Markov chain 𝒮×𝒢𝒫×𝒢\mathcal{M}^{\mathcal{PM}\times\mathcal{G}}_{\mathcal{S}\times\mathcal{G}} with design probability distributions ω\omega and κ\kappa. The closed-loop Markov chain 𝒮×𝒢𝒫×𝒢\mathcal{M}^{\mathcal{PM}\times\mathcal{G}}_{\mathcal{S}\times\mathcal{G}} is a controlled Markov process with {qt}={[st,gt]}\{q_{t}\}=\{[s_{t},g_{t}]\}, t=1,2,t=1,2,\ldots. In this setting, the total risk functional (8) becomes a function of ιinit\iota_{init} and FSC 𝒢\mathcal{G}, i.e.,

Jγ(ιinit,𝒢)=ργ(c([s0,g0],α0),c([s1,g1],α1),),s0κ0,g0κ,J_{\gamma}(\iota_{\mathrm{init}},\mathcal{G})=\rho^{\gamma}\big{(}c([s_{0},g_{0}],\alpha_{0}),c([s_{1},g_{1}],\alpha_{1}),\ldots\big{)},\\ \leavevmode\nobreak\ \leavevmode\nobreak\ s_{0}\sim\kappa_{0},\leavevmode\nobreak\ g_{0}\sim\kappa, (24)

where αt\alpha_{t}s and gtg_{t}s are drawn from the probability distribution ω(gt+1,αtgt,ot)\omega(g_{t+1},\alpha_{t}\mid g_{t},o_{t}). The constraint functionals Dγi(ιinit,𝒢)D_{\gamma}^{i}(\iota_{\mathrm{init}},\mathcal{G}), i=1,2,,nci=1,2,\ldots,n_{c} can also be defined similarly.

Let Jγ(𝜾init)J_{\gamma}(\boldsymbol{\iota}_{init}) be the value of Problem 1 under a FSC 𝒢\mathcal{G}. Then, it is evident that Jγ(𝒃0)Jγ(𝜾init)J_{\gamma}(\boldsymbol{b}_{0})\geq J_{\gamma}(\boldsymbol{\iota}_{init}), since FSCs restrict the search space of the policy π\pi. That is, they can only be as good as the (infinite-dimensional) belief-based policy π(b)\pi(b) as |G||G|\to\infty (infinite-memory).

Risk Value Function Optimization: For POMDPs controlled by stochastic finite state controllers, the dynamic program is developed in the global state space 𝒮×G\mathcal{S}\times G. From Theorem 1, we see that for a given FSC, 𝒢\mathcal{G}, and POMDP 𝒫\mathcal{PM}, the value function Vγ,([s,g])V_{\gamma,\mathcal{M}}([s,g]) can be computed by solving the following finite dimensional optimization

sup𝑽γ,,𝝀𝟎𝜾init,𝑽γ,𝝀,𝜷\displaystyle\sup_{\boldsymbol{V}_{\gamma,\mathcal{M}},\boldsymbol{\lambda}\succeq\boldsymbol{0}}\leavevmode\nobreak\ \leavevmode\nobreak\ \langle\boldsymbol{\iota}_{init},\boldsymbol{V}_{\gamma,\mathcal{M}}\rangle-\langle\boldsymbol{\lambda},\boldsymbol{\beta}\rangle
subject to
Vγ,([s,g])αActp(αg)c~([s,g],α)\displaystyle V_{\gamma,\mathcal{M}}([s,g])\leq\sum_{\alpha\in Act}p(\alpha\mid g)\tilde{c}([s,g],\alpha)
+γσ{Vγ,([s,g]),[s,g],T([s,g]|[s,g])},\displaystyle\quad\quad\quad+\gamma\sigma\Big{\{}V_{\gamma,\mathcal{M}}([s^{\prime},g^{\prime}]),[s,g],T^{\mathcal{M}}\left([s^{\prime},g^{\prime}]\left|[s,g]\right.\right)\Big{\}},
s𝒮,gG,\displaystyle\quad\quad\quad\quad\forall s\in\mathcal{S},\leavevmode\nobreak\ \forall g\in G, (25)

where p(αg)=g𝒢,o𝒪ω(g,αg,o)O(o|g),p(\alpha\mid g)={\sum_{g^{\prime}\in\mathcal{G},o\in\mathcal{O}}\omega(g^{\prime},\alpha\mid g,o)O(o|g^{\prime})}, and c~([s,g],α)=c([s,g],α)+𝝀,𝒅([s,g],α)\tilde{c}([s,g],\alpha)={c}([s,g],\alpha)+\langle\boldsymbol{\lambda},\boldsymbol{d}([s,g],\alpha)\rangle. Then, the solution (𝑽γ,,𝝀)(\boldsymbol{V}^{*}_{\gamma,\mathcal{M}},\boldsymbol{\lambda}^{*}) satisfies

Jγ(𝜾init)𝜾init,𝑽γ,𝝀,𝜷.J_{\gamma}(\boldsymbol{\iota}_{init})\geq\langle\boldsymbol{\iota}_{init},\boldsymbol{V}^{*}_{\gamma,\mathcal{M}}\rangle-\langle\boldsymbol{\lambda}^{*},\boldsymbol{\beta}\rangle. (26)

Note that since ργ\rho^{\gamma} is a coherent, Markov risk measure (Assumption 1), vσ(v,,)v\mapsto\sigma(v,\cdot,\cdot) is convex (because σ\sigma is also a coherent risk measure). In fact, optimization problem (V-B) is indeed a DCP in the form of (IV), where we should replace VγV_{\gamma} with Vγ,V_{\gamma,\mathcal{M}} and set f0(𝝀)=𝝀,𝜷f_{0}(\boldsymbol{\lambda})=\langle\boldsymbol{\lambda},\boldsymbol{\beta}\rangle, g0(𝑽γ,)=𝜾𝒊𝒏𝒊𝒕,𝑽γ,g_{0}(\boldsymbol{V}_{\gamma,\mathcal{M}})=\langle\boldsymbol{\iota_{init}},\boldsymbol{V}_{\gamma,\mathcal{M}}\rangle, f1(Vγ,)=Vγ,f_{1}({V}_{\gamma,\mathcal{M}})={V}_{\gamma,\mathcal{M}}, g1(𝝀)=αActp(αg)c~([s,g],α)g_{1}(\boldsymbol{\lambda})=\sum_{\alpha\in Act}p(\alpha\mid g)\tilde{c}([s,g],\alpha), and g2(Vγ,)=γσ(Vγ,,,)g_{2}({V}_{\gamma,\mathcal{M}})=\gamma\sigma({V}_{\gamma,\mathcal{M}},\cdot,\cdot).

The above optimization is in standard DCP form because f0f_{0} and g1g_{1} are convex (linear) functions of 𝝀\boldsymbol{\lambda} and g0g_{0}, f1f_{1}, and g2g_{2} are convex functions in Vγ,{V}_{\gamma,\mathcal{M}}.

Solving (IV) gives a set of value functions Vγ,V_{\gamma,\mathcal{M}}. In the next section, we discuss how to use the solutions from this DCP in our proposed policy iteration algorithm to sequentially improve the FSC parameters ω\omega.

V-C I-States Improvement

Let Vγ,(g)|S|\vec{V}_{\gamma,\mathcal{M}}(g)\in\mathbb{R}^{|S|} denote the vectorized Vγ,([s,g])V_{\gamma,\mathcal{M}}([s,g]) in ss. We say that an I-state gg is improved, if the tunable FSC parameters associated with that I-state can be adjusted so that Vγ,(g)\vec{V}^{*}_{\gamma,\mathcal{M}}(g) increases.

To begin with, we compute the initial I-state by finding the best valued I-state for a given initial belief, i.e., κ(ginit)=1\kappa(g_{{init}})=1, where

ginit=argmaxgG𝜾init,Vγ,(g).g_{{init}}=\underset{g\in G}{\mbox{argmax}}\leavevmode\nobreak\ \left\langle\boldsymbol{\iota}_{init},\vec{V}_{\gamma,\mathcal{M}}(g)\right\rangle.

After this initialization, we search for FSC parameters ω\omega that result in an improvement.

I-state Improvement Optimization: Given value functions Vγ,([s,g])V_{\gamma,\mathcal{M}}([s,g]) for all s𝒮s\in\mathcal{S} and gGg\in G and Lagrangian parameters 𝝀\boldsymbol{\lambda}, for every I-state gg, we can find FSC parameters ω\omega that result in an improvement by solving the following optimization

maxϵ>0,ω(g,α|g,o)ϵ\displaystyle\underset{\epsilon>0,\omega(g^{\prime},\alpha|g,o)}{\max}\ \ \ \epsilon
subject to
Improvement Constraint:
Vγ,([s,g])+ϵr.h.s. of (V-B),s𝒮,\displaystyle V_{\gamma,\mathcal{M}}([s,g])+\epsilon\leq\text{r.h.s. of \eqref{eq:valueiterationsfc}},\leavevmode\nobreak\ \leavevmode\nobreak\ \forall s\in\mathcal{S},
Probability Constraints:
(g,α)G×Actω(g,αg,o)=1,o𝒪,\displaystyle\underset{(g^{\prime},\alpha)\in G\times Act}{\sum}\omega(g^{\prime},\alpha\mid g,o)=1,\leavevmode\nobreak\ \leavevmode\nobreak\ \forall o\in\mathcal{O},
ω(g,αg,o)0,gG,αAct,o𝒪.\displaystyle\omega(g^{\prime},\alpha\mid g,o)\geq 0,\leavevmode\nobreak\ \leavevmode\nobreak\ \forall g^{\prime}\in G,\alpha\in Act,o\in\mathcal{O}. (27)

Note that the above optimization searches for ω\omega values that improve the I-state value vector Vγ,(g)\vec{V}^{*}_{\gamma,\mathcal{M}}(g) by maximizing the auxiliary decision variable ϵ\epsilon.

Optimization problem (V-C) is in general non-convex. This can be inferred from the fact that, although the first term in the r.h.s. of (V-B) is linear in ω\omega, its convexity or concavity is not clear in the σ\sigma term for a general coherent risk measure. Fortunately, we can prove the following result.

Proposition 2.

Let 𝐕γ,\boldsymbol{V}_{\gamma,\mathcal{M}} and 𝛌\boldsymbol{\lambda} be given. Then, the I-State Improvement Optimization (V-C) is a linear program for conditional expectation and CVaR risk measures. Furthermore, (V-C) is a convex optimization for EVaR risk measure.

Proof.

We present different forms of the Improvement Constraint in (V-C) for different risk measures. Note that the rest of the constraints and the cost function are linear in the decision variables ϵ\epsilon and ω\omega. The Improvement Constraint in (V-C) is linear in ϵ\epsilon. However, its convexity or concavity in ω\omega changes depending on the risk measure one considers. We recall from the previous section that in the Policy Evaluation step, the quantities for 𝑽γ,{\boldsymbol{V}}_{\gamma,\mathcal{M}} and 𝝀𝟎{\boldsymbol{\lambda}}\succeq\boldsymbol{0} (for conditional expectation, CVaR, and EVaR measures) and ζ\zeta for (CVaR and EVaR measures) are calculated and therefore fixed here.

For conditional expectation, the Improvement Constraint alters to

Vγ,([s,g])+ϵαActp(αg)c~([s,g],α)+γs𝒮,g𝒢Vγ,([s,g])T([s,g]|[s,g]),s𝒮,gG.V_{\gamma,\mathcal{M}}([s,g])+\epsilon\leq\sum_{\alpha\in Act}p(\alpha\mid g)\tilde{c}([s,g],\alpha)\\ +\gamma\sum_{s^{\prime}\in\mathcal{S},g^{\prime}\in\mathcal{G}}V_{\gamma,\mathcal{M}}([s^{\prime},g^{\prime}])T^{\mathcal{M}}\left([s^{\prime},g^{\prime}]\left|[s,g]\right.\right),\\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall s\in\mathcal{S},\leavevmode\nobreak\ \forall g\in G. (28)

Substituting the expression for TT^{\mathcal{M}}, i.e.,

T([s,g]|[s,g])=o𝒪αActO(o|s)ω(g,α|g,o)T(s|s,α),T^{\mathcal{M}}\left([s^{\prime},g^{\prime}]\left|[s,g]\right.\right)=\sum_{o\in\mathcal{O}}\sum_{\alpha\in Act}O(o|s)\omega(g^{\prime},\alpha|g,o)T(s^{\prime}|s,\alpha),

and p(αg)p(\alpha\mid g), i.e.,

p(αg)=g𝒢,o𝒪ω(g,αg,o)O(o|g),p(\alpha\mid g)={\sum_{g^{\prime}\in\mathcal{G},o\in\mathcal{O}}\omega(g^{\prime},\alpha\mid g,o)O(o|g^{\prime})},

we obtain

Vγ,([s,g])+ϵα,g,oω(g,αg,o)O(o|g)c~([s,g],α)+γs,g,o,αVγ,([s,g])O(o|s)ω(g,α|g,o)T(s|s,α),s𝒮,gG.V_{\gamma,\mathcal{M}}([s,g])+\epsilon\leq{\sum_{\alpha,g^{\prime},o}\omega(g^{\prime},\alpha\mid g,o)O(o|g^{\prime})}\tilde{c}([s,g],\alpha)\\ +\gamma\sum_{s^{\prime},g^{\prime},o,\alpha}V_{\gamma,\mathcal{M}}([s^{\prime},g^{\prime}])O(o|s)\omega(g^{\prime},\alpha|g,o)T(s^{\prime}|s,\alpha),\\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall s\in\mathcal{S},\leavevmode\nobreak\ \forall g\in G. (29)

The above expression is linear in ω\omega as well as ϵ\epsilon. Hence, I-State Improvement Optimization becomes a linear program for conditional expectation risk measure.

Based on a similar construction, for CVaR measure, the Improvement Constraint changes to

Vγ,([s,g])+ϵα,g,oω(g,αg,o)O(o|g)c~([s,g],α)+γ{ζ+1εg,s(Vγ,([s,g])ζ)+T([s,g]|[s,g])},s𝒮,gG.V_{\gamma,\mathcal{M}}([s,g])+\epsilon\leq\sum_{\alpha,g^{\prime},o}\omega(g^{\prime},\alpha\mid g,o)O(o|g^{\prime})\tilde{c}([s,g],\alpha)\\ +\gamma\bigg{\{}\zeta+\frac{1}{\varepsilon}\sum_{g^{\prime},s^{\prime}}\left(V_{\gamma,\mathcal{M}}\left([s^{\prime},g^{\prime}]\right)-\zeta\right)_{+}T^{\mathcal{M}}([s^{\prime},g^{\prime}]|[s,g])\bigg{\}},\\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall s\in\mathcal{S},\leavevmode\nobreak\ \forall g\in G. (30)

After substituting the term for TT^{\mathcal{M}}, we obtain

Vγ,([s,g])+ϵα,g,oω(g,αg,o)O(o|g)c~([s,g],α)+γ{ζ+1εg,s,o,α(Vγ,([s,g])ζ)+O(o|s)×ω(g,α|g,o)T(s|s,α)},s𝒮,gG.V_{\gamma,\mathcal{M}}([s,g])+\epsilon\leq\sum_{\alpha,g^{\prime},o}\omega(g^{\prime},\alpha\mid g,o)O(o|g^{\prime})\tilde{c}([s,g],\alpha)\\ +\gamma\bigg{\{}\zeta+\frac{1}{\varepsilon}\sum_{g^{\prime},s^{\prime},o,\alpha}\left(V_{\gamma,\mathcal{M}}\left([s^{\prime},g^{\prime}]\right)-\zeta\right)_{+}O(o|s)\times\\ \omega(g^{\prime},\alpha|g,o)T(s^{\prime}|s,\alpha)\bigg{\}},\leavevmode\nobreak\ \leavevmode\nobreak\ \forall s\in\mathcal{S},\leavevmode\nobreak\ \forall g\in G. (31)
maxϵ>0,ω(g,α|g,o)𝜾init,Vγ,λ,β+ϵ\displaystyle\underset{\epsilon>0,\omega(g^{\prime},\alpha|g,o)}{\max}\ \ \ \langle\boldsymbol{\iota}_{init},V_{\gamma,\mathcal{M}}\rangle-\langle\lambda,\beta\rangle+\epsilon
subject to
Improvement Constraint:
Vγ,([s,g])+ϵα,g,oω(g,αg,o)O(o|g)c~([s,g],α)\displaystyle V_{\gamma,\mathcal{M}}([s,g])+\epsilon-\sum_{\alpha,g^{\prime},o}\omega(g^{\prime},\alpha\mid g,o)O(o|g^{\prime})\tilde{c}([s,g],\alpha)\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad (32a)
γ{ζ+1εg,s,o,α(Vγ,([s,g])ζ)+O(o|s)ω(g,α|g,o)T(s|s,α)}0,s𝒮,gG,\displaystyle\quad\quad\quad\quad\quad\quad\quad-\gamma\bigg{\{}\zeta+\frac{1}{\varepsilon}\sum_{g^{\prime},s^{\prime},o,\alpha}\left(V_{\gamma,\mathcal{M}}\left([s^{\prime},g^{\prime}]\right)-\zeta\right)_{+}O(o|s)\omega(g^{\prime},\alpha|g,o)T(s^{\prime}|s,\alpha)\bigg{\}}\leq 0,\leavevmode\nobreak\ \leavevmode\nobreak\ \forall s\in\mathcal{S},\leavevmode\nobreak\ \forall g\in G,
Probability Constraints:
(g,α)G×Actω(g,αg,o)=1,o𝒪,\displaystyle\underset{(g^{\prime},\alpha)\in G\times Act}{\sum}\omega(g^{\prime},\alpha\mid g,o)=1,\leavevmode\nobreak\ \leavevmode\nobreak\ \forall o\in\mathcal{O},
ω(g,αg,o)0,gG,αAct,o𝒪.\displaystyle\omega(g^{\prime},\alpha\mid g,o)\geq 0,\leavevmode\nobreak\ \leavevmode\nobreak\ \forall g^{\prime}\in G,\alpha\in Act,o\in\mathcal{O}. (32b)

 

Furthermore, for fixed Vγ,V_{\gamma,\mathcal{M}}, λ\lambda, and ζ\zeta, the above inequality is linear in ω\omega and ϵ\epsilon. Hence, (31) becomes a linear constraint rendering  (V-C) a linear program (maximizing a linear objective subject to linear constraints), i.e., optimization problem (32).

For the EVaR measure, the Improvement Constraint is given by

Vγ,([s,g])+ϵα,g,oω(g,αg,o)O(o|g)c~([s,g],α)+γ{1ζlog(g,seζVγ,([s,g])T([s,g]|[s,g])ε)},s𝒮,gG.V_{\gamma,\mathcal{M}}([s,g])+\epsilon\leq\sum_{\alpha,g^{\prime},o}\omega(g^{\prime},\alpha\mid g,o)O(o|g^{\prime})\tilde{c}([s,g],\alpha)\\ +\gamma\Bigg{\{}\frac{1}{\zeta}\log\left(\frac{\sum_{g^{\prime},s^{\prime}}e^{\zeta{V}_{\gamma,\mathcal{M}}([s^{\prime},g^{\prime}])}T^{\mathcal{M}}([s^{\prime},g^{\prime}]|[s,g])}{\varepsilon}\right)\Bigg{\}},\\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall s\in\mathcal{S},\leavevmode\nobreak\ \forall g\in G. (33)

Substituting the expression for TT^{\mathcal{M}}, i.e.,

T([s,g]|[s,g])=o𝒪αActO(o|s)ω(g,α|g,o)T(s|s,α),T^{\mathcal{M}}\left([s^{\prime},g^{\prime}]\left|[s,g]\right.\right)=\sum_{o\in\mathcal{O}}\sum_{\alpha\in Act}O(o|s)\omega(g^{\prime},\alpha|g,o)T(s^{\prime}|s,\alpha),

we obtain

Vγ,([s,g])+ϵα,g,oω(g,αg,o)O(o|g)c~([s,g],α)+γζlog(g,s,o,αeζVγ,([s,g])O(o|s)ω(g,α|g,o)T(s|s,α)ε),s𝒮,gG,V_{\gamma,\mathcal{M}}([s,g])+\epsilon\leq\sum\limits_{\alpha,g^{\prime},o}\omega(g^{\prime},\alpha\mid g,o)O(o|g^{\prime})\tilde{c}([s,g],\alpha)\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\\ +\frac{\gamma}{\zeta}\log\left(\frac{\sum\limits_{g^{\prime},s^{\prime},o,\alpha}e^{\zeta{V}_{\gamma,\mathcal{M}}([s^{\prime},g^{\prime}])}O(o|s)\omega(g^{\prime},\alpha|g,o)T(s^{\prime}|s,\alpha)}{\varepsilon}\right),\\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall s\in\mathcal{S},\leavevmode\nobreak\ \forall g\in G, (34)
maxϵ>0,ω(g,α|g,o)𝜾init,Vγ,λ,β+ϵ\displaystyle\underset{\epsilon>0,\omega(g^{\prime},\alpha|g,o)}{\max}\ \ \ \langle\boldsymbol{\iota}_{init},V_{\gamma,\mathcal{M}}\rangle-\langle\lambda,\beta\rangle+\epsilon
subject to
Improvement Constraint:
Vγ,([s,g])+ϵα,g,oω(g,αg,o)O(o|g)c~([s,g],α)\displaystyle V_{\gamma,\mathcal{M}}([s,g])+\epsilon-\sum_{\alpha,g^{\prime},o}\omega(g^{\prime},\alpha\mid g,o)O(o|g^{\prime})\tilde{c}([s,g],\alpha)\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad\quad (35a)
γ{1ζlog(g,s,o,αeζVγ,([s,g])O(o|s)ω(g,α|g,o)T(s|s,α)ε)}0,s𝒮,gG,\displaystyle\quad\quad\quad\quad\quad\quad\quad-\gamma\Bigg{\{}\frac{1}{\zeta}\log\left(\frac{\sum_{g^{\prime},s^{\prime},o,\alpha}e^{\zeta{V}_{\gamma,\mathcal{M}}([s^{\prime},g^{\prime}])}O(o|s)\omega(g^{\prime},\alpha|g,o)T(s^{\prime}|s,\alpha)}{\varepsilon}\right)\Bigg{\}}\leq 0,\leavevmode\nobreak\ \leavevmode\nobreak\ \forall s\in\mathcal{S},\leavevmode\nobreak\ \forall g\in G,
Probability Constraints:
(g,α)G×Actω(g,αg,o)=1,o𝒪,\displaystyle\underset{(g^{\prime},\alpha)\in G\times Act}{\sum}\omega(g^{\prime},\alpha\mid g,o)=1,\leavevmode\nobreak\ \leavevmode\nobreak\ \forall o\in\mathcal{O},
ω(g,αg,o)0,gG,αAct,o𝒪.\displaystyle\omega(g^{\prime},\alpha\mid g,o)\geq 0,\leavevmode\nobreak\ \leavevmode\nobreak\ \forall g^{\prime}\in G,\alpha\in Act,o\in\mathcal{O}. (35b)

 

In the above inequality, the first term on the right-hand side of the is linear in ω\omega and the second term on the right-hand side (logarithm term) is concave in ω\omega (convex if all terms are moved to the left side, since log(x)-\log(x) is convex in xx). Therefore, (34) becomes a convex constraint rendering (V-C) a convex optimization problem (maximizing a linear objective subject to linear and convex constraints) for EVaR measures. That is, the I-State Improvement Optimization takes the convex optimization form of (35). ∎

If no improvement is achieved by optimization (V-C), i.e., ϵ=0\epsilon=0, for fixed number of internal states |G||G|, we can increase |G||G| by one following the footsteps of the bounded policy iteration method proposed in [3, Section V.B].

V-D Policy Iteration Algorithm

Algorithm 1 outlines the main steps in the proposed policy iteration method for the constrained risk-averse FSC synthesis. The algorithm has two distinct parts. First, for fixed parameters of the FSC (ω\omega), policy evaluation is carried out, in which Vγ,([s,g])V_{\gamma,\mathcal{M}}([s,g]) and λ\lambda are computed using DCP (V-B) (Steps 2, 10 and 18). Second, after evaluating the current value functions and the Lagrange multipliers, an improvement is carried out either by changing the parameters of existing I-states via optimization (V-C), or if no new parameters can improve any I-state, then a fixed number of I-states are added to escape the local minima (Steps 14-17) based on the method proposed in [3, Section V.B].

Algorithm 1 Policy Iteration For Synthesizing Constrained Risk-Averse FSC
0: (a) An initial feasible FSC, 𝒢\mathcal{G}. (b) Maximum size of FSC NmaxN_{max}. (c) NnewNmaxN_{new}\leq N_{max} number of I-states
1:improvedTrueimproved\leftarrow True
2: Compute the value vectors, Vγ,\vec{V}_{\gamma,\mathcal{M}} and Lagrange multipliers 𝝀\boldsymbol{\lambda}, based on DCP (V-B).
3:while |G|Nmax|G|\leq N_{max} and improved=Trueimproved=True do
4:  improvedFalseimproved\leftarrow False
5:  for all I-states gGg\in G do
6:   Solve the I-State Improvement Optimization (V-C).
7:   if I-State Improvement Optimization results in ϵ>0\epsilon>0 then
8:    Replace the parameters ω\omega for I-state gg
9:    improvedTrueimproved\leftarrow True
10:    Compute the value vectors, Vγ,\vec{V}_{\gamma,\mathcal{M}} and Lagrange multipliers 𝝀\boldsymbol{\lambda}, based on optimization (V-B).
11:  if improved=Falseimproved=False and |G|<Nmax|G|<N_{max} then
12:   nadded0n_{added}\leftarrow 0
13:   Nnewmin(Nnew,Nmax|G|)N^{\prime}_{new}\leftarrow\min(N_{new},N_{max}-|G|)
14:   Try to add NnewN^{\prime}_{new} I-state(s) to 𝒢\mathcal{G}.
15:   naddedn_{added}\leftarrow actual number of I-states added in previous step.
16:   if nadded>0n_{added}>0 then
17:    improvedTrueimproved\leftarrow True
18:    Compute the value vectors, Vγ,\vec{V}_{\gamma,\mathcal{M}} and Lagrange multipliers 𝝀\boldsymbol{\lambda}, based on optimization (V-B).
18:𝒢\mathcal{G}

VI Numerical Experiments

In this section, we evaluate the proposed methodology with numerical experiments. In addition to the traditional total expectation, we consider two other coherent risk measures, namely, CVaR and EVaR. All experiments were carried out on a MacBook Pro with 2.8 GHz Quad-Core Intel Core i5 and 16 GB of RAM. The resultant linear programs and DCPs were solved using CVXPY [18] with DCCP [43] add-on in Python.

VI-A Rover MDP Example Set Up

Refer to caption
Figure 2: Grid world illustration for the rover navigation example. Blue cells denote the obstacles and the yellow cell denotes the goal.

An agent (e.g. a rover) must autonomously navigate a 2-dimensional terrain map (e.g. Mars surface) represented by an M×NM\times N grid with 0.25MN0.25MN obstacles. The state space is given by 𝒮={si|i=x+My,x{1,,M},y{1,,N}}.\mathcal{S}=\{s_{i}|i=x+My,x\in\{1,\dots,M\},y\in\{1,\dots,N\}\}. The action set available to the robot is Act={E,W,N,S,NE,NW,SE,SW}Act=\{E,\ W,\ N,\ S,\ NE,\ NW,\ SE,\ SW\}. The state transition probabilities for various cell types are shown for actions EE in Figure 2, i.e., the agent moves to the grid implied by the action with 0.70.7 probability but can also move to any adjacent ones with 0.30.3 probability. Partial observability arises because the rover cannot determine obstacle cell location from measurements directly. The observation space is 𝒪={oi|i=x+My,x{1,,M},y{1,,N}}.\mathcal{O}=\{o_{i}|i=x+My,x\in\{1,\dots,M\},y\in\{1,\dots,N\}\}. Once at an adjacent cell to an obstacle, the rover can identify an actual obstacle position (dark green) with probability 0.60.6, and a distribution over the nearby cells (light green).

Hitting an obstacle incurs the immediate cost of 1010, while the goal grid region has zero immediate cost. Any other grid has a cost of 22 to represent fuel consumption. The discount factor is set to γ=0.95\gamma=0.95.

The objective is to compute a safe path that is fuel efficient, i.e., solving Problem 1. To this end, we consider total expectation, CVaR, and EVaR as the coherent risk measure.

Once a policy is calculated, as a robustness test, inspired by [17], we included a set of single grid obstacles that are perturbed in a random direction to one of the neighboring grid cells with probability 0.30.3 to represent uncertainty in the terrain map. For each risk measure, we run 100100 Monte Carlo simulations with the calculated policies and count failure rates, i.e., the number of times a collision has occurred during a run.

VI-B MDP Results

To evaluate the technique discussed in Section IV, we assume that there is no partial observation. In our experiments, we consider four grid-world sizes of 10×1010\times 10, 15×1515\times 15, 20×2020\times 20, and 30×3030\times 30 corresponding to 100100, 225225, 400400, and 900900 states, respectively. For each grid-world, we randomly allocate 25% of the grids to obstacles, including 3, 6, 9, and 12 uncertain (single-cell) obstacles for the 10×1010\times 10, 15×1515\times 15, 20×2020\times 20, and 30×3030\times 30 grids, respectively. In each case, we solve DCP (1) (linear program in the case of total expectation) with |𝒮||Act|=MN×8=8MN|\mathcal{S}||Act|=MN\times 8=8MN constraints and MN+2MN+2 variables (the risk value functions VγV_{\gamma}’s, Langrangian coefficient λ\lambda, and ζ\zeta for CVaR and EVaR). In these experiments, we set ε=0.2\varepsilon=0.2 for CVaR and EVaR coherent risk measures to represent risk-averse policies. The fuel budget (constraint bound β\beta) was set to 50, 10, 200, and 600 for the 10×1010\times 10, 15×1515\times 15, 20×2020\times 20, and 30×3030\times 30 grid-worlds, respectively. The initial condition was chosen as κ0(sM)=M1\kappa_{0}(s_{M})=M-1, i.e., the agent starts at the second left most grid at the bottom.

A summary of our numerical experiments is provided in Table 1. Note the computed values of Problem 1 satisfy 𝔼(c)CVaRε(c)EVaRε(c)\mathbb{E}(c)\leq\mathrm{CVaR}_{\varepsilon}(c)\leq\mathrm{EVaR}_{\varepsilon}(c), which is consistent with that fact that EVaR is a more conservative coherent risk measure than CVaR [6].

(M×N)ρt(M\times N)_{\rho_{t}} Jγ(κ0)J_{\gamma}(\kappa_{0}) Total Time [s] # U.O. F.R.
(10×10)𝔼(10\times 10)_{\mathbb{E}} 9.12 0.8 3 11%
(15×15)𝔼(15\times 15)_{\mathbb{E}} 12.53 0.9 6 23%
(20×20)𝔼(20\times 20)_{\mathbb{E}} 19.93 1.7 9 33%
(30×30)𝔼(30\times 30)_{\mathbb{E}} 27.30 2.4 12 41%
(10×10)CVaR0.7(10\times 10)_{\text{CVaR}_{0.7}} \geq12.04 5.8 3 8%
(15×15)CVaR0.7(15\times 15)_{\text{CVaR}_{0.7}} \geq14.83 9.3 6 18%
(20×20)CVaR0.7(20\times 20)_{\text{CVaR}_{0.7}} \geq20.19 10.34 9 19%
(30×30)CVaR0.7(30\times 30)_{\text{CVaR}_{0.7}} \geq34.95 14.2 12 32%
(10×10)CVaR0.2(10\times 10)_{\text{CVaR}_{0.2}} \geq14.45 6.2 3 3%
(15×15)CVaR0.2(15\times 15)_{\text{CVaR}_{0.2}} \geq17.82 9.0 6 5%
(20×20)CVaR0.2(20\times 20)_{\text{CVaR}_{0.2}} \geq25.63 11.1 9 13%
(30×30)CVaR0.2(30\times 30)_{\text{CVaR}_{0.2}} \geq44.83 15.25 12 22%
(10×10)EVaR0.7(10\times 10)_{\text{EVaR}_{0.7}} \geq14.53 4.8 3 4%
(15×15)EVaR0.7(15\times 15)_{\text{EVaR}_{0.7}} \geq16.36 8.8 6 11%
(20×20)EVaR0.7(20\times 20)_{\text{EVaR}_{0.7}} \geq29.89 10.5 9 15%
(30×30)EVaR0.7(30\times 30)_{\text{EVaR}_{0.7}} \geq54.13 14.99 12 12%
(10×10)EVaR0.2(10\times 10)_{\text{EVaR}_{0.2}} \geq18.03 5.8 3 1%
(15×15)EVaR0.2(15\times 15)_{\text{EVaR}_{0.2}} \geq21.10 8.7 6 3%
(20×20)EVaR0.2(20\times 20)_{\text{EVaR}_{0.2}} \geq24.08 10.2 9 7%
(30×30)EVaR0.2(30\times 30)_{\text{EVaR}_{0.2}} \geq63.04 14.25 12 10%
TABLE I: Comparison between total expectation, CVaR, and EVaR coherent risk measures. (M×N)ρt(M\times N)_{\rho_{t}} denotes experiments with grid-world of size M×NM\times N and one-step coherent risk measure ρt\rho_{t}. Jγ(κ0)J_{\gamma}(\kappa_{0}) is the valued of the constrained risk-averse problem (Problem 1). Total Time denotes the time taken by the CVXPY solver to solve the associated linear programs or DCPs in seconds. #\# U.O. denotes the number of single grid uncertain obstacles used for robustness test. F.R. denotes the failure rate out of 100 Monte Carlo simulations with the computed policy.
Refer to caption
Refer to caption
Refer to caption
Figure 3: Results for the MDP example with total expectation (left), CVaR (middle), and EVaR (right) coherent risk measures. The goal is located at the yellow cell. Notice the 9 single cell obstacles used for robustness test.

For total expectation coherent risk measure, the calculations took significantly less time, since they are the result of solving a set of linear programs. For CVaR and EVaR, a set of DCPs were solved. CVaR calculation was the most computationally involved. This observation is consistent with [7] were it was discussed that EVaR calculation is much more efficient than CVaR. Note that these calculations can be carried out offline for policy synthesis and then the policy can be applied for risk-averse robot path planning.

The table also outlines the failure ratios of each risk measure. In this case, EVaR outperformed both CVaR and total expectation in terms of robustness, which is consistent with the fact that EVaR is more conservative. In addition, these results imply that, although discounted total expectation is a measure of performance in high number of Monte Carlo simulations, it may not be practical to use it for mission-critical decision making under uncertainty scenarios. CVaR and especially EVaR seem to be a more reliable metric for performance in planning under uncertainty.

For the sake of illustrating the computed policies, Figure 3 depicts the results obtained from solving DCP (1) for a 30×3030\times 30 grid-world. The arrows on grids depict the (sub)optimal actions and the heat map indicates the values of Problem 1 for each grid state. Note that the values for EVaR are greater than those for CVaR and the values for CVaR are greater from those of total expectation. This is in accordance with the theory that 𝔼(c)CVaRε(c)EVaRε(c)\mathbb{E}(c)\leq\mathrm{CVaR}_{\varepsilon}(c)\leq\mathrm{EVaR}_{\varepsilon}(c) [6]. In addition, by inspecting the computed actions in obstacle dense areas of the grid-world (for example, the middle right area), we infer that the actions in more risk-averse cases (especially, for EVaR) have a higher tendency to steer the agent away from the obstacles given the diagonal transition uncertainty as depicted in Figure 2; whereas, for total expectation, the actions are merely concerned about reaching the goal.

VI-C POMDP Results

In our experiments, we consider two grid-world sizes of 10×1010\times 10 and 20×2020\times 20 corresponding to 100100 and 400400 states, respectively. For each grid-world, we allocate 25% of the grid to obstacles, including 8, and 16 uncertain (single-cell) obstacles for the 10×1010\times 10 and 20×2020\times 20 grids, respectively. In each case, we run Algorithm 1 for risk-averse FSC synthesis with Nmax=6N_{max}=6 and a maximum number of 100 iterations were considered.

In these experiments, we set the confidence level ε=0.15\varepsilon=0.15 for CVaR and EVaR coherent risk measures. The fuel budget (constraint bound β\beta) was set to 50 and 200 for the 10×1010\times 10 and 20×2020\times 20 grid-worlds, respectively. The initial condition was chosen as κ0(sM)=1\kappa_{0}(s_{M})=1, i.e., the agent starts at the right most grid at the bottom.

A summary of our numerical experiments is provided in Table 1. Note the computed values of Problem 1 satisfy 𝔼(c)CVaRε(c)EVaRε(c)\mathbb{E}(c)\leq\mathrm{CVaR}_{\varepsilon}(c)\leq\mathrm{EVaR}_{\varepsilon}(c) [6].

(M×N)ρt(M\times N)_{\rho_{t}} Jγ(ιinit)J_{\gamma}(\iota_{init}) AIT [s] # U.O. F.R.
(10×10)𝔼(10\times 10)_{\mathbb{E}} 10.53 0.2 3 15%
(20×20)𝔼(20\times 20)_{\mathbb{E}} 19.98 0.3 9 37%
(10×10)CVaR0.7(10\times 10)_{\text{CVaR}_{0.7}} \geq11.02 2.9 3 9%
(20×20)CVaR0.7(20\times 20)_{\text{CVaR}_{0.7}} \geq20.19 7.5 9 22%
(10×10)CVaR0.2(10\times 10)_{\text{CVaR}_{0.2}} \geq16.53 3.1 3 4%
(20×20)CVaR0.2(20\times 20)_{\text{CVaR}_{0.2}} \geq24.92 7.6 9 16%
(10×10)EVaR0.7(10\times 10)_{\text{EVaR}_{0.7}} \geq15.02 3.3 3 5%
(20×20)EVaR0.7(20\times 20)_{\text{EVaR}_{0.7}} \geq23.42 9.9 9 11%
(10×10)EVaR0.2(10\times 10)_{\text{EVaR}_{0.2}} \geq19.62 3.9 3 2%
(20×20)EVaR0.2(20\times 20)_{\text{EVaR}_{0.2}} \geq29.36 9.7 9 6%
TABLE II: Comparison between total expectation, CVaR, and EVaR coherent risk measures. (M×N)ρt(M\times N)_{\rho_{t}} denotes experiments with grid-world of size M×NM\times N and one-step coherent risk measure ρt\rho_{t}. Jγ(ιinit)J_{\gamma}(\iota_{init}) is the valued of the constrained risk-averse POMDP problem (Problem 1). AIT denotes the average time spent for each iteration of Algorithm 1. #\# U.O. denotes the number of single grid uncertain obstacles used for robustness test. F.R. denotes the failure rate out of 100 Monte Carlo simulations with the computed policy.

For total expectation coherent risk measure, the calculations took significantly less time, since they are the result of solving a set of linear programs. For CVaR and EVaR, a set of DCPs were solved in the Risk Value Function Computation step. In the I-State Improvement step, a set of linear programs were solved for CVaR and convex optimizations for EVaR. Hence, EVaR calculation was the most computationally involved in this case. Note that these calculations can be carried out offline for policy synthesis and then the policy can be applied for risk-averse robot path planning.

The table also outlines the failure ratios of each risk measure. In this case, EVaR outperformed both CVaR and total expectation in terms of robustness, tallying with the fact that EVaR is conservative. In addition, these results suggest that, although discounted total expectation is a measure of performance in high number of Monte Carlo simulations, it may not be practical to use it for real-world planning under uncertainty scenarios. CVaR and especially EVaR seem to be a more reliable metric for performance in planning under uncertainty.

Refer to caption
Figure 4: The evolution of the lower-bound and the number of i-states with respect to the number of iterations of Algorithm 1 for the 20×2020\times 20 gridworld and EVaR coherent risk measure.
Refer to caption
Refer to caption
Refer to caption
Figure 5: Results for the POMDP example with total expectation (left), CVaR (middle), and EVaR (right) coherent risk measures. The goal is located at the yellow cell. Notice the 9 single cell obstacles used for robustness test.

For the sake of illustrating the computed policies, Figure 3 depicts the results obtained from solving DCP (1) for a 20×2020\times 20 grid-world. The arrows on grids depict the (sub)optimal actions and the heat map indicates the values of Problem 1 for each grid state. Note that the values for EVaR are greater than those for CVaR and the values for CVaR are greater from those of total expectation. This is in accordance with the theory that 𝔼(c)CVaRε(c)EVaRε(c)\mathbb{E}(c)\leq\mathrm{CVaR}_{\varepsilon}(c)\leq\mathrm{EVaR}_{\varepsilon}(c) [6].

Moreover, for the 20×2020\times 20 gridworld with EVaR coherent risk measure, Figure 2 depicts the evolution of the number of FSC I-states |G||G| and the lower bound on the optimal value of Problem 1, Jγ(ιinit)J_{\gamma}(\iota_{init}), with respect to the iteration number of Algorithm 1. We can see that as the number of I-states increase, the lower bound is improved.

VII Conclusions and Future Research

We proposed an optimization-based method for designing policies for MDPs and POMDPs with coherent risk measure objectives and constraints. We showed that such value function optimizations are in the form of DCPs. In the case of POMDPs, we proposed a policy iteration method for finding sub-optimal FSCs that lower-bound the constrained risk-averse problem and we demonstrated that dependent on the coherent risk measure of interest the policy search can be carried out via a linear program or a convex optimization. Numerical experiments were provided to show the efficacy of our approach. In particular, we showed that considering coherent risk measures lead to significantly lower collision rates in Monte Carlo simulations in navigation problems.

In this work, we focused on discounted infinite horizon risk-averse problems. Future work will explore other cost criteria [14]. The interested reader is referred to our preliminary results on total cost risk-averse MDPs [1], where in Bellman’s equations for the risk-averse stochastic shortest path problem are derived. Expanding on the latter work, we will also explore high-level mission specifications in terms of temporal logic formulas for risk-averse MDPs and POMDPs [5, 40]. Another area for more research is concerned with receding-horizon motion planning under uncertainty with coherent risk constraints [19, 24], with particular application in robot exploration in unstructured subterranean environments [21] (also see works on receding horizon path planning where the coherent risk measure is in the total cost [45, 44] rather than the collision avoidance constraint).

Acknowledgment

M. Ahmadi acknowledges stimulating discussions with Dr. Masahiro Ono at NASA Jet Propulsion Laboratory and Prof. Marco Pavone at Nvidia Research-Stanford University.

References

  • [1] M. Ahmadi, A. Dixit, J. W. Burdick, and A. D. Ames. Risk-averse stochastic shortest path planning. arXiv preprint arXiv:2103.14727, 2021.
  • [2] M. Ahmadi, N. Jansen, B. Wu, and U. Topcu. Control theory meets POMDPs: A hybrid systems approach. IEEE Transactions on Automatic Control, 2020.
  • [3] M. Ahmadi, M. Ono, M. D. Ingham, R. M. Murray, and A. D. Ames. Risk-averse planning under uncertainty. In 2020 American Control Conference (ACC), pages 3305–3312. IEEE, 2020.
  • [4] M. Ahmadi, U. Rosolia, M. Ingham, R. Murray, and A. Ames. Constrained risk-averse Markov decision processes. In 35th AAAI Conference on Artificial Intelligence, 2021.
  • [5] M. Ahmadi, R. Sharan, and J. W. Burdick. Stochastic finite state control of POMDPs with LTL specifications. arXiv preprint arXiv:2001.07679, 2020.
  • [6] A. Ahmadi-Javid. Entropic value-at-risk: A new coherent risk measure. Journal of Optimization Theory and Applications, 155(3):1105–1123, 2012.
  • [7] A. Ahmadi-Javid and M. Fallah-Tafti. Portfolio optimization with entropic value-at-risk. European Journal of Operational Research, 279(1):225–241, 2019.
  • [8] A. Ahmadi-Javid and A. Pichler. An analytical study of norms and banach spaces induced by the entropic value-at-risk. Mathematics and Financial Economics, 11(4):527–550, 2017.
  • [9] E. Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.
  • [10] P. Artzner, F. Delbaen, J. Eber, and D. Heath. Coherent measures of risk. Mathematical finance, 9(3):203–228, 1999.
  • [11] N. Bäuerle and J. Ott. Markov decision processes with average-value-at-risk criteria. Mathematical Methods of Operations Research, 74(3):361–379, 2011.
  • [12] Dimitri P Bertsekas. Dynamic programming and stochastic control. Number 10. Academic Press, 1976.
  • [13] D.P. Bertsekas. Nonlinear Programming. Athena Scientific, 1999.
  • [14] S. Carpin, Y. Chow, and M. Pavone. Risk aversion in finite Markov Decision Processes using total cost criteria and average value at risk. In 2016 ieee international conference on robotics and automation (icra), pages 335–342. IEEE, 2016.
  • [15] Anthony R. Cassandra, Leslie Pack Kaelbling, and Michael L. Littman. Acting optimally in partially observable stochastic domains. In AAAI, pages 1023–1028, 1994.
  • [16] Y. Chow and M. Ghavamzadeh. Algorithms for cvar optimization in mdps. In Advances in neural information processing systems, pages 3509–3517, 2014.
  • [17] Y. Chow, A. Tamar, S. Mannor, and M. Pavone. Risk-sensitive and robust decision-making: a cvar optimization approach. In Advances in Neural Information Processing Systems, pages 1522–1530, 2015.
  • [18] S. Diamond and S. Boyd. CVXPY: A Python-embedded modeling language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016.
  • [19] A. Dixit, M. Ahmadi, and J. W. Burdick. Risk-sensitive motion planning using entropic value-at-risk. In European Control Conference, 2021.
  • [20] D. Du and P. M. Pardalos. Minimax and applications, volume 4. Springer Science & Business Media, 2013.
  • [21] D. D. Fan, K. Otsu, Y. Kubo, A. Dixit, J. Burdick, and A. Agha-Mohammadi. STEP: Stochastic traversability evaluation and planning for safe off-road navigation. arXiv preprint arXiv:2103.02828, 2021.
  • [22] J. Fan and A. Ruszczyński. Process-based risk measures and risk-averse control of discrete-time systems. Mathematical Programming, pages 1–28, 2018.
  • [23] J. Fan and A. Ruszczyński. Risk measurement and risk-averse control of partially observable discrete-time markov systems. Mathematical Methods of Operations Research, 88(2):161–184, 2018.
  • [24] A. Hakobyan, Gyeong C. Kim, and I. Yang. Risk-aware motion planning and control using CVaR-constrained optimization. IEEE Robotics and Automation Letters, 4(4):3924–3931, 2019.
  • [25] R. Horst and N. V. Thoai. DC programming: overview. Journal of Optimization Theory and Applications, 103(1):1–43, 1999.
  • [26] V. Krishnamurthy. Partially observed Markov decision processes. Cambridge University Press, 2016.
  • [27] V. Krishnamurthy and S. Bhatt. Sequential Detection of Market Shocks With Risk-Averse CVaR Social Sensors. IEEE Journal of Selected Topics in Signal Processing, 10(6):1061–1072, 2016.
  • [28] E. L. Lawler and D. E. Wood. Branch-and-bound methods: A survey. Operations research, 14(4):699–719, 1966.
  • [29] H. A. Le Thi, H. M. Le, T. P. Dinh, et al. A dc programming approach for feature selection in support vector machines learning. Advances in Data Analysis and Classification, 2(3):259–278, 2008.
  • [30] T. Lipp and S. Boyd. Variations and extension of the convex–concave procedure. Optimization and Engineering, 17(2):263–287, 2016.
  • [31] O. Madani, S. Hanks, and A. Condon. On the undecidability of probabilistic planning and related stochastic optimization problems. Artificial Intelligence, 147(1):5 – 34, 2003.
  • [32] A. Majumdar and M. Pavone. How should a robot assess risk? towards an axiomatic theory of risk in robotics. In Robotics Research, pages 75–84. Springer, 2020.
  • [33] T. Osogami. Robustness and risk-sensitivity in markov decision processes. In Advances in Neural Information Processing Systems, pages 233–241, 2012.
  • [34] Jonathan Theodor Ott. A Markov decision model for a surveillance application and risk-sensitive Markov decision processes. 2010.
  • [35] G. C. Pflug and A. Pichler. Time-consistent decisions and temporal decomposition of coherent risk functionals. Mathematics of Operations Research, 41(2):682–699, 2016.
  • [36] L. Prashanth. Policy gradients for CVaR-constrained MDPs. In International Conference on Algorithmic Learning Theory, pages 155–169. Springer, 2014.
  • [37] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, 1st edition, 1994.
  • [38] R Tyrrell Rockafellar and Stanislav Uryasev. Conditional value-at-risk for general loss distributions. Journal of banking & finance, 26(7):1443–1471, 2002.
  • [39] R Tyrrell Rockafellar, Stanislav Uryasev, et al. Optimization of conditional value-at-risk. Journal of risk, 2:21–42, 2000.
  • [40] U. Rosolia, M. Ahmadi, R. M. Murray, and A. D. Ames. Time-optimal navigation in uncertain environments with high-level specifications. arXiv preprint arXiv:2103.01476, 2021.
  • [41] A. Ruszczyński. Risk-averse dynamic programming for markov decision processes. Mathematical programming, 125(2):235–261, 2010.
  • [42] A. Shapiro, D. Dentcheva, and A. Ruszczyński. Lectures on stochastic programming: modeling and theory. SIAM, 2014.
  • [43] X. Shen, S. Diamond, Y. Gu, and S. Boyd. Disciplined convex-concave programming. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 1009–1014. IEEE, 2016.
  • [44] S. Singh, Y. Chow, A. Majumdar, and M. Pavone. A framework for time-consistent, risk-sensitive model predictive control: Theory and algorithms. IEEE Transactions on Automatic Control, 2018.
  • [45] P. Sopasakis, D. Herceg, A. Bemporad, and P. Patrinos. Risk-averse model predictive control. Automatica, 100:281–288, 2019.
  • [46] A. Tamar, Y. Chow, M. Ghavamzadeh, and S. Mannor. Policy gradient for coherent risk measures. In Advances in Neural Information Processing Systems, pages 1468–1476, 2015.
  • [47] A. Tamar, Y. Chow, M. Ghavamzadeh, and S. Mannor. Sequential decision making with coherent risk. IEEE Transactions on Automatic Control, 62(7):3323–3338, 2016.
  • [48] J. Thai, T. Hunter, A. K. Akametalu, C. J. Tomlin, and A. M. Bayen. Inverse covariance estimation from data with missing values using the concave-convex procedure. In 53rd IEEE Conference on Decision and Control, pages 5736–5742. IEEE, 2014.
  • [49] D. Vose. Risk analysis: a quantitative guide. John Wiley & Sons, 2008.
  • [50] A. Wang, A. M Jasour, and B. Williams. Non-gaussian chance-constrained trajectory planning for autonomous vehicles under agent uncertainty. IEEE Robotics and Automation Letters, 2020.
  • [51] Huan Xu and Shie Mannor. Distributionally robust markov decision processes. In Advances in Neural Information Processing Systems, volume 23, pages 2505–2513, 2010.