This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Bayesian Composite Risk Approach for Stochastic Optimal Control and Markov Decision Processes

Wentao Ma Email: mwtmwt7@stu.xjtu.edu.cn School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an, Shaanxi, P. R. China Center for Optimization Technique and Quantitative Finance, Xi’an International Academy for Mathematics and Mathematical Technology, Xi’an, P. R. China Zhiping Chen Email: zchen@mail.xjtu.edu.cn School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an, Shaanxi, P. R. China Center for Optimization Technique and Quantitative Finance, Xi’an International Academy for Mathematics and Mathematical Technology, Xi’an, P. R. China Huifu Xu Corresponding author. Email: h.xu@cuhk.edu.hk Department of Systems Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong
Abstract

Inspired by Shapiro et al. [63], we consider a stochastic optimal control (SOC) and Markov decision process (MDP) where the risks arising from epistemic and aleatoric uncertainties are assessed using Bayesian composite risk (BCR) measures (Qian et al. [52]). The time dependence of the risk measures allows us to capture the decision maker’s (DM) dynamic risk preferences opportunely as increasing information about both uncertainties is obtained. This makes the new BCR-SOC/MDP model more flexible than conventional risk-averse SOC/MDP models. Unlike [63] where the control/action at each episode is based on the current state alone, the new model allows the control to depend on the probability distribution of the epistemic uncertainty, which reflects the fact that in many practical instances the cumulative information about epistemic uncertainty often affects the DM’s belief about the future aleatoric uncertainty and hence the DM’s action [67]. The new modeling paradigm incorporates several existing SOC/MDP models including distributionally robust SOC/MDP models and Bayes-adaptive MDP models and generates so-called preference robust SOC/MDP models. Moreover, we derive conditions under which the BCR-SOC/MDP model is well-defined, demonstrate that finite-horizon BCR-SOC/MDP models can be solved using dynamic programming techniques, and extend the discussion to the infinite-horizon case. By using Bellman equations, we show that under some standard conditions, asymptotic convergence of the optimal values and optimal actions as the episodic variable goes to infinity is achieved. Finally, we carry out numerical tests on a finite horizon spread betting problem and an inventory control problem and show the effectiveness of the proposed model and numerical schemes.

Keywords: Stochastic optimal control, Markov decision process, Bayesian learning , Bayesian composite risk measure

1 Introduction

Stochastic optimal control (SOC) and Markov decision process (MDP) are the fundamental models for sequential decision-making where the outcomes are partly random and partly under the control of a decision maker (DM). In recent years, MDP has become even more popular because of its close relationship to reinforcement learning. In a SOC/MDP, the environment is characterized by a finite or infinite set of states, and the DM’s actions influence the transitions between these states. Each action taken in a given state determines the next state probabilistically, thereby capturing the dynamic and stochastic nature of the environment and the long-term consequences of decisions. SOC/MDP have been applied extensively in diverse fields such as inventory control [51], hydro-thermal planning [22], economics and behavioral ecology [5], among other fields [18].

A critical component of a SOC/MDP model’s validity lies in its ability to characterize uncertainty effectively. The uncertainties addressed by such models can arise from both epistemic uncertainty—that is, arising from limited data availability that affects the estimation of model parameters—and the inherent randomness of the environment, referred to as aleatoric uncertainty [69], which are both inevitable in the real world. Aleatoric uncertainty pertains to the variability in the outcome of an experiment arising from inherently random effects. Consequently, even the most accurate model of this process can only offer probabilities for different outcomes, rather than exact answers. On the other hand, epistemic uncertainty arises from a lack of knowledge about the correct model. In other words, it reflects the ignorance of the DM regarding the epistemic state, rather than the inherent randomness of the system. Unlike uncertainty caused by inherent randomness, epistemic uncertainty can, in principle, be reduced by acquiring additional information. We will review existing SOC/MDP models that address these two types of uncertainty below.

The primary goal of SOC/MDP is to develop a policy that on average minimizes a specified cost function under uncertainty, typically by mapping states to actions. However, this risk-neutral SOC/MDP neglects critical low-probability events with potentially severe consequences. To address this limitation, risk-averse SOC/MDP has been proposed to minimize not only the expected costs but also the variability and associated risks of outcomes. These models typically handle aleatoric uncertainty using a fixed distribution. Early work in this area adopted the expected utility framework, where a convex utility function is adopted to achieve risk aversion [29]. Other studies have explored a static risk-averse SOC/MDP, where a risk measure is applied to the total accumulated cost, e.g., the mean-variance [65, 42], or value-at-risk (VaR) [19]. However, these measures are not coherent, failing to satisfy properties consistent with rational decision-making. Various approaches have emerged [9, 68] within the realm of coherent risk measures. Nonetheless, SOC/MDPs employing a static risk-averse formulation cannot guarantee time-consistent optimal policies, where a current optimal policy may lose its optimality when new realizations emerge. As highlighted in [57], utilizing nested compositions of risk transition mappings at each episode facilitates time-consistent decision-making through dynamic programming solutions. Notable research has explored dynamic formulation on finite and infinite horizons further, including average value-at-risk (AVaR) [3], quantile-based risk measure [33], and entropic risk measure (ERM) [50, 26], respectively.

Although SOC/MDP and their risk-averse variants provide valuable frameworks for decision-making under aleatoric uncertainty, a common practical challenge is the neglect of epistemic uncertainty in model parameters. In addition, the inherent randomness of the environment complicates the task of maintaining the robustness of learned policies against small perturbations in real-world applications ([41]). Distributionally robust optimization (DRO) has emerged as a powerful approach which takes epistemic uncertainty into consideration by constructing an ambiguity set for possible distributions of the environment. Ambiguity sets in DRO can generally be categorized into two main types: moment-based and distance-based. In moment-based DRO [14, 77, 81], the DM has certain information about the moments of the random variable’s distribution. In distance-based DRO, the DM has a known reference distribution and considers a region of uncertainty around it, typically defined by a probability metric such as Kullback-Leibler divergence [49] or Wasserstein distance [20]. This has led to the development of distributionally robust SOC/MDP, as introduced in [44, 31]. Extensive discussions of distributionally robust SOC/MDP can be found in studies such as [43, 76, 82, 60, 84], which usually rely on robust Bellman equations. However, it is essential to note that an inappropriately constructed ambiguity set may lead to an over conservative consideration of epistemic uncertainty, which will generate a policy that performs poorly under far more realistic scenarios than the worst case [79].

The existing SOC/MDP models discussed above highlight an important issue: there is a large middle ground between an optimistic belief in a fixed environment, and a pessimistic focus on worst-case scenarios when dealing with epistemic uncertainty. Further, since both risk-averse SOC/MDP and distributionally robust SOC/MDP operate within the domain of continuous decision-making, the dynamics of certain potential parameters may vary between interaction episodes and remain elusive. When faced with unfamiliar scenarios, the DM is compelled to learn continuously through a single episode of interaction, striving to deduce the underlying model parameters while simultaneously aiming to optimize the objective function (see, [6, 32]). Although relying on point estimates of potential parameters can yield feasible strategies within the current risk-averse SOC/MDP framework, and taking worst-case scenarios of these parameters into consideration can produce feasible strategies in the distributionally robust SOC/MDP framework, the absence of adaptive learning between episodes often results in insufficiently robust or over conservative strategies. This issue highlights a common shortcoming that is inadequately addressed in conventional SOC/MDP models, which assume no prior knowledge and depend solely on learning through repetitive interactions within the same system. These challenges can be addressed effectively through a Bayesian approach to MDP, known as Bayes-adaptive MDP [67]. In Bayes-adaptive MDPs, the DM utilizes prior knowledge expressed as a belief distribution over a range of potential environments, continually updating this belief using Bayes’ rule during engagement with the system [16]. By incorporating Bayesian posterior information instead of concentrating solely on extreme scenarios, the Bayes-adaptive MDP model offers a more balanced and practical approach than distributionally robust SOC/MDP problems where uncertainty plays a critical role. Building on the Bayes-adaptive MDP, a significant body of work [11, 45] enhances our understanding and application of Bayesian approaches in MDP, contributing to a broader landscape of intelligent decision-making systems. In this framework, the posterior distribution is treated as an augmented state, complementing the original physical state, and is updated at each episode based on the observed reward and transition, see [55]. In essence, this Bayesian approach not only enhances a deeper understanding of possible scenarios but also enables a more adaptive and flexible optimization policy.

In conclusion, the risk-averse SOC/MDP model captures low-probability, high-impact events effectively under aleatoric uncertainty, while the Bayes-adaptive MDP model adapts to and delineates epistemic uncertainty robustly through a learning process. These models complement each other, enhancing DM’s management of risk. Therefore, exploring the integration of these models into a risk-averse Bayes-adaptive SOC/MDP framework is essential to capture and manage risks effectively, particularly those involving cognitive uncertainties and the integration of observational data and prior knowledge. However, existing literature on this topic remains sparse, with most studies primarily focusing on optimizing risk measures using finite and discrete states over MDPs such as [64]. In addition, previous studies have predominantly addressed static risk, overlooking the dynamic evolution of risk over time, which is crucial for comprehensive and timely risk assessment [53, 9]. However, in the absence of the nested risk function structure, prior studies such as [46, 30] have demonstrated that optimizing a static risk function can also potentially lead to time-inconsistent behavior. Hence, investigating dynamic risk becomes paramount, especially in the context of risk-averse Bayes-adaptive SOC/MDPs with a continuous state space. In a more recent development, a Bayesian risk MDP (BR-MDP) framework is proposed in [38, 73] to dynamically solve a risk-neutral Bayes-adaptive MDP, whereby some specific coherent risk measures are used to quantify the risk of the expected value of the cost function associated with epistemic uncertainty. To reduce computational challenges, Shapiro et al. [63] proposed a suboptimal episodic Bayesian SOC framework compared to BR-MDP, which is like conventional SOC/MDPs in avoiding the incorporation of the posterior distribution as an augmented state. They demonstrate convergence of the optimal values like that of conventional SOC/MDP models. In this paper, we follow this strand of research by adopting general law invariant risk measures to quantify the risks associated with both epistemic and aleatoric uncertainties and allow the explicit dependence of the optimal policy at each episode on the posterior distribution. The new model provides a more general adaptive framework for risk management of dynamic decision-making.

Building on these insights, it becomes evident that a unified framework integrating risk-averse SOC/MDP, distributionally robust SOC/MDP, and Bayes-adaptive MDPs may help to address the limitations of existing models. Such a framework not only captures the dynamic interplay between epistemic and aleatoric uncertainties but also integrates these uncertainties into objective functions tailored to a wide range of risk preferences. Conventional approaches, while handling specific aspects of uncertainty effectively, often fail to achieve this level of flexibility and comprehensiveness. To leverage fully the Bayesian approach in modeling decision-making under distributional uncertainty, given its suitability for such problems, we adopt a perspective akin to composite risk optimization (CRO). This approach, introduced by [52], has proven effective for static (single-stage) optimization. By extending this approach to dynamic SOC/MDP settings, our research aims to analyze the temporal evolution of risk and its impact on state transitions and decisions, offering a refined strategy for timely, adaptive risk management policies. This perspective offers a robust foundation for unifying the existent SOC/MDP models within a coherent and dynamic framework. The main contributions of this paper can be summarized as follows.

  • Modeling. We propose an adaptive SOC/MDP where the risks arising from epistemic and aleatoric uncertainties are assessed by Bayesian composite risk (BCR) measures. Unlike the episodic Bayesian SOC model [63], where the control or action at each episode is based on the current state alone and ignores any future revelation of the randomness process, the BCR-SOC/MDP model allows the action to depend explicitly on the probability distribution of the epistemic uncertainty [67], which reflects the fact that, in many practical situations, accumulated information about epistemic uncertainty can influence the DM’s belief about future aleatoric uncertainty, thereby impacting the DM’s actions. The new modeling paradigm subsumes several existing SOC/MDP models including distributionally robust SOC/MDP models, risk-averse SOC/MDP models, and Bayes-adaptive MDP models and generates so-called preference robust SOC/MDP models. Note that it is well known in the risk neutral setting that SOC and MDP models are equivalent [60]. The equivalence also holds in risk averse setting. We use SOC/MDP instead of merely SOC or MDP as both are used in the literature [38, 57].

  • Analysis. We derive conditions under which the BCR-SOC/MDP model is well defined and demonstrate how the BCR-SOC/MDP model can be solved in both the finite and infinite horizon cases. Moreover, we demonstrate, through detailed case studies, the adaptability of the BCR-SOC/MDP model, illustrating its ability to generalize from existing models and be applied to innovative SOC/MDP contexts. By employing a composite risk measure framework, we investigate the fundamental properties of the BCR-SOC/MDP model across both finite and infinite horizons. Further, using Bellman equations, we demonstrate asymptotic convergence of the optimal values as the episodic variable goes to infinity under some standard conditions and provide quantitative evaluations for specific settings. Notably, we show that, as information or data accumulates, the optimal value and optimal policy derived from our infinite-horizon BCR-SOC/MDP model converge on their respective true optimal counterparts under the true SOC/MDP model. We haven’t seen such theoretical guarantees for current risk-averse or distributionally robust SOC/MDP models. We further explore the tractability and probabilistic guarantees of specific models, such as the VaR-Expectation SOC/MDP and the AVaR-AVaR SOC/MDP.

  • Applications. We examine performances of the proposed BCR-SOC/MDP models and computational schemes by applying them to two conventional SOC/MDP problems: a finite-horizon spread betting problem and an infinite-horizon inventory control problem. In the spread betting problem, the BCR-SOC/MDP model is more robust and has lower variability than the standard risk-averse and distributionally robust SOC/MDP models, particularly when the number of historical records of market movement is small. In the inventory control problem, the BCR-SOC/MDP framework exhibits convergence of the optimal values and optimal policies to their true counterparts as information on aleatoric uncertainty accumulates, demonstrates its adaptability and dynamic learning capabilities. The preliminary numerical results show the efficiency and effectiveness of the BCR-SOC/MDP model in tackling complex decision-making problems with evolving uncertainties in real-world applications.

The rest of the paper is structured as follows. In Section 2, we revisit some basic notions and results in SOC/MDP and risk measures which are needed throughout the paper. In Section 3, we introduce the BCR-SOC/MDP model and demonstrate how it can subsume some existing SOC/MDP models. In Section 4, we briefly discuss the dynamic programming formulation of finite horizon BCR-SOC/MDP. In Section 5, we establish the existence, uniqueness, and convergence of a stationary optimal policy for the infinite-horizon BCR-SOC/MDP. In Section 6, we present a computationally efficient SAA algorithm for two special cases. Finally, in Section 7, we report numerical test results about the BCR-SOC/MDP model, displaying their superior performance and applicability.

2 Preliminaries

In this section, we revisit some basic notions and results in SOC/MDP and risk measurement that are needed throughout the paper.

2.1 Stochastic optimal control/Markov decision process

By convention (see e.g. [51]), we express a discounted MDP by a 5-tuple (𝒮,𝒜,𝒫,𝒞,γ)(\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{C},\gamma), where 𝒮\mathcal{S} and 𝒜\mathcal{A} denote the state space of the system and the action space respectively, 𝒫\mathcal{P} represents the transition probability matrix, with 𝒫(st+1|st,at)\mathcal{P}\left(s_{t+1}|s_{t},a_{t}\right) signifying the probability of transition from state sts_{t} to state st+1s_{t+1} under action ata_{t}. Earlier research has shown that the state transition equation in the SOC model, coupled with the random variable ξt\xi_{t}, determines the transition probability matrix in the MDP model at episode tt. In particular, the state transition process in the MDP model can be described by the equation st+1=gt(st,at,ξt)s_{t+1}=g_{t}(s_{t},a_{t},\xi_{t}), where gt:𝒮×𝒜×Ξ𝒮g_{t}:\mathcal{S}\times\mathcal{A}\times\Xi\to\mathcal{S} is a state transition function (see e.g., [38, 60]). Here, ξt\xi_{t} is a random vector mapping from probability space (Ω,,)(\Omega,{\cal F},\mathbb{P}) with support set Ξ\Xi. This relationship can be established by setting 𝒫(st+1=sst,at)=P(ξt=ξ)\mathcal{P}(s_{t+1}=s^{\prime}\mid s_{t},a_{t})=P(\xi_{t}=\xi^{\prime}), where ξΞ\xi^{\prime}\in\Xi is a realization of ξt\xi_{t} such that gt(st,at,ξ)=sg_{t}(s_{t},a_{t},\xi^{\prime})=s^{\prime}. Indeed, it is well known in the risk neutral setting, that SOC and MDP models are equivalent ([60]). The cost function 𝒞t(st,at,ξt)\mathcal{C}_{t}\left(s_{t},a_{t},\xi_{t}\right) quantifies the immediate cost incurred at episode tt and the parameter γ(0,1]\gamma\in(0,1] represents the discount factor of the cost function. A deterministic Markovian policy, π\pi, maps each state in 𝒮\mathcal{S} to an action in 𝒜\mathcal{A}. The objective of a standard SOC/MDP is to find an optimal policy π\pi^{*} that minimizes the expected cumulative cost over all initial states s1𝒮s_{1}\in\mathcal{S}, formulated as

π(s1)argminπ𝔼π[t=1Tγt𝒞t(st,at,ξt)],\displaystyle\pi^{*}(s_{1})\in\arg\min_{\pi}\mathbb{E}^{\pi}\left[\sum_{t=1}^{T}\gamma^{t}\mathcal{C}_{t}\left(s_{t},a_{t},\xi_{t}\right)\right], (2.1)

where 𝔼π\mathbb{E}^{\pi} denotes the mathematical expectation with respect to (w.r.t. for short) the joint probability distribution of (ξ1,,ξT)(\xi_{1},\cdots,\xi_{T}) for each fixed policy π\pi and T+{}T\in{\mathbb{N}^{+}\cup\{\infty\}} represents time horizon. In the case that T=T=\infty, (2.1) becomes a SOC/MDP with infinite time horizon. In that case, the discount factor γ\gamma is restricted to taking values over (0,1)(0,1).

2.2 Topology of weak convergence and risk measures

Consider a probability space (Ω,,)(\Omega,\mathcal{F},\mathbb{P}) and an \mathcal{F}-measurable function X:ΩX:\Omega\to\mathbb{R}, where XX represents random losses. Let P=X1P=\mathbb{P}\circ X^{-1} denote the probability measure/distribution 111Throughout the paper, we use the terminologies probability measure and probability distribution interchangeably. on \mathbb{R} induced by XX, let FPF_{P} denote the cumulative distribution function (cdf) of XX and FPF_{P}^{\leftarrow} be the left quantile function, that is, FP(t)=inf{x:FP(x)t,t[0,1]}.F_{P}^{\leftarrow}(t)=\inf\{x:F_{P}(x)\geq t,\forall t\in[0,1]\}. Let p[0,)p\in[0,\infty) and LpL^{p} denote the space of random variables defined over (Ω,,)(\Omega,{\cal F},\mathbb{P}) with pp-th order finite moments 𝔼[|X|p]:=|X(ω)|p(dω)<\mathbb{E}_{\mathbb{P}}[|X|^{p}]:=\int_{\mathbb{R}}|X(\omega)|^{p}\mathbb{P}(d\omega)<\infty. Let 𝒫()\mathscr{P}(\mathbb{R}) denote the space of all probability measures over (,)(\mathbb{R},{\cal B}) and

1p={Q𝒫():|x|pQ(dx)<}.\displaystyle{\cal M}^{p}_{1}=\left\{Q\in\mathscr{P}(\mathbb{R}):\int_{\mathbb{R}}|x|^{p}Q(dx)<\infty\right\}. (2.2)

In the case when p=0p=0, 10=𝒫(){\cal M}_{1}^{0}=\mathscr{P}(\mathbb{R}) and in the case when pp\to\infty, 1p{\cal M}_{1}^{p} reduces to the set of all probability measures QQ with Q([1,1])>0Q([-1,1])>0 and Q(\[1,1])=0Q(\mathbb{R}\backslash[-1,1])=0. Note also that XLpX\in L^{p} iff Q=X11pQ=\mathbb{P}\circ X^{-1}\in{\cal M}_{1}^{p} in that |x|pQ(dx)=Ω|X(ω)|p(dω)\int_{\mathbb{R}}|x|^{p}Q(dx)=\int_{\Omega}|X(\omega)|^{p}\mathbb{P}(d\omega). Define 1p\mathfrak{C}^{p}_{1} as the linear space of all continuous functions h:h:\mathbb{R}\to\mathbb{R} for which there exists a positive constant cc such that

|h(x)|c(|x|p+1),x.\displaystyle|h(x)|\leq c(|x|^{p}+1),\forall x\in\mathbb{R}. (2.3)

In the case when p=p=\infty, h(x)h(x) is bounded by cc over [1,1][-1,1]. By [85, Proposition 2.1], the Fortet-Mourier metric 𝖽𝗅FM\mathsf{d\kern-0.70007ptl}_{FM} (see definition in appendix) metricizes the ψ\psi-weak topology on 1p{\cal M}^{p}_{1} for ψ(x)=c(|x|p+1)\psi(x)=c(|x|^{p}+1), denoted by τ||p\tau_{|\cdot|^{p}}, which is the coarsest topology on 1p{\cal M}_{1}^{p} for which the mapping gh:1pg_{h}:{\cal M}_{1}^{p}\to\mathbb{R} defined by

gh(Q):=kh(x)Q(dx),h1pg_{h}(Q):=\int_{\mathbb{R}^{k}}h(x)Q(dx),\;h\in\mathfrak{C}_{1}^{p}

is continuous. A sequence {Pk}1p\{P_{k}\}\subset{\cal M}_{1}^{p} is said to converge ||p|\cdot|^{p}-weakly to P1pP\in{\cal M}_{1}^{p}, written Pk||pP{P_{k}}\xrightarrow[]{|\cdot|^{p}}P, if it converges with respect to (w.r.t.) τ||p\tau_{|\cdot|^{p}}. In the case when p=0p=0, it reduces to the usual topology of weak convergence. Recall that a function ρ():Lp\rho(\cdot):L^{p}\rightarrow\mathbb{R} is called a monetary risk measure if it satisfies: (a) For any X,YLpX,\ Y\in L^{p}, X(ω)Y(ω)X(\omega)\geqslant Y(\omega) for all ωΩ\omega\in\Omega implies that ρ(X)ρ(Y)\rho(X)\geqslant\rho(Y), and (b) ρ(X+c)=ρ(X)+c\rho(X+c)=\rho(X)+c for any XLpX\in L^{p} and real number cc\in\mathbb{R}. A monetary risk measure ρ()\rho(\cdot) is said to be convex if it also satisfies: (c) for any X,YLpX,Y\in L^{p} and λ[0,1],\lambda\in[0,1], ρ(λX+(1λ)Y)λρ(X)+(1λ)ρ(Y)\rho(\lambda X+(1-\lambda)Y)\leq\lambda\rho(X)+(1-\lambda)\rho(Y). A convex risk measure ρ()\rho(\cdot) is said to be coherent if it further satisfies (d) for any XLp and any λ0, it holds that ρ(λX)=λρ(X).\text{for any }X\in L^{p}\text{ and any }\lambda\geqslant 0,\text{ it holds that }\rho(\lambda X)=\lambda\rho(X). A risk measure ρ\rho is law invariant if ρ(X)=ρ(Y)\rho(X)=\rho(Y) for any random variables X,YLpX,Y\in L^{p} with the same probability distribution.

Note that in [61], FP(1α)F^{\leftarrow}_{P}(1-\alpha) is called value-at-risk (VaR)222In some references, VaR is defined as the quantile function at α\alpha, see e.g. [54, 1]. Here we follow the definition in [61]., which is a monetary risk measure satisfying positive homogeneity but not convexity. By convention, we denote it by VaRPα(X)\text{VaR}^{\alpha}_{P}(X), for a random variable XX with α[0,1]\alpha\in[0,1]. Let σ:[0,1]\sigma:[0,1]\to\mathbb{R} be a non-negative, non-decreasing function with the normalized property 01σ(t)𝑑t=1\int_{0}^{1}\sigma(t)dt=1 and 𝔖\mathfrak{S} denote the set of all such σ\sigma. The spectral risk measure of XX with σ𝔖\sigma\in\mathfrak{S}, is defined as

(SRM)Mσ(X):=01FP(t)σ(t)𝑑t=01VaRP1t(X)σ(t)𝑑t,\displaystyle\text{(SRM)}\quad M_{\sigma}(X):=\int_{0}^{1}F^{\leftarrow}_{P}(t)\sigma(t)dt=\int_{0}^{1}\text{VaR}^{1-t}_{P}(X)\sigma(t)dt, (2.4)

where σ\sigma is called a risk spectrum representing the DM’s risk preference. Obviously σ\sigma plays a role of weighting and Mσ(X)M_{\sigma}(X) is the weighted average loss of XX. Moreover, by changing variables in the integral (2.4) (setting x=FP(t)x=F^{\leftarrow}_{P}(t)), we obtain

Mσ(X)=xσ(FP(x))𝑑FP(x),\displaystyle M_{\sigma}(X)=\int_{-\infty}^{\infty}x\sigma(F_{P}(x))dF_{P}(x), (2.5)

in which case σ\sigma may be interpreted as a distortion function of probability. Thus SRM is also known as distortion risk measure ([25]) although the latter has a slightly different representation. SRM is introduced by Acerbi [1]. It shows that Mσ(X)M_{\sigma}(X) is a law invariant coherent risk measure. The following example lists a few well-known spectral risk measures in the literature.

Example 1.

By choosing some specific risk spectra, we can recover a number of well-known law invariant coherent risk measures.

  • (i)

    Average Value-at-Risk (AVaR).333 AVaR is also known as conditional value-at-risk (CVaR), see [54]. Here we follow the definition and terminology from [61]. The relationship between the two is AVaRPα(X)=CVaRP1α(X)\text{AVaR}^{\alpha}_{P}(X)=\text{CVaR}^{1-\alpha}_{P}(X). Let σα(τ):=1α𝟏[1α,1](τ)\sigma_{\alpha}(\tau):=\frac{1}{\alpha}\mathbf{1}_{[1-\alpha,1]}(\tau), where 𝟏[1α,1]\mathbf{1}_{[1-\alpha,1]} is the indicator function and α(0,1)\alpha\in(0,1). Then

    Mσα(X)=1α1α1FP(t)dt=:AVaRPα(X).\displaystyle M_{\sigma_{\alpha}}(X)=\frac{1}{\alpha}\int_{1-\alpha}^{1}F^{\leftarrow}_{P}(t)dt=:\text{AVaR}^{\alpha}_{P}(X). (2.6)
  • (ii)

    Wang’s proportional hazards transform or power distortion [78]. Let σν(τ)=ν(1τ)ν1,\sigma_{\nu}(\tau)=\nu(1-\tau)^{\nu-1}, where ν(0,1]\nu\in(0,1] is a parameter. Then Mσν(X)=0(1FP(t))ν𝑑t.M_{\sigma_{\nu}}(X)=\int_{0}^{\infty}(1-F_{P}(t))^{\nu}dt.

  • (iii)

    Gini’s measure [78]. Let σs(τ)=(1s)+2sτ,\sigma_{s}(\tau)=(1-s)+2s\tau, where s(0,1)s\in(0,1). Then Mσs(X)=𝔼P[X]+s𝔼P(|XX|)=:Ginis(X),M_{\sigma_{s}}(X)=\mathbb{E}_{P}[X]+s\mathbb{E}_{P}(|X-X^{\prime}|)=:\text{Gini}_{s}(X), where XX^{\prime} is an independent copy of XX.

  • (iv)

    Convex combination of expected value and AVaR. Let

    σλ(τ)=λ𝟏[0,1](τ)+(1λ)1α𝟏[1α,1](τ)\displaystyle\sigma_{\lambda}(\tau)=\lambda\mathbf{1}_{[0,1]}(\tau)+(1-\lambda)\frac{1}{\alpha}\mathbf{1}_{[1-\alpha,1]}(\tau)

    for some λ[0,1]\lambda\in[0,1], where 𝟏[a,b]\mathbf{1}_{[a,b]} is the indicator function over interval [a,b][a,b]. Then

    Mσλ(X)=λ𝔼P[X]+(1λ)AVaRPα(X).\displaystyle M_{\sigma_{\lambda}}(X)=\lambda\mathbb{E}_{P}[X]+(1-\lambda)\text{AVaR}_{P}^{\alpha}(X). (2.7)

It may be helpful to note that any law invariant risk measure defined over a non-atomic LpL^{p} space can be represented as a risk functional over the space of the probability distributions of the random variables in the space. Specifically, there exists a unique functional ϱ:1p\varrho:{\cal M}_{1}^{p}\to\mathbb{R} such that

ρ(X)=ϱ(P):=ρ(FP1(U)),\displaystyle\rho(X)=\varrho(P):=\rho(F_{P}^{-1}(U)), (2.8)

where UU is a random variable uniformly distributed over [0,1][0,1]. In the case when ρ\rho is a coherent risk measure and X,YL1X,Y\in L^{1}, we have

|ρ(X)ρ(Y)|=|ϱ(P)ϱ(Q)|L𝖽𝗅K(P,Q),|\rho(X)-\rho(Y)|=|\varrho(P)-\varrho(Q)|\leq L\mathsf{d\kern-0.70007ptl}_{K}(P,Q), (2.9)

where LL is the Lipschitz modulus of ρ\rho, QQ denotes the probability distribution of YY, and 𝖽𝗅K\mathsf{d\kern-0.70007ptl}_{K} denotes the Kantorovich metric (see the definition in the appendix), see the proof of [72, Corollary 4.8]. Moreover, any monetary risk measure ρ:L\rho:L^{\infty}\to\mathbb{R} is Lipschitz continuous with respect to the Wasserstein distance 𝖽𝗅W\mathsf{d\kern-0.70007ptl}_{W}^{\infty}, that is,

|ρ(X)ρ(Y)|𝖽𝗅W(P,Q)=sup0<u<1|FP(u)FQ(u)|=XY,\displaystyle|\rho(X)-\rho(Y)|\leq\mathsf{d\kern-0.70007ptl}_{W}^{\infty}(P,Q)=\sup_{0<u<1}|F_{P}^{\leftarrow}(u)-F_{Q}^{\leftarrow}(u)|=\|X-Y\|_{\infty}, (2.10)

see e.g. [74, Lemma 2.1]. The continuity of a risk measure of a random function w.r.t. the parameters of the function is slightly more complicated. The next lemma addresses this.

Lemma 1.

Let XLpX\in L^{p} be a random variable with support 𝒳{\cal X} and z𝒵nz\in{\cal Z}\subset\mathbb{R}^{n} be a decision vector. Let f:𝒵×𝒳f:\mathcal{Z}\times{\cal X}\to\mathbb{R} be a continuous function such that

|f(z,x)|κ(|x|υ+1),z𝒵,x𝒳\displaystyle|f(z,x)|\leq\kappa(|x|^{\upsilon}+1),\forall z\in{\cal Z},x\in{\cal X} (2.11)

for some positive constants κ,υ\kappa,\upsilon. Let PX1υpP_{X}\in{\cal M}_{1}^{\upsilon p} denote the probability distribution of XX over \mathbb{R}, and for any fixed zz, let νz,PX=PXf1(z,)1p\nu_{z,P_{X}}=P_{X}\circ f^{-1}(z,\cdot)\in{\cal M}_{1}^{p} be a probability measure on \mathbb{R} induced by f(z,X)f(z,X). Let Fνz,PX(x)=:νz,PX((,x])F_{\nu_{z,P_{X}}}(x)=:\nu_{z,P_{X}}((-\infty,x]) for any xx\in\mathbb{R}. Then the following assertions hold.

  • (i)

    νz,PX\nu_{z,P_{X}} is continuous in (z,PX)(z,P_{X}) with respect to topology τ×τ×τ||υp\tau_{\mathbb{R}}\times\tau_{\mathbb{R}}\times\tau_{|\cdot|^{\upsilon p}} and τp\tau^{p}.

  • (ii)

    Let

    Mσ(f(z,X)):=01Fνz,PX(x)σ(x)𝑑x\displaystyle M_{\sigma}(f(z,X)):=\int_{0}^{1}F_{\nu_{z,P_{X}}}^{\leftarrow}(x)\sigma(x)dx (2.12)

    be a spectral risk measure of f(z,X)f(z,X), where σ\sigma is a risk spectrum. Let 𝔄\mathfrak{A} be a set of risk spectra. Suppose that there exists a positive number q>1q>1 such that 𝔄q[0,1]\mathfrak{A}\subset{\cal L}^{q}[0,1] and there is a positive constant pp with 1p+1q=1\frac{1}{p}+\frac{1}{q}=1 such that PX1υpP_{X}\in{\cal M}_{1}^{\upsilon p} and 𝔄\mathfrak{A} is a compact set. Then

|supσ𝔄Mσ(f(z,X))supσ𝔄Mσ(f(z,X))|𝖽𝗅Wp(νz,PX,νz,PX)supσ𝔄[01σ(x)q𝑑x]1q,\displaystyle\left|\sup_{\sigma\in\mathfrak{A}}M_{\sigma}(f(z^{\prime},X^{\prime}))-\sup_{\sigma\in\mathfrak{A}}M_{\sigma}(f(z,X))\right|\leq\mathsf{d\kern-0.70007ptl}_{W}^{p}(\nu_{z^{\prime},P_{X^{\prime}}},\nu_{z,P_{X}})\sup_{\sigma\in\mathfrak{A}}\left[\int_{0}^{1}\sigma(x)^{q}dx\right]^{\frac{1}{q}}, (2.13)

where 𝖽𝗅Wp\mathsf{d\kern-0.70007ptl}_{W}^{p} is the Wasserstein distance of order pp. In the special case that 𝔄\mathfrak{A} is a singleton, Mσ(f(z,X))M_{\sigma}(f(z,X)) is continuous in (z,PX)(z,P_{X}) with respect to topology τ×τ×τ||υp\tau_{\mathbb{R}}\times\tau_{\mathbb{R}}\times\tau_{|\cdot|^{\upsilon p}} and τ\tau_{\mathbb{R}}.

For any law invariant continuous risk measure ρ\rho, ρ(f(z,X))\rho(f(z,X)) is continuous in (z,PX)(z,P_{X}) with respect to topology τ×τ×τ||υp\tau_{\mathbb{R}}\times\tau_{\mathbb{R}}\times\tau_{|\cdot|^{\upsilon p}} and τ\tau_{\mathbb{R}}.

In the case that 𝒳{\cal X} is compact, ρ(f(z,X))\rho(f(z,X)) is continuous in (z,PX)(z,P_{X}) for any law invariant monetary risk measure ρ\rho with respect to topology τ×τ×τ||0\tau_{\mathbb{R}}\times\tau_{\mathbb{R}}\times\tau_{|\cdot|^{0}} and τ\tau_{\mathbb{R}}, where τ||0\tau_{|\cdot|^{0}} denotes the weak topology.

Proof. Part (i) follows directly from [10, Lemma 2.74]. Part (ii) follows from [71, Proposition 5.1]. Part (iii) follows from Part (i). Part (iv) follows from the fact that ff is bounded. \Box

2.3 Bayesian composite risk measure

Consider a random variable X:(Ω,,P)X:(\Omega,{\cal F},P)\to\mathbb{R} whose true distribution belongs to a family of parametric distributions, denoted by {Pθ}\{P_{\theta}\}, where θΘd\theta\in\Theta\subseteq\mathbb{R}^{d} is a vector of parameters. The true probability distribution is associated with an unknown underlying parameter θc{{\theta}^{c}}, i.e., Pc=PθcP^{c}=P_{{\theta}^{c}} for some θcΘd{\theta}^{c}\in\Theta\subseteq\mathbb{R}^{d}. To ease the exposition, we consider the case that d=1d=1. All of our results in the forthcoming discussions are applicable to the case when d1d\geq 1. Since the information on θc\theta^{c} is incomplete, it might be sensible to describe θ\theta as a random variable with prior distribution μ\mu from a modeler’s perspective. This distribution may be obtained from partially available information and/or subjective judgment. For each realization of θ\theta, PθP_{\theta} gives rise to a probability distribution of XX. In practice, we may not be able to obtain a closed form of PθP_{\theta}, rather we may obtain samples of XX, denoted by 𝑿N={X1,,XN}\boldsymbol{X}^{N}=\{X^{1},\cdots,X^{N}\} and use them to estimate θc\theta^{c} by maximum likelihood. Alternatively, we may use Bayes’ formula to obtain a posterior distribution of θ\theta, that is,

μ(θ|𝑿N)=p(𝑿N|θ)μ(θ)Θp(𝑿N|θ)μ(θ)𝑑θ,\mu(\theta|\boldsymbol{X}^{N})=\frac{p(\boldsymbol{X}^{N}|\theta)\mu(\theta)}{\int_{\Theta}p(\boldsymbol{X}^{N}|\theta)\mu(\theta)d\theta}, (2.14)

where p(|θ)p(\cdot|\theta) is the probability density function (pdf) of XX associated with PθP_{\theta}, p(𝑿N|θ)=i=1Np(Xi|θ)p(\boldsymbol{X}^{N}|\theta)=\prod_{i=1}^{N}p\left(X^{i}|\theta\right) is the likelihood function and Θp(𝑿N|θ)μ(θ)𝑑θ\int_{\Theta}p(\boldsymbol{X}^{N}|\theta)\mu(\theta)d\theta is the normalizing factor. As more samples are gathered, μ(θ|𝑿N)\mu(\theta|\boldsymbol{X}^{N}) converges to the Dirac distribution of θ\theta at θc\theta^{c} and subsequently the BCR converges to ρPθc\rho_{P_{\theta^{c}}}. In this paper, we focus on the latter approach and apply it within the framework of SOC/MDP.

To quantify the risk of XX (which represents losses), we may adopt a law invariant risk measure, that is, ρ(X)\rho(X). The notation does not capture the probability distribution of XX. However, if we use the risk functional ϱ\varrho defined in (2.8), then equivalently we can write ρ(X)\rho(X) as ϱ(Pθ)\varrho(P_{\theta}). Since θ\theta is a random variable, then ϱ(Pθ)\varrho(P_{\theta}) is a random function of θ\theta. There are two ways to quantify the risk of ϱ(Pθ)\varrho(P_{\theta}). One is to consider the mean value of ϱ(Pθ)\varrho(P_{\theta}) with respect to the posterior distribution of θ\theta, that is, 𝔼μ(θ|𝑿N)(ϱ(Pθ))\mathbb{E}_{\mu(\theta|\boldsymbol{X}^{N})}(\varrho(P_{\theta})). The other is to adopt a law invariant risk measure ρ^\hat{\rho} to quantify the risk of ϱ(Pθ)\varrho(P_{\theta}), that is, ρ^(ϱ(Pθ))\hat{\rho}(\varrho(P_{\theta})). By using the equivalent risk function of ρ^\hat{\rho}, written ϱ^\hat{\varrho}, we can write ρ^(ϱ(Pθ))\hat{\rho}(\varrho(P_{\theta})) as ϱ^(η)\hat{\varrho}(\eta), where η\eta denotes the probability distribution of ϱ(Pθ)\varrho(P_{\theta}). Unfortunately, this notation is too complex. To ease the exposition, we write ρPθ(X)\rho_{P_{\theta}}(X) for ρ(X)\rho(X) and ρμρPθ(X)\rho_{\mu}\circ\rho_{P_{\theta}}(X) for ϱ^(η)\hat{\varrho}(\eta) even though this notation is not consistent with our previous discussions. We call ρμρPθ()\rho_{\mu}\circ\rho_{P_{\theta}}(\cdot) a Bayesian composite risk (BCR) measure with ρPθ\rho_{P_{\theta}} being the inner risk measure conditional on θ\theta and ρμ\rho_{\mu} being the outer risk measure. This composite framework was proposed by Qian et al. [52]. In a particular case that ρPθ\rho_{P_{\theta}} is a spectral risk measure parameterized by θ\theta and ρμ\rho_{\mu} is the mathematical expectation, the composite risk measure recovers the average randomized spectral risk measure introduced by Li et al. [37]. The next proposition states the definition of well-defined for the BCR.

Proposition 1.

Consider the setting of Lemma 1. Assume: (a) the inequality (2.11) holds, (b) p(|θ)p(\cdot|\theta) is continuous in θ\theta and (c) ρPθ\rho_{P_{\theta}} is any law invariant continuous risk measure. Then ρμρPθ(f(z,X))\rho_{\mu}\circ\rho_{P_{\theta}}({f}(z,X)) is well defined.

3 The BCR-SOC/MDP model

In this section, we move on from the conventional SOC/MDP model (2.1) to propose a BCR-SOC/MDP model by adopting BCR to quantify the risk of loss during each episode and discuss its relationship with some existing SOC/MDP models in the literature. Specifically we consider

minπ\displaystyle\min_{\pi} 𝒥Tγ(π,s1,μ1):=ρμ1ρPθ1[𝒞1(s1,a1,ξ1)++γρμT1ρPθT1[𝒞T1(sT1,aT1,ξT1)+γ𝒞T(sT)]]\displaystyle\mathcal{J}_{T}^{\gamma}(\pi,s_{1},\mu_{1}):=\rho_{\mu_{1}}\circ\rho_{P_{\theta_{1}}}\left[\mathcal{C}_{1}\left(s_{1},a_{1},\xi_{1}\right)+\cdots+\gamma\rho_{\mu_{T-1}}\circ\rho_{P_{\theta_{T-1}}}\left[\mathcal{C}_{T-1}\left(s_{T-1},a_{T-1},\xi_{T-1}\right)+\gamma\mathcal{C}_{T}\left(s_{T}\right)\right]\right]
s.t. st+1=gt(st,at,ξt),at=πt(st,μt),t=1,,T1,\displaystyle s_{t+1}=g_{t}\left(s_{t},a_{t},\xi_{t}\right),\ a_{t}=\pi_{t}(s_{t},\mu_{t}),\ t=1,\cdots,T-1, (3.1b)
μt+1(θ)=p(ξt|θ)μt(θ)Θp(ξt|θ)μt(θ)𝑑θ,t=1,,T1,\displaystyle\mu_{t+1}(\theta)=\frac{p\left(\xi_{t}|\theta\right)\mu_{t}(\theta)}{\int_{\Theta}p\left(\xi_{t}|\theta\right)\mu_{t}(\theta)d\theta},t=1,\cdots,T-1,

where θt\theta_{t}, t=1,,T1t=1,\cdots,T-1, is a random variable with probability density function μt\mu_{t} at episode tt, the cost function 𝒞t(st,at,ξt)\mathcal{C}_{t}(s_{t},a_{t},\xi_{t}) is continuous with respect to ξt\xi_{t} and πt:𝒮×𝒟1p𝒜\pi_{t}:\mathcal{S}\times\mathcal{D}_{1}^{p}\rightarrow\mathcal{A} is the policy for action at episode tt, where 𝒟1p\mathcal{D}_{1}^{p} denotes the set of all density functions whose corresponding probability distributions are in 1p{\cal M}_{1}^{p}. Compared to (2.1), the BCR-SOC/MDP model has several new features.

  • (i)

    For t=1,,T1t=1,\cdots,T-1, the true probability distribution of ξt\xi_{t} is unknown, but it is known that it belongs to a family of parametric distributions, written as PθP_{\theta}, where θ\theta is a random variable with prior distribution μ1(θ)\mu_{1}(\theta), see [63]. In the objective function of the BCR-SOC/MDP model (3.1), we write θt\theta_{t} for PθP_{\theta} at episode tt to indicate that θt\theta_{t} corresponds to the density function μt\mu_{t}. In the initial episode, the epistemic uncertainty is based on prior data and/or subjective judgment, which leads to a prior distribution μ:=μ1\mu:=\mu_{1} over Θ\Theta. As the stochastic process 𝝃t:={ξ1,ξ2,,ξt}\bm{\xi}^{t}:=\{\xi_{1},\xi_{2},\cdots,\xi_{t}\} is observed, the posterior distribution μt+1(θ):=μ(θ|𝝃t)\mu_{t+1}(\theta):=\mu(\theta|\bm{\xi}^{t}) is updated with the Bayes’ formula:

    μt+1(θ)=p(𝝃t|θ)μ1(θ)Θp(𝝃t|θ)μ1(θ)𝑑θ=p(ξt|θ)μt(θ)Θp(ξt|θ)μt(θ)𝑑θ.\displaystyle\mu_{t+1}(\theta)=\frac{p(\bm{\xi}^{t}|\theta)\mu_{1}(\theta)}{\int_{\Theta}p(\bm{\xi}^{t}|\theta)\mu_{1}(\theta)d\theta}=\frac{p\left(\xi_{t}|\theta\right)\mu_{t}(\theta)}{\int_{\Theta}p\left(\xi_{t}|\theta\right)\mu_{t}(\theta)d\theta}. (3.2)

    Moreover, instead of using the mean values as in [63], we propose to use law invariant risk measures ρPθ\rho_{P_{\theta}} and ρμt\rho_{\mu_{t}} to quantify the risks arising from aleatoric uncertainty and epistemic uncertainty respectively during each episode. This is intended to provide a complement to [63] which can be used effectively to deal with the cases where some DMs are risk-averse rather than risk neutral. The time dependence of the risk measures allows us to capture the DM’s dynamic risk preferences as more information about both uncertainties is obtained. This makes the new BCR-SOC/MDP model more flexible than the conventional SOC/MDP model (2.1).

  • (ii)

    Unlike [63] where πt\pi_{t} is purely based on the physical state sts_{t}, the policy πt\pi_{t} in the BCR-SOC/MDP model depends not only on sts_{t} but also on the belief μt\mu_{t}, as considered in [38]. This is motivated by the fact that μt\mu_{t} signifies how θt\theta_{t} takes values over Θ\Theta and subsequently affects the probability distribution PθtP_{\theta_{t}} of ξt\xi_{t}. If we interpret ξt\xi_{t} as the underlying uncertainty that affects the loss function 𝒞t{\cal C}_{t}, then μt\mu_{t} may be regarded as the DM’s belief about the prospect of the future market, such as a bull market or bear market, at state tt. Such belief is accumulated over the decision-making process over the past t1t-1 episodes and affects the DM’s decision at episode tt. Nevertheless, the above augmented pair of states might bring some computational difficulties as envisaged by Shapiro et al. [63], where the authors propose a so-called episodic Bayesian stochastic optimal control framework to avoid explicit inclusion of μt\mu_{t} as an augmented state. In the forthcoming discussions, we will demonstrate how to overcome the resulting computational difficulties.

  • (iii)

    The SOC/MDP model serves as an effective tool to demonstrate various foundational concepts becaues, on the one hand, it can be expressed naturally as a type of stochastic optimal control model [51, 60, 63], and on the other hand, it can be formulated within the multi-stage stochastic programming (MSP) framework [15, 17, 59], by considering xt=(xts,xta)x_{t}=(x_{t}^{s},x_{t}^{a}) with xts=stx_{t}^{s}=s_{t} and xta=atx_{t}^{a}=a_{t}, as an augmented decision variable. Specifically, the BCR-SOC/MDP model can be reformulated as a risk-averse MSP model in [48] as follows:

    minx\displaystyle\min_{x} ρξ1[𝒞1(x1,ξ1)++γρξT1|𝝃T2[𝒞T1(xT1,ξT1)+γ𝒞T(xT)]]\displaystyle\rho_{\xi_{1}}\left[\mathcal{C}_{1}\left(x_{1},\xi_{1}\right)+\cdots+\gamma\rho_{\xi_{T-1}|\bm{\xi}^{T-2}}\left[\mathcal{C}_{T-1}\left(x_{T-1},\xi_{T-1}\right)+\gamma\mathcal{C}_{T}\left(x_{T}\right)\right]\right] (3.3a)
    s.t. xt+1s=gt(xt,ξt),xt+1a𝒜t,t=1,,T1,\displaystyle x_{t+1}^{s}=g_{t}\left(x_{t},\xi_{t}\right),x_{t+1}^{a}\in\mathcal{A}_{t},t=1,\cdots,T-1, (3.3b)

    where ρξt+1|𝝃t:=ρμt+1ρPθt+1\rho_{\xi_{t+1}|\bm{\xi}^{t}}:=\rho_{\mu_{t+1}}\circ\rho_{P_{\theta_{t+1}}} with μt\mu_{t} being updated according to (3.1b) and the stochastic process 𝝃t={ξ1,ξ2,,ξt}\bm{\xi}^{t}=\{\xi_{1},\xi_{2},\cdots,\xi_{t}\}. For a detailed example, we refer the readers to [61], which illustrates how the classical inventory model can be adapted to fit the above framework.

To ensure the BCR-SOC/MDP model to be well-defined, we make the following assumption.

Assumption 1.

(a) For t=1,,Tt=1,\cdots,T, the cost function 𝒞t(st,at,ξt):𝒮×𝒜×Ξ{\cal C}_{t}(s_{t},a_{t},\xi_{t}):\mathcal{S}\times\mathcal{A}\times\Xi\to\mathbb{R} is continuous, and there exist positive constants κ𝒞,υ\kappa_{\cal C},\upsilon such that

|𝒞t(st,at,ξt)|κ𝒞(|ξt|υ+1),st,at,ξt,t=1,,T1\displaystyle|{\cal C}_{t}(s_{t},a_{t},\xi_{t})|\leq\kappa_{\cal C}(|\xi_{t}|^{\upsilon}+1),\forall s_{t},a_{t},\xi_{t},\ t=1,\cdots,T-1 (3.4)

holds, and

|𝒞T(gT1(sT1,aT1,ξT1))|κ𝒞(|ξT1|υ+1),sT1,aT1,ξT1.\displaystyle|\mathcal{C}_{T}\left(g_{T-1}\left(s_{T-1},a_{T-1},\xi_{T-1}\right)\right)|\leq\kappa_{\cal C}(|\xi_{T-1}|^{\upsilon}+1),\forall s_{T-1},a_{T-1},\xi_{T-1}. (3.5)

(b) For t=1,,T1t=1,\cdots,T-1, ρPθt:Lυ\rho_{P_{\theta_{t}}}:L^{\upsilon}\to\mathbb{R} is a continuous law invariant monetary risk measure such that ρPθt(|ξt|υ)<,θtΘ\rho_{P_{\theta_{t}}}(|\xi_{t}|^{\upsilon})<\infty,\forall\theta_{t}\in\Theta, and

ρμtρPθt(|ξt|υ)κρ(|ξt1|υ+1),\displaystyle\quad\rho_{\mu_{t}}\circ\rho_{P_{\theta_{t}}}(|\xi_{t}|^{\upsilon})\leq\kappa_{\rho}(|\xi_{t-1}|^{\upsilon}+1), (3.6)

where κρ\kappa_{\rho} is a positive constant depending on tt. (c) The random variables ξ1,,ξT\xi_{1},\cdots,\xi_{T} are independent and identically distributed (i.i.d.) with true unknown distribution PθcP_{\theta^{c}} which belongs to a parametric family of distributions {PθθΘ}\{P_{\theta}\mid\theta\in\Theta\}, where Pθ1υpP_{\theta}\in{\cal M}_{1}^{\upsilon p} for all θΘ\theta\in\Theta. (d) The density function p(ξ|θ)p(\xi|\theta) is continuous in θ\theta over Θ\Theta\subset\mathbb{R}. (e) The action space 𝒜\mathcal{A} is a compact and convex set.

The following comments about how the specified conditions in the assumption can be satisfied might be helpful. Inequalities (3.4)-(3.5) in Assumption 1 (a) specify the growth conditions of the cost functions in terms of ξt\xi_{t}. This type of condition is widely used in stochastic programming, see e.g., [23] and references therein, but it is a new consideration for SOC/MDP models. In the literature of SOC/MDP models, the cost functions are often assumed to be bounded. Here we relax the boundedness condition to cover the case where the dependence of 𝒞t{\cal C}_{t} on ξt\xi_{t} is non-linear and the support set of ξt\xi_{t} can be unbounded. Condition (b) is closely related to (a). In the case where υ=0\upsilon=0 in (3.5), condition (a) subsumes the case that 𝒞t{\cal C}_{t} is bounded. By setting κρ=1\kappa_{\rho}=1, condition (b) will also accommodate this case. We will return to this shortly to show how condition (b) can be satisfied under some specific circumstances. Assumption 1 (c) is the set-up of the BCR-SOC/MDP model. The independence of ξ1,,ξT\xi_{1},\cdots,\xi_{T} is usually assumed in SOC/MDP models and holds automatically in many scenarios. Assumption 1 (d) is also satisfied by many probability distributions and is widely used in the literature, see e.g., [63].

Inequality (3.6) in Assumption 1 (b) poses a bound on |ξt|υ|\xi_{t}|^{\upsilon} under the BCR. Since μt{\mu_{t}} depends on ξt1\xi_{t-1}, the upper bound of the BCR depends on ξt1\xi_{t-1}. Inequality (3.6) requires the bound to be controllable by |ξt1|υ|\xi_{t-1}|^{\upsilon}. This kind of condition is less intuitive, so we use an example to explain how the condition can be satisfied in some important cases.

Example 2.

Inequality (3.6) holds under the following specific cases.

  • (a)

    The support set of ξt\xi_{t}, Ξ\Xi, is bounded. In this case, ρμtρPθt(|ξt|υ)\rho_{\mu_{t}}\circ\rho_{P_{\theta_{t}}}(|\xi_{t}|^{\upsilon}) is bounded by a positive constant CC.

  • (b)

    Both ξt\xi_{t} and θt\theta_{t} follow a normal distribution with ξt𝒩(θt,σ2)\xi_{t}\sim\mathcal{N}(\theta_{t},\sigma^{2}) and θt𝒩(mt,dt2)\theta_{t}\sim\mathcal{N}(m_{t},d_{t}^{2}). Let μ0(θ):=𝒩(m0,d02)\mu_{0}(\theta):=\mathcal{N}(m_{0},d_{0}^{2}). Then

    mt=λtmt1+(1λt)ξt1,λt=dt12dt12+σ2,dt2=dt12+σ2.m_{t}=\lambda_{t}m_{t-1}+(1-\lambda_{t})\xi_{t-1},\quad\lambda_{t}=\frac{d_{t-1}^{2}}{d_{t-1}^{2}+\sigma^{2}},\quad d_{t}^{-2}=d_{t-1}^{-2}+\sigma^{-2}. (3.7)

    Since ξt=θt+σZ\xi_{t}=\theta_{t}+\sigma Z, then

    |ξt|υ=|θt+σZ|υ2υ1(|θt|υ+|σZ|υ),\displaystyle|\xi_{t}|^{\upsilon}=|\theta_{t}+\sigma Z|^{\upsilon}\leq 2^{\upsilon-1}\left(|\theta_{t}|^{\upsilon}+|\sigma Z|^{\upsilon}\right), (3.8)

    where Z𝒩(0,1)Z\sim\mathcal{N}(0,1) and the inequality is based on the generalized Minkowski inequality. By applying the inner risk measure ρPθt\rho_{P_{\theta_{t}}} to both sides of the inequality, we obtain

    ρPθt(|ξt|υ)2υ1(|θt|υ+συρ(|Z|υ))=C1|θt|υ+C2,\rho_{P_{\theta_{t}}}(|\xi_{t}|^{\upsilon})\leq 2^{\upsilon-1}\left(|\theta_{t}|^{\upsilon}+\sigma^{\upsilon}\rho(|Z|^{\upsilon})\right)=C_{1}|\theta_{t}|^{\upsilon}+C_{2}, (3.9)

    where C1=2υ1C_{1}=2^{\upsilon-1} and C2=C1συρ(|Z|υ)C_{2}=C_{1}\sigma^{\upsilon}\rho(|Z|^{\upsilon}). Since θt𝒩(mt,dt2)\theta_{t}\sim\mathcal{N}(m_{t},d_{t}^{2}), then

    |θt|υ2υ1(|mt|υ+dtυ|Z|υ).\displaystyle|\theta_{t}|^{\upsilon}\leq 2^{\upsilon-1}\left(|m_{t}|^{\upsilon}+d_{t}^{\upsilon}|Z|^{\upsilon}\right). (3.10)

    By applying the outer risk measure ρμt\rho_{\mu_{t}} to both sides of the inequality above, we obtain

    ρμt(|θt|υ)2υ1(|mt|υ+dtυρ(|Z|υ)).\rho_{\mu_{t}}(|\theta_{t}|^{\upsilon})\leq 2^{\upsilon-1}\left(|m_{t}|^{\upsilon}+d_{t}^{\upsilon}\rho(|Z|^{\upsilon})\right). (3.11)

    A combination of (3.9)-(3.11) yields

    ρμtρPθt(|ξt|υ)C3|mt|υ+C4dtυ+C2,\displaystyle\rho_{\mu_{t}}\circ\rho_{P_{\theta_{t}}}(|\xi_{t}|^{\upsilon})\leq C_{3}|m_{t}|^{\upsilon}+C_{4}d_{t}^{\upsilon}+C_{2}, (3.12)

    where C3=C12υ1C_{3}=C_{1}\cdot 2^{\upsilon-1}, C4=C12υ1ρ(|Z|υ)C_{4}=C_{1}2^{\upsilon-1}\rho(|Z|^{\upsilon}). By applying (3.7) to the right-hand side of (3.12), we obtain

    ρμtρPθt(|ξt|υ)K|ξt1|υ+C,\rho_{\mu_{t}}\circ\rho_{P_{\theta_{t}}}(|\xi_{t}|^{\upsilon})\leq K|\xi_{t-1}|^{\upsilon}+C,

    where KK and CC are constants defined as K=C32υ1K=C_{3}\cdot 2^{\upsilon-1}, C=Kmt1υ+C4dtυ+C2C=Km_{t-1}^{\upsilon}+C_{4}d_{t}^{\upsilon}+C_{2} depending on tt.

  • (c)

    ξtExponential(θt)\xi_{t}\sim\text{Exponential}(\theta_{t}) and θtGamma(mt,dt)\theta_{t}\sim\text{Gamma}(m_{t},d_{t}). Let μ0(θ)=Gamma(m0,d0)\mu_{0}(\theta)=\text{Gamma}(m_{0},d_{0}). Then the posterior distribution μt(θ)=Gamma(mt,dt)\mu_{t}(\theta)=\text{Gamma}(m_{t},d_{t}) can be updated based on ξt1\xi_{t-1} as follows:

    mt=mt1+1,dt=dt1+ξt1.\displaystyle m_{t}=m_{t-1}+1,\quad d_{t}=d_{t-1}+\xi_{t-1}. (3.13)

    We consider BCR with both the inner and outer risk measures being mathematical expectation. Since ξtExponential(θt)\xi_{t}\sim\text{Exponential}(\theta_{t}), then

    𝔼Pθt(|ξt|υ)=0ξtυθteθtξt𝑑ξt=Γ(υ+1)θtυ\mathbb{E}_{P_{\theta_{t}}}(|\xi_{t}|^{\upsilon})=\int_{0}^{\infty}\xi_{t}^{\upsilon}\theta_{t}e^{-\theta_{t}\xi_{t}}d\xi_{t}=\frac{\Gamma(\upsilon+1)}{\theta_{t}^{\upsilon}}

    and subsequently

    ρμtρPθt(|ξt|υ)=𝔼μt𝔼Pθt(|ξt|υ)=Γ(υ+1)𝔼μt(1θtυ).\rho_{\mu_{t}}\circ\rho_{P_{\theta_{t}}}(|\xi_{t}|^{\upsilon})=\mathbb{E}_{\mu_{t}}\mathbb{E}_{P_{\theta_{t}}}(|\xi_{t}|^{\upsilon})=\Gamma(\upsilon+1)\mathbb{E}_{\mu_{t}}\left(\frac{1}{\theta_{t}^{\upsilon}}\right).

    Moreover, since μt=Gamma(mt,dt)\mu_{t}=\text{Gamma}(m_{t},d_{t}), then we have

    𝔼μt(1θtυ)=dtmtΓ(mt)0θtmtυ1edtθt𝑑θt=dtυΓ(mtυ)Γ(mt)\mathbb{E}_{\mu_{t}}\left(\frac{1}{\theta_{t}^{\upsilon}}\right)=\frac{d_{t}^{m_{t}}}{\Gamma(m_{t})}\int_{0}^{\infty}\theta_{t}^{m_{t}-\upsilon-1}e^{-d_{t}\theta_{t}}d\theta_{t}=\frac{d_{t}^{\upsilon}\Gamma(m_{t}-\upsilon)}{\Gamma(m_{t})}

    for mt>υm_{t}>\upsilon. By exploiting (3.13), we obtain

    dtυ=(dt1+ξt1)υ2υ1(dt1υ+|ξt1|υ).d_{t}^{\upsilon}=(d_{t-1}+\xi_{t-1})^{\upsilon}\leq 2^{\upsilon-1}\left(d_{t-1}^{\upsilon}+|\xi_{t-1}|^{\upsilon}\right).

    Let

    K=Γ(υ+1)Γ(mtυ)Γ(mt)2υ1,C=Kdt1υ.K=\Gamma(\upsilon+1)\frac{\Gamma(m_{t}-\upsilon)}{\Gamma(m_{t})}2^{\upsilon-1},\quad C=Kd_{t-1}^{\upsilon}.

    Then

    ρμtρPθt(|ξt|υ)=𝔼μt𝔼Pθt(|ξt|υ)K|ξt1|υ+C.\rho_{\mu_{t}}\circ\rho_{P_{\theta_{t}}}(|\xi_{t}|^{\upsilon})=\mathbb{E}_{\mu_{t}}\mathbb{E}_{P_{\theta_{t}}}(|\xi_{t}|^{\upsilon})\leq K|\xi_{t-1}|^{\upsilon}+C.

    Next, we consider the case where the inner risk measure is ρPθt=AVaRPθtβ\rho_{P_{\theta_{t}}}=\text{AVaR}^{\beta}_{P_{\theta_{t}}}. Since AVaRPθtβ(|ξt|υ)1β𝔼Pθt(|ξt|υ)\text{AVaR}^{\beta}_{P_{\theta_{t}}}(|\xi_{t}|^{\upsilon})\leq\frac{1}{\beta}\mathbb{E}_{P_{\theta_{t}}}(|\xi_{t}|^{\upsilon}), the result follows in a similar manner to the case where ρPθt=𝔼Pθt\rho_{P_{\theta_{t}}}=\mathbb{E}_{P_{\theta_{t}}}. Furthermore, consider the case where the inner risk measure is ρPθt=VaRPθtβ\rho_{P_{\theta_{t}}}=\text{VaR}^{\beta}_{P_{\theta_{t}}}. Since VaRPθtβ(|ξt|υ)AVaRPθtβ(|ξt|υ)\text{VaR}^{\beta}_{P_{\theta_{t}}}(|\xi_{t}|^{\upsilon})\leq\text{AVaR}^{\beta}_{P_{\theta_{t}}}(|\xi_{t}|^{\upsilon}), the result can be deduced in a similar manner to the ρPθt=AVaRPθtβ(|ξt|υ)\rho_{P_{\theta_{t}}}=\text{AVaR}^{\beta}_{P_{\theta_{t}}}(|\xi_{t}|^{\upsilon}) case. For the outer risk measures ρμt=VaRμtα\rho_{\mu_{t}}=\text{VaR}^{\alpha}_{\mu_{t}} and ρμt=AVaRμtα\rho_{\mu_{t}}=\text{AVaR}^{\alpha}_{\mu_{t}}, similar results hold.

  • (d)

    ξtPoisson(θt)\xi_{t}\sim\text{Poisson}(\theta_{t}) and θtGamma(mt,dt)\theta_{t}\sim\text{Gamma}(m_{t},d_{t}). By setting the prior distribution μ0(θ)=Gamma(m0,d0)\mu_{0}(\theta)=\text{Gamma}(m_{0},d_{0}), the posterior distribution μt(θ)=Gamma(mt,dt)\mu_{t}(\theta)=\text{Gamma}(m_{t},d_{t}) can be updated based on sample ξt1\xi_{t-1} as follows:

    mt=mt1+ξt1,dt=dt1+1.\displaystyle m_{t}=m_{t-1}+\xi_{t-1},\quad d_{t}=d_{t-1}+1. (3.14)

    Consider the case that both the inner and outer risk measures are mathematical expectations. For ξtPoisson(θt)\xi_{t}\sim\text{Poisson}(\theta_{t}), the expectation is given by:

    𝔼Pθt(|ξt|υ)=𝔼Pθt(ξtυ)=θtυ+terms involving lower powers of θt.\mathbb{E}_{P_{\theta_{t}}}(|\xi_{t}|^{\upsilon})=\mathbb{E}_{P_{\theta_{t}}}(\xi_{t}^{\upsilon})=\theta_{t}^{\upsilon}+\text{terms involving lower powers of }\theta_{t}.

    For simplicity, we can find constants C1C_{1} and C2C_{2} such that 𝔼Pθt(ξtυ)C1θtυ+C2.\mathbb{E}_{P_{\theta_{t}}}(\xi_{t}^{\upsilon})\leq C_{1}\theta_{t}^{\upsilon}+C_{2}.

    Since μt=Gamma(mt,dt)\mu_{t}=\text{Gamma}(m_{t},d_{t}), we have:

    𝔼μt(θtυ)=Γ(mt+υ)Γ(mt)dtυ.\mathbb{E}_{\mu_{t}}(\theta_{t}^{\upsilon})=\frac{\Gamma(m_{t}+\upsilon)}{\Gamma(m_{t})}d_{t}^{-\upsilon}.

    Substituting mt=mt1+ξt1m_{t}=m_{t-1}+\xi_{t-1} and dt=dt1+1d_{t}=d_{t-1}+1 into the equation above, we obtain

    𝔼μt(θtυ)=Γ(mt1+ξt1+υ)Γ(mt1+ξt1)(dt1+1)υ.\mathbb{E}_{\mu_{t}}(\theta_{t}^{\upsilon})=\frac{\Gamma(m_{t-1}+\xi_{t-1}+\upsilon)}{\Gamma(m_{t-1}+\xi_{t-1})}(d_{t-1}+1)^{-\upsilon}.

    Using the properties of the Gamma function, we have

    Γ(mt1+ξt1+υ)Γ(mt1+ξt1)=k=0υ1(mt1+ξt1+k).\frac{\Gamma(m_{t-1}+\xi_{t-1}+\upsilon)}{\Gamma(m_{t-1}+\xi_{t-1})}=\prod_{k=0}^{\upsilon-1}(m_{t-1}+\xi_{t-1}+k).

    Consequently we can establish

    𝔼μt𝔼Pθt(|ξt|υ)C1𝔼μt(θtυ)+C2C1(ξt1+mt1+υ1)υ(dt1+1)υ+C2K|ξt1|υ+C,\mathbb{E}_{\mu_{t}}\mathbb{E}_{P_{\theta_{t}}}(|\xi_{t}|^{\upsilon})\leq C_{1}\mathbb{E}_{\mu_{t}}(\theta_{t}^{\upsilon})+C_{2}\leq C_{1}(\xi_{t-1}+m_{t-1}+\upsilon-1)^{\upsilon}(d_{t-1}+1)^{-\upsilon}+C_{2}\leq K|\xi_{t-1}|^{\upsilon}+C,

    where K=C1(dt1+1)υK=C_{1}(d_{t-1}+1)^{-\upsilon} and C=K(mt1+υ1)υ+C2C=K(m_{t-1}+\upsilon-1)^{\upsilon}+C_{2}. Similar results can be established when VaR and AVaR are chosen for the inner and outer risk measures respectively. We omit the details.

We are now ready to address the well-definedness of the BCR-SOC/MDP problem (3.1).

Proposition 2.

Under Assumption 1, the BCR-SOC/MDP problem (3.1) is well-defined.

Proof. Let

ϑT(sT,aT,ξT,μT)\displaystyle{\vartheta}_{T}(s_{T},a_{T},\xi_{T},\mu_{T}) :=\displaystyle:= 𝒞T(sT),\displaystyle{\cal C}_{T}(s_{T}),
ϑt(st,at,ξt,μt)\displaystyle{\vartheta}_{t}(s_{t},a_{t},\xi_{t},\mu_{t}) :=\displaystyle:= 𝒞t(st,at,ξt)+γρμt+1ρPθt+1(ϑt+1(st+1,at+1,ξt+1,μt+1)),t=1,,T1.\displaystyle\mathcal{C}_{t}\left(s_{t},a_{t},\xi_{t}\right)+\gamma\rho_{\mu_{t+1}}\circ\rho_{P_{\theta_{t+1}}}\left({\vartheta}_{t+1}(s_{t+1},a_{t+1},\xi_{t+1},\mu_{t+1})\right),t=1,\cdots,T-1.

We begin by considering t:=Tt:=T case. Since sT=gT1(sT1,aT1,ξT1)s_{T}=g_{T-1}\left(s_{T-1},a_{T-1},\xi_{T-1}\right), then

ϑT1(sT1,aT1,ξT1,μT1)\displaystyle{\vartheta}_{T-1}(s_{T-1},a_{T-1},\xi_{T-1},\mu_{T-1}) =\displaystyle= 𝒞T1(sT1,aT1,ξT1)+γρμTρPθT(ϑT(sT,aT,ξT,μT))\displaystyle\mathcal{C}_{T-1}\left(s_{T-1},a_{T-1},\xi_{T-1}\right)+\gamma\rho_{\mu_{T}}\circ\rho_{P_{\theta_{T}}}\left({\vartheta}_{T}(s_{T},a_{T},\xi_{T},\mu_{T})\right)
=\displaystyle= 𝒞T1(sT1,aT1,ξT1)+γ𝒞T(gT1(sT1,aT1,ξT1)).\displaystyle\mathcal{C}_{T-1}\left(s_{T-1},a_{T-1},\xi_{T-1}\right)+\gamma\mathcal{C}_{T}\left(g_{T-1}\left(s_{T-1},a_{T-1},\xi_{T-1}\right)\right).

Under Assumption 1 (c),

|ϑT1(sT1,aT1,ξT1,μT1)|(1+γ)κ𝒞(|ξT1|υ+1).\displaystyle|{\vartheta}_{T-1}(s_{T-1},a_{T-1},\xi_{T-1},\mu_{T-1})|\leq(1+\gamma)\kappa_{\cal C}(|\xi_{T-1}|^{\upsilon}+1). (3.15)

By setting κ:=(1+γ)κ𝒞\kappa:=(1+\gamma)\kappa_{\cal C}, we know by Proposition 1 that ρμT1ρPθT1(ϑT1(sT1,aT1,ξT1,μT1))\rho_{\mu_{T-1}}\circ\rho_{P_{\theta_{T-1}}}\left({\vartheta}_{T-1}(s_{T-1},a_{T-1},\xi_{T-1},\mu_{T-1})\right) is well defined and is continuous in aT1a_{T-1}. Repeating the analysis above, we can show by induction that

|ϑt(st,at,ξt,μt)|𝒞t(st,at,ξt)+γρμt+1ρPθt|ϑt+1(st+1,at+1,ξt+1,μt+1)|κ𝒞1(γκρ)Tt+11γκρ(|ξt|υ+1)+γ1(γκρ)Tt1γκρ\displaystyle\begin{aligned} |{\vartheta}_{t}(s_{t},a_{t},\xi_{t},\mu_{t})|&\leq\mathcal{C}_{t}\left(s_{t},a_{t},\xi_{t}\right)+\gamma\rho_{\mu_{t+1}}\circ\rho_{P_{\theta_{t}}}|{\vartheta}_{t+1}(s_{t+1},a_{t+1},\xi_{t+1},\mu_{t+1})|\\ &\leq\kappa_{\mathcal{C}}\frac{1-(\gamma\kappa_{\rho})^{T-t+1}}{1-\gamma\kappa_{\rho}}(|\xi_{t}|^{\upsilon}+1)+\gamma\frac{1-(\gamma\kappa_{\rho})^{T-t}}{1-\gamma\kappa_{\rho}}\end{aligned} (3.16)

and hence the well-definedness of ρμtρPθt(ϑt(st,at,ξt,μt))\rho_{\mu_{t}}\circ\rho_{P_{\theta_{t}}}\left({\vartheta}_{t}(s_{t},a_{t},\xi_{t},\mu_{t})\right) by virtue of Proposition 1 and continuity in ata_{t} by Lemma 1 for t=T2,,1t=T-2,\cdots,1. \Box

The proposition applies to the case where T<T<\infty, that is, the BCR-SOC/MDP is a finite-horizon problem. In the case of T=T=\infty, we need to assume that 𝒞t\mathcal{C}_{t} is bounded and γ<1\gamma<1. The condition is widely used in the literature of infinite horizon SOC/MDP. In the context of Assumption 1, it corresponds to υ=0\upsilon=0 and κρ=1\kappa_{\rho}=1. Consequently, ϑ1(s1,a1,ξ1,μ1)\vartheta_{1}(s_{1},a_{1},\xi_{1},\mu_{1}) is bounded by a constant 2κ𝒞+γ1γ\frac{2\kappa_{\mathcal{C}}+\gamma}{1-\gamma} for all (s1,a1,ξ1,μ1)𝒮×𝒜×Ξ×𝒟10(s_{1},a_{1},\xi_{1},\mu_{1})\in\mathcal{S}\times\mathcal{A}\times\Xi\times\mathcal{D}_{1}^{0}, which leads to the well-definedness of BCR-SOC/MDP problem (3.1) with T=T=\infty.

3.1 BCR-SOC/MDP vs conventional SOC/MDP models

The proposed BCR-SOC/MDP model subsumes several important SOC/MDP models in the literature. Here we list some of them.

Example 3 (Conventional risk-averse SOC/MDPs).

Conventional risk-averse SOC/MDP models (see e.g. [2, 47, 33]) assume that the transition probability matrix 𝒫\mathcal{P} or the distribution PP for ξ\xi is known. These models can be regarded as a special case of the BCR-SOC/MDP model by setting μ1(θ)=δθc\mu_{1}(\theta)=\delta_{\theta^{c}}, where δθc\delta_{\theta^{c}} is the Dirac probability measure at θc\theta^{c}. Consequently the posterior distribution μt(θ)\mu_{t}(\theta) remains δθc\delta_{\theta^{c}} for all t=1,,Tt=1,\cdots,T, because of the Bayes’ updating mechanism δθcp(ξ|θ)δθcΘp(ξ|θ)δθc𝑑θ\delta_{\theta^{c}}\equiv\frac{p\left(\xi|\theta\right)\delta_{\theta^{c}}}{\int_{\Theta}p\left(\xi|\theta\right)\delta_{\theta^{c}}d\theta}. The resulting objective function of problem (3.1) can be reformulated as:

minπρPθc[𝒞1(s1,a1,ξ1)++γρPθc[𝒞T1(sT1,aT1,ξT1)+γ𝒞T(sT)]].\min_{\pi}\rho_{P_{\theta^{c}}}\left[\mathcal{C}_{1}\left(s_{1},a_{1},\xi_{1}\right)+\cdots+\gamma\rho_{P_{\theta^{c}}}\left[\mathcal{C}_{T-1}\left(s_{T-1},a_{T-1},\xi_{T-1}\right)+\gamma\mathcal{C}_{T}\left(s_{T}\right)\right]\right]. (3.17)

Moreover, by adopting the inner risk measure ρPθc\rho_{P_{\theta^{c}}} as expectation, VaR and AVaR, respectively, the BCR-SOC/MDP model recovers the relevant risk-averse SOC/MDP models in the literature.

Example 4 (Distributionally robust SOC/MDPs).

In the distributionally robust SOC/MDP models (see e.g.  [60, 76]), the DM uses partially available information to construct an ambiguity set of transition probability matrices to mitigate the risk arising from incomplete information about the true transition probability matrix 𝒫(st+1=s|st,at)\mathcal{P}\left(s_{t+1}=s|s_{t},a_{t}\right) for st𝒮s_{t}\in\mathcal{S} and at𝒜a_{t}\in\mathcal{A}. The optimal decision is based on the worst expected cost calculated with the ambiguity set of transition matrices. The distributionally robust SOC/MDP models can be recast as BCR-SOC/MDP models.

  • The ambiguity set of transition probabilities can equivalently be described by the ambiguity set of the distributions of the random parameters within the BCR-SOC/MDP framework. For instance, if we set the inner risk measure ρPθ\rho_{P_{\theta}} with the mathematical expectation 𝔼Pθ\mathbb{E}_{P_{\theta}} and the outer risk measure with essential supremum (VaRμtα\text{VaR}_{\mu_{t}}^{\alpha}) where α=0\alpha=0, then the objective function of problem (3.1) becomes

    minπVaRμ1α𝔼Pθ1[𝒞1(s1,a1,ξ1)++γVaRμT1α𝔼PθT1[𝒞T1(sT1,aT1,ξT1)+γ𝒞T(sT)]]\min_{\pi}\text{VaR}_{\mu_{1}}^{\alpha}\circ\mathbb{E}_{P_{\theta_{1}}}\left[\mathcal{C}_{1}\left(s_{1},a_{1},\xi_{1}\right)+\cdots+\gamma\text{VaR}_{\mu_{T-1}}^{\alpha}\circ\mathbb{E}_{P_{\theta_{T-1}}}\left[\mathcal{C}_{T-1}\left(s_{T-1},a_{T-1},\xi_{T-1}\right)+\gamma\mathcal{C}_{T}\left(s_{T}\right)\right]\right]

    or equivalently

    minπmaxθ1Θ1𝔼Pθ1[𝒞1(s1,a1,ξ1)++γmaxθT1ΘT1𝔼PθT1[𝒞T1(sT1,aT1,ξT1)+γ𝒞T(sT)]],\min_{\pi}\max_{\theta_{1}\in\Theta_{1}}\mathbb{E}_{P_{\theta_{1}}}\left[\mathcal{C}_{1}\left(s_{1},a_{1},\xi_{1}\right)+\cdots+\gamma\max_{\theta_{T-1}\in\Theta_{T-1}}\mathbb{E}_{P_{\theta_{T-1}}}\left[\mathcal{C}_{T-1}\left(s_{T-1},a_{T-1},\xi_{T-1}\right)+\gamma\mathcal{C}_{T}\left(s_{T}\right)\right]\right], (3.18)

    where Θt\Theta_{t} is the support set of μt\mu_{t} for t=1,,T1t=1,\cdots,T-1. As demonstrated in [40], this reformulation is in alignment with distributionally robust SOC/MDP models if we treat {Pθt,θtΘt}\{P_{\theta_{t}},\theta_{t}\in\Theta_{t}\} as an ambiguity set of PθcP_{\theta^{c}}. If the inner risk measure is set using a general risk measure instead of the mathematical expectation, the BCR-SOC/MDP model is equivalent to the distributionally robust risk-averse SOC/MDP model proposed in [56]. Moreover, by choosing α(0,1)\alpha\in(0,1) for outer risk measure VaRμtα\text{VaR}_{\mu_{t}}^{\alpha}, the BCR-SOC/MDP model effectively transforms into a less conservative chance-constrained SOC/MDP form, as proposed in [13]. This BCR-SOC/MDP model can thus be viewed as a distributionally robust SOC/MDP with respect to the Bayesian ambiguity set defined in [24]. Further elaboration on this equivalence will be provided in Section 6.1.

  • In view of the robust representation of coherent risk measures, employing some coherent risk measure as the outer risk measure ρμt\rho_{\mu_{t}} gives our BCR-SOC/MDP framework (3.17) an interpretation aligned with distributionally robust SOC/MDP models concerning the corresponding ambiguity sets. To be consistent with the setting of our article, we refer to [60] for the detailed equivalence between the distributional robust model and the risk-averse model under the premise of parametric family of probability distributions.

Example 5 (Bayes-adaptive MDP).

Like the concept behind BCR-SOC/MDP, the state space of Bayes-adaptive MDPs consists of the physical state space 𝒮\mathcal{S} and the posterior distribution space 𝒟1p\mathcal{D}_{1}^{p}, as introduced in [55]. Specifically, as demonstrated in [8], the objective function of the episodic Bayesian SOC in [63] can be formulated as the following Bayes-adaptive form:

minπ𝔼P^θ1[𝒞(s1,a1,ξ1)++γ𝔼P^θT1[𝒞(sT1,aT1,ξT1)+γ𝒞(sT)]].\min_{\pi}\mathbb{E}_{\hat{P}_{{\theta}_{1}}}\left[\mathcal{C}\left(s_{1},a_{1},\xi_{1}\right)+\cdots+\gamma\mathbb{E}_{\hat{P}_{{\theta}_{T-1}}}\left[\mathcal{C}\left(s_{T-1},a_{T-1},\xi_{T-1}\right)+\gamma\mathcal{C}\left(s_{T}\right)\right]\right]. (3.19)

Here, P^θt\hat{P}_{{\theta}_{t}} satisfying dP^θt:=Θp(ξ|θ)μt(θ)𝑑θd\hat{P}_{{\theta}_{t}}:=\int_{\Theta}p(\xi|\theta)\mu_{t}(\theta)d\theta, represents the posterior estimate of the underlying distribution Pθc{P_{\theta^{c}}}. By setting both the inner risk measure ρPθt\rho_{P_{\theta_{t}}} and the outer risk measure ρμt\rho_{\mu_{t}} as the expectation operators 𝔼Pθt\mathbb{E}_{P_{\theta_{t}}} and 𝔼μt\mathbb{E}_{{\mu_{t}}} respectively, the objective function of the BCR-SOC/MDP model reduces to

minπ𝔼μ1𝔼Pθ1[𝒞(s1,a1,ξ1)++γ𝔼μT1𝔼PθT1[𝒞(sT1,aT1,ξT1)+γ𝒞(sT)]],\min_{\pi}\mathbb{E}_{{\mu_{1}}}\circ\mathbb{E}_{P_{\theta_{1}}}\left[\mathcal{C}\left(s_{1},a_{1},\xi_{1}\right)+\cdots+\gamma\mathbb{E}_{{\mu_{T-1}}}\circ\mathbb{E}_{P_{\theta_{T-1}}}\left[\mathcal{C}\left(s_{T-1},a_{T-1},\xi_{T-1}\right)+\gamma\mathcal{C}\left(s_{T}\right)\right]\right], (3.20)

which is equivalent to that of Bayes-adaptive MDP. This can easily be derived by exchanging the order of integrals. Furthermore, the objective function of the BR-MDP in [38]

minπρμ1𝔼Pθ1(𝒞(s1,a1,ξ1)++γρμT1𝔼PθT1[𝒞(sT1,aT1,ξT1)+γ𝒞(sT)])\min_{\pi}\rho_{{\mu_{1}}}\circ\mathbb{E}_{P_{\theta_{1}}}\left(\mathcal{C}\left(s_{1},a_{1},\xi_{1}\right)+\cdots+\gamma\rho_{{\mu_{T-1}}}\circ\mathbb{E}_{P_{\theta_{T-1}}}\left[\mathcal{C}\left(s_{T-1},a_{T-1},\xi_{T-1}\right)+\gamma\mathcal{C}\left(s_{T}\right)\right]\right) (3.21)

can be derived from the BCR-SOC/MDP model by only setting the inner risk measure ρPθt\rho_{P_{\theta_{t}}} as the expectation operator 𝔼Pθt\mathbb{E}_{P_{\theta_{t}}}.

3.2 BCR-SOC/MDP with randomized VaR and SRM – a preference robust viewpoint

BCR can be understood as the risk measure of a random risk measure. In decision analytics, PθP_{\theta} is DM’s belief about uncertainty of the state of nature and ρPθ(X)\rho_{P_{\theta}}(X) is a measure of the risk of such uncertainty. The risk measure captures the DM’s risk preference and the uncertainty of θ\theta makes ρPθ(X)\rho_{P_{\theta}}(X) a random variable. In this subsection, we will use two examples to compare empirical distributions and ambiguity sets, and highlight the importance and practical relevance of the BCR model in real-world problems.

In a data-driven environment, a popular approach is to use samples to derive an estimation of θ\theta and subsequently an approximation of PθcP_{\theta^{c}}. However, the approach may not work well when the sample size is small. The next example illustrates this.

Example 6.

Consider an example of predicting the price movement of a stock. The true probability distribution of the stock price increasing is PθP_{\theta}, where θ\theta signifies the likelihood of an upward movement (stock price increase). When θ(0,0.5)\theta\in(0,0.5), it means the probability that the stock price will increase is less than 0.5, indicating a generally downward market trend. Let us consider the case where the stock price movement is neutral, i.e., θc=0.5\theta^{c}=0.5, meaning the probability of an upward movement is 50%. Suppose that we observe the stock price movement over five trading days, resulting in 2 days of price increases (up days) and 3 days of price decreases (down days). In this case, we may deduce that θ=0.4\theta=0.4, suggesting a higher probability of a downward movement than the true θc=0.5\theta^{c}=0.5. This estimate deviates significantly from the true θc=0.5\theta^{c}=0.5 because the sample size is small. Indeed, by using the Binomial distribution, we have

P0.4(𝝃5={2 up days and 3 down days}):=P(𝝃5={2 up days and 3 down days}|θ=0.4)=0.34\displaystyle P_{0.4}(\boldsymbol{\xi}^{5}=\{\text{2 up days and 3 down days}\}):=P(\boldsymbol{\xi}^{5}=\{\text{2 up days and 3 down days}\}|\theta=0.4)=0.34
\displaystyle\approx P0.5(𝝃5={2 up days and 3 down days}):=P(𝝃5={2 up days and 3 down days}|θ=0.5)=0.31,\displaystyle P_{0.5}(\boldsymbol{\xi}^{5}=\{\text{2 up days and 3 down days}\}):=P(\boldsymbol{\xi}^{5}=\{\text{2 up days and 3 down days}\}|\theta=0.5)=0.31,

which means that the probability of observing 2 up days and 3 down days with a stock market with a neutral price movement is very close to that of a stock market where θ=0.4\theta=0.4. To address the issue, we can use a convex combination 0.4ρP0.4(X)+0.6ρP0.5(X)0.4\rho_{P_{0.4}}(X)+0.6\rho_{P_{0.5}}(X), instead of ρP0.4(X)\rho_{P_{0.4}}(X), to evaluate the risk of loss XX in the market. This concept is used by Li et al. [37, Example 1] in the context of randomized spectral risk measures.

It might be tempting to adopt a robust approach by considering the worst-case probability distribution of PθP_{\theta} from an ambiguity set such as {Pθ|θ[0.3,0.6]}\{P_{\theta}|\theta\in[0.3,0.6]\}. In that case, we have

P0.4(𝝃5={2 up days and 3 down days}):=P(𝝃5={2 up days and 3 down days}|θ=0.4)=0.34\displaystyle P_{0.4}(\boldsymbol{\xi}^{5}=\{\text{2 up days and 3 down days}\}):=P(\boldsymbol{\xi}^{5}=\{\text{2 up days and 3 down days}\}|\theta=0.4)=0.34
>\displaystyle> P0.6(𝝃5={2 up days and 3 down days}):=P(𝝃5={2 up days and 3 down days}|θ=0.6)=0.23,\displaystyle P_{0.6}(\boldsymbol{\xi}^{5}=\{\text{2 up days and 3 down days}\}):=P(\boldsymbol{\xi}^{5}=\{\text{2 up days and 3 down days}\}|\theta=0.6)=0.23,

which means that the worst-case probability distribution is too conservative for use. This is because θ\theta is assumed to have equal probability over the range [0.3,0.6][0.3,0.6] in the robust argument. The simple calculation shows that adopting a distributional form of estimation of θ\theta which weights the different effects of the θ\theta values on the distribution of ξ\xi might be more appropriate in the absence of complete information about θ\theta based on a small sample size of ξ\xi.

We can cope with the above issue from a Bayesian perspective: before observing any data, we assume that no specific information about the true parameter θc\theta^{c} is known by the DM, and thus the DM can consider θc\theta^{c} to be a realization of a random variable θ\theta following a uniform distribution over the interval [0, 1], i.e., μ0(θ)=1\mu_{0}(\theta)=1 for all θ[0,1]\theta\in[0,1]. After observing data with 2 up days and 3 down days, we easily derive using Bayes’ formula that the posterior distribution μ(θ|𝛏5={2 up days and 3 down days})\mu(\theta|\boldsymbol{\xi}^{5}=\text{\{2 up days and 3 down days\}}) follows a Beta(3,4) distribution, as illustrated in Figure 1.

Refer to caption
Figure 1: Posterior distribution of θ\theta based on observations with 2 up days and 3 down days

As depicted, in such cases, employing a distribution μ\mu to randomize θ\theta allows us to capture the variability in environmental performance more effectively. From a modeling perspective, evaluating ρPθ(X)\rho_{P_{\theta}}(X) under this framework can quantify the DM’s risk preference with respect to the environment through ρμρPθ(X)\rho_{\mu}\circ\rho_{P_{\theta}}(X). Therefore, utilizing μ\mu as the probability distribution of θ\theta is more reasonable.

It is possible to express BCR-SOC/MDP as a preference robust SOC/MDP under some specific circumstances. To see this, we consider a BCR where ρPθ=VaRPθα\rho_{P_{\theta}}=\text{VaR}_{P_{\theta}}^{\alpha} for some α(0,1)\alpha\in(0,1) and ρμ=𝔼μ\rho_{\mu}=\mathbb{E}_{\mu}. The resulting BCR is

𝔼μVaRPθα(X)=ΘVaRPθα(X)μ(θ)𝑑θ.\mathbb{E}_{\mu}\circ\text{VaR}_{P_{\theta}}^{\alpha}(X)=\int_{\Theta}\text{VaR}_{P_{\theta}}^{\alpha}(X)\mu(\theta)d\theta.

We will show shortly that the BCR may be expressed as an SRM. Assume, for fixed α\alpha, that there exists a bijective function f:Θ(0,1)f:\Theta\to(0,1) such that VaRPθα(X)=VaRPθcf(θ)(X)\text{VaR}_{P_{\theta}}^{\alpha}(X)=\text{VaR}_{P_{\theta^{c}}}^{f(\theta)}(X), where PθcP_{\theta^{c}} is the true probability distribution of XX. For fixed μ\mu and ff, let

t=1f(θ) and σ(t)=μ(f1(1t))|ddtf1(1t)|.t=1-f(\theta)\text{ and }\sigma(t)=\mu(f^{-1}(1-t))\cdot\left|\frac{d}{dt}f^{-1}(1-t)\right|. (3.22)

Then

𝔼μVaRPθα(X)\displaystyle\mathbb{E}_{\mu}\circ\text{VaR}_{P_{\theta}}^{\alpha}(X) =ΘVaRPθα(X)μ(θ)𝑑θ=ΘVaRPθcf(θ)(X)μ(θ)𝑑θ\displaystyle=\int_{\Theta}\text{VaR}_{P_{\theta}}^{\alpha}(X)\mu(\theta)d\theta=\int_{\Theta}\text{VaR}_{P_{\theta^{c}}}^{f(\theta)}(X)\mu(\theta)d\theta
=01VaRPθc1t(X)σ(t)𝑑t=01FPθc(t)σ(t)𝑑t.\displaystyle=\int_{0}^{1}\text{VaR}_{P_{\theta^{c}}}^{1-t}(X)\sigma(t)dt=\int_{0}^{1}F^{\leftarrow}_{P_{\theta^{c}}}(t)\sigma(t)dt.

The right-hand side of the equation above is a spectral risk measure, which can be viewed as the weighted average of the randomized FPθc(t)F^{\leftarrow}_{P_{\theta^{c}}}(t). The next example explains how it may work.

Example 7.

Consider a random variable XX which follows an exponential distribution parameterized by θ\theta. It is easy to derive that

VaRPθα(X)=lnαθ.\displaystyle\text{VaR}_{P_{\theta}}^{\alpha}(X)=\frac{-\ln\alpha}{\theta}.

Suppose that the true probability distribution of XX is PθcP_{\theta^{c}}. By definition, the SRM of XX is

Mσ(X)=01VaRPθc1t(X)σ(t)𝑑t=01ln(1t)θcσ(t)𝑑t.\displaystyle M_{\sigma}(X)=\int_{0}^{1}\text{VaR}_{P_{\theta}^{c}}^{1-t}(X)\sigma(t)dt=\int_{0}^{1}\frac{-\ln(1-t)}{\theta^{c}}\sigma(t)dt.

On the other hand, the BCR of XX is

𝔼μVaRPθα(X)=Θlnαθμ(θ)𝑑θ.\displaystyle\mathbb{E}_{\mu}\circ\text{VaR}_{P_{\theta}}^{\alpha}(X)=\int_{\Theta}\frac{-\ln\alpha}{\theta}\mu(\theta)d\theta.

By setting f(θ)=αθcθandt=1f(θ)=1αθcθf(\theta)=\alpha^{\frac{\theta^{c}}{\theta}}\quad\text{and}\quad t=1-f(\theta)=1-\alpha^{\frac{\theta^{c}}{\theta}}, we demonstrate that VaRPθα(X)=VaRPθc1t(X).\text{VaR}_{P_{\theta}}^{\alpha}(X)=\text{VaR}_{P_{\theta^{c}}}^{1-t}(X). Thus, by letting risk spectrum

σ(t):=μ(θclnαln(1t))θcln1/α(1t)(ln(1t))2,\displaystyle\sigma(t):=\mu\left(\frac{\theta^{c}\ln\alpha}{\ln(1-t)}\right)\frac{\theta^{c}\ln 1/\alpha}{(1-t)(\ln(1-t))^{2}},

we obtain 𝔼μVaRPθα(X)=Mσ(X)\mathbb{E}_{\mu}\circ\text{VaR}_{P_{\theta}}^{\alpha}(X)=M_{\sigma}(X).

Generalizing from this, we consider a generic law invariant coherent risk measure ρ\rho. We know from [61] that ρ\rho can be represented as

ρ(X)=supζ𝔇𝔼ζX,\rho(X)=\sup_{\zeta\in\mathfrak{D}}\mathbb{E}_{\zeta}X,

where 𝔇\mathfrak{D} represents a set of probability density functions.

By adopting a coherent risk measure for the outer risk measure ρμ\rho_{\mu}, we have

ρμVaRPθα(X):=supμ~𝔇ρμΘVaRPθα(X)μ~(θ)𝑑θ=supσ𝔄ρμ01VaRPθc1t(X)σ(t)𝑑tsupσ𝔄ρμMσ(X),\rho_{\mu}\circ\text{VaR}_{P_{\theta}}^{\alpha}(X):=\sup_{\tilde{\mu}\in\mathfrak{D}_{\rho_{\mu}}}\int_{\Theta}\text{VaR}_{P_{\theta}}^{\alpha}(X)\tilde{\mu}(\theta)d\theta=\sup_{\sigma\in\mathfrak{A}_{\rho_{\mu}}}\int_{0}^{1}\text{VaR}_{P_{\theta^{c}}}^{1-t}(X)\sigma(t)dt\equiv\sup_{\sigma\in\mathfrak{A}_{\rho_{\mu}}}M_{\sigma}(X), (3.23)

where 𝔇ρμ\mathfrak{D}_{\rho_{\mu}} is the domain of the Fenchel’s conjugate of ρμ\rho_{\mu} (see [61, Theorem 6.5]), and 𝔄ρμ\mathfrak{A}_{\rho_{\mu}} is a set of risk spectra corresponding to 𝔇ρμ\mathfrak{D}_{\rho_{\mu}} based on the one-to-one correspondence between μ\mu and σ\sigma as specified in (3.22). The second term in equation (3.23) can be viewed as distributionally robust formulation of ρμVaRPθα(X)\rho_{\mu}\circ\text{VaR}_{P_{\theta}}^{\alpha}(X), while the third term of the equation is preference robustness of SRM. The robustness in the former is concerned with the ambiguity of the environmental risk represented by μ~\tilde{\mu} whereas the robustness in the latter is concerned with ambiguity of the DM’s preference (represented by σ\sigma) associated with the ambiguity in the environmental risk.

Example 8 (Preference robust SOC/MDP).

Let the inner risk measure ρPθ\rho_{P_{\theta}} be VaRPθβ\text{VaR}_{P_{\theta}}^{\beta} and the outer risk measure ρμt\rho_{\mu_{t}} be a coherent risk measure. In this case, the objective function of the BCR-SOC/MDP model (3.1) becomes

minπρμ1VaRPθ1β(𝒞(s1,a1,ξ1)++γρμT1VaRPθT1β[𝒞(sT1,aT1,ξT1)+γ𝒞(sT)]).\min_{\pi}\rho_{\mu_{1}}\circ\text{VaR}_{P_{\theta_{1}}}^{\beta}\left(\mathcal{C}\left(s_{1},a_{1},\xi_{1}\right)+\cdots+\gamma\rho_{\mu_{T-1}}\circ\text{VaR}_{P_{\theta_{T-1}}}^{\beta}\left[\mathcal{C}\left(s_{T-1},a_{T-1},\xi_{T-1}\right)+\gamma\mathcal{C}\left(s_{T}\right)\right]\right).

By virtue of (3.23), we can recast the problem as a preference robust SOC/MDP problem:

minπsupσ1𝔄ρμ1Mσ1[𝒞(s1,a1,ξ1)++γsupσT1𝔄ρμT1MσT1[𝒞(sT1,aT1,ξT1)+γ𝒞(sT)]],\min_{\pi}\sup_{\sigma_{1}\in\mathfrak{A}_{\rho_{\mu_{1}}}}M_{\sigma_{1}}\left[\mathcal{C}\left(s_{1},a_{1},\xi_{1}\right)+\cdots+\gamma\sup_{\sigma_{T-1}\in\mathfrak{A}_{\rho_{\mu_{T-1}}}}M_{\sigma_{T-1}}\left[\mathcal{C}\left(s_{T-1},a_{T-1},\xi_{T-1}\right)+\gamma\mathcal{C}\left(s_{T}\right)\right]\right],

where 𝔄ρμt\mathfrak{A}_{\rho_{\mu_{t}}}, 1tT11\leq t\leq T-1, is a set of risk spectra representing the uncertainty of a DM’s risk attitude at episode tt. Based on Bayes’ rule, we understand that the updates of μt\mu_{t} can be established by a continuous process of learning and correction. Therefore, in the corresponding preference robust SOC/MDP model, the DM’s ambiguity set of risk attitude 𝔄μt\mathfrak{A}_{\mu_{t}} also evolves continually as the sample data accumulates.

A similar correspondence between BCR-SOC/MDP and preference robust SOC/MDP can be established when the inner risk measure ρPθ\rho_{P_{\theta}} is AVaRPθα\text{AVaR}_{P_{\theta}}^{\alpha} and the outer risk measure ρμ\rho_{\mu} is a coherent risk measure. In that case, we can derive a representation akin to the Kusuoka’s representation [36], with Kusuoka’s ambiguity set being induced by μ\mu. All these will lead to another preference robust SOC/MDP model with Kusuoka’s ambiguity set representing DM’s preference uncertainty.

In summary, the examples discussed above show that the BCR-SOC/MDP model displays the breath of its scope and appropriateness for adaptive and self-learning systems within the realm of risk control. Moreover, by judiciously selecting risk measures and configuring posterior distributions, we can align BCR-SOC/MDP models with preference robust SOC/MDP models.

4 Finite-horizon BCR-SOC/MDP

We now turn to discussing numerical methods for solving (3.1). As with the conventional SOC/MDP models, we need to separate the discussions depending on whether T<T<\infty and T=T=\infty. We begin with the former in this section. The finite horizon BCR-SOC/MDP may be regarded as an extension of the risk-neutral MDP [38]. The key step is to derive dynamic recursive equations. In the literature of SOC/MDP and multistage stochastic programming, this relies heavily on interchangeability and the decomposability of the objective function [60]. However, the BCR-SOC/MDP does not require these conditions. The next proposition states this.

Proposition 3.

Let

VT(sT,μT)\displaystyle V_{T}^{*}\left(s_{T},\mu_{T}\right) =\displaystyle= 𝒞T(sT),(sT,μT)𝒮×𝒟1p,\displaystyle\mathcal{C}_{T}\left(s_{T}\right),\forall\left(s_{T},\mu_{T}\right)\in\mathcal{S}\times\mathcal{D}_{1}^{p}, (4.1a)
Vt(st,μt)\displaystyle V_{t}^{*}\left(s_{t},\mu_{t}\right) =\displaystyle= minat𝒜ρμtρPθt[𝒞t(st,at,ξt)+γVt+1(st+1,μt+1)],\displaystyle\min_{a_{t}\in\mathcal{A}}\rho_{\mu_{t}}\circ\rho_{P_{\theta_{t}}}\left[\mathcal{C}_{t}\left(s_{t},a_{t},\xi_{t}\right)+\gamma V_{t+1}^{*}\left(s_{t+1},\mu_{t+1}\right)\right], (4.1b)
(st,μt)𝒮×𝒟1p,t=1,,T1,\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\forall(s_{t},\mu_{t})\in\mathcal{S}\times\mathcal{D}_{1}^{p},\ t=1,\cdots,T-1,

where sts_{t} and μt\mu_{t} satisfy equations (3.1b) and (3.1b), respectively. Let 𝒥Tγ,(s1,μ1)\mathcal{J}_{T}^{\gamma,*}(s_{1},\mu_{1}) denote the optimal value of problem (3.1). Under Assumption 1,

𝒥Tγ,(s1,μ1)=V1(s1,μ1).\displaystyle\mathcal{J}_{T}^{\gamma,*}(s_{1},\mu_{1})=V_{1}^{*}(s_{1},\mu_{1}). (4.2)

Moreover, the optimal policy π={π1,,πT1}\pi^{*}=\{\pi_{1}^{*},\cdots,\pi_{T-1}^{*}\} corresponding to 𝒥Tγ,(s1,μ1)\mathcal{J}_{T}^{\gamma,*}(s_{1},\mu_{1}) exists and can be determined recursively by:

πt(st,μt)argmina𝒜ρμtρPθt[𝒞t(st,at,ξt)+γVt+1(st+1,μt+1)].\pi_{t}^{*}\left(s_{t},\mu_{t}\right)\in\arg\min_{a\in\mathcal{A}}\rho_{\mu_{t}}\circ\rho_{P_{\theta_{t}}}\left[\mathcal{C}_{t}\left(s_{t},a_{t},\xi_{t}\right)+\gamma V_{t+1}^{*}\left(s_{t+1},\mu_{t+1}\right)\right]. (4.3)

By analogy with the analysis in [57], Proposition 3 is evident if we treat the state space 𝒮\mathcal{S} and the posterior distribution space 𝒟1p\mathcal{D}_{1}^{p} together as an augmented state space, see [38] for the latter. In alignment with the discussions in [12, 59], the BCR-SOC/MDP formulation (3.1) emphasizes the critical aspect of time consistency within the nested risk function framework. Thus, for the proposed BCR-SOC/MDP model with a finite horizon, we can recursively solve it using a dynamic programming solution procedure such as that outlined in Algorithm 1. We will discuss the details of solving the Bellman equations (4.1a)-(4.1b) in Section 6.

Algorithm 1 Dynamic Programming Method for Finite-horizon BCR-SOC/MDP (3.1).
1:  Initialization: VT(sT,μT)=𝒞T(sT)V_{T}^{*}\left(s_{T},\mu_{T}\right)=\mathcal{C}_{T}\left(s_{T}\right), for all (sT,μT)𝒮×𝒟1p\left(s_{T},\mu_{T}\right)\in\mathcal{S}\times\mathcal{D}_{1}^{p}
2:  for t=T1,T2,,1t=T-1,T-2,\cdots,1 do
3:     for all (st,μt)𝒮×𝒟1p\left(s_{t},\mu_{t}\right)\in\mathcal{S}\times\mathcal{D}_{1}^{p} do
4:        Solve problem
mina𝒜\displaystyle\min_{a\in\mathcal{A}} ρμtρPθt[𝒞t(st,at,ξt)+γVt+1(s,μ)]\displaystyle\rho_{\mu_{t}}\circ\rho_{P_{\theta_{t}}}\left[\mathcal{C}_{t}\left(s_{t},a_{t},\xi_{t}\right)+\gamma V_{t+1}^{*}\left(s^{\prime},\mu^{\prime}\right)\right] (4.4a)
s.t. s=gt(st,at,ξt),μ=p(ξt|θ)μt(θ)Θp(ξt|θ)μt(θ)𝑑θ\displaystyle s^{\prime}=g_{t}(s_{t},a_{t},\xi_{t}),\ \mu^{\prime}=\frac{p\left(\xi_{t}|\theta\right)\mu_{t}(\theta)}{\int_{\Theta}p\left(\xi_{t}|\theta\right)\mu_{t}(\theta)d\theta} (4.4b)
to obtain an optimal policy πt(st,μt)\pi_{t}^{*}\left(s_{t},\mu_{t}\right).
5:        Setting Vt(st,μt):=ρμtρPθt[𝒞t(st,πt(st,μt),ξt)+γVt+1(s,μ)]V_{t}^{*}\left(s_{t},\mu_{t}\right):=\rho_{\mu_{t}}\circ\rho_{P_{\theta_{t}}}\left[\mathcal{C}_{t}\left(s_{t},\pi_{t}^{*}\left(s_{t},\mu_{t}\right),\xi_{t}\right)+\gamma V_{t+1}^{*}\left(s^{\prime},\mu^{\prime}\right)\right].
6:     end for
7:  end for
8:  Setting 𝒥Tγ,(s,μ):=V1(st,μt)\mathcal{J}_{T}^{\gamma,*}(s,\mu):=V_{1}^{*}\left(s_{t},\mu_{t}\right) for all (s,μ)𝒮×𝒟1p(s,\mu)\in\mathcal{S}\times\mathcal{D}_{1}^{p}.
9:  return  Optimal policy π={π1,π1,,πT1}\pi^{*}=\{\pi^{*}_{1},\pi^{*}_{1},\cdots,\pi^{*}_{T-1}\} and the optimal value function 𝒥Tγ,\mathcal{J}_{T}^{\gamma,*}.

A key step in Algorithm 1 is to solve problem (4.4) for all (st,μt)𝒮×𝒟1p\left(s_{t},\mu_{t}\right)\in\mathcal{S}\times\mathcal{D}_{1}^{p}. This is implementable in practice only when 𝒮×𝒟1p\mathcal{S}\times\mathcal{D}_{1}^{p} is a discrete set with finite elements. Moreover, it is often difficult to obtain a closed form of the objective function when ξ\xi is continuously distributed. All these mean that the algorithm is conceptual as it stands. We will return to the issues in Section 6. Another issue is the convexity of problem (4.4). We need to ensure that this is a convex program so that πt(st,μt)\pi_{t}^{*}\left(s_{t},\mu_{t}\right) is an optimal solution. It is sufficient to ensure that Vt+1(gt(st,at,ξt),μ)V_{t+1}^{*}(g_{t}(s_{t},a_{t},\xi_{t}),\mu^{\prime}) is convex in ata_{t}. We address this below. The following assumption is required.

Assumption 2.

For all ξtΞ\xi_{t}\in\Xi, 𝒞t(st,at,ξt)\mathcal{C}_{t}(s_{t},a_{t},\xi_{t}) and gt(st,at,ξt)g_{t}(s_{t},a_{t},\xi_{t}) are jointly convex with respect to (st,at)𝒮×𝒜(s_{t},a_{t})\in\mathcal{S}\times\mathcal{A}.

By comparison with the assumptions in the existing literature (e.g., [22, 63]), Assumption 2 is mild. In these works and in many practical applications, it is typically assumed that gtg_{t} is affine. Note that the convexity of gg is required to ensure that problem (4.1b) is convex which will facilitate numerical solution of the problem. The next proposition states the convexity of problem (4.1b).

Proposition 4.

Let Assumptions 1 and 2 hold. If ρμt()\rho_{\mu_{t}}(\cdot) and ρPθt()\rho_{P_{\theta_{t}}}(\cdot) are convex risk measures for each episode t=1,,Tt=1,\cdots,T, then the following assertions hold.

  • (i)

    Problem (4.1b) is a convex program.

  • (ii)

    If, in addition, 𝒞t(st,at,ξt)\mathcal{C}_{t}(s_{t},a_{t},\xi_{t}) and gt(st,at,ξt)g_{t}(s_{t},a_{t},\xi_{t}) are Lipschitz continuous in st𝒮s_{t}\in\mathcal{S} uniformly for all ata_{t} and ξt\xi_{t} with modulus L𝒞L_{\mathcal{C}} and LgL_{g}, respectively, then Vt(st,μt)V^{*}_{t}(s_{t},\mu_{t}) is Lipschtz continuous in sts_{t}, i.e.,

    |Vt(st2,μt)Vt(st1,μt)|(L𝒞+γLVt+1Lg)st1st2,st1,st2𝒮,\displaystyle|V_{t}^{*}(s_{t}^{2},\mu_{t})-V_{t}^{*}(s_{t}^{1},\mu_{t})|\leq(L_{\mathcal{C}}+\gamma L_{V_{t+1}}L_{g})\left\|s_{t}^{1}-s_{t}^{2}\right\|,\forall s_{t}^{1},s_{t}^{2}\in{\cal S}, (4.5)

    for t=1,,T1t=1,\cdots,T-1, where LVt=i=0Tt(γLg)iL𝒞L_{V_{t}}=\sum_{i=0}^{T-t}(\gamma L_{g})^{i}L_{\mathcal{C}}.

  • (iii)

    If, further, 𝒞t(st,at,ξt)\mathcal{C}_{t}(s_{t},a_{t},\xi_{t}) and gt(st,at,ξt)g_{t}(s_{t},a_{t},\xi_{t}) are Lipschitz continuous in at𝒜a_{t}\in\mathcal{A} uniformly for all sts_{t} and ξt\xi_{t}, then 𝒞t(st,at,ξt)+γVt+1(st+1,μt+1){\cal C}_{t}(s_{t},a_{t},\xi_{t})+\gamma V^{*}_{t+1}(s_{t+1},\mu_{t+1}) is Lipschtz continuous in ata_{t}, i.e.,

    |𝒞t(st,at1,ξt)+γVt+1(gt(st,at1,ξt),μ)𝒞t(st,at2,ξt)+γVt+1(gt(st,at2,ξt),μ)|\displaystyle|\mathcal{C}_{t}\left(s_{t},a_{t}^{1},\xi_{t}\right)+\gamma V^{*}_{t+1}\left(g_{t}(s_{t},a_{t}^{1},\xi_{t}),\mu^{\prime}\right)-\mathcal{C}_{t}\left(s_{t},a_{t}^{2},\xi_{t}\right)+\gamma V^{*}_{t+1}\left(g_{t}(s_{t},a_{t}^{2},\xi_{t}),\mu^{\prime}\right)| (4.6)
    \displaystyle\leq (L𝒞+γLVt+1Lg)at1at2,at1,at2𝒜,\displaystyle(L_{\mathcal{C}}+\gamma L_{V_{t+1}}L_{g})\left\|a_{t}^{1}-a_{t}^{2}\right\|,\forall a_{t}^{1},a_{t}^{2}\in{\cal A},

    for t=1,,T1t=1,\cdots,T-1.

Proof. Part (i). Due to Assumption 1 (e), it suffices to show that ρμtρPθt[𝒞(st,at,ξt)+γVt+1(st+1,μt+1)]\rho_{\mu_{t}}\circ\rho_{P_{\theta_{t}}}\left[\mathcal{C}(s_{t},a_{t},\xi_{t})+\gamma V^{*}_{t+1}\left(s_{t+1},\mu_{t+1}\right)\right] is jointly convex in (st,at)(s_{t},a_{t}). Under Assumption 2, the function is jointly convex provided that Vt+1(gt(st,at,ξt),μt+1)V^{*}_{t+1}\left(g_{t}(s_{t},a_{t},\xi_{t}),\mu_{t+1}\right) is convex in ata_{t}. This is because the BCR is convex and monotone since ρμt()\rho_{\mu_{t}}(\cdot) and ρPθt()\rho_{P_{\theta_{t}}}(\cdot) are convex risk measures. In what follows, we use backward induction to derive convexity of VtV^{*}_{t} in ata_{t}. Observe that VT(sT,μT)=𝒞(sT),(sT,μT)𝒮×𝒟1pV_{T}^{*}\left(s_{T},\mu_{T}\right)=\mathcal{C}\left(s_{T}\right),\forall\left(s_{T},\mu_{T}\right)\in\mathcal{S}\times\mathcal{D}_{1}^{p}. Thus, the conclusion holds at episode TT. Assuming that the convexity holds for episodes from t+1t+1 to TT, we demonstrate the convexity at episode tt. This is evident from (4.1b) in that Vt+1V^{*}_{t+1} is convex in st+1s_{t+1} and hence ρμtρPθt[𝒞t(st,at,ξt)+γVt+1(st+1,μt+1)]\rho_{\mu_{t}}\circ\rho_{P_{\theta_{t}}}\left[\mathcal{C}_{t}\left(s_{t},a_{t},\xi_{t}\right)+\gamma V_{t+1}^{*}\left(s_{t+1},\mu_{t+1}\right)\right] is jointly convex in (at,st)(a_{t},s_{t}).

Part (ii). We also use backward induction to establish Lipschitz continuity of Vt(st,μt)V^{*}_{t}(s_{t},\mu_{t}) in sts_{t}. At episode TT, the Lipschtz continuity is evident since VT(sT,μT)=𝒞(sT),(sT,μT)𝒮×𝒟1pV_{T}^{*}\left(s_{T},\mu_{T}\right)=\mathcal{C}\left(s_{T}\right),\forall\left(s_{T},\mu_{T}\right)\in\mathcal{S}\times\mathcal{D}_{1}^{p}. Assuming that the Lipschtz continuity holds from episodes t+1t+1 to TT, we demonstrate that the Lipschitz continuity also holds at episode tt. Under the hypothesis of the induction, Vt+1V^{*}_{t+1} is Lipschtz continuous in st+1s_{t+1} with modulus LVt+1L_{V_{t+1}}. For any st1,st2𝒮{s}^{1}_{t},s^{2}_{t}\in\mathcal{S},

|ρμtρPθt[𝒞t(st1,at,ξt)+γVt+1(gt(st1,at,ξt),μ)]ρμtρPθt[𝒞t(st2,at,ξt)+γVt+1(gt(st2,at,ξt),μ)]|\displaystyle\left|\rho_{\mu_{t}}\circ\rho_{P_{\theta_{t}}}\left[\mathcal{C}_{t}\left(s_{t}^{1},{a}_{t},\xi_{t}\right)+\gamma V^{*}_{t+1}\left(g_{t}(s_{t}^{1},a_{t},\xi_{t}),\mu^{\prime}\right)\right]-\rho_{\mu_{t}}\circ\rho_{P_{\theta_{t}}}\left[\mathcal{C}_{t}\left(s_{t}^{2},{a}_{t},\xi_{t}\right)+\gamma V^{*}_{t+1}\left(g_{t}(s_{t}^{2},a_{t},\xi_{t}),\mu^{\prime}\right)\right]\right|
\displaystyle\leq ρμtρPθt|𝒞t(st1,at,ξt)+γVt+1(gt(st1,at,ξt),μ)𝒞t(st2,at,ξt)γVt+1(gt(st2,at,ξt),μ)|\displaystyle\rho_{\mu_{t}}\circ\rho_{P_{\theta_{t}}}\left|\mathcal{C}_{t}\left(s_{t}^{1},{a}_{t},\xi_{t}\right)+\gamma V^{*}_{t+1}\left(g_{t}(s_{t}^{1},a_{t},\xi_{t}),\mu^{\prime}\right)-\mathcal{C}_{t}\left(s^{2}_{t},{a}_{t},\xi_{t}\right)-\gamma V^{*}_{t+1}\left(g_{t}(s_{t}^{2},a_{t},\xi_{t}),\mu^{\prime}\right)\right|
\displaystyle\leq ρμtρPθt(L𝒞st1st2+γLVt+1gt(st1,at,ξt)gt(st2,at,ξt))\displaystyle\rho_{\mu_{t}}\circ\rho_{P_{\theta_{t}}}\left(L_{\mathcal{C}}\left\|s_{t}^{1}-s_{t}^{2}\right\|+\gamma L_{V_{t+1}}\left\|g_{t}(s_{t}^{1},a_{t},\xi_{t})-g_{t}(s_{t}^{2},a_{t},\xi_{t})\right\|\right)
\displaystyle\leq (L𝒞+γLVt+1Lg)st1st2.\displaystyle(L_{\mathcal{C}}+\gamma L_{V_{t+1}}L_{g})\left\|s_{t}^{1}-s_{t}^{2}\right\|.

Thus

Vt(st1,μt)Vt(st2,μt)\displaystyle V_{t}^{*}(s_{t}^{1},\mu_{t})-V_{t}^{*}(s_{t}^{2},\mu_{t})
\displaystyle\leq supat𝒜|ρμtρPθt[𝒞t(st1,at,ξt)+γVt+1(gt(st1,at,ξt),μ)]ρμtρPθt[𝒞t(st2,at,ξt)γVt+1(gt(st2,at,ξt),μ)]|\displaystyle\sup_{a_{t}\in\mathcal{A}}\left|\rho_{\mu_{t}}\circ\rho_{P_{\theta_{t}}}[\mathcal{C}_{t}\left(s_{t}^{1},a_{t},\xi_{t}\right)+\gamma V^{*}_{t+1}\left(g_{t}(s_{t}^{1},{a}_{t},\xi_{t}),\mu^{\prime}\right)]-\rho_{\mu_{t}}\circ\rho_{P_{\theta_{t}}}[\mathcal{C}_{t}\left(s^{2}_{t},{a}_{t},\xi_{t}\right)-\gamma V^{*}_{t+1}\left(g_{t}(s_{t}^{2},a_{t},\xi_{t}),\mu^{\prime}\right)]\right|
\displaystyle\leq (L𝒞+γLVt+1Lg)st1st2.\displaystyle(L_{\mathcal{C}}+\gamma L_{V_{t+1}}L_{g})\left\|s_{t}^{1}-s_{t}^{2}\right\|.

A similar conclusion can be derived by swapping the positions of Vt(st1,μt)V_{t}^{*}(s_{t}^{1},\mu_{t}) and Vt(st2,μt)V_{t}^{*}(s_{t}^{2},\mu_{t}). Therefore, we have

|Vt(st2,μt)Vt(st1,μt)|(L𝒞+γLVt+1Lg)st1st2.|V_{t}^{*}(s_{t}^{2},\mu_{t})-V_{t}^{*}(s_{t}^{1},\mu_{t})|\leq(L_{\mathcal{C}}+\gamma L_{V_{t+1}}L_{g})\left\|s_{t}^{1}-s_{t}^{2}\right\|.

This deduction confirms by mathematical induction that the Lipschitz continuity property holds for all episodes.

Part (iii). For any at1,at2𝒜{a}^{1}_{t},a^{2}_{t}\in\mathcal{A}, we have:

|𝒞t(st,at1,ξt)+γVt+1(gt(st,at1,ξt),μ)𝒞t(st,at2,ξt)+γVt+1(gt(st,at2,ξt),μ)|\displaystyle|\mathcal{C}_{t}\left(s_{t},a_{t}^{1},\xi_{t}\right)+\gamma V^{*}_{t+1}\left(g_{t}(s_{t},a_{t}^{1},\xi_{t}),\mu^{\prime}\right)-\mathcal{C}_{t}\left(s_{t},a_{t}^{2},\xi_{t}\right)+\gamma V^{*}_{t+1}\left(g_{t}(s_{t},a_{t}^{2},\xi_{t}),\mu^{\prime}\right)|
\displaystyle\leq L𝒞at1at2+γLVt+1gt(st,at1,ξt)gt(st,at2,ξt)\displaystyle L_{\mathcal{C}}\left\|a_{t}^{1}-a_{t}^{2}\right\|+\gamma L_{V_{t+1}}||g_{t}(s_{t},a_{t}^{1},\xi_{t})-g_{t}(s_{t},a_{t}^{2},\xi_{t})||
\displaystyle\leq (L𝒞+γLVt+1Lg)at1at2\displaystyle(L_{\mathcal{C}}+\gamma L_{V_{t+1}}L_{g})\left\|a_{t}^{1}-a_{t}^{2}\right\|

for t=1,,T1t=1,\cdots,T-1. \Box

5 Infinite-horizon BCR-SOC/MDP

In this section, we move on to discuss the BCR-SOC/MDP problem (3.1) with infinite-horizon, i.e., T=T=\infty. We start with the following assumption.

Assumption 3.

Consider problem (3.1). (a) The transition dynamics are time-invariant, i.e., gt(s,a,ξ)g(s,a,ξ)g_{t}(s,a,\xi)\equiv g(s,a,\xi) and 𝒞t(s,a,ξ)𝒞(s,a,ξ)\mathcal{C}_{t}(s,a,\xi)\equiv\mathcal{C}(s,a,\xi) for all tt. (b) For all (s,a,ξ)𝒮×𝒜×Ξ(s,a,\xi)\in\mathcal{S}\times\mathcal{A}\times\Xi, the absolute value of the cost function |𝒞(s,a,ξ)||\mathcal{C}(s,a,\xi)| is bounded by a constant 𝒞¯\bar{\mathcal{C}}. (c) The discount factor γ(0,1)\gamma\in(0,1).

Under Assumption 3, we can formulate the BCR-SOC/MDP model (3.1) as follows:

minπ\displaystyle\min_{\pi} 𝒥γ(π,s1,μ1):=ρμ1ρPθ1[𝒞(s1,a1,ξ1)++γρμtρPθt[𝒞(st,at,ξt)+]]\displaystyle\mathcal{J}_{\infty}^{\gamma}(\pi,s_{1},\mu_{1}):=\rho_{\mu_{1}}\circ\rho_{P_{\theta_{1}}}\left[\mathcal{C}\left(s_{1},a_{1},\xi_{1}\right)+\cdots+\gamma\rho_{\mu_{t}}\circ\rho_{P_{\theta_{t}}}\left[\mathcal{C}\left(s_{t},a_{t},\xi_{t}\right)+\cdots\right]\right] (5.1a)
s.t. st+1=g(st,at,ξt),at=π(st,μt),t=1,2,;\displaystyle s_{t+1}=g\left(s_{t},a_{t},\xi_{t}\right),\ a_{t}=\pi(s_{t},\mu_{t}),\ t=1,2,\cdots; (5.1c)
μt+1(θ)=p(ξt|θ)μt(θ)Θp(ξt|θ)μt(θ)𝑑θ,t=1,2,.\displaystyle\mu_{t+1}(\theta)=\frac{p\left(\xi_{t}|\theta\right)\mu_{t}(\theta)}{\int_{\Theta}p\left(\xi_{t}|\theta\right)\mu_{t}(\theta)d\theta},\ t=1,2,\cdots.

We use 𝒥γ,\mathcal{J}_{\infty}^{\gamma,*} to denote the optimal value of problem (5.1). Given the boundness of the cost function 𝒞(s,a,ξ)\mathcal{C}(s,a,\xi), it follows that the value function 𝒥γ,\mathcal{J}_{\infty}^{\gamma,*} is also bounded, i.e.,

𝒥γ,:=sup(s,μ)𝒮×𝒟10|𝒥γ,(s,μ)|𝒞¯1γ.||\mathcal{J}_{\infty}^{\gamma,*}||_{\infty}:=\sup_{(s,\mu)\in\mathcal{S}\times\mathcal{D}^{0}_{1}}|\mathcal{J}_{\infty}^{\gamma,*}(s,\mu)|\leq\frac{\bar{\mathcal{C}}}{1-\gamma}.

5.1 Bellman equation and optimality

It is challenging to solve the infinite-horizon BCR-SOC/MDP problem (5.1). A standard approach is to use the Bellman equation to develop an iterative scheme analogous to those used in the existing SOC/MDP models (e.g., [22, 63, 82]). We begin by defining the Bellman equation of (5.1).

Definition 1 (Deterministic Bellman operator).

Let 𝔅(𝒮,𝒟10)\mathfrak{B}(\mathcal{S},\mathcal{D}_{1}^{0}) denote the space of real-valued bounded measurable functions on 𝒮×𝒟10\mathcal{S}\times\mathcal{D}_{1}^{0}. For any value function V𝔅(𝒮,𝒟10)V\in\mathfrak{B}(\mathcal{S},\mathcal{D}_{1}^{0}), define the deterministic operator 𝒯:𝔅(𝒮,𝒟10)𝔅(𝒮,𝒟10)\mathcal{T}:\mathfrak{B}(\mathcal{S},\mathcal{D}_{1}^{0})\rightarrow\mathfrak{B}(\mathcal{S},\mathcal{D}_{1}^{0}) as follows:

(𝒯V)(s,μ)=mina𝒜ρμρPθ(𝒞(s,a,ξ)+γV(s,μ)) for alls,μ(\mathcal{T}V)(s,\mu)=\min_{a\in\mathcal{A}}\rho_{\mu}\circ\rho_{P_{\theta}}\left(\mathcal{C}(s,a,\xi)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right)\text{ for all}\;\;s,\mu (5.2)

and operator 𝒯π:𝔅(𝒮,𝒟10)𝔅(𝒮,𝒟10)\mathcal{T}^{\pi}:\mathfrak{B}(\mathcal{S},\mathcal{D}_{1}^{0})\rightarrow\mathfrak{B}(\mathcal{S},\mathcal{D}_{1}^{0}) for a given π\pi:

(𝒯πV)(s,μ)=ρμρPθ(𝒞(s,π(s,μ),ξ)+γV(s,μ)).\displaystyle\left(\mathcal{T}^{\pi}V\right)(s,\mu)=\rho_{\mu}\circ\rho_{P_{\theta}}\left(\mathcal{C}(s,\pi(s,\mu),\xi)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right). (5.3)

The dynamic programming formulation integrated into the Bellman operator ensures that risk factors are effectively incorporated into the decision-making process. This approach accounts systematically for both epistemic and aleatoric uncertainties, which are crucial when determining the optimal policy at each decision point. In the Bellman equation presented above, the random variable ξ\xi serves as a general representation of the stochastic input in each decision episode. This simplification helps to maintain the recursive structure of the model without the need to explicitly track time indices for the random variables.

Model (5.1) and the Bellman equation (5.2) assume deterministic policies, but the rationale behind this set-up has not yet been explained fully. The next lemma states that optimizing over randomized policies is essentially equivalent to optimizing over deterministic ones. The latter will simplify the decision problem and improve computational efficiency.

Lemma 2.

The deterministic Bellman operator is equivalent to the random Bellman operator, i.e.,

(𝒯V)(s,μ)\displaystyle(\mathcal{T}V)(s,\mu) =mina𝒜ρμρPθ(𝒞(s,a,ξ)+γV(s,μ))\displaystyle=\min_{a\in\mathcal{A}}\rho_{\mu}\circ\rho_{P_{\theta}}\left(\mathcal{C}(s,a,\xi)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right)
=infπΠR𝔼aπ(s,μ)[ρμρPθ(𝒞(s,a,ξ)+γV(s,μ))],\displaystyle=\inf_{\pi\in\Pi^{R}}\mathbb{E}_{a\sim\pi(s,\mu)}\left[\rho_{\mu}\circ\rho_{P_{\theta}}\left(\mathcal{C}(s,a,\xi)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right)\right],

where ΠR\Pi^{R} denotes the set of all randomized policies.

Proof. Denote an arbitrary randomized policy π\pi by π(s,μ,a)=P(π(s,μ)=a)\pi(s,\mu,a)=P(\pi(s,\mu)=a). For any value function VV, we have

infa𝒜ρμρθ[𝒞(s,a,ξ)+γV(s,μ)]a𝒜ρμρθ[𝒞(s,a,ξ)+γV(s,μ)]𝑑P.\inf_{a\in\mathcal{A}}\rho_{\mu}\circ\rho_{\theta}\left[\mathcal{C}(s,a,\xi)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right]\leq\int_{a\in\mathcal{A}}\rho_{\mu}\circ\rho_{\theta}\left[\mathcal{C}(s,a,\xi)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right]dP.

Hence,

(𝒯V)(s,μ)=mina𝒜ρμρPθ[𝒞(s,a,ξ)+γV(s,μ)]infπΠR𝔼aπ(s,μ)[ρμρPθ[𝒞(s,a,ξ)+γV(s,μ)]].(\mathcal{T}V)(s,\mu)=\min_{a\in\mathcal{A}}\rho_{\mu}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right]\leq\inf_{\pi\in\Pi^{R}}\mathbb{E}_{a\sim\pi(s,\mu)}\left[\rho_{\mu}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right]\right].

The reverse is straightforward, as a deterministic policy is a special case of a randomized policy. \Box

The proof leverages an arbitrary randomized policy π\pi to demonstrate that the deterministic Bellman operator indeed corresponds to the random Bellman operator, thereby justifying our focus on deterministic policies. This simplification, however, does not preclude the extension of our analysis to randomized policies. In what follows, we demonstrate that the optimal value function 𝒥γ,\mathcal{J}_{\infty}^{\gamma,*} of BCR-SOC/MDP problem (5.1) is the unique fixed point of operator 𝒯\mathcal{T}. Specifically,

𝒯𝒥γ,(s,μ)=𝒥γ,(s,μ),(s,μ)𝒮×𝒟10.\displaystyle\mathcal{T}\mathcal{J}_{\infty}^{\gamma,*}(s,\mu)=\mathcal{J}_{\infty}^{\gamma,*}(s,\mu),\forall(s,\mu)\in{\cal S}\times{\cal D}_{1}^{0}. (5.4)

To this end, we derive an intermediate result to ensure that the Bellman operators defined in Definition 1 are monotonic and contractive.

Lemma 3 (Monotonicity and contraction).

Let 𝒯\mathcal{T} and 𝒯π\mathcal{T}^{\pi} be defined as in (5.2) and (5.3). The following assertions hold.

  • (i)

    Both 𝒯\mathcal{T} and 𝒯π\mathcal{T}^{\pi} are monotonic, i.e., VVV\leq V^{\prime} implies that 𝒯V𝒯V\mathcal{T}V\leq\mathcal{T}V^{\prime} and 𝒯πV𝒯πV\mathcal{T}^{\pi}V\leq\mathcal{T}^{\pi}V^{\prime}.

  • (ii)

    For any measurable functions fi:Θf_{i}:\Theta\to\mathbb{R} and gi:Ξg_{i}:\Xi\to\mathbb{R}, i=1,2i=1,2,

    |ρμ(f1(θ))ρμ(f2(θ))|f1(θ)f2(θ)θ,\displaystyle|\rho_{\mu}(f_{1}(\theta))-\rho_{\mu}(f_{2}(\theta))|\leq\|f_{1}(\theta)-f_{2}(\theta)\|_{\theta,\infty} (5.5)

    and

    |ρPθ(g1(ξ))ρPθ(g2(ξ))|g1(ξ)g2(ξ)ξ,,\displaystyle|\rho_{P_{\theta}}(g_{1}(\xi))-\rho_{P_{\theta}}(g_{2}(\xi))|\leq\|g_{1}(\xi)-g_{2}(\xi)\|_{\xi,\infty}, (5.6)

    where θ,\|\cdot\|_{\theta,\infty} is the sup-norm under probability measure μ\mu and ξ,\|\cdot\|_{\xi,\infty} is the sup-norm under probability measure PθP_{\theta}.

  • (iii)

    The operators 𝒯\mathcal{T} and 𝒯π\mathcal{T}^{\pi} are γ\gamma contractive with respect to the \|\cdot\|_{\infty} norm. That is, for any bounded value functions V,V𝔅(𝒮,𝒟10)V,V^{\prime}\in\mathfrak{B}(\mathcal{S},\mathcal{D}_{1}^{0}), we have

    𝒯V𝒯VγVV and 𝒯πV𝒯πVγVV.\left\|\mathcal{T}V-\mathcal{T}V^{\prime}\right\|_{\infty}\leq\gamma\left\|V-V^{\prime}\right\|_{\infty}\text{ and }\left\|\mathcal{T}^{\pi}V-\mathcal{T}^{\pi}V^{\prime}\right\|_{\infty}\leq\gamma\left\|V-V^{\prime}\right\|_{\infty}.

Proof. Part (i). By definition

(𝒯πV)(s,μ)=ρμρPθ[𝒞(s,π(s,μ),ξ)+γV(s,μ)].\left(\mathcal{T}^{\pi}V\right)(s,\mu)=\rho_{\mu}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,\pi(s,\mu),\xi)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right].

If two value functions VV and VV^{\prime} satisfy V(s,μ)V(s,μ)V\left(s^{\prime},\mu^{\prime}\right)\leq V^{\prime}\left(s^{\prime},\mu^{\prime}\right) for all s𝒮s^{\prime}\in\mathcal{S} and μ𝒟10\mu^{\prime}\in\mathcal{D}_{1}^{0}, then we can deduce (𝒯πV)(s,μ)(𝒯πV)(s,μ)\left(\mathcal{T}^{\pi}V\right)(s,\mu)\leq\left(\mathcal{T}^{\pi}V^{\prime}\right)(s,\mu) given the monotonicity of the composite risk measure. The same argument applies to 𝒯\mathcal{T}.

Part (ii). The inequalities are known as non-expansive properties of a risk measure satisfying translation invariance, see (2.10).

Part (iii). For any ε>0\varepsilon>0, there exists a deterministic policy π\pi, such that for any (s,μ)𝒮×𝒟10(s,\mu)\in\mathcal{S}\times\mathcal{D}_{1}^{0},

(𝒯V)(s,μ)ρμρPθ[𝒞(s,π(s,μ),ξ)+γV(s,μ)]ε.(\mathcal{T}V^{\prime})(s,\mu)\geq\rho_{\mu}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,\pi(s,\mu),\xi)+\gamma V^{\prime}\left(s^{\prime},\mu^{\prime}\right)\right]-\varepsilon.

Consequently, we have

(𝒯V)(s,μ)(𝒯V)(s,μ)\displaystyle(\mathcal{T}V)(s,\mu)-(\mathcal{T}V^{\prime})(s,\mu)
\displaystyle\leq ρμρPθ[𝒞(s,π(s,μ),ξ)+γV(s,μ)]ρμρPθ[𝒞(s,π(s,μ),ξ)+γV(s,μ)]+ε\displaystyle\rho_{\mu}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,\pi(s,\mu),\xi)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right]-\rho_{\mu}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,\pi(s,\mu),\xi)+\gamma V^{\prime}\left(s^{\prime},\mu^{\prime}\right)\right]+\varepsilon
\displaystyle\leq γ(ρμρPθ[V(s,μ)]ρμρPθ[V(s,μ)])+ε\displaystyle\gamma\left(\rho_{\mu}\circ\rho_{P_{\theta}}\left[V\left(s^{\prime},\mu^{\prime}\right)\right]-\rho_{\mu}\circ\rho_{P_{\theta}}\left[V^{\prime}\left(s^{\prime},\mu^{\prime}\right)\right]\right)+\varepsilon
\displaystyle\leq γρPθ[V(s,μ)]ρPθ[V(s,μ)]θ,+ε\displaystyle\gamma\left\|\rho_{P_{\theta}}\left[V\left(s^{\prime},\mu^{\prime}\right)\right]-\rho_{P_{\theta}}\left[V^{\prime}\left(s^{\prime},\mu^{\prime}\right)\right]\right\|_{\theta,\infty}+\varepsilon
\displaystyle\leq γV(s,μ)V(s,μ)ξ,θ,+ε\displaystyle\gamma\left\|\left\|V\left(s^{\prime},\mu^{\prime}\right)-V^{\prime}\left(s^{\prime},\mu^{\prime}\right)\right\|_{\xi,\infty}\right\|_{\theta,\infty}+\varepsilon
\displaystyle\leq γVV+ε.\displaystyle\gamma\|V-V^{\prime}\|_{\infty}+\varepsilon.

Since an arbitrarily small ε\varepsilon can be chosen, this implies that (𝒯V)(𝒯V)γVV\left\|(\mathcal{T}V)-(\mathcal{T}V^{\prime})\right\|_{\infty}\leq\gamma\|V-V^{\prime}\|_{\infty}. Switching the positions of VV and VV^{\prime}, we obtain the desired result for 𝒯\mathcal{T}. The proof of 𝒯π\mathcal{T}^{\pi} is analogous to that for 𝒯\mathcal{T}. \Box

We are now ready to present the main result of this section which states the existence of a unique solution to Bellman equation (5.4) and the solution coinciding with the optimal value of the infinite BCR-SOC/MDP (5.1).

Theorem 1.

Consider Bellman equation (5.4). The following assertions hold.

  • (i)

    Equation (5.4) has a unique value function VV^{*}. Moreover, for any value function V𝔅(𝒮,𝒟10)V\in\mathfrak{B}(\mathcal{S},\mathcal{D}_{1}^{0}), if V𝒯VV\geq\mathcal{T}V, then VVV\geq V^{*}, and, if V𝒯VV\leq\mathcal{T}V, then VVV\leq V^{*}.

  • (ii)

    The optimal value of problem (5.1), 𝒥γ,\mathcal{J}_{\infty}^{\gamma,*}, is the unique solution to that of (5.4), i.e.,

    𝒥γ,(s,μ)V(s,μ),(s,μ)𝒮×𝒟10.\mathcal{J}_{\infty}^{\gamma,*}(s,\mu)\equiv V^{*}(s,\mu),\ \forall(s,\mu)\in\mathcal{S}\times\mathcal{D}_{1}^{0}.

Proof. Part (i). Since 𝒯\mathcal{T} is contractive, the existence and uniqueness of the optimal function VV^{*} follows directly from the Banach fixed-point theorem. We prove the other two arguments below. Consider any policy π\pi such that

V𝒯πV.V\geq\mathcal{T}^{\pi}V. (5.7)

Such a policy π\pi always exists since we can choose π\pi such that 𝒯πV=𝒯V\mathcal{T}^{\pi}V=\mathcal{T}V. By applying 𝒯π\mathcal{T}^{\pi} iteratively to both sides of (5.7) and invoking the monotonicity property from Lemma 3, we deduce that V(𝒯π)nVV\geq\left(\mathcal{T}^{\pi}\right)^{n}V for all n+n\in\mathbb{N}^{+}. Here, (𝒯π)nV\left(\mathcal{T}^{\pi}\right)^{n}V represents the accumulated risk value over a finite horizon with the stationary policy π\pi and the terminal value function VV. Specifically,

(𝒯π)nV\displaystyle\left(\mathcal{T}^{\pi}\right)^{n}V
=ρμ1ρPθ1[𝒞(s1,a1,ξ1)+γρμ2ρPθ2[𝒞(s2,a2,ξ2)++γρμnρPθn[𝒞(sn,an,ξn)+γV(sn+1,μn+1)]]]\displaystyle=\rho_{\mu_{1}}\circ\rho_{P_{\theta_{1}}}\left[\mathcal{C}\left(s_{1},a_{1},\xi_{1}\right)+\gamma\rho_{\mu_{2}}\circ\rho_{P_{\theta_{2}}}\left[\mathcal{C}\left(s_{2},a_{2},\xi_{2}\right)+\cdots+\gamma\rho_{\mu_{n}}\circ\rho_{P_{\theta_{n}}}\left[\mathcal{C}\left(s_{n},a_{n},\xi_{n}\right)+\gamma V\left(s_{n+1},\mu_{n+1}\right)\right]\right]\right]
ρμ1ρPθ1[𝒞(s1,a1,ξ1)+γρμ2ρPθ2[𝒞(s2,a2,ξ2)++γρμnρPθn[𝒞(sn,an,ξn)]]].\displaystyle\geq\rho_{\mu_{1}}\circ\rho_{P_{\theta_{1}}}\left[\mathcal{C}\left(s_{1},a_{1},\xi_{1}\right)+\gamma\rho_{\mu_{2}}\circ\rho_{P_{\theta_{2}}}\left[\mathcal{C}\left(s_{2},a_{2},\xi_{2}\right)+\cdots+\gamma\rho_{\mu_{n}}\circ\rho_{P_{\theta_{n}}}\left[\mathcal{C}\left(s_{n},a_{n},\xi_{n}\right)\right]\right]\right].

As nn tends to infinity, the sequence at right-hand side converges to the value function under the policy π\pi. Since VV^{*} is defined as the minimum over all such policies, then VVV\geq V^{*}.

Next, consider a finite horizon BCR-SOC/MDP with the risk value at last episode defined as V(st+1,μt+1)V\left(s_{t+1},\mu_{t+1}\right). It follows that:

ρμtρPθt[𝒞(st,at,ξt)+γV(st+1,μt+1)](𝒯V)(st,μt)V(st,μt).\rho_{\mu_{t}}\circ\rho_{P_{\theta_{t}}}\left[{\cal C}\left(s_{t},a_{t},\xi_{t}\right)+\gamma V\left(s_{t+1},\mu_{t+1}\right)\right]\geq(\mathcal{T}V)\left(s_{t},\mu_{t}\right)\geq V\left(s_{t},\mu_{t}\right).

Therefore, we can deduce

ρμ1ρPθ1[𝒞(s1,a1,ξ1)+γρμ2ρPθ2[𝒞(s2,a2,ξ2)++γV(st+1,μt+1)]]\displaystyle\rho_{\mu_{1}}\circ\rho_{P_{\theta_{1}}}\left[\mathcal{C}\left(s_{1},a_{1},\xi_{1}\right)+\gamma\rho_{\mu_{2}}\circ\rho_{P_{\theta_{2}}}\left[\mathcal{C}\left(s_{2},a_{2},\xi_{2}\right)+\cdots+\gamma V\left(s_{t+1},\mu_{t+1}\right)\right]\right]
\displaystyle\geq ρμ1ρPθ1[𝒞(s1,a1,ξ1)+γρμ2ρPθ2[𝒞(s2,a2,ξ2)++γV(st,μt)]].\displaystyle\rho_{\mu_{1}}\circ\rho_{P_{\theta_{1}}}\left[\mathcal{C}\left(s_{1},a_{1},\xi_{1}\right)+\gamma\rho_{\mu_{2}}\circ\rho_{P_{\theta_{2}}}\left[\mathcal{C}\left(s_{2},a_{2},\xi_{2}\right)+\cdots+\gamma V\left(s_{t},\mu_{t}\right)\right]\right].

By setting s1=ss_{1}=s and μ1=μ\mu_{1}=\mu, we have

ρμ1ρPθ1[𝒞(s1,a1,ξ1)+γρμ2ρPθ2[𝒞(s2,a2,ξ2)++γV(st+1,μt+1)]]V(s,μ).\rho_{\mu_{1}}\circ\rho_{P_{\theta_{1}}}\left[\mathcal{C}\left(s_{1},a_{1},\xi_{1}\right)+\gamma\rho_{\mu_{2}}\circ\rho_{P_{\theta_{2}}}\left[\mathcal{C}\left(s_{2},a_{2},\xi_{2}\right)+\cdots+\gamma V\left(s_{t+1},\mu_{t+1}\right)\right]\right]\geq V(s,\mu).

Let VmaxV_{\max} be an upper bound on |V(s,μ)|,s𝒮,μ𝒟10|V(s,\mu)|,\forall s\in\mathcal{S},\mu\in\mathcal{D}_{1}^{0}, we have

ρμ1ρPθ1[𝒞(s1,a1,ξ1)+γρμ2ρPθ2[𝒞(s2,a2,ξ2)++γρμtρPθt𝒞(st,at,ξt)]]V(s,μ)Vmaxγt.\rho_{\mu_{1}}\circ\rho_{P_{\theta_{1}}}\left[\mathcal{C}\left(s_{1},a_{1},\xi_{1}\right)+\gamma\rho_{\mu_{2}}\circ\rho_{P_{\theta_{2}}}\left[\mathcal{C}\left(s_{2},a_{2},\xi_{2}\right)+\cdots+\gamma\rho_{\mu_{t}}\circ\rho_{P_{\theta_{t}}}\mathcal{C}\left(s_{t},a_{t},\xi_{t}\right)\right]\right]\geq V(s,\mu)-V_{\max}\gamma^{t}.

As tt\rightarrow\infty, the left-hand side of the inequality converges to the value function under the policy π\pi. The infimum over all policy π\pi is bounded from below by V(s,μ)V(s,\mu), leading to the conclusion that VVV^{*}\geq V.

Part (ii). To begin, with the definition of 𝒥Tγ(π,s,μ)\mathcal{J}_{T}^{\gamma}(\pi,s,\mu), for any horizon TT, state ss, μ\mu and policy π\pi, we have that

𝒥Tγ,(s,μ)𝒥Tγ(π,s,μ)𝒥γ(π,s,μ).\mathcal{J}_{T}^{\gamma,*}(s,\mu)\leq\mathcal{J}_{T}^{\gamma}(\pi,s,\mu)\leq\mathcal{J}_{\infty}^{\gamma}(\pi,s,\mu).

Therefore,

𝒥Tγ,(s,μ)𝒥γ,(s,μ).\mathcal{J}_{T}^{\gamma,*}(s,\mu)\leq\mathcal{J}_{\infty}^{\gamma,*}(s,\mu).

Given the monotone property of Bellman operator 𝒯\mathcal{T}, 𝒥Tγ,(s,μ)\mathcal{J}_{T}^{\gamma,*}(s,\mu) forms a nondecreasing sequence which implies that limT𝒥Tγ,=V.\lim_{T\to\infty}\mathcal{J}_{T}^{\gamma,*}=V^{*}.

Hence, to complete the proof, it only remains to show that V𝒥γ,V^{*}\geq\mathcal{J}_{\infty}^{\gamma,*}. Consider any policy π\pi such that V(𝒯π)VV^{*}\geq(\mathcal{T}^{\pi})V^{*}. By iterating this inequality, for all n1n\geq 1 and (s,μ)𝒮×𝒟10(s,\mu)\in\mathcal{S}\times\mathcal{D}_{1}^{0}, we obtain:

V(s,μ)ρμ1ρPθ1[𝒞(s1,a1,ξ1)+γρμ2ρPθ2[𝒞(s2,a2,ξ2)++γρμnρPθn[𝒞(sn,an,ξn)]]].\displaystyle V^{*}(s,\mu)\geq\rho_{\mu_{1}}\circ\rho_{P_{\theta_{1}}}\left[\mathcal{C}\left(s_{1},a_{1},\xi_{1}\right)+\gamma\rho_{\mu_{2}}\circ\rho_{P_{\theta_{2}}}\left[\mathcal{C}\left(s_{2},a_{2},\xi_{2}\right)+\cdots+\gamma\rho_{\mu_{n}}\circ\rho_{P_{\theta_{n}}}\left[\mathcal{C}\left(s_{n},a_{n},\xi_{n}\right)\right]\right]\right].

Letting nn\to\infty, it follows that for all (s,μ)𝒮×𝒟10(s,\mu)\in\mathcal{S}\times\mathcal{D}_{1}^{0},

V(s,μ)𝒥γ(π,s,μ)𝒥γ,(s,μ).V^{*}(s,\mu)\geq\mathcal{J}_{\infty}^{\gamma}(\pi,s,\mu)\geq\mathcal{J}_{\infty}^{\gamma,*}(s,\mu).

Therefore, combining this result with Part (i) of Theorem 1, we conclude that 𝒥γ,V\mathcal{J}_{\infty}^{\gamma,*}\equiv V^{*} satisfies the Bellman equation 𝒥γ,=𝒯𝒥γ,\mathcal{J}_{\infty}^{\gamma,*}=\mathcal{T}\mathcal{J}_{\infty}^{\gamma,*}. \Box

5.2 Asymptotic convergence

Let VV^{*} be defined as in Theorem 1 and PθcP_{\theta^{c}} be defined in Section 2. Define

V(s,δθc)=mina𝒜ρPθc[𝒞(s,a,ξ)+γV(s,δθc)]V^{*}(s,\delta_{\theta^{c}})=\min_{a\in\mathcal{A}}\rho_{P_{\theta^{c}}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\delta_{\theta^{c}}\right)\right] (5.8)

and

π(s,δθc)argmina𝒜ρPθc[𝒞(s,a,ξ)+γV(s,δθc)].\pi^{*}(s,\delta_{\theta^{c}})\in\arg\min_{a\in\mathcal{A}}\rho_{P_{\theta^{c}}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\delta_{\theta^{c}}\right)\right]. (5.9)

For t=1,2,t=1,2,\cdots, let μt\mu_{t} be defined as in (5.1c). Define

V(s,μt)=mina𝒜ρμtρPθ[𝒞(s,a,ξ)+γV(s,μt+1)]V^{*}(s,\mu_{t})=\min_{a\in\mathcal{A}}\rho_{\mu_{t}}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right] (5.10)

and

π(s,μt)argmina𝒜ρμtρPθ[𝒞(s,a,ξ)+γV(s,μt+1)].\pi^{*}(s,\mu_{t})\in\arg\min_{a\in\mathcal{A}}\rho_{\mu_{t}}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]. (5.11)

In this section, we investigate the convergence of V(s,μt)V^{*}(s,\mu_{t}) to V(s,δθc)V^{*}(s,\delta_{\theta^{c}}) and π(s,μt)\pi^{*}(s,\mu_{t}) to π(s,δθc)\pi^{*}(s,\delta_{\theta^{c}}) as tt\to\infty. The rationale behind the convergence analysis is that in the infinite-horizon Markov decision making process, the DM and the environment interact continuously, allowing the DM to acquire gradually an infinite amount of information about the true environment (represented by the true probability distribution of ξ\xi). As information or data accumulates, the DM’s understanding of the environment improves continuously, diminishing the uncertainty of θ\theta. This raises a question as to whether the optimal value V(s,μt)V^{*}(s,\mu_{t}) and the optimal policy π(s,μt)\pi^{*}(s,\mu_{t}), obtained from solving the BCR-SOC/MDP model (5.1), converge to their respective true optimal counterparts, V(s,δθc)V^{*}(s,\delta_{\theta^{c}}) and π(s,δθc)\pi^{*}(s,\delta_{\theta^{c}}) as the data set expands. The first step is to derive weak convergence of μt\mu_{t} to δθc\delta_{\theta^{c}}. To this end, we need the following assumption.

Assumption 4.

Let mtm_{t} and vtv_{t} denote the posterior mean and variance with respect to μt\mu_{t}. As tt tends to infinity, vt0v_{t}\to 0 and mtθcm_{t}\to\theta^{c} almost surely (a.s.).

This assumption indicates that as the sequence 𝝃t\bm{\xi}^{t} expands with the increase of tt, more observations of ξ\xi are gathered. Consequently, the Bayesian posterior distribution will increasingly concentrate around θc\theta^{c}, eventually reducing to a Dirac function centered at θc\theta^{c}. The following example illustrates that this assumption is reasonable.

Example 9.

Consider the inventory control problem, as proposed in [38]. Based on experience and relevant expertise, a warehouse manager believes that the customer demand follows a Poisson distribution with parameter θc\theta^{c}. The prior distribution μ0\mu_{0} is modeled as a Gamma(a,b)\operatorname{Gamma}(a,b) distribution, which, because of its conjugate nature with respect to the Poisson distribution, ensures that the posterior distribution μt\mu_{t} of θ\theta remains within the Gamma distribution family with updated parameters (at,bt)\left(a_{t},b_{t}\right). Here, at=a+tξ¯a_{t}=a+t\bar{\xi}, ξ¯=i=1tξit\bar{\xi}=\frac{\sum_{i=1}^{t}\xi_{i}}{t} and bt=b+tb_{t}=b+t. Therefore, the posterior mean and variance can easily be calculated as

mt=a+tξ¯b+t and vt=a+tξ¯(b+t)2.m_{t}=\frac{a+t\bar{\xi}}{b+t}\quad\text{ and }\quad v_{t}=\frac{a+t\bar{\xi}}{(b+t)^{2}}.

By the law of large numbers, ξ¯θc\bar{\xi}\rightarrow\theta^{c} a.s. under the true distribution PθcP_{\theta^{c}}. From this, it follows immediately that vt0v_{t}\rightarrow 0 and mtθcm_{t}\rightarrow\theta^{c} almost surely, thereby satisfying Assumption 4. Further, since the convergence rate of ξ¯\bar{\xi} is O(t1/2)O(t^{-1/2}), we have

limttmax{(mtθc)2,vt}=θc,\lim_{t\to\infty}{t}\max\{(m_{t}-\theta^{c})^{2},v_{t}\}=\theta^{c},

which verifies the conditions required in Theorem 3 presented below.

Proposition 5.

Under Assumption 4, for any ε>0\varepsilon>0, we have that

limtθ𝔹(θc,ε)μt(θ)𝑑θ=1,a.s.\lim_{t\to\infty}\int_{\theta\in\mathbb{B}(\theta^{c},\varepsilon)}\mu_{t}(\theta)d\theta=1,\ \text{a.s.} (5.12)

where 𝔹(θc,ε):={θ:|θθc|ε}\mathbb{B}(\theta^{c},\varepsilon):=\{\theta:|\theta-\theta^{c}|\leq\varepsilon\} denotes the neighborhood around θc\theta^{c}.

In [63], this result is called Bayesian consistency, see Assumption 3.2 there. As shown in [27], it follows by Markov’s inequality that

θ𝔹(θc,ε)μt(θ)𝑑θ1vt+(mtθc)2ε2.\int_{\theta\in\mathbb{B}(\theta^{c},\varepsilon)}\mu_{t}(\theta)d\theta\geq 1-\frac{v_{t}+(m_{t}-\theta^{c})^{2}}{\varepsilon^{2}}. (5.13)

Therefore, the Bayesian consistency as stated in (5.12) is achieved under Assumption 4.

Lemma 4.

Under Assumption 4, μt\mu_{t} converges weakly to the Dirac measure δθc\delta_{\theta^{c}}, i.e.,

limtΘh(θ)μt(θ)𝑑θ=h(θc)\displaystyle\lim_{t\to\infty}\int_{\Theta}h(\theta)\mu_{t}(\theta)d\theta=h({\theta^{c}}) (5.14)

for any bounded and continuous function hh. Moreover, ρμt(h(θ))\rho_{\mu_{t}}(h(\theta)) converges to ρδθc(h(θ))=h(θc)\rho_{\delta_{\theta^{c}}}\left(h({\theta})\right)=h({\theta^{c}}) as tt goes to infinity.

Proof. The result is similar to [62, Lemma 3.2], here we give a proof for completeness. For any ϵ>0\epsilon>0, decompose the integral as follows:

Θh(θ)μt(θ)𝑑θ=𝔹(θc,ε)h(θ)μt(θ)𝑑θ+Θ𝔹(θc,ε)h(θ)μt(θ)𝑑θ,\displaystyle\int_{\Theta}h(\theta)\mu_{t}(\theta)d\theta=\int_{\mathbb{B}(\theta^{c},\varepsilon)}h(\theta)\mu_{t}(\theta)d\theta+\int_{\Theta\setminus\mathbb{B}(\theta^{c},\varepsilon)}h(\theta)\mu_{t}(\theta)d\theta, (5.15)

Since hh is continuous, for any ϵ>0\epsilon>0, there exists an ε0>0\varepsilon_{0}>0 such that for all εε0\varepsilon\leq\varepsilon_{0} and θ𝔹(θc,ε)\theta\in\mathbb{B}(\theta^{c},\varepsilon), |h(θ)h(θc)|<ϵ|h(\theta)-h(\theta^{c})|<\epsilon. Consequently, we have

|𝔹(θc,ε)h(θ)μt(θ)𝑑θ𝔹(θc,ε)h(θc)μt(θ)𝑑θ|<ϵ𝔹(θc,ε)μt(θ)𝑑θ.\left|\int_{\mathbb{B}(\theta^{c},\varepsilon)}h(\theta)\mu_{t}(\theta)d\theta-\int_{\mathbb{B}(\theta^{c},\varepsilon)}h(\theta^{c})\mu_{t}(\theta)d\theta\right|<\epsilon\int_{\mathbb{B}(\theta^{c},\varepsilon)}\mu_{t}(\theta)d\theta.

Since 𝔹(θc,ε)μt(θ)𝑑θ1\int_{\mathbb{B}(\theta^{c},\varepsilon)}\mu_{t}(\theta)d\theta\leq 1, this implies

|𝔹(θc,ε)h(θ)μt(θ)𝑑θ𝔹(θc,ε)h(θc)μt(θ)𝑑θ|<ϵ.\displaystyle\left|\int_{\mathbb{B}(\theta^{c},\varepsilon)}h(\theta)\mu_{t}(\theta)d\theta-\int_{\mathbb{B}(\theta^{c},\varepsilon)}h(\theta^{c})\mu_{t}(\theta)d\theta\right|<\epsilon. (5.16)

On the other hand, since hh is bounded, then there exists a constant M>0M>0 such that |h(θ)|M|h(\theta)|\leq M for all θΘ\theta\in\Theta. Together with (5.12), we have

|Θ𝔹(θc,ε)h(θ)μt(θ)𝑑θΘ𝔹(θc,ε)h(θc)μt(θ)𝑑θ|2MΘ𝔹(θc,ε)μt(θ)𝑑θ0\displaystyle\left|\int_{\Theta\setminus\mathbb{B}(\theta^{c},\varepsilon)}h(\theta)\mu_{t}(\theta)d\theta-\int_{\Theta\setminus\mathbb{B}(\theta^{c},\varepsilon)}h(\theta^{c})\mu_{t}(\theta)d\theta\right|\leq 2M\int_{\Theta\setminus\mathbb{B}(\theta^{c},\varepsilon)}\mu_{t}(\theta)d\theta\to 0 (5.17)

as tt\to\infty. Combining (5.15)-(5.17), we obtain

|Θh(θ)μt(θ)𝑑θh(θc)|\displaystyle\left|\int_{\Theta}h(\theta)\mu_{t}(\theta)d\theta-h(\theta^{c})\right|
\displaystyle\leq |𝔹(θc,ε)h(θ)μt(θ)𝑑θh(θc)𝔹(θc,ε)μt(θ)𝑑θ|+|Θ𝔹(θc,ε)h(θ)μt(θ)𝑑θΘ𝔹(θc,ε)h(θc)μt(θ)𝑑θ|\displaystyle\left|\int_{\mathbb{B}(\theta^{c},\varepsilon)}h(\theta)\mu_{t}(\theta)d\theta-h(\theta^{c})\int_{\mathbb{B}(\theta^{c},\varepsilon)}\mu_{t}(\theta)d\theta\right|+\left|\int_{\Theta\setminus\mathbb{B}(\theta^{c},\varepsilon)}h(\theta)\mu_{t}(\theta)d\theta-\int_{\Theta\setminus\mathbb{B}(\theta^{c},\varepsilon)}h(\theta^{c})\mu_{t}(\theta)d\theta\right|
\displaystyle\leq ϵ+2MΘ𝔹(θc,ε)μt(θ)𝑑θ.\displaystyle\epsilon+2M\int_{\Theta\setminus\mathbb{B}(\theta^{c},\varepsilon)}\mu_{t}(\theta)d\theta.

Since ϵ\epsilon can be arbitrarily small, and the second term tends to 0 as tt\to\infty, we establish (5.14) as desired. Moreover, since μt\mu_{t} converges to δθc\delta_{\theta^{c}} weakly and by Lemma 1 ρμt(h(θ))\rho_{\mu_{t}}(h(\theta)) is continuous in PμP_{\mu}, we conclude that ρμt(h(θ))\rho_{\mu_{t}}(h(\theta)) converges to h(θc)h({\theta^{c}}) as tt goes to infinity. \Box

5.2.1 Qualitative convergence

We are now ready to address qualitative asymptotic convergence of optimal values and optimal policies defined in (5.10) and (5.11).

Theorem 2.

Let V(s,μt)V^{*}(s,\mu_{t}) and V(s,δθc)V^{*}(s,\delta_{\theta^{c}}) be defined as in (5.10) and (5.8), let π(s,μt)\pi^{*}(s,\mu_{t}) and π(s,δθc)\pi^{*}(s,\delta_{\theta^{c}}) be the associated optimal policies defined as in (5.11) and (5.9). Under Assumptions 1, 3 and 4, the following assertions hold.

  • (i)

    V(s,μt)V^{*}(s,\mu_{t}) converges to V(s,δθc)V^{*}(s,\delta_{\theta^{c}}) uniformly w.r.t. s𝒮s\in{\cal S} as tt\to\infty, i.e., for any ε>0\varepsilon>0, there exists a t(ε)t^{*}(\varepsilon) such that for all tt(ε)t\geq t^{*}(\varepsilon),

    sups𝒮|V(s,μt)V(s,δθc)|ε1γ, a.s.\sup_{s\in\mathcal{S}}\left|V^{*}(s,\mu_{t})-V^{*}(s,\delta_{\theta^{c}})\right|\leq\frac{\varepsilon}{1-\gamma},\text{ a.s.} (5.18)
  • (ii)

    For each fixed s¯𝒮\bar{s}\in{\cal S}, π(s¯,μt)\pi^{*}(\bar{s},\mu_{t}) converges to the set 𝒜(s¯,δθc)\mathcal{A}^{*}(\bar{s},\delta_{\theta^{c}}) a.s. as tt\rightarrow\infty. Moreover, if 𝒜(s¯,δθc)={π(s¯,δθc)}\mathcal{A}^{*}(\bar{s},\delta_{\theta^{c}})=\left\{\pi^{*}(\bar{s},\delta_{\theta^{c}})\right\} is a singleton, then π(s¯,μt)\pi^{*}(\bar{s},\mu_{t}) converges to π(s¯,δθc)\pi^{*}(\bar{s},\delta_{\theta^{c}}) almost surely.

Proof. Part (i). It suffices to show the uniform convergence of the right-hand side of (5.10). By definition,

sups𝒮|V(s,μt)V(s,δθc)|\displaystyle\sup_{s\in\mathcal{S}}\left|V^{*}(s,\mu_{t})-V^{*}(s,\delta_{\theta^{c}})\right| (5.19)
=\displaystyle= sups𝒮|mina𝒜ρμtρPθ[𝒞(s,a,ξ)+γV(s,μt+1)]mina𝒜ρPθc[𝒞(s,a,ξ)+γV(s,δθc)]|\displaystyle\sup_{s\in\mathcal{S}}\left|\min_{a\in\mathcal{A}}\rho_{\mu_{t}}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]-\min_{a\in\mathcal{A}}\rho_{P_{\theta^{c}}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\delta_{\theta^{c}}\right)\right]\right|
\displaystyle\leq sup(s,a)𝒮×𝒜|ρμtρPθ[𝒞(s,a,ξ)+γV(s,μt+1)]ρδθcρPθ[𝒞(s,a,ξ)+γV(s,δθc)]|\displaystyle\sup_{(s,a)\in\mathcal{S}\times\mathcal{A}}\left|\rho_{\mu_{t}}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]-\rho_{\delta_{\theta^{c}}}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\delta_{\theta^{c}}\right)\right]\right|
\displaystyle\leq sup(s,a)𝒮×𝒜|ρμtρPθ[𝒞(s,a,ξ)+γV(s,μt+1)]ρδθcρPθ[𝒞(s,a,ξ)+γV(s,μt+1)]|\displaystyle\sup_{(s,a)\in\mathcal{S}\times\mathcal{A}}\left|\rho_{\mu_{t}}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]-\rho_{\delta_{\theta^{c}}}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]\right|
+\displaystyle+ sup(s,a)𝒮×𝒜|ρδθcρPθ[𝒞(s,a,ξ)+γV(s,μt+1)]ρδθcρPθ[𝒞(s,a,ξ)+γV(s,δθc)]|\displaystyle\sup_{(s,a)\in\mathcal{S}\times\mathcal{A}}\left|\rho_{\delta_{\theta^{c}}}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]-\rho_{\delta_{\theta^{c}}}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\delta_{\theta^{c}}\right)\right]\right|
\displaystyle\leq ε+γρPθc[sups𝒮|V(s,μt+1)V(s,δθc)|]\displaystyle\varepsilon+\gamma\rho_{P_{\theta^{c}}}\left[\sup_{s\in\mathcal{S}}\left|V^{*}\left(s,\mu_{t+1}\right)-V^{*}\left(s,\delta_{\theta^{c}}\right)\right|\right]
\displaystyle\leq ε+γε+γ2ρPθcρPθc[sups𝒮|V(s,μt+2)V(s,δθc)|]\displaystyle\varepsilon+\gamma\varepsilon+\gamma^{2}\rho_{P_{\theta^{c}}}\circ\rho_{P_{\theta^{c}}}\left[\sup_{s\in\mathcal{S}}\left|V^{*}\left(s,\mu_{t+2}\right)-V^{*}\left(s,\delta_{\theta^{c}}\right)\right|\right]
\displaystyle\leq i=1nγi1εi+γn+1ρPθcρPθc[sups𝒮|V(s,μt+n+1)V(s,δθc)|]\displaystyle\sum_{i=1}^{n}\gamma^{i-1}\varepsilon^{i}+\gamma^{n+1}\rho_{P_{\theta^{c}}}\cdots\rho_{P_{\theta^{c}}}\left[\sup_{s\in\mathcal{S}}\left|V^{*}\left(s,\mu_{t+n+1}\right)-V^{*}\left(s,\delta_{\theta^{c}}\right)\right|\right]
\displaystyle\leq ε1γ.\displaystyle\frac{\varepsilon}{1-\gamma}.

The third inequality follows by defining h(θ):=ρPθ(𝒞(s,a,ξ)+γV(s,μt+1))h(\theta):=\rho_{P_{\theta}}(\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)) and applying Lemma 4 directly. Letting nn\to\infty, we obtain that V(s,μt)V^{*}(s,\mu_{t}) converges uniformly to V(s,δθc)V^{*}(s,\delta_{\theta^{c}}).

Part (ii). The result here is similar to that of [63, Proposition 3.2], we provide a proof for completeness due to the adoption of BCR in our model. By Part (i),

sup(s,a)𝒮×𝒜|ρμtρPθ[𝒞(s,a,ξ)+γV(s,μt+1)]ρPθc[𝒞(s,a,ξ)+γV(s,δθc)]|0,a.s.\displaystyle\sup_{(s,a)\in\mathcal{S}\times\mathcal{A}}\left|\rho_{\mu_{t}}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]-\rho_{P_{\theta^{c}}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\delta_{\theta^{c}}\right)\right]\right|\to 0,\ a.s.

Thus, for any ε>0\varepsilon>0, there exists a t(ε)<t^{*}(\varepsilon)<\infty such that for all tt(ε)t\geq t^{*}(\varepsilon),

ρPθc[𝒞(s¯,π(s¯,μt),ξ)+γV(s¯,δθc)]ε/2ρμtρPθ[𝒞(s¯,π(s¯,μt),ξ)+γV(s¯,μt)],ρμtρPθ[𝒞(s¯,π(s¯,δθc),ξ)+γV(s¯,μt)]ε/2ρPθc[𝒞(s¯,π(s¯,δθc),ξ)+γV(s¯,δθc)].\begin{array}[]{r}\rho_{P_{\theta^{c}}}\left[\mathcal{C}(\bar{s},\pi^{*}(\bar{s},\mu_{t}),\xi)+\gamma V^{*}\left(\bar{s}^{\prime},\delta_{\theta^{c}}\right)\right]-\varepsilon/2\leq\rho_{\mu_{t}}\circ\rho_{P_{\theta}}\left[\mathcal{C}(\bar{s},\pi^{*}(\bar{s},\mu_{t}),\xi)+\gamma V^{*}\left(\bar{s}^{\prime},\mu_{t}\right)\right],\\ \rho_{\mu_{t}}\circ\rho_{P_{\theta}}\left[\mathcal{C}(\bar{s},\pi^{*}(\bar{s},\delta_{\theta^{c}}),\xi)+\gamma V^{*}\left(\bar{s}^{\prime},\mu_{t}\right)\right]-\varepsilon/2\leq\rho_{P_{\theta^{c}}}\left[\mathcal{C}(\bar{s},\pi^{*}(\bar{s},\delta_{\theta^{c}}),\xi)+\gamma V^{*}\left(\bar{s}^{\prime},\delta_{\theta^{c}}\right)\right].\end{array}

By the definition of π(s¯,μt)\pi^{*}(\bar{s},\mu_{t}),

ρμtρPθ[𝒞(s¯,π(s¯,μt),ξ)+γV(s¯,μt)]ρμtρPθ[𝒞(s¯,π(s¯,δθc),ξ)+γV(s¯,μt)].\rho_{\mu_{t}}\circ\rho_{P_{\theta}}\left[\mathcal{C}(\bar{s},\pi^{*}(\bar{s},\mu_{t}),\xi)+\gamma V^{*}\left(\bar{s}^{\prime},\mu_{t}\right)\right]\leq\rho_{\mu_{t}}\circ\rho_{P_{\theta}}\left[\mathcal{C}(\bar{s},\pi^{*}(\bar{s},\delta_{\theta^{c}}),\xi)+\gamma V^{*}\left(\bar{s}^{\prime},\mu_{t}\right)\right].

Thus, we can derive

ρPθc[𝒞(s¯,π(s¯,μt),ξ)+γV(s¯,δθc)]ρPθc[𝒞(s¯,π(s¯,δθc),ξ)+γV(s¯,δθc)]+ε.\rho_{P_{\theta^{c}}}\left[\mathcal{C}(\bar{s},\pi^{*}(\bar{s},\mu_{t}),\xi)+\gamma V^{*}\left(\bar{s}^{\prime},\delta_{\theta^{c}}\right)\right]\leq\rho_{P_{\theta^{c}}}\left[\mathcal{C}(\bar{s},\pi^{*}(\bar{s},\delta_{\theta^{c}}),\xi)+\gamma V^{*}\left(\bar{s}^{\prime},\delta_{\theta^{c}}\right)\right]+\varepsilon.

Define the lower level set:

Lε(s¯):={a𝒜:ρPθc[𝒞(s¯,a,ξ)+γV(s¯,δθc)]ρPθc[𝒞(s¯,π(s¯,δθc),ξ)+γV(s¯,δθc)]+ε}.L_{\varepsilon}(\bar{s}):=\{a\in\mathcal{A}:\rho_{P_{\theta^{c}}}\left[\mathcal{C}(\bar{s},a,\xi)+\gamma V^{*}\left(\bar{s}^{\prime},\delta_{\theta^{c}}\right)\right]\leq\rho_{P_{\theta^{c}}}\left[\mathcal{C}(\bar{s},\pi^{*}(\bar{s},\delta_{\theta^{c}}),\xi)+\gamma V^{*}\left(\bar{s}^{\prime},\delta_{\theta^{c}}\right)\right]+\varepsilon\}.

Then π(s¯,μt)Lε(s¯)\pi^{*}(\bar{s},\mu_{t})\in L_{\varepsilon}(\bar{s}). By the convexity of the objective function and the compactness and convexity of the action set, we deduce that for any s¯𝒮\bar{s}\in\mathcal{S}, the set of the optimal actions, 𝒜(s¯,δθc)\mathcal{A}^{*}(\bar{s},\delta_{\theta^{c}}), is nonempty and bounded. Consider a decreasing sequence εk0\varepsilon_{k}\to 0, it is evident that Lεk+1(s¯)Lεk(s¯)L_{\varepsilon_{k+1}}(\bar{s})\subseteq L_{\varepsilon_{k}}(\bar{s}). Thus the intersection k=1Lεk(s¯)\cap_{k=1}^{\infty}L_{\varepsilon_{k}}(\bar{s}) is contained in the topological closure of 𝒜(s¯,δθc)\mathcal{A}^{*}(\bar{s},\delta_{\theta^{c}}).

Assume for the sake of contradiction that the convergence does not hold. Then we can find a bounded sequence {ak}Lεk(s¯)\{a_{k}\}\in L_{\varepsilon_{k}}(\bar{s}) such that dist(ak,𝒜(s¯,δθc))δ>0\text{dist}\left(a_{k},\mathcal{A}^{*}(\bar{s},\delta_{\theta^{c}})\right)\geq\delta>0 for any kk. Let aa^{*} be an accumulation point of the sequence. Then

ρPθc[𝒞(s¯,a,ξ)+γV(s¯,δθc)]ρPθc[𝒞(s¯,π(s¯,δθc),ξ)+γV(s¯,δθc)]+ε\rho_{P_{\theta^{c}}}\left[\mathcal{C}(\bar{s},a^{*},\xi)+\gamma V^{*}\left(\bar{s}^{\prime},\delta_{\theta^{c}}\right)\right]\leq\rho_{P_{\theta^{c}}}\left[\mathcal{C}(\bar{s},\pi^{*}(\bar{s},\delta_{\theta^{c}}),\xi)+\gamma V^{*}\left(\bar{s}^{\prime},\delta_{\theta^{c}}\right)\right]+\varepsilon

for any ε>0\varepsilon>0, and hence

ρPθc[𝒞(s¯,a,ξ)+γV(s¯,δθc)]=ρPθc[𝒞(s¯,π(s¯,δθc),ξ)+γV(s¯,δθc)],\rho_{P_{\theta^{c}}}\left[\mathcal{C}(\bar{s},a^{*},\xi)+\gamma V^{*}\left(\bar{s}^{\prime},\delta_{\theta^{c}}\right)\right]=\rho_{P_{\theta^{c}}}\left[\mathcal{C}(\bar{s},\pi^{*}(\bar{s},\delta_{\theta^{c}}),\xi)+\gamma V^{*}\left(\bar{s}^{\prime},\delta_{\theta^{c}}\right)\right],

i.e., a𝒜(s¯,δθc)a^{*}\in\mathcal{A}^{*}(\bar{s},\delta_{\theta^{c}}). However, this contradicts the assumption that dist(a,𝒜(s¯,δθc))δ\operatorname{dist}\left(a^{*},\mathcal{A}^{*}(\bar{s},\delta_{\theta^{c}})\right)\geq\delta. Consequently, the distance from π(s¯,μt)\pi^{*}(\bar{s},\mu_{t}) to the topological closure of 𝒜(s¯,δθc)\mathcal{A}^{*}(\bar{s},\delta_{\theta^{c}}) converges to zero almost surely, and therefore the distance from π(s¯,μt)\pi^{*}(\bar{s},\mu_{t}) to 𝒜(s¯,δθc)\mathcal{A}^{*}(\bar{s},\delta_{\theta^{c}}) converges to zero as well. The conclusion is straightforward when 𝒜(s¯,δθc)\mathcal{A}^{*}(\bar{s},\delta_{\theta^{c}}) is a singleton. \Box

The theorem shows uniform convergence of V(s,μt)V^{*}(s,\mu_{t}) to V(s,δθc)V^{*}(s,\delta_{\theta^{c}}) and point-wise convergence of the optimal policies, which are similar to the convergence results in [63, Propositions 3.1-3.2]. We include a proof as the conventional SOC/MDPs and the episodic Bayesian SOC model [63] focus on a risk-neutral objective function under fixed environment parameters, whereas the BCR-SOC/MDP model (5.1) introduces additional complexity by incorporating epistemic uncertainty about parameter estimation and its impact on both the policy and value function. Moreover, since the BCR does not usually preserve linearity, we need to make sure the specific details of the convergence of the Bellman optimal value function and policy work through. Note also that the asymptotic convergence is short of quantifying t(ε)t^{*}(\varepsilon) for the prescribed precision ε\varepsilon. In the next section, we will present some quantitative convergence results when the BCR takes a specific form.

5.2.2 Quantitative convergence

From Theorem 2, we can see that the convergence of the optimal values and optimal policies depends heavily on the convergence of μt\mu_{t} to δθc\delta_{\theta^{c}}. Under some special circumstances, convergence of the latter may be quantified by the mean and variance of μt\mu_{t}. This raises a question as to whether the convergence of the optimal values and optimal policies can be quantified in terms of the convergence of the mean and variance of μt\mu_{t}. In this subsection, we address this. To this end, we need the following generic technical condition.

Assumption 5.

Let fs,a(ξ)=𝒞(s,a,ξ)+γV(s,μt+1)f_{s,a}(\xi)=\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right) and ψs,a(θ):=ρPθ[fs,a(ξ)]ρPθc[fs,a(ξ)].\psi_{s,a}(\theta):=\rho_{P_{\theta}}\left[f_{s,a}(\xi)\right]-\rho_{P_{\theta^{c}}}\left[f_{s,a}(\xi)\right]. The inner risk measure ρPθ\rho_{P_{\theta}} is Lipschitz continuous on 𝔹(θc,ε)\mathbb{B}(\theta^{c},\varepsilon), i.e., there exist constants LρL_{\rho} and kk such that

|ψs,a(θ)|Lρ|θθc|k,θ𝔹(θc,ε).\displaystyle|\psi_{s,a}(\theta)|\leq L_{\rho}|\theta-\theta^{c}|^{k},\ \forall\theta\in\mathbb{B}(\theta^{c},\varepsilon). (5.20)

The next proposition states how the assumption may be satisfied.

Proposition 6.

Suppose the total variation (TV) distance [66] between PθP_{\theta} and PθcP_{\theta^{c}} satisfies the following condition:

𝖽𝗅TV(Pθ,Pθc)=12Ξ|p(ξ|θ)p(ξ|θc)|dξLp|θθc|,θ𝔹(θc,ε),\displaystyle{\mathsf{d\kern-0.70007ptl}_{TV}}(P_{\theta},P_{\theta^{c}})=\frac{1}{2}\int_{\Xi}|p(\xi|\theta)-p(\xi|\theta^{c})|d\xi\leq L_{p}|\theta-\theta^{c}|,\;\forall\theta\in\mathbb{B}(\theta^{c},\varepsilon), (5.21)

where LpL_{p} is a constant. Then, there exists a constant LρL_{\rho} such that

|ψs,a(θ)|Lρ|θθc|,θ𝔹(θc,ε),\displaystyle|\psi_{s,a}(\theta)|\leq L_{\rho}|\theta-\theta^{c}|,\ \forall\theta\in\mathbb{B}(\theta^{c},\varepsilon), (5.22)

if one of the following conditions hold: (i) The inner risk measure is a coherent risk measure taking a parametric form as:

ρPθ(fs,a(ξ))=infϕΦ𝔼Pθ[Ψ(fs,a(ξ),ϕ)],\rho_{P_{\theta}}(f_{s,a}(\xi))=\inf_{\phi\in\Phi}\mathbb{E}_{P_{\theta}}[\Psi(f_{s,a}(\xi),\phi)], (5.23)

where Φ\Phi is a non-empty closed convex subset of m\mathbb{R}^{m}, and Ψ:×Φ\Psi:\mathbb{R}\times\Phi\to\mathbb{R} is jointly convex with respect to (fs,a(ξ),ϕ)×Φ(f_{s,a}(\xi),\phi)\in\mathbb{R}\times\Phi. For every ϕΦ\phi\in\Phi, the function Ψ(fs,a(ξ),ϕ)\Psi(f_{s,a}(\xi),\phi) is nondecreasing and positive homogeneous in fs,a(ξ)f_{s,a}(\xi). (ii) The diameter MM of Ξ\Xi is finite and the inner risk measure is a robust SRM as ρPθ(fs,a(ξ))=supσ𝔄Mσ(fs,a(ξ))\rho_{P_{\theta}}(f_{s,a}(\xi))=\sup_{\sigma\in\mathfrak{A}}M_{\sigma}(f_{s,a}(\xi)).

Proof. Part (i). By definition

|ψs,a(θ)|=\displaystyle|\psi_{s,a}(\theta)|= |ρPθ[𝒞(s,a,ξ)+γV(s,μt+1)]ρPθc[𝒞(s,a,ξ)+γV(s,μt+1)]|\displaystyle\left|\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]-\rho_{P_{\theta^{c}}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]\right|
=\displaystyle= |infϕΦ𝔼Pθ[Ψ(𝒞(s,a,ξ)+γV(s,μt+1),ϕ)]infϕΦ𝔼Pθc[Ψ(𝒞(s,a,ξ)+γV(s,μt+1),ϕ)]|\displaystyle\left|\inf_{\phi\in\Phi}\mathbb{E}_{P_{\theta}}[\Psi(\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right),\phi)]-\inf_{\phi\in\Phi}\mathbb{E}_{P_{\theta^{c}}}[\Psi(\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right),\phi)]\right|
\displaystyle\leq supϕΦ|𝔼Pθ[Ψ(𝒞(s,a,ξ)+γV(s,μt+1),ϕ)]𝔼Pθc[Ψ(𝒞(s,a,ξ)+γV(s,μt+1),ϕ)]|\displaystyle\sup_{\phi\in\Phi}\left|\mathbb{E}_{P_{\theta}}[\Psi(\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right),\phi)]-\mathbb{E}_{P_{\theta^{c}}}[\Psi(\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right),\phi)]\right|
\displaystyle\leq K|p(ξ|θ)p(ξ|θc)|dξ2KLp|θθc|,\displaystyle K\int|p(\xi|\theta)-p(\xi|\theta^{c})|d\xi\leq 2KL_{p}|\theta-\theta^{c}|,

where the second last inequality follows from the assumption on ρPθ\rho_{P_{\theta}}, which guarantees the existence of a constant KK such that Ψ(𝒞(s,a,ξ)+γV(s,μt+1),ϕ)K\Psi(\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right),\phi)\leq K. By setting Lρ=2KLpL_{\rho}=2KL_{p} and k=1k=1, we can draw the conclusion.

Part (ii). Denote νs,a,Pθ=Pθfs,a1\nu_{s,a,P_{\theta}}=P_{\theta}\circ f_{s,a}^{-1} and νs,a,Pθc=Pθcfs,a1\nu_{s,a,P_{\theta^{c}}}=P_{\theta^{c}}\circ f_{s,a}^{-1}. Then by definition, we have

𝖽𝗅Wp(νs,a,Pθ,νs,a,Pθc)M[𝖽𝗅TV(νs,a,Pθ,νs,a,Pθc)]1pM[𝖽𝗅TV(Pθ,Pθc)]1pLpM|θθc|1p.\mathsf{d\kern-0.70007ptl}_{W}^{p}(\nu_{s,a,P_{\theta}},\nu_{s,a,P_{\theta^{c}}})\leq M[\mathsf{d\kern-0.70007ptl}_{TV}(\nu_{s,a,P_{\theta}},\nu_{s,a,P_{\theta^{c}}})]^{\frac{1}{p}}\leq M[{\mathsf{d\kern-0.70007ptl}_{TV}}(P_{\theta},P_{\theta^{c}})]^{\frac{1}{p}}\leq L_{p}M|\theta-\theta^{c}|^{\frac{1}{p}}.

Thus, by setting Lρ=LpMsupσ𝔄[01σ(τ)q𝑑τ]1qL_{\rho}=L_{p}M\sup_{\sigma\in\mathfrak{A}}\left[\int_{0}^{1}\sigma(\tau)^{q}d\tau\right]^{\frac{1}{q}} and k=1/pk=1/p, we can draw the conclusion using part (ii) of Lemma 1. \Box

The assumption about TV distance in Proposition 6 is satisfied by a number of important parametric probability distributions including exponential, gamma and mixed normal distributions under some moderate conditions. The condition of part (i) in Proposition 6 is commonly used in the literature, such as in  [22, 61, 33]. It encompasses a variety of risk measures including AVaR, ϕ\phi-divergence risk measure, L1L_{1}-norm risk measure and LL_{\infty}-norm risk measure.

Theorem 3.

Let Assumptions 4 and 5 hold. If there exist positive constants M1M_{1} and M2M_{2} such that

limttM1max{(mtθc)2,vt}=M2,\lim_{t\to\infty}{t^{M_{1}}}\max\{(m_{t}-\theta^{c})^{2},v_{t}\}={M_{2}}, (5.24)

then the following assertions hold.

  • (i)

    If the outer risk measure is chosen with 𝔼μt\mathbb{E}_{\mu_{t}}, then

    sups𝒮|V(s,μt)V(s,δθc)|Lρ(vt+(mtθc)2)k2,a.s.\displaystyle\sup_{s\in\mathcal{S}}\left|V^{*}(s,\mu_{t})-V^{*}(s,\delta_{\theta^{c}})\right|\leq L_{\rho}\left(v_{t}+\left(m_{t}-\theta^{c}\right)^{2}\right)^{\frac{k}{2}},\ a.s. (5.25)

    for all k2.k\leq 2.

  • (ii)

    If the outer risk measure is chosen with VaRμtα\text{VaR}_{\mu_{t}}^{\alpha}, then

    sups𝒮|V(s,μt)V(s,δθc)|Lρ1γ(vt+(mtθc)2α)k2,a.s.\displaystyle\sup_{s\in\mathcal{S}}\left|V^{*}(s,\mu_{t})-V^{*}(s,\delta_{\theta^{c}})\right|\leq\frac{L_{\rho}}{1-\gamma}\left({\frac{v_{t}+(m_{t}-\theta^{c})^{2}}{\alpha}}\right)^{\frac{k}{2}},\ a.s. (5.26)

    Furthermore, if the outer risk measure is chosen with robust SRM supσ𝔄Mσ\sup_{\sigma\in\mathfrak{A}}M_{\sigma}, then

    sups𝒮|V(s,μt)V(s,δθc)|Lρ(vt+(mtθc)2)k21γsupσ𝔄01(1τ)k2σ(τ)𝑑τ,a.s.\displaystyle\sup_{s\in\mathcal{S}}\left|V^{*}(s,\mu_{t})-V^{*}(s,\delta_{\theta^{c}})\right|\leq\frac{L_{\rho}\left(v_{t}+(m_{t}-\theta^{c})^{2}\right)^{\frac{k}{2}}}{1-\gamma}\sup_{\sigma\in\mathfrak{A}}\int_{0}^{1}(1-\tau)^{-\frac{k}{2}}\sigma(\tau)d\tau,\ a.s. (5.27)
  • (iii)

    If the outer risk measure is chosen with AVaRμtα\text{AVaR}_{\mu_{t}}^{\alpha}, then

    sups𝒮|V(s,μt)V(s,δθc)|\displaystyle\sup_{s\in\mathcal{S}}\left|V^{*}(s,\mu_{t})-V^{*}(s,\delta_{\theta^{c}})\right| \displaystyle\leq LρtM1k/4(1γ)α(vt+(mtθc)2α)k2\displaystyle\frac{L_{\rho}t^{M_{1}k/4}}{(1-\gamma)\alpha}\left({\frac{v_{t}+(m_{t}-\theta^{c})^{2}}{\alpha}}\right)^{\frac{k}{2}} (5.28)
    +𝒞¯(1γ)2tM1/2,a.s.\displaystyle+\frac{\bar{\mathcal{C}}}{(1-\gamma)^{2}t^{M_{1}/2}},\ a.s.

Before providing a proof, it might be helpful to comment on the conditions and results of the theorem. Unlike Theorem 2, Theorem 3 explicitly quantifies the convergence by providing a worst-case upper bound for the optimal value functions uniformly with respect to risk measures. This is achieved by strengthening Assumption 4 with an additional condition (5.24) about the rate of convergence of vtv_{t} and (mtθc)2(m_{t}-\theta^{c})^{2}, which means that as episode tt increases, both the posterior variance and bias decrease at a rate of tM1t^{-M_{1}}. As shown in [21], under standard statistical assumptions, both vtv_{t} and (mtθc)2(m_{t}-\theta^{c})^{2} typically converge at a rate of O(1t)O(\frac{1}{t}), corresponding to M1=1M_{1}=1, see Example 9. With the convergence rate of vtv_{t} and (mtθc)2(m_{t}-\theta^{c})^{2}, we can establish the rate of convergence of the optimal value function from Theorem 3 under some specific BCRs.

  • If the outer risk measure is chosen with 𝔼μt\mathbb{E}_{\mu_{t}}, VaRμtα\text{VaR}_{\mu_{t}}^{\alpha}, and robust SRM, the convergence rate of V(s,μt)V^{*}(s,\mu_{t}) to V(s,δθc)V^{*}(s,\delta_{\theta^{c}}) is O(tk2M1)O(t^{-\frac{k}{2M_{1}}}).

  • If the outer risk measure is chosen with AVaRμtα\text{AVaR}_{\mu_{t}}^{\alpha}, the convergence rate of V(s,μt)V^{*}(s,\mu_{t}) to V(s,δθc)V^{*}(s,\delta_{\theta^{c}}) is O(tk4M1)O(t^{-\frac{k}{4M_{1}}}).

Moreover, Part (i) of the theorem provides a supplement to Part (iii). Specifically, when k2k\leq 2, we have the inequality AVaRμtα(ψs,a(θ))1α𝔼μt(ψs,a(θ))\text{AVaR}_{\mu_{t}}^{\alpha}(\psi_{s,a}(\theta))\leq\frac{1}{\alpha}\mathbb{E}_{\mu_{t}}(\psi_{s,a}(\theta)), demonstrating that the convergence rate for AVaRμtα\text{AVaR}_{\mu_{t}}^{\alpha} can be as fast as that of 𝔼μt\mathbb{E}_{\mu_{t}}.

Proof. Part (i). By the definitions of vtv_{t} and mtm_{t} in Assumption 4, we have

𝔼μt[(θθc)2]=vt+(mtθc)2.\displaystyle\mathbb{E}_{\mu_{t}}\left[\left(\theta-\theta^{c}\right)^{2}\right]=v_{t}+\left(m_{t}-\theta^{c}\right)^{2}. (5.29)

Since ρμt=𝔼μt\rho_{\mu_{t}}=\mathbb{E}_{\mu_{t}}, then by (5.20), we have

sups𝒮|V(s,μt)V(s,δθc)|sup(s,a)𝒮×𝒜𝔼μt[|ψs,a(θ)|]Lρ𝔼μt[|θθc|k].\displaystyle\sup_{s\in\mathcal{S}}\left|V^{*}(s,\mu_{t})-V^{*}\left(s,\delta_{\theta^{c}}\right)\right|\leq\sup_{(s,a)\in\mathcal{S}\times\mathcal{A}}\mathbb{E}_{\mu_{t}}\left[|\psi_{s,a}(\theta)|\right]\leq L_{\rho}\mathbb{E}_{\mu_{t}}[|\theta-\theta^{c}|^{k}]. (5.30)

By Hölder inequality,

𝔼μt[|θθc|k](𝔼μt[(θθc)2])k2(vt+(mtθc)2)k2.\displaystyle\mathbb{E}_{\mu_{t}}[|\theta-\theta^{c}|^{k}]\leq\left(\mathbb{E}_{\mu_{t}}\left[\left(\theta-\theta^{c}\right)^{2}\right]\right)^{\frac{k}{2}}\leq\left(v_{t}+\left(m_{t}-\theta^{c}\right)^{2}\right)^{\frac{k}{2}}. (5.31)

A combination of (5.29)- (5.31) yields (5.25).

Part (ii) Under Assumption 5, it follows from (5.13) that

Pμt(|ψs,a(θ)|Lρε)Pμt(|θθc|kε)1vt+(mtθc)2ε2/k.P_{\mu_{t}}\left(|\psi_{s,a}(\theta)|\leq L_{\rho}\varepsilon\right)\geq P_{\mu_{t}}(|\theta-\theta^{c}|^{k}\leq\varepsilon)\geq 1-\frac{v_{t}+(m_{t}-\theta^{c})^{2}}{\varepsilon^{2/k}}.

For the specified α\alpha in the definition of VaRμtα\text{VaR}^{\alpha}_{\mu_{t}}, let ε:=(vt+(mtθc)2α)k2\varepsilon:=\left({\frac{v_{t}+(m_{t}-\theta^{c})^{2}}{\alpha}}\right)^{\frac{k}{2}}. Then the inequality above can equivalently be written as

Pμt(|ψs,a(θ)|Lρ(vt+(mtθc)2α)k2)1α.P_{\mu_{t}}\left(|\psi_{s,a}(\theta)|\leq L_{\rho}\left({\frac{v_{t}+(m_{t}-\theta^{c})^{2}}{\alpha}}\right)^{\frac{k}{2}}\right)\geq 1-\alpha. (5.32)

Let h(θ)=|ψs,a(θ)|h(\theta)=|\psi_{s,a}(\theta)| and νh\nu_{h} be the probability distribution of h(θ)h(\theta). Then (5.32) implies

Pνh(h(θ)Lρ(vt+(mtθc)2α)k2)1α.P_{\nu_{h}}\left(h(\theta)\leq L_{\rho}\left({\frac{v_{t}+(m_{t}-\theta^{c})^{2}}{\alpha}}\right)^{\frac{k}{2}}\right)\geq 1-\alpha. (5.33)

By the definition of VaR, this implies

VaRνhα(h(θ))Lρ(vt+(mtθc)2α)k2.\text{VaR}_{\nu_{h}}^{\alpha}(h(\theta))\leq L_{\rho}\left({\frac{v_{t}+(m_{t}-\theta^{c})^{2}}{\alpha}}\right)^{\frac{k}{2}}.

Thus, we obtain that

|VaRμtαρPθ[𝒞(s,a,ξ)+γV(s,μt+1)]ρPθc[𝒞(s,a,ξ)+γV(s,μt+1)]|\displaystyle\left|\text{VaR}_{\mu_{t}}^{\alpha}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]-\rho_{P_{\theta^{c}}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]\right|
\displaystyle\leq VaRμtα(|ψs,a(θ)|)=VaRνhα(h(θ))Lρ(vt+(mtθc)2α)k2.\displaystyle\text{VaR}_{\mu_{t}}^{\alpha}(|\psi_{s,a}(\theta)|)=\text{VaR}_{\nu_{h}}^{\alpha}(h(\theta))\leq L_{\rho}\left({\frac{v_{t}+(m_{t}-\theta^{c})^{2}}{\alpha}}\right)^{\frac{k}{2}}.

Similar to the proof of (LABEL:4.18), we can conclude that

sups𝒮|V(s,μt)V(s,δθc)|\displaystyle\sup_{s\in\mathcal{S}}\left|V^{*}(s,\mu_{t})-V^{*}(s,\delta_{\theta^{c}})\right|
\displaystyle\leq sup(s,a)𝒮×𝒜|ρμtρPθ[𝒞(s,a,ξ)+γV(s,μt+1)]ρδθcρPθ[𝒞(s,a,ξ)+γV(s,μt+1)]|\displaystyle\sup_{(s,a)\in\mathcal{S}\times\mathcal{A}}\left|\rho_{\mu_{t}}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]-\rho_{\delta_{\theta^{c}}}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]\right|
+\displaystyle+ sup(s,a)𝒮×𝒜|ρδθcρPθ[𝒞(s,a,ξ)+γV(s,μt+1)]ρδθcρPθ[𝒞(s,a,ξ)+γV(s,δθc)]|\displaystyle\sup_{(s,a)\in\mathcal{S}\times\mathcal{A}}\left|\rho_{\delta_{\theta^{c}}}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]-\rho_{\delta_{\theta^{c}}}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\delta_{\theta^{c}}\right)\right]\right|
\displaystyle\leq Lρ(vt+(mtθc)2α)k2+γρPθc[sups𝒮|V(s,μt+1)V(s,δθc)|]\displaystyle L_{\rho}\left({\frac{v_{t}+(m_{t}-\theta^{c})^{2}}{\alpha}}\right)^{\frac{k}{2}}+\gamma\rho_{P_{\theta^{c}}}\left[\sup_{s\in\mathcal{S}}\left|V^{*}\left(s,\mu_{t+1}\right)-V^{*}\left(s,\delta_{\theta^{c}}\right)\right|\right]
\displaystyle\leq Lρ1γ(vt+(mtθc)2α)k2.\displaystyle\frac{L_{\rho}}{1-\gamma}\left({\frac{v_{t}+(m_{t}-\theta^{c})^{2}}{\alpha}}\right)^{\frac{k}{2}}.

Thus the proof for VaRμtα\text{VaR}_{\mu_{t}}^{\alpha} is completed. Similar result can be deduced by the definition of supσ𝔄Mσ\sup_{\sigma\in\mathfrak{A}}M_{\sigma}.

Part (iii). Similar to the proof of Part (ii), we have

Pμt(|ψs,a(θ)|Lρε)Pμt(|θθc|kε)1vt+(mtθc)2ε2/k.P_{\mu_{t}}\left(|\psi_{s,a}(\theta)|\leq{L_{\rho}}\varepsilon\right)\geq P_{\mu_{t}}(|\theta-\theta^{c}|^{k}\leq\varepsilon)\geq 1-\frac{v_{t}+(m_{t}-\theta^{c})^{2}}{\varepsilon^{2/k}}.

By setting ε=tM1k/4(vt+(mtθc)2α)k2\varepsilon=t^{M_{1}k/4}\left({\frac{v_{t}+(m_{t}-\theta^{c})^{2}}{\alpha}}\right)^{\frac{k}{2}}, we can obtain

Pμt(|ψs,a(θ)|LρtM1k/4(vt+(mtθc)2α)k2)1αtM1/21α.P_{\mu_{t}}\left(|\psi_{s,a}(\theta)|\leq{L_{\rho}}{t^{M_{1}k/4}}\left({\frac{v_{t}+(m_{t}-\theta^{c})^{2}}{\alpha}}\right)^{\frac{k}{2}}\right)\geq 1-\frac{\alpha}{t^{M_{1}/2}}\geq 1-\alpha. (5.34)

Let

Θε={θ:|ψs,a(θ)|LρtM1k/4(vt+(mtθc)2α)k2}\Theta_{\varepsilon}=\left\{\theta:|\psi_{s,a}(\theta)|\leq{L_{\rho}}{t^{M_{1}k/4}}\left({\frac{v_{t}+(m_{t}-\theta^{c})^{2}}{\alpha}}\right)^{\frac{k}{2}}\right\}

and

Θα={θ:ρPθ[𝒞(s,a,ξ)+γV(s,μt+1)]VaRμtαρPθ[𝒞(s,a,ξ)+γV(s,μt+1)]}.\Theta_{\alpha}=\left\{\theta:\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]\geq\text{VaR}_{\mu_{t}}^{\alpha}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]\right\}.

Then, for all s𝒮s\in\mathcal{S}, we have

|AVaRμtαρPθ[𝒞(s,a,ξ)+γV(s,μt+1)]ρPθc[𝒞(s,a,ξ)+γV(s,μt+1)]|\displaystyle\left|\text{AVaR}_{\mu_{t}}^{\alpha}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]-\rho_{P_{\theta^{c}}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]\right|
=\displaystyle= 1α|Θα(ρPθ[𝒞(s,a,ξ)+γV(s,μt+1)]ρPθc[𝒞(s,a,ξ)+γV(s,μt+1)])μt(θ)𝑑θ|\displaystyle\frac{1}{\alpha}\left|\int_{\Theta_{\alpha}}\left(\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]-\rho_{P_{\theta^{c}}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]\right)\mu_{t}(\theta)d\theta\right|
\displaystyle\leq 1αΘαΘε|ρPθ[𝒞(s,a,ξ)+γV(s,μt+1)]ρPθc[𝒞(s,a,ξ)+γV(s,μt+1)]|μt(θ)𝑑θ\displaystyle\frac{1}{\alpha}\int_{\Theta_{\alpha}\cap\Theta_{\varepsilon}}\left|\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]-\rho_{P_{\theta^{c}}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]\right|\mu_{t}(\theta)d\theta
+\displaystyle+ 1αΘαΘεc|ρPθ[𝒞(s,a,ξ)+γV(s,μt+1)]ρPθc[𝒞(s,a,ξ)+γV(s,μt+1)]|μt(θ)𝑑θ\displaystyle\frac{1}{\alpha}\int_{\Theta_{\alpha}\cap\Theta_{\varepsilon}^{c}}\left|\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]-\rho_{P_{\theta^{c}}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]\right|\mu_{t}(\theta)d\theta
\displaystyle\leq 1αsupθΘε|ρPθ[𝒞(s,a,ξ)+γV(s,μt+1)]ρPθc[𝒞(s,a,ξ)+γV(s,μt+1)]|\displaystyle\frac{1}{\alpha}\sup_{\theta\in\Theta_{\varepsilon}}\left|\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]-\rho_{P_{\theta^{c}}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]\right|
+\displaystyle+ 1αΘεc|ρPθ[𝒞(s,a,ξ)+γV(s,μt+1)]ρPθc[𝒞(s,a,ξ)+γV(s,μt+1)]|μt(θ)𝑑θ\displaystyle\frac{1}{\alpha}\int_{\Theta^{c}_{\varepsilon}}\left|\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]-\rho_{P_{\theta^{c}}}\left[\mathcal{C}(s,a,\xi)+\gamma V^{*}\left(s^{\prime},\mu_{t+1}\right)\right]\right|\mu_{t}(\theta)d\theta
\displaystyle\leq 1αsupθΘε|ψs,a(θ)|+𝒞¯(1γ)αΘεcμt(θ)𝑑θ\displaystyle\frac{1}{\alpha}\sup_{\theta\in\Theta_{\varepsilon}}\left|\psi_{s,a}(\theta)\right|+\frac{\bar{\mathcal{C}}}{(1-\gamma)\alpha}\int_{\Theta_{\varepsilon}^{c}}\mu_{t}(\theta)d\theta
\displaystyle\leq LρtM1k/4α(vt+(mtθc)2α)k2+𝒞¯(1γ)tM1/2.\displaystyle\frac{{L_{\rho}}t^{M_{1}k/4}}{\alpha}\left({\frac{v_{t}+(m_{t}-\theta^{c})^{2}}{\alpha}}\right)^{\frac{k}{2}}+\frac{\bar{\mathcal{C}}}{(1-\gamma)t^{M_{1}/2}}.

The second last inequality results from the boundedness of the cost function. In combination with (LABEL:4.18), we can conclude that

sups𝒮|V(s,μt)V(s,δθc)|LρtM1k/4(1γ)α(vt+(mtθc)2α)k2+𝒞¯(1γ)2tM1/2,\sup_{s\in\mathcal{S}}\left|V^{*}(s,\mu_{t})-V^{*}(s,\delta_{\theta^{c}})\right|\leq\frac{L_{\rho}t^{M_{1}k/4}}{(1-\gamma)\alpha}\left({\frac{v_{t}+(m_{t}-\theta^{c})^{2}}{\alpha}}\right)^{\frac{k}{2}}+\frac{\bar{\mathcal{C}}}{(1-\gamma)^{2}t^{M_{1}/2}},

which completes the proof for AVaRμtα\text{AVaR}_{\mu_{t}}^{\alpha}. \Box

5.3 Algorithms for value iterations and policy iterations

Having examined the theoretical properties of the proposed infinite-horizon BCR-SOC/MDP problem, we now consider its solution. This subsection introduces two algorithms for computing the unique value function VV^{*} within the infinite-horizon BCR-SOC/MDP framework. Both algorithms are inspired by the dynamic programming equation (𝒯V)(s,μ)=V(s,μ)(\mathcal{T}V^{*})(s,\mu)=V^{*}(s,\mu) embodying a risk-averse adaptation of the usual value iteration and policy iteration methods applied in standard SOC/MDPs (see [4, 28] for details).

In the BCR-SOC/MDP model, the introduction of risk-averse characteristics requires adjustments to the conventional value iteration algorithm to accommodate the new dynamic programming equations. The core of these adjustments lies in integrating the impact of risk measures on decision-making during the value function update process. Building on the framework of the conventional value iteration algorithm, we propose a value iteration algorithm tailored for BCR-SOC/MDP, as detailed in Algorithm 2, which aims to approximate VV^{*} with V^\hat{V}^{*} and meanwhile identify the corresponding ε\varepsilon-optimal policy πε,\pi^{\varepsilon,*}.

Algorithm 2 Value Iteration Algorithm for Infinite-horizon BCR-SOC/MDP (5.1).
1:  Initialize precision ε\varepsilon and the value functions V0,V1V_{0},V_{1} with V0=0V_{0}=0 and V1ε(1γ)/2γ\left\|V_{1}\right\|_{\infty}\geq\varepsilon(1-\gamma)/2\gamma, set k:=0k:=0;
2:  while Vk+1Vkε(1γ)/2γ\left\|V_{k+1}-V_{k}\right\|_{\infty}\geq\varepsilon(1-\gamma)/2\gamma do
3:     for all (s,μ)𝒮×𝒟10\left(s,\mu\right)\in\mathcal{S}\times\mathcal{D}_{1}^{0} do
4:        Vk+1(s,μ)mina𝒜ρμρPθ[𝒞(s,a,ξ)+γVk(s,μ)],s=g(s,a,ξ),μ(θ)=p(ξ|θ)μ(θ)Θp(ξ|θ)μ(θ)𝑑θ.V_{k+1}(s,\mu)\leftarrow\min_{a\in\mathcal{A}}\rho_{\mu}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V_{k}\left(s^{\prime},\mu^{\prime}\right)\right],\ s^{\prime}=g\left(s,a,\xi\right),\ \mu^{\prime}(\theta)=\frac{p\left(\xi|\theta\right)\mu(\theta)}{\int_{\Theta}p\left(\xi|\theta\right)\mu(\theta)d\theta}.
5:     end for
6:     Set k:=k+1k:=k+1.
7:  end while
8:  for all (s,μ)𝒮×𝒟10\left(s,\mu\right)\in\mathcal{S}\times\mathcal{D}_{1}^{0} do
9:     Set V^(s,μ):=Vk(s,μ)\hat{V}^{*}(s,\mu):=V_{k}(s,\mu) and determine πε,(s,μ)\pi^{\varepsilon,*}(s,\mu) by
πε,(s,μ)=argmina𝒜ρμρPθ[𝒞(s,a,ξ)+γV^(s,μ)].\displaystyle\pi^{\varepsilon,*}(s,\mu)=\arg\min_{a\in\mathcal{A}}\rho_{\mu}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma\hat{V}^{*}\left(s^{\prime},\mu^{\prime}\right)\right]. (5.35)
10:  end for
11:  return  The optimal value function V^\hat{V}^{*} and optimal policy πε,\pi^{\varepsilon,*}.

The main component of Algorithm 2 is Steps 2-7 which are designed to obtain an approximate optimal value function of problem (5.1). As in Algorithm 1, this requires the set 𝒮×𝒟10\mathcal{S}\times\mathcal{D}^{0}_{1} to be a finite discrete set. Once an approximate optimal value function V^\hat{V}^{*} is obtained, we solve problem (5.35) to obtain an approximate optimal policy in Step 9. This kind of algorithm is well known in the literature of infinite horizon MDP, see [51]. The convergence of such algorithms is also well documented. In the next theorem, we give a statement of convergence of the algorithm for self-containedness resulting from the BCR structure.

Theorem 4.

Let the sequence {Vk}\{V_{k}\} be generated by Algorithm 2. Then

  • (i)

    {Vk}\{V_{k}\} converges linearly to the unique value function VV^{*} at a rate of γ\gamma. Further, the initialization V1=0V_{1}=0 ensures a non-decreasing sequence {Vk}\{V_{k}\}. Conversely, the initialization V1𝒞¯V_{1}\geq\bar{\mathcal{C}} results in a non-increasing sequence {Vk}\{V_{k}\}.

  • (ii)

    The policy πε,\pi^{\varepsilon,*} derived in Algorithm 2 is ε\varepsilon-optimal, i.e.,

    Vπε,Vε.\left\|V^{\pi^{\varepsilon,*}}-V^{*}\right\|_{\infty}\leq\varepsilon.

Proof. Part (i). Leveraging Lemma 3, it is evident that the sequence {Vk}\{V_{k}\} converges linearly to VV^{*}, as demonstrated by:

Vk+1VVkV=𝒯Vk𝒯VVkVγ.\frac{\left\|V_{k+1}-V^{*}\right\|_{\infty}}{\left\|V_{k}-V^{*}\right\|_{\infty}}=\frac{\left\|\mathcal{T}V_{k}-\mathcal{T}V^{*}\right\|_{\infty}}{\left\|V_{k}-V^{*}\right\|_{\infty}}\leq\gamma.

Starting with V1=0V_{1}=0 leads to V2=𝒯V1V1=0V_{2}=\mathcal{T}V_{1}\geq V_{1}=0. The monotonicity of 𝒯\mathcal{T}, as established in Lemma 3, ensures that

Vk+1=𝒯k1V2𝒯k1V1=VkV_{k+1}=\mathcal{T}^{k-1}V_{2}\geq\mathcal{T}^{k-1}V_{1}=V_{k}

for all k1k\geq 1. Conversely, initiating with V1sup{𝒞(s,a,ξ),s𝒮,a𝒜,ξΞ}V_{1}\geq\sup\{\mathcal{C}(s,a,\xi),\forall s\in\mathcal{S},a\in\mathcal{A},\xi\in\Xi\} implies that V2=𝒯V1V1V_{2}=\mathcal{T}V_{1}\leq V_{1}. This reasoning concludes that Vk+1VkV_{k+1}\leq V_{k} for all k1k\geq 1.

Part (ii). We begin by noting that

Vπε,VVπε,Vk+1+Vk+1V.\left\|V^{\pi^{\varepsilon,*}}-V^{*}\right\|_{\infty}\leq\left\|V^{\pi^{\varepsilon,*}}-V_{k+1}\right\|_{\infty}+\left\|V_{k+1}-V^{*}\right\|_{\infty}. (5.36)

We dissect the first term of the right-hand side of (5.36) as follows:

Vπε,Vk+1=\displaystyle\left\|V^{\pi^{\varepsilon,*}}-V_{k+1}\right\|_{\infty}= 𝒯πε,Vπε,Vk+1\displaystyle\left\|\mathcal{T}^{\pi^{\varepsilon,*}}V^{\pi^{\varepsilon,*}}-V_{k+1}\right\|_{\infty}
\displaystyle\leq 𝒯πε,Vπε,𝒯Vk+1+𝒯Vk+1Vk+1\displaystyle\left\|\mathcal{T}^{\pi^{\varepsilon,*}}V^{\pi^{\varepsilon,*}}-\mathcal{T}V_{k+1}\right\|_{\infty}+\left\|\mathcal{T}V_{k+1}-V_{k+1}\right\|_{\infty}
=\displaystyle= 𝒯πε,Vπε,𝒯πε,Vk+1+𝒯Vk+1Vk+1\displaystyle\left\|\mathcal{T}^{\pi^{\varepsilon,*}}V^{\pi^{\varepsilon,*}}-\mathcal{T}^{\pi^{\varepsilon,*}}V_{k+1}\right\|_{\infty}+\left\|\mathcal{T}V_{k+1}-V_{k+1}\right\|_{\infty}
\displaystyle\leq γVπε,Vk+1+γVk+1Vk.\displaystyle\gamma\left\|V^{\pi^{\varepsilon,*}}-V_{k+1}\right\|_{\infty}+\gamma\left\|V_{k+1}-V_{k}\right\|_{\infty}.

The first equation is because Vπε,V^{\pi^{\varepsilon,*}} is the fixed point of 𝒯πε,\mathcal{T}^{\pi^{\varepsilon,*}} and the second equation is satisfied since 𝒯Vk+1=𝒯πε,Vk+1\mathcal{T}V_{k+1}=\mathcal{T}^{\pi^{\varepsilon,*}}V_{k+1} according to Algorithm 2. Thus we can infer

Vπε,Vk+1γ1γVk+1Vk.\left\|V^{\pi^{\varepsilon,*}}-V_{k+1}\right\|_{\infty}\leq\frac{\gamma}{1-\gamma}\left\|V_{k+1}-V_{k}\right\|_{\infty}.

Further examining the second term of the right-hand side of (5.36) leads to the following recursive deduction:

Vk+1V\displaystyle\left\|V_{k+1}-V^{*}\right\|_{\infty}\leq Vk+2V+Vk+1Vk+2\displaystyle\left\|V_{k+2}-V^{*}\right\|_{\infty}+\left\|V_{k+1}-V_{k+2}\right\|_{\infty}
=\displaystyle= Vk+2V+γVk+1Vk\displaystyle\left\|V_{k+2}-V^{*}\right\|_{\infty}+\gamma\left\|V_{k+1}-V_{k}\right\|_{\infty}
\displaystyle\leq Vk+3V+Vk+2Vk+3+γVk+1Vk\displaystyle\left\|V_{k+3}-V^{*}\right\|_{\infty}+\left\|V_{k+2}-V_{k+3}\right\|_{\infty}+\gamma\left\|V_{k+1}-V_{k}\right\|_{\infty}
=\displaystyle= Vk+3V+(γ+γ2)Vk+1Vk\displaystyle\left\|V_{k+3}-V^{*}\right\|_{\infty}+(\gamma+\gamma^{2})\left\|V_{k+1}-V_{k}\right\|_{\infty}
\displaystyle\leq \displaystyle\cdots
\displaystyle\leq (i=1γi)Vk+1Vkγ1γVk+1Vk.\displaystyle(\sum_{i=1}^{\infty}\gamma^{i})\left\|V_{k+1}-V_{k}\right\|_{\infty}\leq\frac{\gamma}{1-\gamma}\left\|V_{k+1}-V_{k}\right\|_{\infty}.

Therefore, we conclude from the above two inequalities and (5.36) that whenever

Vk+1Vk<ε(1γ)/2γ,\left\|V_{k+1}-V_{k}\right\|_{\infty}<\varepsilon(1-\gamma)/2\gamma,

it follows that Vπε,Vε\left\|V^{\pi^{\varepsilon,*}}-V^{*}\right\|_{\infty}\leq\varepsilon. \Box

Algorithm 2 is about an iterative process of value function. Once a satisfactory approximate value function is obtained, we solve a single program (5.35) to obtain an approximate optimal policy. It is also possible to carry out policy iterations. Unlike value iterations, policy iterations refine policies directly. This kind of approach is also known in the literature of MDP [57]. Algorithm 3 describes the process.

Algorithm 3 Policy Iteration Algorithm for Infinite-horizon BCR-SOC/MDP (5.1).
1:  Initialize policy π1\pi_{1}, the total number of iterations KK and set k:=1k:=1;
2:  while kKk\leq K do
3:     for all (s,μ)𝒮×𝒟10\left(s,\mu\right)\in\mathcal{S}\times\mathcal{D}_{1}^{0} do
4:        Solve the equation:
Vk(s,μ)=ρμρPθ[𝒞(s,πk(s,μ),ξ)+γVk(s,μ)],s=g(s,πk(s,μ),ξ),μ(θ)=p(ξ|θ)μ(θ)Θp(ξ|θ)μ(θ)𝑑θV_{k}(s,\mu)=\rho_{\mu}\circ\rho_{P_{\theta}}\left[\mathcal{C}\left(s,\pi_{k}(s,\mu),\xi\right)+\gamma V_{k}(s^{\prime},\mu^{\prime})\right],\ s^{\prime}=g\left(s,\pi_{k}(s,\mu),\xi\right),\ \mu^{\prime}(\theta)=\frac{p\left(\xi|\theta\right)\mu(\theta)}{\int_{\Theta}p\left(\xi|\theta\right)\mu(\theta)d\theta} (5.37)
to obtain value function VkV_{k}.
5:        Choose the corresponding policy by
πk+1(s,μ)argmina𝒜ρμρPθ[𝒞(s,a,ξ)+γVk(s,μ)].\pi_{k+1}(s,\mu)\in\arg\min_{a\in\mathcal{A}}\rho_{\mu}\circ\rho_{P_{\theta}}\left[\mathcal{C}\left(s,a,\xi\right)+\gamma V_{k}\left(s^{\prime},\mu^{\prime}\right)\right]. (5.38)
6:     end for
7:     Set k:=k+1k:=k+1.
8:  end while
9:  return  Optimal policy πk\pi_{k}.

The next theorem states the convergence of Algorithm 3.

Theorem 5.

The sequence of value functions {Vk}\{V_{k}\} generated by Algorithm 3 is non-increasing and converges linearly to the unique value function VV^{*} at a rate of γ\gamma.

Proof. Equations (5.37)–(5.38) can be expressed concisely as:

Vk=𝒯πkVk, and 𝒯πk+1Vk=𝒯Vk,V_{k}=\mathcal{T}^{\pi_{k}}V_{k},\text{ and }\mathcal{T}^{\pi_{k+1}}V_{k}=\mathcal{T}V_{k},

respectively. Therefore, we can deduce that

𝒯πk+1Vk=𝒯Vk𝒯πkVk=Vk,\mathcal{T}^{\pi_{k+1}}V_{k}=\mathcal{T}V_{k}\leq\mathcal{T}^{\pi_{k}}V_{k}=V_{k},

from which we can obtain

(𝒯πk+1)nVk𝒯VkVk,n=0,1,2,,\left(\mathcal{T}^{\pi_{k+1}}\right)^{n}V_{k}\leq\mathcal{T}V_{k}\leq V_{k},\quad n=0,1,2,\cdots,

because of the monotonicity of the operator 𝒯πk+1\mathcal{T}^{\pi_{k+1}}. As nn\rightarrow\infty, we can conclude that (𝒯πk+1)nVk\left(\mathcal{T}^{\pi_{k+1}}\right)^{n}V_{k} converges to Vk+1V_{k+1} by applying Lemma 3 and Banach fixed-point theorem. Therefore,

Vk+1𝒯VkVk.V_{k+1}\leq\mathcal{T}V_{k}\leq V_{k}. (5.39)

Thus, we have established the non-increasing property of the sequence {Vk}\left\{V_{k}\right\}. Recall that VkVV_{k}\geq V^{*}, we deduce from (5.39) that

0Vk+1V𝒯VkV=𝒯Vk𝒯V.0\leq V_{k+1}-V^{*}\leq\mathcal{T}V_{k}-V^{*}=\mathcal{T}V_{k}-\mathcal{T}V^{*}.

As a result of the contraction property of 𝒯\mathcal{T}, we can conclude that

Vk+1VγVkV.\left\|V_{k+1}-V^{*}\right\|_{\infty}\leq\gamma\left\|V_{k}-V^{*}\right\|_{\infty}.

Hence, the sequence {Vk}\left\{V_{k}\right\} converges linearly to VV^{*} at a rate of γ\gamma. \Box

Remark 1.

Suppose 𝒮×𝒟10×𝒜\mathcal{S}\times\mathcal{D}_{1}^{0}\times\mathcal{A} is finite, there exist only finitely policies. Then Algorithm 3 must terminate after a finite number of iterations KK because of the non-increasing property of sequence {Vk}\{V_{k}\}. That is, for all (s,μ)𝒮×𝒟10,\left(s,\mu\right)\in\mathcal{S}\times\mathcal{D}_{1}^{0}, we have πK+1(s,μ)=πK(s,μ)\pi_{K+1}(s,\mu)=\pi_{K}(s,\mu), πK\pi_{K} and πK+1\pi_{K+1} are optimal policies, and VKV_{K} is the optimal value function.

In summary, both Algorithm 2 and Algorithm 3 are pivotal in navigating the solution complexities of BCR-SOC/MDPs, each offering distinct advantages in the quest for optimal policies. The transition from approximating value functions to directly refining policies encapsulates a comprehensive approach to addressing BCR-SOC/MDPs, marking a significant advance in the field of Bayes-adaptive risk-averse SOC/MDP methods.

6 Sample average approximation of the BCR-SOC/MDPs

In Algorithms 1-3, we respectively outline how to solve BCR-SOC/MDP (3.1) in the finite horizon case and BCR-SOC/MDP (5.1) in the infinite horizon case. A key step in Algorithms 1 and 2 is to solve the BCR minimization problem. When either ξ\xi or θ\theta or both are continuously distributed, these problems are not tractable because either a closed form of the objective function cannot be obtained or calculating multiple integrals is prohibitively expensive. This motivates us to adopt the sample average approximation (SAA) approach to discretize the probability distributions in the first place. The same issue exists in Algorithm 3. We begin by considering the general BCR minimization problem as follows:

mina𝒜\displaystyle\min_{a\in\mathcal{A}} ρμρPθ[𝒞(s,a,ξ)+γV(s,μ)]\displaystyle\rho_{\mu}\circ\rho_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right] (6.1a)
s.t. s=g(s,a,ξ),μ(θ)=p(ξ|θ)μ(θ)Θp(ξ|θ)μ(θ)𝑑θ\displaystyle s^{\prime}=g\left(s,a,\xi\right),\ \mu^{\prime}(\theta)=\frac{p\left(\xi|\theta\right)\mu(\theta)}{\int_{\Theta}p\left(\xi|\theta\right)\mu(\theta)d\theta} (6.1b)

for a predetermined value function V𝔅(𝒮,𝒟1p)V\in\mathfrak{B}(\mathcal{S},\mathcal{D}^{p}_{1}). In this section, we discuss how an appropriate SAA method can be implemented when the inner and outer risk measures take specific risk measures such as VaR and AVaR.

6.1 The VaR-Expectation case

In Example 4, we explained how BCR-SOC/MDPs can be derived under the distributionally robust SOC/MDP framework by setting the outer risk measure as VaRμα\text{VaR}_{\mu}^{\alpha} with α=0\alpha=0. In this section, we revisit the problem but with a general α(0,1)\alpha\in(0,1):

mina𝒜\displaystyle\min_{a\in\mathcal{A}} VaRμα𝔼Pθ[𝒞(s,a,ξ)+γV(s,μ)]\displaystyle\text{VaR}_{\mu}^{\alpha}\circ\mathbb{E}_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right] (6.2a)
s.t. s=g(s,a,ξ),μ(θ)=p(ξ|θ)μ(θ)Θp(ξ|θ)μ(θ)𝑑θ.\displaystyle s^{\prime}=g\left(s,a,\xi\right),\ \mu^{\prime}(\theta)=\frac{p\left(\xi|\theta\right)\mu(\theta)}{\int_{\Theta}p\left(\xi|\theta\right)\mu(\theta)d\theta}. (6.2b)

In this formulation, the objective is to minimize the α\alpha-quantile of the expected costs 𝔼Pθ[𝒞(s,a,ξ)+γV(s,μ)]\mathbb{E}_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right]. This approach is less conservative than a DRO problem

mina𝒜maxθΘ𝔼Pθ[𝒞(s,a,ξ)+γV(s,μ)],\displaystyle\min_{a\in\mathcal{A}}\max_{\theta\in{\Theta}}\mathbb{E}_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right], (6.3)

as demonstrated in [52].

Further, problem (6.2) can be reformulated as a chance-constrained minimization problem:

minκ,a𝒜κ\displaystyle\min_{\kappa,a\in\mathcal{A}}\ \ \kappa (PVEα\text{P}_{\text{VE}}^{\alpha})
s.t.Pμ(𝔼Pθ[𝒞(s,a,ξ)+γV(s,μ)]κ)α.\displaystyle\text{ s.t.}\ \ \ \ P_{\mu}(\mathbb{E}_{P_{\theta}}\left[\mathcal{C}\left(s,a,\xi\right)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right]\geq\kappa)\leq\alpha.

This reformulation provides a probabilistic interpretation of the α\alpha-quantile constraint, aligning it with Bayesian decision-making. Specifically, by treating 𝔼Pθ[𝒞(s,a,ξ)+γV(s,μ)]κ\mathbb{E}_{P_{\theta}}\left[\mathcal{C}(s,a,\xi)+\gamma V(s^{\prime},\mu^{\prime})\right]-\kappa as a function of θ\theta, the chance constraint in (PVEα\text{P}_{\text{VE}}^{\alpha}) can be understood as Bayesian posterior feasibility, as defined in [24]. Thus, problem (PVEα\text{P}_{\text{VE}}^{\alpha}) can be equivalently reformulated as a DRO problem with a Bayesian ambiguity set, as proposed in [24].

Below, we discuss how to use SAA to discretize (6.2). To this end, let θi,i=1,2,,N\theta_{i},i=1,2,\cdots,N, be iid random variables having posterior distribution μ\mu. Analogous to [39, 52], we can formulate the sample average approximation of (PVEα\text{P}_{\text{VE}}^{\alpha}) as:

minκ,a𝒜,zi{0,1}\displaystyle\min_{\kappa,a\in\mathcal{A},z_{i}\in\{0,1\}} κ\displaystyle\kappa (PVEN,α\text{P}_{\text{VE}}^{N,\alpha})
s.t. Uzi+𝔼Pθi[𝒞(s,a,ξ)+γV(s,μ)]κ,i=1,2,,N,\displaystyle-Uz_{i}+\mathbb{E}_{P_{\theta_{i}}}\left[\mathcal{C}\left(s,a,\xi\right)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right]\leq\kappa,\ i=1,2,\cdots,N,
i=1Nzi=αN,\displaystyle\sum_{i=1}^{N}z_{i}=\lfloor\alpha N\rfloor,

where ziz_{i}, i=1,Ni=1,\cdots N are auxiliary variables taking integer values {0,1}\{0,1\}, b\lfloor b\rfloor stands for the largest integer value below bb and UU is a large positive constant exceeding

maxa𝒜,i=1,,N𝔼Pθi[𝒞(s,a,ξ)+γV(s,μ)]mina𝒜,i=1,,N𝔼Pθi[𝒞(s,a,ξ)+γV(s,μ)].\max_{a\in\mathcal{A},i=1,\cdots,N}\mathbb{E}_{P_{\theta_{i}}}\left[\mathcal{C}\left(s,a,\xi\right)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right]-\min_{a\in\mathcal{A},i=1,\cdots,N}\mathbb{E}_{P_{\theta_{i}}}\left[\mathcal{C}\left(s,a,\xi\right)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right].

The next proposition states convergence of the problem (PVEN,α\text{P}_{\text{VE}}^{N,\alpha}) to its true counterpart (PVEα\text{P}_{\text{VE}}^{\alpha}) in terms of the optimal value.

Proposition 7.

Assume: (a) 𝔼Pθ[𝒞(s,a,ξ)+γV(s,μ)]\mathbb{E}_{P_{\theta}}\left[\mathcal{C}\left(s,a,\xi\right)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right] is globally Lipschitz continuous in aa with modulus LL uniformly for all s𝒮s\in\mathcal{S} and θΘ\theta\in\Theta, (b) the optimal values ϑVEα\vartheta^{\alpha}_{VE} and ϑVEN,α\vartheta^{N,\alpha}_{VE} of problems (PVEα\text{P}_{\text{VE}}^{\alpha}) and (PVEN,α\text{P}_{\text{VE}}^{N,\alpha}) are finite. Let 0<τ<α,ϵ(0,1),η>00<\tau<\alpha,\epsilon\in(0,1),\eta>0, DaD\geq||a|| for all a𝒜a\in\mathcal{A} and

N0=2τ2[log(1ϵ)+nlog2LDη+nlog2τ].N_{0}=\frac{2}{\tau^{2}}\left[\log(\frac{1}{\epsilon})+n\log\lceil\frac{2LD}{\eta}\rceil+n\log\lceil\frac{2}{\tau}\rceil\right]. (6.4)

Then with probability at least 12ϵ1-2\epsilon,

ϑVEα+τηϑVEN,αϑVEατ\vartheta^{\alpha+\tau}_{VE}-\eta\leq\vartheta^{N,\alpha}_{VE}\leq\vartheta^{\alpha-\tau}_{VE} (6.5)

for all NN0N\geq N_{0}.

Proof. By Theorem 3 in [39], we can deduce that ϑVEN,αϑVEατ\vartheta^{N,\alpha}_{VE}\leq\vartheta^{\alpha-\tau}_{VE} with probability at least 1ϵ1-\epsilon for all N12τ2log(1ϵ)N\geq\frac{1}{2\tau^{2}}\log(\frac{1}{\epsilon}). Moreover, following a similar argument to the proof of [39, Theorem 10], we can show that a feasible solution to the problem

minκ,a𝒜,zi{0,1}\displaystyle\min_{\kappa,a\in\mathcal{A},z_{i}\in\{0,1\}} κ\displaystyle\kappa (6.6)
s.t. Uzi+𝔼Pθi[𝒞(s,a,ξ)+γV(s,μ)]+ηκ,i=1,2,,N,\displaystyle-Uz_{i}+\mathbb{E}_{P_{\theta_{i}}}\left[\mathcal{C}\left(s,a,\xi\right)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right]+\eta\leq\kappa,\ i=1,2,\cdots,N,
i=1Nzi=αN,\displaystyle\sum_{i=1}^{N}z_{i}=\lfloor\alpha N\rfloor,

is also a feasible solution to problem (PVEα+τ)(\text{P}_{\text{VE}}^{\alpha+\tau}) with probability at least 1ϵ1-\epsilon for all NN0N\geq N_{0}. Thus, the optimal value of problem (6.6) can be written as ϑVEN,α+η\vartheta^{N,\alpha}_{VE}+\eta. This means that when the solution to problem (6.6) is feasible for problem (PVEα+τ)(\text{P}_{\text{VE}}^{\alpha+\tau}), it holds that ϑVEN,α+ηϑVEα+τ\vartheta^{N,\alpha}_{VE}+\eta\geq\vartheta^{\alpha+\tau}_{VE}. The conclusion follows as N012τ2log(1ϵ)N_{0}\geq\frac{1}{2\tau^{2}}\log(\frac{1}{\epsilon}). \Box

Note that in (6.4), the sample size N0N_{0} depends on the logarithm of 1/ϵ1/\epsilon and 1/η1/\eta, which means N0N_{0} increases very slowly when ϵ\epsilon and η\eta are driven to 0. Indeed, the numerical experiments in [39] show that the bounds in Proposition 7 are over conservative, potentially allowing for a smaller NN in practice. Note also that when 𝔼Pθ[𝒞(s,a,ξ)+γV(s,μ)]\mathbb{E}_{P_{\theta}}\left[\mathcal{C}\left(s,a,\xi\right)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right] does not have a closed-form and the dimension of aa is large, we need to use sample average to approximate 𝔼Pθ[𝒞(s,a,ξ)+γV(s,μ)]\mathbb{E}_{P_{\theta}}\left[\mathcal{C}\left(s,a,\xi\right)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right]. Consequently, problem (PVEN,α\text{P}_{\text{VE}}^{N,\alpha}) can be approximated further by the following MINLP:

minκ,a𝒜,zi{0,1}\displaystyle\min_{\kappa,a\in\mathcal{A},z_{i}\in\{0,1\}} κ\displaystyle\kappa (PVEN,M,α\text{P}_{\text{VE}}^{N,M,\alpha})
s.t. Uzi+j=1M[𝒞(s,a,ξi,j)+γV(si,j,μi,j)]κ,i=1,2,,N,\displaystyle-Uz_{i}+\sum_{j=1}^{M}\left[\mathcal{C}\left(s,a,\xi^{i,j}\right)+\gamma V\left(s^{i,j},\mu^{i,j}\right)\right]\leq\kappa,\ i=1,2,\cdots,N,
i=1Nzi=αN,\displaystyle\sum_{i=1}^{N}z_{i}=\lfloor\alpha N\rfloor,

where (ξi,1,,ξi,M)(\xi^{i,1},\cdots,\xi^{i,M}) are i.i.d. samples generated from PθiP_{\theta_{i}} for i=1,,Ni=1,\cdots,N. Denote the optimal value of problem (PVEN,M,α\text{P}_{\text{VE}}^{N,M,\alpha}) by ϑVEN,M,α\vartheta^{N,M,\alpha}_{VE}. The next proposition states that ϑVEN,M,α\vartheta^{N,M,\alpha}_{VE} lies in a neighborhood of ϑVEN,α\vartheta^{N,\alpha}_{VE} with a high probability when the sample size MM is sufficiently large.

Proposition 8.

Suppose: (a) there exists a measurable function φ:Ξ+\varphi:\Xi\to\mathbb{R}^{+} such that

|𝒞(s,a1,ξ)+γV(s1,μ)𝒞(s,a2,ξ)+γV(s2,μ)|φ(ξ)a1a2\displaystyle\left|\mathcal{C}\left(s,a_{1},\xi\right)+\gamma V\left(s_{1}^{\prime},\mu^{\prime}\right)-\mathcal{C}\left(s,a_{2},\xi\right)+\gamma V\left(s_{2}^{\prime},\mu^{\prime}\right)\right|\leq\varphi(\xi)\left\|a_{1}-a_{2}\right\| (6.7)

for all s𝒮s\in\mathcal{S}, a1,a2𝒜a_{1},a_{2}\in\mathcal{A} and ξΞ\xi\in\Xi, (b) the moment generating function of φ(ξ)\varphi(\xi) is finite valued in a neighborhood of 0. Define

M0=8ς2δ2[log(1+Dnυn)+log(1ε)],M_{0}=\frac{8\varsigma^{2}}{\delta^{2}}\left[\log\left(1+\frac{D^{n}}{\upsilon^{n}}\right)+\log(\frac{1}{\varepsilon})\right],

where ς\varsigma and υ\upsilon are positive constants corresponding to the given objective function 𝒞(s,a,ξ)+γV(s,μ)\mathcal{C}\left(s,a,\xi\right)+\gamma V\left(s^{\prime},\mu^{\prime}\right) and distribution PθP_{\theta}. Then with probability at least 12ε1-2\varepsilon, we have

ϑVEN,αδϑVEN,M,αϑVEN,α+δ\vartheta^{N,\alpha}_{VE}-\delta\leq\vartheta^{N,M,\alpha}_{VE}\leq\vartheta^{N,\alpha}_{VE}+\delta (6.8)

for all MM0M\geq M_{0}.

Proof. Proposition 2 in [70] demonstrates that a feasible solution for problem (PVEN,M,α\text{P}_{\text{VE}}^{N,M,\alpha}) is also feasible for the problem:

minκ,a𝒜,zi{0,1}\displaystyle\min_{\kappa,a\in\mathcal{A},z_{i}\in\{0,1\}} κ\displaystyle\kappa (6.9)
s.t. Uzi+𝔼Pθi[𝒞(s,a,ξ)+γV(s,μ)]κ+δ,i=1,2,,N,\displaystyle-Uz_{i}+\mathbb{E}_{P_{\theta_{i}}}\left[\mathcal{C}\left(s,a,\xi\right)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right]\leq\kappa+\delta,\ i=1,2,\cdots,N,
i=1Nzi=αN,\displaystyle\sum_{i=1}^{N}z_{i}=\lfloor\alpha N\rfloor,

with probability at least 1ε1-\varepsilon for all MM0M\geq M_{0}. Observe that the optimal value of problem (6.9) is ϑVEN,αδ\vartheta^{N,\alpha}_{VE}-\delta. Therefore, if a feasible solution to problem (PVEN,M,α\text{P}_{\text{VE}}^{N,M,\alpha}) is also a feasible solution to problem (6.9)(\ref{VE_feasi2}), then we have ϑVEN,αδϑVEN,M,α\vartheta^{N,\alpha}_{VE}-\delta\leq\vartheta^{N,M,\alpha}_{VE}. Likewise a feasible solution to the following problem

minκ,a𝒜,zi{0,1}\displaystyle\min_{\kappa,a\in\mathcal{A},z_{i}\in\{0,1\}} κ\displaystyle\kappa (6.10)
s.t. Uzi+𝔼Pθi[𝒞(s,a,ξ)+γV(s,μ)]κδ,i=1,2,,N,\displaystyle-Uz_{i}+\mathbb{E}_{P_{\theta_{i}}}\left[\mathcal{C}\left(s,a,\xi\right)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right]\leq\kappa-\delta,\ i=1,2,\cdots,N,
i=1Nzi=αN,\displaystyle\sum_{i=1}^{N}z_{i}=\lfloor\alpha N\rfloor,

is also feasible for problem (PVEN,M,α\text{P}_{\text{VE}}^{N,M,\alpha}) with probability at least 1ε1-\varepsilon when MM0M\geq M_{0}. The optimal value of problem (6.10) is given by ϑVEN,α+δ\vartheta^{N,\alpha}_{VE}+\delta. Therefore, when the feasible solution to problem (6.10) is feasible for problem (PVEN,M,α\text{P}_{\text{VE}}^{N,M,\alpha}), it holds that ϑVEN,α+δϑVEN,M,α\vartheta^{N,\alpha}_{VE}+\delta\geq\vartheta^{N,M,\alpha}_{VE}. \Box

The assumptions (a) and (b) in Proposition 8 can easily be satisfied, and its justification can be found in part (iii) of Proposition 4. From the numerical perspective, we can conveniently use packages such as BARON to solve moderately-sized problems (PVEN,α\text{P}_{\text{VE}}^{N,\alpha}) and (PVEN,M,α\text{P}_{\text{VE}}^{N,M,\alpha}). By combining Propositions 7 and 8, we are ready to state the main result of this section.

Theorem 6.

Assume the settings and conditions of Propositions 7 and 8 are satisfied. Then with probability at least 12ϵ2ε1-2\epsilon-2\varepsilon, we have

ϑVEN,M,α+τδϑVEαϑVEN,M,ατ+δ+η\vartheta^{N,M,\alpha+\tau}_{VE}-\delta\leq\vartheta^{\alpha}_{VE}\leq\vartheta^{N,M,\alpha-\tau}_{VE}+\delta+\eta

for all NN0N\geq N_{0} and MM0M\geq M_{0}.

6.2 The VaR-AVaR case

We now discuss the case when the outer risk measure and the inner risk measure in the BCR minimization problem are set with VaR and AVaR respectively:

mina𝒜\displaystyle\min_{a\in\mathcal{A}} VaRμαAVaRPθβ[𝒞(s,a,ξ)+γV(s,μ)]\displaystyle\text{VaR}_{\mu}^{\alpha}\circ\text{AVaR}_{P_{\theta}}^{\beta}\left[\mathcal{C}\left(s,a,\xi\right)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right] (6.11)
s.t. s=g(s,a,ξ),μ(θ)=p(ξ|θ)μ(θ)Θp(ξ|θ)μ(θ)𝑑θ.\displaystyle s^{\prime}=g\left(s,a,\xi\right),\ \mu^{\prime}(\theta)=\frac{p\left(\xi|\theta\right)\mu(\theta)}{\int_{\Theta}p\left(\xi|\theta\right)\mu(\theta)d\theta}.

This kind of model is used to minimize the highest risk value with preference AVaRPθβ\text{AVaR}_{P_{\theta}}^{\beta} that occurs with a probability less than or equal to a predefined threshold α\alpha and can also be interpreted within the framework of a Bayesian DRO problem. Since it is often difficult to derive a closed-form expression for AVaRPθβ[𝒞(s,a,ξ)+γV(s,μ)]\text{AVaR}_{P_{\theta}}^{\beta}\left[\mathcal{C}\left(s,a,\xi\right)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right], we may consider the sample average approximation of the problem. However, accurately evaluating SAA requires a large sample size when α\alpha and β\beta are close to 11. Meanwhile, directly substituting 𝔼Pθ\mathbb{E}_{P_{\theta}} with AVaRPθβ\text{AVaR}_{P_{\theta}}^{\beta} in model (6.11) results in a large mixed integer program, posing significant computational challenges. To address these issues, we relax the outer risk measure VaR to its closest convex upper bound AVaR, resulting in a convex program which can be solved efficiently using a wide range of algorithms. Specifically, under this setting, we propose an algorithm to solve the following optimization problem within the BCR-SOC/MDP framework:

mina𝒜AVaRμαAVaRPθβ[𝒞(s,a,ξ)+γV(s,μ)]\displaystyle\min_{a\in\mathcal{A}}\text{AVaR}_{\mu}^{\alpha}\circ\text{AVaR}_{P_{\theta}}^{\beta}\left[\mathcal{C}\left(s,a,\xi\right)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right] (6.12)
s=g(s,a,ξ),μ(θ)=p(ξ|θ)μ(θ)Θp(ξ|θ)μ(θ)𝑑θ.\displaystyle s^{\prime}=g\left(s,a,\xi\right),\ \mu^{\prime}(\theta)=\frac{p\left(\xi|\theta\right)\mu(\theta)}{\int_{\Theta}p\left(\xi|\theta\right)\mu(\theta)d\theta}.

By exploiting Rockafellar-Uryasev’s reformulation of AVaR\mathrm{AVaR} in [54] as an optimization problem, we can rewrite problem (6.12) as

mina𝒜u{u+1α𝔼μ[AVaRPθβ(𝒞(s,a,ξ)+γV(s,μ))u]+dθ}\min_{\begin{subarray}{c}a\in\mathcal{A}\\ u\in\mathbb{R}\end{subarray}}\left\{u+\frac{1}{\alpha}\mathbb{E}_{\mu}\left[\text{AVaR}^{\beta}_{P_{\theta}}\left(\mathcal{C}\left(s,a,\xi\right)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right)-u\right]^{+}d\theta\right\} (PCCα,β\text{P}_{\text{CC}}^{\alpha,\beta})

and applying SAA to the latter

𝔼μ[(AVaRPθβ(𝒞(s,a,ξ)+γV(s,μ))u+)]1Ni=1N(AVaRPθiβ(𝒞(s,a,ξ)+γV(s,μ))u+),\mathbb{E}_{\mu}\left[(\text{AVaR}^{\beta}_{P_{\theta}}\left(\mathcal{C}\left(s,a,\xi\right)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right)-u^{+})\right]\\ \approx\frac{1}{N}\sum_{i=1}^{N}(\text{AVaR}^{\beta}_{P_{\theta_{i}}}\left(\mathcal{C}\left(s,a,\xi\right)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right)-u^{+}), (6.13)

we can obtain sample average approximation of problem (PCCα,β\text{P}_{\text{CC}}^{\alpha,\beta}):

minu,a𝒜,vN\displaystyle\min_{u\in\mathbb{R},a\in\mathcal{A},v\in\mathbb{R}^{N}} u+1αNeTv\displaystyle u+\frac{1}{\alpha N}e^{T}v (PCCN,α,β\text{P}_{\text{CC}}^{N,\alpha,\beta})
s.t. viAVaRPθiβ(𝒞(s,a,ξ)+γV(s,μ))u,\displaystyle v_{i}\geq\text{AVaR}^{\beta}_{P_{\theta_{i}}}\left(\mathcal{C}\left(s,a,\xi\right)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right)-u,
vi0,i=1,2,,N,\displaystyle v_{i}\geq 0,\ i=1,2,\cdots,N,

where ee represents the vector with all components being ones. By regarding problem (PCCN,α,β\text{P}_{\text{CC}}^{N,\alpha,\beta}) as a AVaR-constrained problem, we can use another SAA to approximate the AVaR in the constraints:

minu,a,v,w\displaystyle\min_{u,a,v,w} u+1αNeTv\displaystyle u+\frac{1}{\alpha N}e^{T}v (PCCN,M,α,β\text{P}_{\text{CC}}^{N,M,\alpha,\beta})
s.t. viwi+1βMj=1M[𝒞(s,a,ξi,j)+γV(si,j,μi,j)wi]+u,\displaystyle v_{i}\geq w_{i}+\frac{1}{\beta M}\sum_{j=1}^{M}\left[\mathcal{C}\left(s,a,\xi^{i,j}\right)+\gamma V\left(s^{i,j},\mu^{i,j}\right)-w_{i}\right]^{+}-u,
vi0,i=1,2,,N,\displaystyle v_{i}\geq 0,\ i=1,2,\cdots,N,

where w=(w1,,wN)w=(w_{1},\cdots,w_{N}) denotes an auxiliary variable and (ξi,1,,ξi,M)(\xi^{i,1},\cdots,\xi^{i,M}) are MM i.i.d. samples generated from PθiP_{\theta_{i}} for i=1,,Ni=1,\cdots,N. Because of the convexity as demonstrated in Proposition 4, problem (PCCN,M,α,β\text{P}_{\text{CC}}^{N,M,\alpha,\beta}) is a convex optimization problem with a linear objective function, which can be solved efficiently even for large-scale problems.

Next, we examine the necessary sample sizes NN and MM for a specified precision of the optimal value. Denote the optimal values of (PCCN,α,β\text{P}_{\text{CC}}^{N,\alpha,\beta}) and (PCCN,M,α,β\text{P}_{\text{CC}}^{N,M,\alpha,\beta}) by ϑCCN,α,β\vartheta^{N,\alpha,\beta}_{CC} and ϑCCN,M,α,β\vartheta^{N,M,\alpha,\beta}_{CC}, respectively. Regarding NN, it has been proven in [61, Corollary 5.19] that under certain conditions, the next proposition holds.

Proposition 9.

Suppose that for all s𝒮s\in\mathcal{S}, a𝒜a\in\mathcal{A}, 𝒞(s,a,ξ)+γV(s,μ)\mathcal{C}\left(s,a,\xi\right)+\gamma V\left(s^{\prime},\mu^{\prime}\right) is Lipschtiz continuous with respect to aa with a Lipschitz constant LL. Let ϵ,η>0\epsilon,\eta>0, and

N0=(O(1)LDη)2[nlogO(1)LDη+log(1ϵ)],N_{0}=\left(\frac{O(1)LD}{\eta}\right)^{2}\left[n\log\frac{O(1)LD}{\eta}+\log(\frac{1}{\epsilon})\right],

where O(1)O(1) is a constant. Then the optimal solution of problem (PCCN,α,β\text{P}_{\text{CC}}^{N,\alpha,\beta}) lies in the η\eta-neighborhood of that of problem (PCCα,β\text{P}_{\text{CC}}^{\alpha,\beta}) with probability 1ϵ1-\epsilon for all N>N0N>N_{0}. Moreover, the optimal value of problem (PCCN,α,β\text{P}_{\text{CC}}^{N,\alpha,\beta}) lies in the LηL\eta-neighborhood of that of problem (PCCα,β\text{P}_{\text{CC}}^{\alpha,\beta}) with probability 1ϵ1-\epsilon for all N>N0N>N_{0}.

As for the selection of MM, we give the following proposition.

Proposition 10.

Suppose: (a) there exists a measurable function φ:Ξ+\varphi:\Xi\to\mathbb{R}^{+} such that

|𝒞(s,a1,ξ)+γV(s1,μ)𝒞(s,a2,ξ)+γV(s2,μ)|φ(ξ)a1a2,\left|\mathcal{C}\left(s,a_{1},\xi\right)+\gamma V\left(s_{1}^{\prime},\mu^{\prime}\right)-\mathcal{C}\left(s,a_{2},\xi\right)+\gamma V\left(s_{2}^{\prime},\mu^{\prime}\right)\right|\leq\varphi(\xi)\left\|a_{1}-a_{2}\right\|,

for all s𝒮s\in\mathcal{S}, a1,a2𝒜a_{1},a_{2}\in\mathcal{A} and ξΞ\xi\in\Xi; (b) the moment generating function of φ(ξ)\varphi(\xi) is finite in a neighborhood of 0. Let δ>0\delta>0, ε(0,1)\varepsilon\in(0,1),

M0=8ς2δ2[log(1+Dn2𝒞¯υn+1(1γ))+log(1ε)],M_{0}=\frac{8\varsigma^{2}}{\delta^{2}}\left[\log\left(1+\frac{D^{n}2\bar{\mathcal{C}}}{\upsilon^{n+1}(1-\gamma)}\right)+\log(\frac{1}{\varepsilon})\right],

where ς\varsigma and υ\upsilon are constants associated with a given objective function 𝒞(s,a,ξ)+γV(s,μ)\mathcal{C}\left(s,a,\xi\right)+\gamma V\left(s^{\prime},\mu^{\prime}\right) and distribution PθP_{\theta}. Then with probability at least 12ε1-2\varepsilon,

ϑCCN,α,βδϑCCN,M,α,βϑCCN,α,β+δ\vartheta^{N,\alpha,\beta}_{CC}-\delta\leq\vartheta^{N,M,\alpha,\beta}_{CC}\leq\vartheta^{N,\alpha,\beta}_{CC}+\delta (6.14)

for all MM0M\geq M_{0}.

Proof. Proposition 3 in [70] states that with probability at least 1ε1-\varepsilon, a feasible solution to problem (PCCN,M,α,β\text{P}_{\text{CC}}^{N,M,\alpha,\beta}) is also feasible for the problem:

minu,a𝒜,vN\displaystyle\min_{u\in\mathbb{R},a\in\mathcal{A},v\in\mathbb{R}^{N}} u+1αNeTv\displaystyle u+\frac{1}{\alpha N}e^{T}v (6.15)
s.t. viAVaRPθiβ(𝒞(s,a,ξ)+γV(s,μ))uδ,\displaystyle v_{i}\geq\text{AVaR}^{\beta}_{P_{\theta_{i}}}\left(\mathcal{C}\left(s,a,\xi\right)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right)-u-\delta,
vi0,i=1,2,,N\displaystyle v_{i}\geq 0,\ i=1,2,\cdots,N

for all MM0M\geq M_{0}. Note that the optimal value of problem (6.15) equals ϑCCN,α,βδ\vartheta^{N,\alpha,\beta}_{CC}-\delta. Thus, if a feasible solution to problem (PCCN,M,α,β\text{P}_{\text{CC}}^{N,M,\alpha,\beta}) is feasible for problem (6.15), then ϑCCN,α,βδϑCCN,M,α,β\vartheta^{N,\alpha,\beta}_{CC}-\delta\leq\vartheta^{N,M,\alpha,\beta}_{CC}. Likewise a feasible solution to the problem

minu,a𝒜,vN\displaystyle\min_{u\in\mathbb{R},a\in\mathcal{A},v\in\mathbb{R}^{N}} u+1αNeTv\displaystyle u+\frac{1}{\alpha N}e^{T}v (6.16)
s.t. viAVaRPθiβ(𝒞(s,a,ξ)+γV(s,μ))u+δ,\displaystyle v_{i}\geq\text{AVaR}^{\beta}_{P_{\theta_{i}}}\left(\mathcal{C}\left(s,a,\xi\right)+\gamma V\left(s^{\prime},\mu^{\prime}\right)\right)-u+\delta,
vi0,i=1,2,,N\displaystyle v_{i}\geq 0,\ i=1,2,\cdots,N

is also feasible for problem (6.15) with probability at least 1ε1-\varepsilon for all MM0M\geq M_{0}. The optimal value of problem (6.16) is ϑCCN,α,β+δ\vartheta^{N,\alpha,\beta}_{CC}+\delta. Thus, when the feasible solution to problem (6.16) is feasible for problem (6.15), we have ϑCCN,α,β+δϑCCN,M,α,β\vartheta^{N,\alpha,\beta}_{CC}+\delta\geq\vartheta^{N,M,\alpha,\beta}_{CC}. \Box

By combining Propositions 9 and 10, we can generate the second main result of this section.

Theorem 7.

Assume the settings and conditions of Propositions 9 and 10 are satisfied. Then with probability at least 1ϵ2ε1-\epsilon-2\varepsilon,

ϑCCN,M,α,βδLηϑCCα,βϑCCN,M,α,β+δ+Lη\vartheta^{N,M,\alpha,\beta}_{CC}-\delta-L\eta\leq\vartheta^{\alpha,\beta}_{CC}\leq\vartheta^{N,M,\alpha,\beta}_{CC}+\delta+L\eta

for all NN0N\geq N_{0} and MM0M\geq M_{0}.

The sample sizes established in Propositions 9 and 10 could be large. In practice, various strategies may be employed to accelerate the solution process. For example, [83] shows that smoothing the AVaR can improve computational efficiency significantly, potentially speeding up the solution process by several multiples.

7 Numerical results

In this section, we evaluate the performance of proposed BCR-SOC/MDP models and algorithms by applying them to a finite-horizon spread betting problem and an infinite-horizon inventory control problem. All the experiments are carried out on a 64-bit PC with 12 GB of RAM and a 3.20 GHz processor.

We choose the following SOC/MDP models for comparison.

  • BCR-SOC/MDP: for the exact dynamic programming method and value iterations, Algorithm 1 and Algorithm 2, we need to discretize the augmented state (st,μt)(s_{t},\mu_{t}) into a fine grid with a step size of 0.1. We consider two variants in our experiment: the VaR-Expectation BCR-SOC/MDP and the AVaR-AVaR BCR-SOC/MDP, as discussed in Section 6.

  • Risk-averse SOC/MDP (RA-SOC/MDP) presented in [7]: we derive a maximal likelihood estimation (MLE) θ^\hat{\theta} from the observations of ξ\xi and solve the corresponding risk-averse SOC/MDP using the estimated distribution Pθ^P_{\hat{\theta}}. For comparability with the risk-averse Bayes-adaptive MDP (AVaR-AVaR case), we adopt the AVaR as the risk measure.

  • Standard SOC/MDP: for comparability with the risk-averse Bayes-adaptive MDP (VaR-Expectation case), we adopt the expectation operator as the objective function. Likewise, we use the MLE estimator to estimate the true distribution parameter.

  • Bayes-adaptive MDP: here, we focus on the episodic Bayesian approach presented in [63], which can be reformulated as the Expectation-Expectation BCR-SOC/MDP variant as illustrated in Example 4.

  • Distributionally robust SOC/MDP (DR-SOC/MDP) presented in [82, 38]: since prior probabilistic information to construct the ambiguity set might not be available from the dataset directly, we simulate samples of θ\theta from the posterior distribution to construct the ambiguity set, and derive the optimal policy that minimizes the expected cost under the worst-case scenario of θ\theta.

Each method will be used to derive the optimal policy from the same sample set, and the performance of the resulting policy will then be tested on a real system, specifically, a SOC/MDP with the true parameter θc\theta^{c}. Below are detailed descriptions of the problems and the tests.

7.1 Finite horizon spread betting problem

In this subsection, we consider a spread betting problem varied from Birge et al. [6]. In spread betting markets, market makers (e.g., bookmakers) quote a spread line for the outcome of an uncertain event, and participants (bettors) bet on whether the outcome will exceed or fall below the spread line based on their own judgment and experience. A typical example is in sports betting, where the bookmaker sets a point spread for a specific match or game. Bettors then bet on whether the outcome will be greater (favorite wins by more than the spread) or smaller (favorite does not win by more than the spread) than the set spread line. Birge et al. [6] consider the market maker’s decision-making process on setting spread line such that the total expected profit is maximized over a specified time horizon.

Here, we focus on a risk-averse bettor’s optimal decision-making problem where the bettor aims to minimize the long-term risk of random losses by dynamically adjusting the bet size and learning about the market’s movements over time. To model this dynamic decision-making process, we apply the BCR-SOC/MDP model to the informed bettor’s decision problem in [6]. The setup is specified as follows.

  • (i)

    Market movement ξt\xi_{t}. The spread market is a betting market on a specific event, where the outcome of the event is represented by a continuous random variable Xt:ΩX_{t}:\Omega\to\mathbb{R} with cdf FXtF_{X_{t}} at episode tt. The bookmaker sets a spread line ltl_{t} for each bet at episode tt. The corresponding spread market movement ξt\xi_{t} is a binary random variable, where: ξt=1\xi_{t}=1 represents the event that the outcome of XtX_{t} exceeds ltl_{t} and ξt=1\xi_{t}=-1 otherwise. In practice, the probability P(ξt=1)=1FXt(lt)P(\xi_{t}=1)=1-F_{X_{t}}(l_{t}) is unknown to bettors and must be estimated over time. For simplicity, we assume that the market maker maintains a constant probability for the event ξt=1\xi_{t}=1, denoted as θc\theta^{c}, for all bets. This may be understood as the market maker striving to control the spread line ltl_{t} such that the probability of the event exceeding the spread remains constant over time. The assumption means that the market maker would maintain a fixed probability for market movements by adjusting ltl_{t} according to the distribution of XtX_{t}. In the case that θc=0.5\theta^{c}=0.5, the market maker remains indifferent to the outcome of the bets in each run, avoiding significant exposure to either side of the bet.

  • (ii)

    Wealth sts_{t}. The bettor’s cumulative wealth at episode tt, representing the bettor’s account balance. Wealth is updated at each episode based on the trades and the spread market scenarios.

  • (iii)

    Belief μt(θ)\mu_{t}(\theta). The bettor’s belief (or estimate) of the spread market activity level θc\theta^{c}. The belief about spread market movements is updated based on the spread market feedback after the trade observation, reflecting the evolving understanding of the spread market state over time. A typical prior μ1\mu_{1} for θ\theta here is the Beta prior since the posterior μt\mu_{t} is also a Beta distribution (see [21]).

  • (iv)

    Position Size ata_{t}. At each episode tt, the bettor chooses the position size ata_{t}, which represents the amount of money to bet. The bettor can bet on the event {Xt>lt}\{X_{t}>l_{t}\} (i.e., the outcome exceeds the spread line). The position size is determined by the bettor’s policy πt(st,μt)\pi_{t}(s_{t},\mu_{t}), which depends on the bettor’s current wealth sts_{t} and belief about the market state μt(θ)\mu_{t}(\theta). In this model, the bettor cannot bet more than the bettor’s available wealth, so the stake must satisfy the constraint 0atst0\leq a_{t}\leq s_{t}, ensuring that the bettor does not wager more than they have.

  • (v)

    Loss function 𝒞t{\cal C}_{t} and state transition function gtg_{t}. The loss at episode tt is defined as the negative profit from the trade. Specifically, the loss function 𝒞t{\cal C}_{t} is defined as:

    𝒞t(st,at,ξt)=(1τ𝟏{ξt>0})atξt,{\cal C}_{t}(s_{t},a_{t},\xi_{t})=-(1-\tau\boldsymbol{1}_{\{\xi_{t}>0\}})a_{t}\xi_{t},

    where τ\tau denotes the commission rate (a value between 0 and 1) charged by the market maker, and 𝟏{ξt>0}\boldsymbol{1}_{\{\xi_{t}>0\}} is an indicator function that takes the value of 1 if ξt>0\xi_{t}>0 and 0 otherwise. The net profit for the bettor is calculated as follows. If the bettor wins (i.e., atξt>0a_{t}\xi_{t}>0), the effective profit is reduced by the commission: (1τ)atξt(1-\tau)a_{t}\xi_{t}. If the bettor loses (i.e., atξt<0a_{t}\xi_{t}<0), then the net loss is atξt-a_{t}\xi_{t}, and no commission is applied. The cumulative wealth of the bettor at episode t+1t+1, st+1s_{t+1}, can then be defined as st+1=g(st,at,ξt):=st+(1τ𝟏{ξt>0})atξts_{t+1}=g(s_{t},a_{t},\xi_{t}):=s_{t}+(1-\tau\boldsymbol{1}_{\{\xi_{t}>0\}})a_{t}\xi_{t}.

The BCR-SOC/MDP formulation of the above spread betting problem can be formulated as:

minπ\displaystyle\min_{\pi}\ ρμ1ρPθ1[(1τ𝟏{ξ1>0})a1ξ1++ρμT1ρPθT1[(1τ𝟏{ξT1>0})aT1ξT1]]\displaystyle\rho_{\mu_{1}}\circ\rho_{P_{\theta_{1}}}\left[-(1-\tau\mathbf{1}_{\{\xi_{1}>0\}})a_{1}\xi_{1}+\cdots+\rho_{\mu_{T-1}}\circ\rho_{P_{\theta_{T-1}}}\left[-(1-\tau\mathbf{1}_{\{\xi_{T-1}>0\}})a_{T-1}\xi_{T-1}\right]\right]
s.t. st+1=st+(1τ𝟏{ξ1>0})atξt,at=πt(st,μt), 0atst,μt+1(θ)=p(ξt|θ)μt(θ)Θp(ξt|θ)μt(θ)𝑑θ,t=1,,T1.\displaystyle s_{t+1}=s_{t}+(1-\tau\mathbf{1}_{\{\xi_{1}>0\}})a_{t}\xi_{t},\ a_{t}=\pi_{t}(s_{t},\mu_{t}),\ 0\leq a_{t}\leq s_{t},\ \mu_{t+1}(\theta)=\frac{p\left(\xi_{t}|\theta\right)\mu_{t}(\theta)}{\int_{\Theta}p\left(\xi_{t}|\theta\right)\mu_{t}(\theta)d\theta},t=1,\cdots,T-1.

The well-definedness of this problem can be verified using the expressions of loss function, state transition function as well as the boundedness of ξt\xi_{t}. In the numerical tests, the initial wealth is set with s1=80s_{1}=80. The true parameter is defined as θc=0.6\theta^{c}=0.6, with the support set of the parameter being Θ={0.1,0.3,0.4,0.6,0.7,0.9}\Theta=\{0.1,0.3,0.4,0.6,0.7,0.9\}. The bettor engages in a series of T=8T=8 episodes, selecting the betting size ata_{t} from the action space 𝒜=[0,10]\mathcal{A}=[0,10] at each episode. We set α=0.6\alpha=0.6 and β=0.8\beta=0.8 for the outer and inner risk parameters, respectively. The commission rate τ=0.05\tau=0.05. The dataset for this problem comprises historical market movement records of size NN. To reduce the effect of random sampling, we repeat each test with 100 simulations, each of which employs an independent dataset. Table 1 displays the average CPU time (in seconds) required to determine the optimal betting policy, as well as the mean and variance of the actual performance of the optimal policy across the 100 simulations with different sample sizes N=5,10,50,100N=5,10,50,100. Figure 2 visualizes the histogram of actual performances for the BCR-SOC/MDP with AVaRμα\text{AVaR}_{\mu}^{\alpha}-AVaRPθβ\text{AVaR}_{P_{\theta}}^{\beta}, incorporating different combinations of risk levels: α=0.01,0.5,0.9\alpha=0.01,0.5,0.9 and β=0.01,0.5,0.9\beta=0.01,0.5,0.9.

Table 1: CPU time, mean and variance of actual performances of different approaches for the spread betting problem.
Approach time(s) N=5N=5 N=10N=10 N=50N=50 N=100N=100
mean variance mean variance mean variance mean variance
Bayes-adaptive MDP (𝔼μ\mathbb{E}_{\mu}-𝔼Pθ\mathbb{E}_{P_{\theta}}) 165.11 -9.94 5.79 -10.48 4.37 -11.51 2.59 -13.61 1.32
Standard SOC/MDP 1.08 -7.85 7.63 -8.07 6.79 -10.78 4.29 -13.07 2.66
BCR-SOC/MDP (VaRμα\text{VaR}_{\mu}^{\alpha}-𝔼Pθ\mathbb{E}_{P_{\theta}}) 172.63 -11.12 4.17 -11.66 3.02 -12.44 1.05 -13.92 0.25
RA-SOC/MDP 1.15 -7.29 6.33 -7.76 5.97 -10.02 3.57 -12.36 2.48
BCR-SOC/MDP (AVaRμα\text{AVaR}_{\mu}^{\alpha}-AVaRPθβ\text{AVaR}_{P_{\theta}}^{\beta}) 224.37 -10.09 4.95 -10.63 3.74 -11.79 1.80 -13.78 1.07
DR-SOC/MDP 1.84 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Refer to caption
Refer to caption
Figure 2: Histogram of performance frequencies over 100 replications for the AVaRμα\text{AVaR}_{\mu}^{\alpha}-AVaRPθβ\text{AVaR}^{\beta}_{P_{\theta}} BCR-SOC/MDP with different α\alpha and β\beta.

The observations presented in Table 1 and Figure 2 lead to the following conclusions.

  • Compared to the standard SOC/MDP and RA-SOC/MDP, the BCR-SOC/MDP model exhibits better robustness in terms of the mean and variance across different BCR measures (e.g., 𝔼μ\mathbb{E}_{\mu}-𝔼Pθ\mathbb{E}_{P_{\theta}}, VaRμα\text{VaR}_{\mu}^{\alpha}-𝔼Pθ\mathbb{E}_{P_{\theta}} and AVaRμα\text{AVaR}_{\mu}^{\alpha}-AVaRPθβ\text{AVaR}_{P_{\theta}}^{\beta}). Unlike the methods that depend solely on the point estimate θ^\hat{\theta}, our approach utilizes the distribution μ\mu to capture the uncertainty in θ\theta effectively, significantly enhancing the stability of the optimal policy for both small and large sample sizes. Moreover, the DR-SOC/MDP tends to be over conservative, as it always considers the worst-case scenario from the set of all possible market uncertainties, resulting in zero variance in actual performance. In particular, even in situations where the probability of winning at θc\theta^{c} is high, the DM under DR-SOC/MDP model refrains from betting because of its conservative nature.

  • As the sample size increases, the uncertainty corresponding to the model parameter diminishes. Consequently, the performances of all SOC/MDP models improve consistently, with variance monotonically decreasing. The tendency is underpinned by the convergence of both the posterior distribution μ\mu and the MLE estimator θ^\hat{\theta} to the true parameter θc\theta^{c} as the sample size increases. Eventually, every model converges to the SOC/MDP model with respect to the true parameter θc\theta^{c} in terms of optimal values and optimal policies.

  • In the Bayes-adaptive MDP models (e.g. [38, 55]), we tend to use the augmented pair of the original physical state and the posterior distribution to obtain the optimal value function and policy. Utilizing dynamic programming techniques, we discretize the continuous state, resulting in an exceedingly large augmented state space. As a result of this brute-force approach the computational time for BCR-SOC/MDP model is considerably higher than that of conventional SOC/MDP formulations which involve only a single discrete state physical space 𝒮\mathcal{S}.

  • In Figure 2, the risk levels α\alpha and β\beta reflect the bettor’s risk preferences towards the epistemic and aleatoric uncertainties, respectively. With lower values of α\alpha and β\beta, the DM tends to adopt a more conservative policy, typically choosing not to bet. Consequently, this conservative stance causes the distribution of actual performance to shift toward the right. However, the risk preferences characterized by α\alpha and β\beta diverge in specific aspects. A lower α\alpha indicates a bettor’s increased concern over losses in extreme environment, signifying a preference for avoiding such outcomes. As α\alpha decreases, the optimal decisions under BCR-SOC/MDP model closely resemble those of the DR-SOC/MDP approach, resulting in more conservative choices. Conversely, a lower β\beta directly signifies a bettor’s aversion to losses, demonstrating prudence and a preference for minimal potential losses. Thus, as β\beta decreases, the bettor focuses more on the losses with respect to worst-case ξ\xi irrespective of the environmental parameter θ\theta, steering the BCR-SOC/MDP model towards the robust SOC/MDP model. This invariably leads to more cautious decisions. Moreover, compared to the risk level α\alpha, β\beta exerts a stronger influence on the conservativeness of the outcomes. This is because, while the outer conservatism ensures that losses maintain a degree of randomness even in the worst environment, the inner conservatism compels the bettor to directly consider the most adverse scenario.

7.2 Infinite horizon inventory control problem

Having demonstrated the reasonability and efficiency of the proposed BCR-SOC/MDP model compared with typical SOC/MDP models in the literature, in this subsection, we mainly apply the proposed infinite horizon BCR-SOC/MDP model to the inventory control problem with a discount factor γ=0.8\gamma=0.8.

Inventory models have numerous real-world applications, including healthcare systems, electric energy storage systems, and more (see [58, 75] and the references therein). In inventory management, the problem typically involves deciding the order quantity required in each episode to fulfill customer demand while minimizing costs. The goal of the inventory manager is to manage the trade-offs between ordering costs, holding costs, and penalty costs associated with stockouts, while dynamically adjusting orders based on inventory levels and the uncertainty of future demand. A typical risk-averse inventory control problem aims to minimize long-term costs by dynamically adjusting order quantities and updating beliefs about future demand.

To model this dynamic decision-making process using the BCR-SOC/MDP framework, we define the following parameters and assumptions:

  • (i)

    Customer demand ξt\xi_{t}. The random variable ξt\xi_{t} represents the demand for goods at episode tt. We assume that the demand ξt\xi_{t} follows a Poisson distribution with an unknown parameter θc\theta^{c} [61].

  • (ii)

    Inventory level sts_{t}. The inventory level at episode tt, denoted by sts_{t}, represents the quantity of goods available in stock at that episode. Inventory is updated at each episode based on the decision made by the inventory manager (the order quantity) and the actual demand observed during that episode.

  • (iii)

    Belief μt(θ)\mu_{t}(\theta). The inventory manager’s belief about the demand distribution at episode tt, denoted by μt(θ)\mu_{t}(\theta), represents the manager’s knowledge of the likelihood of various demand amounts. A typical prior μ1\mu_{1} for θ\theta here is the Gamma prior since the posterior μt\mu_{t} is also a Gamma distribution (see also [21]).

  • (iv)

    Order quantity ata_{t}. The order quantity at episode tt, denoted by ata_{t}, represents the amount of goods the inventory manager decides to order. This decision ata_{t} is determined by the manager’s policy πt(st,μt)\pi_{t}(s_{t},\mu_{t}), which depends on the current inventory level sts_{t} and the inventory manager’s belief μt(θ)\mu_{t}(\theta) about the demand distribution.

  • (v)

    Cost function 𝒞{\cal C} and state transition function gg. The cost 𝒞{\cal C} at episode tt is defined as

    𝒞(st,at,ξt)=oat+c(st+at,ξt),\mathcal{C}(s_{t},a_{t},\xi_{t})=oa_{t}+c(s_{t}+a_{t},\xi_{t}),

    where

    c(y,d):=pmax{dy,0}+hmax{yd,0},c(y,d):=p\max\{d-y,0\}+h\max\{y-d,0\},

    o,po,p and hh are unit order cost, penalty cost for unfulfilling a unit demand, and unit holding cost for excess inventory, respectively. The inventory level st+1s_{t+1} can then be defined as st+1=g(st,at,ξt):=st+atξts_{t+1}=g(s_{t},a_{t},\xi_{t}):=s_{t}+a_{t}-\xi_{t}.

Under the BCR-SOC/MDP framework, the infinite horizon inventory control problem can be formulated as:

minπ\displaystyle\min_{\pi}\quad ρμ1ρPθ1[oa1+c(s1+a1,ξ1)++γρμtρPθt[oat+c(st+at,ξt)+]]\displaystyle\rho_{\mu_{1}}\circ\rho_{P_{\theta_{1}}}\left[oa_{1}+c(s_{1}+a_{1},\xi_{1})+\cdots+\gamma\rho_{\mu_{t}}\circ\rho_{P_{\theta_{t}}}\left[oa_{t}+c(s_{t}+a_{t},\xi_{t})+\cdots\right]\right] (7.2a)
s.t. st+1=st+atξt,at=π(st,μt),at0,μt+1(θ)=p(ξt|θ)μt(θ)Θp(ξt|θ)μt(θ)𝑑θ,t=1,2,.\displaystyle s_{t+1}=s_{t}+a_{t}-\xi_{t},\ a_{t}=\pi(s_{t},\mu_{t}),\ a_{t}\geq 0,\ \mu_{t+1}(\theta)=\frac{p\left(\xi_{t}|\theta\right)\mu_{t}(\theta)}{\int_{\Theta}p\left(\xi_{t}|\theta\right)\mu_{t}(\theta)d\theta},t=1,2,\cdots. (7.2b)

The classical stationary inventory control problem corresponding to the true distribution PθcP_{\theta^{c}} can be described in BCR-SOC/MDP form as follows:

minπ\displaystyle\min_{\pi} ρPθc[oa1+c(s1+a1,ξ1)++γρPθc[oat+c(st+at,ξt)+]]\displaystyle\rho_{P_{\theta^{c}}}\left[oa_{1}+c(s_{1}+a_{1},\xi_{1})+\cdots+\gamma\rho_{P_{\theta^{c}}}\left[oa_{t}+c(s_{t}+a_{t},\xi_{t})+\cdots\right]\right] (7.3)
s.t. st+1=st+atξt,at=π(st,δθc),at0,t=1,2,\displaystyle s_{t+1}=s_{t}+a_{t}-\xi_{t},\ a_{t}=\pi(s_{t},\delta_{\theta^{c}}),\ a_{t}\geq 0,\ t=1,2,\cdots

As demonstrated in [63], the optimal policy for problem (7.3) is the base-stock policy π(s,δc)=max{ss,0}\pi^{*}(s,\delta^{c})=\max\{s^{*}-s,0\}, where s=H1(κ)s^{*}=H^{-1}(\kappa) with H()H(\cdot) being the cdf of the random demand and κ:=p(1γ)op+h\kappa:=\frac{p-(1-\gamma)o}{p+h}. Thus, for sss\leq s^{*}, the optimal value of (7.3) is given by:

V(s,δθc)=os+os+(1γ)1ρPθc[γoξ+c(s,ξ)].V^{*}(s,\delta_{\theta^{c}})=-os+os^{*}+(1-\gamma)^{-1}\rho_{P_{\theta^{c}}}[\gamma o\xi+c(s^{*},\xi)]. (7.4)

With this analytical expression, we can compute the true optimal value functions with ρPθc=𝔼Pθc\rho_{P_{\theta^{c}}}=\mathbb{E}_{P_{\theta^{c}}} and ρPθc=AVaRPθcβ\rho_{P_{\theta^{c}}}=\text{AVaR}_{P_{\theta^{c}}}^{\beta}, respectively, as benchmarks for standard SOC/MDP and risk-averse SOC/MDP to demonstrate the convergence of the BCR-SOC/MDP models corresponding to VaRμα\text{VaR}_{\mu}^{\alpha}-𝔼Pθ\mathbb{E}_{P_{\theta}} and AVaRμα\text{AVaR}_{\mu}^{\alpha}-AVaRPθβ\text{AVaR}_{P_{\theta}}^{\beta} with respect to the value functions.

In this inventory control problem, we set Θ={6,8,9,10,11,12,14}\Theta=\{6,8,9,10,11,12,14\}. The unit cost for ordering is o=2o=2, the unit holding cost is h=4h=4, and the penalty for unmet demand is set at p=6p=6. For the BCR-SOC/MDP model, we set the outer risk parameter α=0.6\alpha=0.6 and the inner risk parameter β=0.4\beta=0.4. To facilitate comparison with other MDP models, we examine the performance of the model with different true parameter values θc\theta^{c}. Figure 3 depicts the performance when the learned policy is deployed in various environments with the demand following different Poisson distributions. To illustrate the convergence of optimal value function, we use Algorithm 2 to find the unique optimal value function VV^{*} of the Bellman equation corresponding to (7.2a). Then, we generate sample sequences 𝝃t\bm{\xi}^{t} across different episodes tt, compute μt\mu_{t} and obtain the corresponding optimal value V(s,μt)V^{*}(s,\mu_{t}). The procedure is replicated 100 times to ensure robustness, and the results are shown in Figures 4 and 5. Specifically, we plot boxplot of the sup-norm of the gap between the estimation V(,μt)V^{*}(\cdot,\mu_{t}) and the true optimal value V(,δθc)V^{*}(\cdot,\delta_{\theta^{c}}), which is calculated as sups𝒮|V(s,μt)V(s,δθc)|\sup_{s\in\mathcal{S}}|V^{*}(s,\mu_{t})-V^{*}(s,\delta_{\theta^{c}})|. Moreover, we also use line chart to show the average and standard deviation of the sup-norm of the gap between the estimated value V(,μt)V^{*}(\cdot,\mu_{t}) and the true optimal value V(,δθc)V^{*}(\cdot,\delta_{\theta^{c}}) across different episodes, providing insights into the robustness and variability of the estimation process. We can make the following observations from Figures 3, 4 and 5.

Refer to caption
Figure 3: Actual performances of different approaches with different θc\theta^{c}.
Refer to caption
Refer to caption
Figure 4: Convergence of the optimal value function of the VaRμα\text{VaR}_{\mu}^{\alpha}-𝔼Pθ\mathbb{E}_{P_{\theta}} BCR-SOC/MDP model
Refer to caption
Refer to caption
Figure 5: Convergence of the optimal value function of the AVaRμα\text{AVaR}_{\mu}^{\alpha}-AVaRPθβ\text{AVaR}_{P_{\theta}}^{\beta} BCR-SOC/MDP model
  • The BCR-SOC/MDP model outperforms the standard SOC/MDP and RA-SOC/MDP models in diverse environments and various BCR measures, such as 𝔼μ\mathbb{E}_{\mu}-𝔼Pθ\mathbb{E}_{P_{\theta}}, VaRμα\text{VaR}_{\mu}^{\alpha}-𝔼Pθ\mathbb{E}_{P_{\theta}} and AVaRμα\text{AVaR}_{\mu}^{\alpha}-AVaRPθβ\text{AVaR}_{P_{\theta}}^{\beta}. Moreover, when compared to the risk-neutral Bayes-adaptive SOC/MDP model, BCR-SOC/MDP models incorporating risk-averse preferences achieve a lower optimal cost value in more adversarial scenarios (e.g., when the Poisson parameter is less than 10), emphasizing their robustness and adaptability to a wide range of challenging environments.

  • The mean, median and the range of the super-norm all decrease to 0 as tt increases, which means that the optimal value functions of all VaRμα\text{VaR}_{\mu}^{\alpha}-𝔼Pθ\mathbb{E}_{P_{\theta}} and AVaRμα\text{AVaR}_{\mu}^{\alpha}-AVaRPθβ\text{AVaR}_{P_{\theta}}^{\beta} BCR-SOC/MDP models display uniform convergence as envisaged by the theoretical results. From the boxplots in Figures 4 and 5, we observe that the convergence rate of the optimal value functions corresponding to the AVaRμα\text{AVaR}_{\mu}^{\alpha}-AVaRPθβ\text{AVaR}_{P_{\theta}}^{\beta} case is slower than that of the VaRμα\text{VaR}_{\mu}^{\alpha}-𝔼Pθ\mathbb{E}_{P_{\theta}} BCR-SOC/MDP model, which is consistent with the conclusion in Theorem 3. Moreover, the line charts in Figures 4 and 5 show that, compared to the VaRμα\text{VaR}_{\mu}^{\alpha}-𝔼Pθ\mathbb{E}_{P_{\theta}} BCR-SOC/MDP model, the AVaRμα\text{AVaR}_{\mu}^{\alpha}-AVaRPθβ\text{AVaR}_{P_{\theta}}^{\beta} BCR-SOC/MDP model exhibits smaller variability in the sup-norm gap, reflecting the more robust characteristics of the latter risk-averse attitude with respect to the value function.

  • As the episode tt increases, the elements of the stochastic process 𝝃t\bm{\xi}^{t} increase, and the optimal value function corresponding to state ss converges exponentially to the optimal value of the true model as μt\mu_{t} converges. This manifests the significance and characteristics of Bayes-adaptive MDPs. Specifically, the epistemic uncertainty gradually diminishes as the DM continuously observes/learns about the environment. Consequently, the DM’s understanding of the environment becomes more precise, driving the gap of the optimal values to zero. Likewise, the variability of the gap reduces to zero. By contrast, for a standard SOC/MDP or risk-averse SOC/MDP, once the model is established, the optimal value function remains fixed and does not change over time. Specifically, once an MLE estimator θ^\hat{\theta} is derived for the standard or risk-averse SOC/MDP, the corresponding optimal value function takes a similar form to that in (7.4), with θc\theta^{c} being replaced by θ^\hat{\theta}. Consequently, for any episode tt, the optimal value functions for the estimated SOC/MDP models remain fixed, which explains that these models do not adaptively learn in response to dynamic change of the environment.

8 Concluding remarks

In this paper, we introduce a BCR-SOC/MDP model which subsumes a range of existing SOC/MDP models and relates to preference robust SOC/MDP models. By comparison with [63], the new model exhibits adaptive learning capabilities and ensures robustness against the inherent parameter randomness. In the finite-horizon case, we derive some important properties of the BCR-SOC/MDP model. In the infinite-horizon case, we derive some basic properties and demonstrate the model’s adaptability by showing that as the number of episodes increases, the optimal value functions and optimal policies almost surely converge to those of the true underlying problem. We also propose SAA-type algorithms tailored for solving the BCR-SOC/MDP model under specific composite risk measures and examine the robustness and adaptability of the model through numerical experiments. The theoretical analyses and numerical results show that the monetary risk measure, especially the Value at Risk (VaR), robust spectral risk measure (SRM) and average VaR, also exhibits convergence properties, prompting an examination of how to relax the conditions to encompass a broader range of risk measures.

There are a few aspects that we may take from this work for more in-depth exploration. First, the relationship between the BCR-SOC/MDP and preference-robust SOC/MDP models. We have indicated in Section 3.2 that a relationship between the two models may be established. It might be interesting to investigate the fine details of the relationship between distributionally robustness and preference robustness in the dynamic setting given that both are fundamentally related to DM’s dynamic risk preferences. Moreover, the BCR-SOC/MDP framework may potentially provide a more convenient avenue for solving multistage randomized preference-robust optimization problems [80]. Second, because of the complex structure of the Bayesian composite risk measure and the posterior distributions in the augmented state space, this paper does not explore SDDP-type methods for solving the problem as in [62]. It might be worth discussing SDDP for solving the proposed model at least under some specific cases given the known effectiveness of SDDP approach. Third, in practical application of the BCR-SOC/MDP model, particularly in data-driven problems, samples are often drawn from empirical data which might contain some noise. In that case, it might be relevant to investigate whether the proposed models are stable against perturbations in the uncertainty data in terms of both optimal values and optimal policies. Stability results for conventional SOC/MDP models have been derived by Kern et al. [35], and it would be valuable to explore whether similar stability results can be established for the BCR-SOC/MDP model. Finally, the numerical tests were carried out with two conventional SOC/MDP problems. It will be much more interesting if we can find new applications particularly related to online decision making. We leave all these avenues for future research.

Declarations

  • This project is supported by the National Key R&D Program of China (2022YFA1004000, 2022YFA1004001), National Natural Science Foundation of China (11991023) and CUHK start-up grant.

Acknowledgments

We would like to thank Alex Shapiro and Zhiyao Yang for a number of valuable discussions during preparation of this work.

reference

  • [1] C. Acerbi, Spectral measures of risk: A coherent representation of subjective risk aversion, Journal of Banking & Finance, 26 (2002), pp. 1505–1518.
  • [2] M. Ahmadi, U. Rosolia, M. D. Ingham, R. M. Murray, and A. D. Ames, Constrained risk-averse Markov decision processes, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, 2021, pp. 11718–11725.
  • [3] N. Bäuerle and J. Ott, Markov decision processes with average-value-at-risk criteria, Mathematical Methods of Operations Research, 74 (2011), pp. 361–379.
  • [4] R. E. Bellman and S. E. Dreyfus, Applied Dynamic Programming, vol. 2050, Princeton University Press, 2015.
  • [5] D. Bertsekas, Dynamic Programming and Optimal Control: Volume I, Athena scientific, 2012.
  • [6] J. R. Birge, Y. Feng, N. B. Keskin, and A. Schultz, Dynamic learning and market making in spread betting markets with informed bettors, Operations Research, 69 (2021), pp. 1746–1766.
  • [7] S. Carpin, Y.-L. Chow, and M. Pavone, Risk aversion in finite Markov decision processes using total cost criteria and average value at risk, in 2016 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2016, pp. 335–342.
  • [8] Z. Chen and W. Ma, A Bayesian approach to data-driven multi-stage stochastic optimization, Journal of Global Optimization, 90 (2024), pp. 401–428.
  • [9] Y. Chow, A. Tamar, S. Mannor, and M. Pavone, Risk-sensitive and robust decision-making: a CVaR optimization approach, Advances in Neural Information Processing Systems, 28 (2015).
  • [10] M. Claus, Advancing Stability Analysis of Mean-risk Stochastic programs: Bilevel and Two-stage Models, PhD thesis, Dissertation, Duisburg, Essen, Universität Duisburg-Essen, 2016.
  • [11] R. Dearden, N. Friedman, and D. Andre, Model-based Bayesian exploration, arXiv preprint arXiv:1301.6690, (2013).
  • [12] E. Delage and D. A. Iancu, Robust multistage decision making, in The operations research revolution, INFORMS, 2015, pp. 20–46.
  • [13] E. Delage and S. Mannor, Percentile optimization for Markov decision processes with parameter uncertainty, Operations Research, 58 (2010), pp. 203–213.
  • [14] E. Delage and Y. Ye, Distributionally robust optimization under moment uncertainty with application to data-driven problems, Operations Research, 58 (2010), pp. 595–612.
  • [15] P. Dommel and A. Pichler, Foundations of Multistage Stochastic Programming, arXiv preprint arXiv:2102.07464, (2021).
  • [16] M. O. Duff, Optimal Learning: Computational procedures for Bayes-adaptive Markov Decision Processes, University of Massachusetts Amherst, 2002.
  • [17] J. Dupačová and K. Sladkỳ, Comparison of multistage stochastic programs with recourse and stochastic dynamic programs with discrete time, ZAMM-Journal of Applied Mathematics and Mechanics, 82 (2002), pp. 753–765.
  • [18] E. A. Feinberg and A. Shwartz, Handbook of Markov Decision Processes: Methods and Applications, vol. 40, Springer Science & Business Media, 2012.
  • [19] J. A. Filar, D. Krass, and K. W. Ross, Percentile performance criteria for limiting average Markov decision processes, IEEE Transactions on Automatic Control, 40 (1995), pp. 2–10.
  • [20] R. Gao and A. Kleywegt, Distributionally robust stochastic optimization with Wasserstein distance, Mathematics of Operations Research, 48 (2023), pp. 603–655.
  • [21] A. Gelman, J. Carlin, H. Stern, D. Dunson, A. Vehtari, and D. Rubin, Bayesian Data Analysis, Third Edition, Chapman & Hall/CRC Texts in Statistical Science, Taylor & Francis, 2013.
  • [22] V. Guigues, A. Shapiro, and Y. Cheng, Risk-averse stochastic optimal control: an efficiently computable statistical upper bound, Operations Research Letters, 51 (2023), pp. 393–400.
  • [23] S. Guo and H. Xu, Robust spectral risk optimization when the subjective risk aversion is ambiguous: a moment-type approach, Mathematical Programming, 194 (2022), pp. 305–340.
  • [24] V. Gupta, Near-optimal bayesian ambiguity sets for distributionally robust optimization, Management Science, 65 (2019), pp. 4242–4260.
  • [25] H. Gzyl and S. Mayoral, On a relationship between distorted and spectral risk measures., (2008).
  • [26] J. L. Hau, M. Petrik, and M. Ghavamzadeh, Entropic risk optimization in discounted MDPs, in International Conference on Artificial Intelligence and Statistics, PMLR, 2023, pp. 47–76.
  • [27] L. Hong and R. Martin, A review of Bayesian asymptotics in general insurance applications, European Actuarial Journal, 7 (2017), pp. 231–255.
  • [28] R. A. Howard, Dynamic Programming and Markov Processes, John Wiley, 1960.
  • [29] R. A. Howard and J. E. Matheson, Risk-sensitive Markov decision processes, Management Science, 18 (1972), pp. 356–369.
  • [30] D. A. Iancu, M. Petrik, and D. Subramanian, Tight approximations of dynamic risk measures, Mathematics of Operations Research, 40 (2015), pp. 655–682.
  • [31] G. N. Iyengar, Robust dynamic programming, Mathematics of Operations Research, 30 (2005), pp. 257–280.
  • [32] D. Jalota and Y. Ye, Stochastic online fisher markets: Static pricing limits and adaptive enhancements, Operations Research, (2024).
  • [33] D. R. Jiang and W. B. Powell, Risk-averse approximate dynamic programming with quantile-based risk measures, Mathematics of Operations Research, 43 (2018), pp. 554–579.
  • [34] L. V. Kantorovich and S. Rubinshtein, On a space of totally additive functions, Vestnik of the St. Petersburg University: Mathematics, 13 (1958), pp. 52–59.
  • [35] P. Kern, A. Simroth, and H. Zähle, First-order sensitivity of the optimal value in a Markov decision model with respect to deviations in the transition probability function, Mathematical Methods of Operations Research, 92 (2020), pp. 165–197.
  • [36] S. Kusuoka, On law invariant coherent risk measures, Advances in mathematical economics, (2001), pp. 83–95.
  • [37] M. Li, X. Tong, and H. Xu, Randomization of spectral risk measure and distributional robustness, arXiv preprint arXiv:2212.08871, (2022).
  • [38] Y. Lin, Y. Ren, and E. Zhou, Bayesian risk Markov decision processes, Advances in Neural Information Processing Systems, 35 (2022), pp. 17430–17442.
  • [39] J. Luedtke and S. Ahmed, A sample approximation approach for optimization with probabilistic constraints, SIAM Journal on Optimization, 19 (2008), pp. 674–699.
  • [40] W. Ma and Z. Chen, Multi-stage distributionally robust convex stochastic optimization with Bayesian-type ambiguity sets, Mathematical Methods of Operations Research, 100 (2024), pp. 553–600.
  • [41] S. Mannor, D. Simester, P. Sun, and J. N. Tsitsiklis, Bias and variance in value function estimation, in Proceedings of the twenty-first International Conference on Machine learning, 2004, p. 72.
  • [42] S. Mannor and J. Tsitsiklis, Mean-variance optimization in Markov decision processes, arXiv preprint arXiv:1104.5601, (2011).
  • [43] S. Mannor and H. Xu, Data-driven methods for Markov decision problems with parameter uncertainty, in Operations Research & Management Science in the Age of Analytics, INFORMS, 2019, pp. 101–129.
  • [44] A. Nilim and L. El Ghaoui, Robust control of Markov decision processes with uncertain transition matrices, Operations Research, 53 (2005), pp. 780–798.
  • [45] I. Osband, D. Russo, and B. Van Roy, (More) efficient reinforcement learning via posterior sampling, Advances in Neural Information Processing Systems, 26 (2013).
  • [46] T. Osogami and T. Morimura, Time-consistency of optimization problems, in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 26, 2012, pp. 1945–1953.
  • [47] M. Petrik and D. Subramanian, An approximate solution method for large risk-averse Markov decision processes, arXiv preprint arXiv:1210.4901, (2012).
  • [48] G. C. Pflug and A. Pichler, Multistage Stochastic Optimization, vol. 1104, Springer, 2014.
  • [49] A. B. Philpott, V. L. de Matos, and L. Kapelevich, Distributionally robust SDDP, Computational Management Science, 15 (2018), pp. 431–454.
  • [50] A. Pichler and R. Schlotter, Risk-averse optimal control in continuous time by nesting risk measures, Mathematics of Operations Research, 48 (2023), pp. 1657–1678.
  • [51] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley & Sons, 2014.
  • [52] P. Qian, Z. Wang, and Z. Wen, A composite risk measure framework for decision making under uncertainty, Journal of the Operations Research Society of China, 7 (2019), pp. 43–68.
  • [53] M. Rigter, B. Lacerda, and N. Hawes, Risk-averse Bayes-adaptive reinforcement learning, Advances in Neural Information Processing Systems, 34 (2021), pp. 1142–1154.
  • [54] R. T. Rockafellar, S. Uryasev, et al., Optimization of conditional value-at-risk, Journal of risk, 2 (2000), pp. 21–42.
  • [55] S. Ross, J. Pineau, B. Chaib-draa, and P. Kreitmann, A Bayesian approach for learning and planning in partially observable Markov decision processes., Journal of Machine Learning Research, 12 (2011).
  • [56] H. Ruan, Z. Chen, and C. P. Ho, Risk-averse MDPs under reward ambiguity, arXiv preprint arXiv:2301.01045, (2023).
  • [57] A. Ruszczyński, Risk-averse dynamic programming for Markov decision processes, Mathematical Programming, 125 (2010), pp. 235–261.
  • [58] E. Saha and P. K. Ray, Modelling and analysis of inventory management systems in healthcare: A review and reflections, Computers & Industrial Engineering, 137 (2019), p. 106051.
  • [59] A. Shapiro, Tutorial on risk neutral, distributionally robust and risk averse multistage stochastic programming, European Journal of Operational Research, 288 (2021), pp. 1–13.
  • [60] A. Shapiro, Distributionally robust modeling of optimal control, Operations Research Letters, 50 (2022), pp. 561–567.
  • [61] A. Shapiro, D. Dentcheva, and A. Ruszczynski, Lectures on Stochastic Programming: Modeling and Theory, SIAM, 2021.
  • [62] A. Shapiro, E. Zhou, and Y. Lin, Bayesian distributionally robust optimization, SIAM Journal on Optimization, 33 (2023), pp. 1279–1304.
  • [63] A. Shapiro, E. Zhou, Y. Lin, and Y. Wang, Episodic Bayesian optimal control with unknown randomness distributions, arXiv preprint arXiv:2308.08478, (2023).
  • [64] A. Sharma, J. Harrison, M. Tsao, and M. Pavone, Robust and adaptive planning under model uncertainty, in Proceedings of the international conference on automated planning and scheduling, vol. 29, 2019, pp. 410–418.
  • [65] M. Sniedovich, A variance-constrained reservoir control problem, Water Resources Research, 16 (1980), pp. 271–274.
  • [66] H. Strasser, Mathematical Theory of Statistics: Statistical Experiments and Asymptotic Decision Theory, vol. 7, Walter de Gruyter, 1985.
  • [67] M. Strens, A Bayesian framework for reinforcement learning, in ICML, 2000, pp. 943–950.
  • [68] A. Tamar, Y. Chow, M. Ghavamzadeh, and S. Mannor, Sequential decision making with coherent risk, IEEE Transactions on Automatic Control, 62 (2016), pp. 3323–3338.
  • [69] H. Wang, L. He, R. Gao, and F. Calmon, Aleatoric and epistemic discrimination: Fundamental limits of fairness interventions, Advances in Neural Information Processing Systems, 36 (2024).
  • [70] W. Wang and S. Ahmed, Sample average approximation of expected value constrained stochastic programs, Operations Research Letters, 36 (2008), pp. 515–519.
  • [71] W. Wang and H. Xu, Robust spectral risk optimization when information on risk spectrum is incomplete, SIAM Journal on Optimization, 30 (2020), pp. 3198–3229.
  • [72] W. Wang, H. Xu, and T. Ma, Quantitative statistical robustness for tail-dependent law invariant risk measures, Quantitative Finance, 21 (2021), pp. 1669–1685.
  • [73] Y. Wang and E. Zhou, Bayesian risk-averse Q-Learning with streaming observations, arXiv preprint arXiv:2305.11300, (2023).
  • [74] S. Weber, Distribution-invariant risk measures, information, and dynamic consistency, Mathematical Finance: An International Journal of Mathematics, Statistics and Financial Economics, 16 (2006), pp. 419–441.
  • [75] T. Weitzel and C. H. Glock, Energy management for stationary electric energy storage systems: A systematic literature review, European Journal of Operational Research, 264 (2018), pp. 582–606.
  • [76] W. Wiesemann, D. Kuhn, and B. Rustem, Robust Markov decision processes, Mathematics of Operations Research, 38 (2013), pp. 153–183.
  • [77] W. Wiesemann, D. Kuhn, and M. Sim, Distributionally robust convex optimization, Operations Research, 62 (2014), pp. 1358–1376.
  • [78] D. Wozabal, Robustifying convex risk measures for linear portfolios: A nonparametric approach, Operations Research, 62 (2014), pp. 1302–1315.
  • [79] D. Wu, H. Zhu, and E. Zhou, A Bayesian risk approach to data-driven stochastic optimization: Formulations and asymptotics, SIAM Journal on Optimization, 28 (2018), pp. 1588–1612.
  • [80] Q. Wu, H. Xu, and H. Zheng, Multistage Robust Average Randomized Spectral Risk Optimization, arXiv preprint arXiv:2409.00892, (2024).
  • [81] H. Xu, Y. Liu, and H. Sun, Distributionally robust optimization with matrix moment constraints: Lagrange duality and cutting plane methods, Mathematical Programming, 169 (2018), pp. 489–529.
  • [82] H. Xu and S. Mannor, Distributionally robust Markov decision processes, Advances in Neural Information Processing Systems, 23 (2010).
  • [83] H. Xu and D. Zhang, Smooth sample average approximation of stationary points in nonsmooth stochastic optimization and applications, Mathematical Programming, 119 (2009), pp. 371–401.
  • [84] I. Yang, Wasserstein distributionally robust stochastic control: A data-driven approach, IEEE Transactions on Automatic Control, 66 (2020), pp. 3863–3870.
  • [85] S. Zhang, H. Xu, and H. Sun, Statistical Robustness of Kernel Learning Estimator with Respect to Data Perturbation, arXiv preprint arXiv:2406.10555, (2024).

9 Appendix

Definition 2 (Fortet-Mourier metric).

Let

p(Z):={h:Z:|h(𝒛)h(𝒛′′)|Lp(𝒛,𝒛′′)𝒛𝒛′′,𝒛,𝒛′′Z}\displaystyle\mathcal{F}_{p}(Z):=\left\{h:Z\rightarrow\mathbb{R}:|h({\bm{z}}^{\prime})-h({{\bm{z}}^{\prime\prime}})|\leq L_{p}({\bm{z}}^{\prime},{{\bm{z}}^{\prime\prime}})\|{\bm{z}}^{\prime}-{{\bm{z}}^{\prime\prime}}\|,\forall{\bm{z}}^{\prime},{{\bm{z}}^{\prime\prime}}\in Z\right\} (9.1)

be the set of locally Lipschitz continuous functions of growth order pp, where \|\cdot\| denotes some norm on ZZ, Lp(𝐳,𝐳′′):=max{1,𝐳,𝐳′′}p1L_{p}({\bm{z}}^{\prime},{{\bm{z}}^{\prime\prime}}):=\max\{1,\|{\bm{z}}^{\prime}\|,\|{{\bm{z}}^{\prime\prime}}\|\}^{p-1} for all 𝐳,𝐳′′Z{\bm{z}}^{\prime},{{\bm{z}}^{\prime\prime}}\in Z, and p1p\geq 1 describes the growth of the local Lipschitz constants. The pp-th order Fortet-Mourier metric over 𝒫(Z)\mathscr{P}(Z) is defined by

𝖽𝗅FM(P,P′′):=suphp(Z)|Zh(𝒛)P(d𝒛)Zh(𝒛)P′′(d𝒛)|,P,P′′𝒫(Z).\displaystyle\mathsf{d\kern-0.70007ptl}_{\rm FM}(P^{\prime},P^{\prime\prime}):=\sup_{h\in\mathcal{F}_{p}(Z)}\left|\int_{Z}h({\bm{z}})P^{\prime}(d{\bm{z}})-\int_{Z}h({\bm{z}})P^{\prime\prime}(d{\bm{z}})\right|,\forall P^{\prime},P^{\prime\prime}\in\mathscr{P}(Z). (9.2)

In the case when p=1p=1, it reduces to Kantorovich metric, in which case we denote the distance by 𝖽𝗅K\mathsf{d\kern-0.70007ptl}_{K}.

Definition 3 (Wasserstein distance/ metric).

For probability measures PP and P~\tilde{P}, the Wasserstein distance/ metric of order r1r\geq 1 is

𝖽𝗅Wr(P,P~)=(infπd(ξ,ξ~)rπ(dξ,dξ~))1r,\mathsf{d\kern-0.70007ptl}_{W}^{r}(P,\tilde{P})=\left(\inf_{\pi}\iint d\left(\xi,\tilde{\xi}\right)^{r}\pi(\mathrm{d}\xi,\mathrm{d}\tilde{\xi})\right)^{\frac{1}{r}}, (9.3)

where π\pi is among all probability measure with marginals PP and P~\tilde{P}, i.e.,

P(A)\displaystyle P(A) =π(A×Ξ),A(Ξ) and\displaystyle=\pi(A\times\Xi),\quad A\in\mathscr{B}(\Xi)\ \text{ and}
P~(B)\displaystyle\tilde{P}(B) =π(Ξ×B),B(Ξ).\displaystyle=\pi(\Xi\times B),\quad B\in\mathscr{B}(\Xi). (9.4)

One of the main results concerning the Wasserstein distance is the Kantorovich–Rubinstein Theorem ([34]), which establishes a relationship between the Kantorovich metric of two probability measures and the Wasserstein distance when r=1r=1, i.e.,

𝖽𝗅W1(P,P~)=𝖽𝗅K(P,P~).\mathsf{d\kern-0.70007ptl}_{W}^{1}(P,\tilde{P})=\mathsf{d\kern-0.70007ptl}_{K}(P,\tilde{P}). (9.5)