This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Distributionally Robust Optimal and Safe Control of Stochastic Systems via Kernel Conditional Mean Embedding [extended version]

Licio Romao, Ashish R. Hota, and Alessandro Abate L. Romao and A. Abate are with the Department of Computer Science, Oxford University, UK. Email adresses: {licio.romao,aabate}@cs.ox.ac.uk. A. R. Hota is with the Department of Electrical Engineering, Indian Institute of Technology, Kharagpur, India. Email: ahota@ee.iitkgp.ac.in.
Abstract

We present a novel distributionally robust framework for dynamic programming that uses kernel methods to design feedback control policies. Specifically, we leverage kernel mean embedding to map the transition probabilities governing the state evolution into an associated repreducing kernel Hilbert space. Our key idea lies in combining conditional mean embedding with the maximum mean discrepancy distance to construct an ambiguity set, and then design a robust control policy using techniques from distributionally robust optimization. The main theoretical contribution of this paper is to leverage functional analytic tools to prove that optimal policies for this infinite-dimensional min-max problem are Markovian and deterministic. Additionally, we discuss approximation schemes based on state and input discretization to make the approach computationally tractable. To validate the theoretical findings, we conduct an experiment on safe control for thermostatically controlled loads (TCL).

I Introduction

We focus on discrete-time stochastic control problems, where states evolve according to an underlying stochastic transition kernel. The main objective is to design a feedback control policy that either minimizes an objective function or that maximizes the probability of satisfying temporal properties, such as safety or reach-avoid specifications [1, 2].

Drawing inspirations from recent advancements in the data-driven control literature [3, 4], which advocates for techniques that enable the design of feedback controllers based on available data, we introduce a novel design approach based on dynamic programming that leverages available trajectories of the system. To address sampling errors resulting from finite data, we employ techniques from “distributionally robust” optimization and control [5, 6, 7, 8], which have gained significant attention in the community. These techniques induce robustness in the space of probability measures through the creation of the so-called ambiguity sets. In other words, distributionally robust techniques compute the worst-case expected value of a function, when the underlying measure belongs to such an ambiguity set. In this paper, we relate these ideas to the class of min-max control problems initially investigated in the seminal work [9] and further explored in [10].

A key feature of our approach is the use of kernel methods [11, 12], specifically conditional mean embedding and Maximum Mean Discrepancy (MMD) [13]. These methods are employed to embed probability distributions into the associated reproducing kernel Hilbert space (RKHS) and measure distance between two probabilities distributions by using MMD. We create an ambiguity set represented as a ball in the space of probability measures, while the mean embedding provides an estimate of the center of the generated ambiguity set. This allows us to reason about uncertainty and ensure robustness in the decision-making process.

The construction of a min-max control problem through the ambiguity set formulation is relevant, especially in data-limited scenarios where the empirical estimate may not be rich enough to approximate the true conditional mean embedding. While conditional mean embedding and its empirical estimate have been applied in the context of Bayesian inference [14], dynamical systems [15] and more recently for reachability analysis [16, 17], we are not aware of any work that leverages this framework for control synthesis using distributionally robust dynamic programming. In fact, distributionally robust optimization (DRO) with ambiguity sets defined via MMD and kernel mean embedding has only been recently studied [12], where the authors established the strong duality result for this class of problems. Notably, kernel DRO problems has not received much attention, with [18] being the sole exception.

Before summarizing our main contributions, we highlight the widespread use of kernel methods, MMD, and kernel mean embedding within the machine learning and system identification literature [19, 13, 20, 11]. For instance, Muandet et al. [11] express the expectation of any function of the subsequent state as an inner product between such a function and the conditional mean embedding. When the transition probability is unknown, and we have access to state-input trajectories, the empirical estimate of the conditional mean embedding has been used to approximate the inner product computation [16, 21].

Moreover, distributionally robust dynamic programming has been studied in several recent contributions [22, 23, 24, 25], where ambiguity sets are defined in an exogenous manner, independent of the current state and action. While this is a reasonable assumption when the state evolution is uncertain in a parametric manner (e.g., the state transition being governed by known dynamics affected by a parametric uncertainty or additive disturbance), the more general case of state evolution given by a stochastic transition kernel requires defining the stochastic state evolution and its associated ambiguity set as a function of the current state and chosen action. This class of ambiguity sets is referred to as decision-dependent ambiguity sets and often poses challenges for tractable reformulations [26, 27]. Our main contributions are summarized as: (1) We introduce a novel framework that provides a solution to discrete-time stochastic control problems based on available system trajectories. The key objective is to derive robust control policies against sample errors due to finite data, achieved through the construction of ambiguity sets; (2) We simultaneously employ kernel mean embedding and MMD distance in space of probability distributions, enabling the formulation and solution to a distributionally robust dynamic programming; (3) We leverage certain functional analytic tools and an existing result in the literature to demonstrate that the set of optimal policies obtained through our formulation can be chosen to be Markovian and deterministic.

II Preliminaries

II-A Reproducing kernel Hilbert spaces (RKHS) and kernel mean embeddings

Let (𝒳,X)(\mathcal{X},\mathcal{F}_{X}) be a measurable space, where 𝒳\mathcal{X} is an abstract set and X\mathcal{F}_{X} represents a σ\sigma-algebra on 𝒳\mathcal{X} (please refer to Chapter 1 in [28] for more details). Consider a measurable function k:𝒳×𝒳k:\mathcal{X}\times\mathcal{X}\mapsto\mathbb{R}, called a kernel, that satisfies the following properties.

  • Boundedness: For any x𝒳x\in\mathcal{X}, we have that supx|k(x,x)|<\sup_{x}|k(x,x)|<\infty.

  • Symmetry: For any x,x𝒳x,x^{\prime}\in\mathcal{X}, we have k(x,x)=k(x,x)k(x,x^{\prime})=k(x^{\prime},x);

  • Positive semidefinite (PSD): For any finite collection of points (xi)i=1m(x_{i})_{i=1}^{m}, where xi𝒳x_{i}\in\mathcal{X} for all i=1,,mi=1,\ldots,m, and for any vector αm\alpha\in\mathbb{R}^{m}, we have that

    i,j=1mαiαjk(xi,xj)0.\sum_{i,j=1}^{m}\alpha_{i}\alpha_{j}k(x_{i},x_{j})\geq 0.

    In other words, the Gram matrix KK whose (i,j)(i,j)-th entry is given by k(xi,xj)k(x_{i},x_{j}) is a positive semidefinite matrix for any choice of points {xi}i=1m\{x_{i}\}_{i=1}^{m}.

A kernel function kk satisfying the above three properties is called a positive semidefinite kernel.

Two consequences are in place with the presence of a positive semidefinite kernel. First, every semipositive definite kernel is associated with a reproducing kernel Hilbert space (RKHS) 𝒳\mathcal{H}_{\mathcal{X}}, which is defined as

𝒳=I𝒳I finitespan{k(x,):𝒳,xI}¯,\mathcal{H}_{\mathcal{X}}=\overline{\bigcup_{\begin{subarray}{c}I\subset\mathcal{X}\\ I\text{ finite}\end{subarray}}\mathrm{span}\{k(x,\cdot):\mathcal{X}\mapsto\mathbb{R},x\in I\}}, (1)

defined as the closure of all possible finite dimensional subspaces induced by the kernel kk. The space 𝒳\mathcal{H}_{\mathcal{X}} is equipped with the inner product k(,x1),k(,x2)𝒳=k(x1,x2)\langle k(\cdot,x_{1}),k(\cdot,x_{2})\rangle_{\mathcal{H}_{\mathcal{X}}}=k(x_{1},x_{2}). We may notice that the boundedness, symmetry and PSD properties above ensure that such an inner product is well-defined; hence, 𝒳\mathcal{H}_{\mathcal{X}} equipped with this inner product is a Hilbert space (see [29] for more details). By definition, for any function f𝒳f\in\mathcal{H}_{\mathcal{X}} there exist a sequence of integers {mn}n\{m_{n}\}_{n\in\mathbb{N}} and a sequence of functions fn(x)=i=1mnβink(xin,x)f_{n}(x)=\sum_{i=1}^{m_{n}}\beta_{i}^{n}k(x_{i}^{n},x) such that f(x)=limnfn(x)f(x)=\lim_{n\to\infty}f_{n}(x), which then implies that

f,k(,x)𝒳=f(x),\langle f,k(\cdot,x)\rangle_{\mathcal{H}_{\mathcal{X}}}=f(x),

thus justifying the reproducing kernel property of the Hilbert space 𝒳\mathcal{H}_{\mathcal{X}}. Alternatively, by the Riesz representation theorem (Theorem 4.11 in [29]), any Hilbert space with the reproducing kernel property can be identified with the RKHS of a PSD kernel. The interested reader is referred to [13] and references therein for more details. Second, there exists a feature map ϕ:𝒳𝒳\phi:\mathcal{X}\mapsto\mathcal{H}_{\mathcal{X}} with the property k(x1,x2)=ϕ(x1),ϕ(x2)𝒳k(x_{1},x_{2})=\langle\phi(x_{1}),\phi(x_{2})\rangle_{\mathcal{H}_{\mathcal{X}}}. The canonical feature map is given by ϕ(x):=k(x,)\phi(x):=k(x,\cdot), and we use the notation ϕ(x)(x)=k(x,x)\phi(x)(x^{\prime})=k(x,x^{\prime}).

We now introduce the notion of kernel mean embedding of probability measures [13, 11]. Let 𝒫(𝒳)\mathcal{P}(\mathcal{X}) be the set of probability measures on 𝒳\mathcal{X}, and XX be a random variable defined on 𝒳\mathcal{X} with distribution \mathbb{P}. The kernel mean embedding is a mapping Ψ:𝒫(𝒳)𝒳\Psi:\mathcal{P}(\mathcal{X})\mapsto\mathcal{H}_{\mathcal{X}} defined as

Ψ()():=𝔼[ϕ(X)]=𝒳k(x,)𝑑(x).\Psi(\mathbb{P})(\cdot):=\mathbb{E}_{\mathbb{P}}[\phi(X)]=\int_{\mathcal{X}}k(x,\cdot)d\mathbb{P}(x). (2)

We have the following result from [13, 11] on the reproducing property of the expectation operator in the RKHS.

Lemma 1 (Lemma 3.1 [11]).

If 𝔼[k(X,X)]<\mathbb{E}_{\mathbb{P}}[\sqrt{k(X,X)}]<\infty, then Ψ()𝒳\Psi(\mathbb{P})\in\mathcal{H}_{\mathcal{X}} and 𝔼[f(X)]=f,Ψ()𝒳\mathbb{E}_{\mathbb{P}}[f(X)]=\langle f,\Psi(\mathbb{P})\rangle_{\mathcal{H}_{\mathcal{X}}}.

Lemma 1 implies that Ψ()\Psi(\mathbb{P}) is indeed an element of the RKHS 𝒳\mathcal{H}_{\mathcal{X}}, and that the expectation of any function of the random variable XX can be computed by means of an inner product between the corresponding function and the kernel mean embedding. Let {x^(1),,x^(m)}\{\widehat{x}_{(1)},\ldots,\widehat{x}_{(m)}\} be a collection of mm independent samples from the distribution \mathbb{P} and the consider the random variable Ψ^:𝒳m×𝒫(𝒳)𝒳\widehat{\Psi}:\mathcal{X}^{m}\times\mathcal{P}(\mathcal{X})\mapsto\mathcal{H}_{\mathcal{X}} as

Ψ^()()=1mi=1mϕ(x^(i))=1mi=1mk(x^(i),).\widehat{\Psi}(\mathbb{P})(\cdot)=\frac{1}{m}\sum_{i=1}^{m}\phi(\widehat{x}_{(i)})=\frac{1}{m}\sum_{i=1}^{m}k(\widehat{x}_{(i)},\cdot). (3)

In other words, Ψ^()\widehat{\Psi}(\mathbb{P}) is the mean embedding of the empirical distribution ^m:=1mi=1mδx^(i)\widehat{\mathbb{P}}_{m}:=\frac{1}{m}\sum^{m}_{i=1}\delta_{\widehat{x}_{(i)}} induced by the samples.

II-B Kernel-based ambiguity sets

There are several ways to introduce a metric in the space of probability measures (see [30] for more details). In the context of RKHS and the kernel mean embedding introduced in the previous section, a common metric is a type of integral probability metric [13] called the Maximum mean discrepancy (MMD), which is defined as

𝙼𝙼𝙳(,)\displaystyle\mathtt{MMD}(\mathbb{P},\mathbb{Q}) =supf𝒳1f,Ψ()𝒳f,Ψ()𝒳\displaystyle=\sup_{\|f\|_{\mathcal{H}_{\mathcal{X}}}\leq 1}\langle f,\Psi(\mathbb{P})\rangle_{\mathcal{H}_{\mathcal{X}}}-\langle f,\Psi(\mathbb{Q})\rangle_{\mathcal{H}_{\mathcal{X}}}
=Ψ()Ψ()𝒳.\displaystyle=||\Psi(\mathbb{P})-\Psi(\mathbb{Q})||_{\mathcal{H}_{\mathcal{X}}}. (4)

In this work, we consider data-driven MMD ambiguity sets induced by observed samples {x^(1),,x^(m)}\{\widehat{x}_{(1)},\ldots,\widehat{x}_{(m)}\} defined as

mϵ:={𝒫(𝒳)|𝙼𝙼𝙳(,^m)ϵ},\mathcal{M}^{\epsilon}_{m}:=\{\mathbb{P}\in\mathcal{P}(\mathcal{X})\leavevmode\nobreak\ |\leavevmode\nobreak\ \mathtt{MMD}(\mathbb{P},\widehat{\mathbb{P}}_{m})\leq\epsilon\}, (5)

where ^m\widehat{\mathbb{P}}_{m} is the empirical distribution defined earlier. Thus, mϵ\mathcal{M}^{\epsilon}_{m} contains all distributions whose kernel mean embedding is within distance ϵ0\epsilon\geq 0 of the kernel mean embedding of the empirical distribution. The above ambiguity set also enjoys a sharp uniform convergence guarantees of 𝒪(1m)\mathcal{O}\left(\frac{1}{\sqrt{m}}\right) as shown in [31]. Details are omitted for brevity.

II-C RKHS embedding of conditional distributions

We now consider random variables of the form (Y,X)(Y,X) taking values over the space 𝒴×𝒳\mathcal{Y}\times\mathcal{X}. Let 𝒴\mathcal{H}_{\mathcal{Y}} be the RKHS of real valued functions defined on 𝒴\mathcal{Y} with positive definite kernel k𝒴:𝒴×𝒴k_{\mathcal{Y}}:\mathcal{Y}\times\mathcal{Y}\mapsto\mathbb{R} and feature map ϕ𝒴:𝒴𝒴\phi_{\mathcal{Y}}:\mathcal{Y}\mapsto\mathcal{H}_{\mathcal{Y}}. For any distribution μ𝒫(𝒴×𝒳)\mu\in\mathcal{P}(\mathcal{Y}\times\mathcal{X}), we define the bilinear covariance operator [32] cov:𝒴×𝒳\mathrm{cov}:\mathcal{H}_{\mathcal{Y}}\times\mathcal{H}_{\mathcal{X}}\mapsto\mathbb{R} as

cov(g,f)=𝒴×𝒳g(y)f(x)μ(dy,dx).\mathrm{cov}(g,f)=\int_{\mathcal{Y}\times\mathcal{X}}g(y)f(x)\mu(dy,dx). (6)

For each fixed g𝒴g\in\mathcal{H}_{\mathcal{Y}}, we define a cross-covariance operator CXY:𝒴𝒳C_{XY}:\mathcal{H}_{\mathcal{Y}}\mapsto\mathcal{H}_{\mathcal{X}} as the unique operator, which is well-defined due to Riesz representation theorem, such that, for all f𝒳f\in\mathcal{H}_{\mathcal{X}}, we have

f,CXYg𝒳=cov(g,f).\langle f,C_{XY}g\rangle_{\mathcal{H}_{\mathcal{X}}}=\mathrm{cov}(g,f).

Similarly, we define the operator CYX:𝒳𝒴C_{YX}:\mathcal{H}_{\mathcal{X}}\mapsto\mathcal{H}_{\mathcal{Y}}, and one may notice that CYX=CXYC_{YX}=C_{XY}^{\star}, where CXYC_{XY}^{\star} is the adjoint operator of CXYC_{XY}. When 𝒴=𝒳\mathcal{Y}=\mathcal{X}, the analogous operator CYYC_{YY} is called the covariance operator. Of interest to our approach is the conditional mean embedding of the conditional distributions.

For two measurable spaces (𝒳,𝒳)(\mathcal{X},\mathcal{B}_{\mathcal{X}}) and (𝒴,𝒴)(\mathcal{Y},\mathcal{B}_{\mathcal{Y}}), where 𝒳\mathcal{B}_{\mathcal{X}} and 𝒴\mathcal{B}_{\mathcal{Y}} are associated σ\sigma-algebras, a stochastic kernel T:𝒴𝒫(𝒳)T:\mathcal{Y}\rightarrow\mathcal{P}(\mathcal{X}) is a measurable mapping from 𝒴\mathcal{Y} to the set of probability measures equipped with the topology of weak convergence. Please refer to [33], Chapter 7, for a more detailed definition.

Definition 1 (Definition 4.1 [11]).

Given a stochastic kernel T:𝒴𝒫(𝒳)T:\mathcal{Y}\mapsto\mathcal{P}(\mathcal{X}), its conditional mean embedding is a mapping ψ:𝒴𝒳\psi{}:\mathcal{Y}\mapsto\mathcal{H}_{\mathcal{X}} such that, for all f𝒳f\in\mathcal{H}_{\mathcal{X}}, the following property holds

ψ(y),f𝒳=𝒳f(x)T(dxy)=𝔼XT(y)[f(X)],\langle\psi(y),f\rangle_{\mathcal{H}_{\mathcal{X}}}=\int_{\mathcal{X}}f(x)T(dx\mid y)=\mathbb{E}_{X\sim T(\cdot\mid y)}[f(X)], (7)

that is, the inner product of the conditional mean embedding with a function in f𝒳f\in\mathcal{H}_{\mathcal{X}} coincides with the conditional expectation of ff under the stochastic kernel TT.

A crucial result to the success of conditional mean embedding in machine learning applications is the relation proved in [32, 34] between the mappings in Definition 1 and the covariance and cross-covariance operators CXYC_{XY} and CYYC_{YY} as shown in the sequel.

Proposition 1 (Corollary 3, [34]).

Let 𝒫(𝒴)\mathbb{Q}\in\mathcal{P}(\mathcal{Y}) be a probability measure over 𝒴\mathcal{Y} and T:𝒴𝒫(𝒳)T:\mathcal{Y}\mapsto\mathcal{P}(\mathcal{X}) be a stochastic kernel. Define the measure μ𝒫(𝒴×𝒳)\mu\in\mathcal{P}(\mathcal{Y}\times\mathcal{X}) as the unique measure such that μ(U×V)=U(dy)T(Vy)\mu(U\times V)=\int_{U}\mathbb{Q}(dy)T(V\mid y) and consider its corresponding covariance operators as in (6). Then we have that

ψ(y)()=CXYCYY1k𝒴(,y),\psi(y)(\cdot)=C_{XY}C_{YY}^{-1}k_{\mathcal{Y}}(\cdot,y),

where ψ\psi is the conditional mean embedding of Definition 1.

The importance of Proposition 1 lies in the fact that it allows us to estimate the mapping ψ\psi by means of available data. In fact, in many applications, the joint or conditional distributions involving 𝒴\mathcal{Y} and 𝒳\mathcal{X} are not known, rather we have access to i.i.d. samples {(y^(i),x^(i))}i=1m\{(\widehat{y}_{(i)},\widehat{x}_{(i)})\}_{i=1}^{m} drawn from the joint distribution μ𝒫(𝒴×𝒳)\mu\in\mathcal{P}(\mathcal{Y}\times\mathcal{X}) induced by an underlying stochastic kernel (namely, the system dynamics, as discussed in the next section).

Remark 1.

The main step of the proof of Proposition 1 consists in showing that, for all f𝒳f\in\mathcal{H}_{\mathcal{X}}, the relation f,CXYCYY1k𝒴(,y)𝒳=f,ψ(y)𝒳\langle f,C_{XY}C_{YY}^{-1}k_{\mathcal{Y}}(\cdot,y)\rangle_{\mathcal{H}_{\mathcal{X}}}=\langle f,\psi(y)\rangle_{\mathcal{H}_{\mathcal{X}}} holds. Details are omitted for brevity.

Let K𝒴m×mK_{\mathcal{Y}}\in\mathbb{R}^{m\times m} be the Gram matrix associated with {y^i}i=1m\{\widehat{y}_{i}\}_{i=1}^{m}, that is, the matrix whose (i,j)(i,j)-th entry given by [K𝒴]ij=k𝒴(y^(i),y^(j))[K_{\mathcal{Y}}]_{ij}=k_{\mathcal{Y}}(\widehat{y}_{(i)},\widehat{y}_{(j)}). The empirical estimate of the conditional mean embedding ψ(x)\psi(x) was derived in [35] and is stated below.

Theorem 1 (Theorem 4.2 [11]).

An empirical estimate of the conditional mean embedding ψ:𝒴𝒳\psi:\mathcal{Y}\rightarrow\mathcal{H}_{\mathcal{X}} is given by

ψ^m(y)()=i=1mβi(y)k𝒳(x^(i),),\widehat{\psi}_{m}(y)(\cdot)=\sum^{m}_{i=1}\beta_{i}(y)k_{\mathcal{X}}(\widehat{x}_{(i)},\cdot), (8)

where β(y)=(K𝒴+mλ𝐈m)1k𝐲(y)m\beta(y)=(K_{\mathcal{Y}}+m\lambda\mathbf{I}_{m})^{-1}k_{\mathbf{y}}(y)\in\mathbb{R}^{m} with k𝐲(y)=[k𝒴(y^(1),y)k𝒴(y^(2),y)k𝒴(y^(m),y)]mk_{\mathbf{y}}(y)=[k_{\mathcal{Y}}(\widehat{y}_{(1)},y)\quad k_{\mathcal{Y}}(\widehat{y}_{(2)},y)\quad\ldots k_{\mathcal{Y}}(\widehat{y}_{(m)},y)]^{\top}\in\mathbb{R}^{m}, 𝐈m\mathbf{I}_{m} being the identity matrix of dimension mm and λ>0\lambda>0 being a regularization parameter.

The above empirical estimate can also be obtained by solving a regularized regression problem as established in [36, 37]. The regularization is necessary for the well-posedness of the empirical estimator of the inverse of the operator CYYC_{YY}.

III Kernel Distributionally Robust Optimal Control

We now connect the abstract mathematical framework introduced in the previous sections in the context of optimal control problems. Let (𝒳,,)(\mathcal{X},\mathcal{F},\mathbb{P}) be a probability measure space, where 𝒳n\mathcal{X}\subset\mathbb{R}^{n}; we use such measure-theoretic framework to formalize statements about the stochastic process in the sequel. Consider the discrete-time stochastic system given by

xk+1T(|xk,ak),ak𝒜(xk),x0=x¯,x_{k+1}\sim T(\cdot|x_{k},a_{k}),\quad a_{k}\in\mathcal{A}(x_{k}),\quad x_{0}=\bar{x}, (9)

where xk𝒳nx_{k}\in\mathcal{X}\subset\mathbb{R}^{n} and ak𝒜(x)pa_{k}\in\mathcal{A}(x)\subset\mathbb{R}^{p} denote the state and control input at time kk\in\mathbb{N}, with 𝒜(xk)\mathcal{A}(x_{k}) being the set of admissible control inputs at time kk, and x¯\bar{x} denotes the initial state. Observe that we denote the system dynamics by a stochastic kernel T:𝒳×𝒜(𝒳)𝒫(𝒳)T:\mathcal{X}\times\mathcal{A}(\mathcal{X})\rightarrow\mathcal{P}(\mathcal{X}) defined over the state-input pair, which describes a probability distribution over the next state. It is a standard machinery to assign the semantics for the dynamical system in (9) in such a way that it defines a unique probability measure in the space of sequences with value in 𝒳\mathcal{X} using the Kolmogorov extension theorem. See [38] for more details.

We define a sequence of history-dependant (or non-Markovian) control policies πk:(𝒳×𝒜(𝒳))k×𝒳𝒫(𝒜(𝒳))\pi_{k}:(\mathcal{X}\times\mathcal{A}(\mathcal{X}))^{k}\times\mathcal{X}\rightarrow\mathcal{P}(\mathcal{A}(\mathcal{X})), which maps a sequence of state-input pair (x0,a0,,xk1,ak1,xk)(x_{0},a_{0},\ldots,x_{k-1},a_{k-1},x_{k}) into a probability measure with support in the space of admissible inputs 𝒜(𝒳)\mathcal{A}(\mathcal{X}). The collection of admissible control policies over a horizon of length LL is given by the set

ΠL={(π0,π1,\displaystyle\Pi_{L}=\{(\pi_{0},\pi_{1}, ,πL1)πk(𝒜(xk)|h)=1,\displaystyle\ldots,\pi_{L-1})\mid\pi_{k}(\mathcal{A}(x_{k})|h)=1,
for all k{0,,L1},for all\displaystyle\text{for all }k\in\{0,\ldots,L-1\},\text{for all }
h(𝒳×𝒜(𝒳))k1×𝒳},\displaystyle\hskip 56.9055pth\in(\mathcal{X}\times\mathcal{A}(\mathcal{X}))^{k-1}\times\mathcal{X}\},

composed by the set LL control policies with support on the admissible input set associated with the current system state. In case the control policy depends only on the current state, i.e., πk(h)=πk(h)\pi_{k}(h)=\pi_{k}(h^{\prime}) for all h,h(𝒳×𝒜(𝒳))k1×𝒳h,h^{\prime}\in(\mathcal{X}\times\mathcal{A}(\mathcal{X}))^{k-1}\times\mathcal{X} that agree on the last entry, then we call such a policy a Markovian policy.

Inspired by a growing interest in data-driven control methods [16, 21, 3], our objective is to design an admissible control policy that minimizes a performance index with respect to the generated trajectories of the system dynamics given in (9) by relying on available data set {x^(i),a^(i),x^(i)+}i=1m\{\widehat{x}_{(i)},\widehat{a}_{(i)},\widehat{x}_{(i)}^{+}\}_{i=1}^{m}, composed by a sequence of state-input-next-state tuple. To account for the sampling error due to the finite dataset, we leverage techniques from distributionally robust optimization [7, 6, 5, 23] by creating a state-dependant ambiguity set around the estimate of the system dynamics, which is obtained using the mathematical framework introduced in Section II.

Formally, and referring to the notation used in Section II, let us identify the space 𝒴\mathcal{Y} with the cartesian product 𝒳×𝒜(𝒳)\mathcal{X}\times\mathcal{A}(\mathcal{X}) and 𝒳\mathcal{X} with the state space of the dynamics in (9). Let k𝒴:(𝒳×𝒜(𝒳))×(𝒳×𝒜(𝒳))k_{\mathcal{Y}}:(\mathcal{X}\times\mathcal{A}(\mathcal{X}))\times(\mathcal{X}\times\mathcal{A}(\mathcal{X}))\rightarrow\mathbb{R} and k𝒳:𝒳×𝒳k_{\mathcal{X}}:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R} be the corresponding kernels that induce, respectively, the RKHS 𝒴\mathcal{H}_{\mathcal{Y}} and 𝒳\mathcal{H}_{\mathcal{X}}. Let T:𝒳×𝒜(𝒳)𝒫(𝒳)T:\mathcal{X}\times\mathcal{A}(\mathcal{X})\rightarrow\mathcal{P}(\mathcal{X}) be the stochastic kernel associated with the state-transition matrix, and ψ\psi and ψ^\widehat{\psi} be the conditional mean embeddings as in Definition 1 and Theorem 1, respectively; the latter obtained from the dataset {x^(i),a^(i),x^(i)+}i=1m\{\widehat{x}_{(i)},\widehat{a}_{(i)},\widehat{x}_{(i)}^{+}\}_{i=1}^{m}.

A key novelty of our approach is the construction of an ambiguity set using the kernel mean embedding introduced in (2):

ϵ(x,a)={𝒫(𝒳)Ψ()ψ(x,a)𝒳ϵ},\mathcal{M}^{\epsilon}(x,a)=\{\mathbb{P}\in\mathcal{P}(\mathcal{X})\mid\|\Psi(\mathbb{P})-\psi(x,a)\|_{\mathcal{H}_{\mathcal{X}}}\leq\epsilon\}, (10)

which is a collection of probability measures over 𝒳\mathcal{X} whose mean embedding is ϵ\epsilon close to the conditional mean embedding of the system dynamics. Notice the dependence of the current state-action pair through the conditional mean embedding. Similar to the control policy definition, we define a collection of admissible dynamics for a given time-horizon LL. Formally, for a given sequence (x0,a0,,xL1,aL1)(x_{0},a_{0},\ldots,x_{L-1},a_{L-1}) of state-input pairs of size LL we define the set of admissible dynamics as

ΓL={(μ0,μ1,,μL1)μkϵ(xk,ak)\displaystyle\Gamma_{L}=\left\{(\mu_{0},\mu_{1},\ldots,\mu_{L-1})\mid\mu_{k}\in\mathcal{M}^{\epsilon}(x_{k},a_{k})\right.
for all k{0,,L1}},\displaystyle\left.\text{for all }k\in\{0,\ldots,L-1\}\right\}, (11)

where ϵ(xk,ak)\mathcal{M}^{\epsilon}(x_{k},a_{k}) denotes the ambiguity set in (10).

Our main goal is to solve a distributionally robust dynamic programming using the ingredients presented so far. To this end, let LL\in\mathbb{N} be the time-horizon, and consider the corresponding set of control policies ΠL\Pi_{L} and admissible dynamics ΓL\Gamma_{L}. For any πΠL\pi\in\Pi_{L} and μΓL\mu\in\Gamma_{L}, and for any initial state x¯μ0\bar{x}\sim\mu_{0}, where μ0\mu_{0} is the first entry of the admissible dynamics μ\mu, we denote by π,μ\mathbb{P}^{\pi,\mu} the induced measure on the space of sequences in 𝒳\mathcal{X} of length LL. For a given stage cost c:𝒳×𝒜(𝒳)c:\mathcal{X}\times\mathcal{A}(\mathcal{X})\to\mathbb{R}, we define the finite horizon expected cost as

VL(π,μ):=𝔼π,μ[k=0L1c(xk,ak)],V_{L}(\pi,\mu):=\mathbb{E}^{\pi,\mu}\left[\sum^{L-1}_{k=0}c(x_{k},a_{k})\right], (12)

where 𝔼π,μ\mathbb{E}^{\pi,\mu} denotes the expectation operator with respect to π,μ\mathbb{P}^{\pi,\mu}. Our goal is to find a policy πΠL\pi^{\star}\in\Pi_{L} that solves the distributionally robust control problem given by

infπΠLsupμΓLVL(π,μ).\inf_{\pi\in\Pi_{L}}\sup_{\mu\in\Gamma_{L}}V_{L}(\pi,\mu). (13)

Problem (13) consists of a min-max infinite-dimensional problem, since the collection of control policy and admissible dynamics are infinite-dimensional spaces; and it has been studied in several related communities [5] under different lenses. This paper departs from related contributions within the control community [21, 16], showing that under technical conditions and leveraging compactness-related arguments that there exist optimal Markovian policies for the solution of (13). To this end, we consider the following technical conditions.

Assumption 1.

Let V𝒳×𝒜(𝒳)×𝒫(𝒳)V\subset\leavevmode\nobreak\ \mathcal{X}\times\mathcal{A}(\mathcal{X})\times\mathcal{P}(\mathcal{X}) be a weakly compact set such that (x,a,μ)V(x,a,\mu)\in V if and only if μϵ(x,a)\mu\in\mathcal{M}^{\epsilon}(x,a) for some ϵ>0\epsilon>0. The following conditions hold.

  1. 1.

    The stage cost function c(x,a)c(x,a) in (12) is lower semicontinuous. Besides, there exists a constant C0C\geq 0 and a continuous function w:𝒳[1,)w:\mathcal{X}\rightarrow[1,\infty) such that

    |c(x,a)|Cw(x),a𝒜(x),x𝒳.|c(x,a)|\leq Cw(x),\quad\forall a\in\mathcal{A}(x),\leavevmode\nobreak\ x\in\mathcal{X}.
  2. 2.

    For every bounded, continuous function f:𝒳f:\mathcal{X}\to\mathbb{R}, the functional φ:V\varphi:V\rightarrow\mathbb{R}, defined as (x,a,μ)𝒳f(ξ)μ(dξ)(x,a,\mu)\mapsto\int_{\mathcal{X}}f(\xi)\mu(d\xi) is continuous on VV.

  3. 3.

    There exists a constant C>0C>0 such that φ(x,a,μ)Cw(x)\varphi(x,a,\mu)\leq Cw(x) for all (x,a,μ)V(x,a,\mu)\in V.

  4. 4.

    The set 𝒜(x)\mathcal{A}(x) is compact for each x𝒳x\in\mathcal{X}. Furthermore, the set-valued mapping x𝒜(x)x\mapsto\mathcal{A}(x) is upper semi-continuous.

Theorem 2.

Suppose Assumption 1 holds. Then there exists a collection of functions fk:𝒳pf_{k}:\mathcal{X}\rightarrow\mathbb{R}^{p}, for k{0,,L1}k\in\{0,\ldots,L-1\}, such that the Markov policy (f0,,fL1)(f_{0},\ldots,f_{L-1}) is the optimal solution of the infinite-dimensional problem (13).

Proof.

Our proof strategy is inspired by the analysis in [9]. Without loss of generality, we may assume that the cost function c:𝒳×𝒜nc:\mathcal{X}\times\mathcal{A}\rightarrow\mathbb{R}^{n} is bounded, that is, w1w\equiv 1, (otherwise one could consider a weighted norm using the function ww appearing in item aa of Assumption 1 – see [9] for more details.) For each bounded function g:𝒳g:\mathcal{X}\rightarrow\mathbb{R}, we define the functionals H:𝒳×𝒳×𝒜×𝒫(𝒳)H:\mathcal{H}_{\mathcal{X}}\times\mathcal{X}\times\mathcal{A}\times\mathcal{P}(\mathcal{X})\rightarrow\mathbb{R} and H#:𝒳×𝒳×𝒜H^{\#}:\mathcal{H}_{\mathcal{X}}\times\mathcal{X}\times\mathcal{A}\rightarrow\mathbb{R}, and the mapping H:𝒳𝒳H^{{\dagger}}:\mathcal{H}_{\mathcal{X}}\rightarrow\mathcal{H}_{\mathcal{X}} as

H(g,x,a,ν):=c(x,a)+𝒳g(ξ)ν(dξ),\displaystyle H(g,x,a,\nu):=c(x,a)+\int_{\mathcal{X}}g(\xi)\nu(d\xi), (14)
H#(g,x,a):=supνϵ(x,a)H(g,x,a,ν),\displaystyle H^{\#}(g,x,a):=\sup_{\nu\in\mathcal{M}^{\epsilon}(x,a)}H(g,x,a,\nu), (15)
H(g)(x):=infa𝒜(x)H#(g,x,a)\displaystyle H^{{\dagger}}(g)(x):=\inf_{a\in\mathcal{A}(x)}H^{\#}(g,x,a)
=infa𝒜(x)supνϵ(x,a)[c(x,a)+𝒳g(ξ)ν(dξ)].\displaystyle\quad=\inf_{a\in\mathcal{A}(x)}\sup_{\nu\in\mathcal{M}^{\epsilon}(x,a)}\left[c(x,a)+\int_{\mathcal{X}}g(\xi)\nu(d\xi)\right]. (16)

Specifically, (16) defines the distributionally robust dynamic programming (DP) operator under MMD ambiguity defined in (10), which is an infinite-dimensional optimization problem. Given any initial function f𝒳f\in\mathcal{H}_{\mathcal{X}}, the value function of the distributionally robust control problem can be defined iteratively as

vL1(x)\displaystyle v_{L-1}(x) :=f(x), for all x𝒳,\displaystyle:=f(x),\text{ for all }x\in\mathcal{X},
vk(x)\displaystyle v_{k}(x) :=H(vk+1)(x),for all x𝒳,\displaystyle:=H^{\dagger}(v_{k+1})(x),\text{for all }x\in\mathcal{X}, (17)

for k=0,,L2k=0,\ldots,L-2. We are now ready to show that problem (13) admits a non-randomized, Markov policy which is optimal. To this end, we need to show (1) lower semi-continuity of the functional HH and (2) that the inner supremum in the definition of HH^{\dagger} is achieved.

To show lower semi-continuity, we follow a similar approach as the proof of [9, Theorem 3.1] and [23, Theorem 1]. The primary challenge is to show that the functional HH^{\dagger} used in (17) preserves the lower semi-continuity of the value function. We show this via induction. Let vk+1v_{k+1} be lower semicontinuous. Following identical arguments as [9, Lemma 3.3], it can be shown that H(vk+1)H^{\dagger}(v_{k+1}) is lower semicontinuous on 𝒳\mathcal{X}, provided one can show that the mapping (x,a)ϵ(x,a)(x,a)\mapsto\mathcal{M}^{\epsilon}(x,a) is weakly111Due to the fact that we are dealing with some infinite-dimensional spaces, we rely on the topological notion of continuity. A function between two topological spaces (please refer to [39] for an introduction to these concepts) (𝒴,τ𝒴)(\mathcal{Y},\tau_{\mathcal{Y}}) and (𝒳,τ𝒳)(\mathcal{X},\tau_{\mathcal{X}}), f:𝒴𝒳f:\mathcal{Y}\mapsto\mathcal{X} is continuous if for all Uτ𝒳U\in\tau_{\mathcal{X}} we have that f1(U)𝒳f^{-1}(U)\in\mathcal{X}. The notions of weakly continuous, weakly compact, etc, are used due to the fact that we equip the infinite-dimensional spaces 𝒫(𝒳)\mathcal{P}(\mathcal{X}) and 𝒳\mathcal{H}_{\mathcal{X}} with the weak\text{weak}^{*} topology. We refer the reader to [29], Chapter 4, for more details about these concepts. compact and lower semi-continuous (i.e., the condition analogous to [9, Assumption 3.1(g)]).

To show weak compactness of (x,a)ϵ(x,a)(x,a)\mapsto\mathcal{M}^{\epsilon}(x,a), let us first study properties of the set

𝒞ϵ(x,a)={f𝒳:fψ(x,a)𝒳ϵ},\mathcal{C}^{\epsilon}(x,a)=\{f\in\mathcal{H}_{\mathcal{X}}:\|f-\psi(x,a)\|_{\mathcal{H}_{\mathcal{X}}}\leq\epsilon\},

where ψ\psi is the conditional mean embedding mapping of Definition 1 using the notation of Section III. Note that 𝒞ϵ(x,a)\mathcal{C}^{\epsilon}(x,a) is a convex subset of the RKHS 𝒳\mathcal{H}_{\mathcal{X}}; hence, by [29, Theorem 3.7], it is also weakly closed. Since Hilbert spaces are reflexive Banach spaces, then by Kakutani’s theorem (see [29, Theorem 3.17]) we show that the set 𝒞ϵ(x,a)\mathcal{C}^{\epsilon}(x,a) is weakly compact. Now, observe that

ϵ(x,a)=Ψ1(𝒞ϵ(x,a)),\mathcal{M}^{\epsilon}(x,a)=\Psi^{-1}(\mathcal{C}^{\epsilon}(x,a)),

where the mapping Ψ:𝒫(𝒳)𝒳\Psi:\mathcal{P}(\mathcal{X})\mapsto\mathcal{H}_{\mathcal{X}}, defined in (2), is continuous due to boundedness and continuity of the kernel, thus weakly continuous. Since Ψ\Psi is also surjective222For any function f𝒳f\in\mathcal{H}_{\mathcal{X}} there exists a sequence fn(x)=i=1mβi(n)k(xi,x)f_{n}(x)=\sum_{i=1}^{m}\beta_{i}(n)k(x_{i},x) such that limnfn(x)=f(x)\lim_{n\to\infty}f_{n}(x)=f(x). Let n=i=1mβi(n)δxi\mathbb{P}_{n}=\sum_{i=1}^{m}\beta_{i}(n)\delta_{x_{i}} and notice that Ψ()=Ψ(limnn)=f\Psi(\mathbb{P})=\Psi(\lim_{n\to\infty}\mathbb{P}_{n})=f, thus showing that Ψ\Psi is a surjective mapping., we have by the open mapping theorem ([29, Theorem 2.6]) that ϵ(x,a)\mathcal{M}^{\epsilon}(x,a) is also weakly compact.

For a fixed ϵ>0\epsilon>0, we now show that the mapping (x,a)ϵ(x,a)(x,a)\mapsto\mathcal{M}^{\epsilon}(x,a) is lower semicontinuous. We define the distance function from a distribution ν𝒫(𝒳)\nu\in\mathcal{P}(\mathcal{X}) to a convex subset SS of ϵ(x,a)\mathcal{M}^{\epsilon}(x,a) as333This distance is only well-defined due to the weak compactness result shown above, as it would allow us to take convergent subsequences for any sequence achieving the infimum in this definition.

d(ν,S):=infμSΨ(μ)Ψ(ν)𝒳.d(\nu,S):=\inf_{\mu\in S}||\Psi(\mu)-\Psi(\nu)||_{\mathcal{H}_{\mathcal{X}}}.

Let (x,a,ν)(x,a,\nu) be given, with x𝒳x\in\mathcal{X}, a𝒜(x)a\in\mathcal{A}(x), and νϵ(x,a)\nu\in\mathcal{M}^{\epsilon}(x,a). Thus,

Ψ(ν)ψ(x,a)𝒳ϵ.||\Psi(\nu)-\psi(x,a)||_{\mathcal{H}_{\mathcal{X}}}\leq\epsilon.

Consider a sequence (xn,an)n0(x_{n},a_{n})_{n\geq 0} with an𝒜(xn),n0a_{n}\in\mathcal{A}(x_{n}),\forall n\geq 0 and limn(xn,an)=(x,a)\lim_{n\to\infty}(x_{n},a_{n})=(x,a). From [40, Proposition 1.4.7], it follows that the lower semi-continuity of (x,a)ϵ(x,a)(x,a)\mapsto\mathcal{M}^{\epsilon}(x,a) is equivalent to

νlim infnϵ(xn,an)limnd(ν,ϵ(xn,an))=0.\nu\in\liminf_{n\to\infty}\mathcal{M}^{\epsilon}(x_{n},a_{n})\iff\lim_{n\to\infty}d(\nu,\mathcal{M}^{\epsilon}(x_{n},a_{n}))=0.

To this end, we compute

Ψ(ν)ψ(xn,an)𝒳Ψ(ν)ψ(x,a)𝒳\displaystyle||\Psi(\nu)-\psi(x_{n},a_{n})||_{\mathcal{H}_{\mathcal{X}}}\leq||\Psi(\nu)-\psi(x,a)||_{\mathcal{H}_{\mathcal{X}}}
+ψ(x,a)ψ(xn,an)𝒳\displaystyle\qquad\qquad+||\psi(x,a)-\psi(x_{n},a_{n})||_{\mathcal{H}_{\mathcal{X}}}
\displaystyle\implies limnΨ(ν)ψ(xn,an)𝒳ϵ\displaystyle\lim_{n\to\infty}||\Psi(\nu)-\psi(x_{n},a_{n})||_{\mathcal{H}_{\mathcal{X}}}\leq\epsilon
+limnψ(x,a)ψ(xn,an)𝒳\displaystyle\qquad\qquad+\lim_{n\to\infty}||\psi(x,a)-\psi(x_{n},a_{n})||_{\mathcal{H}_{\mathcal{X}}}
\displaystyle\implies limnΨ(ν)ψ(xn,an)𝒳ϵ\displaystyle\lim_{n\to\infty}||\Psi(\nu)-\psi(x_{n},a_{n})||_{\mathcal{H}_{\mathcal{X}}}\leq\epsilon
\displaystyle\implies limnd(ν,ϵ(xn,an))=0,\displaystyle\lim_{n\to\infty}d(\nu,\mathcal{M}^{\epsilon}(x_{n},a_{n}))=0,

from the definition of the ambiguity set. In the second last step, the second term goes to 0 as limn(xn,an)=(x,a)\lim_{n\to\infty}(x_{n},a_{n})=(x,a) due to the continuity of the kernel function and the definition of ψ(x,a)\psi(x,a) in Proposition 1. As a result, νlim infnϵ(xn,an)\nu\in\liminf_{n\to\infty}\mathcal{M}^{\epsilon}(x_{n},a_{n}) and thus, ϵ(x,a)lim infnϵ(xn,an)\mathcal{M}^{\epsilon}(x,a)\subseteq\liminf_{n\to\infty}\mathcal{M}^{\epsilon}(x_{n},a_{n}).

With the above properties of the mapping (x,a)ϵ(x,a)(x,a)\mapsto\mathcal{M}^{\epsilon}(x,a) in hand, it can be shown following identical arguments as the proof of [9, Theorem 3.1] that both H#(vk+1,x,a)H^{\#}(v_{k+1},x,a) and (vk+1)\mathcal{H}^{\dagger}(v_{k+1}) are lower semicontinuous and there exists a function fk:𝒳𝒜(x)f_{k}:\mathcal{X}\to\mathcal{A}(x) such that ak=fk(xk)a_{k}=f_{k}(x_{k}) is the minimizer of (16) for u=vk+1u=v_{k+1}. This concludes the proof.

Remark 2.

Note that the ambiguity sets considered in related works on distributionally robust control [23, 22] do not depend on the current state and action, i.e., the mapping (x,a)ϵ(x,a)(x,a)\mapsto\mathcal{M}^{\epsilon}(x,a) is independent of (x,a)(x,a), and as a result, properties such as compactness and lower semi-continuity are easily shown. In contrast, the ambiguity set considered here is more general and directly captures the dependence of the transition kernel on the current state-action pair. In the DRO literature, such ambiguity sets are referred to as decision-dependent ambiguity sets [41, 27, 26], which are relatively challenging to handle, and less explored in the past.

By showing that there exist deterministic and Markovian policies in the optimal set of (13), Theorem 2 allows to reduce the search of general history-dependant policies to functions π:𝒳𝒜(𝒳)\pi:\mathcal{X}\rightarrow\mathcal{A}(\mathcal{X}), which maps at each time instant a state to an admissible action. However, the result of Theorem 2 does not lead to a computationally tractable formulation towards solving problem (13). To this end, we now describe two approximation schemes that enable us to obtain a computational solution to (13). The first approximation is related to the fact that we are aiming towards a data-driven solution of (13); hence, the ambiguity set in (10) is replaced by its sample-based counterpart, where its center is defined according to the estimate of the conditional mean embedding mapping introduced in Proposition 1. That is, we consider instead the ambiguity set

^mϵ(x,a)={𝒫(𝒳)Ψ()ψ^m(x,a)𝒳ϵ}.\widehat{\mathcal{M}}^{\epsilon}_{m}(x,a)=\{\mathbb{P}\in\mathcal{P}(\mathcal{X})\mid\|\Psi(\mathbb{P})-\widehat{\psi}_{m}(x,a)\|_{\mathcal{H}_{\mathcal{X}}}\leq\epsilon\}. (18)

Under this approximation and exploiting the rectangular structure of the admissible dynamics in (11), one may observe that the inner supremum in (10) is computed separately for each time instant and that if we replace ϵ\mathcal{M}^{\epsilon} by ^mϵ\widehat{\mathcal{M}}^{\epsilon}_{m} in the corresponding inner optimization we obtain, for any f𝒳f^{\prime}\in\mathcal{H}_{\mathcal{X}},

supμk^mϵ(xk,ak)𝔼Xμk[f(X)]=supΨ(μk)𝒞^kf,Ψ(μk)𝒳,\displaystyle\sup_{\mu_{k}\in\widehat{\mathcal{M}}^{\epsilon}_{m}(x_{k},a_{k})}\mathbb{E}_{X\sim\mu_{k}}[f^{\prime}(X)]=\sup_{\Psi(\mu_{k})\in\widehat{\mathcal{C}}_{k}}\langle f^{\prime},\Psi(\mu_{k})\rangle_{\mathcal{H}_{\mathcal{X}}}, (19)

where the last equality is a consequence of the reproducing property and

𝒞^k:={f𝒳fψ^m(xk,ak)𝒳ϵ}𝒳.\widehat{\mathcal{C}}_{k}:=\{f\in\mathcal{H}_{\mathcal{X}}\mid||f-\widehat{\psi}_{m}(x_{k},a_{k})||_{\mathcal{H}_{\mathcal{X}}}\leq\epsilon\}\subseteq\mathcal{H}_{\mathcal{X}}.

Following [12], the support of 𝒞^k\widehat{\mathcal{C}}_{k} can then be computed as

σ𝒞^k(f)=supf𝒞^kf,f𝒳\displaystyle\sigma_{\widehat{\mathcal{C}}_{k}}(f^{\prime})=\sup_{f\in\widehat{\mathcal{C}}_{k}}\langle f^{\prime},f\rangle_{\mathcal{H}_{\mathcal{X}}}
=supf𝒞^kf,fψ^m(xk,ak)𝒳+f,ψ^m(xk,ak)𝒳\displaystyle=\sup_{f\in\widehat{\mathcal{C}}_{k}}\quad\langle f^{\prime},f-\widehat{\psi}_{m}(x_{k},a_{k})\rangle_{\mathcal{H}_{\mathcal{X}}}+\langle f^{\prime},\widehat{\psi}_{m}(x_{k},a_{k})\rangle_{\mathcal{H}_{\mathcal{X}}}
=ϵf𝒳+i=1mβi(xk,ak)f(x^(i)+),\displaystyle=\epsilon||f^{\prime}||_{\mathcal{H}_{\mathcal{X}}}+\sum_{i=1}^{m}\beta_{i}(x_{k},a_{k})f^{\prime}(\widehat{x}_{(i)}^{+}),

where βi(xk,ak)\beta_{i}(x_{k},a_{k}) denote the coefficients of the empirical estimate of the conditional mean embedding as given in (8) evaluated at the state-action pair at the time instant k{0,,L}k\in\{0,\ldots,L\}, and where x^(i)+\widehat{x}_{(i)}^{+} represents the collected sample444That is, for each i{1,,m}i\in\{1,\ldots,m\}, x^(i)+\widehat{x}_{(i)}^{+} is a sample from T(x^(i),a^(i))T(\cdot\mid\widehat{x}_{(i)},\widehat{a}_{(i)}). Within the machine learning literature, the transition kernel TT is usually referred to as a “generative model”. along the observed trajectories of the dynamics. To compute the norm f𝒳\|f^{\prime}\|_{\mathcal{H}_{\mathcal{X}}}, we solve, similar to Theorem 1, a regression problem given by

minimizeαi,i=1,,m\displaystyle\operatorname*{minimize}_{\alpha_{i},i=1,\ldots,m} i=1mαik(x^(i),)f𝒳+λα22,\displaystyle\quad\left\|\sum_{i=1}^{m}\alpha_{i}k(\widehat{x}_{(i)}^{\prime},\cdot)-f^{\prime}\right\|_{\mathcal{H}_{\mathcal{X}}}+\lambda\|\alpha\|_{2}^{2}, (20)

where {x^(1),,x^(m)}\{\widehat{x}_{(1)}^{\prime},\ldots,\widehat{x}_{(m)}^{\prime}\} are arbitrary points in the domain of the function ff^{\prime}. The solution of this regression problem is given by f𝒳=αK𝒳α,\|f^{\prime}\|_{\mathcal{H}_{\mathcal{X}}}=\sqrt{\alpha^{\top}K_{\mathcal{X}}^{\prime}\alpha}, where

α=(K𝒳+λI)1[f(x^(1))f(x^(m))].\quad\alpha=(K_{\mathcal{X}}^{\prime}+\lambda I)^{-1}\begin{bmatrix}f^{\prime}(\widehat{x}_{(1)}^{\prime})&\ldots&f^{\prime}(\widehat{x}_{(m)}^{\prime})\end{bmatrix}^{\top}.

Variable K𝒳K_{\mathcal{X}}^{\prime} is the m×mm\times m Gram matrix associated with the available points, namely, it is the positive semidefinite matrix whose (i,j)(i,j)-entry is given by k𝒳(x^(i),x^(j))k_{\mathcal{X}}(\widehat{x}_{(i)}^{\prime},\widehat{x}_{(j)}^{\prime}). Notice that the collection of points used to estimate f𝒳\|f^{\prime}\|_{\mathcal{H}_{\mathcal{X}}} may not necessarily coincide with those points used to estimate the conditional mean embedding in Theorem 1; we denote this in our notation by adding a prime as a superscript. Same comment applies for the regularizer λ\lambda appearing in (20) and in the expression of β\beta in Theorem 1.

We now discuss how to solve for optimal control inputs using a value iteration approach. To this end, we introduce our second approximation of the distributionally robust dynamic programming (13) by discretizing the action space. At a given state x𝒳x\in\mathcal{X}, we discretize the input space 𝒜(x)\mathcal{A}(x), denoting the resulting discrete set as {a(1),,a(M)}\{a^{(1)},\ldots,a^{(M)}\}. To define the value function, we modify slightly the set of admissible dynamics ΓL\Gamma_{L} in (11) by fixing the initial distribution to a given point x𝒳x\in\mathcal{X}. In other words, we let555 The symbol δx\delta_{x}, for some x𝒳x\in\mathcal{X}, represents the Diract measure, i.e., a probability measure that assigns unit mass to the point x𝒳x\in\mathcal{X}. μ0=δx\mu_{0}=\delta_{x} for some x𝒳x\in\mathcal{X}. Then, for a given function f:𝒳f:\mathcal{X}\rightarrow\mathbb{R}, we define recursively the collection of functions v:𝒳v_{\ell}:\mathcal{X}\rightarrow\mathbb{R}, where {0,,L2}\ell\in\{0,\ldots,L-2\}, as vL1=fv_{L-1}=f, and

v(x)=mina(j):j=1,,M\displaystyle v_{\ell}(x)=\min_{a^{(j)}:j=1,\ldots,M} c(x,a(j))\displaystyle c(x,a^{(j)})
+supμ^mϵ(x,a(j))𝒳v+1(ξ)μ(dξ)\displaystyle+\sup_{\mu\in\widehat{\mathcal{M}}^{\epsilon}_{m}(x,a^{(j)})}\int_{\mathcal{X}}\!\!v_{\ell+1}(\xi)\mu(d\xi) (21)

For a given (x,a(j))(x,a^{(j)}), the inner supremum problem is an instance of (19) and can be computed using the techniques discussed in this section.

III-A Distributionally robust approach to forward invariance or safe control

While the discussion thus far has focused on the general optimal control problem, this approach can also be leveraged for synthesizing control inputs meeting safety specifications. Using the notation of previous sections, let LL\in\mathbb{N} be the time-horizon and SnS\subset\mathbb{R}^{n} be a measurable safe set. For an admissible control policy πΠL\pi\in\Pi_{L}, admissible dynamics μΓL\mu\in\Gamma_{L}, and initial state xx, the probability of the state trajectory being safe is given by (more details can be found in [1])

VS(x;π,μ)\displaystyle V_{S}(x;\pi,\mu) =x0π,μ{xkS, for all k{0,,L1}},\displaystyle=\mathbb{P}^{\pi,\mu}_{x_{0}}\{x_{k}\in S,\text{ for all }k\in\{0,\ldots,L-1\}\}, (22)

where (x0,x2,,xL1)(x_{0},x_{2},\ldots,x_{L-1}) denote the solution of (9) under the dynamics μ\mu and policy π\pi. Our goal is to solve the problem

VS(x)=supπΠLinfμΓLVS(x;π,μ),V^{\star}_{S}(x)=\sup_{\pi\in\Pi_{L}}\inf_{\mu\in\Gamma_{L}}V_{S}(x;\pi,\mu), (23)

using the mathematical framework proposed in this paper. In fact, one can show that an analogous result of Theorem 2 holds for (23), namely, there exists an optimal Markovian and deterministic policy. Hence, using similar approximations as in the previous section, function VSV_{S}^{\star} in (23) can be approximated recursively as (see [1] for more details)

vL1\displaystyle v_{L-1} (x):=𝟏S(x),\displaystyle(x):=\mathbf{1}_{S}(x),
v(x)\displaystyle v_{\ell}(x) :=maxa(j):j=1,,Minfμ^mϵ(x,a(j))𝟏S(x)𝒳v+1(ξ)μ(dξ),\displaystyle:=\max_{a^{(j)}:j=1,\ldots,M}\inf_{\mu\in\widehat{\mathcal{M}}^{\epsilon}_{m}(x,a^{(j)})}\mathbf{1}_{S}(x)\int_{\mathcal{X}}v_{\ell+1}(\xi)\mu(d\xi), (24)

for {0,,L2}\ell\in\{0,\ldots,L-2\}, where 𝟏S\mathbf{1}_{S} is the indicator function of the safe set SS. In the numerical example reported in the following section, we compute the safe control inputs by solving the inner problem in an identical manner as discussed in the previous subsection.

IV Numerical examples

Inspired by papers [1, 23], we apply our methods to study safety probability of a thermostatically controlled load given by the dynamics

xk+1=αxk+(1α)(θηRPak)+ωk,x_{k+1}=\alpha x_{k}+(1-\alpha)(\theta-\eta RPa_{k})+\omega_{k}, (25)

where the state xkx_{k}\in\mathbb{R} is the temperature, ak{0,1}a_{k}\in\{0,1\} is a binary control input, representing whether the load is on or off, and ωk\omega_{k} is a stochastic disturbance taking values in the uncertainty space (Ω,,)(\Omega,\mathcal{F},\mathbb{P}). The parameters of (25) are given by α=exp(h/CR)\alpha=\exp(h/CR), where R=2°C/kWR=2\degree\mathrm{C}/\mathrm{kW}, C=2kWh/°CC=2\mathrm{kWh}/\degree\mathrm{C}, θ=32°C\theta=32\degree\mathrm{C}, h=5/60hourh=5/60\leavevmode\nobreak\ \mathrm{hour}, P=14kWP=14\mathrm{kW}, and η=0.7\eta=0.7. Our goal is keep the temperature within the range 𝒮=[19°C,22°C]\mathcal{S}=[19\degree\mathrm{C},22\degree C] for 9090 minutes.

Our goal is to compute a control policy purely on available sampled trajectories for the model (25) and without solving an optimization problem at each iteration. To this end, we let {x^(i),a^(i),x^(i)+}i=1m\{\widehat{x}_{(i)},\widehat{a}_{(i)},\widehat{x}_{(i)}^{+}\}_{i=1}^{m} be a collection of observed transitions from the model, where the pair {x^(i),a^(i)}\{\widehat{x}_{(i)},\widehat{a}_{(i)}\} is the set of random chosen points in the set [19,22]×{0,1}[19,22]\times\{0,1\} and x^(i)+\widehat{x}_{(i)}^{+} represents the observed transitions from such a state-input pair. We then solve the dynamic programming recursion given in (24) by partitioning the state space uniformly from 18°C18\degree\mathrm{C} to 23°C23\degree\mathrm{C} with 35 points, and using 70007000 data points to estimate the conditional kernel mean embedding map (see Theorem 1) and to compute the norm of the value function, as shown in (20). We choose λ=200\lambda=200 as the regularisation parameter, and use the kernel functions k𝒳:𝒳×𝒳k_{\mathcal{X}}:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R} and k𝒴:(𝒳×𝒰)×(𝒳×𝒰)k_{\mathcal{Y}}:(\mathcal{X}\times\mathcal{U})\times(\mathcal{X}\times\mathcal{U})\rightarrow\mathbb{R} given by

k𝒳(x,x)\displaystyle k_{\mathcal{X}}(x,x^{\prime}) =eγ|xx|2,\displaystyle=e^{\gamma|x-x^{\prime}|^{2}},
k𝒴((x,a),(x,a))\displaystyle k_{\mathcal{Y}}((x,a),(x^{\prime},a^{\prime})) =eγ|xx|2+k1(a,a),\displaystyle=e^{-\gamma|x-x^{\prime}|^{2}}+k_{1}(a,a^{\prime}), (26)

with γ=100\gamma=100 and where k1(a,a)=1+aa+aamin(a,a)a+a2min(a,a)2+13min(a,a)3k_{1}(a,a^{\prime})=1+aa^{\prime}+aa^{\prime}\min(a,a^{\prime})-\frac{a+a^{\prime}}{2}\min(a,a^{\prime})^{2}+\frac{1}{3}\min(a,a^{\prime})^{3}, for the numerical examples. The choice of the kernel k1k_{1} has shown better results for this problem when compared with a Gaussian kernel. Figure 1 shows the obtained value function for different values of the radius ϵ\epsilon, where we notice a decrease in the returned value function with the increase in the size of the ambiguity set. The yy-axis represents the safety probability and the xx-axis is the temperature; notice that the value function is zero outside the safe set [19°C,22°C][19\degree\mathrm{C},22\degree\mathrm{C}]. We also compare the returned value function with the one obtained using the method proposed in [23] (we refer to the reader to this paper for the definition of the parameters cc and bb shown in the legend).

Refer to caption
Figure 1: Solution of the dynamic programming recursion given in (24) for the kernel ambiguity sets with different values of the radius (blue lines) with the kernel parameter γ=100\gamma=100. We have used 70007000 state-action pairs as samples to estimate the conditional mean embedding and the norm of vectors in the RKHS. The regularisation parameter λ\lambda is equal to 200200. The solid red line is the value function using the methods proposed in [23] for b=0.1b=0.1 and c=1.5c=1.5 (as defined in [23]).

V Conclusion

We analyzed the problem of distributionally robust (safe) control of stochastic systems where the ambiguity set is defined as the set of distributions whose kernel mean embedding is within a certain distance from the empirical estimate of the conditional kernel mean embedding derived from data. We showed that there exists a non-randomized Markovian policy that is optimal and discussed how to compute the value function by leveraging strong duality associated with kernel DRO problems. Numerical results illustrate the performance of the proposed formulations and the impact of the radius of the ambiguity set. There are several possible directions for future research, including deriving efficient algorithms to perform value iteration without resorting to discretization, representing multistage state evolution using composition of conditional mean embedding operators, and performing a thorough empirical investigation on the impact of dataset size on the performance and computational complexity of the problem.

References

  • [1] A. Abate, M. Prandini, J. Lygeros, and S. Sastry, “Probabilistic reachability and safety for controlled discrete time stochastic hybrid systems,” Automatica, vol. 44, no. 11, pp. 2724–2734, 2008.
  • [2] S. Summers and J. Lygeros, “Verification of discrete time stochastic hybrid systems: A stochastic reach-avoid decision problem,” Automatica, vol. 46, no. 12, pp. 1951–1961, 2010.
  • [3] C. Persis and P. Tesi, “Formulas for data-driven control: stabilization, optimality, and robustness,” IEEE Transactions on Automatic Control, vol. 65, no. 3, pp. 909–924, 2020.
  • [4] J. Berberich, J. Kohler, M. A. Muller, and F. Allgower, “Data-driven model predictive control with stability and robustness guarantees,” IEEE Transactions on Automatic Control, vol. 66, no. 4, pp. 1702–1717, 2021.
  • [5] A. Shapiro, “Distributionally robust stochastic programming,” SIAM Journal on Optimization, vol. 27, no. 4, pp. 2258–2275, 2017.
  • [6] P. Mohajerin Esfahani and D. Kuhn, “Data-driven distributionally robust optimization using the wasserstein metric: performance guarantees and tractable reformulations,” Mathematical Programming, vol. 171, no. 1, pp. 115–166, 2018.
  • [7] A. Hota, A. Cherukuri, and J. Lygeros, “Data-driven chance constrained optimization under wasserstein ambiguity sets,” in 2019 American Control Conference (ACC), IEEE, 2019.
  • [8] L. Aolaritei, N. Lanzetti, H. Chen, and F. Dörfler, “Distributional uncertainty propagation via optimal transport,” arXiv, 2023.
  • [9] J. González-Trejo, O. Hernández-Lerma, and L. Hoyos-Reyes, “Minimax control of discrete-time stochastic systems,” SIAM Journal on Control and Optimization, vol. 41, no. 5, pp. 1626–1659, 2002.
  • [10] J. Ding, M. Kamgarpour, S. Summers, A. Abate, J. Lygeros, and C. Tomlin, “A stochastic games framework for verification and control of discrete time stochastic hybrid systems,” Automatica, vol. 49, no. 9, pp. 2665–2674, 2013.
  • [11] K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Schölkopf, “Kernel mean embedding of distributions: A review and beyond,” Foundations and Trends® in Machine Learning, vol. 10, no. 1-2, pp. 1–141, 2017.
  • [12] J. Zhu, W. Jitkrittum, M. Diehl, and B. Schölkopf, “Kernel distributionally robust optimization: Generalized duality theorem and stochastic approximation,” in International Conference on Artificial Intelligence and Statistics, pp. 280–288, PMLR, 2021.
  • [13] A. Smola, A. Gretton, L. Song, and B. Schölkopf, “A Hilbert space embedding for distributions,” in International Conference on Algorithmic Learning Theory, pp. 13–31, Springer, 2007.
  • [14] K. Fukumizu, L. Song, and A. Gretton, “Kernel Bayes’ rule: Bayesian inference with positive definite kernels,” The Journal of Machine Learning Research, vol. 14, no. 1, pp. 3753–3783, 2013.
  • [15] B. Boots, A. Gretton, and G. J. Gordon, “Hilbert space embeddings of predictive state representations,” in Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, pp. 92–101, 2013.
  • [16] A. Thorpe and M. Oishi, “Model-free stochastic reachability using kernel distribution embeddings,” IEEE Control Systems Letters, vol. 4, no. 2, pp. 512–517, 2019.
  • [17] A. Thorpe and M. Oishi, “SOCKS: A stochastic optimal control and reachability toolbox using kernel methods,” in 25th ACM International Conference on Hybrid Systems: Computation and Control, ACM, 2022.
  • [18] Y. Chen, J. Kim, and J. Anderson, “Distributionally robust decision making leveraging conditional distributions,” in 2022 IEEE 61st Conference on Decision and Control (CDC), pp. 5652–5659, IEEE, 2022.
  • [19] G. Pillonetto, M. Quang, and A. Chiuso, “A new kernel-based approach for nonlinear system identification,” IEEE Transactions on Automatic Control, vol. 56, no. 12, pp. 2825–2840, 2011.
  • [20] S. Grünewälder, G. Lever, L. Baldassarre, S. Patterson, A. Gretton, and M. Pontil, “Conditional mean embeddings as regressors,” in Proceedings of the 29th International Coference on International Conference on Machine Learning, pp. 1803–1810, 2012.
  • [21] A. Thorpe, K. Ortiz, and M. Oishi, “State-based confidence bounds for data-driven stochastic reachability using Hilbert space embeddings,” Automatica, vol. 138, p. 110146, 2022.
  • [22] I. Yang, “Wasserstein distributionally robust stochastic control: a data-driven approach,” IEEE Transactions on Automatic Control, vol. 66, no. 8, pp. 3863–3870, 2020.
  • [23] I. Yang, “A dynamic game approach to distributionally robust safety specifications for stochastic systems,” Automatica, vol. 94, pp. 94–101, 2018.
  • [24] M. Schuurmans and P. Patrinos, “A general framework for learning-based distributionally robust MPC of Markov jump systems,” IEEE Transactions on Automatic Control, vol. 68, no. 5, pp. 2950–2965, 2023.
  • [25] M. Fochesato and J. Lygeros, “Data-driven distributionally robust bounds for stochastic model predictive control,” in 2022 IEEE 61st Conference on Decision and Control (CDC), pp. 3611–3616, IEEE, 2022.
  • [26] F. Luo and S. Mehrotra, “Distributionally robust optimization with decision dependent ambiguity sets,” Optimization Letters, vol. 14, pp. 2565–2594, 2020.
  • [27] N. Noyan, G. Rudolf, and M. Lejeune, “Distributionally robust optimization under a decision-dependent ambiguity set with applications to machine scheduling and humanitarian logistics,” INFORMS Journal on Computing, vol. 34, no. 2, pp. 729–751, 2022.
  • [28] D. Salamon, Measure and Integration. European Mathematical Society, 2016, 2016.
  • [29] H. Brezis, Functional Analysis, Sobolev Spaces and Partial Differential Equations. Springer, 2011.
  • [30] D. Panchenko, Lecture Notes on Probability Theory. 2019.
  • [31] Y. Nemmour, H. Kremer, B. Schölkopf, and J. Zhu, “Maximum mean discrepancy distributionally robust nonlinear chance-constrained optimization with finite-sample guarantee,” in IEEE Conference on Decision and Control (CDC), pp. 5660–5667, 2022.
  • [32] C. Baker, “Joint measures and cross-covariance operators,” Transactions of the American Mathematical Society, vol. 186, pp. 273–273, 1973.
  • [33] D. Bertsekas and S. Shreve, Stochastic Optimal Control: The Discrete-time Case. Athena Scientific, 1978.
  • [34] K. Fukumizu, F. Bach, and M. Jordan, “Dimensionality reduction for supervised learning with reproducing kernel Hilbert spaces,” Journal of machine learning research, 2004.
  • [35] L. Song, J. Huang, A. Smola, and K. Fukumizu, “Hilbert space embeddings of conditional distributions with applications to dynamical systems,” in Proceedings of the 26th Annual International Conference on Machine Learning, pp. 961–968, 2009.
  • [36] S. Grunewalder, G. Lever, L. Baldassarre, M. Pontil, and A. Gretton, “Modelling transition dynamics in MDPs with RKHS embeddings,” in International Conference on Machine Learning (ICML), pp. 1603–1610, 2012.
  • [37] C. Micchelli and M. Pontil, “On learning vector-valued functions,” Neural computation, vol. 17, no. 1, pp. 177–204, 2005.
  • [38] R. Durrett, Probability: Theory and Examples. Cambridge University Press, 2010.
  • [39] J. Munkres, Topology: A First Course. Prentice Hall, 1974.
  • [40] J.-P. Aubin and H. Frankowska, Set-valued analysis. Springer Science & Business Media, 2009.
  • [41] K. Wood, G. Bianchin, and E. Dall’Anese, “Online projected gradient descent for stochastic optimization with decision-dependent distributions,” IEEE Control Systems Letters, vol. 6, pp. 1646–1651, 2021.