Localized active learning of Gaussian process state space models

\NameAlexandre Capone \Emailalexandre.capone@tum.de
\NameGerrit Noske \Emailga24coq@mytum.de
\NameJonas Umlauft \Emailjonas.umlauft@tum.de
\NameThomas Beckers \Emailt.beckers@tum.de
\NameArmin Lederer \Emailarmin.lederer@tum.de
\NameSandra Hirche \Emailhirche@tum.de
\addrChair of Information-oriented Control
Department of Electrical and Computer Engineering
Technical University of Munich
D-80333 Munich Germany

Abstract

While most dynamic system exploration techniques aim to achieve a globally accurate model, this is generally unsuited for systems with unbounded state spaces. Furthermore, many applications do not require a globally accurate model, e.g., local stabilization tasks. In this paper, we propose an active learning strategy for Gaussian process state space models that aims to obtain an accurate model on a bounded subset of the state-action space. Our approach aims to maximize the mutual information of the exploration trajectories with respect to a discretization of the region of interest. By employing model predictive control, the proposed technique integrates information collected during exploration and adaptively improves its exploration strategy. To enable computational tractability, we decouple the choice of most informative data points from the model predictive control optimization step. This yields two optimization problems that can be solved in parallel. We apply the proposed method to explore the state space of various dynamical systems and compare our approach to a commonly used entropy-based exploration strategy. In all experiments, our method yields a better model within the region of interest than the entropy-based method.

keywords:

exploration, Bayesian inference, data-driven control, model predictive control

1 Introduction

Autonomous systems often need to operate in complex environments, of which a model is difficult or even impossible to derive from first principles. Learning-based techniques have become a promising paradigm to address these issues (Pillonetto et al., 2014). In particular, Gaussian processes (GPs) have been increasingly employed for system identification and control (Umlauft et al., 2018; Capone and Hirche, 2019; Berkenkamp and Schoellig, 2015; Deisenroth et al., 2015). GPs possess very good generalization properties (Rasmussen and Williams, 2006), which can be leveraged to obtain data-efficient learning-based approaches (Deisenroth et al., 2015; Kamthe and Deisenroth, 2018). By employing a Bayesian framework, GPs provide an automatic trade-off between model smoothness and data fitness. Moreover, GPs provide an explicit estimate of the model uncertainty that is used to derive probabilistic bounds in control settings (Capone and Hirche, 2019; Beckers et al., 2019; Umlauft and Hirche, 2020).

A performance-determining factor of data-driven techniques is the quality of the available data. In settings where data is insufficient to achieve accurate predictions, new data needs to be gathered via exploration (Umlauft and Hirche, 2020). In classical reinforcement learning settings, exploration is often enforced by randomly selecting a control action with a predetermined probability that tends to zero over time (Dayan and Sejnowski, 1996). However, this is generally inefficient, as regions of low uncertainty are potentially revisited in multiple iterations. These issues have been addressed by techniques that choose the most informative exploration trajectories (Alpcan and Shames, 2015; Ay et al., 2008; Burgard et al., 2005; Schreiter et al., 2015). The goal of these methods is to obtain a globally accurate model. While this is reasonable for systems with a bounded state-action space, it is unsuited for systems with unbounded ones, particularly if a non-parametric model is used. This is because a potentially infinite number of points is required to achieve a globally accurate model. Furthermore, in practice a model often only needs to be accurate locally, e.g., for stabilization tasks.

In this paper, we propose a model predictive control-based exploration approach that steers the system towards the most informative points within a bounded subset of the state-action space. By modeling the system with a Gaussian process, we are able to quantify the information inherent in each data point. Our approach chooses actions by approximating the mutual information of the system trajectory with respect to a discretization of the region of interest. This is achieved by first selecting the single most informative data point within the region of interest, then steering the system towards that point using model predictive control. Through this approximation, the solution approach is rendered computationally tractable.

The remainder of this paper is structured as follows. Section 2 describes the general problem. Section 3 discusses how GPs are employed for modeling and exploration. Section 4 presents the LocAL algorithm, and is followed by a numerical simulation example, in Section 5.

2 Problem Statement

We consider the problem of exploring the state and control space of a discrete-time nonlinear system with Markovian dynamics of the form

\displaystyle{{\bm{x}}_{t+1}}

\displaystyle={\bm{f}}({\bm{x}}_{t},{\bm{u}}_{t})+{\bm{g}}({\bm{x}}_{t},{\bm{u}}_{t})+{\bm{w}}_{t}:={\bm{f}}({\tilde{\bm{x}}}_{t})+{\bm{g}}({\tilde{\bm{x}}}_{t})+{\bm{w}}_{t},

(1)

where $t\in\mathbb{N}_{0}$ , ${{\bm{x}}_{t}\in\mathcal{X}\subseteq\mathbb{R}^{{d_{x}}}}$ and ${{\bm{u}}_{t}\in\mathcal{U}\subseteq\mathbb{R}^{{d_{u}}}}$ are the system’s state vector and control vector at the $t$ -th time step, respectively. The system is disturbed by multivariate Gaussian process noise ${{\bm{w}}_{t}\sim\mathcal{N}(\bm{0},\bm{\Sigma}_{w})}$ with $\bm{\Sigma}_{w}=\text{diag}(\sigma_{\text{w},1}^{2},\ldots,\sigma_{\text{w},{d_{x}}}^{2})$ , $\sigma_{\text{w,i}}^{2}\in\mathbb{R}_{+,0}$ . The concatenation ${{\tilde{\bm{x}}}_{t}:=\begin{pmatrix}{\bm{x}}_{t}^{\text{T}}&{\bm{u}}_{t}^{\text{T}}\end{pmatrix}^{\text{T}}\in{\tilde{\mathcal{X}}}}$ , where ${{\tilde{\mathcal{X}}}:=\mathcal{X}\times\mathcal{U}}$ is employed for simplicity of exposition. The nonlinear function ${{\bm{f}}:{\tilde{\mathcal{X}}}\rightarrow\mathcal{X}}$ represents the known component of the system dynamics, e.g., a model obtained using first principles, while ${{\bm{g}}:{\tilde{\mathcal{X}}}\rightarrow\mathcal{X}}$ corresponds to the unknown component of the system dynamics.

We aim to obtain an approximation of the function ${\bm{g}}(\cdot)$ , denoted $\widehat{\bm{g}}(\cdot)$ , which provides an accurate estimate of ${{\bm{g}}(\cdot)}$ on a predefined bounded subset of the augmented state space ${\tilde{\mathcal{X}}}_{B}\subset{\tilde{\mathcal{X}}}$ . This is often required in practice, e.g., for local stabilization tasks.

3 Gaussian Processes

In order to faithfully capture the stochastic behavior of (1), we model the system as a Gaussian process (GP), where we employ measurements of the augmented state vector ${\tilde{\bm{x}}}_{t}$ as training inputs, and the differences ${\bm{x}}_{t+1}-{\bm{f}}({\tilde{\bm{x}}}_{t})={\bm{g}}({\tilde{\bm{x}}}_{t})+{\bm{w}}_{t}$ as training targets.

A GP is a collection of dependent random variables variables, for which any finite subset is jointly normally distributed (Rasmussen and Williams, 2006). It is specified by a mean function $m:{\tilde{\mathcal{X}}}\rightarrow\mathbb{R}$ and a positive definite covariance function $k:{\tilde{\mathcal{X}}}\times{\tilde{\mathcal{X}}}\rightarrow\mathbb{R}$ , also known as kernel. In this paper, we set $m\equiv 0$ without loss of generality, as all prior knowledge is already encoded in $f(\cdot)$ . The kernel $k(\cdot,\cdot)$ is a similarity measure for evaluations of ${\bm{g}}(\cdot)$ , and encodes function properties such as smoothness and periodicity.

In the case where the state is a scalar, i.e., ${d_{x}}=1$ , given $n$ training input samples ${\tilde{\bm{X}}}=\left\{{\tilde{\bm{x}}}_{1},\ldots,{\tilde{\bm{x}}}_{n}\right\}\subset{\tilde{\mathcal{X}}}$ and training outputs $\bm{y}_{{\tilde{\bm{X}}}}=\begin{pmatrix}g({\tilde{\bm{x}}}_{1})+{w}_{1}&\ldots&g({\tilde{\bm{x}}}_{n})+{w}_{n}\end{pmatrix}^{\text{T}}$ , the posterior mean and variance of the GP corresponds to a one-step transition model. Starting at a point ${\tilde{\bm{x}}}_{t}$ , the difference between the subsequent state and the known component is normally distributed, i.e.,

\displaystyle{\bm{x}}_{t+1}-{\bm{f}}({\tilde{\bm{x}}}_{t})\sim\mathcal{N}\left(\mu_{n}({\tilde{\bm{x}}}_{t}),\sigma_{n}^{2}({\tilde{\bm{x}}}_{t})\right),

(2)

with mean and variance given by

\displaystyle\mu_{n}({\tilde{\bm{x}}}_{t})\coloneqq\bm{k}^{\text{T}}({\tilde{\bm{x}}}_{t})\left({\bm{K}}+\sigma_{w}^{2}\bm{I}\right)^{-1}\bm{y}_{{\tilde{\bm{X}}}},\qquad\sigma^{2}_{n}({\tilde{\bm{x}}}_{t})\coloneqq k({\tilde{\bm{x}}}_{t},{\tilde{\bm{x}}}_{t})-\bm{k}^{\text{T}}({\tilde{\bm{x}}}_{t})\left({\bm{K}}+\sigma^{2}\bm{I}\right)^{-1}\bm{k}({\tilde{\bm{x}}}_{t}),

respectively, where ${\bm{k}(\cdot)=\begin{pmatrix}k({\tilde{\bm{x}}}_{1},\cdot)&\ldots&k({\tilde{\bm{x}}}_{n},\cdot)\end{pmatrix}^{\text{T}}}$ , and the entries of the covariance matrix $\bm{K}$ are computed as $K_{ij}=k({\tilde{\bm{x}}}_{i},{\tilde{\bm{x}}}_{j})$ , $i,j=1,\ldots,n$ .

In the case where the state is multidimensional, we model dimension of the state transition function using a separate GP. This corresponds to the assumption that the state transition function entries are conditionally independent. For simplicity of exposition, unless stated otherwise, we henceforth assume ${d_{x}}=1$ . However, the methods presented in this paper extend straightforwardly to the multivariate case.

3.1 Performing multi-step ahead predictions

The GP model presented in the previous section serves as a one-step predictor given a known test input ${\tilde{\bm{x}}}_{t}$ . However, if only a distribution $p({\tilde{\bm{x}}}_{t})$ is available, the successor states’ distribution generally cannot be computed analytically. Hence, the distributions of future states cannot be computed exactly, but only approximated, e.g., using Monte Carlo methods (Candela et al., 2003). Alternatively, approximate computations exist that enable to propagate the GP uncertainty over multiple time steps, such as moment-matching and GP linearization (Deisenroth et al., 2015). In this paper, we employ the GP mean to perform multi-step ahead predictions, without propagating uncertainty, i.e., $x_{t+1}=f({\tilde{\bm{x}}}_{t})+\mu_{n}({\tilde{\bm{x}}}_{t})$ , $t\in\mathbb{N}$ . However, the proposed method is also applicable using models that propagate uncertainty, e.g., moment-matching (Deisenroth et al., 2015).

3.2 Quantifying utility of data

In order to steer the system along informative trajectories, we need to quantify the utility of data points in the augmented state space ${\tilde{\mathcal{X}}}$ . To this end, we consider the mutual information between observations $\bm{y}_{{\tilde{\bm{X}}}}$ at training inputs ${\tilde{\bm{X}}}$ and evaluations $\bm{y}_{{{\tilde{\bm{X}}}_{\text{ref}}}}$ at reference points ${{\tilde{\bm{X}}}_{\text{ref}}}$ . Here ${{{\tilde{\bm{X}}}_{\text{ref}}}\subset{{\tilde{\mathcal{X}}}_{\text{ref}}}}$ is a discretization of the bounded subset ${{\tilde{\mathcal{X}}}_{\text{ref}}}$ . Formally, the mutual information between $\bm{y}_{{{\tilde{\bm{X}}}_{\text{ref}}}}$ and $\bm{y}_{{\tilde{\bm{X}}}}$ is given by

\displaystyle I(\bm{y}_{{\tilde{\bm{X}}}},\bm{y}_{{{\tilde{\bm{X}}}_{\text{ref}}}})=\!\!\!\!\!\!\!\int\limits_{\mathcal{X}^{\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert}\times\mathcal{X}^{\lvert{\tilde{\bm{X}}}\rvert}}\!\!\!\!\!\!\!p\left(\bm{y}_{\tilde{\bm{X}}},\bm{y}_{{\tilde{\bm{X}}}_{\text{ref}}}\right)\log\left(\frac{p\left(\bm{y}_{\tilde{\bm{X}}},\bm{y}_{{\tilde{\bm{X}}}_{\text{ref}}}\right)}{p\left(\bm{y}_{\tilde{\bm{X}}}\right),p\left(\bm{y}_{{\tilde{\bm{X}}}_{\text{ref}}}\right)}\right)d\bm{y}_{\tilde{\bm{X}}}d\bm{y}_{{\tilde{\bm{X}}}_{\text{ref}}}

(3)

respectively denote the differential entropy of $\bm{y}_{\tilde{\bm{X}}}$ and the conditional differential entropy of $\bm{y}_{\tilde{\bm{X}}}$ given $\bm{y}_{{\tilde{\bm{X}}}_{\text{ref}}}$ . In practice, computing (3) for a multi-step GP prediction is intractable. However, we can obtain the single most informative data point $\bm{\xi}^{*}\in{\tilde{\mathcal{X}}}$ with respect to $\bm{y}_{{{\tilde{\bm{X}}}_{\text{ref}}}}$ by computing the unconstrained minimum of

\displaystyle I(\bm{y}_{\bm{\xi}},\bm{y}_{{{\tilde{\bm{X}}}_{\text{ref}}}})=\frac{1}{2}\log\left(\frac{{\left(k(\bm{\xi},\bm{\xi})+\sigma_{w}\right)}\lvert\bm{K}_{{{\tilde{\bm{X}}}_{\text{ref}}}}+\sigma_{w}\bm{I}\rvert}{\lvert{\bm{K}_{{{\tilde{\bm{X}}}_{\text{ref}}}}\cup\bm{\xi}}+\sigma_{w}\bm{I}\rvert}\right),

(4)

where $\lvert\cdot\rvert$ denotes the determinant of a square matrix. In settings with unconstrained decision spaces, sequentially computing a minimizer of (4) has been shown to yield a solution that corresponds to at least $63\%$ of the optimal value (Krause et al., 2008).

4 The LocAL algorithm

Algorithm 1 LocAL (Localized Active Learning)

{\bm{x}}_{0}

{\bm{f}}(\cdot)

k(\cdot,\cdot)

1: for

t=0,1,2,3,\ldots

2: Solve

\displaystyle\bm{\xi}^{*}=\arg\max\limits_{\bm{\xi}\in{\tilde{\mathcal{X}}}}\ \frac{1}{2}\log\left(\frac{{\left(k(\bm{\xi},\bm{\xi})+\sigma_{w}\right)}\lvert\bm{K}_{{{\tilde{\bm{X}}}_{\text{ref}}}}+\sigma_{w}\bm{I}\rvert}{\lvert{\bm{K}_{{{\tilde{\bm{X}}}_{\text{ref}}}}\cup\bm{\xi}}+\sigma_{w}\bm{I}\rvert}\right),

3: Solve

		$\displaystyle\bm{U}^{}=\arg\min\limits_{\bm{U}\in\mathcal{U}^{N_{H}}}\sum\limits_{t=1}^{N_{H}}\left(\bm{\xi}^{}-{\tilde{\bm{x}}}_{t}\right)^{\text{T}}\bm{Q}\left(\bm{\xi}^{*}-{\tilde{\bm{x}}}_{t}\right)$
	s.t.	$\displaystyle x_{t+\tau+1}=f({\tilde{\bm{x}}}_{t+\tau})+\mu_{t}({\tilde{\bm{x}}}_{t+\tau}),\ {\bm{u}}_{t+\tau}\in{\mathcal{U}},\quad\forall\tau\in\left\{0,\ldots,N_{H}-1\right\}$

4: Apply

{\bm{u}}_{t+1}^{*}

to system

5: Measure

{\bm{x}}_{t+1}

, set

{\tilde{\bm{X}}}={\tilde{\bm{X}}}\cup{\bm{x}}_{t+1}

, and update GP model

\mu_{t}(\cdot)

\sigma^{2}_{t}(\cdot)

6: end for

The system dynamics (1) considerably limit the decision space at every time step $t$ . Furthermore, after a data point is collected, both the GP model and mutual information change. Hence, we employ a model predictive control (MPC)-based approach to steer the system towards areas of high information. Ideally, at every MPC-step $t$ , we would like to minimize (3) with respect to a series of $N_{H}$ inputs $\bm{U}\coloneqq\{{\bm{u}}_{t},\ldots,{\bm{u}}_{t+N_{H}-1}\}$ . However, this is generally infeasible, limiting its applicability in an MPC setting. Hence, we consider an approximate solution approach that sequentially computes the most informative data point by minimizing (4) separately from the MPC optimization. This is achieved as follows. At every time step $t$ , an unconstrained minimizer $\bm{\xi}^{*}$ of (4) is computed. Afterwards, the MPC computes the approximate optimal inputs $\bm{U}^{*}$ by minimizing a constrained optimization problem that penalizes the weighted distance to the reference point $\bm{\xi}$ using a positive semi-definite weight matrix $\bm{Q}$ . The ensuing state is then measured, the GP model is updated, and the procedure is repeated. These steps yield the Localized Active Learning (LocAL) algorithm, which is presented in Algorithm 1.

The square weight matrix $\bm{Q}\in\mathbb{R}^{{d_{x}}+{d_{u}}}\times\mathbb{R}^{{d_{x}}+{d_{u}}}$ should be chosen such that the MPC steers the system as close to $\bm{\xi}$ as possible. This represents a system-dependent task. Alternatively, $\bm{Q}$ can be chosen such that the MPC cost function corresponds to a quadratic approximation of the mutual information, e.g., such that $I(\bm{y}_{{\tilde{\bm{x}}}_{t}},\bm{y}_{\bm{\xi}})\approx\sum_{t=1}^{N_{H}}\left(\bm{\xi}-{\tilde{\bm{x}}}_{t}\right)^{\text{T}}\bm{Q}\left(\bm{\xi}-{\tilde{\bm{x}}}_{t}\right)$ holds for ${\tilde{\bm{x}}}_{t}\approx\bm{\xi}$ .

4.1 Sensitivity analysis

We now provide a sensitivity analysis of (4). Specifically, we show that the difference in mutual information at $\bm{\xi}^{*}$ and ${\tilde{\bm{x}}}_{t}$ is lower bounded. To this end, we require the following assumption.

Assumption 1

The kernel $k(\cdot,\cdot)$ is Lipschitz continuous and upper bounded by $k_{\text{max}}>0$ .

This assumption holds for many commonly used kernels, e.g., the squared exponential kernel.

Using 1, we are able to bound the difference in mutual information at $\bm{\zeta}^{*}$ and an arbitrary state ${\tilde{\bm{x}}}_{t}$ , as detailed in the following.

Theorem 4.1.

Let $\Delta\bm{\xi}^{*}:=\bm{\xi}^{*}-{\tilde{\bm{x}}}$ be the difference between the augmented state and the most informative data point $\bm{\xi}^{*}$ at time step . Moreover, let $\Delta I^{*}\coloneqq I(\bm{y}_{\bm{\xi}^{*}},\bm{y}_{{{\tilde{\bm{X}}}_{\text{ref}}}})-I(\bm{y}_{{\tilde{\bm{x}}}},\bm{y}_{{{\tilde{\bm{X}}}_{\text{ref}}}})$ denote the corresponding difference in mutual information, and let 1 hold. Then, there exists a constant $L\geq 0$ , such that $\Delta I^{*}_{t}\leq\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert\log\left(1+{C(\Delta\bm{x}^{*})}\right)/2$ holds, where $C(\Delta\bm{x}^{*})\coloneqq{\sigma_{w}^{-2}}\min\{k_{\text{max}},L({t+\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert})^{-1/2}\lVert\Delta\bm{x}^{*}\rVert_{2}\}$ .

To prove Theorem 4.1, we require the following preliminary results.

Lemma 4.10.

For any $i\in\left\{1,\ldots,\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert\right\}$ , let $\tilde{\bm{X}}_{\text{ref},i}:=\left\{{\tilde{\bm{x}}}_{\text{ref},1},\ldots,{\tilde{\bm{x}}}_{\text{ref},i}\right\}\subseteq{{\tilde{\bm{X}}}_{\text{ref}}}$ denote the subset containing the first $i$ elements of ${{\tilde{\bm{X}}}_{\text{ref}}}$ . Then

\displaystyle\begin{split}&\operatorname*{arg\,max}_{{\tilde{\bm{X}}}\subset{\tilde{\mathcal{X}}},\lvert{\tilde{\bm{X}}}\rvert=n}I(\bm{y}_{\mathcal{X}},\bm{y}_{\mathcal{A}})=\operatorname*{arg\,max}_{{\tilde{\bm{X}}}\subset{\tilde{\mathcal{X}}},\lvert{\tilde{\bm{X}}}\rvert=n}-\sum\limits_{i=0}^{\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert-1}\log(2\pi e(\sigma_{w}^{2}+\sigma^{2}_{i+\lvert{\tilde{\bm{X}}}\rvert}({\tilde{\bm{x}}}_{\text{ref},i+1})))/2,\end{split}

(5)

where $\sigma_{i+\lvert{\tilde{\bm{X}}}\rvert}({\tilde{\bm{x}}})\coloneqq\sigma({\tilde{\bm{x}}}_{\text{ref},\step}|\bm{y}_{{\tilde{\bm{X}}}\cup\tilde{\bm{X}}_{\text{ref},i}})$ denotes the posterior GP variance evaluated at ${\tilde{\bm{x}}}_{\text{ref},\step}$ conditioned on training data observed at ${\tilde{\bm{X}}}\cup\tilde{\bm{X}}_{\text{ref},i}$ .

Proof 4.14.

The proof follows directly from the proof of (Srinivas et al., 2012, Lemma 5.3). Let $y_{i}\in\bm{y}_{{\tilde{\bm{X}}}}$ denote a single training target. For a GP,

\displaystyle\begin{split}H(\bm{y}_{{\tilde{\bm{X}}}})&=H(\bm{y}_{{\tilde{\bm{X}}}}\backslash y_{n})+H(y_{n}|\bm{y}_{{\tilde{\bm{X}}}}\backslash y_{n})=H(\bm{y}_{{\tilde{\bm{X}}}}\backslash y_{n})+\log\left(2\pi e(\sigma_{w}^{2}+\sigma^{2}_{n-1}({\tilde{\bm{x}}}_{n}))\right)/2\\ &=\sum\limits_{i=1}^{n-1}\log\left(2\pi e(\sigma_{w}^{2}+\sigma^{2}_{i}({\tilde{\bm{x}}}_{i+1}))\right)/2\end{split}

(6)

holds. Hence,

\displaystyle\begin{split}H(\bm{y}_{{\tilde{\bm{X}}}}\cup\bm{y}_{\mathcal{A}})=H(\bm{y}_{{\tilde{\bm{X}}}})+\sum\limits_{i=1}^{\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert-1}\log\left(2\pi e(\sigma_{w}^{2}+\sigma^{2}_{i+\lvert{\tilde{\bm{X}}}\rvert}({\tilde{\bm{x}}}_{\text{ref},i+1}))\right)/2\end{split}

(7)

Since, the points $\bm{y}_{\mathcal{A}}$ are fixed, optimizing (3) with respect to $\bm{y}_{{\tilde{\bm{X}}}}$ is equivalent to optimizing $H(\bm{y}_{\tilde{\bm{X}}})-H(\bm{y}_{{{\tilde{\bm{X}}}_{\text{ref}}}}\cup\bm{y}_{{\tilde{\bm{X}}}})=-\sum_{i=1}^{\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert-1}\log(2\pi e(\sigma_{w}^{2}+\sigma^{2}_{i+\lvert{\tilde{\bm{X}}}\rvert}({\tilde{\bm{x}}}_{\text{ref},i+1})))/2$ .

Hence, we employ the objective function

\displaystyle\begin{split}J({\tilde{\bm{X}}})&=H(\bm{y}_{{\tilde{\bm{X}}}})-H(\bm{y}_{{\tilde{\bm{X}}}}\cup\bm{y}_{{{\tilde{\bm{X}}}_{\text{ref}}}})=\sum\limits_{i=1}^{\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert-1}-\log(2\pi e(\sigma_{w}^{2}+\sigma^{2}_{i+\lvert{\tilde{\bm{X}}}\rvert}({\tilde{\bm{x}}}_{\text{ref},i+1})))/2.\end{split}

(8)

with

\displaystyle J_{i}({\tilde{\bm{X}}})=-\log\left(2\pi e(\sigma_{w}^{2}+\sigma^{2}_{i+\lvert{\tilde{\bm{X}}}\rvert}({\tilde{\bm{x}}}_{\text{ref},i+1}))\right)/2.

(9)

Proof 4.15.

This follows directly from the property $I(\bm{y}_{\mathcal{X}},\bm{y}_{\mathcal{A}})\leq I(\bm{y}_{\mathcal{A}},\bm{y}_{\mathcal{A}})$ and Lemma 4.10.

Lemma 4.16.

Let ${\bm{K}}$ be the covariance matrix used to compute $\sigma^{2}_{n}(\cdot)$ , and let $\partial{\bm{K}}/\partial\tilde{x}_{ij}$ denote the matrix of partial derivatives of $\bm{K}$ with respect to the $i$ -th entry of the $j$ -th data point ${\tilde{\bm{x}}}_{j}$ . Then $\lambda_{\text{max}}\left(\partial{\bm{K}}/\partial\tilde{x}_{ij}\right)\leq 2L_{k}\sqrt{t}$ holds, where $\lambda_{\text{max}}(\cdot)$ denotes the maximal eigenvalue.

Proof 4.21.

The matrix of partial derivatives has a single nonzero column and a single nonzero row. Its entries are

\displaystyle\left[\frac{\partial{\bm{K}}}{\partial\tilde{x}_{ij}}\right]_{\alpha\beta}=\begin{cases}\frac{\partial k({\tilde{\bm{x}}}_{\alpha},{\tilde{\bm{x}}}_{\beta})}{\partial\tilde{x}_{i\alpha}},&\text{if}\quad\alpha=j\\ \frac{\partial k({\tilde{\bm{x}}}_{\alpha},{\tilde{\bm{x}}}_{\beta})}{\partial\tilde{x}_{i\beta}},&\text{if}\quad\beta=j\\ 0&\text{otherwise}\end{cases}

(10)

and its eigenvalues are given by

\displaystyle\lambda\left(\frac{\partial{\bm{K}}}{\partial\tilde{x}_{ij}}\right)=\frac{\frac{\partial k({\tilde{\bm{x}}}_{j},{\tilde{\bm{x}}}_{j})}{\partial\tilde{x}_{ij}}\!\pm\!\sqrt{\left(\frac{\partial k({\tilde{\bm{x}}}_{j},{\tilde{\bm{x}}}_{j})}{\partial\tilde{x}_{ij}}\right)^{2}\!+\!4\sum\limits_{\alpha=1,\alpha\neq j}}}{2}\leq\frac{L_{k}\left(1\!+\!\sqrt{1\!+\!4}\right)}{2}\leq 2L_{k}\sqrt{t}.

Corollary 4.26.

Let 1 hold. Then for a fixed set of $n$ data points the posterior covariance $\sigma^{2}_{n}({\tilde{\bm{x}}}_{\text{ref}})$ is Lipschitz continuous with respect to the data, with Lipschitz constant given by $L_{\sigma_{n}}=2L_{k}\sqrt{k_{\text{max}}}/\sigma_{\text{on}}(1+\sqrt{nk_{\text{max}}}/\sigma_{\text{on}})$ .

Proof 4.27.

For an arbitrary ${\tilde{\bm{x}}}_{\text{ref}}\in{{\tilde{\bm{X}}}_{\text{ref}}}$ , data point ${\tilde{\bm{x}}}_{j}$ , and corresponding entry $\tilde{x}_{ij}$ ,

\displaystyle\begin{split}\frac{\partial\sigma^{2}_{n}({\tilde{\bm{x}}}_{\text{ref}})}{\partial\tilde{x}_{ij}}=&-2\frac{\partial\bm{k}({\tilde{\bm{x}}}_{\text{ref}})}{\partial\tilde{x}_{ij}}^{\text{T}}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-1}\bm{k}({\tilde{\bm{x}}}_{\text{ref}})\\ &-\bm{k}({\tilde{\bm{x}}}_{\text{ref}})^{\text{T}}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-1}\frac{\partial{\bm{K}}}{\partial\tilde{x}_{ij}}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-1}\bm{k}({\tilde{\bm{x}}}_{\text{ref}})\\ \leq&2\lambda_{\text{max}}\left(\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-\frac{1}{2}}\right)\Big{\rVert}\frac{\partial\bm{k}({\tilde{\bm{x}}}_{\text{ref}})}{\partial\tilde{x}_{ij}}\Big{\lVert}_{2}\Big{\lVert}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-\frac{1}{2}}\bm{k}({\tilde{\bm{x}}}_{\text{ref}})\Big{\rVert}_{2}\\ &+\lambda^{2}_{\text{max}}\left(\left(\frac{\partial{\bm{K}}}{\partial\tilde{x}_{ij}}\right)^{\frac{1}{2}}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-\frac{1}{2}}\right)\Big{\lVert}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-\frac{1}{2}}\bm{k}({\tilde{\bm{x}}}_{\text{ref}})\Big{\rVert}_{2}^{2}\\ \leq&\frac{2}{\sigma_{\text{on}}}L_{k}\Big{\lVert}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-\frac{1}{2}}\bm{k}({\tilde{\bm{x}}}_{\text{ref}})\Big{\rVert}_{2}+\frac{2L_{k}\sqrt{n}}{\sigma_{\text{on}}^{2}}\Big{\lVert}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-\frac{1}{2}}\bm{k}({\tilde{\bm{x}}}_{\text{ref}})\Big{\rVert}_{2}^{2}\\ \leq&\frac{2}{\sigma_{\text{on}}}L_{k}\sqrt{k_{\text{max}}}+\frac{2L_{k}\sqrt{n}}{\sigma_{\text{on}}^{2}}k_{\text{max}}.\end{split}

Here we employ the identities

\displaystyle\bm{k}({\tilde{\bm{x}}}_{\text{ref}})^{\text{T}}\frac{\partial\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-1}}{\partial\tilde{x}_{ij}}\bm{k}({\tilde{\bm{x}}}_{\text{ref}})

\displaystyle=\bm{k}({\tilde{\bm{x}}}_{\text{ref}})^{\text{T}}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-1}\frac{\partial{\bm{K}}}{\partial\tilde{x}_{ij}}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-1}\bm{k}({\tilde{\bm{x}}}_{\text{ref}})

(11)

and

\displaystyle\Big{\lVert}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-\frac{1}{2}}\bm{k}({\tilde{\bm{x}}}_{\text{ref}})\Big{\rVert}_{2}^{2}

\displaystyle=\bm{k}({\tilde{\bm{x}}}_{\text{ref}})^{\text{T}}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-1}\bm{k}({\tilde{\bm{x}}}_{\text{ref}})\leq k({\tilde{\bm{x}}}_{\text{ref}},{\tilde{\bm{x}}}_{\text{ref}})\leq k_{\text{max}},

(12)

where the latter identity follows from the positivity of $\sigma^{2}_{n}(\cdot)$ .

We can now prove Theorem 4.1

Proof of Theorem 1

Define $\sigma^{2}_{i+\step,*}\!(\cdot)\!\!\coloneqq\!\!\sigma^{2}(\cdot|\bm{y}_{{\tilde{\bm{X}}}\step\cup\tilde{\bm{X}}_{\text{ref},i}\cup{\bm{x}}^{*}}\!)$ and $\sigma^{2}_{i+\step}(\cdot)\!\!\coloneqq\!\!\sigma^{2}(\cdot|\bm{y}_{{\tilde{\bm{X}}}\step\cup\tilde{\bm{X}}_{\text{ref},i}\cup{\bm{x}}}\!)$ . Moreover, assume without loss of generality that $\sigma^{2}_{i+\step,*}\!({\tilde{\bm{x}}}_{\text{ref},i+1}\!)\!\!\geq\!\!\sigma^{2}_{i+\step}({\tilde{\bm{x}}}_{\text{ref},i+1}\!)$ holds. Then, due to Assumption 1 and Lemma 4.10,

\displaystyle\begin{split}\Delta I^{*}&=-\sum\limits_{i=0}^{\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert-1}\log\left(2\pi e(\sigma_{w}^{2}+\sigma^{2}_{i+\step,*}({\tilde{\bm{x}}}_{\text{ref},i+1}))\right)/2+\sum\limits_{i=0}^{\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert-1}\log(2\pi e(\sigma_{w}^{2}+\sigma^{2}_{i+\step}({\tilde{\bm{x}}}_{\text{ref},i+1})))/2\\ &=\underbrace{\sum\limits_{i=0}^{\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert-1}\frac{1}{2}\log\left(\frac{\sigma_{w}^{2}+\sigma^{2}_{i+\step}({\tilde{\bm{x}}}_{\text{ref},i+1})}{\sigma_{w}^{2}+\sigma^{2}_{i+\step,*}({\tilde{\bm{x}}}_{\text{ref},i+1})}\right)}_{\leq\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert\frac{1}{2}\log\left(1+\frac{k_{\text{max}}}{\sigma_{\text{on}}^{2}}\right)}\\ &=\underbrace{\sum\limits_{i=0}^{\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert-1}\frac{1}{2}\log\left(1+\frac{\sigma^{2}_{i+\step}({\tilde{\bm{x}}}_{\text{ref},i+1})-\sigma^{2}_{i+\step,*}({\tilde{\bm{x}}}_{\text{ref},i+1})}{\sigma_{w}^{2}+\sigma^{2}_{i+\step,*}({\tilde{\bm{x}}}_{\text{ref},i+1})}\right)}_{\leq\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert\frac{1}{2}\log\left(1+L_{\sigma}\frac{\lVert{\tilde{\bm{x}}}\rVert_{2}}{\sigma^{2}_{\text{on}}}\right)}.\end{split}

(13)

In particular, Theorem 4.1 implies that the mutual information with respect to $\bm{\xi}^{*}$ can be approximated arbitrarily accurately by reducing the difference $\bm{\xi}^{*}-{\tilde{\bm{x}}}$ . In many control applications, this can be achieved in spite of the model error, e.g., by increasing control gains Capone and Hirche (2019); Umlauft and Hirche (2020). Hence, Theorem 4.1 can be potentially employed to guarantee a gradual improvement in model accuracy, or even to derive the worst-case number of iterations required to learn the system.

5 Numerical Experiments

In this section, we apply the proposed algorithm to four different dynamical systems. We begin with a toy example, with which we can easily illustrate the explored portions of the state space. Afterwards, we apply the proposed approach to a pendulum, a cart-pole, and a synthetic model that generalizes the mountain car problem. The exploration is repeated $50$ times for each system using different starting points $\bm{x}_{0}$ sampled from a normal distribution. To quantify the performance of each approach, we compute the root mean square model error (RMSE) on $500$ points sampled from a uniform distribution on the region of interest ${{\tilde{\mathcal{X}}}_{\text{ref}}}$ .

We employ a squared-exponential kernel in all examples, and train the hyperparemters online using gradient-based log likelihood maximization (Rasmussen and Williams, 2006). We employ an MPC horizon of $N_{H}=10$ , and choose weight matrix for the MPC optimization step as

\displaystyle\bm{Q}=\sum\limits_{d=1}^{{d_{x}}}\sigma_{g_{d}}\text{diag}(l^{-2}_{1,g_{d}},\ldots,l^{-2}_{{d_{x}},g_{d}})\quad\forall i=1,\ldots,N,

where $\sigma_{g_{d}}$ denotes the standard deviation of the GP kernel corresponding to the $d$ -th dimension, and $l^{-2}_{1,g_{d}},\ldots,l^{-2}_{{d_{x}},g_{d}}$ denote the corresponding lengthscales. In order to ease the computational burden, we apply the first $7$ inputs computed by the LocAL algorithm before computing a new solution.

We additionally explore each system using a greedy entropy-based cost function, as suggested in Koller et al. (2018) and Schreiter et al. (2015), and compare the results. In all three cases, the LocAL algorithm yields a better model in the regions of interest than the entropy-based algorithm.

5.1 Toy Problem

Consider the continuous-time nonlinear dynamical system

\displaystyle\dot{x}=10(\sin(x)+\arctan(x)+u),

(14)

with state space $\mathcal{X}=\mathbb{R}$ and input space $\mathcal{U}=[-5,5]$ . We are interested in obtaining an accurate dynamical model within the region ${{\tilde{\mathcal{X}}}_{\text{ref}}}=\{[x\ u]^{\text{T}}\in{\tilde{\mathcal{X}}}\ |\ x\in[-\pi,\pi],u\in[-1,1]\}$ . To obtain a discrete-time system in the form of (1), we discretize (14) with a discretization step of $\Delta t=0.1$ and set the prior model to ${f(x,u)=x}$ . The results are displayed in Figure 1.

The LocAL algorithm yields a substantial improvement in model accuracy in every run. This is because the system stays close to the region of interest ${{\tilde{\mathcal{X}}}_{\text{ref}}}$ during the whole simulation. By contrast, the greedy entropy-based method covers a considerably more extensive portion of the state space. This comes at the cost of a poorer model on ${{\tilde{\mathcal{X}}}_{\text{ref}}}$ , as indicated by the respective RMSE.

Refer to caption — Figure 1: Toy problem results. Collected data in augmented state space ${\mathcal{X}}\times\mathcal{U}$ after 200 times steps (top). Median, lower and upper quartile of RMSE on region of interest (bottom). The LocAL algorithm explores the region of interest more thoroughly than the entropy-based approach

5.2 Surface exploration

We apply the LocAL algorithm to the dynamical system given by

\displaystyle\dot{x}_{1}=3u_{1}+10\cos(5x_{1})\cos(5x_{2}),\qquad\qquad\dot{x}_{2}=3u_{2}+10\sin(5x_{1})\sin(5x_{2}).

(15)

This setting can be seen as a surface exploration problem, i.e., an agent navigates a surface to learn its curvature. We aim to obtain an accurate model of the dynamics within the region given by ${{\tilde{\mathcal{X}}}_{\text{ref}}}=\{[\bm{x}^{\text{T}}\ \bm{u}^{\text{T}}]^{\text{T}}\in{\tilde{\mathcal{X}}}\ |\ {\bm{x}}\in[-\pi/4,\pi/4]^{2},\ {\bm{u}}\in[-1,1]^{2}\}$ . To run the LocAL algorithm, we employ a discretization step of $\Delta$ and set the prior model to ${{\bm{f}}({\tilde{\bm{x}}})=\bm{x}+\Delta\step{\bm{u}}}$ . The results are shown in Figure 2.

The LocAL algorithm manages to significantly improve its model after $400$ time steps, while the entropy-based strategy does not yield any improvement. This is because every variable of the state space is unbounded, i.e., the state space can be explored for a potentially infinite amount of time without ever reaching the region of interest ${{\tilde{\mathcal{X}}}_{\text{ref}}}$ .

5.3 Pendulum

We now consider a two-dimensional pendulum, whose state $\bm{x}=[\vartheta,\dot{\vartheta}]$ is given by the angle $\vartheta$ and angular velocity $\dot{\vartheta}$ . The input torque $u$ is constrained to the interval $\mathcal{U}=[-10,10]$ . Our goal is to obtain a suitable model within ${{\tilde{\mathcal{X}}}_{\text{ref}}}=\{[\bm{x}^{\text{T}}u]^{\text{T}}\in{\tilde{\mathcal{X}}}\ |\ x_{1}\in[\pi/2,3/2\pi],\ x_{2}[-5,5],\ u\in[-3,3]\}$ . Obtaining a precise model around this region is particularly useful for the commonly considered task of stabilizing the pendulum around the upward position $\vartheta=\dot{\vartheta}=0$ . The results are depicted in Figure 3.

The RMSE indicates that the LocAL algorithm yields a similar model improvement every run. The model obtained with the entropy-based strategy, by contrast, exhibits higher errors.

5.4 Cart-pole

We apply the LocAL algorithm to the cart-pole system (Barto et al., 1983). The state space is given by $\bm{x}=[v,\vartheta,\dot{\vartheta}]$ , where $v$ is the cart velocity, $\vartheta$ is the pendulum angle, and $\dot{\vartheta}$ is its angular velocity. Here we ignore the cart position, as it has no influence on the system dynamics. The region of interest is given by ${{\tilde{\mathcal{X}}}_{\text{ref}}}=\left\{[\bm{x}^{\text{T}}\bm{u}^{\text{T}}]^{\text{T}}\in{\tilde{\mathcal{X}}}\ \Big{|}\ x_{1}\in[-2,2],\ x_{2}\in[-\pi/4,\pi/4],\ u_{1},u_{2}\in[-5,5]\right\}.$ Similarly to the pendulum case, obtaining an accurate model on this region is useful to address the balancing task. The discretization step is set to $\Delta t=$0.05\text{\,}\mathrm{s}$$ , the prior model is $\bm{f}({\tilde{\bm{x}}})=\bm{x}$ . The results are shown in Figure 4.

Similarly to the pendulum case, the model obtained with the LocAL algorithm exhibits low standard deviation compared to the one obtained with the entropy-based approach. This is because the region of interest is explored more thoroughly with our approach.

6 Conclusion

A technique for efficiently exploring bounded subsets of the state-action space of a system has been presented. The proposed technique aims to minimize the mutual information of the system trajectories with respect to a discretization of the region of interest. It employs Gaussian processes both to model the unknown system dynamics and to quantify the informativeness of potentially collected data points. In numerical simulations of four different dynamical systems, we have demonstrated that the proposed approach yields a better model after a limited amount of time steps than a greedy entropy-based approach.

Acknowledgements

This work was supported by the European Research Council (ERC) Consolidator Grant ”Safe data-driven control for human-centric systems (CO-MAN)” under grant agreement number 864686 .

References

Alpcan and Shames (2015) T. Alpcan and I. Shames. An information-based learning approach to dual control. IEEE transactions on neural networks and learning systems, 26(11):2736–2748, 2015.
Ay et al. (2008) N. Ay, N. Bertschinger, R. Der, F. Güttler, and E. Olbrich. Predictive information and explorative behavior of autonomous robots. The European Physical Journal B, 63(3):329–339, 2008.
Barto et al. (1983) A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13(5):834–846, 1983.
Beckers et al. (2019) T. Beckers, D. Kulić, and S. Hirche. Stable Gaussian process based tracking control of euler-lagrange systems. Automatica, (103):390–397, 2019.
Berkenkamp and Schoellig (2015) F. Berkenkamp and A. P. Schoellig. Safe and robust learning control with Gaussian processes. In Control Conference (ECC), 2015 European, pages 2496–2501. IEEE, 2015.
Burgard et al. (2005) W. Burgard, D. Fox, and S. Thrun. Probabilistic robotics. The MIT Press, 2005.
Candela et al. (2003) J. Q. Candela, A. Girard, J. Larsen, and C. E. Rasmussen. Propagation of uncertainty in Bayesian kernel models - application to multiple-step ahead forecasting. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03)., volume 2, pages II–701, April 2003.
Capone and Hirche (2019) A. Capone and S. Hirche. Backstepping for partially unknown nonlinear systems using Gaussian processes. IEEE Control Systems Letters, 3:416–421, 2019.
Dayan and Sejnowski (1996) P. Dayan and T. J. Sejnowski. Exploration bonuses and dual control. Machine Learning, 25(1):5–22, 1996.
Deisenroth et al. (2015) M. P. Deisenroth, D. Fox, and C. E. Rasmussen. Gaussian processes for data-efficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):408–423, 2015.
Kamthe and Deisenroth (2018) S. Kamthe and M. P. Deisenroth. Data-efficient reinforcement learning with probabilistic model predictive control. In AISTATS, volume 84, pages 1701–1710. Proceedings of Machine Learning (PMLR), 2018.
Koller et al. (2018) T. Koller, F. Berkenkamp, M. Turchetta, and A. Krause. Learning-based model predictive control for safe exploration. In 2018 IEEE Conference on Decision and Control (CDC), pages 6059–6066, 2018.
Krause et al. (2008) A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. Journal of Machine Learning Research, 9(Feb):235–284, 2008.
Pillonetto et al. (2014) G. Pillonetto, F. Dinuzzo, T. Chen, G. De Nicolao, and L. Ljung. Kernel methods in system identification, machine learning and function estimation: A survey. Automatica, 50(3):657–682, 2014.
Rasmussen and Williams (2006) C. E. Rasmussen and C. K. Williams. Gaussian processes for machine learning. 2006. The MIT Press, Cambridge, MA, USA, 38:715–719, 2006.
Schreiter et al. (2015) F. Schreiter, D. Nguyen-Tuong, M. Eberts, B. Bischoff, H. Markert, and M. Toussaint. Safe exploration for active learning with Gaussian processes. In Machine Learning and Knowledge Discovery in Databases, pages 133–149, Cham, 2015. Springer International Publishing.
Srinivas et al. (2012) N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger. Information-theoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250–3265, 2012.
Umlauft and Hirche (2020) J. Umlauft and S. Hirche. Feedback linearization based on Gaussian processes with event-triggered online learning. IEEE Transactions on Automatic Control, 2020.
Umlauft et al. (2018) J. Umlauft, L. Pöhler, and S. Hirche. An uncertainty-based control Lyapunov approach for control-affine systems modeled by Gaussian process. IEEE Control Systems Letters, 2018.