This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Localized active learning of Gaussian process state space models

\NameAlexandre Capone \Emailalexandre.capone@tum.de
\NameGerrit Noske \Emailga24coq@mytum.de
\NameJonas Umlauft \Emailjonas.umlauft@tum.de
\NameThomas Beckers \Emailt.beckers@tum.de
\NameArmin Lederer \Emailarmin.lederer@tum.de
\NameSandra Hirche \Emailhirche@tum.de
\addrChair of Information-oriented Control
Department of Electrical and Computer Engineering
Technical University of Munich
D-80333 Munich
   Germany
Abstract

While most dynamic system exploration techniques aim to achieve a globally accurate model, this is generally unsuited for systems with unbounded state spaces. Furthermore, many applications do not require a globally accurate model, e.g., local stabilization tasks. In this paper, we propose an active learning strategy for Gaussian process state space models that aims to obtain an accurate model on a bounded subset of the state-action space. Our approach aims to maximize the mutual information of the exploration trajectories with respect to a discretization of the region of interest. By employing model predictive control, the proposed technique integrates information collected during exploration and adaptively improves its exploration strategy. To enable computational tractability, we decouple the choice of most informative data points from the model predictive control optimization step. This yields two optimization problems that can be solved in parallel. We apply the proposed method to explore the state space of various dynamical systems and compare our approach to a commonly used entropy-based exploration strategy. In all experiments, our method yields a better model within the region of interest than the entropy-based method.

keywords:
exploration, Bayesian inference, data-driven control, model predictive control

1 Introduction

Autonomous systems often need to operate in complex environments, of which a model is difficult or even impossible to derive from first principles. Learning-based techniques have become a promising paradigm to address these issues (Pillonetto et al., 2014). In particular, Gaussian processes (GPs) have been increasingly employed for system identification and control (Umlauft et al., 2018; Capone and Hirche, 2019; Berkenkamp and Schoellig, 2015; Deisenroth et al., 2015). GPs possess very good generalization properties (Rasmussen and Williams, 2006), which can be leveraged to obtain data-efficient learning-based approaches (Deisenroth et al., 2015; Kamthe and Deisenroth, 2018). By employing a Bayesian framework, GPs provide an automatic trade-off between model smoothness and data fitness. Moreover, GPs provide an explicit estimate of the model uncertainty that is used to derive probabilistic bounds in control settings (Capone and Hirche, 2019; Beckers et al., 2019; Umlauft and Hirche, 2020).

A performance-determining factor of data-driven techniques is the quality of the available data. In settings where data is insufficient to achieve accurate predictions, new data needs to be gathered via exploration (Umlauft and Hirche, 2020). In classical reinforcement learning settings, exploration is often enforced by randomly selecting a control action with a predetermined probability that tends to zero over time (Dayan and Sejnowski, 1996). However, this is generally inefficient, as regions of low uncertainty are potentially revisited in multiple iterations. These issues have been addressed by techniques that choose the most informative exploration trajectories (Alpcan and Shames, 2015; Ay et al., 2008; Burgard et al., 2005; Schreiter et al., 2015). The goal of these methods is to obtain a globally accurate model. While this is reasonable for systems with a bounded state-action space, it is unsuited for systems with unbounded ones, particularly if a non-parametric model is used. This is because a potentially infinite number of points is required to achieve a globally accurate model. Furthermore, in practice a model often only needs to be accurate locally, e.g., for stabilization tasks.

In this paper, we propose a model predictive control-based exploration approach that steers the system towards the most informative points within a bounded subset of the state-action space. By modeling the system with a Gaussian process, we are able to quantify the information inherent in each data point. Our approach chooses actions by approximating the mutual information of the system trajectory with respect to a discretization of the region of interest. This is achieved by first selecting the single most informative data point within the region of interest, then steering the system towards that point using model predictive control. Through this approximation, the solution approach is rendered computationally tractable.

The remainder of this paper is structured as follows. Section 2 describes the general problem. Section 3 discusses how GPs are employed for modeling and exploration. Section 4 presents the LocAL algorithm, and is followed by a numerical simulation example, in Section 5.

2 Problem Statement

We consider the problem of exploring the state and control space of a discrete-time nonlinear system with Markovian dynamics of the form

𝒙t+1\displaystyle{{\bm{x}}_{t+1}} =𝒇(𝒙t,𝒖t)+𝒈(𝒙t,𝒖t)+𝒘t:=𝒇(𝒙~t)+𝒈(𝒙~t)+𝒘t,\displaystyle={\bm{f}}({\bm{x}}_{t},{\bm{u}}_{t})+{\bm{g}}({\bm{x}}_{t},{\bm{u}}_{t})+{\bm{w}}_{t}:={\bm{f}}({\tilde{\bm{x}}}_{t})+{\bm{g}}({\tilde{\bm{x}}}_{t})+{\bm{w}}_{t}, (1)

where t0t\in\mathbb{N}_{0}𝒙t𝒳dx{{\bm{x}}_{t}\in\mathcal{X}\subseteq\mathbb{R}^{{d_{x}}}} and 𝒖t𝒰du{{\bm{u}}_{t}\in\mathcal{U}\subseteq\mathbb{R}^{{d_{u}}}} are the system’s state vector and control vector at the tt-th time step, respectively. The system is disturbed by multivariate Gaussian process noise 𝒘t𝒩(𝟎,𝚺w){{\bm{w}}_{t}\sim\mathcal{N}(\bm{0},\bm{\Sigma}_{w})} with 𝚺w=diag(σw,12,,σw,dx2)\bm{\Sigma}_{w}=\text{diag}(\sigma_{\text{w},1}^{2},\ldots,\sigma_{\text{w},{d_{x}}}^{2})σw,i2+,0\sigma_{\text{w,i}}^{2}\in\mathbb{R}_{+,0}. The concatenation 𝒙~t:=(𝒙tT𝒖tT)T𝒳~{{\tilde{\bm{x}}}_{t}:=\begin{pmatrix}{\bm{x}}_{t}^{\text{T}}&{\bm{u}}_{t}^{\text{T}}\end{pmatrix}^{\text{T}}\in{\tilde{\mathcal{X}}}}, where 𝒳~:=𝒳×𝒰{{\tilde{\mathcal{X}}}:=\mathcal{X}\times\mathcal{U}} is employed for simplicity of exposition. The nonlinear function 𝒇:𝒳~𝒳{{\bm{f}}:{\tilde{\mathcal{X}}}\rightarrow\mathcal{X}} represents the known component of the system dynamics, e.g., a model obtained using first principles, while 𝒈:𝒳~𝒳{{\bm{g}}:{\tilde{\mathcal{X}}}\rightarrow\mathcal{X}} corresponds to the unknown component of the system dynamics.

We aim to obtain an approximation of the function 𝒈(){\bm{g}}(\cdot), denoted 𝒈^()\widehat{\bm{g}}(\cdot), which provides an accurate estimate of 𝒈(){{\bm{g}}(\cdot)} on a predefined bounded subset of the augmented state space 𝒳~B𝒳~{\tilde{\mathcal{X}}}_{B}\subset{\tilde{\mathcal{X}}}. This is often required in practice, e.g., for local stabilization tasks.

3 Gaussian Processes

In order to faithfully capture the stochastic behavior of (1), we model the system as a Gaussian process (GP), where we employ measurements of the augmented state vector 𝒙~t{\tilde{\bm{x}}}_{t} as training inputs, and the differences 𝒙t+1𝒇(𝒙~t)=𝒈(𝒙~t)+𝒘t{\bm{x}}_{t+1}-{\bm{f}}({\tilde{\bm{x}}}_{t})={\bm{g}}({\tilde{\bm{x}}}_{t})+{\bm{w}}_{t} as training targets.

A GP is a collection of dependent random variables variables, for which any finite subset is jointly normally distributed (Rasmussen and Williams, 2006). It is specified by a mean function m:𝒳~m:{\tilde{\mathcal{X}}}\rightarrow\mathbb{R} and a positive definite covariance function k:𝒳~×𝒳~k:{\tilde{\mathcal{X}}}\times{\tilde{\mathcal{X}}}\rightarrow\mathbb{R}, also known as kernel. In this paper, we set m0m\equiv 0 without loss of generality, as all prior knowledge is already encoded in f()f(\cdot). The kernel k(,)k(\cdot,\cdot) is a similarity measure for evaluations of 𝒈(){\bm{g}}(\cdot), and encodes function properties such as smoothness and periodicity.

In the case where the state is a scalar, i.e., dx=1{d_{x}}=1, given nn training input samples 𝑿~={𝒙~1,,𝒙~n}𝒳~{\tilde{\bm{X}}}=\left\{{\tilde{\bm{x}}}_{1},\ldots,{\tilde{\bm{x}}}_{n}\right\}\subset{\tilde{\mathcal{X}}} and training outputs 𝒚𝑿~=(g(𝒙~1)+w1g(𝒙~n)+wn)T\bm{y}_{{\tilde{\bm{X}}}}=\begin{pmatrix}g({\tilde{\bm{x}}}_{1})+{w}_{1}&\ldots&g({\tilde{\bm{x}}}_{n})+{w}_{n}\end{pmatrix}^{\text{T}}, the posterior mean and variance of the GP corresponds to a one-step transition model. Starting at a point 𝒙~t{\tilde{\bm{x}}}_{t}, the difference between the subsequent state and the known component is normally distributed, i.e.,

𝒙t+1𝒇(𝒙~t)𝒩(μn(𝒙~t),σn2(𝒙~t)),\displaystyle{\bm{x}}_{t+1}-{\bm{f}}({\tilde{\bm{x}}}_{t})\sim\mathcal{N}\left(\mu_{n}({\tilde{\bm{x}}}_{t}),\sigma_{n}^{2}({\tilde{\bm{x}}}_{t})\right), (2)

with mean and variance given by

μn(𝒙~t)𝒌T(𝒙~t)(𝑲+σw2𝑰)1𝒚𝑿~,σn2(𝒙~t)k(𝒙~t,𝒙~t)𝒌T(𝒙~t)(𝑲+σ2𝑰)1𝒌(𝒙~t),\displaystyle\mu_{n}({\tilde{\bm{x}}}_{t})\coloneqq\bm{k}^{\text{T}}({\tilde{\bm{x}}}_{t})\left({\bm{K}}+\sigma_{w}^{2}\bm{I}\right)^{-1}\bm{y}_{{\tilde{\bm{X}}}},\qquad\sigma^{2}_{n}({\tilde{\bm{x}}}_{t})\coloneqq k({\tilde{\bm{x}}}_{t},{\tilde{\bm{x}}}_{t})-\bm{k}^{\text{T}}({\tilde{\bm{x}}}_{t})\left({\bm{K}}+\sigma^{2}\bm{I}\right)^{-1}\bm{k}({\tilde{\bm{x}}}_{t}),

respectively, where 𝒌()=(k(𝒙~1,)k(𝒙~n,))T{\bm{k}(\cdot)=\begin{pmatrix}k({\tilde{\bm{x}}}_{1},\cdot)&\ldots&k({\tilde{\bm{x}}}_{n},\cdot)\end{pmatrix}^{\text{T}}}, and the entries of the covariance matrix 𝑲\bm{K} are computed as Kij=k(𝒙~i,𝒙~j)K_{ij}=k({\tilde{\bm{x}}}_{i},{\tilde{\bm{x}}}_{j}), i,j=1,,ni,j=1,\ldots,n.

In the case where the state is multidimensional, we model dimension of the state transition function using a separate GP. This corresponds to the assumption that the state transition function entries are conditionally independent. For simplicity of exposition, unless stated otherwise, we henceforth assume dx=1{d_{x}}=1. However, the methods presented in this paper extend straightforwardly to the multivariate case.

3.1 Performing multi-step ahead predictions

The GP model presented in the previous section serves as a one-step predictor given a known test input 𝒙~t{\tilde{\bm{x}}}_{t}. However, if only a distribution p(𝒙~t)p({\tilde{\bm{x}}}_{t}) is available, the successor states’ distribution generally cannot be computed analytically. Hence, the distributions of future states cannot be computed exactly, but only approximated, e.g., using Monte Carlo methods (Candela et al., 2003). Alternatively, approximate computations exist that enable to propagate the GP uncertainty over multiple time steps, such as moment-matching and GP linearization (Deisenroth et al., 2015). In this paper, we employ the GP mean to perform multi-step ahead predictions, without propagating uncertainty, i.e., xt+1=f(𝒙~t)+μn(𝒙~t)x_{t+1}=f({\tilde{\bm{x}}}_{t})+\mu_{n}({\tilde{\bm{x}}}_{t}), tt\in\mathbb{N}. However, the proposed method is also applicable using models that propagate uncertainty, e.g., moment-matching (Deisenroth et al., 2015).

3.2 Quantifying utility of data

In order to steer the system along informative trajectories, we need to quantify the utility of data points in the augmented state space 𝒳~{\tilde{\mathcal{X}}}. To this end, we consider the mutual information between observations 𝒚𝑿~\bm{y}_{{\tilde{\bm{X}}}} at training inputs 𝑿~{\tilde{\bm{X}}} and evaluations 𝒚𝑿~ref\bm{y}_{{{\tilde{\bm{X}}}_{\text{ref}}}} at reference points 𝑿~ref{{\tilde{\bm{X}}}_{\text{ref}}}. Here 𝑿~ref𝒳~ref{{{\tilde{\bm{X}}}_{\text{ref}}}\subset{{\tilde{\mathcal{X}}}_{\text{ref}}}} is a discretization of the bounded subset 𝒳~ref{{\tilde{\mathcal{X}}}_{\text{ref}}}. Formally, the mutual information between 𝒚𝑿~ref\bm{y}_{{{\tilde{\bm{X}}}_{\text{ref}}}} and 𝒚𝑿~\bm{y}_{{\tilde{\bm{X}}}} is given by

I(𝒚𝑿~,𝒚𝑿~ref)=𝒳|𝑿~ref|×𝒳|𝑿~|p(𝒚𝑿~,𝒚𝑿~ref)log(p(𝒚𝑿~,𝒚𝑿~ref)p(𝒚𝑿~),p(𝒚𝑿~ref))𝑑𝒚𝑿~𝑑𝒚𝑿~ref\displaystyle I(\bm{y}_{{\tilde{\bm{X}}}},\bm{y}_{{{\tilde{\bm{X}}}_{\text{ref}}}})=\!\!\!\!\!\!\!\int\limits_{\mathcal{X}^{\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert}\times\mathcal{X}^{\lvert{\tilde{\bm{X}}}\rvert}}\!\!\!\!\!\!\!p\left(\bm{y}_{\tilde{\bm{X}}},\bm{y}_{{\tilde{\bm{X}}}_{\text{ref}}}\right)\log\left(\frac{p\left(\bm{y}_{\tilde{\bm{X}}},\bm{y}_{{\tilde{\bm{X}}}_{\text{ref}}}\right)}{p\left(\bm{y}_{\tilde{\bm{X}}}\right),p\left(\bm{y}_{{\tilde{\bm{X}}}_{\text{ref}}}\right)}\right)d\bm{y}_{\tilde{\bm{X}}}d\bm{y}_{{\tilde{\bm{X}}}_{\text{ref}}} (3)

respectively denote the differential entropy of 𝒚𝑿~\bm{y}_{\tilde{\bm{X}}} and the conditional differential entropy of 𝒚𝑿~\bm{y}_{\tilde{\bm{X}}} given 𝒚𝑿~ref\bm{y}_{{\tilde{\bm{X}}}_{\text{ref}}}. In practice, computing (3) for a multi-step GP prediction is intractable. However, we can obtain the single most informative data point 𝝃𝒳~\bm{\xi}^{*}\in{\tilde{\mathcal{X}}} with respect to 𝒚𝑿~ref\bm{y}_{{{\tilde{\bm{X}}}_{\text{ref}}}} by computing the unconstrained minimum of

I(𝒚𝝃,𝒚𝑿~ref)=12log((k(𝝃,𝝃)+σw)|𝑲𝑿~ref+σw𝑰||𝑲𝑿~ref𝝃+σw𝑰|),\displaystyle I(\bm{y}_{\bm{\xi}},\bm{y}_{{{\tilde{\bm{X}}}_{\text{ref}}}})=\frac{1}{2}\log\left(\frac{{\left(k(\bm{\xi},\bm{\xi})+\sigma_{w}\right)}\lvert\bm{K}_{{{\tilde{\bm{X}}}_{\text{ref}}}}+\sigma_{w}\bm{I}\rvert}{\lvert{\bm{K}_{{{\tilde{\bm{X}}}_{\text{ref}}}}\cup\bm{\xi}}+\sigma_{w}\bm{I}\rvert}\right), (4)

where ||\lvert\cdot\rvert denotes the determinant of a square matrix. In settings with unconstrained decision spaces, sequentially computing a minimizer of (4) has been shown to yield a solution that corresponds to at least 63%63\% of the optimal value (Krause et al., 2008).

4 The LocAL algorithm

Algorithm 1 LocAL (Localized Active Learning)
0:𝒙0{\bm{x}}_{0}𝒇(){\bm{f}}(\cdot), k(,)k(\cdot,\cdot)
1:for t=0,1,2,3,t=0,1,2,3,\ldots do
2:  Solve 
𝝃=argmax𝝃𝒳~12log((k(𝝃,𝝃)+σw)|𝑲𝑿~ref+σw𝑰||𝑲𝑿~ref𝝃+σw𝑰|),\displaystyle\bm{\xi}^{*}=\arg\max\limits_{\bm{\xi}\in{\tilde{\mathcal{X}}}}\ \frac{1}{2}\log\left(\frac{{\left(k(\bm{\xi},\bm{\xi})+\sigma_{w}\right)}\lvert\bm{K}_{{{\tilde{\bm{X}}}_{\text{ref}}}}+\sigma_{w}\bm{I}\rvert}{\lvert{\bm{K}_{{{\tilde{\bm{X}}}_{\text{ref}}}}\cup\bm{\xi}}+\sigma_{w}\bm{I}\rvert}\right),
3:  Solve 
𝑼=argmin𝑼𝒰NHt=1NH(𝝃𝒙~t)T𝑸(𝝃𝒙~t)\displaystyle\bm{U}^{*}=\arg\min\limits_{\bm{U}\in\mathcal{U}^{N_{H}}}\sum\limits_{t=1}^{N_{H}}\left(\bm{\xi}^{*}-{\tilde{\bm{x}}}_{t}\right)^{\text{T}}\bm{Q}\left(\bm{\xi}^{*}-{\tilde{\bm{x}}}_{t}\right)
s.t. xt+τ+1=f(𝒙~t+τ)+μt(𝒙~t+τ),𝒖t+τ𝒰,τ{0,,NH1}\displaystyle x_{t+\tau+1}=f({\tilde{\bm{x}}}_{t+\tau})+\mu_{t}({\tilde{\bm{x}}}_{t+\tau}),\ {\bm{u}}_{t+\tau}\in{\mathcal{U}},\quad\forall\tau\in\left\{0,\ldots,N_{H}-1\right\}
4:  Apply 𝒖t+1{\bm{u}}_{t+1}^{*} to system
5:  Measure 𝒙t+1{\bm{x}}_{t+1}, set 𝑿~=𝑿~𝒙t+1{\tilde{\bm{X}}}={\tilde{\bm{X}}}\cup{\bm{x}}_{t+1}, and update GP model μt()\mu_{t}(\cdot)σt2()\sigma^{2}_{t}(\cdot)
6:end for

The system dynamics (1) considerably limit the decision space at every time step tt. Furthermore, after a data point is collected, both the GP model and mutual information change. Hence, we employ a model predictive control (MPC)-based approach to steer the system towards areas of high information. Ideally, at every MPC-step tt, we would like to minimize (3) with respect to a series of NHN_{H} inputs 𝑼{𝒖t,,𝒖t+NH1}\bm{U}\coloneqq\{{\bm{u}}_{t},\ldots,{\bm{u}}_{t+N_{H}-1}\}. However, this is generally infeasible, limiting its applicability in an MPC setting. Hence, we consider an approximate solution approach that sequentially computes the most informative data point by minimizing (4) separately from the MPC optimization. This is achieved as follows. At every time step tt, an unconstrained minimizer 𝝃\bm{\xi}^{*} of (4) is computed. Afterwards, the MPC computes the approximate optimal inputs 𝑼\bm{U}^{*} by minimizing a constrained optimization problem that penalizes the weighted distance to the reference point 𝝃\bm{\xi} using a positive semi-definite weight matrix 𝑸\bm{Q}. The ensuing state is then measured, the GP model is updated, and the procedure is repeated. These steps yield the Localized Active Learning (LocAL) algorithm, which is presented in Algorithm 1.

The square weight matrix 𝑸dx+du×dx+du\bm{Q}\in\mathbb{R}^{{d_{x}}+{d_{u}}}\times\mathbb{R}^{{d_{x}}+{d_{u}}} should be chosen such that the MPC steers the system as close to 𝝃\bm{\xi} as possible. This represents a system-dependent task. Alternatively, 𝑸\bm{Q} can be chosen such that the MPC cost function corresponds to a quadratic approximation of the mutual information, e.g., such that I(𝒚𝒙~t,𝒚𝝃)t=1NH(𝝃𝒙~t)T𝑸(𝝃𝒙~t)I(\bm{y}_{{\tilde{\bm{x}}}_{t}},\bm{y}_{\bm{\xi}})\approx\sum_{t=1}^{N_{H}}\left(\bm{\xi}-{\tilde{\bm{x}}}_{t}\right)^{\text{T}}\bm{Q}\left(\bm{\xi}-{\tilde{\bm{x}}}_{t}\right) holds for 𝒙~t𝝃{\tilde{\bm{x}}}_{t}\approx\bm{\xi}.

4.1 Sensitivity analysis

We now provide a sensitivity analysis of (4). Specifically, we show that the difference in mutual information at 𝝃\bm{\xi}^{*} and 𝒙~t{\tilde{\bm{x}}}_{t} is lower bounded. To this end, we require the following assumption.

Assumption 1

The kernel k(,)k(\cdot,\cdot) is Lipschitz continuous and upper bounded by kmax>0k_{\text{max}}>0.

This assumption holds for many commonly used kernels, e.g., the squared exponential kernel.

Using 1, we are able to bound the difference in mutual information at 𝜻\bm{\zeta}^{*} and an arbitrary state 𝒙~t{\tilde{\bm{x}}}_{t}, as detailed in the following.

Theorem 4.1.

Let Δ𝛏:=𝛏𝐱~\Delta\bm{\xi}^{*}:=\bm{\xi}^{*}-{\tilde{\bm{x}}} be the difference between the augmented state and the most informative data point 𝛏\bm{\xi}^{*} at time step . Moreover, let ΔII(𝐲𝛏,𝐲𝐗~ref)I(𝐲𝐱~,𝐲𝐗~ref)\Delta I^{*}\coloneqq I(\bm{y}_{\bm{\xi}^{*}},\bm{y}_{{{\tilde{\bm{X}}}_{\text{ref}}}})-I(\bm{y}_{{\tilde{\bm{x}}}},\bm{y}_{{{\tilde{\bm{X}}}_{\text{ref}}}}) denote the corresponding difference in mutual information, and let 1 hold. Then, there exists a constant L0L\geq 0, such that ΔIt|𝐗~ref|log(1+C(Δ𝐱))/2\Delta I^{*}_{t}\leq\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert\log\left(1+{C(\Delta\bm{x}^{*})}\right)/2 holds, where C(Δ𝐱)σw2min{kmax,L(t+|𝐗~ref|)1/2Δ𝐱2}C(\Delta\bm{x}^{*})\coloneqq{\sigma_{w}^{-2}}\min\{k_{\text{max}},L({t+\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert})^{-1/2}\lVert\Delta\bm{x}^{*}\rVert_{2}\}.

To prove Theorem 4.1, we require the following preliminary results.

Lemma 4.10.

For any i{1,,|𝐗~ref|}i\in\left\{1,\ldots,\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert\right\}, let 𝐗~ref,i:={𝐱~ref,1,,𝐱~ref,i}𝐗~ref\tilde{\bm{X}}_{\text{ref},i}:=\left\{{\tilde{\bm{x}}}_{\text{ref},1},\ldots,{\tilde{\bm{x}}}_{\text{ref},i}\right\}\subseteq{{\tilde{\bm{X}}}_{\text{ref}}} denote the subset containing the first ii elements of 𝐗~ref{{\tilde{\bm{X}}}_{\text{ref}}}. Then

argmax𝑿~𝒳~,|𝑿~|=nI(𝒚𝒳,𝒚𝒜)=argmax𝑿~𝒳~,|𝑿~|=ni=0|𝑿~ref|1log(2πe(σw2+σi+|𝑿~|2(𝒙~ref,i+1)))/2,\displaystyle\begin{split}&\operatorname*{arg\,max}_{{\tilde{\bm{X}}}\subset{\tilde{\mathcal{X}}},\lvert{\tilde{\bm{X}}}\rvert=n}I(\bm{y}_{\mathcal{X}},\bm{y}_{\mathcal{A}})=\operatorname*{arg\,max}_{{\tilde{\bm{X}}}\subset{\tilde{\mathcal{X}}},\lvert{\tilde{\bm{X}}}\rvert=n}-\sum\limits_{i=0}^{\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert-1}\log(2\pi e(\sigma_{w}^{2}+\sigma^{2}_{i+\lvert{\tilde{\bm{X}}}\rvert}({\tilde{\bm{x}}}_{\text{ref},i+1})))/2,\end{split} (5)

where σi+|𝐗~|(𝐱~)σ(𝐱~ref,Step 4.124.12Step 4.12Step 4.12.|𝐲𝐗~𝐗~ref,i)\sigma_{i+\lvert{\tilde{\bm{X}}}\rvert}({\tilde{\bm{x}}})\coloneqq\sigma({\tilde{\bm{x}}}_{\text{ref},\step}|\bm{y}_{{\tilde{\bm{X}}}\cup\tilde{\bm{X}}_{\text{ref},i}}) denotes the posterior GP variance evaluated at 𝐱~ref,Step 4.134.13Step 4.13Step 4.13.{\tilde{\bm{x}}}_{\text{ref},\step} conditioned on training data observed at 𝐗~𝐗~ref,i{\tilde{\bm{X}}}\cup\tilde{\bm{X}}_{\text{ref},i}.

Proof 4.14.

The proof follows directly from the proof of (Srinivas et al., 2012, Lemma 5.3). Let yi𝐲𝐗~y_{i}\in\bm{y}_{{\tilde{\bm{X}}}} denote a single training target. For a GP,

H(𝒚𝑿~)=H(𝒚𝑿~\yn)+H(yn|𝒚𝑿~\yn)=H(𝒚𝑿~\yn)+log(2πe(σw2+σn12(𝒙~n)))/2=i=1n1log(2πe(σw2+σi2(𝒙~i+1)))/2\displaystyle\begin{split}H(\bm{y}_{{\tilde{\bm{X}}}})&=H(\bm{y}_{{\tilde{\bm{X}}}}\backslash y_{n})+H(y_{n}|\bm{y}_{{\tilde{\bm{X}}}}\backslash y_{n})=H(\bm{y}_{{\tilde{\bm{X}}}}\backslash y_{n})+\log\left(2\pi e(\sigma_{w}^{2}+\sigma^{2}_{n-1}({\tilde{\bm{x}}}_{n}))\right)/2\\ &=\sum\limits_{i=1}^{n-1}\log\left(2\pi e(\sigma_{w}^{2}+\sigma^{2}_{i}({\tilde{\bm{x}}}_{i+1}))\right)/2\end{split} (6)

holds. Hence,

H(𝒚𝑿~𝒚𝒜)=H(𝒚𝑿~)+i=1|𝑿~ref|1log(2πe(σw2+σi+|𝑿~|2(𝒙~ref,i+1)))/2\displaystyle\begin{split}H(\bm{y}_{{\tilde{\bm{X}}}}\cup\bm{y}_{\mathcal{A}})=H(\bm{y}_{{\tilde{\bm{X}}}})+\sum\limits_{i=1}^{\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert-1}\log\left(2\pi e(\sigma_{w}^{2}+\sigma^{2}_{i+\lvert{\tilde{\bm{X}}}\rvert}({\tilde{\bm{x}}}_{\text{ref},i+1}))\right)/2\end{split} (7)

Since, the points 𝐲𝒜\bm{y}_{\mathcal{A}} are fixed, optimizing (3) with respect to 𝐲𝐗~\bm{y}_{{\tilde{\bm{X}}}} is equivalent to optimizing H(𝐲𝐗~)H(𝐲𝐗~ref𝐲𝐗~)=i=1|𝐗~ref|1log(2πe(σw2+σi+|𝐗~|2(𝐱~ref,i+1)))/2H(\bm{y}_{\tilde{\bm{X}}})-H(\bm{y}_{{{\tilde{\bm{X}}}_{\text{ref}}}}\cup\bm{y}_{{\tilde{\bm{X}}}})=-\sum_{i=1}^{\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert-1}\log(2\pi e(\sigma_{w}^{2}+\sigma^{2}_{i+\lvert{\tilde{\bm{X}}}\rvert}({\tilde{\bm{x}}}_{\text{ref},i+1})))/2.

Hence, we employ the objective function

J(𝑿~)=H(𝒚𝑿~)H(𝒚𝑿~𝒚𝑿~ref)=i=1|𝑿~ref|1log(2πe(σw2+σi+|𝑿~|2(𝒙~ref,i+1)))/2.\displaystyle\begin{split}J({\tilde{\bm{X}}})&=H(\bm{y}_{{\tilde{\bm{X}}}})-H(\bm{y}_{{\tilde{\bm{X}}}}\cup\bm{y}_{{{\tilde{\bm{X}}}_{\text{ref}}}})=\sum\limits_{i=1}^{\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert-1}-\log(2\pi e(\sigma_{w}^{2}+\sigma^{2}_{i+\lvert{\tilde{\bm{X}}}\rvert}({\tilde{\bm{x}}}_{\text{ref},i+1})))/2.\end{split} (8)

with

Ji(𝑿~)=log(2πe(σw2+σi+|𝑿~|2(𝒙~ref,i+1)))/2.\displaystyle J_{i}({\tilde{\bm{X}}})=-\log\left(2\pi e(\sigma_{w}^{2}+\sigma^{2}_{i+\lvert{\tilde{\bm{X}}}\rvert}({\tilde{\bm{x}}}_{\text{ref},i+1}))\right)/2. (9)
Proof 4.15.

This follows directly from the property I(𝐲𝒳,𝐲𝒜)I(𝐲𝒜,𝐲𝒜)I(\bm{y}_{\mathcal{X}},\bm{y}_{\mathcal{A}})\leq I(\bm{y}_{\mathcal{A}},\bm{y}_{\mathcal{A}}) and Lemma 4.10.

Lemma 4.16.

Let 𝐊{\bm{K}} be the covariance matrix used to compute σn2()\sigma^{2}_{n}(\cdot), and let 𝐊/x~ij\partial{\bm{K}}/\partial\tilde{x}_{ij} denote the matrix of partial derivatives of 𝐊\bm{K} with respect to the ii-th entry of the jj-th data point 𝐱~j{\tilde{\bm{x}}}_{j}. Then λmax(𝐊/x~ij)2Lkt\lambda_{\text{max}}\left(\partial{\bm{K}}/\partial\tilde{x}_{ij}\right)\leq 2L_{k}\sqrt{t} holds, where λmax()\lambda_{\text{max}}(\cdot) denotes the maximal eigenvalue.

Proof 4.21.

The matrix of partial derivatives has a single nonzero column and a single nonzero row. Its entries are

[𝑲x~ij]αβ={k(𝒙~α,𝒙~β)x~iα,ifα=jk(𝒙~α,𝒙~β)x~iβ,ifβ=j0otherwise\displaystyle\left[\frac{\partial{\bm{K}}}{\partial\tilde{x}_{ij}}\right]_{\alpha\beta}=\begin{cases}\frac{\partial k({\tilde{\bm{x}}}_{\alpha},{\tilde{\bm{x}}}_{\beta})}{\partial\tilde{x}_{i\alpha}},&\text{if}\quad\alpha=j\\ \frac{\partial k({\tilde{\bm{x}}}_{\alpha},{\tilde{\bm{x}}}_{\beta})}{\partial\tilde{x}_{i\beta}},&\text{if}\quad\beta=j\\ 0&\text{otherwise}\end{cases} (10)

and its eigenvalues are given by

λ(𝑲x~ij)=k(𝒙~j,𝒙~j)x~ij±(k(𝒙~j,𝒙~j)x~ij)2+4α=1,αj2Lk(1+1+4)22Lkt.\displaystyle\lambda\left(\frac{\partial{\bm{K}}}{\partial\tilde{x}_{ij}}\right)=\frac{\frac{\partial k({\tilde{\bm{x}}}_{j},{\tilde{\bm{x}}}_{j})}{\partial\tilde{x}_{ij}}\!\pm\!\sqrt{\left(\frac{\partial k({\tilde{\bm{x}}}_{j},{\tilde{\bm{x}}}_{j})}{\partial\tilde{x}_{ij}}\right)^{2}\!+\!4\sum\limits_{\alpha=1,\alpha\neq j}}}{2}\leq\frac{L_{k}\left(1\!+\!\sqrt{1\!+\!4}\right)}{2}\leq 2L_{k}\sqrt{t}.
Corollary 4.26.

Let 1 hold. Then for a fixed set of nn data points the posterior covariance σn2(𝐱~ref)\sigma^{2}_{n}({\tilde{\bm{x}}}_{\text{ref}}) is Lipschitz continuous with respect to the data, with Lipschitz constant given by Lσn=2Lkkmax/σon(1+nkmax/σon)L_{\sigma_{n}}=2L_{k}\sqrt{k_{\text{max}}}/\sigma_{\text{on}}(1+\sqrt{nk_{\text{max}}}/\sigma_{\text{on}}).

Proof 4.27.

For an arbitrary 𝐱~ref𝐗~ref{\tilde{\bm{x}}}_{\text{ref}}\in{{\tilde{\bm{X}}}_{\text{ref}}}, data point 𝐱~j{\tilde{\bm{x}}}_{j}, and corresponding entry x~ij\tilde{x}_{ij},

σn2(𝒙~ref)x~ij=2𝒌(𝒙~ref)x~ijT(𝑲+σon2𝑰)1𝒌(𝒙~ref)𝒌(𝒙~ref)T(𝑲+σon2𝑰)1𝑲x~ij(𝑲+σon2𝑰)1𝒌(𝒙~ref)2λmax((𝑲+σon2𝑰)12)𝒌(𝒙~ref)x~ij2(𝑲+σon2𝑰)12𝒌(𝒙~ref)2+λmax2((𝑲x~ij)12(𝑲+σon2𝑰)12)(𝑲+σon2𝑰)12𝒌(𝒙~ref)222σonLk(𝑲+σon2𝑰)12𝒌(𝒙~ref)2+2Lknσon2(𝑲+σon2𝑰)12𝒌(𝒙~ref)222σonLkkmax+2Lknσon2kmax.\displaystyle\begin{split}\frac{\partial\sigma^{2}_{n}({\tilde{\bm{x}}}_{\text{ref}})}{\partial\tilde{x}_{ij}}=&-2\frac{\partial\bm{k}({\tilde{\bm{x}}}_{\text{ref}})}{\partial\tilde{x}_{ij}}^{\text{T}}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-1}\bm{k}({\tilde{\bm{x}}}_{\text{ref}})\\ &-\bm{k}({\tilde{\bm{x}}}_{\text{ref}})^{\text{T}}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-1}\frac{\partial{\bm{K}}}{\partial\tilde{x}_{ij}}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-1}\bm{k}({\tilde{\bm{x}}}_{\text{ref}})\\ \leq&2\lambda_{\text{max}}\left(\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-\frac{1}{2}}\right)\Big{\rVert}\frac{\partial\bm{k}({\tilde{\bm{x}}}_{\text{ref}})}{\partial\tilde{x}_{ij}}\Big{\lVert}_{2}\Big{\lVert}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-\frac{1}{2}}\bm{k}({\tilde{\bm{x}}}_{\text{ref}})\Big{\rVert}_{2}\\ &+\lambda^{2}_{\text{max}}\left(\left(\frac{\partial{\bm{K}}}{\partial\tilde{x}_{ij}}\right)^{\frac{1}{2}}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-\frac{1}{2}}\right)\Big{\lVert}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-\frac{1}{2}}\bm{k}({\tilde{\bm{x}}}_{\text{ref}})\Big{\rVert}_{2}^{2}\\ \leq&\frac{2}{\sigma_{\text{on}}}L_{k}\Big{\lVert}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-\frac{1}{2}}\bm{k}({\tilde{\bm{x}}}_{\text{ref}})\Big{\rVert}_{2}+\frac{2L_{k}\sqrt{n}}{\sigma_{\text{on}}^{2}}\Big{\lVert}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-\frac{1}{2}}\bm{k}({\tilde{\bm{x}}}_{\text{ref}})\Big{\rVert}_{2}^{2}\\ \leq&\frac{2}{\sigma_{\text{on}}}L_{k}\sqrt{k_{\text{max}}}+\frac{2L_{k}\sqrt{n}}{\sigma_{\text{on}}^{2}}k_{\text{max}}.\end{split}

Here we employ the identities

𝒌(𝒙~ref)T(𝑲+σon2𝑰)1x~ij𝒌(𝒙~ref)\displaystyle\bm{k}({\tilde{\bm{x}}}_{\text{ref}})^{\text{T}}\frac{\partial\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-1}}{\partial\tilde{x}_{ij}}\bm{k}({\tilde{\bm{x}}}_{\text{ref}}) =𝒌(𝒙~ref)T(𝑲+σon2𝑰)1𝑲x~ij(𝑲+σon2𝑰)1𝒌(𝒙~ref)\displaystyle=\bm{k}({\tilde{\bm{x}}}_{\text{ref}})^{\text{T}}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-1}\frac{\partial{\bm{K}}}{\partial\tilde{x}_{ij}}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-1}\bm{k}({\tilde{\bm{x}}}_{\text{ref}}) (11)

and

(𝑲+σon2𝑰)12𝒌(𝒙~ref)22\displaystyle\Big{\lVert}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-\frac{1}{2}}\bm{k}({\tilde{\bm{x}}}_{\text{ref}})\Big{\rVert}_{2}^{2} =𝒌(𝒙~ref)T(𝑲+σon2𝑰)1𝒌(𝒙~ref)k(𝒙~ref,𝒙~ref)kmax,\displaystyle=\bm{k}({\tilde{\bm{x}}}_{\text{ref}})^{\text{T}}\left({\bm{K}}+\sigma_{\text{on}}^{2}\bm{I}\right)^{-1}\bm{k}({\tilde{\bm{x}}}_{\text{ref}})\leq k({\tilde{\bm{x}}}_{\text{ref}},{\tilde{\bm{x}}}_{\text{ref}})\leq k_{\text{max}}, (12)

where the latter identity follows from the positivity of σn2()\sigma^{2}_{n}(\cdot).

We can now prove Theorem 4.1

Proof of Theorem 1

Define σi+Step 4.284.28Step 4.28Step 4.28.,2()σ2(|𝐲𝐗~Step 4.294.29Step 4.29Step 4.29.𝐗~ref,i𝐱)\sigma^{2}_{i+\step,*}\!(\cdot)\!\!\coloneqq\!\!\sigma^{2}(\cdot|\bm{y}_{{\tilde{\bm{X}}}\step\cup\tilde{\bm{X}}_{\text{ref},i}\cup{\bm{x}}^{*}}\!) and σi+Step 4.314.31Step 4.31Step 4.31.2()σ2(|𝐲𝐗~Step 4.324.32Step 4.32Step 4.32.𝐗~ref,i𝐱)\sigma^{2}_{i+\step}(\cdot)\!\!\coloneqq\!\!\sigma^{2}(\cdot|\bm{y}_{{\tilde{\bm{X}}}\step\cup\tilde{\bm{X}}_{\text{ref},i}\cup{\bm{x}}}\!). Moreover, assume without loss of generality that σi+Step 4.344.34Step 4.34Step 4.34.,2(𝐱~ref,i+1)σi+Step 4.354.35Step 4.35Step 4.35.2(𝐱~ref,i+1)\sigma^{2}_{i+\step,*}\!({\tilde{\bm{x}}}_{\text{ref},i+1}\!)\!\!\geq\!\!\sigma^{2}_{i+\step}({\tilde{\bm{x}}}_{\text{ref},i+1}\!) holds. Then, due to Assumption 1 and Lemma 4.10,

ΔI=i=0|𝑿~ref|1log(2πe(σw2+σi+Step 4.374.37Step 4.37Step 4.37.,2(𝒙~ref,i+1)))/2+i=0|𝑿~ref|1log(2πe(σw2+σi+Step 4.384.38Step 4.38Step 4.38.2(𝒙~ref,i+1)))/2=i=0|𝑿~ref|112log(σw2+σi+Step 4.394.39Step 4.39Step 4.39.2(𝒙~ref,i+1)σw2+σi+Step 4.404.40Step 4.40Step 4.40.,2(𝒙~ref,i+1))|𝑿~ref|12log(1+kmaxσon2)=i=0|𝑿~ref|112log(1+σi+Step 4.414.41Step 4.41Step 4.41.2(𝒙~ref,i+1)σi+Step 4.424.42Step 4.42Step 4.42.,2(𝒙~ref,i+1)σw2+σi+Step 4.434.43Step 4.43Step 4.43.,2(𝒙~ref,i+1))|𝑿~ref|12log(1+Lσ𝒙~2σon2).\displaystyle\begin{split}\Delta I^{*}&=-\sum\limits_{i=0}^{\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert-1}\log\left(2\pi e(\sigma_{w}^{2}+\sigma^{2}_{i+\step,*}({\tilde{\bm{x}}}_{\text{ref},i+1}))\right)/2+\sum\limits_{i=0}^{\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert-1}\log(2\pi e(\sigma_{w}^{2}+\sigma^{2}_{i+\step}({\tilde{\bm{x}}}_{\text{ref},i+1})))/2\\ &=\underbrace{\sum\limits_{i=0}^{\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert-1}\frac{1}{2}\log\left(\frac{\sigma_{w}^{2}+\sigma^{2}_{i+\step}({\tilde{\bm{x}}}_{\text{ref},i+1})}{\sigma_{w}^{2}+\sigma^{2}_{i+\step,*}({\tilde{\bm{x}}}_{\text{ref},i+1})}\right)}_{\leq\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert\frac{1}{2}\log\left(1+\frac{k_{\text{max}}}{\sigma_{\text{on}}^{2}}\right)}\\ &=\underbrace{\sum\limits_{i=0}^{\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert-1}\frac{1}{2}\log\left(1+\frac{\sigma^{2}_{i+\step}({\tilde{\bm{x}}}_{\text{ref},i+1})-\sigma^{2}_{i+\step,*}({\tilde{\bm{x}}}_{\text{ref},i+1})}{\sigma_{w}^{2}+\sigma^{2}_{i+\step,*}({\tilde{\bm{x}}}_{\text{ref},i+1})}\right)}_{\leq\lvert{{\tilde{\bm{X}}}_{\text{ref}}}\rvert\frac{1}{2}\log\left(1+L_{\sigma}\frac{\lVert{\tilde{\bm{x}}}\rVert_{2}}{\sigma^{2}_{\text{on}}}\right)}.\end{split} (13)

In particular, Theorem 4.1 implies that the mutual information with respect to 𝝃\bm{\xi}^{*} can be approximated arbitrarily accurately by reducing the difference 𝝃𝒙~\bm{\xi}^{*}-{\tilde{\bm{x}}}. In many control applications, this can be achieved in spite of the model error, e.g., by increasing control gains Capone and Hirche (2019); Umlauft and Hirche (2020). Hence, Theorem 4.1 can be potentially employed to guarantee a gradual improvement in model accuracy, or even to derive the worst-case number of iterations required to learn the system.

5 Numerical Experiments

In this section, we apply the proposed algorithm to four different dynamical systems. We begin with a toy example, with which we can easily illustrate the explored portions of the state space. Afterwards, we apply the proposed approach to a pendulum, a cart-pole, and a synthetic model that generalizes the mountain car problem. The exploration is repeated 5050 times for each system using different starting points 𝒙0\bm{x}_{0} sampled from a normal distribution. To quantify the performance of each approach, we compute the root mean square model error (RMSE) on 500500 points sampled from a uniform distribution on the region of interest 𝒳~ref{{\tilde{\mathcal{X}}}_{\text{ref}}}.

We employ a squared-exponential kernel in all examples, and train the hyperparemters online using gradient-based log likelihood maximization (Rasmussen and Williams, 2006). We employ an MPC horizon of NH=10N_{H}=10, and choose weight matrix for the MPC optimization step as

𝑸=d=1dxσgddiag(l1,gd2,,ldx,gd2)i=1,,N,\displaystyle\bm{Q}=\sum\limits_{d=1}^{{d_{x}}}\sigma_{g_{d}}\text{diag}(l^{-2}_{1,g_{d}},\ldots,l^{-2}_{{d_{x}},g_{d}})\quad\forall i=1,\ldots,N,

where σgd\sigma_{g_{d}} denotes the standard deviation of the GP kernel corresponding to the dd-th dimension, and l1,gd2,,ldx,gd2l^{-2}_{1,g_{d}},\ldots,l^{-2}_{{d_{x}},g_{d}} denote the corresponding lengthscales. In order to ease the computational burden, we apply the first 77 inputs computed by the LocAL algorithm before computing a new solution.

We additionally explore each system using a greedy entropy-based cost function, as suggested in Koller et al. (2018) and Schreiter et al. (2015), and compare the results. In all three cases, the LocAL algorithm yields a better model in the regions of interest than the entropy-based algorithm.

5.1 Toy Problem

Consider the continuous-time nonlinear dynamical system

x˙=10(sin(x)+arctan(x)+u),\displaystyle\dot{x}=10(\sin(x)+\arctan(x)+u), (14)

with state space 𝒳=\mathcal{X}=\mathbb{R} and input space 𝒰=[5,5]\mathcal{U}=[-5,5]. We are interested in obtaining an accurate dynamical model within the region 𝒳~ref={[xu]T𝒳~|x[π,π],u[1,1]}{{\tilde{\mathcal{X}}}_{\text{ref}}}=\{[x\ u]^{\text{T}}\in{\tilde{\mathcal{X}}}\ |\ x\in[-\pi,\pi],u\in[-1,1]\}. To obtain a discrete-time system in the form of (1), we discretize (14) with a discretization step of Δt=0.1\Delta t=0.1 and set the prior model to f(x,u)=x{f(x,u)=x}. The results are displayed in Figure 1.

The LocAL algorithm yields a substantial improvement in model accuracy in every run. This is because the system stays close to the region of interest 𝒳~ref{{\tilde{\mathcal{X}}}_{\text{ref}}} during the whole simulation. By contrast, the greedy entropy-based method covers a considerably more extensive portion of the state space. This comes at the cost of a poorer model on 𝒳~ref{{\tilde{\mathcal{X}}}_{\text{ref}}}, as indicated by the respective RMSE.

Refer to caption Refer to caption Refer to caption Refer to caption

Figure 1: Toy problem results. Collected data in augmented state space 𝒳×𝒰{\mathcal{X}}\times\mathcal{U} after 200 times steps (top). Median, lower and upper quartile of RMSE on region of interest (bottom). The LocAL algorithm explores the region of interest more thoroughly than the entropy-based approach

5.2 Surface exploration

We apply the LocAL algorithm to the dynamical system given by

x˙1=3u1+10cos(5x1)cos(5x2),x˙2=3u2+10sin(5x1)sin(5x2).\displaystyle\dot{x}_{1}=3u_{1}+10\cos(5x_{1})\cos(5x_{2}),\qquad\qquad\dot{x}_{2}=3u_{2}+10\sin(5x_{1})\sin(5x_{2}). (15)

This setting can be seen as a surface exploration problem, i.e., an agent navigates a surface to learn its curvature. We aim to obtain an accurate model of the dynamics within the region given by 𝒳~ref={[𝒙T𝒖T]T𝒳~|𝒙[π/4,π/4]2,𝒖[1,1]2}{{\tilde{\mathcal{X}}}_{\text{ref}}}=\{[\bm{x}^{\text{T}}\ \bm{u}^{\text{T}}]^{\text{T}}\in{\tilde{\mathcal{X}}}\ |\ {\bm{x}}\in[-\pi/4,\pi/4]^{2},\ {\bm{u}}\in[-1,1]^{2}\}. To run the LocAL algorithm, we employ a discretization step of Δ\Delta and set the prior model to 𝒇(𝒙~)=𝒙+ΔStep 5.45.4Step 5.4Step 5.4.𝒖{{\bm{f}}({\tilde{\bm{x}}})=\bm{x}+\Delta\step{\bm{u}}}. The results are shown in Figure 2.

The LocAL algorithm manages to significantly improve its model after 400400 time steps, while the entropy-based strategy does not yield any improvement. This is because every variable of the state space is unbounded, i.e., the state space can be explored for a potentially infinite amount of time without ever reaching the region of interest 𝒳~ref{{\tilde{\mathcal{X}}}_{\text{ref}}}.

Refer to caption Refer to caption

Figure 2: Surface exploration results. Median, lower and upper quartile of RMSE on region of interest.

5.3 Pendulum

We now consider a two-dimensional pendulum, whose state 𝒙=[ϑ,ϑ˙]\bm{x}=[\vartheta,\dot{\vartheta}] is given by the angle ϑ\vartheta and angular velocity ϑ˙\dot{\vartheta}. The input torque uu is constrained to the interval 𝒰=[10,10]\mathcal{U}=[-10,10]. Our goal is to obtain a suitable model within 𝒳~ref={[𝒙Tu]T𝒳~|x1[π/2,3/2π],x2[5,5],u[3,3]}{{\tilde{\mathcal{X}}}_{\text{ref}}}=\{[\bm{x}^{\text{T}}u]^{\text{T}}\in{\tilde{\mathcal{X}}}\ |\ x_{1}\in[\pi/2,3/2\pi],\ x_{2}[-5,5],\ u\in[-3,3]\}. Obtaining a precise model around this region is particularly useful for the commonly considered task of stabilizing the pendulum around the upward position ϑ=ϑ˙=0\vartheta=\dot{\vartheta}=0. The results are depicted in Figure 3.

Refer to caption Refer to caption

Figure 3: Pendulum results. Median, lower and upper quartile of RMSE on region of interest.

The RMSE indicates that the LocAL algorithm yields a similar model improvement every run. The model obtained with the entropy-based strategy, by contrast, exhibits higher errors.

5.4 Cart-pole

We apply the LocAL algorithm to the cart-pole system (Barto et al., 1983). The state space is given by 𝒙=[v,ϑ,ϑ˙]\bm{x}=[v,\vartheta,\dot{\vartheta}], where vv is the cart velocity, ϑ\vartheta is the pendulum angle, and ϑ˙\dot{\vartheta} is its angular velocity. Here we ignore the cart position, as it has no influence on the system dynamics. The region of interest is given by 𝒳~ref={[𝒙T𝒖T]T𝒳~|x1[2,2],x2[π/4,π/4],u1,u2[5,5]}.{{\tilde{\mathcal{X}}}_{\text{ref}}}=\left\{[\bm{x}^{\text{T}}\bm{u}^{\text{T}}]^{\text{T}}\in{\tilde{\mathcal{X}}}\ \Big{|}\ x_{1}\in[-2,2],\ x_{2}\in[-\pi/4,\pi/4],\ u_{1},u_{2}\in[-5,5]\right\}. Similarly to the pendulum case, obtaining an accurate model on this region is useful to address the balancing task. The discretization step is set to Δt=0.05 s\Delta t=$0.05\text{\,}\mathrm{s}$, the prior model is 𝒇(𝒙~)=𝒙\bm{f}({\tilde{\bm{x}}})=\bm{x}. The results are shown in Figure 4.

Refer to caption Refer to caption

Figure 4: Cart-pole results. Median, lower and upper quartile of RMSE on region of interest.

Similarly to the pendulum case, the model obtained with the LocAL algorithm exhibits low standard deviation compared to the one obtained with the entropy-based approach. This is because the region of interest is explored more thoroughly with our approach.

6 Conclusion

A technique for efficiently exploring bounded subsets of the state-action space of a system has been presented. The proposed technique aims to minimize the mutual information of the system trajectories with respect to a discretization of the region of interest. It employs Gaussian processes both to model the unknown system dynamics and to quantify the informativeness of potentially collected data points. In numerical simulations of four different dynamical systems, we have demonstrated that the proposed approach yields a better model after a limited amount of time steps than a greedy entropy-based approach.

Acknowledgements

This work was supported by the European Research Council (ERC) Consolidator Grant ”Safe data-driven control for human-centric systems (CO-MAN)” under grant agreement number 864686 .

References

  • Alpcan and Shames (2015) T. Alpcan and I. Shames. An information-based learning approach to dual control. IEEE transactions on neural networks and learning systems, 26(11):2736–2748, 2015.
  • Ay et al. (2008) N. Ay, N. Bertschinger, R. Der, F. Güttler, and E. Olbrich. Predictive information and explorative behavior of autonomous robots. The European Physical Journal B, 63(3):329–339, 2008.
  • Barto et al. (1983) A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-13(5):834–846, 1983.
  • Beckers et al. (2019) T. Beckers, D. Kulić, and S. Hirche. Stable Gaussian process based tracking control of euler-lagrange systems. Automatica, (103):390–397, 2019.
  • Berkenkamp and Schoellig (2015) F. Berkenkamp and A. P. Schoellig. Safe and robust learning control with Gaussian processes. In Control Conference (ECC), 2015 European, pages 2496–2501. IEEE, 2015.
  • Burgard et al. (2005) W. Burgard, D. Fox, and S. Thrun. Probabilistic robotics. The MIT Press, 2005.
  • Candela et al. (2003) J. Q. Candela, A. Girard, J. Larsen, and C. E. Rasmussen. Propagation of uncertainty in Bayesian kernel models - application to multiple-step ahead forecasting. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP ’03)., volume 2, pages II–701, April 2003.
  • Capone and Hirche (2019) A. Capone and S. Hirche. Backstepping for partially unknown nonlinear systems using Gaussian processes. IEEE Control Systems Letters, 3:416–421, 2019.
  • Dayan and Sejnowski (1996) P. Dayan and T. J. Sejnowski. Exploration bonuses and dual control. Machine Learning, 25(1):5–22, 1996.
  • Deisenroth et al. (2015) M. P. Deisenroth, D. Fox, and C. E. Rasmussen. Gaussian processes for data-efficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):408–423, 2015.
  • Kamthe and Deisenroth (2018) S. Kamthe and M. P. Deisenroth. Data-efficient reinforcement learning with probabilistic model predictive control. In AISTATS, volume 84, pages 1701–1710. Proceedings of Machine Learning (PMLR), 2018.
  • Koller et al. (2018) T. Koller, F. Berkenkamp, M. Turchetta, and A. Krause. Learning-based model predictive control for safe exploration. In 2018 IEEE Conference on Decision and Control (CDC), pages 6059–6066, 2018.
  • Krause et al. (2008) A. Krause, A. Singh, and C. Guestrin. Near-optimal sensor placements in Gaussian processes: Theory, efficient algorithms and empirical studies. Journal of Machine Learning Research, 9(Feb):235–284, 2008.
  • Pillonetto et al. (2014) G. Pillonetto, F. Dinuzzo, T. Chen, G. De Nicolao, and L. Ljung. Kernel methods in system identification, machine learning and function estimation: A survey. Automatica, 50(3):657–682, 2014.
  • Rasmussen and Williams (2006) C. E. Rasmussen and C. K. Williams. Gaussian processes for machine learning. 2006. The MIT Press, Cambridge, MA, USA, 38:715–719, 2006.
  • Schreiter et al. (2015) F. Schreiter, D. Nguyen-Tuong, M. Eberts, B. Bischoff, H. Markert, and M. Toussaint. Safe exploration for active learning with Gaussian processes. In Machine Learning and Knowledge Discovery in Databases, pages 133–149, Cham, 2015. Springer International Publishing.
  • Srinivas et al. (2012) N. Srinivas, A. Krause, S. M. Kakade, and M. W. Seeger. Information-theoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58(5):3250–3265, 2012.
  • Umlauft and Hirche (2020) J. Umlauft and S. Hirche. Feedback linearization based on Gaussian processes with event-triggered online learning. IEEE Transactions on Automatic Control, 2020.
  • Umlauft et al. (2018) J. Umlauft, L. Pöhler, and S. Hirche. An uncertainty-based control Lyapunov approach for control-affine systems modeled by Gaussian process. IEEE Control Systems Letters, 2018.