This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sufficiently Accurate Model Learning for Planning

Clark Zhang, Santiago Paternain, and Alejandro Ribeiro C. Zhang and A. Ribeiro are with the Electrical Engineering department at the University of Pennsylvania: {clarkz, aribeiro}@seas.upenn.edu. S. Paternain is with the Electrical Computer and Systems Engineering department at the Rensselaer Polytechnic Institute: paters@rpi.edu
Abstract

Data driven models of dynamical systems help planners and controllers to provide more precise and accurate motions. Most model learning algorithms will try to minimize a loss function between the observed data and the model’s predictions. This can be improved using prior knowledge about the task at hand, which can be encoded in the form of constraints. This turns the unconstrained model learning problem into a constrained one. These constraints allow models with finite capacity to focus their expressive power on important aspects of the system. This can lead to models that are better suited for certain tasks. This paper introduces the constrained Sufficiently Accurate model learning approach, provides examples of such problems, and presents a theorem on how close some approximate solutions can be. The approximate solution quality will depend on the function parameterization, loss and constraint function smoothness, and the number of samples in model learning.

Index Terms:
Model Learning, Dynamics Model, Machine Learning

I Introduction

Dynamics models play an essential role in many modern controllers and planners. This is for instance the case in Model Predictive Control (MPC) [1]. They can also be used to compute feedforward terms for feedback controllers [2], or be used with a planner to find a trajectory from which stabilizing controllers can be generated [3, 4]. While model-free Reinforcement Learning techniques can solve similar problems without the use of a dynamics model [5, 6, 7], they generally do not scale to new problems or changes in problem parameters. Model learning methods perform admirably when the models can approximate the dynamics of the system accurately. However, the performance of these controllers can be degraded with uncertainty in the model. While robust controllers can be designed to attempt to alleviate these issues [8], these methods typically perform conservatively due to having to cater to worst case approximations of the model. In addition, there may be effects that a robust controller designer may not be aware. For example, consider the problem of landing a quadrotor precisely at a target as shown in Figure 9. There are complex aerodynamic effects associated when nearby surfaces cause disturbances to the airflow. This may result in large torques when the quadrotor is hovering close to the ground and hamper precise landings. This is known as the “ground effect.” These aerodynamic effects can be hard to model from just prior knowledge and may show up as a highly correlated, state dependent, non-zero mean noise. A common method that has been suggested to model similar effects is to learn or adjust a dynamics model with real data taken from running the system.

For linear systems, this method of System Identification (SI) or Model Learning has many results [9, 10, 11, 12, 13, 14] —such as recoverability of linear system dynamics when the data contains sufficient excitation. System identification methods have been proposed for non-linear systems [15, 16, 17]. However, they suffer from issues such as the need for a large amount of data or the requirement of a special problem structure such as block-based systems.

While generic methods for non-linear systems do not provide the same theoretical guarantees, there has been some success in practice. In robotics, Gaussian Processes, Gaussian Mixture Models, or Neural Networks have been used to learn models of dynamics [18, 19, 20, 21]. A typical process for learning these models involves selecting a parameterized model, such as a neural network with a fixed number of layers and neurons, and choosing a loss function that penalizes the output of the model for not matching the data gathered from running the real system. Then, one optimizes the parameters by minimizing the empirical risk using, for instance, stochastic gradient descent like algorithms. This formulation assumes that all transitions are equally important, since it penalizes the mismatch between model and data uniformly on all portions of the state-action space. While this formulation has shown success in some applications, prior knowledge about the task and system can inform better learning objectives. A control designer may know that a certain part of the state space requires a certain accuracy for a robust controller to work well, or that some part of the state space is more important and should have hard constraints on the model accuracy. For example, to precisely land a quadrotor, a designer may note that the accuracy of modeling the complex ground effect forces is more important near the landing site. Incorporating this prior knowledge can lead to better performing models.

To address the problem of incorporating prior knowledge into model learning, we introduced the idea of Sufficiently Accurate Model Learning in [22]. This formulation is based on the inclusion of constraints in the optimization problem whose role is to introduce prior-knowledge about the importance of different state-control subsets. In the example of the quadrotor, notice that when the quadrotor is away from the surfaces, the ground effect is minor and thus, it is important to focus the learned model’s expressiveness in the region of the state-space that is most heavily affected. This can be easily captured by a constraint that the average error in the important state-space regions is smaller than a desired value. These constraints will allow models with finite expressiveness concentrate on modeling important aspects of a system. One point to note is that this constrained objective can be used orthogonally to many existing methods. For example, the constrained objective can replace the unconstrained objective in [20, 21], and all other aspects of the methods can remain the same. The data can be collected the same way. The idea of using extra sensors during training for [20] need not change. While not trivial, the idea of constraints can also be applied to Gaussian process models such as [18, 19].

In its most generic form, the problem of model learning is an infinite dimensional non-convex optimization problem that involves the computation of expectations with respect to an unknown distribution over the state-action space. In addition, the formulation proposed here introduces constraints which seems to make the learning process even more challenging. However, in this work we show that solving this problem accurately is not more challenging than solving an unconstrained parametric learning problem. To reach this conclusion we solve a relaxation of the problem of interest with three common modifications: (i) function parameterization, (ii) empirical approximation, and (iii) dual problem solving. Function parameterization turns the infinite dimensional problem into one over finite function parameters. Empirical approximation allows for efficient computation of approximate expectations, and solving the dual problem leads to a convex unconstrained optimization problem. The three approximations introduced however may not yield solutions that are good approximations of the original problem. To that end, we establish a bound on the difference of the value of these solutions. This gap between the original and approximate problem depends on the number of samples of data as well as the expressiveness of the function approximation (Theorem 1). In particular, the bound can be made arbitrarily small with sufficient number of samples and with the selection of a rich enough function approximator. This implies that solving the functional constrained problem is nearly equivalent to solving a sequence of unconstrained approximate problems using primal-dual methods.

This paper extends [22] to the case of empirical samples and presents Theorem 1 that relates number of samples, function approximation expressiveness, and loss function smoothness to approximation error. In addition, there is experimental validation of Theorem 1 as well as different examples to showcase the framework of Sufficiently Accurate model learning. This paper is organized as follows: Section II introduces the Sufficiently Accurate model learning framework, Section III presents in detail the three approximations introduced to solve the problem as well as a result that bounds the error on the approximate problem (Theorem 1). Section IV provides the proof for the main theorem. Section V presents a simple primal-dual method to solve the constrained problem, while experimental results are presented in Section VI on a double integrator system with friction, a ball bouncing task, and a quadrotor landing with ground effect in simulation. Theoretical results are experimentally tested for the double integrator system. Section VII presents the paper’s conclusions.

II Sufficiently Accurate Model Learning

In this paper we consider the problem of learning a discrete time dynamical system. Let us denote tt\in\mathbb{Z} as the time index and let xtnx_{t}\in\mathbb{R}^{n}, utpu_{t}\in\mathbb{R}^{p} be the state and input of the system at time tt. The dynamical system of interest is represented by a function f:n×pnf\mathrel{\mathop{\ordinarycolon}}\mathbb{R}^{n}\times\mathbb{R}^{p}\to\mathbb{R}^{n} that relates the state and input at time tt to the state at time t+1t+1

xt+1=f(xt,ut)x_{t+1}=f(x_{t},u_{t}) (1)

One approach to System Identification or Model Learning consists of fitting an estimated dynamical model, ϕ:n×pn\phi\mathrel{\mathop{\ordinarycolon}}\mathbb{R}^{n}\times\mathbb{R}^{p}\rightarrow\mathbb{R}^{n}, to state transition data [18]. This state transition data consists of tuples of s=(xt,ut,xt+1)s=(x_{t},u_{t},x_{t+1}) drawn from a distribution 𝒮𝒟\mathcal{S}_{\mathcal{D}} with the sample space 𝒮n×p×n\mathcal{S}\subseteq\mathbb{R}^{n}\times\mathbb{R}^{p}\times\mathbb{R}^{n}. The estimated model ϕ\phi belongs to a class of functions, Φ\Phi, of which ff is an element. Φ\Phi could be, for example, the space of all continuous functions. Then, the problem of model learning reduces to finding the function ϕ\phi in the class Φ\Phi that best fits the transition data. The figure of merit is a loss function :𝒮×Φ\ell\mathrel{\mathop{\ordinarycolon}}\mathcal{S}\times\Phi\to\mathbb{R}. With these definitions, the problem of interest can be written as the following stochastic optimization problem

minϕΦ𝔼s𝒮𝒟[l(s,ϕ)].\min_{\phi\in\Phi}\mathbb{E}_{s\sim\mathcal{S}_{\mathcal{D}}}[l(s,\phi)]. (2)

The loss function needs to be selected to encourage the model ϕ\phi to match its output with the state transition data. An example of a loss function is the p-norm, l(s,ϕ)=ϕ(xt,ut)xt+1pl(s,\phi)=\mathinner{\!\left\lVert\phi(x_{t},u_{t})-x_{t+1}\right\rVert}_{p}. For p=1p=1, this reduces to a sum of absolute differences between each output dimension of ϕ\phi and the true next state, xt+1x_{t+1}. When p=2p=2, this is simply a euclidean loss. Other common losses can include the Huber loss and, in discrete state settings, a 0-1 loss.

Often times, one can derive, from first principles, a model f^\hat{f} of the dynamical system of interest. Depending on the complexity of the system of, these models may be inaccurate since they may ignore hard to model dynamics or higher order effects. For instance, one can derive a model f^\hat{f} for a quadrotor from rigid body dynamics where forces and torques are functions of each motor’s propeller speed. However, the accuracy of this model will depend on other effects that are harder to model such as aerodynamic forces near the ground. If these aerodynamic effects are ignored, it can result in a failure to control the system or in poor performance [23]. In these cases, the target model, denoted by ϕ~\tilde{\phi}, can decomposed as the sum of an analytic model and an error

ϕ~(xt,ut)=f^(xt,ut)+ϕ(xt,ut).\tilde{\phi}(x_{t},u_{t})=\hat{f}(x_{t},u_{t})+\phi(x_{t},u_{t}). (3)

The learning of the error term—or residual model—fits the framework described in (2). For instance, for the pp norm loss can be modified to take the form l(s,ϕ)=ϕ~(xt,ut)xt+1p=ϕ(xt,ut)(xt+1f^(xt,ut))pl(s,\phi)=\mathinner{\!\left\lVert\tilde{\phi}(x_{t},u_{t})-x_{t+1}\right\rVert}_{p}=\mathinner{\!\left\lVert\phi(x_{t},u_{t})-(x_{t+1}-\hat{f}(x_{t},u_{t}))\right\rVert}_{p}.

A characteristic of the classic model learning problem defined in (2) is that errors are uniformly weighted across the state-input space. In principle, one can craft the loss so to represent different properties on different subsets of the state-input space. However, this design is challenging, system dependent, and dependant on the transition data available. In contrast, our approach aims to exploit prior knowledge about the errors and how they impact the control performance. For instance, based on the analysis of robust controllers one can have bounds on the error required for successful control. This information can be used to formulate the Sufficiently Accurate model learning problem, where we introduce the prior information in the form of constraints. Formally, we encode the prior information by introducing KK\in\mathbb{N} functions gk:𝒮g_{k}\mathrel{\mathop{\ordinarycolon}}\mathcal{S}\rightarrow\mathbb{R}. Define in addition, a collection of subsets of transition tuples where this prior information is relevant, 𝒮k𝒮\mathcal{S}_{k}\subset\mathcal{S} and corresponding indicator functions 𝕀k(s):𝒮{0,1}\mathbb{I}_{k}(s)\mathrel{\mathop{\ordinarycolon}}\mathcal{S}\rightarrow\{0,1\} taking the value one if s𝒮ks\in\mathcal{S}_{k} and zero otherwise. With these definitions the sufficiently accurate model learning problem is defined as

P=\displaystyle P^{\star}= minϕΦ𝔼s𝒮𝒟[l(s,ϕ)𝕀0(s)]\displaystyle\min_{\phi\in\Phi}\mathbb{E}_{s\sim\mathcal{S}_{\mathcal{D}}}[l(s,\phi)\mathbb{I}_{0}(s)] (4)
s.t. 𝔼s𝒮𝒟[gk(s,ϕ)𝕀k(s)]0,k=1,2,,K\displaystyle\text{s.t. }\mathbb{E}_{s\sim\mathcal{S}_{\mathcal{D}}}[g_{k}(s,\phi)\mathbb{I}_{k}(s)]\leq 0,k=1,2,\ldots,K

Note that the sets 𝒮k\mathcal{S}_{k} that define the indicator variables 𝕀k(s)\mathbb{I}_{k}(s) are not necessarily disjoint. In fact, in practice, they are often not. The sets can be arbitrary and have no relation to each other. Examples of how these sets are used are given in examples at the end of this section. Notice that the (4) is an infinite dimensional problem since the optimization variable is a function and it involves the computation of expectations with respect to a possibly unknown distribution. An approximation to this problem is presented in Section III. For technical reasons, the functions gkg_{k} and ll should be expectation-wise Lipschitz continuous.

Assumption 1.

The functions 0(s,ϕ)=l(s,ϕ)𝕀0(s)\ell_{0}(s,\phi)=l(s,\phi)\mathbb{I}_{0}(s) and 𝐠k(s,ϕ)=gk(s,ϕ)𝕀k(s)\bm{g}_{k}(s,\phi)=g_{k}(s,\phi)\mathbb{I}_{k}(s) are LL-expectation-wise Lipschitz continuous in Φ\Phi, i.e.,

𝔼s0(s,ϕ1)0(s,ϕ2)L𝔼sϕ1ϕ2,ϕ1,ϕ2Φ\mathbb{E}_{s}\mathinner{\!\left\lVert\ell_{0}(s,\phi_{1})-\ell_{0}(s,\phi_{2})\right\rVert}_{\infty}\leq L\mathbb{E}_{s}\mathinner{\!\left\lVert\phi_{1}-\phi_{2}\right\rVert}_{\infty},\forall\phi_{1},\phi_{2}\in\Phi (5)

Here, ϕ1ϕ2\mathinner{\!\left\lVert\phi_{1}-\phi_{2}\right\rVert}_{\infty} is the infinity norm for functions which is defined as f=supx|f(x)|\mathinner{\!\left\lVert f\right\rVert}_{\infty}=\sup_{x}\mathinner{\!\left\lvert f(x)\right\rvert}.

The expectation-wise Lipschitz assumption is a weaker assumption than Lipschitz-continuity, as any Lipschitz-continuous function with a Lipschitz constant LL is also expectation-wise Lipschitz-continuous with a constant of LL. In particular, the loss functions in Example 1 and 2 are expectation-wise Lipschitz-continuous with some constant (cf., Appendix A). There is no assumption that the functions should be convex or continuous.

Before we proceed, we present two examples of Sufficiently Accurate model learning. For notational brevity, when an expectation does not have a subscript, it is always taken over s𝒮𝒟s\sim\mathcal{S}_{\mathcal{D}}.

Example 1.

Selective Accuracy

minϕΦ𝔼ϕ(xt,ut)xt+12\displaystyle\min_{\phi\in\Phi}\mathbb{E}\mathinner{\!\left\lVert\phi(x_{t},u_{t})-x_{t+1}\right\rVert}_{2} (6)
s.t. 𝔼ϕ(xt,ut)xt+12𝕀A(s)ϵc\displaystyle\text{s.t. }\mathbb{E}\mathinner{\!\left\lVert\phi(x_{t},u_{t})-x_{t+1}\right\rVert}_{2}\mathbb{I}_{A}(s)\leq\epsilon_{c}

This problem is a simple modification of (2). It has the same objective, but adds a constraint that a certain state-control subset, defined by a set AA, should be within ϵc\epsilon_{c} accuracy. The indicator variable 𝕀A(s)\mathbb{I}_{A}(s) will be 1 when ss is in the set AA. Here, g(s,ϕ)=ϕ(xt,ut)xt+12ϵc𝔼𝕀A(s)g(s,\phi)=\mathinner{\!\left\lVert\phi(x_{t},u_{t})-x_{t+1}\right\rVert}_{2}-\frac{\epsilon_{c}}{\mathbb{E}\mathbb{I}_{A}(s)}. This formulation allows you to trade off the accuracy in one part of the state-control space with everything else as it may be more important to a task. Another use case can be to provide an error bound for robust controllers. This is the formulation used in the quadrotor precise landing experiments detailed later in Section VI-C, where the set AA is defined to be all states close to the ground where the ground effect is more prominent. VI-C.

Example 2.

Normalized Objective

minϕΦ𝔼ϕ(xt,ut)xt+12xt+12𝕀A(s)\displaystyle\min_{\phi\in\Phi}\mathbb{E}\frac{\mathinner{\!\left\lVert\phi(x_{t},u_{t})-x_{t+1}\right\rVert}_{2}}{\mathinner{\!\left\lVert x_{t+1}\right\rVert}_{2}}\mathbb{I}_{A}(s) (7)
s.t. 𝔼ϕ(xt,ut)xt+12𝕀AC(s)ϵc\displaystyle\text{s.t. }\mathbb{E}\mathinner{\!\left\lVert\phi(x_{t},u_{t})-x_{t+1}\right\rVert}_{2}\mathbb{I}_{A^{C}}(s)\leq\epsilon_{c}

where 𝕀A\mathbb{I}_{A} is the indicator variable for the subset A={s𝒮:xt+12δc}A=\{s\in\mathcal{S}\mathrel{\mathop{\ordinarycolon}}\mathinner{\!\left\lVert x_{t+1}\right\rVert}_{2}\geq\delta_{c}\}, and ACA^{C} is the complement of the set AA. This problem formulation looks at minimizing an objective such that the error term is normalized by the size of the next state. This can be useful in cases where the states can take values in a very large range. An error of 1 unit can be large if the true value is 0.1 units, but it is a small error if the true value is 100 units. The set AA contains all data samples where the true next state is large enough for this to be significant. This can reduce numerical issues when the denominator is small. For all small state values, the error is simply bounded by ϵc\epsilon_{c}. From a practical point of view, sensors will always have noise. When the state is small, the “true” measurement of the state can be dominated by noise, and the model can be better off just bounding the error rather than focusing on fitting the noise. This is the formulation used in the ball bouncing experiment in Section VI-B, where the we would like the errors in velocity prediction to be scaled to the speed, and all errors below a small speed can be constrained with a simple constant.

III Problem Approximation

The unconstrained problem (2) and the constrained problem (4) are functional optimization problems. In general, these are infinite dimensional and usually intractable. Instead of optimizing over the entire function space Φ\Phi, one may look at function spaces, ΦθΦ\Phi_{\theta}\subset\Phi, parameterized by a dd-dimensional vector θΘ=d\theta\in\Theta=\mathbb{R}^{d}. Examples of these classes of functions are linear functions of the form ϕθ(x,u)=θxx+θuu\phi_{\theta}(x,u)=\theta_{x}^{\top}x+\theta_{u}^{\top}u where θ=[θx,θu]\theta=[\theta_{x},\theta_{u}] is a vector of weights for the state and control input. More complex function approximators, such as neural networks, may be used to express a richer class of functions [21, 24]. Restricting the function space poses a problem in that the optimal solution to (4) may no longer exist in the set Φθ\Phi_{\theta}. The goal under these circumstances should be to find the closest solution in Φθ\Phi_{\theta} to the true optimal solution ϕ\phi^{\star}. Additionally, the expectations of the loss and constraint functions are in general intractable. The distributions can be unknown or hard to compute in closed form. In practice, the expectation is approximated with a finite number of data samples si𝒮𝒟s_{i}\sim\mathcal{S}_{\mathcal{D}} with i=1,,Ni=1,\ldots,N. This yields the following empirical parameterized risk minimization problem

PN=\displaystyle P_{N}^{\star}= minθΘ1Ni=1Nl(si,ϕθ)𝕀0(si)\displaystyle\min_{\theta\in\Theta}\frac{1}{N}\sum_{i=1}^{N}l(s_{i},\phi_{\theta})\mathbb{I}_{0}(s_{i}) (8)
s.t. 1Ni=1Ngk(si,ϕθ)𝕀k(si)0,k=1,2,,K\displaystyle\text{s.t. }\frac{1}{N}\sum_{i=1}^{N}g_{k}(s_{i},\phi_{\theta})\mathbb{I}_{k}(s_{i})\leq 0,k=1,2,\ldots,K

While both function and empirical approximations are common ways to simplify the problem, the approximate problem introduced in (8) is still a constrained optimization problem and can be difficult to solve in general as it can be nonconvex in the parameters θ\theta. This is the case for instance when the function approximator is a neural network. One approach to solve this problem is to solve the dual problem associated with (8). To aid in the definition of the dual problem, we first define the dual variables (also known as Lagrange multipliers), λ+K\lambda\in\mathbb{R}_{+}^{K}, along with the Lagrangian associated with (8)

N(θ,λ)=1Ni=1N0(si,ϕθ)+λ1Ni=1N𝒈(si,ϕθ).\mathcal{L}_{N}(\theta,\lambda)=\frac{1}{N}\sum_{i=1}^{N}\ell_{0}(s_{i},\phi_{\theta})+\lambda^{\top}\frac{1}{N}\sum_{i=1}^{N}\bm{g}(s_{i},\phi_{\theta}). (9)

Here, the symbol, 0(si,ϕθ)\ell_{0}(s_{i},\phi_{\theta}), is defined as l(si,ϕθ)𝕀0(s)l(s_{i},\phi_{\theta})\mathbb{I}_{0}(s) to condense the notation. Similarly, the bolded vector, 𝒈(si,ϕθ)\bm{g}(s_{i},\phi_{\theta}) is a vector where the kthk^{\text{th}} entry is defined as gk(si,ϕθ)𝕀k(s)g_{k}(s_{i},\phi_{\theta})\mathbb{I}_{k}(s). The dual problem is now defined as

DN=maxλ0minθΘN(θ,λ)D_{N}^{\star}=\max_{\lambda\geq 0}\min_{\theta\in\Theta}\mathcal{L}_{N}(\theta,\lambda) (10)

Notice that (10) is similar to a regularized form of (8) where each constraint is penalized by a coefficient ωk\omega_{k}

DN=minθΘ1Ni=1N0(si,ϕθ)+𝝎1Ni=1N𝒈(si,ϕθ).D_{N}^{\star}=\min_{\theta\in\Theta}\frac{1}{N}\sum_{i=1}^{N}\ell_{0}(s_{i},\phi_{\theta})+\bm{\omega}^{\top}\frac{1}{N}\sum_{i=1}^{N}\bm{g}(s_{i},\phi_{\theta}). (11)

Adding this type of regularization can weight certain state-action spaces more. In fact, if ωk\omega_{k} is chosen to be λN\lambda_{N}^{\star}, solving (11) would be equivalent to solving (10). However, arbitrary choices of ωk\omega_{k} provide no guarantees on the constraint function values. By defining the constraint functions directly, constraint function values are determined independent of any tuning factor. For problems where strong guarantees are required or easier to define, the Sufficiently Accurate framework will satisfy them by design. An alternative interpretation is that (10) provides a principled way of selecting the regularization coefficients. In Section V, we discuss an implementation of a primal dual algorithm to do so.

The dual problem has two important properties that hold regardless of the structure of the optimization problem (8). For any θ\theta that minimizes the Lagrangian N\mathcal{L}_{N}, the resulting function—termed the dual function—is concave on the multipliers, since it is the point-wise minimum of linear functions (see e.g. [25]). Therefore, its maximization is tractable and it can be done efficiently for instance using stochastic gradient descent. In addition, the dual function is always a lower bound on the value PNP_{N}^{\star} and in that sense solving the dual problem (10) provides the tightest lower bound. In the case of convex problems (that fulfill Slater’s Condition), it is well known that the problems have zero duality gap, and therefore PN=DNP_{N}^{\star}=D_{N}^{\star} [25, Section 5.2.3]. However, the problem (8) is non-convex and a priori we do not have guarantees on how far the values of the primal and the dual are. Moreover, recall that the primal problem in (8) is an empirical approximation of the problem that we are actually interested in solving (4).

The previous discussion leads to the question about the quality of the solution (10) as an approximation to (4). The duality gap is defined as the difference between the primal and dual solutions of the same problem. Here, the gap is the difference between the primal and the dual of different but closely related problems. Hence, the quantity we are interested in bounding is the surrogate duality gap defined as

|PDN|.|P^{\star}-D_{N}^{\star}|. (12)

To provide specific bounds for the difference in the previous expression we consider the family of function classes Φθ\Phi_{\theta} termed ϵ\epsilon-universal function approximators. We define this notion next.

Definition 1.

The function class Φθ\Phi_{\theta} is an ϵ\epsilon-universal function approximator for Φ\Phi if, for any ϕΦ\phi\in\Phi, there exists a ϕθΦθ\phi_{\theta}\in\Phi_{\theta} such that 𝔼s𝒮𝒟ϕ(s)ϕθ(s)ϵ\mathbb{E}_{s\sim\mathcal{S}_{\mathcal{D}}}\mathinner{\!\left\lVert\phi(s)-\phi_{\theta}(s)\right\rVert}_{\infty}\leq\epsilon.

To provide some intuition on the definition consider the case where Φ\Phi is the space of all continuous function, the above property is satisfied by some neural network architecture. That is, for any ϵ\epsilon, there exists a class of neural network architectures, Φθ\Phi_{\theta} such that Φθ\Phi_{\theta} is an ϵ\epsilon-universal approximator for the set of continuous functions [24, Corollary 2.1]. Thus, for any dynamical system with continuous dynamics, this assumption is mild. Other parameterizations, such as Reproducing Kernel Hilbert Spaces, are ϵ\epsilon-universal as well [26]. Notice that the previous definition is an approximation on the total norm variation and hence it is a milder assumption than the universal approximation property that fully connected neural networks exhibit [24].

Next, we define an intermediate problem on which the surrogate duality gap depends: a perturbed version of problem (4) where the constraints are relaxed by Lϵ0L\epsilon\geq 0 where LL is the constant defined in Assumption 1 and ϵ\epsilon the universal approximation constant in Definition 1

PLϵ=\displaystyle P_{L\epsilon}^{\star}= minϕΦ𝔼s𝒮𝒟[0(s,ϕ)]\displaystyle\min_{\phi\in\Phi}\mathbb{E}_{s\sim\mathcal{S}_{\mathcal{D}}}[\ell_{0}(s,\phi)] (13)
s.t. 𝔼s𝒮𝒟[𝒈(s,ϕ)]+𝟏Lϵ0.\displaystyle\text{s.t. }\mathbb{E}_{s\sim\mathcal{S}_{\mathcal{D}}}[\bm{g}(s,\phi)]+\bm{1}L\epsilon\leq 0.

𝟏\bm{1} is a vector of ones. The perturbation results in a problem whose constraints are tighter as compared to (4). The set of feasible solutions for the perturbed problem (13) is a subset of the feasible solutions for the unperturbed problem (4). The perturbed problem accounts for the approximation introduced by the parameterization. In the worst case scenario, if the problem (13) is infeasible, the parameterized approximation of (8) may turn infeasible as the number of samples increases.

Let λLϵ\lambda_{L\epsilon}^{\star} be the solution to the dual of (13)

λLϵ=argmaxλ0minϕΦ𝔼[0(s,ϕ)]+λ(𝔼[𝒈(s,ϕ)]+𝟏Lϵ)\lambda_{L\epsilon}^{\star}=\operatorname*{arg\,max}_{\lambda\geq 0}\min_{\phi\in\Phi}\mathbb{E}[\ell_{0}(s,\phi)]+\lambda^{\top}(\mathbb{E}[\bm{g}(s,\phi)]+\bm{1}L\epsilon) (14)

With these definitions, we can present the main theorem that bounds the surrogate duality gap.

Theorem 1.

Let Φ\Phi be a compact class of functions over a compact space such that there exists ϕΦ\phi\in\Phi for which (4) is feasible, and let Φθ\Phi_{\theta} be an ϵ\epsilon-universal approximator of Φ\Phi as in Definition 1. Let the space of Lagrange multipliers, λ\lambda, be a compact set as in [27]. In addition, let Assumption 1 hold and let Φθ\Phi_{\theta} satisfy the following property

limNHΦθ(δϵL(λLϵ1+1),N)N=0\lim_{N\rightarrow\infty}\frac{H^{\Phi_{\theta}}({\delta}-\epsilon L(\mathinner{\!\left\lVert\lambda_{L\epsilon}^{\star}\right\rVert}_{1}+1),N)}{N}=0 (15)

where λLϵ1\mathinner{\!\left\lVert\lambda_{L\epsilon}^{\star}\right\rVert}_{1} is the optimal dual variable for the problem (14), LL is the Lipschitz constant for the loss function, and HΦθH^{\Phi_{\theta}} is the random VC-entropy [28, section II.B]. Note that both arguments for HΦθH^{\Phi_{\theta}} must be positive. Then PP^{\star} and DND^{\star}_{N}, the values of (4) and (10) respectively, satisfy

limN(|PDN|δ)=1,\lim_{N\rightarrow\infty}\mathbb{P}\left(|P^{\star}-D_{N}^{\star}|\leq{\delta}\right)=1, (16)

where the probability is over independent samples {s1,s2,,sN}\{s_{1},s_{2},\ldots,s_{N}\} drawn from the distribution 𝒮\mathcal{S} as defined in problem (8).

Proof.

See Section IV

The intuition behind the theorem is that given some acceptable surrogate duality gap, δ\delta, there exists a neural network architecture, Φθ\Phi_{\theta}, and a number of samples, NN such that the probability that the solution to (10) is within δ\delta to the solution to (4) is very high. The choice of neural network will influence the value of ϵ\epsilon and λLϵ\lambda_{L\epsilon}^{\star}. These in turn will decide the duality gap, δ\delta, as the quantity δϵL(λLϵ1+1)\delta-\epsilon L(\mathinner{\!\left\lVert\lambda_{L\epsilon}^{\star}\right\rVert}_{1}+1) must be positive. A larger neural network will correspond to a smaller ϵ\epsilon which will also has an impact on the perturbed problem (13). A smoother function and smaller ϵ\epsilon will lead to smaller perturbations. Smaller perturbations can lead to a smaller dual variable, λLϵ\lambda_{L\epsilon}^{\star}. Thus, larger neural networks and smoother dynamic systems will have smaller duality gaps. If LϵL\epsilon is large, then the perturbed problem may be infeasible. In theses cases, λLϵ\lambda_{L\epsilon}^{\star} will be infinite. This corresponds to problems where the function approximation simply can not satisfy the original constraint functions. For example, using constant functions to approximate a complicated system may violate the constraint functions gkg_{k} for all possible constant functions. Thus, no δ\delta exists to bound the solution as the parameterized empirical problem (8) has no feasible solution. This theorem suggests that with a good enough function approximation and large enough NN, solving (10) is a good approximation to solving (4) with large probability.

There are some details to point out in Theorem 1. First, the function HΦθH^{\Phi_{\theta}} a complicated function that will usually scale with the size of the neural network. A larger neural network will lead to a smaller ϵ\epsilon, but may require a larger number of samples NN to adequately converge to an acceptable solution. The assumption on the limiting behavior of HΦθH^{\Phi_{\theta}} is fufilled by some neural network architectures [29], but the general behavior of this function for all neural network architectures is still a topic of research. Additionally, we assume the space of Lagrange multipliers is a compact set. This will imply, along with compact state-action space, that \mathcal{L} is bounded. A finite Lagrange multiplier is a reasonable assumption as the problem (4) is required to be feasible [27].

The bound established in Theorem 1 depends on quantities that are in general difficult to estimate, These include HΦθH^{\Phi_{\theta}}, λLϵ1\mathinner{\!\left\lVert\lambda_{L\epsilon}^{\star}\right\rVert}_{1}, LL, ϵ\epsilon. Thus, while this theorem provides some insights on how these quantities influence the gap between solutions, it is mainly a statement of the existence of such values that can provide a desired result. In practice, this result can be achieved by choosing increasing the sizes of neural networks as well as data samples until the desired performance is reached. Note that the theorem follows our intuition that larger neural networks and more data will give us more accurate result. However, this theorem formalizes not only that it is more accurate, but that the error will tend to 0 as number of samples and number of parameters increase.

IV Proof of Theorem 1

To begin, we define an intermediate problem

Pθ=\displaystyle P_{\theta}^{\star}= minθΘ𝔼s𝒮𝒟[0(s,ϕθ)]\displaystyle\min_{\theta\in\Theta}\mathbb{E}_{s\sim\mathcal{S}_{\mathcal{D}}}[\ell_{0}(s,\phi_{\theta})] (17)
s.t. 𝔼s𝒮𝒟[𝒈(s,ϕθ)]0.\displaystyle\textrm{s.t. }\mathbb{E}_{s\sim\mathcal{S}_{\mathcal{D}}}[\bm{g}(s,\phi_{\theta})]\leq 0.

Note that this is the unperturbed version of (13). As a reminder, this problem uses a class of parameterized functions, but does not use data samples to approximate the expectation. Thus, it can be seen as a step in between (4) and (8). As with the dual problem to (8), we can define the Lagrangian associated with (17)

θ(θ,λ)=𝔼s𝒮𝒟[0(s,ϕθ)]+λ𝔼s𝒮𝒟[𝒈(s,ϕθ)]\mathcal{L}_{\theta}(\theta,\lambda)=\mathbb{E}_{s\sim\mathcal{S}_{\mathcal{D}}}[\ell_{0}(s,\phi_{\theta})]+\lambda^{\top}\mathbb{E}_{s\sim\mathcal{S}_{\mathcal{D}}}[\bm{g}(s,\phi_{\theta})] (18)

and the dual problem

Dθ=maxλ0minθΘθ(θ,λ).D_{\theta}^{\star}=\max_{\lambda\geq 0}\min_{\theta\in\Theta}\mathcal{L}_{\theta}(\theta,\lambda). (19)

Using this intermediate problem, we can break the bound |PDN|\mathinner{\!\left\lvert P^{\star}-D_{N}^{\star}\right\rvert} into two components.

|PDN|\displaystyle\mathinner{\!\left\lvert P^{\star}-D_{N}^{\star}\right\rvert} =|(PDθ)+(DθDN)|\displaystyle=\mathinner{\!\left\lvert(P^{\star}-D_{\theta}^{\star})+(D_{\theta}^{\star}-D_{N}^{\star})\right\rvert} (20)
|PDθ|+|DθDN|\displaystyle\leq\mathinner{\!\left\lvert P^{\star}-D_{\theta}^{\star}\right\rvert}+\mathinner{\!\left\lvert D_{\theta}^{\star}-D_{N}^{\star}\right\rvert}

As a reminder, PP^{\star} is the solution to the problem we want to solve in (4), DND_{N}^{\star} is the solution to the problem (10) we can feasibly solve, and DθD_{\theta}^{\star} is the solution to an intermediate problem (19). The first half of this bound, |PDθ|\mathinner{\!\left\lvert P^{\star}-D_{\theta}^{\star}\right\rvert}, is the error that arises from using a parameterized function and dual approximation. The second half of this bound, DθDND_{\theta}^{\star}-D_{N}^{\star}, is the error that arises from using empirical data samples. It can be seen as a kind of generalization error. The proof will now be split into two parts that will find a bound for each of these errors.

IV-A Function Approximation Error

We first look at the quantity |PDθ|\mathinner{\!\left\lvert P^{\star}-D_{\theta}^{\star}\right\rvert}. This can be further split as follows

|PDθ|\displaystyle\mathinner{\!\left\lvert P^{\star}-D_{\theta}^{\star}\right\rvert} =|(PD)+(DDθ)|\displaystyle=\mathinner{\!\left\lvert(P^{\star}-D^{\star})+(D^{\star}-D_{\theta}^{\star})\right\rvert} (21)
|PD|+|DDθ|\displaystyle\leq\mathinner{\!\left\lvert P^{\star}-D^{\star}\right\rvert}+\mathinner{\!\left\lvert D^{\star}-D_{\theta}^{\star}\right\rvert}

where DD^{\star} is the solution to the dual problem associated with (4). This is defined with the Lagrangian

(ϕ,λ)=𝔼s𝒮𝒟[0(s,ϕ)]+λ𝔼s𝒮𝒟[𝒈(x,ϕ)]\mathcal{L}(\phi,\lambda)=\mathbb{E}_{s\sim\mathcal{S}_{\mathcal{D}}}[\ell_{0}(s,\phi)]+\lambda^{\top}\mathbb{E}_{s\sim\mathcal{S}_{\mathcal{D}}}[\bm{g}(x,\phi)] (22)

and the dual problem

D=maxλ0minϕΦ(ϕ,λ).D^{\star}=\max_{\lambda\geq 0}\min_{\phi\in\Phi}\mathcal{L}(\phi,\lambda). (23)

We note that the quantity |PD|\mathinner{\!\left\lvert P^{\star}-D^{\star}\right\rvert} is actually 0 due to a result from [30, Theorem 1]. The theorem is reproduced here using the notation of this paper.

Theorem 2 ([30], Theorem 1).

There is zero duality between the primal problem (4) and dual problem (23), if

  1. 1.

    There exists a strictly feasible solution (ϕ\phi, λ\lambda) to (23)

  2. 2.

    The distribution SS is nonatomic.

While the problem defined in [30] is different from the sufficiently accurate problem defined in 4, there is an equivalent problem formulation (see Appendix B). Since Theorem 1 fulfills the assumptions of Theorem 2, we get |PD|=0\mathinner{\!\left\lvert P^{\star}-D^{\star}\right\rvert}=0.

For the second half of this approximation error, |DDθ|\mathinner{\!\left\lvert D^{\star}-D_{\theta}^{\star}\right\rvert}, has also been previously studied in [31, Theorem 1] in the context of a slightly different problem formulation. The following theorem adapts [31, Theorem 1] to the Sufficiently Accurate problem formulation (4).

Theorem 3.

Given the primal problem (4) and the dual problem (19), along with the following assumptions

  1. 1.

    Φθ\Phi_{\theta} is an ϵ\epsilon-universal function approximator for Φ\Phi, and there exists a strictly feasible solution ϕθ\phi_{\theta} for (17).

  2. 2.

    The loss and constraint functions are expectation-wise Lipschitz-continuous with constant LL.

  3. 3.

    All assumptions of Theorem, 2

The dual value, DθD_{\theta}^{\star} is bounded by

DDθD+(λLϵ1+1)Lϵ,D^{\star}\leq D_{\theta}^{\star}\leq D^{\star}+(\mathinner{\!\left\lVert\lambda_{L\epsilon}^{\star}\right\rVert}_{1}+1)L\epsilon, (24)

where λLϵ\lambda_{L\epsilon}^{\star} is the dual variable that achieves the optimal solution to (14).

Proof.

See Appendix C

Again, the assumptions of Theorem 1 fulfill the assumptions for Theorem 3. Due to notational differences, as well as a different way of framing the optimization problem, the proof has been adapted from [31] and is given in Appendix C. With Theorem 2 and 3, the following can be stated

|PDθ|(λLϵ1+1)Lϵ\mathinner{\!\left\lvert P^{\star}-D_{\theta}^{\star}\right\rvert}\leq(\mathinner{\!\left\lVert\lambda_{L\epsilon}^{\star}\right\rVert}_{1}+1)L\epsilon (25)

IV-B Empirical Error

We now look at the empirical error, |DθDN|\mathinner{\!\left\lvert D_{\theta}^{\star}-D_{N}^{\star}\right\rvert}. We first observe the following Lemma.

Lemma 1.

Let Δ(θ,λ)=|θ(θ,λ)N(θ,λ)|\Delta\mathcal{L}(\theta,\lambda)=|\mathcal{L}_{\theta}(\theta,\lambda)-\mathcal{L}_{N}(\theta,\lambda)|. Then under the assumption of Theorem 1 it follows that

|DθDN|supθ,λΔ(θ,λ).|D_{\theta}^{\star}-D_{N}^{\star}|\leq\sup_{\theta,\lambda}\Delta\mathcal{L}(\theta,\lambda). (26)
Proof.

See Appendix D

IV-C Probabilistic Bound

Substituting the parameterized bound (25) and the empirical bound (26) in (20) yields the following implication

supθ,λΔ(θ,λ)δϵL(λLϵ1+1)PDNδ.\sup_{\theta,\lambda}\Delta\mathcal{L}(\theta,\lambda)\leq\delta-\epsilon L(\mathinner{\!\left\lVert\lambda_{L\epsilon}^{\star}\right\rVert}_{1}+1)\Rightarrow\mathinner{\!\left\lVert P^{\star}-D_{N}^{\star}\right\rVert}\leq\delta. (27)

Let (|PDN|δ)\mathbb{P}\left(|P^{\star}-D_{N}^{\star}|\leq{\delta}\right) be a probability over samples {s1,s2,,sN}\{s_{1},s_{2},\ldots,s_{N}\} that are drawn to estimate the expectation in the primal problem (8). Using the implication (27) it follows that

(|PDN|δ)\displaystyle\mathbb{P}\left(|P^{\star}-D_{N}^{\star}|\leq{\delta}\right) (28)
(supθ,λΔ(θ,λ)δϵL(λLϵ1+1))\displaystyle\geq\mathbb{P}\left(\sup_{\theta,\lambda}\Delta\mathcal{L}(\theta,\lambda)\leq{\delta}-\epsilon L(\mathinner{\!\left\lVert\lambda_{L\epsilon}^{\star}\right\rVert}_{1}+1)\right)
=1(supθ,λΔ(θ,λ)>δϵL(λLϵ1+1)),\displaystyle=1-\mathbb{P}\left(\sup_{\theta,\lambda}\Delta\mathcal{L}(\theta,\lambda)>{\delta}-\epsilon L(\mathinner{\!\left\lVert\lambda_{L\epsilon}^{\star}\right\rVert}_{1}+1)\right),

where the equality follows directly from the fact that for any event AA, (A)=1(Ac)\mathbb{P}(A)=1-\mathbb{P}(A^{c}). The assumptions of Theorem 1 allows us to use the following result from Statistical Learning Theory [28, (Section II.B)],

limN(supθ,λΔ(θ,λ)>δϵL(λLϵ1+1))=0.\lim_{N\rightarrow\infty}\mathbb{P}\left(\sup_{\theta,\lambda}\Delta\mathcal{L}(\theta,\lambda)>{\delta}-\epsilon L(\mathinner{\!\left\lVert\lambda_{L\epsilon}^{\star}\right\rVert}_{1}+1)\right)=0. (29)

Note that this theorem requires bounded loss functions. The assumptions for a bounded dual variable, and compact state-action space in Theorem 1 satisfies this constraint.

Thus, this establishes that for any δ>0\delta>0, we have limN(|PDN|δ)=1.\lim_{N\rightarrow\infty}\mathbb{P}\left(|P^{\star}-D_{N}^{\star}|\leq{\delta}\right)=1. This concludes the proof of the theorem.

V Constrained Solution Via Primal-Dual Method

Refer to caption
Figure 1: Primal Dual Training Curve: Example of the evolution of the constraint, loss, and dual variables during training. The red dotted line shows 0. The curves have been scaled so that they fit on the same y-axis and they are of different magnitudes. The constraint function is unfeasible, which leads to a growing dual variable. This can cause the loss function to increase until the constraint is feasible again.
Input: Data Samples, S={s1,s2,,sN}S=\{s_{1},s_{2},\ldots,s_{N}\}
 Neural Network with parameter space Θ\Theta
 Batch Size, MM
 Primal Learning rate, αθ\alpha_{\theta}
 Dual Learning rate, αλ\alpha_{\lambda}
Initialize θΘ\theta\in\Theta;
Initialize λ=𝟎\lambda=\bm{0};
while Not Converged do
   Sample batch of data S^={si1,,siM}\hat{S}=\{s_{i_{1}},\ldots,s_{i_{M}}\} from SS;
   Use S^\hat{S} to compute estimates of θN(θ,λ)\nabla_{\theta}\mathcal{L}_{N}(\theta,\lambda) and λN(θ,λ)\nabla_{\lambda}\mathcal{L}_{N}(\theta,\lambda) (See (31) and (32));
 θθαθθN(θ,λ)\theta\leftarrow\theta-\alpha_{\theta}\nabla_{\theta}\mathcal{L}_{N}(\theta,\lambda) ;
 λλ+αλλN(θ,λ)\lambda\leftarrow\lambda+\alpha_{\lambda}\nabla_{\lambda}\mathcal{L}_{N}(\theta,\lambda) ;
 λ[λ]+\lambda\leftarrow[\lambda]_{+}
end while
Algorithm 1 Primal-Dual

Section III has shown that problem (10) can approximate (4) given a large enough neural network and enough samples. This section will discuss how to compute a solution (10). There are many primal-dual methods [32, 33, 34] in the literature to solve this exact problem, and Algorithm 1 is an example of a simple primal-dual algorithm. One way to approach this problem is to consider the optimal dual variable, λN\lambda_{N}^{\star}. Given knowledge of λN\lambda_{N}^{\star}, the problem reduces to the following unconstrained minimization

DN=minθΘN(ϕ,λN)D_{N}^{*}=\min_{\theta\in\Theta}\mathcal{L}_{N}(\phi,\lambda_{N}^{\star}) (30)

A possible solution method is to start with an estimate of λN\lambda_{N}^{\star}, and solve the minimization problem. Then holding the primal variables fixed, update the dual variables by solving the outer maximization. This method can be seen as solving a sequence of unconstrained minimization problems. This method can be further approximated; instead of fully minimizing with respect to the primal variables, a gradient descent step can be taken. And instead of fully maximizing with respect to the dual variables, a gradient ascent step can be taken. This leads to Algorithm 1 where we iterate between the inner minimization step and the outer maximization step. At each iteration, dual variables are projected onto the positive orthant of K\mathbb{R}^{K}, denoted by the projection operator, [λ]+[\lambda]_{+}. This is to ensure non-negativity of the dual variables.

Refer to caption
Figure 2: Double Integrator model errors: The error in predicted velocity of the sufficiently accurate and the uniformly accurate models. There is a constant control input of u=1u=1 used to generate these plots. The top row contains three different views of the error for the sufficiently accurate model, while the bottom row contains the same three views for the uniformly accurate model. The z-axis on each plot is the error in the velocity for the difference |f(xt,ut)(f^(xt,ut)+ϕθ(xt,ut))||f(x_{t},u_{t})-(\hat{f}(x_{t},u_{t})+\phi_{\theta}(x_{t},u_{t}))|. Best viewed in color.

In many cases, the full gradient of θN(θ,λ)\nabla_{\theta}\mathcal{L}_{N}(\theta,\lambda) and λN(θ,λ)\nabla_{\lambda}\mathcal{L}_{N}(\theta,\lambda) can be too expensive to compute. This is due to the possibly large number of samples NN. An alternative is to take stochastic gradient descent and ascent steps. The gradients can be approximated by taking MM random samples of the whole dataset S=s1,,sNS={s_{1},\ldots,s_{N}}. The samples will be denoted as S^=si1,,siM\hat{S}={s_{i_{1}},\ldots,s_{i_{M}}} where i1i_{1} is an integer index into whole dataset SS. Using S^\hat{S}, we obtain

θN(θ,λ)\displaystyle\nabla_{\theta}\mathcal{L}_{N}(\theta,\lambda) =θ[1Ni=1N0(si,ϕθ)+λ1Ni=1N𝒈(si,ϕθ)]\displaystyle=\nabla_{\theta}[\frac{1}{N}\sum_{i=1}^{N}\ell_{0}(s_{i},\phi_{\theta})+\lambda^{\top}\frac{1}{N}\sum_{i=1}^{N}\bm{g}(s_{i},\phi_{\theta})] (31)
θ[1Mj=1M0(sij,ϕθ)+λ1Mj=1M𝒈(sij,ϕθ)]\displaystyle\approx\nabla_{\theta}[\frac{1}{M}\sum_{j=1}^{M}\ell_{0}(s_{i_{j}},\phi_{\theta})+\lambda^{\top}\frac{1}{M}\sum_{j=1}^{M}\bm{g}(s_{i_{j}},\phi_{\theta})]
=1Mj=1Mθ0(sij,ϕθ)+λ1Mj=1Mθ𝒈(sij,ϕθ).\displaystyle=\frac{1}{M}\sum_{j=1}^{M}\nabla_{\theta}\ell_{0}(s_{i_{j}},\phi_{\theta})+\lambda^{\top}\frac{1}{M}\sum_{j=1}^{M}\nabla_{\theta}\bm{g}(s_{i_{j}},\phi_{\theta}).

The gradients θ0(sij,ϕθ)\nabla_{\theta}\ell_{0}(s_{i_{j}},\phi_{\theta}) and θ𝒈(sij,ϕθ)\nabla_{\theta}\bm{g}(s_{i_{j}},\phi_{\theta}) can be computed easily using backpropogation. Similarly, for λN(θ,λ)\nabla_{\lambda}\mathcal{L}_{N}(\theta,\lambda),

λN(θ,λ)\displaystyle\nabla_{\lambda}\mathcal{L}_{N}(\theta,\lambda) λ[1Mj=1M0(sij,ϕθ)+λ1Mj=1M𝒈(sij,ϕθ)]\displaystyle\approx\nabla_{\lambda}[\frac{1}{M}\sum_{j=1}^{M}\ell_{0}(s_{i_{j}},\phi_{\theta})+\lambda^{\top}\frac{1}{M}\sum_{j=1}^{M}\bm{g}(s_{i_{j}},\phi_{\theta})] (32)
=1Mj=1M𝒈(sij,ϕθ).\displaystyle=\frac{1}{M}\sum_{j=1}^{M}\bm{g}(s_{i_{j}},\phi_{\theta}).

The dual gradient can be estimated as simply the average of the constraint functions over the sampled dataset.

In the simplest form of the primal-dual algorithm, the variables are updated with simple gradient ascent/descent steps. These updates can be replaced with more complicated update schemes, such as using momentum [35] or adaptive learning rates [36]. Higher order optimization methods such as Newton’s method can be used to replace the gradient ascent and descent steps. For large neural networks, this can be unfeasible as it requires the computation of Hessians with respect to neural network weights. The memory complexity for the Hessian is quadratic with the number of neural network weights.

The primal-dual algorithm presented here is not guaranteed to converge to the global optimum. With proper choice of learning rate, it can converge to a local optimum or saddle point. This issue is present in unconstrained formulations like (2) as well. An example of the evolution of the loss and constraint functions is shown in Figure 1.

VI Experiments

This section shows examples of the Sufficiently Accurate model learning problem. First, experiments are performed using a simple double integrator experiencing unknown dynamic friction. The simplicity of this problem along with the small state space allows us to explore and visualize some of the properties of the approximated solution. Next, two more interesting examples are shown. One example learns how a ball bounces on a paddle with unknown paddle orientation and coefficient of restitution. The other example mitigates ground effects which can disturb the landing sequence of a quadrotor. The experiments will compare the sufficiently accurate problem (4) with the unconstrained problem (2) which will be denoted as the Uniformly Accurate problem. Each experimental subsection will be broken down into three parts, 1) System and Task introduction, 2) Experimental details, and 3) Results.

VI-A Double Integrator with Friction

Refer to caption
Figure 3: Double integrator friction force: The magnitude of the acceleration due to the position varying kinetic friction force.
Refer to caption
Figure 4: Experimental Surrogate Duality Gaps: Duality gaps of learning the double integrator problem with different neural networks and sample sizes (best viewed in color). The y-axis shows the Lagrangian value at the end of training which approximates DND_{N}^{\star}. The models numbers indicate how large the neural network is with Model 0 being the smallest network. For each NN, 15 tests with each model were run, and a box plot is shown that indicates the median as a solid bolded line. The ends of each box are the 1st and 3rd quartile, while the whiskers on the plot are the minimum and maximum values. The red line is the optimal value to the original problem (4). Note that for N=10000N=10000, the median is not shown for Model 0 as it is very large (0.21).
Refer to caption
Figure 5: Trajectory of the double integrator position: This shows the evolution of the trajectory of the double integrator when controlled using MPC with both a sufficiently accurate and uniformly accurate model.

VI-A1 Introduction

To analyze aspects of the Sufficiently Accurate model learning formulation, simple experiments are performed on a simple double integrator system with dynamic friction. When trying to control the system, a simple base model to use is that of a double integrator without any friction

[pt+1vt+1]=[1Δt01][ptvt]+[0Δt]ut\begin{bmatrix}p_{t+1}\\ v_{t+1}\end{bmatrix}=\begin{bmatrix}1&\Delta t\\ 0&1\end{bmatrix}\begin{bmatrix}p_{t}\\ v_{t}\end{bmatrix}+\begin{bmatrix}0\\ \Delta t\end{bmatrix}u_{t} (33)

where pp is the position of the system, vv is the velocity, uu is the control input, and Δt\Delta t is the sampling time. The state of the system is x=[p,v]x=[p,v].The true model of the system that is unknown to the controller is

[pt+1vt+1]=[1Δt01][ptvt]+[0(utΔt)c(vt,ut,b(pt))]\displaystyle\begin{bmatrix}p_{t+1}\\ v_{t+1}\end{bmatrix}=\begin{bmatrix}1&\Delta t\\ 0&1\end{bmatrix}\begin{bmatrix}p_{t}\\ v_{t}\end{bmatrix}+\begin{bmatrix}0\\ (u_{t}\Delta t)-c(v_{t},u_{t},b(p_{t}))\end{bmatrix} (34)

where b(pt)b(p_{t}) a position varying kinetic friction. cc is a function that ensures that the friction cannot reverse the the direction of the speed (it is an artifact of the discrete time formulation)

c(vt,ut,b(pt))={vt+utΔt, if sign(vt+utΔt)sign(vt+utΔt+b(pt))b(pt), otherwise.c(v_{t},u_{t},b(p_{t}))=\begin{cases}v_{t}+u_{t}\Delta t,\text{ if }\\ \quad sign(v_{t}+u_{t}\Delta t)\neq\\ \quad sign(v_{t}+u_{t}\Delta t+b(p_{t}))\\ b(p_{t}),\text{ otherwise}.\end{cases} (35)

If within a single time step, the friction force will change the sign of the velocity, cc will set vt+1v_{t+1} to be 0. Otherwise, cc will not modify the friction force in any way. The specific b(p)b(p) used is shown in Figure 3 and the sampling time is set to Δt=0.1\Delta t=0.1. The task is to drive the system to the origin [pv]=[00]\begin{bmatrix}p&v\end{bmatrix}=\begin{bmatrix}0&0\end{bmatrix}.

VI-A2 Experimental Details

The goal of model learning in this experiment is to learn ϕθ(x,u)\phi_{\theta}(x,u) such that f(x,u)f^(x,u)+ϕθ(x,u)f(x,u)\approx\hat{f}(x,u)+\phi_{\theta}(x,u) where ff is (34) and f^\hat{f} is (33). A Uniformly Accurate model will be learned using (2) along with a Sufficiently Accurate model using the problem defined in Example 1. In the scenario defined by (6), 𝕀(s)\mathbb{I}(s) is active in the region {(p,v)2:[p,v]0.5}\{(p,v)\in\mathbb{R}^{2}\mathrel{\mathop{\ordinarycolon}}\mathinner{\!\left\lVert[p,v]^{\top}\right\rVert}_{\infty}\leq 0.5\} and ϵc=0.035\epsilon_{c}=0.035. The constraint, therefore, enforces a high accuracy in the state space near the origin.

The neural network, ϕθ\phi_{\theta}, used to approximate the residual dynamics has two hidden layers. The first hidden layer has four neurons, while the second has two. Both hidden layers use a parametric rectified linear (PReLU) activation [37]. The input into the network is a concatenated vector of [pt,vt,ut][p_{t},v_{t},u_{t}]. The output layer’s weights are initially set to zero so before learning the residual error, the network will output zero. The dataset used to train both the sufficiently and uniformly accurate models is generated by uniformly sampling 15,00015,000 positions from [-2, 2], velocities from [-2.5, 2.5], and control inputs from [-10, 10]. The real model (34) is then used to obtain the true next state. Instead of simple gradient descent/ascent, ADAM [36] is used as an update rule with αθ=1×103\alpha_{\theta}=1\times 10^{-3} and αλ=1×104\alpha_{\lambda}=1\times 10^{-4}. Both models were trained in 200 epochs.

The models are then evaluated on how well it performs within a MPC controller defined in (36). This controller seeks to drive the system to the origin while obeying control constraints. The controller is solved using a Sequential Quadratic Programming solver [38] with a time horizon of T=10T=10. The models are evaluated in 200 different simulations where xstartx_{start} is drawn uniformly from [2,2][-2,2].

min{xt,ut}t=1\displaystyle\min_{\{x_{t},u_{t}\}_{t=1}^{\top}} t=1|xt|\displaystyle\sum_{t=1}^{\top}|x_{t}| (36)
s.t. |ut|10,t=1,,T\displaystyle|u_{t}|\leq 0,t=1,\ldots,T
xt+1=f^(xt,ut)+ϕθ(xt,ut),t=1,,T1\displaystyle x_{t+1}=\hat{f}(x_{t},u_{t})+\phi_{\theta}(x_{t},u_{t}),t=1,\ldots,T-1
x1=xstart\displaystyle x_{1}=x_{start}
x˙1=0\displaystyle\dot{x}_{1}=0
Refer to caption
Figure 6: Ball Bouncing Simulation: The paddle seeks to bounce the orange ball above a target position on the xy plane, represented by the red ball. This figure shows a time sequence of a bounce from left to right.

VI-A3 Results

The sufficiently accurate formulation utilizes the prior knowledge that the model should be more accurate near the goal in order to stop efficiently. While the system is far from the origin, the control is simple, regardless of the friction; the controller only needs to know what direction to push in. A plot of the accuracy of both models is shown in Figure 2 and summarized in Table II. It is noticeable that the Sufficiently Accurate model has low average error near the origin, but suffers from higher average error outside of the region defined by 𝕀(s)\mathbb{I}(s). This is the expected behavior.

The performance of the controllers are summarized in Table I. Even though Sufficiently Accurate has higher error outside of the constraint region and lower error within, it leads to lower costs when controlling the double integrator. The reason is shown in Figure 5, where the sufficiently accurate model may get to steady state a bit slower but is able to control the overshoot better and not have oscillations near the origin. This is because the model is purposefully more accurate near the origin as it is more important for this task.

t=1|xt|\sum_{t=1}^{\top}|x_{t}| t=1|xt|/|xstart|\sum_{t=1}^{\top}|x_{t}|/|x_{start}|
Uniformly Accurate 5.51±3.465.51\pm 3.46 5.66±0.895.66\pm 0.89
Sufficiently Accurate 4.73±3.38\bm{4.73\pm 3.38} 4.57±0.73\bm{4.57\pm 0.73}
TABLE I: Controller performance: This table shows the results of 200 trials of simulating the double integrator starting from different positions. Each entry shows the mean and one standard deviation. The first column shows the raw cost function of the MPC problem averaged over all trials. The second column shows an average normalized cost where each cost is normalized by the absolute value of the starting position. This is due to the fact that larger magnitude starting locations will have higher costs.
All state space K\mathcal{I}_{K} KC\mathcal{I}_{K}^{C}
Uniform 0.110±0.098\bm{0.110\pm 0.098} 0.083±0.054\pagecolor{black!20}0.083\pm 0.054 0.113±0.10\bm{0.113\pm 0.10}
Sufficient 0.136±0.1260.136\pm 0.126 0.048±0.040\bm{0.048\pm 0.040} 0.146±0.130.146\pm 0.13
TABLE II: Mean errors between the true model and the learned models in various state space subsets: The error is the absolute difference between the true model and the learned models. Each entry shows the mean and one standard deviation. The first column shows the average error for the whole state space. The second column shows the average error for the state space near the origin, while the third column shows the average error for the complement of that set (states far from the origin). The errors are evaluated on a test set not seen during training.

VI-A4 Convergence Experiments

The double integrator is a simple system. This enables running more comprehensive tests to experimentally show some aspects of Theorem 1. For this particular system, we will run one more experiment where 4 different neural network architectures were used. Each network has two hidden layers with PReLU activation, where the only difference is in the number of neurons in each layer. Denoting a network as (number of neurons in first layer, number of neurons in second layer), the network sizes used are: (2, 1), (4, 2), (8, 4), (16, 8). A set of values of the number of samples, NN, are also chosen: {100,1000,5000,10000,15000}\{100,1000,5000,10000,15000\}. For each NN, 15 random datasets are sampled, and each neural network is trained with each dataset using the Sufficiently Accurate objective described in Section VI-A2. There is one minor difference in how the data is collected; a zero mean Gaussian noise with σ=0.2\sigma=0.2 is added to vt+1v_{t+1}. With noisy observations of velocity, the optimal model that can be learned for (4) will have an objective value of P=0.04P^{\star}=0.04. The results of training each neural network model with each random dataset is shown in Figure 4. Each boxplot in the figure shows the distribution of the final value of the Lagrangian, N\mathcal{L}_{N}, at the ending of training. This is an approximation of DND_{N}^{\star}. The primal-dual algorithm may not be able to solve for the optimal DND_{N}^{\star}, but the expectation is that for a simple problem like double integrator, the solution is somewhat close. In fact, Figure 4 shows that with increasing model sizes and larger NN, the distribution of the solutions appear to be converging to PP^{\star}. Note that the figure shows the value of the Lagrangian with training data. Thus for small NN, networks can overfit and have a near zero Lagrangian value. When increasing NN, the networks have less of a chance to overfit to the training data.

VI-B Ball Bouncing

Refer to caption
Figure 7: Model errors vs. velocity magnitude: The scatter plot shows the distribution of model errors versus the magnitude of the velocity of the true result. Both the sufficiently and uniformly accurate models are evaluated using a validation set that is not used during training. The blue dotted line represents the boundary of where the constraint set is and the red dotted line represents the boundary of the constraint function.
Refer to caption
Figure 8: Ball bouncing trajectory: The (x, y z) trajectory of the ball is plotted for both the base model (with wrong parameters) and a learned model using the sufficiently accurate objective. This plot shows that the base model is not sufficient by itself to bounce the ball at a desired location.

VI-B1 Introduction

This experiment involves bouncing a ball on a paddle as in Figure 6. The ball has the state space x=[𝒑ball,𝒗ball]x=[\bm{p}_{ball},\bm{v}_{ball}], where 𝒑ball\bm{p}_{ball} is the three-dimensional position of the ball, 𝒗ball\bm{v}_{ball} is the three-dimensional velocity of the ball. The control input is u=[𝒗paddle,𝒏]u=[\bm{v}_{paddle},\bm{n}] where 𝒗paddle\bm{v}_{paddle} is the velocity of the paddle at the moment of impact with the ball and 𝒏\bm{n} is the normal vector of the paddle, representing its orientation. This control input is a high level action and is realized by lower level controllers that attempt to match the velocity and orientation desired for the paddle. A basic model of how the velocity of the ball changes during collision is

𝒗ball+\displaystyle\bm{v}_{ball}^{+} =αr(𝒗rel2𝒏(𝒏𝒗rel))+𝒗paddle\displaystyle=\alpha_{r}(\bm{v}_{rel}^{-}-2\bm{n}(\bm{n}\cdot\bm{v}_{rel}^{-}))+\bm{v}_{paddle} (37)
𝒗rel\displaystyle\bm{v}_{rel}^{-} =(𝒗ball𝒗paddle)\displaystyle=(\bm{v}_{ball}^{-}-\bm{v}_{paddle})

where the superscript - refers to quantities before the paddle-ball collision and the superscript ++ refers to quantities after the paddle-ball collision (the paddle velocity and orientation are assumed to be unchanged during and directly after collision). αr\alpha_{r} is the coefficient of restitution. In this experiment, a neural network is tasked to learn the model of how the ball velocity changes, i.e. (37).

VI-B2 Experimental Details

First, a neural network is trained without knowledge of any base model of how the ball bounces. This will be denoted as learning a full model as opposed to a residual model. This network is trained two ways, with the Uniformly Accurate problem (2) as well as the Sufficiently Accurate problem realized in Example 2. The constants used in Example 2 are defined here as ϵc=0.1\epsilon_{c}=0.1 and δc=0.1\delta_{c}=0.1.

A second neural network is trained for both the uniformly and sufficiently accurate formulations that utilizes the base model, f^(x,u)\hat{f}(x,u) given in (37) to learn a residual error. In the base model, the coefficient of restitution, αr\alpha_{r}, is wrong and the control 𝒏\bm{n} has a constant bias where a rotation of 0.2 radians is applied to the y-axis. This is to simulate a robot arm picking up the paddle and not observing the rotation from the hand to the paddle correctly.

The neural network used for all models has 2 hidden layers with 128 neurons in each using the pReLU activation. The input into the network is the the state of the ball and the control input at time of collision, [x,u][x^{-},u], and it outputs the ball velocity after the collision, 𝒗ball+\bm{v}_{ball}^{+}. The network was trained using the ADAM optimizer with an initial learning rate of 10310^{-3} for both the primal and dual variables. The data used for all model training was gathered by simulating random ball bounces in MuJoCo for the equivalent of 42 minutes in real life.

All learned models are then evaluated with how well a controller utilizes them. The controller will attempt to bounce the ball at a specific xyxy location. This is represented through the following optimization problem that the controller solves

minu|loc(ϕθ(x,u))locdesired|\displaystyle\min_{{u}}\mathinner{\!\left\lvert loc(\phi_{\theta}({x,u}))-{loc}_{desired}\right\rvert} (38)
s.t. rollminrollrollmax\displaystyle\text{s.t. }{roll}_{min}\leq roll\leq{roll}_{max}
pitchminpitchpitchmax\displaystyle\quad{pitch}_{min}\leq pitch\leq{pitch}_{max}
𝒗min|𝒗rel|𝒗max\displaystyle\quad\bm{v}_{min}\leq\mathinner{\!\left\lvert\bm{v}_{rel}\right\rvert}\leq\bm{v}_{max}

where loc()loc(\cdot) is a function that maps the velocity of the ball to the xyxy location it will be in when it falls back to its current height. rollroll and pitchpitch are both derived from the paddle normal 𝒏\bm{n}. [locdesired,rollmin,rollmax,pitchmin,pitchmax,vmin,vmax][{loc}_{desired},roll_{min},roll_{max},pitch_{min},pitch_{max},v_{min},v_{max}] are parameters of the controller that can be chosen. The system and controller is then simulated in MuJoCo [39] using libraries from the DeepMind Control Suite [40].

Each model is evaluated 500 different times for varying controller parameters. locdesiredloc_{desired} is uniformly distributed in the region {(x,y)|1mx1m,1my1m}\{(x,y)|-1m\leq x\leq 1m,-1m\leq y\leq 1m\}, 𝒗min\bm{v}_{min} uniformly sampled from the interval [3m/s,4m/s)[3m/s,4m/s), and 𝒗max\bm{v}_{max} is selected to be above 𝒗min\bm{v}_{min} by between 1m/s1m/s to 2m/s2m/s.

VI-B3 Results

A plot of the model errors are shown in Figure 7. While the uniformly accurate model has errors that are distributed more or less uniformly across all magnitudes of ball velocity, the sufficiently accurate model has a clear linear relationship. This is expected from the normalized objective that is used which penalizes errors based on large the velocity of the ball is. Therefore, larger velocities can have larger errors with the same penalty as smaller velocities with small errors.

Refer to caption
Figure 9: 3D Quadrotor Simulation: Trajectory of the Sufficiently Accurate and Uniformly Accurate models used in a MPC controller (best viewed in color). The goal is to land at a precise point represented by the red dot. Each plot is a different view of the same data. The dotted lines represent the trajectory of the center of mass of the vehicle for 50 time steps. The state of the quadrotor at the last time step is drawn.

The results of running each model with the controller 500 times is shown in Table III. The error characteristics of the Sufficiently Accurate model (Figure 7) allow it to out perform its Uniformly Accurate counterpart with both a full model and a residual model. For the full model, the uniformly accurate problem yields a failure rate of over 20%20\% while the sufficiently accurate problem yields a failure rate of under 1%1\%. Here, failure means the paddle fails to keep the ball bouncing. For the residual model, neither model failed because the base model provides a decent guess (though the base model by itself is not good enough for control, see Figure 8). The sufficiently accurate model still provided better mean errors.

We hypothesize that the large errors spread randomly across the Uniform model leads to high variance estimates of the output given small changes in the input. For optimizers that use gradient information, this leads to a poor estimate of the gradient. For optimizers that are gradient free, this still causes problems due to the high variance of the values themselves.

Uniform Sufficient
Full Model Failure 20.8%20.8\% 0.8%0.8\%
Mean error 0.31360.3136 0.2124
Residual Model Failure 0%0\% 0%0\%
Mean error 0.1640.164 0.156
TABLE III: Ball bouncing controller performance: Results of 500 trials of ball bouncing for each model. There is a full and residual model trained for both the uniformly accurate and sufficiently accurate model learning problems.

VI-C Quadrotor with ground effects

VI-C1 Introduction

The last experiment deals landing a quadrotor while undergoing disturbances from ground effect. This disturbance occurs when a quadrotor operates near surfaces which can change the airflow [23]. The state for the quadrotor model is a 12 degree of freedom model which consists of x=[𝒑,𝒗,𝒒,𝝎]x=[\bm{p},\bm{v},\bm{q},\bm{\omega}] where 𝒑3\bm{p}\in\mathbb{R}^{3} is position of the center of mass, 𝒗3\bm{v}\in\mathbb{R}^{3} is the center of mass velocity, 𝒒SO(3)\bm{q}\in SO(3) is a quaternion that represents the orientation the quadrotor, and 𝒘3\bm{w}\in\mathbb{R}^{3} is the angular velocity expressed in body frame. The control input is u=[u(1),u(2),u(3),u(4)]u=[u^{(1)},u^{(2)},u^{(3)},u^{(4)}] where u(i)u^{(i)} is the force from the ithi^{\text{th}} motor. The base model of the quadrotor, f^(x,u)\hat{f}(x,u) is as follows

[𝒑t+1𝒗t+1𝒒t+1𝝎t+1]\displaystyle\begin{bmatrix}\bm{p}_{t+1}\\ \bm{v}_{t+1}\\ \bm{q}_{t+1}\\ \bm{\omega}_{t+1}\end{bmatrix} =[𝒑t+𝒑t˙Δt𝒗t+𝒗t˙Δt𝒒t+𝒒t˙Δt𝒒t+𝒒t˙Δt2𝝎t+𝝎t˙Δt]\displaystyle=\begin{bmatrix}\bm{p}_{t}+\dot{\bm{p}_{t}}\Delta t\\ \bm{v}_{t}+\dot{\bm{v}_{t}}\Delta_{t}\\ \frac{\bm{q}_{t}+\dot{\bm{q}_{t}}\Delta_{t}}{\mathinner{\!\left\lVert\bm{q}_{t}+\dot{\bm{q}_{t}}\Delta_{t}\right\rVert}_{2}}\\ \bm{\omega}_{t}+\dot{\bm{\omega}_{t}}\Delta_{t}\end{bmatrix} (39)
[𝒑˙𝒗˙𝒒˙𝝎˙]\displaystyle\begin{bmatrix}\dot{\bm{p}}\\ \dot{\bm{v}}\\ \dot{\bm{q}}\\ \dot{\bm{\omega}}\end{bmatrix} =[𝒗𝒒[0,0,i=14u(i)/m]𝒒1[0,0,9.81]12𝝎𝒒1(𝑻𝝎×(𝝎))]\displaystyle=\begin{bmatrix}\bm{v}\\ \bm{q}\otimes[0,0,\sum_{i=1}^{4}u^{(i)}/m]^{\top}\otimes\bm{q}^{-1}-[0,0,9.81]^{\top}\\ \frac{1}{2}\bm{\omega}\otimes\bm{q}\\ \mathcal{I}^{-1}(\bm{T}-\bm{\omega}\times(\mathcal{I}\bm{\omega}))\end{bmatrix}
𝑻\displaystyle\bm{T} =[u(4)u(2)u(3)u(1)(u(1)+u(3))(u(2)+u(4))]\displaystyle=\begin{bmatrix}u^{(4)}-u^{(2)}\\ u^{(3)}-u^{(1)}\\ (u^{(1)}+u^{(3)})-(u^{(2)}+u^{(4)})\end{bmatrix}

where mm is the total mass of the quadrotor (set to be 1kg for all experiments) and \mathcal{I} is inertia matrix around the principle axis (set to be identity for all experiments). The ×\times symbol represents cross product, and \otimes represents quaternion multiplication. When using \otimes between a vector and a quaternion, the vector components are treated as the imaginary components of a quaternion with a real component of 0. The discrete model normalizes the quaternion for each state update so that it remains a unit quaternion. The body frame of the quadrotor is such that the x axis aligns with one of the quadrotor arms, and the z axis points “up.”

The true model used in simulation adds disturbances to the force on each propeller, but is otherwise the same as the base model:

f(x,u)=f^(x,h(x,u))f(x,u)=\hat{f}(x,h(x,u)) (40)

where h:n×pph\mathrel{\mathop{\ordinarycolon}}\mathbb{R}^{n}\times\mathbb{R}^{p}\rightarrow\mathbb{R}^{p} is the ground effect model. In this experiment we provide a simplified model of ground effects where each motor has independence disturbances. The ithi^{\text{th}} output of the ground effect model, hi(x,u)h_{i}(x,u) is

hi(x,u)\displaystyle h_{i}(x,u) =u(i)(1+Kground)\displaystyle=u^{(i)}(1+K_{ground}) (41)
Kground\displaystyle K_{ground} =([1hprophmax]+)(4[θgroundπ2]+2π2)α\displaystyle=([1-\frac{h_{prop}}{h_{max}}]_{+})(\frac{4[\theta_{ground}-\frac{\pi}{2}]_{+}^{2}}{\pi^{2}})\alpha

where hproph_{prop} is height of the propeller above the ground (not the height of the center of mass), hmaxh_{max} is a constant that determines the height at which the ground effect is no longer in effect. θground\theta_{ground} is the angle between the unit vector aligned with the negative zz axis of the quadrotor and the unit vector [0,0,1][0,0,-1]. α\alpha is a number in the set [0,1][0,1] that represents the maximum fraction of the propeller’s generated force that can be added as a result of ground effect. As a reminder, the []+[\cdot]_{+} operator projects its arguments onto the positive orthant. A visualization of KgroundK_{ground} is shown in Figure 10. In the experiments, hmax=1.5h_{max}=1.5, α=0.5\alpha=0.5.

Refer to caption
Figure 10: Ground effect: A visualization of KgroundK_{ground} as a function of θground\theta_{ground} and hproph_{prop}.

VI-C2 Experimental Details

The Sufficiently Accurate model trained using the problem presented in Example 1, where ϵc=0.001\epsilon_{c}=0.001 and the indicator 𝕀(s)\mathbb{I}(s) is active when the height of the quadrotor is less than 1.51.5. A Uniformly Accurate and Sufficiently Accurate model are trained to learn the residual error between f(x,u)f(x,u) and f^(x,u)\hat{f}(x,u). Both models use a neural network with 2 hidden layers of 16 and 8 neurons each with pReLU activation. The update for primal and dual variables used ADAM with αθ=1×103\alpha_{\theta}=1\times 10^{-3} and αλ=1×104\alpha_{\lambda}=1\times 10^{-4}, and both models trained using 3,0003,000 epochs. The training data consists of 10,00010,000 randomly sampled quadrotor states. The (x,y)(x,y) positions of the quadrotor were uniformly sampled from [10,10][-10,10]. The zz positions of the quadrotor were uniformly sampled from [0.1,5][0.1,5]. Linear velocities components were uniformly sampled from [6,6][-6,6]. Angular velocities components were uniformly sampled from [2.5,2.5][-2.5,2.5]. Control inputs for each motor are sampled from [0,10][0,10]. Quaternions are sampled by sampling random unit vectors along with a random angle in [0.4rad,0.4rad][-0.4rad,0.4rad]. This angle-axis rotation is transformed into a quaternion.

Both models are tested by sampling a random starting location and asking the quadrotor to land at the origin. The controller used for landing is an MPC controller that repeatedly solves the following problem

min{ut,xt}t=1Tt=1T𝒑t𝒑target2\displaystyle\min_{\{u_{t},x_{t}\}_{t=1}^{T}}\sum_{t=1}^{T}\mathinner{\!\left\lVert\bm{p}_{t}-\bm{p}_{target}\right\rVert}_{2} (42)
s.t. 0ut10,t\displaystyle\text{s.t. }0\leq u_{t}\leq 0,\forall t
xt+1=f^(xt,ut)+ϕθ(xt,ut),t=1,,T1\displaystyle\quad x_{t+1}=\hat{f}(x_{t},u_{t})+\phi_{\theta}(x_{t},u_{t}),t=1,\ldots,T-1
ht,com0.2,t\displaystyle\quad h_{t,com}\geq 2,\forall t
|qw1|0.05\displaystyle\quad\mathinner{\!\left\lvert q_{w}-1\right\rvert}\leq 05

where 𝒑\bm{p} is the position of the quadrotor, ht,comh_{t,com} is the height of the center of mass, and qwq_{w} is the real component of the quaternion at the last time step. This problem encourages reaching a target, subject to control and dynamics constraint. It also has a constraint on the height of the quadrotor so it is always above a certain small altitude, and an orientation constraint on the last time step so it is mostly upright when it lands.

VI-C3 Results

The results of running this controller over several different starting locations is shown in Table IV. Similar to previous experiments, the Sufficiently Accurate model has a higher loss overall, but better accuracy in the constrained area which is more important to the task. This allows the controller to utilize the higher accuracy to land the quadrotor precisely. An example of one of the landing trajectories is shown in Figure 9. It can be seen that the Sufficiently Accurate model can more precisely land at the origin (0,0)(0,0). It also is able to reach the ground faster, as it can more accurately compensate for the extra force caused by the ground surface. The ground effect can also disturb roll and pitch maneuvers which can offset the center of mass as well.

Sufficiently Accurate Uniformly Accurate
MPC cost 4.21±1.88\bm{4.21\pm 1.88} 4.55±1.904.55\pm 1.90
𝔼[l(s,ϕθ)]\mathbb{E}[l(s,\phi_{\theta})] 6.52×1046.52\times 10^{-4} 6.16×𝟏𝟎𝟒\bm{6.16\times 10^{-4}}
𝔼[g(s,ϕθ)𝕀(s)]\mathbb{E}[g(s,\phi_{\theta})\mathbb{I}(s)] 1.00×𝟏𝟎𝟒\bm{1.00\times 10^{-4}} 8.28×1048.28\times 10^{-4}
TABLE IV: Quadrotor Results: The first row shows the MPC cost average (42) and one standard deviation over 5 runs. The second row shows the training loss, while the third row shows the constraint function. The expected errors are evaluated on a test set not seen during training.

VII Conclusion

This paper presents Sufficiently Accurate model learning as a way to incorporate prior information about a system in the form of constraints. In addition, it proves that this constrained learning formulation can have arbitrarily small duality gap. This means that existing methods like primal-dual algorithms can find decent solutions to the problem.

With good constraints, the model and learning method can focus on important aspects of the problem to improve the performance of the system even if the overall accuracy of the model is lower. These constraints can come from robust control formulations or from knowledge of sensor noise. Some important questions to consider when using this method is how to choose good constraints. For some systems and tasks, it can be simple while for others, it can be quite difficult. This objective is not useful for all tasks and systems but rather for a subset of tasks and systems where prior knowledge is more easily expressed as a constraint.

Acknowledgements

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE-1321851. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Appendix A Expectation-wise Lipschitz-Continuity of Loss functions

A-A l(s,ϕ)=ϕ(xt,ut)xt+12l(s,\phi)=\mathinner{\!\left\lVert\phi(x_{t},u_{t})-x_{t+1}\right\rVert}_{2}

The loss function l(s,ϕ)=ϕ(xt,ut)xt+12l(s,\phi)=\mathinner{\!\left\lVert\phi(x_{t},u_{t})-x_{t+1}\right\rVert}_{2} is expectation-wise Lipschitz-continuous in ϕ\phi.

Using the reverse triangle inequality, we get

l(s,ϕ1)l(s,ϕ2)=\displaystyle\mathinner{\!\left\lVert l(s,\phi_{1})-l(s,\phi_{2})\right\rVert}_{\infty}= (43)
ϕ1(xt,ut)xt+12ϕ2(xt,ut)xt+12\displaystyle\quad\mathinner{\!\left\lVert\mathinner{\!\left\lVert\phi_{1}(x_{t},u_{t})-x_{t+1}\right\rVert}_{2}-\mathinner{\!\left\lVert\phi_{2}(x_{t},u_{t})-x_{t+1}\right\rVert}_{2}\right\rVert}_{\infty}
ϕ1(xt,ut)ϕ2(xt,ut)2.\displaystyle\quad\leq\mathinner{\!\left\lVert\mathinner{\!\left\lVert\phi_{1}(x_{t},u_{t})-\phi_{2}(x_{t},u_{t})\right\rVert}_{2}\right\rVert}_{\infty}.

By, the equivalence of norms, there exists LL such that ϕ1(x,u)ϕ2(x,u)2L|ϕ1(x,u)ϕ2(x,u)|\mathinner{\!\left\lVert\phi_{1}(x,u)-\phi_{2}(x,u)\right\rVert}_{2}\leq L\mathinner{\!\left\lvert\phi_{1}(x,u)-\phi_{2}(x,u)\right\rvert}, for all (x,u)(x,u). Thus,

l(s,ϕ1)l(s,ϕ2)Lϕ1ϕ2\mathinner{\!\left\lVert l(s,\phi_{1})-l(s,\phi_{2})\right\rVert}_{\infty}\leq L\mathinner{\!\left\lVert\phi_{1}-\phi_{2}\right\rVert}_{\infty} (44)

Since this is true for any ss, it is also true in expectation.

𝔼sl(s,ϕ1)l(s,ϕ2)L𝔼sϕ1ϕ2\mathbb{E}_{s}\mathinner{\!\left\lVert l(s,\phi_{1})-l(s,\phi_{2})\right\rVert}_{\infty}\leq L\mathbb{E}_{s}\mathinner{\!\left\lVert\phi_{1}-\phi_{2}\right\rVert}_{\infty} (45)

A-B l(s,ϕ)=ϕ(xt,ut)xt+12xt+12𝕀A(s)l(s,\phi)=\frac{\mathinner{\!\left\lVert\phi(x_{t},u_{t})-x_{t+1}\right\rVert}_{2}}{\mathinner{\!\left\lVert x_{t+1}\right\rVert}_{2}}\mathbb{I}_{A}(s)

The loss function l(s,ϕ)=ϕ(xt,ut)xt+12xt+12𝕀A(s)l(s,\phi)=\frac{\mathinner{\!\left\lVert\phi(x_{t},u_{t})-x_{t+1}\right\rVert}_{2}}{\mathinner{\!\left\lVert x_{t+1}\right\rVert}_{2}}\mathbb{I}_{A}(s) is also expectation-wise Lipschitz-continuous in ϕ\phi. Following the same logic as for the euclidean norm, we get

l(s,ϕ1)l(s,ϕ2)L𝕀A(s)xt+12(ϕ1ϕ2)\mathinner{\!\left\lVert l(s,\phi_{1})-l(s,\phi_{2})\right\rVert}_{\infty}\leq L\mathinner{\!\left\lVert\frac{\mathbb{I}_{A}(s)}{\mathinner{\!\left\lVert x_{t+1}\right\rVert}_{2}}(\phi_{1}-\phi_{2})\right\rVert}_{\infty} (46)

This reduces to

l(s,ϕ1)l(s,ϕ2)L1minxt+1xt+12(ϕ1ϕ2)\mathinner{\!\left\lVert l(s,\phi_{1})-l(s,\phi_{2})\right\rVert}_{\infty}\leq L\mathinner{\!\left\lVert\frac{1}{\min_{x_{t+1}}\mathinner{\!\left\lVert x_{t+1}\right\rVert}_{2}}(\phi_{1}-\phi_{2})\right\rVert}_{\infty} (47)

when considering the case where sAs\in A. The largest that 𝕀A(s)xt+12\frac{\mathbb{I}_{A}(s)}{\mathinner{\!\left\lVert x_{t+1}\right\rVert}_{2}} can be is 11 over the smallest value of xt+12\mathinner{\!\left\lVert x_{t+1}\right\rVert}_{2}. If sAs\not\in A, both sides reduce to 0, as the indicator variable is 0. This leads to

l(s,ϕ1)l(s,ϕ2)\displaystyle\mathinner{\!\left\lVert l(s,\phi_{1})-l(s,\phi_{2})\right\rVert}_{\infty} L1ϵ(ϕ1ϕ2)\displaystyle\leq L\mathinner{\!\left\lVert\frac{1}{\epsilon}(\phi_{1}-\phi_{2})\right\rVert}_{\infty} (48)
𝔼l(s,ϕ1)l(s,ϕ2)\displaystyle\mathbb{E}\mathinner{\!\left\lVert l(s,\phi_{1})-l(s,\phi_{2})\right\rVert}_{\infty} Lϵ𝔼ϕ1ϕ2\displaystyle\leq\frac{L}{\epsilon}\mathbb{E}\mathinner{\!\left\lVert\phi_{1}-\phi_{2}\right\rVert}_{\infty}

Appendix B Equivalent Problem Formulation

Using the notation in this paper, the problem defined in [30] is

P=\displaystyle P^{\star}= max𝒙,ϕf0(𝒙)\displaystyle\max_{\bm{x},\phi}f_{0}(\bm{x}) (49)
s.t. 𝒙𝔼[𝒇𝟏(s,ϕ(s))]\displaystyle\text{s.t. }\bm{x}\leq\mathbb{E}[\bm{f_{1}}(s,\phi(s))]
𝒇𝟐(𝒙)0,𝒙𝒳,ϕΦ\displaystyle\quad\bm{f_{2}}(\bm{x})\geq 0,\bm{x}\in\mathcal{X},\phi\in\Phi

where f0f_{0}, and 𝒇𝟐\bm{f_{2}} are concave functions. 𝒇𝟏\bm{f_{1}} is not necessarily convex with respect to ϕ\phi. 𝒳\mathcal{X} is a convex set and Φ\Phi is compact. Note that f0f_{0}, 𝒇𝟏\bm{f_{1}}, 𝒇𝟐\bm{f_{2}}, 𝒙\bm{x}, and 𝒳\mathcal{X} are not directly present in the Sufficiently Accurate problem.

To translate the problem, let us assume that xx is K+1K+1 dimensional, where KK is the number of constraints in (4). Let the last element of 𝒇𝟏(s,ϕ)\bm{f_{1}}(s,\phi) be equal to l(s,ϕ)𝕀0(s)-l(s,\phi)\mathbb{I}_{0}(s), the negative of the objective function in the Sufficiently Accurate problem. Let the first kthk^{\text{th}} element of 𝒇𝟏(s,ϕ)\bm{f_{1}}(s,\phi) be equal to gk(s,ϕ)𝕀k(s)-g_{k}(s,\phi)\mathbb{I}_{k}(s), the negative of a constraint function in (4). Set the objective function to be f0(𝒙)=𝒙K+1f_{0}(\bm{x})=\bm{x}_{K+1}, where 𝒙K+1\bm{x}_{K+1} is the (K+1)th(K+1)^{\text{th}} element of 𝒙\bm{x}. 𝒇𝟐\bm{f_{2}} can be ignored by setting it to be the zero function. Under these assumptions, (49) is equivalent to the following

P=\displaystyle P^{\star}= max𝒙,ϕ𝒙K+1\displaystyle\max_{\bm{x},\phi}\bm{x}_{K+1} (50)
s.t. 𝒙k𝔼[gk(s,ϕ)𝕀k(s)],k=1,,K\displaystyle\text{s.t. }\bm{x}_{k}\leq-\mathbb{E}[g_{k}(s,\phi)\mathbb{I}_{k}(s)],k=1,\ldots,K
𝒙K+1𝔼[l(s,ϕ)𝕀0(s)]\displaystyle\phantom{s.t.}\bm{x}_{K+1}\leq-\mathbb{E}[l(s,\phi)\mathbb{I}_{0}(s)]
𝒙𝒳,ϕΦ.\displaystyle\phantom{s.t.}\bm{x}\in\mathcal{X},\phi\in\Phi.

Now, define the set 𝒳={(x1,x2,,xK+1):xk=0,k=1,,K,xK+1𝒳K+1}\mathcal{X}=\{(x_{1},x_{2},\ldots,x_{K+1})\mathrel{\mathop{\ordinarycolon}}x_{k}=0,k=1,\ldots,K,x_{K+1}\in\mathcal{X}_{K+1}\}, where 𝒳K+1\mathcal{X}_{K+1} is an arbitrary compact set in one dimension. This set of vectors, 𝒳\mathcal{X}, is a set that is 0 in the first KK components, and is compact in the last component. This, will further simplify (50) to the following

P=\displaystyle P^{\star}= maxϕ𝔼[l(s,ϕ)𝕀0(s)]\displaystyle\max_{\phi}-\mathbb{E}[l(s,\phi)\mathbb{I}_{0}(s)] (51)
s.t. 0𝔼[gk(s,ϕ)𝕀k(s)],k=1,,K\displaystyle\text{s.t. }0\leq-\mathbb{E}[g_{k}(s,\phi)\mathbb{I}_{k}(s)],k=1,\ldots,K
ϕΦ\displaystyle\phantom{s.t.}\phi\in\Phi

as long as ll takes on values in a compact set. Finally, flipping all the negatives in (51),

P=\displaystyle P^{\star}=- minϕΦ𝔼[l(s,ϕ)𝕀0(s)]\displaystyle\min_{\phi\in\Phi}\mathbb{E}[l(s,\phi)\mathbb{I}_{0}(s)] (52)
s.t. 𝔼[gk(s,ϕ)𝕀k(s)]0,k=1,,K\displaystyle\text{s.t. }\mathbb{E}[g_{k}(s,\phi)\mathbb{I}_{k}(s)]\leq 0,k=1,\ldots,K

This completes the translation of problem (49) to (4).

Appendix C Proof of Theorem 3

This proof follows some of the steps of the proof for [31, Theorem 1]. Let (ϕ,λ)(\phi^{\star},\lambda^{\star}) be the primal and dual variables that attains the solution value of P=DP^{\star}=D^{\star} in problems (4) and (23). Similarly, let (θ,λθ)(\theta^{\star},\lambda_{\theta}^{\star}) be the primal and dual variables that attain the solution value DθD_{\theta}^{\star} in problem (19). ϕθ\phi_{\theta^{\star}} is the function that θ\theta^{\star} induces. Note that the optimal dual variables for (23), λ\lambda^{\star}, are not necessarily the same as the optimal dual variables for (19), λθ\lambda_{\theta}^{\star}.

C-A Lower Bound

We first show the lower bound for DθD_{\theta}^{\star}. Writing out the dual problem (23), we obtain

D=maxλ0minϕΦ(ϕ,λ)=(ϕ,λ)D^{\star}=\max_{\lambda\geq 0}\min_{\phi\in\Phi}\mathcal{L}(\phi,\lambda)=\mathcal{L}(\phi^{\star},\lambda^{\star}) (53)

Since λ\lambda^{\star} is the optimal dual variable that achieves the maximal value for the maximization and minimization for the Lagrangian, it is true that

ϕ=argminϕ(ϕ,λ).\phi^{\star}=\operatorname*{arg\,min}_{\phi}\mathcal{L}(\phi,\lambda^{\star}). (54)

Thus for any ϕ\phi,

(ϕ,λ)(ϕ,λ),ϕΦ.\mathcal{L}(\phi^{\star},\lambda^{\star})\leq\mathcal{L}(\phi,\lambda^{\star}),\forall\phi\in\Phi. (55)

We now look at the parameterized dual problem (19).

Dθ=maxλ0minθΘθ(θ,λ)=maxλ0minθΘ(ϕθ,λ)D_{\theta}^{\star}=\max_{\lambda\geq 0}\min_{\theta\in\Theta}\mathcal{L}_{\theta}(\theta,\lambda)=\max_{\lambda\geq 0}\min_{\theta\in\Theta}\mathcal{L}(\phi_{\theta},\lambda) (56)

This simply redefines θ\mathcal{L}_{\theta} in terms of \mathcal{L} as the only difference is that θ\mathcal{L}_{\theta} is only defined for a subset of the primal variables that \mathcal{L} is defined for. By definition, λθ\lambda_{\theta}^{\star} maximizes the minimization of θ\mathcal{L}_{\theta} over θ\theta. That is to say for the dual solution (θ,λθ)(\theta^{\star},\lambda_{\theta}^{\star}), λθ\lambda_{\theta}^{\star} minimizes (ϕθ,)\mathcal{L}(\phi_{\theta^{\star}},\cdot)

λθ\displaystyle\lambda_{\theta}^{\star} =argmaxλ0minθΘ(ϕθ,λ)\displaystyle=\operatorname*{arg\,max}_{\lambda\geq 0}\min_{\theta\in\Theta}\mathcal{L}(\phi_{\theta},\lambda) (57)
=argmaxλ0(ϕθ,λ).\displaystyle=\operatorname*{arg\,max}_{\lambda\geq 0}\mathcal{L}(\phi_{\theta^{\star}},\lambda).

Thus, for all λ\lambda, it is the case that

(ϕθ,λθ)(ϕθ,λ)\mathcal{L}(\phi_{\theta^{\star}},\lambda_{\theta}^{\star})\geq\mathcal{L}(\phi_{\theta^{\star}},\lambda) (58)

Putting together (55) and (58), we obtain

Dθ=(ϕθ,λθ)\displaystyle D_{\theta}^{\star}=\mathcal{L}(\phi_{\theta^{\star}},\lambda_{\theta}^{\star}) (ϕθ,λ)(ϕ,λ)=D\displaystyle\geq\mathcal{L}(\phi_{\theta^{\star}},\lambda^{\star})\geq\mathcal{L}(\phi^{\star},\lambda^{\star})=D^{\star} (59)

C-B Upper Bound

Next, we show the upper bound for DθD_{\theta}^{\star}. We begin by writing the Lagrangian (18)

Dθ=maxλ0minϕΦθ(ϕθ,λ)D_{\theta}^{\star}=\max_{\lambda\geq 0}\min_{\phi\in\Phi_{\theta}}\mathcal{L}(\phi_{\theta},\lambda) (60)

as previously written in (56). By adding and subtracting (ϕ,λ)\mathcal{L}(\phi,\lambda), we obtain

Dθ\displaystyle D_{\theta}^{\star} =maxλ0minθΘ(ϕ,λ)+(ϕθ,λ)(ϕ,λ)\displaystyle=\max_{\lambda\geq 0}\min_{\theta\in\Theta}\mathcal{L}(\phi,\lambda)+\mathcal{L}(\phi_{\theta},\lambda)-\mathcal{L}(\phi,\lambda) (61)
=maxλ0[(ϕ,λ)+minθΘ[(ϕθ,λ)(ϕ,λ)]]\displaystyle=\max_{\lambda\geq 0}[\mathcal{L}(\phi,\lambda)+\min_{\theta\in\Theta}[\mathcal{L}(\phi_{\theta},\lambda)-\mathcal{L}(\phi,\lambda)]]
maxλ0[(ϕ,λ)+minθΘ|(ϕθ,λ)(ϕ,λ)|]\displaystyle\leq\max_{\lambda\geq 0}[\mathcal{L}(\phi,\lambda)+\min_{\theta\in\Theta}\mathinner{\!\left\lvert\mathcal{L}(\phi_{\theta},\lambda)-\mathcal{L}(\phi,\lambda)\right\rvert}]

where the last line comes from the fact that the absolute value of an expression is always at least as large as the original expression, i.e. x|x|x\leq\mathinner{\!\left\lvert x\right\rvert}. Looking just at the quantity |(ϕθ,λ)(ϕ,λ)|\mathinner{\!\left\lvert\mathcal{L}(\phi_{\theta},\lambda)-\mathcal{L}(\phi,\lambda)\right\rvert}, we can expand it as

|(ϕθ,λ)(ϕ,λ)|\displaystyle\mathinner{\!\left\lvert\mathcal{L}(\phi_{\theta},\lambda)-\mathcal{L}(\phi,\lambda)\right\rvert} =|𝔼[0(s,ϕθ)+λ𝒈(s,ϕθ)]\displaystyle=|\mathbb{E}[\ell_{0}(s,\phi_{\theta})+{\lambda}^{\top}\bm{g}(s,\phi_{\theta})]- (62)
𝔼[0(s,ϕ)+λ𝒈(s,ϕ)]|\displaystyle\mathbb{E}[\ell_{0}(s,\phi)+{\lambda}^{\top}\bm{g}(s,\phi)]|

Using the triangle inequality, this is upper bounded as

|(ϕθ,λ)(ϕ,λ)|\displaystyle\mathinner{\!\left\lvert\mathcal{L}(\phi_{\theta},\lambda)-\mathcal{L}(\phi,\lambda)\right\rvert} |𝔼[0(s,ϕθ)0(s,ϕ)]|+\displaystyle\leq\mathinner{\!\left\lvert\mathbb{E}[\ell_{0}(s,\phi_{\theta})-\ell_{0}(s,\phi)]\right\rvert}+ (63)
|𝔼[λ(𝒈(s,ϕθ)𝒈(s,ϕ))]|\displaystyle\mathinner{\!\left\lvert\mathbb{E}[{\lambda}^{\top}(\bm{g}(s,\phi_{\theta})-\bm{g}(s,\phi))]\right\rvert}

Using Hölder’s inequality, we can create a further upper bound

|(ϕθ,λ)(ϕ,λ)|\displaystyle\mathinner{\!\left\lvert\mathcal{L}(\phi_{\theta},\lambda)-\mathcal{L}(\phi,\lambda)\right\rvert} 𝔼[0(s,ϕθ)0(s,ϕ)]+\displaystyle\leq\mathinner{\!\left\lVert\mathbb{E}[\ell_{0}(s,\phi_{\theta})-\ell_{0}(s,\phi)]\right\rVert}_{\infty}+ (64)
λ1𝔼[𝒈(s,ϕθ)𝒈(s,ϕ)]\displaystyle\mathinner{\!\left\lVert\lambda\right\rVert}_{1}\mathinner{\!\left\lVert\mathbb{E}[\bm{g}(s,\phi_{\theta})-\bm{g}(s,\phi)]\right\rVert}_{\infty}

where the infinity norm of the scalar value 𝔼[0(s,ϕθ)0(s,ϕ)]\mathbb{E}[\ell_{0}(s,\phi_{\theta})-\ell_{0}(s,\phi)] is the same as its absolute value. Using the fact that the infinity norm is convex and Jensen’s inequality, we can move the norm inside of the expectation.

|(ϕθ,λ)(ϕ,λ)|\displaystyle\mathinner{\!\left\lvert\mathcal{L}(\phi_{\theta},\lambda)-\mathcal{L}(\phi,\lambda)\right\rvert} 𝔼0(s,ϕθ)0(s,ϕ)+\displaystyle\leq\mathbb{E}\mathinner{\!\left\lVert\ell_{0}(s,\phi_{\theta})-\ell_{0}(s,\phi)\right\rVert}_{\infty}+ (65)
λ1𝔼𝒈(s,ϕθ)𝒈(s,ϕ)\displaystyle\mathinner{\!\left\lVert\lambda\right\rVert}_{1}\mathbb{E}\mathinner{\!\left\lVert\bm{g}(s,\phi_{\theta})-\bm{g}(s,\phi)\right\rVert}_{\infty}

By expectation-wise Lipschitz-continuity of both the loss and constraint functions,

|(ϕθ,λ)(ϕ,λ)|\displaystyle\mathinner{\!\left\lvert\mathcal{L}(\phi_{\theta},\lambda)-\mathcal{L}(\phi,\lambda)\right\rvert} L𝔼ϕθϕ+\displaystyle\leq L\mathbb{E}\mathinner{\!\left\lVert\phi_{\theta}-\phi\right\rVert}_{\infty}+ (66)
λ1L𝔼ϕθϕ\displaystyle\quad\mathinner{\!\left\lVert\lambda\right\rVert}_{1}L\mathbb{E}\mathinner{\!\left\lVert\phi_{\theta}-\phi\right\rVert}_{\infty}
=(1+λ1)L𝔼ϕθϕ.\displaystyle=(1+\mathinner{\!\left\lVert\lambda\right\rVert}_{1})L\mathbb{E}\mathinner{\!\left\lVert\phi_{\theta}-\phi\right\rVert}_{\infty}.

Combining (61) with (66), we obtain

Dθ\displaystyle D_{\theta}^{\star} \displaystyle\leq =maxλ0[(ϕ,λ)+(λ1+1)LminθΘ𝔼ϕθϕ]\displaystyle=\max_{\lambda\geq 0}[\mathcal{L}(\phi,\lambda)+(\mathinner{\!\left\lVert\lambda\right\rVert}_{1}+1)L\min_{\theta\in\Theta}\mathbb{E}\mathinner{\!\left\lVert\phi_{\theta}-\phi\right\rVert}_{\infty}] (67)

Since, Φθ\Phi_{\theta} is an ϵ\epsilon-universal approximation for Φ\Phi, we can write minθΘ𝔼ϕθϕϵ\min_{\theta\in\Theta}\mathbb{E}\mathinner{\!\left\lVert\phi_{\theta}-\phi\right\rVert}_{\infty}\leq\epsilon. This further reduces (67) to

Dθmaxλ0[(ϕ,λ)+(λ1+1)Lϵ].D_{\theta}^{\star}\leq\max_{\lambda\geq 0}[\mathcal{L}(\phi,\lambda)+(\mathinner{\!\left\lVert\lambda\right\rVert}_{1}+1)L\epsilon]. (68)

Note that (68) is true for all ϕ\phi. In particular it must be also true for the λ\lambda that minimizes the inner value, i.e.

Dθ\displaystyle D_{\theta}^{\star} maxλ0minϕΦ[(ϕ,λ)+(λ1+1)Lϵ]\displaystyle\leq\max_{\lambda\geq 0}\min_{\phi\in\Phi}[\mathcal{L}(\phi,\lambda)+(\mathinner{\!\left\lVert\lambda\right\rVert}_{1}+1)L\epsilon] (69)
=Lϵ+maxλ0minϕΦ[𝔼[0(s,ϕ)]+λ(𝔼[𝒈(s,ϕ)]+𝟏Lϵ)]\displaystyle=L\epsilon+\max_{\lambda\geq 0}\min_{\phi\in\Phi}[\mathbb{E}[\ell_{0}(s,\phi)]+\lambda^{\top}(\mathbb{E}[\bm{g}(s,\phi)]+\bm{1}L\epsilon)]

The second half of (69) is actually the solution to the dual problem (14). The primal problem is reproduced here for reference,

PLϵ=\displaystyle P_{L\epsilon}^{\star}= minϕΦ𝔼[0(s,ϕ)]\displaystyle\min_{\phi\in\Phi}\mathbb{E}[\ell_{0}(s,\phi)]
s.t. 𝔼[𝒈(s,ϕ)]+𝟏Lϵ0.\displaystyle\text{s.t. }\mathbb{E}[\bm{g}(s,\phi)]+\bm{1}L\epsilon\leq 0.

That is to say, DθLϵ+DLϵD_{\theta}^{\star}\leq L\epsilon+D_{L\epsilon}^{\star}. The primal problem (13) is a perturbed version of (4), where all the constraints are tighter by LϵL\epsilon. There exists a relationship between the solution of (13) and (4) from [25, Eq. 5.57]. Treating (4) as the perturbed version of (13) (that tightens the constraints by Lϵ-L\epsilon), the relationship between the two solutions is

PPLϵλLϵ𝟏Lϵ.P^{\star}\geq P_{L\epsilon}^{\star}-{\lambda_{L\epsilon}^{\star}}^{\top}\bm{1}L\epsilon. (70)

Since both (4) and (13) have zero duality gap by Theorem 2, this is the same as

DDLϵλLϵ1Lϵ.D^{\star}\geq D_{L\epsilon}^{\star}-\mathinner{\!\left\lVert\lambda_{L\epsilon}^{\star}\right\rVert}_{1}L\epsilon. (71)

Combining (71) with the fact that DθLϵ+DLϵD_{\theta}^{\star}\leq L\epsilon+D_{L\epsilon}^{\star}, the following bound is obtained.

DθD+Lϵ(λLϵ1+1)D_{\theta}^{\star}\leq D^{\star}+L\epsilon(\mathinner{\!\left\lVert\lambda_{L\epsilon}^{\star}\right\rVert}_{1}+1) (72)

This gives us the desired upper bound.

Appendix D Proof of Lemma 1

We start by establishing an upper bound on the difference |DθDN|\mathinner{\!\left\lvert D_{\theta}^{\star}-D_{N}^{\star}\right\rvert}. By definition of the Dual Problems (19) and (10), it follows that Dθ=minθθ(θ,λθ)D_{\theta}^{\star}=\min_{\theta}\mathcal{L}_{\theta}(\theta,\lambda_{\theta}^{\star}) and DN=minθN(θ,λN)D_{N}^{\star}=\min_{\theta}\mathcal{L}_{N}(\theta,\lambda_{N}^{\star}). Hence we have that

DθDN\displaystyle D_{\theta}^{\star}-D_{N}^{\star} =minθθ(θ,λθ)minθN(θ,λN)\displaystyle=\min_{\theta}\mathcal{L}_{\theta}(\theta,\lambda_{\theta}^{\star})-\min_{\theta}\mathcal{L}_{N}(\theta,\lambda_{N}^{\star})
minθθ(θ,λθ)minθN(θ,λθ),\displaystyle\leq\min_{\theta}\mathcal{L}_{\theta}(\theta,\lambda_{\theta}^{\star})-\min_{\theta}\mathcal{L}_{N}(\theta,\lambda_{\theta}^{\star}), (73)

where the inequality follows from the fact that λN\lambda_{N}^{\star} maximizes the function dN(λ)=minθN(θ,λ)d_{N}(\lambda)=\min_{\theta}\mathcal{L}_{N}(\theta,\lambda). Thus, any other λ\lambda, in particular λθ\lambda_{\theta}^{\star} results in a value that is less than or equal to DND_{N}^{\star}. Let θN(λ)=argminθΘN(θ,λ)\theta_{N}^{\star}(\lambda)=\operatorname*{arg\,min}_{\theta\in\Theta}\mathcal{L}_{N}(\theta,\lambda). Substituting by this definition and using the definition of minimum, (73) can be further upper bounded by

DθDNθ(θN(λθ),λθ)N(θN(λθ),λθ).D_{\theta}^{\star}-D_{N}^{\star}\leq\mathcal{L}_{\theta}(\theta_{N}^{\star}(\lambda_{\theta}^{\star}),\lambda_{\theta}^{\star})-\mathcal{L}_{N}(\theta_{N}^{\star}(\lambda_{\theta}^{\star}),\lambda_{\theta}^{\star}). (74)

We set now to establish a similar lower bound. Analogous to the step for the upper bound, we can use the definition of the Dual problem to lower bound DθDND_{\theta}^{\star}-D_{N}^{\star}

DθDNminθθ(θ,λN)minθN(θ,λN).D_{\theta}^{\star}-D_{N}^{\star}\geq\min_{\theta}\mathcal{L}_{\theta}(\theta,\lambda_{N}^{\star})-\min_{\theta}\mathcal{L}_{N}(\theta,\lambda_{N}^{\star}). (75)

Likewise, define θ(λ)=argminθΘθ(θ,λ)\theta^{\star}(\lambda)=\operatorname*{arg\,min}_{\theta\in\Theta}\mathcal{L}_{\theta}(\theta,\lambda) and further upper bound (75) as

DθDNθ(θ(λN),λN)N(θ(λN),λN).D_{\theta}^{\star}-D_{N}^{\star}\geq\mathcal{L}_{\theta}(\theta^{\star}(\lambda_{N}^{\star}),\lambda_{N}^{\star})-\mathcal{L}_{N}(\theta^{\star}(\lambda_{N}^{\star}),\lambda_{N}^{\star}). (76)

Using the upper and lower bounds for DθDND_{\theta}^{\star}-D_{N}^{\star} derived in (74) and (76), we can bound the absolute value of the difference by the maximum of the absolute values of the lower and upper bounds

|DθDN|max{|θ(θ(λN),λN)N(θ(λN),λN)||θ(θN(λθ),λθ)N(θN(λθ),λθ)||D_{\theta}^{\star}-D_{N}^{\star}|\leq\max\begin{cases}|\mathcal{L}_{\theta}(\theta^{\star}(\lambda_{N}^{\star}),\lambda_{N}^{\star})-\mathcal{L}_{N}(\theta^{\star}(\lambda_{N}^{\star}),\lambda_{N}^{\star})|\\ |\mathcal{L}_{\theta}(\theta_{N}^{\star}(\lambda_{\theta}^{\star}),\lambda_{\theta}^{\star})-\mathcal{L}_{N}(\theta_{N}^{\star}(\lambda_{\theta}^{\star}),\lambda_{\theta}^{\star})|\end{cases} (77)

A conservative upper bound for the previous expression is

|DθDN||supθΘ,λ+Kθ(θ,λ)N(θ,λ)||D_{\theta}^{\star}-D_{N}^{\star}|\leq\mathinner{\!\left\lvert\sup_{\theta\in\Theta,\lambda\in\mathbb{R}_{+}^{K}}\mathcal{L}_{\theta}(\theta,\lambda)-\mathcal{L}_{N}(\theta,\lambda)\right\rvert} (78)

This completes the proof of the Lemma.

References

  • [1] M. Morari and J. H. Lee, “Model predictive control: past, present and future,” Computers & Chemical Engineering, vol. 23, no. 4-5, pp. 667–682, 1999.
  • [2] K. J. Åström and R. M. Murray, Feedback systems: an introduction for scientists and engineers. Princeton university press, 2010.
  • [3] G. Hoffmann, S. Waslander, and C. Tomlin, “Quadrotor helicopter trajectory tracking control,” in AIAA guidance, navigation and control conference and exhibit, p. 7410.
  • [4] D. Mellinger and V. Kumar, “Minimum snap trajectory generation and control for quadrotors,” in 2011 IEEE International Conference on Robotics and Automation, 2011, pp. 2520–2525.
  • [5] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning, 2015, pp. 1889–1897.
  • [6] A. Khan, V. Kumar, and A. Ribeiro, “Learning sample-efficient target reaching for mobile robots,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 3080–3087.
  • [7] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates,” in 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017, pp. 3389–3396.
  • [8] I. R. Petersen, V. A. Ugrinovskii, and A. V. Savkin, Robust Control Design Using H-infinity Methods. Springer Science & Business Media, 2012.
  • [9] S. J. Qin, “An overview of subspace identification,” Computers & chemical engineering, vol. 30, no. 10-12, pp. 1502–1513, 2006.
  • [10] M. Viberg, “Subspace-based methods for the identification of linear time-invariant systems,” Automatica, vol. 31, no. 12, pp. 1835–1851, 1995.
  • [11] G. Pillonetto and G. De Nicolao, “A new kernel-based approach for linear system identification,” Automatica, vol. 46, no. 1, pp. 81–93, 2010.
  • [12] J. Bruls, C. T. Chou, B. Haverkamp, and M. Verhaegen, “Linear and non-linear system identification using separable least-squares.”
  • [13] L. Ljung, “System identification,” Wiley encyclopedia of electrical and electronics engineering, pp. 1–19, 1999.
  • [14] R. Pasquier and I. F. Smith, “Robust system identification and model predictions in the presence of systematic uncertainty,” Advanced Engineering Informatics, vol. 29, no. 4, pp. 1096–1109, 2015.
  • [15] S. Ozer and H. Zorlu, “Identification of bilinear systems using differential evolution algorithm,” Sadhana, vol. 36, no. 3, pp. 281–292, 2011.
  • [16] S. A. Billings, “Identification of nonlinear systems–a survey,” in IEE Proceedings D (Control Theory and Applications), vol. 127, no. 6. IET, 1980, pp. 272–285.
  • [17] M. Schoukens and K. Tiels, “Identification of block-oriented nonlinear systems starting from linear approximations: A survey,” Automatica, vol. 85, pp. 272–292, 2017.
  • [18] M. Deisenroth and C. E. Rasmussen, “Pilco: A model-based and data-efficient approach to policy search,” in Proceedings of the 28th International Conference on machine learning (ICML-11), 2011, pp. 465–472.
  • [19] D. Nguyen-Tuong, J. R. Peters, and M. Seeger, “Local gaussian process regression for real time online model learning,” in Advances in neural information processing systems, 2009, pp. 1193–1200.
  • [20] S. Levine and V. Koltun, “Guided policy search,” in International Conference on Machine Learning, 2013, pp. 1–9.
  • [21] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning,” in 2018 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2018, pp. 7559–7566.
  • [22] C. Zhang, A. Khan, S. Paternain, and A. Ribeiro, “Learning sufficiently accurate models,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020.
  • [23] P. Sanchez-Cuevas, G. Heredia, and A. Ollero, “Characterization of the aerodynamic ground effect and its influence in multirotor control,” International Journal of Aerospace Engineering, vol. 2017, 2017.
  • [24] K. Hornik, M. Stinchcombe, H. White et al., “Multilayer feedforward networks are universal approximators.”
  • [25] S. Boyd, S. P. Boyd, and L. Vandenberghe, Convex optimization. Cambridge university press, 2004.
  • [26] B. Sriperumbudur, K. Fukumizu, and G. Lanckriet, “On the relation between universality, characteristic kernels and rkhs embedding of measures,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010, pp. 773–780.
  • [27] A. Nedić and A. Ozdaglar, “Subgradient methods for saddle-point problems,” Journal of optimization theory and applications, vol. 142, no. 1, pp. 205–228, 2009.
  • [28] V. N. Vapnik, “An overview of statistical learning theory,” IEEE transactions on neural networks, vol. 10, no. 5, pp. 988–999, 1999.
  • [29] P. L. Bartlett, D. J. Foster, and M. J. Telgarsky, “Spectrally-normalized margin bounds for neural networks,” in Advances in Neural Information Processing Systems, 2017, pp. 6240–6249.
  • [30] A. Ribeiro, “Optimal resource allocation in wireless communication and networking,” EURASIP Journal on Wireless Communications and Networking, vol. 2012, no. 1, p. 272, 2012.
  • [31] M. Eisen, C. Zhang, L. F. Chamon, D. D. Lee, and A. Ribeiro, “Learning optimal resource allocations in wireless systems,” arXiv preprint arXiv:1807.08088, 2018.
  • [32] T. Goldstein, B. O’Donoghue, S. Setzer, and R. Baraniuk, “Fast alternating direction optimization methods,” SIAM Journal on Imaging Sciences, vol. 7, no. 3, pp. 1588–1623, 2014.
  • [33] D. M. Gay, M. L. Overton, and M. H. Wright, “A primal-dual interior method for nonconvex nonlinear programming,” in Advances in nonlinear programming. Springer, 1998, pp. 31–56.
  • [34] P. E. Gill and D. P. Robinson, “A primal-dual augmented lagrangian,” Computational Optimization and Applications, vol. 51, no. 1, pp. 1–25, 2012.
  • [35] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in International conference on machine learning, 2013, pp. 1139–1147.
  • [36] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [37] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1026–1034.
  • [38] D. H. Kraft, “A software package for sequential quadratic programming,” 1988.
  • [39] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for model-based control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 5026–5033.
  • [40] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq et al., “Deepmind control suite,” arXiv preprint arXiv:1801.00690, 2018.