This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Stabilizing Dynamical Systems via Policy Gradient Methods

Juan C. Perdomo University of California, Berkeley, jcperdomo@berkeley.edu    Jack Umenberger MIT, umnbrgr@mit.edu, msimchow@csail.mit.edu    Max Simchowitz 22footnotemark: 2
Abstract

Stabilizing an unknown control system is one of the most fundamental problems in control systems engineering. In this paper, we provide a simple, model-free algorithm for stabilizing fully observed dynamical systems. While model-free methods have become increasingly popular in practice due to their simplicity and flexibility, stabilization via direct policy search has received surprisingly little attention. Our algorithm proceeds by solving a series of discounted LQR problems, where the discount factor is gradually increased. We prove that this method efficiently recovers a stabilizing controller for linear systems, and for smooth, nonlinear systems within a neighborhood of their equilibria. Our approach overcomes a significant limitation of prior work, namely the need for a pre-given stabilizing control policy. We empirically evaluate the effectiveness of our approach on common control benchmarks.

1 Introduction

Stabilizing an unknown control system is one of the most fundamental problems in control systems engineering. A wide variety of tasks - from maintaining a dynamical system around a desired equilibrium point, to tracking a reference signal (e.g a pilot’s input to a plane) - can be recast in terms of stability. More generally, synthesizing an initial stabilizing controller is often a necessary first step towards solving more complex tasks, such as adaptive or robust control design (Sontag, 1999, 2009).

In this work, we consider the problem of finding a stabilizing controller for an unknown dynamical system via direct policy search methods. We introduce a simple procedure based off policy gradients which provably stabilizes a dynamical system around an equilibrium point. Our algorithm only requires access to a simulator which can return rollouts of the system under different control policies, and can efficiently stabilize both linear and smooth, nonlinear systems.

Relative to model-based approaches, model-free procedures, such as policy gradients, have two key advantages: they are conceptually simple to implement, and they are easily adaptable; that is, the same method can be applied in a wide variety of domains without much regard to the intricacies of the underlying dynamics. Due to their simplicity and flexibility, direct policy search methods have become increasingly popular amongst practitioners, especially in settings with complex, nonlinear dynamics which may be challenging to model. In particular, they have served as the main workhorse for recent breakthroughs in reinforcement learning and control (Silver et al., 2016; Mnih et al., 2015; Andrychowicz et al., 2020).

Despite their popularity amongst practitioners, model-free approaches for continuous control have only recently started to receive attention from the theory community (Fazel et al., 2018; Kakade et al., 2020; Tu and Recht, 2019). While these analyses have begun to map out the computational and statistical tradeoffs that emerge in choosing between model-based and model-free approaches, they all share a common assumption: that the unknown dynamical system in question is stable, or that an initial stabilizing controller is known. As such, they do not address the perhaps more basic question, how do we arrive at a stabilizing controller in the first place?

1.1 Contributions

We establish a reduction from stabilizing an unknown dynamical system to solving a series of discounted, infinite-horizon LQR problems via policy gradients, for which no knowledge of an initial stable controller is needed. Our approach, which we call discount annealing, gradually increases the discount factor and yields a control policy which is near optimal for the undiscounted LQR objective. To the best of our knowledge, our algorithm is the first model-free procedure shown to provably stabilize unknown dynamical systems, thereby solving an open problem from Fazel et al. (2018).

We begin by studying linear, time-invariant dynamical systems with full state observation and assume access to inexact cost and gradient evaluations of the discounted, infinite-horizon LQR cost of a state-feedback controller KK. Previous analyses (e.g., (Fazel et al., 2018)) establish how such evaluations can be implemented with access to (finitely many, finite horizon) trajectories sampled from a simulator. We show that our method recovers the controller KK_{\star} which is the optimal solution of the undiscounted LQR problem in a bounded number of iterations, up to optimization and simulator error. The stability of the resulting KK_{\star} is guaranteed by known stability margin results for LQR. In short, we prove the following guarantee:

Theorem 1 (informal).

For linear systems, discount annealing returns a stabilizing state-feedback controller which is also near-optimal for the LQR problem. It uses at most polynomially many, ε\varepsilon-inexact gradient and cost evaluations, where the tolerance ε\varepsilon also depends polynomially on the relevant problem parameters.

Since both the number of queries and error tolerance are polynomial, discount annealing can be efficiently implemented using at most polynomially many samples from a simulator.

Furthermore, our results extend to smooth, nonlinear dynamical systems. Given access to a simulator that can return damped system rollouts, we show that our algorithm finds a controller that attains near-optimal LQR cost for the Jacobian linearization of the nonlinear dynamics at the equilibrium. We then show that this controller stabilizes the nonlinear system within a neighborhood of its equilibrium.

Theorem 2 (informal).

Discount annealing returns a state-feedback controller which is exponentially stabilizing for smooth, nonlinear systems within a neighborhood of their equilibrium, using again only polynomially many samples drawn from a simulator.

In each case, the algorithm returns a near optimal solution KK to the relevant LQR problem (or local approximation thereof). Hence, the stability properties of KK are, in theory, no better than those of the optimal LQR controller KK_{\star}. Importantly, the latter may have worse stability guarantees than the optimal solution of a corresponding robust control objective (e.g. \mathcal{H}_{\infty} synthesis). Nevertheless, we focus on the LQR subroutine in the interest of simplicity, clarity, and in order to leverage prior analyses of model-free methods for LQR. Extending our procedure to robust-control objectives is an exciting direction for future work.

Lastly, while our theoretical analysis only guarantees that the resulting controller will be stabilizing within a small neighborhood of the equilibrium, our simulations on nonlinear systems, such as the nonlinear cartpole, illustrate that discount annealing produces controllers that are competitive with established robust control procedures, such as \mathcal{H}_{\infty} synthesis, without requiring any knowledge of the underlying dynamics.

1.2 Related work

Given its central importance to the field, stabilization of unknown and uncertain dynamical systems has received extensive attention within the controls literature. We review some of the relevant literature and point the reader towards classical texts for a more comprehensive treatment (Sontag, 2009; Sastry and Bodson, 2011; Zhou et al., 1996; Callier and Desoer, 2012; Zhou and Doyle, 1998).

Model-based approaches. Model-based methods construct approximate system models in order to synthesize stabilizing control policies. Traditional analyses consider stabilization of both linear and nonlinear dynamical systems in the asymptotic limit of sufficient data (Sastry and Bodson, 2011; Sastry, 2013). More recent, non-asymptotic studies have focused almost entirely on linear systems, where the controller is generated using data from multiple independent trajectories (Fiechter, 1997; Dean et al., 2019; Faradonbeh et al., 2018a, 2019). Assuming the model is known, stabilizing policies may also be synthesized via convex optimization (Prajna et al., 2004) by combining a ‘dual Lyapunov theorem’ (Rantzer, 2001) with sum-of-squares programming (Parrilo, 2003). Relative to these analysis our focus is on strengthening the theoretical foundations of model-free procedures and establishing rigorous guarantees that policy gradient methods can also be used to generate stabilizing controllers.

Online control. Online control studies the problem of adaptively fine-tuning the performance of an already-stabilizing control policy on a single trajectory (Dean et al., 2018; Faradonbeh et al., 2018b; Cohen et al., 2019; Mania et al., 2019; Simchowitz and Foster, 2020; Hazan et al., 2020; Simchowitz et al., 2020; Kakade et al., 2020). Though early papers in this direction consider systems without pre-given stabilizing controllers (Abbasi-Yadkori and Szepesvári, 2011), their guarantees degrade exponentially in the system dimension (a penalty ultimately shown to be unavoidable by Chen and Hazan (2021)). Rather than fine-tuning an already stabilizing controller, we focus on the more basic problem of finding a controller which is stabilizing in the first place, and allow for the use of multiple independent trajectories.

Model-free approaches. Model-free approaches eschew trying to approximate the underlying dynamics and instead directly search over the space of control policies. The landmark paper of Fazel et al. (2018) proves that, despite the non-convexity of the problem, direct policy search on the infinite-horizon LQR objective efficiently converges to the globally optimal policy, assuming the search is initialized at an already stabilizing controller. Fazel et al. (2018) pose the synthesis of this initial stabilizing controller via policy gradients as an open problem; one that we solve in this work.

Following this result, there have been a large number of works studying policy gradients procedures in continuous control, see for example Feng and Lavaei (2020); Malik et al. (2019); Mohammadi et al. (2020, 2021); Zhang et al. (2021) just to name a few. Relative to our analysis, these papers consider questions of policy finite-tuning, derivative-free methods, and robust (or distributed) control which are important, yet somewhat orthogonal to the stabilization question considered herein. The recent analysis by Lamperski (2020) is perhaps the most closely related piece of prior work. It proposes a model-free, off-policy algorithm for computing a stabilizing controller for deterministic LQR systems. Much like discount annealing, the algorithm also works by alternating between policy optimization (in their case by a closed-form policy improvement step based on the Riccati update) and increasing a damping factor. However, whereas we provide precise finite-time convergence guarantees to a stabilizing controller for both linear and nonlinear systems, the guarantees in Lamperski (2020) are entirely asymptotic and restricted to linear systems. Furthermore, we pay special attention to quantifying the various error tolerances in the gradient and cost queries to ensure that the algorithm can be efficiently implemented in finite samples.

1.3 Background on stability of dynamical systems

Before introducing our results, we first review some of the basic concepts and definitions regarding stability of dynamical systems. In this paper, we study discrete-time, noiseless, time-invariant dynamical systems with states 𝐱tdx\mathbf{x}_{t}\in\mathbb{R}^{d_{x}} and control inputs 𝐮tdu\mathbf{u}_{t}\in\mathbb{R}^{d_{u}}. In particular, given an initial state 𝐱0\mathbf{x}_{0}, the dynamics evolves according to 𝐱t+1=G(𝐱t,𝐮t)\mathbf{x}_{t+1}=G(\mathbf{x}_{t},\mathbf{u}_{t}) where G:dx×dudxG:\mathbb{R}^{d_{x}}\times\mathbb{R}^{d_{u}}\rightarrow\mathbb{R}^{d_{x}} is a state transition map. An equilibrium point of a dynamical system is a state 𝐱dx\mathbf{x}_{\star}\in\mathbb{R}^{d_{x}} such that G(𝐱,0)=𝐱G(\mathbf{x}_{\star},0)=\mathbf{x}_{\star}. As per convention, we assume that the origin 𝐱=0\mathbf{x}_{\star}=0 is the desired equilibrium point around which we wish to stabilize the system.

This paper restricts its attention to static state-feedback policies of the form 𝐮t=K𝐱t\mathbf{u}_{t}=K\mathbf{x}_{t} for a fixed matrix Kdu×dxK\in\mathbb{R}^{d_{u}\times d_{x}}. Abusing notation slightly, we conflate the matrix KK with its induced policy. Our aim is to find a policy KK which is exponentially stabilizing around the equilbrium point.

Time-invariant, linear systems, where G(𝐱,𝐮)=A𝐱+B𝐮G(\mathbf{x},\mathbf{u})=A\mathbf{x}+B\mathbf{u} are stabilizable if and only if there exists a KK such that A+BKA+BK is a stable matrix (Callier and Desoer, 2012). That is if ρ(A+BK)<1\rho(A+BK)<1, where ρ(X)\rho(X) denotes the spectral radius, or the largest eigenvalue magnitude, of a matrix XX. For general nonlinear systems, our goal is to find controllers which satisfy the following general, quantitative definition of exponential stability (e.g Chapter 5.2 in Sastry (2013)). Throughout, \|\cdot\| denotes the Euclidean norm.

Definition 1.1.

A controller KK is (m,α)(m,\alpha)-exponentially stable for dynamics GG if there exist constants m,α>0m,\alpha>0 such that if inputs are chosen according to 𝐮t=K𝐱t\mathbf{u}_{t}=K\mathbf{x}_{t}, the sequence of states 𝐱t+1=G(𝐱t,𝐮t)\mathbf{x}_{t+1}=G(\mathbf{x}_{t},\mathbf{u}_{t}) satisfy

𝐱tmexp(αt)𝐱0.\displaystyle\|\mathbf{x}_{t}\|\leq m\cdot\exp(-\alpha\cdot t)\|\mathbf{x}_{0}\|. (1.1)

Likewise, KK is (m,α)(m,\alpha)-exponentially stable on radius r>0r>0 if (1.1) holds for all 𝐱0\mathbf{x}_{0} such that 𝐱0r\|\mathbf{x}_{0}\|\leq r.

For linear systems, a controller KK is stabilizing if and only if it is stable over the entire state space, however, the restriction to stabilization over a particular radius is in general needed for nonlinear systems. Our approach for stabilizing nonlinear systems relies on analyzing their Jacobian linearization about the origin equilibrium. Given a continuously differentiable transition operator GG, the local dynamics can be approximated by the Jacobian linearization (Ajac,Bjac)(A_{\mathrm{jac}},B_{\mathrm{jac}}) of GG about the zero equilibrium; that is

Ajac:=𝐱G(𝐱,𝐮)|(𝐱,𝐮)=(0,0),Bjac:=𝐮G(𝐱,𝐮)|(𝐱,𝐮)=(0,0).\displaystyle A_{\mathrm{jac}}:=\nabla\mkern-2.5mu_{\mathbf{x}}G(\mathbf{x},\mathbf{u})\big{|}_{(\mathbf{x},\mathbf{u})=(0,0)},\quad B_{\mathrm{jac}}:=\nabla\mkern-2.5mu_{\mathbf{u}}G(\mathbf{x},\mathbf{u})\big{|}_{(\mathbf{x},\mathbf{u})=(0,0)}. (1.2)

In particular, for 𝐱\mathbf{x} and 𝐮\mathbf{u} sufficiently small, G(𝐱,𝐮)=Ajac𝐱+Bjac𝐮+fnl(𝐱,𝐮)G(\mathbf{x},\mathbf{u})=A_{\mathrm{jac}}\mathbf{x}+B_{\mathrm{jac}}\mathbf{u}+f_{\mathrm{nl}}(\mathbf{x},\mathbf{u}), where fnl(𝐱,𝐮)f_{\mathrm{nl}}(\mathbf{x},\mathbf{u}) is a nonlinear remainder from the Taylor expansion of GG. To ensure stabilization via state-feedback is feasible, we assume throughout our presentation that the linearized dynamics (Ajac,Bjac)(A_{\mathrm{jac}},B_{\mathrm{jac}}) are stabilizable.

2 Stabilizing Linear Dynamical Systems

We now present our main results establishing how our algorithm, discount annealing, provably stabilizes linear dynamical systems via a reduction to direct policy search methods. We begin with the following preliminaries on the Linear Quadratic Regulator (LQR).

Definition 2.1 (LQR Objective).

For a given starting state 𝐱\mathbf{x}, we define the LQR problem JlinJ_{\mathrm{lin}} with discount factor γ(0,1]\gamma\in(0,1], dynamic matrices (A,B)(A,B), and state feedback controller KK as,

Jlin(K𝐱,γ,A,B):=t=0γt(𝐱tQ𝐱t+𝐮tR𝐮t) s.t. 𝐮t=K𝐱t,𝐱t+1=A𝐱t+B𝐮t,𝐱0=𝐱.\displaystyle J_{\mathrm{lin}}(K\mid\mathbf{x},\gamma,A,B):=\sum_{t=0}^{\infty}\gamma^{t}\left(\mathbf{x}_{t}^{\top}Q\mathbf{x}_{t}+\mathbf{u}_{t}^{\top}R\mathbf{u}_{t}\right)\text{ s.t. }\,\mathbf{u}_{t}=K\mathbf{x}_{t},\,\mathbf{x}_{t+1}=A\mathbf{x}_{t}+B\mathbf{u}_{t},\,\mathbf{x}_{0}=\mathbf{x}.

Here, 𝐱tdx\mathbf{x}_{t}\in\mathbb{R}^{d_{x}}, 𝐮tdu\mathbf{u}_{t}\in\mathbb{R}^{d_{u}}, and Q,RQ,R are positive definite matrices. Slightly overloading notation, we define

Jlin(Kγ,A,B):=𝔼𝐱0dx𝒮dx1[Jlin(K𝐱,γ,A,B)],\displaystyle J_{\mathrm{lin}}(K\mid\gamma,A,B):=\operatorname*{\mathbb{E}}_{\mathbf{x}_{0}\sim\sqrt{d_{x}}\cdot\mathcal{S}^{d_{x}-1}}[J_{\mathrm{lin}}(K\mid\mathbf{x},\gamma,A,B)],

to be the same as the problem above, but where the initial state is now drawn from the uniform distribution over the sphere in dx\mathbb{R}^{d_{x}} of radius dx\sqrt{d_{x}}.111This scaling is chosen so that the initial state distribution has identity covariance, and yields cost equivalent to 𝐱0𝒩(0,I)\mathbf{x}_{0}\sim\mathcal{N}(0,I).

To simplify our presentation, we adopt the shorthand Jlin(Kγ):=Jlin(Kγ,A,B)J_{\mathrm{lin}}(K\mid\gamma):=J_{\mathrm{lin}}(K\mid\gamma,A,B) in cases where the system dynamics (A,B)(A,B) are understood from context. Furthermore, we assume that (A,B)(A,B) is stabilizable and that λmin(Q),λmin(R)1\lambda_{\min}(Q),\lambda_{\min}(R)\geq 1. It is a well-known fact that K,γ:=argminKJlin(Kγ,A,B)K_{\star,\gamma}:=\operatorname*{arg\,min}_{K}J_{\mathrm{lin}}(K\mid\gamma,A,B) achieves the minimum LQR cost over all possible control laws. We begin our analysis with the observation that the discounted LQR problem is equivalent to the undiscounted LQR problem with damped dynamics matrices.222This lemma is folklore within the controls community, see e.g. Lamperski (2020).

Lemma 2.1.

For all controllers KK such that Jlin(Kγ,A,B)<J_{\mathrm{lin}}(K\mid\gamma,A,B)<\infty,

Jlin(Kγ,A,B)=Jlin(K1,γA,γB).\displaystyle J_{\mathrm{lin}}(K\mid\gamma,A,B)=J_{\mathrm{lin}}(K\mid 1,\sqrt{\gamma}\cdot A,\sqrt{\gamma}\cdot B).

From this equivalence, it follows from basic facts about LQR that a controller KK satisfies Jlin(0γ,A,B)<J_{\mathrm{lin}}(0\mid\gamma,A,B)<\infty if and only if γ(A+BK)\sqrt{\gamma}(A+BK) is stable. Consequently, for γ<ρ(A)2\gamma<\rho(A)^{-2}, the zero controller is stabilizing and one can solve the discounted LQR problem via direct policy search initialized at KK = 0 (Fazel et al., 2018). At this point, one may wonder whether the solution to this highly discounted problem yields a controller which stabilizes the undiscounted system. If this were true, running policy gradients (defined in Eq. 2.1) to convergence, on a single discounted LQR problem, would suffice to find a stabilizing controller.

Kt+1=KtηKJlin(Ktγ),K0=0,η>0\displaystyle K_{t+1}=K_{t}-\eta\nabla\mkern-2.5mu_{K}J_{\mathrm{lin}}(K_{t}\mid\gamma),\quad K_{0}=0,\quad\eta>0 (2.1)

Unfortunately, the following proposition shows that this is not the case.

Proposition 2.2 (Impossibility of Reward Shaping).

Fix A=diag(0,2)A=\mathrm{diag}(0,2). For any positive definite cost matrices Q,RQ,R and discount factor γ\gamma such that γA\sqrt{\gamma}A is stable, there exists a matrix BB such that (A,B)(A,B) is controllable (and thus stabilizable), yet the optimal controller K,γ:=argminKJ(Kγ,A,B)K_{\star,\gamma}:=\operatorname*{arg\,min}_{K}J(K\mid\gamma,A,B) on the discounted problem is such that A+BK,γA+BK_{\star,\gamma} is unstable.

Discount Annealing Initialize: Objective J()J(\cdot\mid\cdot), γ0(0,ρ(A)2)\gamma_{0}\in(0,\rho(A)^{-2}),  K00K_{0}\leftarrow 0,  and QIQ\leftarrow I, RIR\leftarrow I For t=0,1,t=0,1,\dots 1. If γt=1\gamma_{t}=1, run policy gradients once more as in Step 2, break, and return the resulting KK^{\prime}. 2. Using policy gradients (see Eq. 2.1) initialized at KtK_{t}, find KK^{\prime} such that: Jlin(Kγt)minKJlin(Kγt)dx.\displaystyle J_{\mathrm{lin}}(K^{\prime}\mid\gamma_{t})-\min_{K}J_{\mathrm{lin}}(K\mid\gamma_{t})~{}\leq~{}d_{x}. (2.2) 3. Update initial controller Kt+1KK_{t+1}\leftarrow K^{\prime}. 4. Using binary or random search, find a discount factor γ[γt,1]\gamma^{\prime}\in[\gamma_{t},1] such that 2.5J(Kt+1γt)J(Kt+1γ)8J(Kt+1γt).\displaystyle 2.5J(K_{t+1}\mid\gamma_{t})~{}\leq~{}J(K_{t+1}\mid\gamma^{\prime})~{}\leq~{}8J(K_{t+1}\mid\gamma_{t}). (2.3) 5. Update the discount factor γt+1γ\gamma_{t+1}\leftarrow\gamma^{\prime}.

Figure 1: Discount annealing algorithm. The procedure is identical for both linear and nonlinear systems. For linear, we initialize J=Jlin(γ0)J=J_{\mathrm{lin}}(\cdot\mid\gamma_{0}) and for nonlinear J=Jnl(γ0,r)J=J_{\mathrm{nl}}(\cdot\mid\gamma_{0},r_{\star}) where rr_{\star} is chosen as in Theorem 2. See Theorem 1, Theorem 2, and Appendix C for details regarding policy gradients and binary (or random) search. The constants above are chosen for convenience, any constants c1,c2c_{1},c_{2} such that 1<c1<c21<c_{1}<c_{2} suffice.

We now describe the discount annealing procedure for linear systems (Figure 1), which provably recovers a stabilizing controller KK. For simplicity, we present the algorithm assuming access to noisy, bounded cost and gradient evaluations which satisfy the following definition. Employing standard arguments from (Fazel et al., 2018; Flaxman et al., 2005), we illustrate how these evaluations can be efficiently implemented using polynomially many samples drawn from a simulator in Appendix C.

Definition 2.2 (Gradient and Cost Queries).

Given an error parameter ε>0\varepsilon>0 and a function J:dJ:\mathbb{R}^{d}\rightarrow\mathbb{R}, ε-𝙶𝚛𝚊𝚍(J,𝐳)\varepsilon\text{-}\mathtt{Grad}(J,\mathbf{z}) returns a vector ~\widetilde{\nabla\mkern-2.5mu} such that ~J(𝐳)Fε\|\widetilde{\nabla\mkern-2.5mu}-\nabla\mkern-2.5muJ(\mathbf{z})\|_{\mathrm{F}}~{}\leq~{}\varepsilon. Similarly, ε-𝙴𝚟𝚊𝚕(J,𝐳,c)\varepsilon\text{-}\mathtt{Eval}(J,\mathbf{z},c) returns a scalar vv such that |vmin{J(𝐳),c}|ε|v-\min\{J(\mathbf{z}),c\}|~{}\leq~{}\varepsilon.

The procedure leverages the equivalence (Lemma 2.1) between discounted costs and damped dynamics for LQR, and the consequence that the zero controller is stabilizing if we choose γ0\gamma_{0} sufficiently small. Hence, for this discount factor, we may apply policy gradients initialized at the zero controller in order to recover a controller K1K_{1} which is near-optimal for the γ0\gamma_{0} discounted objective.

Our key insight is that, due to known stability margins for LQR controllers, K1K_{1} is stabilizing for the γ1\gamma_{1} discounted dynamics for some discount factor γ1>(1+c)γ0\gamma_{1}>(1+c)\gamma_{0}, where cc is a small constant that has a uniform lower bound. Therefore, K1K_{1} has finite cost on the γ1\gamma_{1} discounted problem, so that we may again use policy gradients initialized at K1K_{1} to compute a near-optimal controller K2K_{2} for this larger discount factor. By iterating, we have that γt(1+c)tγ0\gamma_{t}~{}\geq~{}(1+c)^{t}\gamma_{0} and can increase the discount factor up to 1, yielding a near-optimal stabilizing controller for the undiscounted LQR objective.

The rate at which we can increase the discount factors γt\gamma_{t} depends on certain properties of the (unknown) dynamical system. Therefore, we opt for binary search to compute the desired γ\gamma in the absence of system knowledge. This yields the following guarantee, which we state in terms of properties of the matrix PP_{\star}, the optimal value function for the undiscounted LQR problem, which satisfies minKJlin(K1)=tr[P]\min_{K}J_{\mathrm{lin}}(K\mid 1)=\mathrm{tr}\left[P_{\star}\right] (see Appendix A for further details).

Theorem 1 (Linear Systems).

Let Mlin:=max{16tr[P],Jlin(K0γ0)}M_{\mathrm{lin}}:=\max\{16\mathrm{tr}\left[P_{\star}\right],J_{\mathrm{lin}}(K_{0}\mid\gamma_{0})\}. The following statements are true regarding the discount annealing algorithm when run on linear dynamical systems:

  1. a)

    Discount annealing returns a controller K^\widehat{K} which is (2tr[P],(4tr[P])1)(\sqrt{2\mathrm{tr}[P_{\star}]},(4\mathrm{tr}\left[P_{\star}\right])^{-1})-exponentially stable.

  2. b)

    If γ0<1\gamma_{0}<1, the algorithm is guaranteed to halt whenever tt is greater than 64tr[P]4log(1/γ0)64\mathrm{tr}\left[P_{\star}\right]^{4}\log(1/\gamma_{0}).

Furthermore, at each iteration tt:

  1. c)

    Policy gradients as defined in Eq. 2.1 achieves the guarantee in Eq. 2.2 using only poly(Mlin,Aop,Bop)\mathrm{poly}(M_{\mathrm{lin}},\|A\|_{\mathrm{op}},\|B\|_{\mathrm{op}}) many queries to ε-𝙶𝚛𝚊𝚍(,Jlin(γ))\varepsilon\text{-}\mathtt{Grad}(\cdot,J_{\mathrm{lin}}(\cdot\mid\gamma)) as long as ε\varepsilon is less than poly(Mlin1,Aop1,Bop1)\mathrm{poly}(M_{\mathrm{lin}}^{-1},\|A\|_{\mathrm{op}}^{-1},\|B\|_{\mathrm{op}}^{-1}).

  2. c)

    The noisy binary search algorithm (see Figure 2) returns a discount factor γ\gamma^{\prime} satisfying Eq. 2.3 using at most 4log(tr[P])+10\lceil 4\log(\mathrm{tr}\left[P_{\star}\right])\rceil+10 many queries to ε-𝙴𝚟𝚊𝚕(,Jlin(γ))\varepsilon\text{-}\mathtt{Eval}(\cdot,J_{\mathrm{lin}}(\cdot\mid\gamma)) for ε=.1dx\varepsilon=.1d_{x}.

We remark that since ε\varepsilon need only be polynomially small in the relevant problem parameters, each call to ε-𝙶𝚛𝚊𝚍\varepsilon\text{-}\mathtt{Grad} and ε-𝙴𝚟𝚊𝚕\varepsilon\text{-}\mathtt{Eval} can be carried out using only polynomially many samples from a simulator which returns finite horizon system trajectories under various control policies. We make this claim formal in Appendix C.

Proof.

We prove part b)b) of the theorem and defer the proofs of the remaining parts of to Appendix A. Define PK,γP_{K,\gamma} to be the solution to the discrete-time Lyapunov equation. That is for γ(A+BK)\sqrt{\gamma}(A+BK) stable, PK,γP_{K,\gamma} solves:

PK,γ:=Q+KRK+γ(A+BK)PK,γ(A+BK).\displaystyle P_{K,\gamma}:=Q+K^{\top}RK+\gamma(A+BK)^{\top}P_{K,\gamma}(A+BK). (2.4)

Using this notation, P=PK,1P_{\star}=P_{K_{\star},1} is the solution to the above Lyapunov equation with γ=1\gamma=1. The key step of the proof is Proposition A.4, which uses Lyapunov theory to verify the following: given the current discount factor γt\gamma_{t}, an idealized discount factor γt+1\gamma_{t+1}^{\prime} defined by

γt+1:=(18PKt+1,γtop4+1)2γt,\displaystyle\gamma_{t+1}^{\prime}:=\left(\frac{1}{8\|P_{K_{t+1},\gamma_{t}}\|_{\mathrm{op}}^{4}}+1\right)^{2}\gamma_{t},

satisfies Jlin(Kt+1γt+1)=tr[PKt+1,γt+1]2tr[PKt+1,γt]=2Jlin(Kt+1γt)J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t+1}^{\prime})=\mathrm{tr}[P_{K_{t+1},\gamma_{t+1}^{\prime}}]~{}\leq~{}2\mathrm{tr}\left[P_{K_{t+1},\gamma_{t}}\right]=2J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t}). Since the control cost is non-decreasing in γ\gamma, the binary search update in Step 4 ensures that the actual γt+1\gamma_{t+1} also satisfies

γt+1(18PKt+1,γtop4+1)2γt(1128tr[P]4+1)2γt,\displaystyle\gamma_{t+1}~{}\geq~{}\left(\frac{1}{8\|P_{K_{t+1},\gamma_{t}}\|_{\mathrm{op}}^{4}}+1\right)^{2}\gamma_{t}~{}\geq~{}\left(\frac{1}{128\mathrm{tr}\left[P_{\star}\right]^{4}}+1\right)^{2}\gamma_{t},

The following calculation (which uses dxtr[P]d_{x}\leq\mathrm{tr}\left[P_{\star}\right] for λmin(Q)1\lambda_{\min}(Q)\geq 1) justifies the second inequality above:

PKt+1,γtop4tr[PKt+1,γt]4=Jlin(Kt+1γt)4(minKJlin(Kγt)+dx)416tr[P]4.\displaystyle\|P_{K_{t+1},\gamma_{t}}\|_{\mathrm{op}}^{4}~{}\leq~{}\mathrm{tr}\left[P_{K_{t+1},\gamma_{t}}\right]^{4}=J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t})^{4}~{}\leq~{}(\min_{K}J_{\mathrm{lin}}(K\mid\gamma_{t})+d_{x})^{4}~{}\leq~{}16\mathrm{tr}\left[P_{\star}\right]^{4}.

Therefore, γt(1/(128tr[P]4)+1)2tγ0\gamma_{t}~{}\geq~{}(1/(128\mathrm{tr}\left[P_{\star}\right]^{4})+1)^{2t}\gamma_{0}. The precise bound follows from taking logs of both sides and using the numerical inequality log(1+x)x\log(1+x)~{}\leq~{}x to simplify the denominator.

3 Stabilizing Nonlinear Dynamical Systems

We now extend the guarantees of the discount annealing algorithm to smooth, nonlinear systems. Whereas our study of linear systems explicitly leveraged the equivalence of discounted costs and damped dynamics, our analysis for nonlinear systems requires access to system rollouts under damped dynamics, since the previous equivalence between discounting and damping breaks down in nonlinear settings.

More specifically, in this section, we assume access to a simulator which given a controller KK, returns trajectories generated according to 𝐱t+1=γGnl(𝐱t,K𝐱t)\mathbf{x}_{t+1}=\sqrt{\gamma}G_{\mathrm{nl}}(\mathbf{x}_{t},K\mathbf{x}_{t}) for any damping factor γ(0,1]\gamma\in(0,1], where GnlG_{\mathrm{nl}} is the transition operator for the nonlinear system. While such trajectories may be infeasible to generate on a physical system, we believe these are reasonable to consider when dynamics are represented using software simulators, as is often the case in practice (Lewis et al., 2003; Peng et al., 2018).

The discount annealing algorithm for nonlinear systems is almost identical to the algorithm for linear systems. It again works by repeatedly solving a series of quadratic cost objectives on the nonlinear dynamics as defined below, and progressively increasing the damping factor γ\gamma.

Definition 3.1 (Nonlinear Objective).

For a statefeedback controller K:dxduK:\mathbb{R}^{d_{x}}\rightarrow\mathbb{R}^{d_{u}}, damping factor γ(0,1]\gamma\in(0,1], and an initial state 𝐱\mathbf{x}, we define:

Jnl(K𝐱,γ):=t=0𝐱tQ𝐱t+𝐮tR𝐮t\displaystyle J_{\mathrm{nl}}(K\mid\mathbf{x},\gamma):=\sum_{t=0}^{\infty}\mathbf{x}_{t}^{\top}Q\mathbf{x}_{t}+\mathbf{u}_{t}^{\top}R\mathbf{u}_{t} (3.1)
s.t 𝐮t=K𝐱t,𝐱t+1=γGnl(𝐱t,𝐮t),𝐱0=𝐱.\displaystyle\text{ s.t }\mathbf{u}_{t}=K\mathbf{x}_{t},\quad\mathbf{x}_{t+1}=\sqrt{\gamma}\cdot G_{\mathrm{nl}}(\mathbf{x}_{t},\mathbf{u}_{t}),\quad\mathbf{x}_{0}=\mathbf{x}. (3.2)

Overloading notation as before, we let Jnl(Kγ,r):=𝔼𝐱r𝒮dx1[Jnl(Kγ,𝐱)]×dxr2J_{\mathrm{nl}}(K\mid\gamma,r):=\mathbb{E}_{\mathbf{x}\sim r\cdot\mathcal{S}^{d_{x}-1}}\left[J_{\mathrm{nl}}(K\mid\gamma,\mathbf{x})\right]\times\frac{d_{x}}{r^{2}}.

The normalization by dx/r2d_{x}/r^{2} above is chosen so that the nonlinear objective coincides with the LQR objective when GnlG_{\mathrm{nl}} is in fact linear. Relative to the linear case, the only algorithmic difference for nonlinear systems is that we introduce an extra parameter rr which determines the radius for the initial state distribution. As established in Theorem 2, this parameter must be chosen small enough to ensure that discount annealing succeeds. Our analysis pertains to dynamics which satisfy the following smoothness definition.

Assumption 1 (Local Smoothness).

The transition map GnlG_{\mathrm{nl}} is continuously differentiable. Furthermore, there exist rnl,βnl>0r_{\mathrm{nl}},\beta_{\mathrm{nl}}>0 such that for all (𝐱,𝐮)dx+du(\mathbf{x},\mathbf{u})\in\mathbb{R}^{d_{x}+d_{u}} with 𝐱+𝐮rnl\|\mathbf{x}\|+\|\mathbf{u}\|\leq r_{\mathrm{nl}},

𝐱,𝐮Gnl(𝐱,𝐮)𝐱,𝐮Gnl(0,0)opβnl(𝐱+𝐮).\displaystyle\|\nabla\mkern-2.5mu_{\mathbf{x},\mathbf{u}}G_{\mathrm{nl}}(\mathbf{x},\mathbf{u})-\nabla\mkern-2.5mu_{\mathbf{x},\mathbf{u}}G_{\mathrm{nl}}(0,0)\|_{\mathrm{op}}~{}\leq~{}\beta_{\mathrm{nl}}(\|\mathbf{x}\|+\|\mathbf{u}\|).

For simplicity, we assume βnl1\beta_{\mathrm{nl}}~{}\geq~{}1 and rnl1r_{\mathrm{nl}}~{}\leq~{}1. Using Assumption 1, we can apply Taylor’s theorem to rewrite GnlG_{\mathrm{nl}} as its Jacobian linearization around the equilibrium point, plus a nonlinear remainder term.

Lemma 3.1.

If GnlG_{\mathrm{nl}} satisfies Assumption 1, then all 𝐱,𝐮\mathbf{x},\mathbf{u} for which 𝐱+𝐮rnl\|\mathbf{x}\|+\|\mathbf{u}\|~{}\leq~{}r_{\mathrm{nl}},

Gnl(𝐱,𝐮)=Ajac𝐱+Bjac𝐮+fnl(𝐱,𝐮),\displaystyle G_{\mathrm{nl}}(\mathbf{x},\mathbf{u})=A_{\mathrm{jac}}\mathbf{x}+B_{\mathrm{jac}}\mathbf{u}+f_{\mathrm{nl}}(\mathbf{x},\mathbf{u}), (3.3)

where fnl(𝐱,𝐮)βnl(𝐱2+𝐮2)\|f_{\mathrm{nl}}(\mathbf{x},\mathbf{u})\|~{}\leq~{}\beta_{\mathrm{nl}}(\|\mathbf{x}\|^{2}+\|\mathbf{u}\|^{2}), fnl(𝐱,𝐮)βnl(𝐱+𝐮)\|\nabla\mkern-2.5muf_{\mathrm{nl}}(\mathbf{x},\mathbf{u})\|~{}\leq~{}\beta_{\mathrm{nl}}(\|\mathbf{x}\|+\|\mathbf{u}\|), and where (Ajac,Bjac)(A_{\mathrm{jac}},B_{\mathrm{jac}}) are the system’s Jacobian linearization matrices defined in Eq. 1.2.

Rather than trying to directly understand the behavior of stabilization procedures on the nonlinear system, the key insight of our nonlinear analysis is that we can reason about the performance of a state-feedback controller on the nonlinear system via its behavior on the system’s Jacobian linearization. In particular, the following lemma establishes how any controller which achieves finite discounted LQR cost for the Jacobian linearization is guaranteed to be exponentially stabilizing on the damped nonlinear system for initial states that are small enough. Throughout the remainder of this section, we define Jlin(γ):=Jlin(γ,Ajac,Bjac)J_{\mathrm{lin}}(\cdot\mid\gamma):=J_{\mathrm{lin}}(\cdot\mid\gamma,A_{\mathrm{jac}},B_{\mathrm{jac}}) as the LQR objective from Definition 2.1 where (A,B)=(Ajac,Bjac)(A,B)=(A_{\mathrm{jac}},B_{\mathrm{jac}}).

Lemma 3.2 (Restatement of Lemma B.2).

Suppose that 𝒞J=Jlin(Kγ)<\mathcal{C}_{J}=J_{\mathrm{lin}}(K\mid\gamma)<\infty, then KK is (𝒞J1/2,(4𝒞J)1)(\mathcal{C}_{J}^{1/2},(4\mathcal{C}_{J})^{-1}) exponentially stable on the damped system γGnl\sqrt{\gamma}G_{\mathrm{nl}} over radius r=rnl/(βnl𝒞J3/2)r=r_{\mathrm{nl}}\;/\;(\beta_{\mathrm{nl}}\mathcal{C}_{J}^{3/2}).

The second main building block of our nonlinear analysis is the observation that if the dynamics are locally smooth around the equilibrium point, then by Lemma 3.1, decreasing the radius rr of the initial state distribution 𝐱0r𝒮dx1\mathbf{x}_{0}\sim r\cdot\mathcal{S}^{d_{x}-1} reduces the magnitude of the nonlinear remainder term fnlf_{\mathrm{nl}}. Hence, the nonlinear system smoothly approximates its Jacobian linearization. More precisely, we establish that the difference in gradients and costs between Jnl(Kγ,r)J_{\mathrm{nl}}(K\mid\gamma,r) and Jlin(Kγ)J_{\mathrm{lin}}(K\mid\gamma) decrease linearly with the radius rr.

Proposition 3.3.

Assume Jlin(Kγ)<J_{\mathrm{lin}}(K\mid\gamma)<\infty. Then, for PK,γP_{K,\gamma} defined as in Eq. 2.4:

  1. a)

    If rrnl2βnlPK,γop2r~{}\leq~{}\frac{r_{\mathrm{nl}}}{2\beta_{\mathrm{nl}}\|P_{K,\gamma}\|_{\mathrm{op}}^{2}}, then |Jnl(Kγ,r)Jlin(Kγ)|8dxβnlPK,γop4r.\big{|}J_{\mathrm{nl}}(K\mid\gamma,r)-J_{\mathrm{lin}}(K\mid\gamma)\big{|}~{}\leq~{}8d_{x}\beta_{\mathrm{nl}}\|P_{K,\gamma}\|_{\mathrm{op}}^{4}\cdot r.

  2. b)

    If r112βnlPK,γop5/2r~{}\leq~{}\frac{1}{12\beta_{\mathrm{nl}}\|P_{K,\gamma}\|_{\mathrm{op}}^{5/2}}, then, KJnl(Kγ,r)KJlin(Kγ)F48dxβnl(1+Bop)PK,γop7r\|\nabla\mkern-2.5mu_{K}J_{\mathrm{nl}}(K\mid\gamma,r)-\nabla\mkern-2.5mu_{K}J_{\mathrm{lin}}(K\mid\gamma)\|_{\mathrm{F}}~{}\leq~{}48d_{x}\beta_{\mathrm{nl}}(1+\|B\|_{\mathrm{op}})\|P_{K,\gamma}\|_{\mathrm{op}}^{7}\cdot r

Lastly, because policy gradients on linear dynamical systems is robust to inexact gradient queries, we show that for rr sufficiently small, running policy gradients on JnlJ_{\mathrm{nl}} converges to a controller which has performance close to the optimal controller for the LQR problem with dynamic matrices (Ajac,Bjac)(A_{\mathrm{jac}},B_{\mathrm{jac}}). As noted previously, we can then use Lemma 3.2 to translate the performance of the optimal LQR controller for the Jacobian linearization to an exponential stability guarantee for the nonlinear dynamics. Using these insights, we establish the following theorem regarding discount annealing for nonlinear dynamics.

Theorem 2 (Nonlinear Systems).

Let Mnl:=max{21tr[P],Jlin(K0γ0)}M_{\mathrm{nl}}:=\max\{21\mathrm{tr}\left[P_{\star}\right],J_{\mathrm{lin}}(K_{0}\mid\gamma_{0})\}. The following statements are true regarding the discount annealing algorithm for nonlinear dynamical systems when rr_{\star} is less than a fixed quantity that is poly(1/Mnl,1/Aop,1/Bop,rnl/βnl)\mathrm{poly}(1/M_{\mathrm{nl}},1/\|A\|_{\mathrm{op}},1/\|B\|_{\mathrm{op}},r_{\mathrm{nl}}/\beta_{\mathrm{nl}})

  1. a)

    Discount annealing returns a controller K^\widehat{K} which is (2tr[P],(8tr[P])1)(\sqrt{2\mathrm{tr}[P_{\star}]},(8\mathrm{tr}\left[P_{\star}\right])^{-1})-exponentially stable over a radius r=rnl/(8βnltr[P]2)r=r_{\mathrm{nl}}\;/\;(8\beta_{\mathrm{nl}}\mathrm{tr}\left[P_{\star}\right]^{2})

  2. b)

    If γ0<1\gamma_{0}<1, the algorithm is guaranteed to halt whenever tt is greater than 64tr[P]4log(1/γ0)64\mathrm{tr}\left[P_{\star}\right]^{4}\log(1/\gamma_{0}).

Furthermore, at each iteration tt:

  1. c)

    Policy gradients achieves the guarantee in Eq. 2.2 using only poly(Mnl,Aop,Bop)\mathrm{poly}(M_{\mathrm{nl}},\|A\|_{\mathrm{op}},\|B\|_{\mathrm{op}}) many queries to ε-𝙶𝚛𝚊𝚍(,Jnl(γ))\varepsilon\text{-}\mathtt{Grad}(\cdot,J_{\mathrm{nl}}(\cdot\mid\gamma)) as long as ε\varepsilon is less than some fixed polynomial poly(Mnl1,Aop1,Bop1)\mathrm{poly}(M_{\mathrm{nl}}^{-1},\|A\|_{\mathrm{op}}^{-1},\|B\|_{\mathrm{op}}^{-1}).

  2. c)

    Let c0c_{0} denote a universal constant. With probability 1δ1-\delta, the noisy random search algorithm (see Figure 2) returns a discount factor γ\gamma^{\prime} satisfying Eq. 2.3 using at most c0tr[P]4log(1/δ)c_{0}\cdot\mathrm{tr}\left[P_{\star}\right]^{4}\log(1/\delta) queries to ε-𝙴𝚟𝚊𝚕(,Jnl(γ,r))\varepsilon\text{-}\mathtt{Eval}(\cdot,J_{\mathrm{nl}}(\cdot\mid\gamma,r_{\star})) for ε=.01dx\varepsilon=.01d_{x}.

We note that while our theorem only guarantees that the controller is stabilizing around a polynomially small neighborhood of the equilibrium, in experiments, we find that the resulting controller successfully stabilizes the dynamics for a wide range of initial conditions. Relative to the case of linear systems where we leveraged the monotonicity of the LQR cost to search for discount factors using binary search, this monotonicity breaks down in the case of nonlinear systems and we instead analyze a random search algorithm to simplify the analysis.

4 Experiments

In this section, we evaluate the ability of the discount annealing algorithm to stabilize a simulated nonlinear system. Specifically, we consider the familiar cart-pole, with dx=4d_{x}=4 (positions and velocities of the cart and pole), and du=1d_{u}=1 (horizontal force applied to the cart). The goal is to stabilize the system with the pole in the unstable ‘upright’ equilibrium position. For further details, including the precise dynamics, see Section D.1. The system was simulated in discrete-time with a simple forward Euler discretization, i.e., 𝐱t+1=𝐱t+Ts𝐱˙t\mathbf{x}_{t+1}=\mathbf{x}_{t}+{T_{s}}\dot{\mathbf{x}}_{t}, where 𝐱˙t\dot{\mathbf{x}}_{t} is given by the continuous time dynamics, and Ts=0.05{T_{s}}=0.05 (20Hz). Simulations were carried out in PyTorch (Paszke et al., 2019) and run on a single GPU.

Setup. The discounted annealing algorithm of Figure 1 was implemented as follows. In place of the true infinite horizon discounted cost Jnl(Kγ,r)J_{\mathrm{nl}}(K\mid\gamma,r) in Eq. C.4 we use a finite horizon, finite sample Monte Carlo approximation as described in Appendix C,

Jnl(Kγ)1Ni=1NJnl(H)(Kγ,𝐱(i)),𝐱(i)r𝒮dx1.\displaystyle J_{\mathrm{nl}}(K\mid\gamma)\approx\frac{1}{N}{\sum}_{i=1}^{N}J_{\mathrm{nl}}^{(H)}(K\mid\gamma,\mathbf{x}^{(i)}),\quad\mathbf{x}^{(i)}\sim r\cdot\mathcal{S}^{d_{x}-1}.

Here, Jnl(H)(Kγ,𝐱)=j=0H1𝐱tQ𝐱t+𝐮tR𝐮tJ_{\mathrm{nl}}^{(H)}(K\mid\gamma,\mathbf{x})=\sum_{j=0}^{H-1}\mathbf{x}_{t}^{\top}Q\mathbf{x}_{t}+\mathbf{u}_{t}^{\top}R\mathbf{u}_{t}, is the length HH, finite horizon cost of a controller KK in which the states evolve according to the γ\sqrt{\gamma} damped dynamics from Eq. C.5 and 𝐮t=K𝐱t\mathbf{u}_{t}=K\mathbf{x}_{t}. We used N=5000N=5000 and H=1000H=1000 in our experiments. For the cost function, we used Q=TsIQ={T_{s}}\cdot I and R=TsR={T_{s}}. We compute unbiased approximations of the gradients using automatic differentiation on the finite horizon objective Jnl(H)J_{\mathrm{nl}}^{(H)}.

Table 1: Final region of attraction radius rroa{r_{\text{roa}}} as a function of the initial state radius rr used during training (discount annealing). We report the [min, max] values of rroa{r_{\text{roa}}} over 5 independent trials. The optimal LQR policy for the linearized system achieved rroa=0.703{r_{\text{roa}}}=0.703 when applied to the nonlinear system. We also synthesized an \mathcal{H}_{\infty} optimal controller for the linearized dynamics, which achieved rroa=0.506{r_{\text{roa}}}=0.506.
rr 0.1 0.3 0.5 0.6 0.7
rroa{r_{\text{roa}}} [0.702, 0.704] [0.711, 0.713] [0.727, 0.734] [0.731, 0.744] [0.769, 0.777]

Instead of using SGD updates for policy gradients, we use Adam (Kingma and Ba, 2014) with a learning rate of η=0.01/r\eta=0.01/r. Furthermore, we replace the policy gradient termination criteria in Step 2 (Eq. 2.2) by instead halting after a fixed number (M=200)(M=200) of gradient descent steps. We wish to emphasize that the hyperparameters (N,H,η,M)(N,H,\eta,M) were not optimized for performance. In particular, for r=0.1r=0.1, we found that as few as M=40M=40 iterations of policy gradient and horizons as short as H=400H=400 were sufficient. Finally, we used an initial discount factor γ0=0.9Ajac22\gamma_{0}=0.9\cdot\|A_{\mathrm{jac}}\|_{2}^{-2}, where AjacA_{\mathrm{jac}} denotes the linearization of the (discrete-time) cart-pole about the vertical equilibrium.

Results. We now proceed to discuss the performance of the algorithm, focusing on three main properties of interest: i) the number of iterations of discount annealing required to find a stabilizing controller (that is, increase γt\gamma_{t} to 1), ii) the maximum radius rr of the ball of initial conditions 𝐱0r𝒮dx1\mathbf{x}_{0}\sim r\cdot\mathcal{S}^{d_{x}-1} for which discount annealing succeeds at stabilizing the system, and iii) the radius rroa{r_{\text{roa}}} of the largest ball contained within the region of attraction (ROA) for the policy returned by discount annealing. Although the true ROA (the set of all initial conditions such that the closed-loop system converges asymptotically to the equilibrium point) is not necessarily shaped like a ball (as the system is more sensitive to perturbations in the position and velocity of the pole than the cart), we use the term region of attraction radius to refer to the radius of the largest ball contained in the ROA.

Concerning (i), discount annealing reliably returned a stabilizing policy in less than 9 iterations. Specifically, over 5 independent trials for each initial radius r{0.1,0.3,0.5,0.7}r\in\{0.1,0.3,0.5,0.7\} (giving 20 independent trials, in total) the algorithm never required more than 9 iterations to return a stabilizing policy.

Concerning (ii), discount annealing reliably stabilized the system for r0.7r~{}\leq~{}0.7. For r0.75r\approx 0.75, we observed trials in which the state of the damped system (γ<1\gamma<1) diverged to infinity. For such a rollout, the gradient of the cost is not well-defined, and policy gradient is unable to improve the policy, which prevents discount annealing from finding a stabilizing policy.

Concerning (iii), in Table 1 we report the final radius rroa{r_{\text{roa}}} for the region of attraction of the final controller returned by discount annealing as a function of the training radius rr. We make the following observations. Foremost, the policy returned by discount annealing extends the radius of the ROA beyond the radius used during training, i.e. rroa>r{r_{\text{roa}}}>r. Moreover, for each r>.1r>.1, the rroa{r_{\text{roa}}} achieved by discount annealing is greater than the rroa=0.703{r_{\text{roa}}}=0.703 achieved by the exact optimal LQR controller and the rroa=0.506{r_{\text{roa}}}=0.506 achieved by the exact optimal {\mathcal{H}_{\infty}} controller for the system’s Jacobian linearization (see Table 1). (The {\mathcal{H}_{\infty}} optimal controller mitigates the effect of worst-case additive state disturbances on the cost; cf. Section D.2 for details).

One may hypothesize that this is due to the fact that discount annealing directly operates on the true nonlinear dynamics whereas the other baselines (LQR and \mathcal{H}_{\infty} control), find the optimal controller for an idealized linearization of the dynamics. Indeed, there is evidence to support this hypothesis. In Figure 3 presented in Appendix D, we plot the error Kpg(γt)Klin(γt)F\|K_{\text{pg}}^{\star}(\gamma_{t})-K_{\text{lin}}^{\star}(\gamma_{t})\|_{F} between the policy Kpg(γt)K_{\text{pg}}^{\star}(\gamma_{t}) returned by policy gradients, and the optimal LQR policy Klin(γt)K_{\text{lin}}^{\star}(\gamma_{t}) for the (damped) linearized system, as a function of the discount factor γt\gamma_{t} used in each iteration of the discount annealing algorithm. For small training radius, such as r=0.05r=0.05, Kpg(γt)Klin(γt)K_{\text{pg}}^{\star}(\gamma_{t})\approx K_{\text{lin}}^{\star}(\gamma_{t}) for all γt\gamma_{t}. However, for larger radii (i.e r=0.7r=0.7), we see that Kpg(γt)Klin(γt)F\|K_{\text{pg}}^{\star}(\gamma_{t})-K_{\text{lin}}^{\star}(\gamma_{t})\|_{F} steadily increases as γt\gamma_{t} increases.

That is, as discount annealing increases the discount factor γ\gamma and the closed-loop trajectories explore regions of the state space where the dynamics are increasingly nonlinear, KpgK_{\text{pg}}^{\star} begins to diverge from KlinK_{\text{lin}}^{\star}. Moreover, at the conclusion of discount annealing Kpg(1)K_{\text{pg}}^{\star}(1) achieves a lower cost, namely [15.2, 15.4] vs [16.5, 16.8] (here [a,b][a,b] denotes [min\min, max\max] over 5 trials) and larger rroa{r_{\text{roa}}}, namely [0.769, 0.777] vs [0.702, 0.703], than Klin(1)K_{\text{lin}}^{\star}(1), suggesting that the method has indeed adapted to the nonlinearity of the system. Similar observations as to the behavior of controllers fine tuned via policy gradient methods are predicted by the theoretical results from Qu et al. (2020).

5 Discussion

This works illustrates how one can provably stabilize a broad class of dynamical systems via a simple model-free procedure based off policy gradients. In line with the simplicity and flexibility that have made model-free methods so popular in practice, our algorithm works under relatively weak assumptions and with little knowledge of the underlying dynamics. Furthermore, we solve an open problem from previous work (Fazel et al., 2018) and take a step towards placing model-free methods on more solid theoretical footing. We believe that our results raise a number of interesting questions and directions for future work.

In particular, our theoretical analysis states that discount annealing returns a controller whose stability properties are similar to those of the optimal LQR controller for the system’s Jacobian linearization. We were therefore quite surprised when in experiments, the resulting controller had a significantly better radius of attraction than the exact optimal LQR and \mathcal{H}_{\infty} controllers for the linearization of the dynamics. It is an interesting and important direction for future work to gain a better understanding of exactly when and how model-free procedures are adaptive to the nonlinearities of the system and improve upon these model-based baselines. Furthermore, for our analysis of nonlinear systems, we require access to damped system trajectories. It would be valuable to understand whether this is indeed necessary or whether our analysis could be extended to work without access to damped trajectories.

As a final note, in this work we reduce the problem of stabilizing dynamical systems to running policy gradients on a discounted LQR objective. This choice of reducing to LQR was in part made for simplicity to leverage previous analyses. However, it is possible that overall performance could be improved if rather than reducing to LQR, we instead attempted to run a model-free method that directly tries to optimize a robust control objective (which explicitly deals with uncertainty in the system dynamics). We believe that understanding these tradeoffs in objectives and their relevant sample complexities is an interesting avenue for future inquiry.

Acknowledgments

We would like to thank Peter Bartlett for many helpful conversations and comments throughout the course of this project, and Russ Tedrake for support with the numerical experiments. JCP is supported by an NSF Graduate Research Fellowship. MS is supported by an Open Philanthropy Fellowship grant. JU is supported by the National Science Foundation, Award No. EFMA-1830901, and the Department of the Navy, Office of Naval Research, Award No. N00014-18-1-2210.

References

  • Abbasi-Yadkori and Szepesvári [2011] Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In Proceedings of the 24th Annual Conference on Learning Theory, pages 1–26, 2011.
  • Andrychowicz et al. [2020] OpenAI: Marcin Andrychowicz, Bowen Baker, Maciek Chociej, Rafal Jozefowicz, Bob McGrew, Jakub Pachocki, Arthur Petron, Matthias Plappert, Glenn Powell, Alex Ray, et al. Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1):3–20, 2020.
  • Callier and Desoer [2012] Frank M Callier and Charles A Desoer. Linear system theory. Springer Science & Business Media, 2012.
  • Chen and Hazan [2021] Xinyi Chen and Elad Hazan. Black-box control for linear dynamical systems. In Conference on Learning Theory, pages 1114–1143. PMLR, 2021.
  • Cohen et al. [2019] Alon Cohen, Tomer Koren, and Yishay Mansour. Learning linear-quadratic regulators efficiently with only sqrt(t) regret. In International Conference on Machine Learning, pages 1300–1309. PMLR, 2019.
  • Dean et al. [2018] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4188–4197, 2018.
  • Dean et al. [2019] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. On the sample complexity of the linear quadratic regulator. Foundations of Computational Mathematics, pages 1–47, 2019.
  • Faradonbeh et al. [2018a] Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Finite-time adaptive stabilization of linear systems. IEEE Transactions on Automatic Control, 64(8):3498–3505, 2018a.
  • Faradonbeh et al. [2018b] Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. On optimality of adaptive linear-quadratic regulators. arXiv preprint arXiv:1806.10749, 2018b.
  • Faradonbeh et al. [2019] Mohamad Kazem Shirani Faradonbeh, Ambuj Tewari, and George Michailidis. Randomized algorithms for data-driven stabilization of stochastic linear systems. In 2019 IEEE Data Science Workshop (DSW), pages 170–174. IEEE, 2019.
  • Fazel et al. [2018] Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1467–1476. PMLR, 2018.
  • Feng and Lavaei [2020] Han Feng and Javad Lavaei. Escaping locally optimal decentralized control polices via damping. In 2020 American Control Conference (ACC), pages 50–57. IEEE, 2020.
  • Fiechter [1997] Claude-Nicolas Fiechter. Pac adaptive control of linear systems. In Proceedings of the tenth annual conference on Computational learning theory, pages 72–80, 1997.
  • Flaxman et al. [2005] Abraham D. Flaxman, Adam Tauman Kalai, and H. Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’05, page 385–394, USA, 2005. Society for Industrial and Applied Mathematics.
  • Hazan et al. [2020] Elad Hazan, Sham Kakade, and Karan Singh. The nonstochastic control problem. In Algorithmic Learning Theory, pages 408–421. PMLR, 2020.
  • Kakade et al. [2020] Sham Kakade, Akshay Krishnamurthy, Kendall Lowrey, Motoya Ohnishi, and Wen Sun. Information theoretic regret bounds for online nonlinear control. In Advances in Neural Information Processing Systems, volume 33, pages 15312–15325, 2020.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Lamperski [2020] Andrew Lamperski. Computing stabilizing linear controllers via policy iteration. In 2020 59th IEEE Conference on Decision and Control (CDC), pages 1902–1907. IEEE, 2020.
  • Laurent and Massart [2000] Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000.
  • Lewis et al. [2003] Frank L Lewis, Darren M Dawson, and Chaouki T Abdallah. Robot manipulator control: theory and practice. CRC Press, 2003.
  • Malik et al. [2019] Dhruv Malik, Ashwin Pananjady, Kush Bhatia, Koulik Khamaru, Peter Bartlett, and Martin Wainwright. Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2916–2925. PMLR, 2019.
  • Mania et al. [2019] Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalence is efficient for linear quadratic control. In Advances in Neural Information Processing Systems, pages 10154–10164, 2019.
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • Mohammadi et al. [2020] Hesameddin Mohammadi, Mahdi Soltanolkotabi, and Mihailo R Jovanović. On the linear convergence of random search for discrete-time lqr. IEEE Control Systems Letters, 5(3):989–994, 2020.
  • Mohammadi et al. [2021] Hesameddin Mohammadi, Armin Zare, Mahdi Soltanolkotabi, and Mihailo R Jovanovic. Convergence and sample complexity of gradient methods for the model-free linear quadratic regulator problem. IEEE Transactions on Automatic Control, 2021.
  • Parrilo [2003] Pablo A Parrilo. Semidefinite programming relaxations for semialgebraic problems. Mathematical programming, 96(2):293–320, 2003.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703, 2019.
  • Peng et al. [2018] Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 3803–3810, 2018. doi: 10.1109/ICRA.2018.8460528.
  • Perdomo et al. [2021] Juan C. Perdomo, Max Simchowitz, Alekh Agarwal, and Peter L. Bartlett. Towards a dimension-free understanding of adaptive linear control. In Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 3681–3770. PMLR, 2021.
  • Prajna et al. [2004] Stephen Prajna, Pablo A Parrilo, and Anders Rantzer. Nonlinear control synthesis by convex optimization. IEEE Transactions on Automatic Control, 49(2):310–314, 2004.
  • Qu et al. [2020] Guannan Qu, Chenkai Yu, Steven Low, and Adam Wierman. Combining model-based and model-free methods for nonlinear control: A provably convergent policy gradient approach. arXiv preprint arXiv:2006.07476, 2020.
  • Rantzer [2001] Anders Rantzer. A dual to lyapunov’s stability theorem. Systems & Control Letters, 42(3):161–168, 2001.
  • Sastry [2013] Shankar Sastry. Nonlinear systems: analysis, stability, and control, volume 10. Springer Science & Business Media, 2013.
  • Sastry and Bodson [2011] Shankar Sastry and Marc Bodson. Adaptive control: stability, convergence and robustness. Courier Corporation, 2011.
  • Silver et al. [2016] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
  • Simchowitz and Foster [2020] Max Simchowitz and Dylan Foster. Naive exploration is optimal for online lqr. In International Conference on Machine Learning, pages 8937–8948. PMLR, 2020.
  • Simchowitz et al. [2020] Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control. In Conference on Learning Theory, pages 3320–3436. PMLR, 2020.
  • Sontag [1999] Eduardo D Sontag. Nonlinear feedback stabilization revisited. In Dynamical Systems, Control, Coding, Computer Vision, pages 223–262. Springer, 1999.
  • Sontag [2009] Eduardo D. Sontag. Stability and feedback stabilization. Springer New York, New York, NY, 2009. ISBN 978-0-387-30440-3. doi: 10.1007/978-0-387-30440-3_515.
  • Tu and Recht [2019] Stephen Tu and Benjamin Recht. The gap between model-based and model-free methods on the linear quadratic regulator: An asymptotic viewpoint. In Conference on Learning Theory, pages 3036–3083. PMLR, 2019.
  • Zhang et al. [2021] Kaiqing Zhang, Xiangyuan Zhang, Bin Hu, and Tamer Başar. Derivative-free policy optimization for risk-sensitive and robust control design: implicit regularization and sample complexity. arXiv preprint arXiv:2101.01041, 2021.
  • Zhou and Doyle [1998] Kemin Zhou and John Comstock Doyle. Essentials of robust control, volume 104. Prentice Hall, 1998.
  • Zhou et al. [1996] Kemin Zhou, John Comstock Doyle, Keith Glover, et al. Robust and optimal control, volume 40. Prentice hall New Jersey, 1996.

Appendix A Deferred Proofs and Analysis for the Linear Setting

Preliminaries on Linear Quadratic Control.

The cost a state-feedback controller Jlin(Kγ)J_{\mathrm{lin}}(K\mid\gamma) is intimately related to the solution of the discrete-time Lyapunov equation. Given a stable matrix AclA_{\mathrm{cl}} and a symmetric positive definite matrix Σ\Sigma, we define 𝖽𝗅𝗒𝖺𝗉(Acl,Σ)\mathsf{dlyap}(A_{\mathrm{cl}},\Sigma) to be the unique solution (over XX) to the matrix equation,

X=Σ+AclXAcl.\displaystyle X=\Sigma+A_{\mathrm{cl}}^{\top}XA_{\mathrm{cl}}.

A classical result in Lyapunov theory states that 𝖽𝗅𝗒𝖺𝗉(Acl,Σ)=j=0(Aclj)XAclj\mathsf{dlyap}(A_{\mathrm{cl}},\Sigma)=\sum_{j=0}^{\infty}(A_{\mathrm{cl}}^{j})^{\top}XA_{\mathrm{cl}}^{j}. Recalling our earlier definition, for a controller KK such that γ(A+BK)\sqrt{\gamma}(A+BK) is stable, we let PK,γ:=𝖽𝗅𝗒𝖺𝗉(γ(A+BK),Q+KRK)P_{K,\gamma}:=\mathsf{dlyap}(\sqrt{\gamma}(A+BK),Q+K^{\top}RK), where Q,RQ,R are the cost matrices for the LQR problem defined in Definition 2.1. As a special case, we let P:=𝖽𝗅𝗒𝖺𝗉(A+BK,Q+KRK)P_{\star}:=\mathsf{dlyap}(A+BK_{\star},Q+K_{\star}^{\top}RK_{\star}) where K:=K,1K_{\star}:=K_{\star,1} is the optimal controller for the undiscounted LQR problem. Using these definitions, we have the following facts:

Fact A.1.

Jlin(Kγ,A,B)<J_{\mathrm{lin}}(K\mid\gamma,A,B)<\infty if and only if γ(A+BK)\sqrt{\gamma}\cdot(A+BK) is a stable matrix.

Fact A.2.

If Jlin(Kγ,A,B)<J_{\mathrm{lin}}(K\mid\gamma,A,B)<\infty then for all 𝐱dx\mathbf{x}\in\mathbb{R}^{d_{x}}, Jlin(K𝐱,γ,A,B)=𝐱PK,γ𝐱J_{\mathrm{lin}}(K\mid\mathbf{x},\gamma,A,B)=\mathbf{x}^{\top}P_{K,\gamma}\mathbf{x}. Furthermore,

Jlin(Kγ,A,B)=𝔼𝐱0dx𝒮dx1𝐱0PK,γ𝐱0=tr[PK,γ].J_{\mathrm{lin}}(K\mid\gamma,A,B)=\operatorname*{\mathbb{E}}_{\mathbf{x}_{0}\sim\sqrt{d_{x}}\mathcal{S}^{d_{x}-1}}\mathbf{x}_{0}^{\top}P_{K,\gamma}\mathbf{x}_{0}=\mathrm{tr}\left[P_{K,\gamma}\right].

Employing these identities, we can now restate and prove Lemma 2.1:

Lemma 2.1.

For all KK such that J(Kγ,A,B)<J(K\mid\gamma,A,B)<\infty, Jlin(Kγ,A,B)=Jlin(K1,γA,γB)J_{\mathrm{lin}}(K\mid\gamma,A,B)=J_{\mathrm{lin}}(K\mid 1,\sqrt{\gamma}\cdot A,\sqrt{\gamma}\cdot B).

Proof.

From the definition of the LQR cost and linear dynamics in Definition 2.1, for Acl:=A+BKA_{\mathrm{cl}}:=A+BK,

Jlin(Kγ,A,B)\displaystyle J_{\mathrm{lin}}(K\mid\gamma,A,B) =𝔼𝐱0[t=0γt𝐱0(Aclt)(Q+KRK)Aclt𝐱0]\displaystyle=\mathbb{E}_{\mathbf{x}_{0}}\left[\sum_{t=0}^{\infty}\gamma^{t}\cdot\mathbf{x}_{0}^{\top}\left(A_{\mathrm{cl}}^{t}\right)^{\top}\left(Q+K^{\top}RK\right)A_{\mathrm{cl}}^{t}\mathbf{x}_{0}\right]
=𝔼𝐱0[t=0𝐱0((γAcl)t)(Q+KRK)(γAcl)t𝐱0]\displaystyle=\mathbb{E}_{\mathbf{x}_{0}}\left[\sum_{t=0}^{\infty}\mathbf{x}_{0}^{\top}\left((\sqrt{\gamma}A_{\mathrm{cl}})^{t}\right)^{\top}\left(Q+K^{\top}RK\right)\left(\sqrt{\gamma}A_{\mathrm{cl}}\right)^{t}\mathbf{x}_{0}\right]
=Jlin(K1,γA,γB).\displaystyle=J_{\mathrm{lin}}(K\mid 1,\sqrt{\gamma}\cdot A,\sqrt{\gamma}\cdot B).

Therefore, since LQR with a discount factor is equivalent to LQR with damped dynamics, it follows that for γ0<ρ(A)2\gamma_{0}<\rho(A)^{-2}, (noisy) policy gradients initialized at the zero controller converges to the global optimum of the discounted LQR problem. The following lemma is essentially a restatement of the finite sample convergence result for gradient descent on LQR (Theorem 31) from Fazel et al. [2018], where we have set Q=R=IQ=R=I and 𝔼[𝐱0𝐱0]=I\mathbb{E}\left[\mathbf{x}_{0}\mathbf{x}_{0}^{\top}\right]=I as in the discount annealing algorithm. We include the proof of this result in Section A.2 for the sake of completeness.

Lemma A.3 (Fazel et al. [2018]).

For K0K_{0} such that Jlin(K0γ)<J_{\mathrm{lin}}(K_{0}\mid\gamma)<\infty, define (noisy) policy gradients as the procedure which computes updates according to,

Kt+1=Ktη~t,\displaystyle K_{t+1}=K_{t}-\eta\widetilde{\nabla\mkern-2.5mu}_{t},

for some matrix ~t\widetilde{\nabla\mkern-2.5mu}_{t}. There exists a choice of a constant step size η>0\eta>0 and a fixed polynomial,

𝒞PG:=poly(1Jlin(K0γ),1Aop,1Bop),\displaystyle\mathcal{C}_{\mathrm{PG}}:=\mathrm{poly}\left(\frac{1}{J_{\mathrm{lin}}(K_{0}\mid\gamma)},\frac{1}{\|A\|_{\mathrm{op}}},\frac{1}{\|B\|_{\mathrm{op}}}\right),

such that if the following inequality holds for all t=1Tt=1\dots T,

KJlin(Ktγ)~tF𝒞PGε,\displaystyle\left\|\nabla\mkern-2.5mu_{K}J_{\mathrm{lin}}(K_{t}\mid\gamma)-\widetilde{\nabla\mkern-2.5mu}_{t}\right\|_{\mathrm{F}}~{}\leq~{}\mathcal{C}_{\mathrm{PG}}\cdot\varepsilon, (A.1)

then Jlin(KTγ)minKJlin(Kγ)εJ_{\mathrm{lin}}(K_{T}\mid\gamma)-\min_{K}J_{\mathrm{lin}}(K\mid\gamma)~{}\leq~{}\varepsilon whenever

Tpoly(Aop,Bop,Jlin(K0γ))log(Jlin(K0γ)minKJlin(Kγ)ε).\displaystyle T~{}\geq~{}\mathrm{poly}\Big{(}\|A\|_{\mathrm{op}},\|B\|_{\mathrm{op}},J_{\mathrm{lin}}(K_{0}\mid\gamma)\Big{)}\log\left(\frac{J_{\mathrm{lin}}(K_{0}\mid\gamma)-\min_{K}J_{\mathrm{lin}}(K\mid\gamma)}{\varepsilon}\right).

With this lemma, we can now present the proof of Theorem 1.

A.1 Discount annealing on linear systems: Proof of Theorem 1

We organize the proof into two main parts. First, we prove statements c)c) and d)d) by an inductive argument. Then, having already proved part b)b) in the main body of the paper, we finish the proof by establishing the stability guarantees of the resulting controller outlined in part a)a).

A.1.1 Proof of c)c) and d)d)

Base case.

Recall that at iteration tt of discount annealing, policy gradients is initialized at Kt,0:=KtK_{t,0}:=K_{t} and (in the case of linear systems) computes updates according to:

Kt,j+1=Kt,jηε-𝙶𝚛𝚊𝚍(Jlin(γt),Kt,j).\displaystyle K_{t,j+1}=K_{t,j}-\eta\cdot\varepsilon\text{-}\mathtt{Grad}\left(J_{\mathrm{lin}}(\cdot\mid\gamma_{t}),K_{t,j}\right).

From Lemma A.3, if Jlin(Kt,0γt)<J_{\mathrm{lin}}(K_{t,0}\mid\gamma_{t})<\infty and

ε-𝙶𝚛𝚊𝚍(Jlin(γt),Kt,j)Jlin(Kt,jγt)F𝒞pg,tdxj0,\displaystyle\|\varepsilon\text{-}\mathtt{Grad}\left(J_{\mathrm{lin}}(\cdot\mid\gamma_{t}),K_{t,j}\right)-\nabla\mkern-2.5muJ_{\mathrm{lin}}(K_{t,j}\mid\gamma_{t})\|_{\mathrm{F}}~{}\leq~{}\mathcal{C}_{\mathrm{pg},t}d_{x}\quad\forall j~{}\geq~{}0, (A.2)

where 𝒞pg,t=poly(1/Jlin(Kt,0γt),1/Aop,1/Bop))\mathcal{C}_{\mathrm{pg},t}=\mathrm{poly}(1/J_{\mathrm{lin}}(K_{t,0}\mid\gamma_{t}),1/\|A\|_{\mathrm{op}},1/\|B\|_{\mathrm{op}}))then policy gradients will converge to a Kt,jK_{t,j} such that Jlin(Kt,jγt)minKJlin(Kγt)dxJ_{\mathrm{lin}}(K_{t,j}\mid\gamma_{t})-\min_{K}J_{\mathrm{lin}}(K\mid\gamma_{t})~{}\leq~{}d_{x} after poly(Aop,Bop,Jlin(Kt,0))\mathrm{poly}(\|A\|_{\mathrm{op}},\|B\|_{\mathrm{op}},J_{\mathrm{lin}}(K_{t,0})) many iterations.

By our choice of discount factor, we have that Jlin(0γ0)<J_{\mathrm{lin}}(0\mid\gamma_{0})<\infty. Furthermore, since εpoly(1/Jlin(0γ0),1/Aop,1/Bop))\varepsilon~{}\leq~{}\mathrm{poly}(1/J_{\mathrm{lin}}(0\mid\gamma_{0}),1/\|A\|_{\mathrm{op}},1/\|B\|_{\mathrm{op}})), then the condition outlined in Eq. A.2 holds and policy gradients achieves the desired guarantee when t=0t=0.

The correctness of binary search at iteration t=0t=0 follows from Lemma C.6. In particular, we instantiate the lemma with f=Jlin(Kt+1)f=J_{\mathrm{lin}}(K_{t+1}\mid\cdot) of the LQR objective, which is nondecreasing in γ\gamma by definition, f1=2.5Jlin(Kt+1γt)f_{1}=2.5J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t}) and f2=8Jlin(Kt+1γt)f_{2}=8J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t}). The algorithm requires auxiliary values f¯1[2.5Jlin(Kt+1γt),3Jlin(Kt+1γt)]\overline{f}_{1}\in[2.5J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t}),3J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t})] and f¯2[7Jlin(Kt+1γt),7.5Jlin(Kt+1γt)]\overline{f}_{2}\in[7J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t}),7.5J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t})] which we can always compute by using ε-𝙴𝚟𝚊𝚕\varepsilon\text{-}\mathtt{Eval} to estimate the cost Jlin(Kt+1γt)J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t}) to precision .1dx.1d_{x} (recall that Jlin(Kγ)dxJ_{\mathrm{lin}}(K\mid\gamma)~{}\geq~{}d_{x} for any KK and γ\gamma). The last step needed to apply the lemma is to lower bound the width of the feasible region of γ\gamma’s which satisfy the desired criterion that Jlin(Kt+1γ)[f¯1,f¯2]J_{\mathrm{lin}}(K_{t+1}\mid\gamma)\in[\overline{f}_{1},\overline{f}_{2}].

Let γγt\gamma^{\prime}~{}\geq~{}\gamma_{t} be such that Jlin(Kt+1γ)=3Jlin(Kt+1γt)J_{\mathrm{lin}}(K_{t+1}\mid\gamma^{\prime})=3J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t}). Such a γ\gamma is guaranteed to exist since Jlin(Kt+1)J_{\mathrm{lin}}(K_{t+1}\mid\cdot) is nondecreasing in γ\gamma and it is a continuous function for all γ<γ¯:=sup{γ:Jlin(Kt+1γ)<}\gamma<\bar{\gamma}:=\sup\{\gamma:J_{\mathrm{lin}}(K_{t+1}\mid\gamma)<\infty\}. By the calculation from the proof of part b)b) presented in the main body of the paper, for

γ′′:=(18PKt+1,γop4+1)2γ,\displaystyle\gamma^{\prime\prime}:=\left(\frac{1}{8\|P_{K_{t+1},\gamma^{\prime}}\|_{\mathrm{op}}^{4}}+1\right)^{2}\gamma^{\prime},

we have that Jlin(Kt+1γ′′)2Jlin(Kt+1γ)=6Jlin(Kt+1γt)J_{\mathrm{lin}}(K_{t+1}\mid\gamma^{\prime\prime})~{}\leq~{}2J_{\mathrm{lin}}(K_{t+1}\mid\gamma^{\prime})=6J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t}). By monotonicity and continuity of Jlin(Kt+1)J_{\mathrm{lin}}(K_{t+1}\mid\cdot), when restricted to γγ¯\gamma~{}\leq~{}\bar{\gamma}, all γ[γ,γ′′]\gamma\in[\gamma^{\prime},\gamma^{\prime\prime}] satisfy Jlin(Kt+1γ)[3Jlin(Kt+1γt),6Jlin(Kt+1γt)]J_{\mathrm{lin}}(K_{t+1}\mid\gamma)\in[3J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t}),6J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t})]. Moreover,

γ′′γ=[(18PKt+1,γop4+1)21]γ14PKt+1,γop4γ14tr[PKt+1,γ]4γt,\displaystyle\gamma^{\prime\prime}-\gamma^{\prime}=\left[\left(\frac{1}{8\|P_{K_{t+1},\gamma^{\prime}}\|_{\mathrm{op}}^{4}}+1\right)^{2}-1\right]\gamma^{\prime}~{}\geq~{}\frac{1}{4\|P_{K_{t+1},\gamma^{\prime}}\|_{\mathrm{op}}^{4}}\gamma^{\prime}~{}\geq~{}\frac{1}{4\mathrm{tr}\left[P_{K_{t+1},\gamma^{\prime}}\right]^{4}}\gamma_{t},

where the last line follows from the fact that γγt\gamma^{\prime}~{}\geq~{}\gamma_{t} and that the trace of a PSD matrix is always at least as large as the operator norm. Lastly, since tr[PKt+1,γ]=Jlin(Kt+1γ)=3Jlin(Kt+1γt)\mathrm{tr}\left[P_{K_{t+1},\gamma^{\prime}}\right]=J_{\mathrm{lin}}(K_{t+1}\mid\gamma^{\prime})=3J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t}), by the guarantee of policy gradients, we have that for t=0t=0, Jlin(K1γ0)2tr[P]J_{\mathrm{lin}}(K_{1}\mid\gamma_{0})~{}\leq~{}2\mathrm{tr}\left[P_{\star}\right]. Therefore, for t=0t=0:

γ′′γ14(6tr[P])4\displaystyle\gamma^{\prime\prime}-\gamma^{\prime}~{}\geq~{}\frac{1}{4(6\mathrm{tr}\left[P_{\star}\right])^{4}}

Hence the width of the feasible region is at least 15184tr[P]4γ0\frac{1}{5184\mathrm{tr}\left[P_{\star}\right]^{4}}\gamma_{0}.

Inductive step.

To show that policy gradients achieves the desired guarantee at iteration t+1t+1, we can repeat the exact same argument as in the base case. The only difference is that the need to argue that the cost of the initial controller supt0Jlin(Ktγt)\sup_{t~{}\geq~{}0}J_{\mathrm{lin}}(K_{t}\mid\gamma_{t}) is uniformly bounded across iterations. By the inductive hypothesis on the success of the binary search algorithm at iteration t1t-1, we have that,

Jlin(Ktγt)\displaystyle J_{\mathrm{lin}}(K_{t}\mid\gamma_{t}) 8Jlin(Ktγt1)\displaystyle~{}\leq~{}8J_{\mathrm{lin}}(K_{t}\mid\gamma_{t-1})
8(minKJlin(Kγt1)+dx)\displaystyle~{}\leq~{}8\left(\min_{K}J_{\mathrm{lin}}(K\mid\gamma_{t-1})+d_{x}\right)
8(minKJlin(K1)+dx)\displaystyle~{}\leq~{}8\left(\min_{K}J_{\mathrm{lin}}(K\mid 1)+d_{x}\right)
16tr[P].\displaystyle~{}\leq~{}16\mathrm{tr}\left[P_{\star}\right].

Hence, by Lemma A.3, policy gradients achieves the desired guarantee using poly(Mlin,Aop,Bop)\mathrm{poly}(M_{\mathrm{lin}},\|A\|_{\mathrm{op}},\|B\|_{\mathrm{op}}) many queries to ε-𝙶𝚛𝚊𝚍(,Jlin(γ))\varepsilon\text{-}\mathtt{Grad}(\cdot,J_{\mathrm{lin}}(\cdot\mid\gamma)) as long as ε\varepsilon is less than poly(Mlin1,Aop1,Bop1)\mathrm{poly}(M_{\mathrm{lin}}^{-1},\|A\|_{\mathrm{op}}^{-1},\|B\|_{\mathrm{op}}^{-1}).

Likewise, the argument for the correctness of the binary search procedure is identical to that of the base case. Because of the success of policy gradients and binary search at the previous iteration, we can upper bound, tr[PKt+1,γ]\mathrm{tr}\left[P_{K_{t+1},\gamma^{\prime}}\right] by 6tr[P]6\mathrm{tr}\left[P_{\star}\right] and get a uniform lower bound on the width of the feasible region.

A.1.2 Proof of a)a)

After halting, we see that discount annealing returns a controller K^\widehat{K} satisfying the stated condition from Step 2 requiring that,

Jlin(K^1)Jlin(K1)=tr[P^P]dx.\displaystyle J_{\mathrm{lin}}(\widehat{K}\mid 1)-J_{\mathrm{lin}}(K_{\star}\mid 1)=\mathrm{tr}\left[\widehat{P}-P_{\star}\right]~{}\leq~{}d_{x}.

Here, we have used A.2 to rewrite Jlin(K^1)J_{\mathrm{lin}}(\widehat{K}\mid 1) as tr[P^]\mathrm{tr}[\widehat{P}] for P^:=𝖽𝗅𝗒𝖺𝗉(A+BK^,Q+K^RK^)\widehat{P}:=\mathsf{dlyap}(A+B\widehat{K},Q+\widehat{K}^{\top}R\widehat{K}) (and likewise for PP_{\star}). Since tr[P]dx\mathrm{tr}\left[P_{\star}\right]~{}\geq~{}d_{x}, we conclude that tr[P^]2tr[P]\mathrm{tr}[\widehat{P}]~{}\leq~{}2\mathrm{tr}\left[P_{\star}\right]. Now, by properties of the Lyapunov equation (see Lemma A.5) the following holds for Acl:=A+BK^A_{\mathrm{cl}}:=A+B\widehat{K}:

Acl2optP^op(11P^op)ttr[P^]exp(t/tr[P^]).\displaystyle\|A_{\mathrm{cl}}^{2}\|_{\mathrm{op}}^{t}~{}\leq~{}\|\widehat{P}\|_{\mathrm{op}}\left(1-\frac{1}{\|\widehat{P}\|_{\mathrm{op}}}\right)^{t}~{}\leq~{}\mathrm{tr}[\widehat{P}]\exp\left(-t/\mathrm{tr}[\widehat{P}]\right).

Hence, we conclude that,

𝐱t=Aclt𝐱0Acltop𝐱02tr[P]exp(14tr[P]t)𝐱0.\displaystyle\|\mathbf{x}_{t}\|=\|A_{\mathrm{cl}}^{t}\mathbf{x}_{0}\|~{}\leq~{}\|A_{\mathrm{cl}}^{t}\|_{\mathrm{op}}\|\mathbf{x}_{0}\|~{}\leq~{}\sqrt{2\mathrm{tr}[P_{\star}]}\exp\left(-\frac{1}{4\mathrm{tr}\left[P_{\star}\right]}\cdot t\right)\|\mathbf{x}_{0}\|.

A.2 Convergence of policy gradients for LQR: Proof of Lemma A.3

Proof.

Note that, by Lemma 2.1, proving the above result for Jlin(γ,A,B)J_{\mathrm{lin}}(\cdot\mid\gamma,A,B) is the same as proving it for Jlin(1,γA,γB)J_{\mathrm{lin}}(\cdot\mid 1,\sqrt{\gamma}A,\sqrt{\gamma}B). We start by defining the following idealized updates,

K\displaystyle K^{\prime} =KηKJlin(Ktγ)\displaystyle=K-\eta\nabla\mkern-2.5mu_{K}J_{\mathrm{lin}}(K_{t}\mid\gamma)
K′′\displaystyle K^{\prime\prime} =Kη~t.\displaystyle=K-\eta\widetilde{\nabla\mkern-2.5mu}_{t}.

From Lemmas 13, 24, and 25 from Fazel et al. [2018], there exists a fixed polynomial,

𝒞η:=poly(1γAop,1γBop,1Jlin(K0γ))\displaystyle\mathcal{C}_{\eta}:=\mathrm{poly}\left(\frac{1}{\sqrt{\gamma}\|A\|_{\mathrm{op}}},\frac{1}{\sqrt{\gamma}\|B\|_{\mathrm{op}}},\frac{1}{J_{\mathrm{lin}}(K_{0}\mid\gamma)}\right)

Such that, for η𝒞η\eta~{}\leq~{}\mathcal{C}_{\eta}, the following inequality holds,

Jlin(Kγ)Jlin(Kγ)(1η1ΣKop)(Jlin(Kγ)Jlin(Kγ)),\displaystyle J_{\mathrm{lin}}(K^{\prime}\mid\gamma)-J_{\mathrm{lin}}(K_{\star}\mid\gamma)~{}\leq~{}\left(1-\eta\frac{1}{\|\Sigma_{K_{\star}}\|_{\mathrm{op}}}\right)\left(J_{\mathrm{lin}}(K\mid\gamma)-J_{\mathrm{lin}}(K_{\star}\mid\gamma)\right),

where ΣK=𝔼𝐱0[t=0𝐱t,𝐱t,]\Sigma_{K_{\star}}=\mathbb{E}_{\mathbf{x}_{0}}[\sum_{t=0}^{\infty}\mathbf{x}_{t,\star}^{\top}\mathbf{x}_{t,\star}] and {𝐱t,}t=0\{\mathbf{x}_{t,\star}\}_{t=0}^{\infty} is the sequence of states generated by the controller KK_{\star}. Therefore, if Jlin(K′′γ)J_{\mathrm{lin}}(K^{\prime\prime}\mid\gamma) and Jlin(Kγ)J_{\mathrm{lin}}(K^{\prime}\mid\gamma) satisfy,

|Jlin(K′′γ)Jlin(Kγ)|12ΣKopηε\displaystyle|J_{\mathrm{lin}}(K^{\prime\prime}\mid\gamma)-J_{\mathrm{lin}}(K^{\prime}\mid\gamma)|~{}\leq~{}\frac{1}{2\|\Sigma_{K_{\star}}\|_{\mathrm{op}}}\eta\varepsilon (A.3)

then, as long as Jlin(Kγ)Jlin(Kγ)εJ_{\mathrm{lin}}(K\mid\gamma)-J_{\mathrm{lin}}(K_{\star}\mid\gamma)~{}\geq~{}\varepsilon, this following inequality also holds:

Jlin(K′′γ)Jlin(Kγ)(1η12ΣKop)(Jlin(Kγ)Jlin(Kγ)).\displaystyle J_{\mathrm{lin}}(K^{\prime\prime}\mid\gamma)-J_{\mathrm{lin}}(K_{\star}\mid\gamma)~{}\leq~{}\left(1-\eta\frac{1}{\|2\Sigma_{K_{\star}}\|_{\mathrm{op}}}\right)\left(J_{\mathrm{lin}}(K\mid\gamma)-J_{\mathrm{lin}}(K_{\star}\mid\gamma)\right).

The proof then follows by unrolling the recursion and simplifying. We now focus on establishing Eq. A.3. By Lemma 27 in Fazel et al. [2018], if

K′′Kop=ηKJlin(Kjγ)~jop𝒞K\displaystyle\|K^{\prime\prime}-K^{\prime}\|_{\mathrm{op}}=\eta\|\nabla\mkern-2.5mu_{K}J_{\mathrm{lin}}(K_{j}\mid\gamma)-\widetilde{\nabla\mkern-2.5mu}_{j}\|_{\mathrm{op}}~{}\leq~{}\mathcal{C}_{K}

where 𝒞K\mathcal{C}_{K} is a fixed polynomial 𝒞K:=poly(1Jlin(K0γ),1γAop,1γBop)\mathcal{C}_{K}:=\mathrm{poly}(\frac{1}{J_{\mathrm{lin}}(K_{0}\mid\gamma)},\frac{1}{\sqrt{\gamma}\|A\|_{\mathrm{op}}},\frac{1}{\sqrt{\gamma}\|B\|_{\mathrm{op}}}), then

|Jlin(K′′γ)Jlin(Kγ)|𝒞costK′′Kop=𝒞costηKJlin(Kjγ)~jop,\displaystyle|J_{\mathrm{lin}}(K^{\prime\prime}\mid\gamma)-J_{\mathrm{lin}}(K^{\prime}\mid\gamma)|~{}\leq~{}\mathcal{C}_{\mathrm{cost}}\|K^{\prime\prime}-K^{\prime}\|_{\mathrm{op}}=\mathcal{C}_{\mathrm{cost}}\cdot\eta\|\nabla\mkern-2.5mu_{K}J_{\mathrm{lin}}(K_{j}\mid\gamma)-\widetilde{\nabla\mkern-2.5mu}_{j}\|_{\mathrm{op}},

where 𝒞cost:=poly(dx,Rop,Bop,Jlin(K0))\mathcal{C}_{\mathrm{cost}}:=\mathrm{poly}(d_{x},\|R\|_{\mathrm{op}},\|B\|_{\mathrm{op}},J_{\mathrm{lin}}(K_{0})). Therefore, Eq. A.3 holds if

KJlin(Kjγ)~jopmin{12ΣKop𝒞costε,𝒞K}.\displaystyle\|\nabla\mkern-2.5mu_{K}J_{\mathrm{lin}}(K_{j}\mid\gamma)-\widetilde{\nabla\mkern-2.5mu}_{j}\|_{\mathrm{op}}~{}\leq~{}\min\left\{\frac{1}{2\|\Sigma_{K_{\star}}\|_{\mathrm{op}}\mathcal{C}_{\mathrm{cost}}}\varepsilon,\;\mathcal{C}_{K}\right\}.

The exact statement follows from using dxJlin(K0γ)d_{x}~{}\leq~{}J_{\mathrm{lin}}(K_{0}\mid\gamma) and ΣK,γopJlin(K,γ)Jlin(K0γ)\|\Sigma_{K_{\star,\gamma}}\|_{\mathrm{op}}~{}\leq~{}J_{\mathrm{lin}}(K_{\star,\gamma})~{}\leq~{}J_{\mathrm{lin}}(K_{0}\mid\gamma) by Lemma 13 in [Fazel et al., 2018] and taking the polynomial in the proposition statement to be the minimum of 𝒞K\mathcal{C}_{K} and 1/𝒞cost1/\mathcal{C}_{\mathrm{cost}}. ∎

A.3 Impossibility of reward shaping: Proof of Proposition 2.2

Proof.

Consider the linear dynamical system with dynamics matrices,

A=[0002],B=[1β]\displaystyle A=\begin{bmatrix}0&0\\ 0&2\end{bmatrix},\quad B=\begin{bmatrix}1\\ \beta\end{bmatrix}

where β>0\beta>0 is a parameter to be chosen later. Note that a linear dynamical system of these dimensions is controllable (and hence stabilizable) [Callier and Desoer, 2012], since the matrix

[BAB]=[10β2β]\displaystyle\begin{bmatrix}B&AB\end{bmatrix}=\begin{bmatrix}1&0\\ \beta&2\beta\end{bmatrix}

is full rank. For any controller K=[k1k2]K=\begin{bmatrix}k_{1}&k_{2}\end{bmatrix}, the closed loop system Acl:=A+BKA_{\mathrm{cl}}:=A+BK has the form,

Acl=[k1k2βk12+βk2].\displaystyle A_{\mathrm{cl}}=\begin{bmatrix}k_{1}&k_{2}\\ \beta k_{1}&2+\beta k_{2}\end{bmatrix}.

By Gershgorin’s circle theorem, AclA_{\mathrm{cl}} has an eigenvalue λ\lambda which satisfies,

|λ||2+βk2||βk1|22βmax{|k1|,|k2|}.\displaystyle|\lambda|~{}\geq~{}|2+\beta k_{2}|-|\beta k_{1}|~{}\geq~{}2-2\beta\max\{|k_{1}|,|k_{2}|\}.

Therefore, any controller KK for which the closed-loop system A+BKA+BK is stable must have the property that,

max{|k1|,|k2|}12β.\displaystyle\max\{|k_{1}|,|k_{2}|\}~{}\geq~{}\frac{1}{2\beta}.

Using this observation and A.2, for any discount factor γ\gamma, a stabilizing controller KK must satisfy,

Jlin(Kγ)\displaystyle J_{\mathrm{lin}}(K\mid\gamma) =tr[PK,γ]\displaystyle=\mathrm{tr}\left[P_{K,\gamma}\right]
tr[Q]+tr[KRK]\displaystyle~{}\geq~{}\mathrm{tr}\left[Q\right]+\mathrm{tr}\left[K^{\top}RK\right]
(k12+k22)σmin(R)\displaystyle~{}\geq~{}(k_{1}^{2}+k_{2}^{2})\cdot\sigma_{\min}(R)
14β2σmin(R).\displaystyle~{}\geq~{}\frac{1}{4\beta^{2}}\cdot\sigma_{\min}(R).

In the above calculation, we have used the identity PK,γ=Q+KRK+γAclPK,γAclP_{K,\gamma}=Q+K^{\top}RK+\gamma A_{\mathrm{cl}}^{\top}P_{K,\gamma}A_{\mathrm{cl}} as well as the assumption that RR is positive definite. Next, we observe that for a discount factor γ=c2ρ(A)2\gamma=c^{2}\cdot\rho(A)^{-2}, where c(0,1)c\in(0,1) as chosen in the initial iteration of our algorithm, the cost of the 0 controller has the following upper bound:

Jlin(0γ0)\displaystyle J_{\mathrm{lin}}(0\mid\gamma_{0}) =j=0c2jρ(A)2jtr[(A)jQAj]\displaystyle=\sum_{j=0}^{\infty}c^{2j}\rho(A)^{-2j}\cdot\mathrm{tr}\left[(A^{\top})^{j}QA^{j}\right]
Qopj=0c2jρ(A)2jAjF2\displaystyle~{}\leq~{}\|Q\|_{\mathrm{op}}\sum_{j=0}^{\infty}c^{2j}\rho(A)^{-2j}\|A^{j}\|_{\mathrm{F}}^{2}
=Qopj=0(cρ(A)A)jF2.\displaystyle=\|Q\|_{\mathrm{op}}\sum_{j=0}^{\infty}\left\|\left(\frac{c}{\rho(A)}\cdot A\right)^{j}\right\|_{\mathrm{F}}^{2}.

Using standard Lyapunov arguments (see for example Section D.2 in Perdomo et al. [2021]) the sum in the last line is a geometric series and is equal to some function f(c,A)<f(c,A)<\infty, which depends only on cc and AA, for all c(0,1)c\in(0,1). Using this calculation, it follows that

minKJlin(Kγ0)Jlin(0γ)Qopf(c,A)\displaystyle\min_{K}J_{\mathrm{lin}}(K\mid\gamma_{0})~{}\leq~{}J_{\mathrm{lin}}(0\mid\gamma)~{}\leq~{}\|Q\|_{\mathrm{op}}f(c,A)

Hence, for any Q,RQ,R, and discount factor γ(0,ρ(A)2)\gamma\in(0,\rho(A)^{-2}), we can choose β\beta small enough such that,

Qopf(c,A)<14β2σmin(R)\displaystyle\|Q\|_{\mathrm{op}}f(c,A)<\frac{1}{4\beta^{2}}\sigma_{\min}(R)

implying that the optimal controller K,γK_{\star,\gamma} for the discounted problem cannot be stabilizing for (A,B)(A,B). ∎

A.4 Auxiliary results for linear systems

Proposition A.4.

Let γ1(A+BK)\sqrt{\gamma_{1}}(A+BK) be a stable matrix and define P1:=𝖽𝗅𝗒𝖺𝗉(γ1(A+BK),Q+KRK)P_{1}:=\mathsf{dlyap}(\sqrt{\gamma_{1}}(A+BK),Q+K^{\top}RK), then for cc defined as

c:=18P1op4+1,\displaystyle c:=\frac{1}{8\|P_{1}\|_{\mathrm{op}}^{4}}+1,

the following holds for γ2:=c2γ1\gamma_{2}:=c^{2}\gamma_{1} and P2:=𝖽𝗅𝗒𝖺𝗉(γ2(A+BK),Q+KRK)P_{2}:=\mathsf{dlyap}(\sqrt{\gamma_{2}}(A+BK),Q+K^{\top}RK):

tr[P2P1]tr[P1].\displaystyle\mathrm{tr}\left[P_{2}-P_{1}\right]~{}\leq~{}\mathrm{tr}\left[P_{1}\right].
Proof.

The proof is a direct consequence of Proposition C.7 in Perdomo et al. [2021]. In particular, we use their results for the trace norm and use the following substitutions,

A1γ1(A+BK)\displaystyle A_{1}\leftarrow\sqrt{\gamma_{1}}(A+BK)\quad ΣQ+KRK\displaystyle\Sigma\leftarrow Q+K^{\top}RK
A2γ2(A+BK)\displaystyle A_{2}\leftarrow\sqrt{\gamma_{2}}(A+BK)\quad α1/2\displaystyle\alpha\leftarrow 1/2

where A1,A2,Σ,A_{1},A_{2},\Sigma, and α\alpha are defined as in Perdomo et al. [2021]. Note that for cc satisfying,

c18γ1(A+BK)opmin{1P1op3/2;tr[P1]dxP1op7/2}+1\displaystyle c~{}\leq~{}\frac{1}{8\|\sqrt{\gamma_{1}}(A+BK)\|_{\mathrm{op}}}\min\left\{\frac{1}{\|P_{1}\|_{\mathrm{op}}^{3/2}};\;\frac{\mathrm{tr}\left[P_{1}\right]}{d_{x}\|P_{1}\|_{\mathrm{op}}^{7/2}}\right\}+1

we get that,

A1A2op2=γ1(c1)2A+BKop2164P1op3=α216P1op3.\displaystyle\|A_{1}-A_{2}\|_{\mathrm{op}}^{2}=\gamma_{1}(c-1)^{2}\|A+BK\|_{\mathrm{op}}^{2}~{}\leq~{}\frac{1}{64\|P_{1}\|_{\mathrm{op}}^{3}}=\frac{\alpha^{2}}{16\|P_{1}\|_{\mathrm{op}}^{3}}.

Therefore, Proposition C.7 states that, for 𝒞:=tr[P11/2(Q+KRK)P11/2]tr[I]=dx\mathcal{C}:=\mathrm{tr}\left[P_{1}^{-1/2}(Q+K^{\top}RK)P_{1}^{-1/2}\right]~{}\leq~{}\mathrm{tr}\left[I\right]=d_{x},

tr[P2P1]\displaystyle\mathrm{tr}\left[P_{2}-P_{1}\right] 8𝒞γ1(c1)A+BKopP1op7/2\displaystyle~{}\leq~{}8\mathcal{C}\sqrt{\gamma_{1}}(c-1)\|A+BK\|_{\mathrm{op}}\|P_{1}\|_{\mathrm{op}}^{7/2}
tr[P1].\displaystyle~{}\leq~{}\mathrm{tr}\left[P_{1}\right].

Lastly, noting that,

P1=𝖽𝗅𝗒𝖺𝗉(γ1(A+BK),Q+KRK)γ(A+BK)(A+BK)\displaystyle P_{1}=\mathsf{dlyap}(\sqrt{\gamma_{1}}(A+BK),Q+K^{\top}RK)\succeq\gamma(A+BK)^{\top}(A+BK)

we have that γ1(A+BK)opP1op1/2\|\sqrt{\gamma_{1}}(A+BK)\|_{\mathrm{op}}~{}\leq~{}\|P_{1}\|_{\mathrm{op}}^{1/2} and tr[P1]dx\mathrm{tr}\left[P_{1}\right]~{}\geq~{}d_{x}. Therefore, since P1op1\|P_{1}\|_{\mathrm{op}}~{}\geq~{}1, in order to apply Proposition C.7 from Perdomo et al. [2021] it suffices for cc to satisfy,

c18P1op4+1.\displaystyle c~{}\leq~{}\frac{1}{8\|P_{1}\|_{\mathrm{op}}^{4}}+1.

Lemma A.5 (Lemma D.9 in Perdomo et al. [2021]).

Let AA be a stable matrix, QIQ\succeq I, and define P:=𝖽𝗅𝗒𝖺𝗉(A,Q)P:=\mathsf{dlyap}(A,Q). Then, for all j0j~{}\geq~{}0,

(A)jAj(A)jPAjP(11Pop)j.\displaystyle(A^{\top})^{j}A^{j}\preceq(A^{\top})^{j}PA^{j}\preceq P\left(1-\frac{1}{\|P\|_{\mathrm{op}}}\right)^{j}.
Lemma A.6.

Let AA be a stable matrix and define P:=𝖽𝗅𝗒𝖺𝗉(A,Q)P:=\mathsf{dlyap}(A,Q) where QIQ\succeq I. Then, for any matrix Δ\Delta such that Δop16Pop2,\|\Delta\|_{\mathrm{op}}~{}\leq~{}\frac{1}{6\|P\|_{\mathrm{op}}^{2}}, it holds that for all j0j~{}\geq~{}0

((A+Δ))jP(A+Δ)jP(112Pop)j.\displaystyle\left((A+\Delta)^{\top}\right)^{j}P(A+\Delta)^{j}\preceq P\left(1-\frac{1}{2\|P\|_{\mathrm{op}}}\right)^{j}.
Proof.

Expanding out, we have that

(A+Δ)P(A+Δ)\displaystyle(A+\Delta)^{\top}P(A+\Delta) =APA+APΔ+ΔPA+ΔPΔ\displaystyle=A^{\top}PA+A^{\top}P\Delta+\Delta^{\top}PA+\Delta^{\top}P\Delta
P(11Pop)+APΔ+ΔPA+ΔPΔopI,\displaystyle\preceq P\left(1-\frac{1}{\|P\|_{\mathrm{op}}}\right)+\|A^{\top}P\Delta+\Delta^{\top}PA+\Delta^{\top}P\Delta\|_{\mathrm{op}}I,

where in the second line we have used properties of the Lyapunov function, Lemma A.5. Next, we observe that

ΔPAopΔP1/2opP1/2AopΔP1/2opP1/2opΔopPop,\displaystyle\|\Delta^{\top}PA\|_{\mathrm{op}}~{}\leq~{}\|\Delta^{\top}P^{1/2}\|_{\mathrm{op}}\|P^{1/2}A\|_{\mathrm{op}}~{}\leq~{}\|\Delta^{\top}P^{1/2}\|_{\mathrm{op}}\|P^{1/2}\|_{\mathrm{op}}~{}\leq~{}\|\Delta\|_{\mathrm{op}}\|P\|_{\mathrm{op}},

where we have again used Lemma A.5 to conclude that APAPA^{\top}PA\preceq P. Note that the exact same calculation holds for APΔop\|A^{\top}P\Delta\|_{\mathrm{op}}. Hence, we can conclude that for Δ\Delta such that Δop1\|\Delta\|_{\mathrm{op}}~{}\leq~{}1,

(A+Δ)P(A+Δ)P(11Pop)+3PopΔopI.\displaystyle(A+\Delta)^{\top}P(A+\Delta)\preceq P\left(1-\frac{1}{\|P\|_{\mathrm{op}}}\right)+3\|P\|_{\mathrm{op}}\|\Delta\|_{\mathrm{op}}\cdot I.

Using the fact that, PIP\succeq I and that Δop1/(6Pop2)\|\Delta\|_{\mathrm{op}}~{}\leq~{}1/(6\|P\|_{\mathrm{op}}^{2}), we get that,

3PopΔopI12PopP,\displaystyle 3\|P\|_{\mathrm{op}}\|\Delta\|_{\mathrm{op}}\cdot I\preceq\frac{1}{2\|P\|_{\mathrm{op}}}P,

which finishes the proof. ∎

Appendix B Deferred Proofs and Analysis for the Nonlinear Setting

Establishing Lyapunov functions. Our analysis for nonlinear systems begins with the observation that any state-feedback controller KK which achieves finite cost on the γ\gamma-discounted LQR problem has an associated value function PK,γP_{K,\gamma} which can be used as a Lyapunov function for the γ\sqrt{\gamma}-damped nonlinear dynamics, for small enough initial states. We present the proof of this result in Section B.3.

Lemma B.1.

Let Jlin(Kγ)<J_{\mathrm{lin}}(K\mid\gamma)<\infty. Then, for all 𝐱xd\mathbf{x}\in\mathbb{R}^{d}_{x} such that,

𝐱PK,γ𝐱rnl24βnl2PK,γop3,\displaystyle\mathbf{x}^{\top}P_{K,\gamma}\mathbf{x}~{}\leq~{}\frac{r_{\mathrm{nl}}^{2}}{4\beta_{\mathrm{nl}}^{2}\|P_{K,\gamma}\|_{\mathrm{op}}^{3}},

the following inequality holds:

γGnl(𝐱,K𝐱)PK,γGnl(𝐱,K𝐱)𝐱PK,γ𝐱(112PK,γop).\displaystyle\gamma\cdot G_{\mathrm{nl}}(\mathbf{x},K\mathbf{x})^{\top}P_{K,\gamma}G_{\mathrm{nl}}(\mathbf{x},K\mathbf{x})~{}\leq~{}\mathbf{x}^{\top}P_{K,\gamma}\mathbf{x}\cdot\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right).

Using this observation, we can then show that any controller which has finite discounted LQR cost is exponentially stabilizing over states in a sufficiently small region of attraction.

Lemma B.2.

Assume Jlin(Kγ)<J_{\mathrm{lin}}(K\mid\gamma)<\infty and define {𝐱t,nl}t=0\{\mathbf{x}_{t,\mathrm{nl}}\}_{t=0}^{\infty} be the sequence of states generated according to 𝐱t+1,nl=γGnl(𝐱t,nl,K𝐱t,nl)\mathbf{x}_{t+1,\mathrm{nl}}=\sqrt{\gamma}G_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}}) where 𝐱t,nl=𝐱0\mathbf{x}_{t,\mathrm{nl}}=\mathbf{x}_{0}. If 𝐱0\mathbf{x}_{0} is such that,

V0:=𝐱0PK,γ𝐱0rnl24βnl2PK,γop3,\displaystyle V_{0}:=\mathbf{x}_{0}^{\top}P_{K,\gamma}\mathbf{x}_{0}~{}\leq~{}\frac{r_{\mathrm{nl}}^{2}}{4\beta_{\mathrm{nl}}^{2}\|P_{K,\gamma}\|_{\mathrm{op}}^{3}},

then for all t0t~{}\geq~{}0 and for V0V_{0} defined as above,

  1. a)

    The norm of the state 𝐱t,nl2\|\mathbf{x}_{t,\mathrm{nl}}\|^{2} is bounded by

    𝐱t,nl2𝐱t,nlPK,γ𝐱t,nlV0(112PK,γop)t.\displaystyle\|\mathbf{x}_{t,\mathrm{nl}}\|^{2}~{}\leq~{}\mathbf{x}_{t,\mathrm{nl}}^{\top}P_{K,\gamma}\mathbf{x}_{t,\mathrm{nl}}~{}\leq~{}V_{0}\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{t}. (B.1)
  2. b)

    The norms of fnl(𝐱t,nl,K𝐱t,nl)f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}}) and fnl(𝐱t,nl,K𝐱t,nl)\nabla\mkern-2.5muf_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}}) are bounded by

    fnl(𝐱t,nl,K𝐱t,nl)\displaystyle\|f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\| βnl(1+Kop2)V0(112PK,γop)t.\displaystyle\leq\beta_{\mathrm{nl}}(1+\|K\|_{\mathrm{op}}^{2})V_{0}\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{t}. (B.2)
    fnl(𝐱t,nl,K𝐱t,nl)op\displaystyle\|\nabla\mkern-2.5muf_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\|_{\mathrm{op}} β(1+Kop)V01/2(112PK,γop)t/2.\displaystyle\leq\beta(1+\|K\|_{\mathrm{op}})V_{0}^{1/2}\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{t/2}. (B.3)
Proof.

The proof of (a)(a) follows by repeatedly applying Lemma B.1. Part (b)(b) follows from the first after using Lemma 3.1. The statement of the lemma in the main body follows from using PK,γoptr[PK,γ]=Jlin(Kγ)\|P_{K,\gamma}\|_{\mathrm{op}}~{}\leq~{}\mathrm{tr}\left[P_{K,\gamma}\right]=J_{\mathrm{lin}}(K\mid\gamma) and simplifying (1x)texp(tx)(1-x)^{t}~{}\leq~{}\exp(-tx). ∎

Relating GnlG_{\mathrm{nl}} to its Jacobian Linearization. Having established how any controller that achieves finite LQR cost is guaranteed to be stabilizing for the nonlinear system, we now go a step further and illustrate how this stability guarantee can be used to prove that the difference in costs and gradients between GnlG_{\mathrm{nl}} and its Jacobian linearization are guaranteed to be small.

Proposition 3.3 (restated).

Assume Jlin(Kγ)<J_{\mathrm{lin}}(K\mid\gamma)<\infty. Then,

  1. a)

    If rrnl2βnlPK,γop2r~{}\leq~{}\frac{r_{\mathrm{nl}}}{2\beta_{\mathrm{nl}}\|P_{K,\gamma}\|_{\mathrm{op}}^{2}}, then |Jnl(Kγ,r)Jlin(Kγ)|8dxβnlPK,γop4r.\big{|}J_{\mathrm{nl}}(K\mid\gamma,r)-J_{\mathrm{lin}}(K\mid\gamma)\big{|}~{}\leq~{}8d_{x}\beta_{\mathrm{nl}}\|P_{K,\gamma}\|_{\mathrm{op}}^{4}\cdot r.

  2. b)

    If rrnl12βnlPK,γop5/2r~{}\leq~{}\frac{r_{\mathrm{nl}}}{12\beta_{\mathrm{nl}}\|P_{K,\gamma}\|_{\mathrm{op}}^{5/2}}, then,

    KJnl(Kγ,r)KJlin(Kγ)F48dxβnl(1+Bop)PK,γop7r.\displaystyle\|\nabla\mkern-2.5mu_{K}J_{\mathrm{nl}}(K\mid\gamma,r)-\nabla\mkern-2.5mu_{K}J_{\mathrm{lin}}(K\mid\gamma)\|_{\mathrm{F}}~{}\leq~{}48d_{x}\beta_{\mathrm{nl}}(1+\|B\|_{\mathrm{op}})\|P_{K,\gamma}\|_{\mathrm{op}}^{7}r.
Proof.

Due to our assumption on r=𝐱0r=\|\mathbf{x}_{0}\|, we have that,

𝐱0PK,γ𝐱0PK,γop𝐱0214βnl2PK,γop3.\displaystyle\mathbf{x}_{0}^{\top}P_{K,\gamma}\mathbf{x}_{0}~{}\leq~{}\|P_{K,\gamma}\|_{\mathrm{op}}\|\mathbf{x}_{0}\|^{2}~{}\leq~{}\frac{1}{4\beta_{\mathrm{nl}}^{2}\|P_{K,\gamma}\|_{\mathrm{op}}^{3}}.

Therefore, we can apply Lemma B.5 to conclude that,

|Jnl(Kγ,𝐱0)Jlin(Kγ,𝐱0)|8βnl𝐱03PK,γop4.\displaystyle\big{|}J_{\mathrm{nl}}(K\mid\gamma,\mathbf{x}_{0})-J_{\mathrm{lin}}(K\mid\gamma,\mathbf{x}_{0})\big{|}~{}\leq~{}8\beta_{\mathrm{nl}}\|\mathbf{x}_{0}\|^{3}\|P_{K,\gamma}\|_{\mathrm{op}}^{4}.

Next, we multiply both sides by dx/r2d_{x}/r^{2}, take expectations, and apply Jensen’s inequality to get that,

|dxr2𝔼𝐱0r𝒮dx1Jnl(Kγ,𝐱0)dxr2𝔼𝐱0r𝒮dx1Jlin(Kγ,𝐱0)|8dxr2βnlPK,γop4𝔼𝐱0r𝒮dx1𝐱03.\displaystyle\big{|}\frac{d_{x}}{r^{2}}\mathbb{E}_{\mathbf{x}_{0}\sim r\cdot\mathcal{S}^{d_{x}-1}}J_{\mathrm{nl}}(K\mid\gamma,\mathbf{x}_{0})-\frac{d_{x}}{r^{2}}\mathbb{E}_{\mathbf{x}_{0}\sim r\cdot\mathcal{S}^{d_{x}-1}}J_{\mathrm{lin}}(K\mid\gamma,\mathbf{x}_{0})\big{|}~{}\leq~{}8\frac{d_{x}}{r^{2}}\beta_{\mathrm{nl}}\|P_{K,\gamma}\|_{\mathrm{op}}^{4}\mathbb{E}_{\mathbf{x}_{0}\sim r\cdot\mathcal{S}^{d_{x}-1}}\|\mathbf{x}_{0}\|^{3}.

Given our definitions of the linear objective in Definition 2.1, we have that,

dxr2𝔼𝐱0r𝒮dx1Jlin(Kγ,𝐱0)=Jlin(Kγ),\frac{d_{x}}{r^{2}}\mathbb{E}_{\mathbf{x}_{0}\sim r\cdot\mathcal{S}^{d_{x}-1}}J_{\mathrm{lin}}(K\mid\gamma,\mathbf{x}_{0})=J_{\mathrm{lin}}(K\mid\gamma),

for all r>0r>0. Therefore, we can rewrite the inequality above as,

|Jnl(Kγ,r)Jlin(Kγ)|8dxβnlPK,γop4r.\displaystyle\big{|}J_{\mathrm{nl}}(K\mid\gamma,r)-J_{\mathrm{lin}}(K\mid\gamma)\big{|}~{}\leq~{}8d_{x}\beta_{\mathrm{nl}}\|P_{K,\gamma}\|_{\mathrm{op}}^{4}\cdot r.

The second part of the proposition uses the same argument as part a, but this time employing Lemma B.6 to bound the difference in gradients (pointwise). ∎

In short, this previous lemma states that if the cost on the linear system is bounded, then the costs and gradients between the nonlinear objective and its Jacobian linearization are close. We can also prove the analogous statement which establishes closeness while assuming that the cost on the nonlinear system is bounded.

Lemma B.3.

Let α>1\alpha>1 be such that 80dx2Jnl(Kγ,r)α80d_{x}^{2}J_{\mathrm{nl}}(K\mid\gamma,r)~{}\leq~{}\alpha.

  1. 1.

    If rrnl264βnlα2(1+Kop)r~{}\leq~{}\frac{r_{\mathrm{nl}}^{2}}{64\beta_{\mathrm{nl}}\alpha^{2}(1+\|K\|_{\mathrm{op}})}, then |Jnl(Kγ,r)Jlin(Kγ)|8dxβnlα4r|J_{\mathrm{nl}}(K\mid\gamma,r)-J_{\mathrm{lin}}(K\mid\gamma)|~{}\leq~{}8d_{x}\beta_{\mathrm{nl}}\alpha^{4}r.

  2. 2.

    If r112βnlα5/2r~{}\leq~{}\frac{1}{12\beta_{\mathrm{nl}}\alpha^{5/2}}, then KJnl(Kγ,r)KJlin(Kγ)F48dxβnl(1+Bop)α7r.\|\nabla\mkern-2.5mu_{K}J_{\mathrm{nl}}(K\mid\gamma,r)-\nabla\mkern-2.5mu_{K}J_{\mathrm{lin}}(K\mid\gamma)\|_{\mathrm{F}}~{}\leq~{}48d_{x}\beta_{\mathrm{nl}}(1+\|B\|_{\mathrm{op}})\alpha^{7}\cdot r.

Proof.

The lemma is a consequence of combining Proposition 3.3 and Proposition C.5. In particular, from Proposition C.5 if rmin{αrnl2,dx64α2βnl(1+Kop)}r~{}\leq~{}\min\{\alpha r_{\mathrm{nl}}^{2},\;\frac{d_{x}}{64\alpha^{2}\beta_{\mathrm{nl}}(1+\|K\|_{\mathrm{op}})}\}, then

80dx2Jnl(Kγ,r)min{Jlin(Kγ),α}\displaystyle 80d_{x}^{2}J_{\mathrm{nl}}(K\mid\gamma,r)~{}\geq~{}\min\{J_{\mathrm{lin}}(K\mid\gamma),\alpha\}

However, since α80dx2Jnl(Kγ,r)\alpha~{}\geq~{}80d_{x}^{2}J_{\mathrm{nl}}(K\mid\gamma,r), we conclude that 80dx2Jnl(Kγ,r)Jlin(Kγ)80d_{x}^{2}J_{\mathrm{nl}}(K\mid\gamma,r)~{}\geq~{}J_{\mathrm{lin}}(K\mid\gamma). Having shown that the linear cost is bounded, we can now plug in Proposition 3.3. In particular, if

rrnl2βnlα2rnl2βnlJlin(Kγ)2\displaystyle r~{}\leq~{}\frac{r_{\mathrm{nl}}}{2\beta_{\mathrm{nl}}\alpha^{2}}~{}\leq~{}\frac{r_{\mathrm{nl}}}{2\beta_{\mathrm{nl}}J_{\mathrm{lin}}(K\mid\gamma)^{2}}

then, Proposition 3.3 states that

|Jnl(Kγ,r)Jlin(Kγ)|8dxβnlJlin(Kγ)4r8dxβnlα4r.\displaystyle|J_{\mathrm{nl}}(K\mid\gamma,r)-J_{\mathrm{lin}}(K\mid\gamma)|~{}\leq~{}8d_{x}\beta_{\mathrm{nl}}J_{\mathrm{lin}}(K\mid\gamma)^{4}r~{}\leq~{}8d_{x}\beta_{\mathrm{nl}}\alpha^{4}r.

To prove the second part of the statement, we again use Proposition 3.3. In particular, since

r112βnlα5/2112βnlJlin(Kγ)5/2\displaystyle r~{}\leq~{}\frac{1}{12\beta_{\mathrm{nl}}\alpha^{5/2}}~{}\leq~{}\frac{1}{12\beta_{\mathrm{nl}}J_{\mathrm{lin}}(K\mid\gamma)^{5/2}}

we can hence conclude that

KJnl(Kγ,r)KJlin(Kγ)F48dxβnl(1+Bop)α7r.\displaystyle\|\nabla\mkern-2.5mu_{K}J_{\mathrm{nl}}(K\mid\gamma,r)-\nabla\mkern-2.5mu_{K}J_{\mathrm{lin}}(K\mid\gamma)\|_{\mathrm{F}}~{}\leq~{}48d_{x}\beta_{\mathrm{nl}}(1+\|B\|_{\mathrm{op}})\alpha^{7}\cdot r.

B.1 Discount annealing on nonlinear systems: Proof of Theorem 2

As in Theorem 1, we first prove parts c)c) and d)d) by induction and then prove parts a)a) and b)b) separately.

B.1.1 Proof of c)c) and d)d)

Base case. As before, at each iteration tt of discount annealing, policy gradients is initialized at Kt,0:=KtK_{t,0}:=K_{t} and computes updates according to,

Kt,j+1=Kt,jηε-𝙶𝚛𝚊𝚍(Jnl(γt,r),Kt,j).\displaystyle K_{t,j+1}=K_{t,j}-\eta\cdot\varepsilon\text{-}\mathtt{Grad}\left(J_{\mathrm{nl}}(\cdot\mid\gamma_{t},r_{\star}),K_{t,j}\right).

To prove correctness, we show that the noisy gradients on the nonlinear system are close to the true gradients on the linear system. That is,

ε-𝙶𝚛𝚊𝚍(Jnl(γt,r),Kt,j)Jlin(Kt,jγt)F𝒞pg,tdxj0,\displaystyle\|\varepsilon\text{-}\mathtt{Grad}\left(J_{\mathrm{nl}}(\cdot\mid\gamma_{t},r_{\star}),K_{t,j}\right)-\nabla\mkern-2.5muJ_{\mathrm{lin}}(K_{t,j}\mid\gamma_{t})\|_{\mathrm{F}}~{}\leq~{}\mathcal{C}_{\mathrm{pg},t}d_{x}\quad\forall j~{}\geq~{}0, (B.4)

where 𝒞pg,t=poly(1/Aop,1/Bop,1/Jlin(Ktγt))\mathcal{C}_{\mathrm{pg},t}=\mathrm{poly}(1/\|A\|_{\mathrm{op}},1/\|B\|_{\mathrm{op}},1/J_{\mathrm{lin}}(K_{t}\mid\gamma_{t})) is again a fixed polynomial from Lemma A.3.

Consider the first iteration of discount annealing, by choice of γ0\gamma_{0}, we have that Jlin(K0γ0)<J_{\mathrm{lin}}(K_{0}\mid\gamma_{0})<\infty. Therefore, by Proposition 3.3 if

rmin{rnl12βnlJlin(K0,0γ0),𝒞pg,0100βnl(1+Bop)Jlin(K0,0γ0)7}\displaystyle r_{\star}~{}\leq~{}\min\left\{\frac{r_{\mathrm{nl}}}{12\beta_{\mathrm{nl}}J_{\mathrm{lin}}(K_{0,0}\mid\gamma_{0})},\frac{\mathcal{C}_{\mathrm{pg},0}}{100\beta_{\mathrm{nl}}(1+\|B\|_{\mathrm{op}})J_{\mathrm{lin}}(K_{0,0}\mid\gamma_{0})^{7}}\right\}

it must hold that Jnl(K0,0γ0,r)Jlin(K0,0γ0)F.5𝒞pg,0dx\|\nabla\mkern-2.5muJ_{\mathrm{nl}}(K_{0,0}\mid\gamma_{0},r_{\star})-\nabla\mkern-2.5muJ_{\mathrm{lin}}(K_{0,0}\mid\gamma_{0})\|_{\mathrm{F}}~{}\leq~{}.5\mathcal{C}_{\mathrm{pg},0}d_{x}. Likewise, if we choose the tolerance parameter ε.5𝒞pg,0dx\varepsilon~{}\leq~{}.5\mathcal{C}_{\mathrm{pg},0}d_{x} in ε-𝙶𝚛𝚊𝚍\varepsilon\text{-}\mathtt{Grad} then we have that

ε-𝙶𝚛𝚊𝚍(Jnl(γ0,r),K0,0)Jnl(K0,0γ0,r)F.5𝒞pg,0dx.\|\varepsilon\text{-}\mathtt{Grad}\left(J_{\mathrm{nl}}(\cdot\mid\gamma_{0},r_{\star}),K_{0,0}\right)-J_{\mathrm{nl}}(K_{0,0}\mid\gamma_{0},r_{\star})\|_{\mathrm{F}}~{}\leq~{}.5\mathcal{C}_{\mathrm{pg},0}d_{x}.

By the triangle inequality, the inequality in Eq. B.4 holds for t=0t=0 and j=0j=0. However, because Lemma A.3 shows that policy gradients is a descent method, that is Jlin(K0,jγ0)Jlin(K0,0γ0)J_{\mathrm{lin}}(K_{0,j}\mid\gamma_{0})~{}\leq~{}J_{\mathrm{lin}}(K_{0,0}\mid\gamma_{0}) for all j0j~{}\geq~{}0, Eq. B.4 also holds for all j0j~{}\geq~{}0 for the same choice of rr_{\star} and tolerance parameter for ε-𝙶𝚛𝚊𝚍\varepsilon\text{-}\mathtt{Grad}. By guarantee of Lemma A.3, for t=0t=0, policy gradients achieves the guarantee outlined in Step 2 using at most poly(Aop,Bop,Jlin(K0γ0))\mathrm{poly}(\|A\|_{\mathrm{op}},\|B\|_{\mathrm{op}},J_{\mathrm{lin}}(K_{0}\mid\gamma_{0})) many queries.

To prove that random search achieves the guarantee outlines in Step 4 at iteration 0 of discount annealing, we appeal to Lemma C.7. In particular, we instantiate the lemma with fJnl(K1,r)f\leftarrow J_{\mathrm{nl}}(K_{1}\mid\cdot,r_{\star}), f18Jnl(K1γ0,r)f_{1}\leftarrow 8J_{\mathrm{nl}}(K_{1}\mid\gamma_{0},r_{\star}), f22.5Jnl(K1γ0,r)f_{2}\leftarrow 2.5J_{\mathrm{nl}}(K_{1}\mid\gamma_{0},r_{\star}). As before, the algorithm requires values f¯1[2.9Jnl(K1γ0,r),3Jnl(K1γ0)]\overline{f}_{1}\in[2.9J_{\mathrm{nl}}(K_{1}\mid\gamma_{0},r_{\star}),3J_{\mathrm{nl}}(K_{1}\mid\gamma_{0})] and f¯2[6Jnl(K1γ0,r),6.1Jnl(K1γ0)]\overline{f}_{2}\in[6J_{\mathrm{nl}}(K_{1}\mid\gamma_{0},r_{\star}),6.1J_{\mathrm{nl}}(K_{1}\mid\gamma_{0})]. These can be estimated via two calls to ε-𝙴𝚟𝚊𝚕\varepsilon\text{-}\mathtt{Eval} with tolerance parameter .01dx.01d_{x}.

To show the lemma applies we only need to lower bound the width of feasible γ\gamma such that

2.9Jnl(K1γ0,r)Jnl(K1γ,r)6.1Jnl(K1γ0,r)\displaystyle 2.9J_{\mathrm{nl}}(K_{1}\mid\gamma_{0},r_{\star})~{}\leq~{}J_{\mathrm{nl}}(K_{1}\mid\gamma,r_{\star})~{}\leq~{}6.1J_{\mathrm{nl}}(K_{1}\mid\gamma_{0},r_{\star}) (B.5)

From the guarantee from policy gradients, we know that Jlin(K1γ0)2tr[P]J_{\mathrm{lin}}(K_{1}\mid\gamma_{0})~{}\leq~{}2\mathrm{tr}\left[P_{\star}\right]. Furthermore, from the proof of Theorem 1, we know that there exists γ′′,γ[0,1]\gamma^{\prime\prime},\gamma^{\prime}\in[0,1] satisfying, γ′′γ15200tr[P]4γ0\gamma^{\prime\prime}-\gamma^{\prime}~{}\geq~{}\frac{1}{5200\mathrm{tr}\left[P_{\star}\right]^{4}}\gamma_{0}, such that for all γ[γ,γ′′]\gamma\in[\gamma^{\prime},\gamma^{\prime\prime}]

3Jlin(K1γ0)Jlin(K1γ)6Jlin(K1γ0).\displaystyle 3J_{\mathrm{lin}}(K_{1}\mid\gamma_{0})~{}\leq~{}J_{\mathrm{lin}}(K_{1}\mid\gamma)~{}\leq~{}6J_{\mathrm{lin}}(K_{1}\mid\gamma_{0}). (B.6)

To finish the proof of correctness, we show that any γ\gamma that satisfies Eq. B.6 must also satisfy Eq. B.5. In particular, since Jlin(K1γ0)2tr[P]J_{\mathrm{lin}}(K_{1}\mid\gamma_{0})~{}\leq~{}2\mathrm{tr}\left[P_{\star}\right] and Jlin(K1γ)12tr[P]J_{\mathrm{lin}}(K_{1}\mid\gamma)~{}\leq~{}12\mathrm{tr}\left[P_{\star}\right] for

rmin{rnl2βnl(12tr[P])2,.018βnl(12tr[P])4},\displaystyle r_{\star}~{}\leq~{}\min\left\{\frac{r_{\mathrm{nl}}}{2\beta_{\mathrm{nl}}(12\mathrm{tr}\left[P_{\star}\right])^{2}},\;\frac{.01}{8\beta_{\mathrm{nl}}(12\mathrm{tr}\left[P_{\star}\right])^{4}}\right\},

it holds that |Jnl(K1γ0,r)Jlin(K1γ0)|.01dx|J_{\mathrm{nl}}(K_{1}\mid\gamma_{0},r_{\star})-J_{\mathrm{lin}}(K_{1}\mid\gamma_{0})|~{}\leq~{}.01d_{x} and |Jnl(K1γ,r)Jlin(K1γ)|.01dx|J_{\mathrm{nl}}(K_{1}\mid\gamma,r_{\star})-J_{\mathrm{lin}}(K_{1}\mid\gamma)|~{}\leq~{}.01d_{x}. Using these two inequalities along with Eq. B.6 implies that Eq. B.5 must also hold. Therefore, the width of the feasible region is at least 1/(5200tr[P]4)1/(5200\mathrm{tr}\left[P_{\star}\right]^{4}) and random search must return a discount factor using at most 1 over this many iterations by Lemma C.7.

Inductive step.

To show that policy gradients converges, we can use the exact same argument as when arguing the base case where instead of referring to Jlin(K0γ0)J_{\mathrm{lin}}(K_{0}\mid\gamma_{0}) and Jnl(K0γ0,r)J_{\mathrm{nl}}(K_{0}\mid\gamma_{0},r_{\star}) we use Jlin(Ktγt)J_{\mathrm{lin}}(K_{t}\mid\gamma_{t}) and Jnl(Ktγt,r)J_{\mathrm{nl}}(K_{t}\mid\gamma_{t},r_{\star}). In particular, we can reuse the same argument as in the base case, but need to ensure that is that supt1Jlin(Ktγt)\sup_{t~{}\geq~{}1}J_{\mathrm{lin}}(K_{t}\mid\gamma_{t}) is uniformly bounded.

To prove this, from the inductive hypothesis on the correctness of binary search at previous iterations, we know that Jnl(Ktγt)8Jnl(Ktγt1,r)J_{\mathrm{nl}}(K_{t}\mid\gamma_{t})~{}\leq~{}8J_{\mathrm{nl}}(K_{t}\mid\gamma_{t-1},r_{\star}). Again by the inductive hypothesis, at time step t1t-1 policy gradients achieves the desired guarantee from Step 2, implying that Jlin(Ktγt1)2tr[P]J_{\mathrm{lin}}(K_{t}\mid\gamma_{t-1})~{}\leq~{}2\mathrm{tr}\left[P_{\star}\right]. By choice of rr_{\star} this implies that

|Jlin(Ktγt1)Jnl(Ktγt1,r)|.01dx|J_{\mathrm{lin}}(K_{t}\mid\gamma_{t-1})-J_{\mathrm{nl}}(K_{t}\mid\gamma_{t-1},r_{\star})|~{}\leq~{}.01d_{x}

and hence Jnl(Ktγt)20tr[P]J_{\mathrm{nl}}(K_{t}\mid\gamma_{t})~{}\leq~{}20\mathrm{tr}\left[P_{\star}\right]. Now, we can apply Lemma B.3 to conclude that for α:=(80×20)dx2tr[P]\alpha:=(80\times 20)d_{x}^{2}\mathrm{tr}\left[P_{\star}\right] and

rmin{rnl264βnlα2(1+Kop),.018βnlα4}\displaystyle r_{\star}~{}\leq~{}\min\left\{\frac{r_{\mathrm{nl}}^{2}}{64\beta_{\mathrm{nl}}\alpha^{2}(1+\|K\|_{\mathrm{op}})},\;\frac{.01}{8\beta_{\mathrm{nl}}\alpha^{4}}\right\}

it holds that |Jnl(Ktγt,r)Jlin(Ktγt)|.01dx|J_{\mathrm{nl}}(K_{t}\mid\gamma_{t},r_{\star})-J_{\mathrm{lin}}(K_{t}\mid\gamma_{t})|~{}\leq~{}.01d_{x} and hence Jlin(Ktγt)21tr[P]J_{\mathrm{lin}}(K_{t}\mid\gamma_{t})~{}\leq~{}21\mathrm{tr}\left[P_{\star}\right]. Therefore, supt1Jlin(Ktγt)21tr[P]\sup_{t~{}\geq~{}1}J_{\mathrm{lin}}(K_{t}\mid\gamma_{t})~{}\leq~{}21\mathrm{tr}\left[P_{\star}\right].

Similarly, the inductive step for the random search procedure follows from noting that the exact same argument can be repeated by replacing Jnl(K1γ0,r)J_{\mathrm{nl}}(K_{1}\mid\gamma_{0},r_{\star}) with Jnl(Ktγt1)J_{\mathrm{nl}}(K_{t}\mid\gamma_{t-1}) and Jlin(K1γ0)J_{\mathrm{lin}}(K_{1}\mid\gamma_{0}) with Jlin(Ktγt1)J_{\mathrm{lin}}(K_{t}\mid\gamma_{t-1}) since (by the inductive hypothesis) Jlin(Ktγt1)2tr[P]J_{\mathrm{lin}}(K_{t}\mid\gamma_{t-1})~{}\leq~{}2\mathrm{tr}\left[P_{\star}\right].

B.1.2 Proof of a)a)

By the guarantee from Step 2, the algorithm returns a K^\widehat{K} which satisfies the following guarantee on the linear system:

Jlin(K^1)minKJlin(K1)dx.\displaystyle J_{\mathrm{lin}}(\widehat{K}\mid 1)-\min_{K}J_{\mathrm{lin}}(K\mid 1)~{}\leq~{}d_{x}.

Therefore, Jlin(K^1)2tr[P]J_{\mathrm{lin}}(\widehat{K}\mid 1)~{}\leq~{}2\mathrm{tr}\left[P_{\star}\right]. Now by, Lemma B.2, the following holds,

𝐱t,nl2\displaystyle\|\mathbf{x}_{t,\mathrm{nl}}\|^{2} P^op𝐱02(112P^op)t\displaystyle~{}\leq~{}\|\widehat{P}\|_{\mathrm{op}}\|\mathbf{x}_{0}\|^{2}\left(1-\frac{1}{2\|\widehat{P}\|_{\mathrm{op}}}\right)^{t}
2tr[P]𝐱02exp(t4tr[P]),\displaystyle~{}\leq~{}2\mathrm{tr}\left[P_{\star}\right]\|\mathbf{x}_{0}\|^{2}\exp\left(-\frac{t}{4\mathrm{tr}\left[P_{\star}\right]}\right),

for P^:=𝖽𝗅𝗒𝖺𝗉(A+BK^,Q+K^RK^)\widehat{P}:=\mathsf{dlyap}(A+B\widehat{K},Q+\widehat{K}^{\top}R\widehat{K}) and all 𝐱0\mathbf{x}_{0} such that 𝐱0rnl/(8βnltr[P]2).\|\mathbf{x}_{0}\|~{}\leq~{}r_{\mathrm{nl}}/(8\beta_{\mathrm{nl}}\mathrm{tr}\left[P_{\star}\right]^{2}).

B.1.3 Proof of b)b)

The bound for the number of subproblems solved by the discount annealing algorithms is similar to that of the linear case. The crux of the argument for part b)b) is to show that any γ[γt,1]\gamma^{\prime}\in[\gamma_{t},1] such that

2.5Jnl(Kt+1γt,r)Jnl(Kt+1γ,r)8Jnl(Kt+1γt,r)\displaystyle 2.5J_{\mathrm{nl}}(K_{t+1}\mid\gamma_{t},r_{\star})~{}\leq~{}J_{\mathrm{nl}}(K_{t+1}\mid\gamma^{\prime},r_{\star})~{}\leq~{}8J_{\mathrm{nl}}(K_{t+1}\mid\gamma_{t},r_{\star})

the following inequality must also hold: Jlin(Kt+1γ)2Jlin(Kt+1γt)J_{\mathrm{lin}}(K_{t+1}\mid\gamma^{\prime})~{}\geq~{}2J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t}). Once we’ve lower bounded the cost on the linear system, we can repeat the same argument as in Theorem 1. Since the cost on the linear system in nondecreasing in γ\gamma, it must be the case that γ\gamma^{\prime} satisfies

γ(18PKt+1,γtop4+1)2γt(1128tr[P]4+1)2γt.\displaystyle\gamma^{\prime}~{}\geq~{}\left(\frac{1}{8\|P_{K_{t+1},\gamma_{t}}\|_{\mathrm{op}}^{4}}+1\right)^{2}\gamma_{t}~{}\geq~{}\left(\frac{1}{128\mathrm{tr}\left[P_{\star}\right]^{4}}+1\right)^{2}\gamma_{t}.

Here, we have again we have used the calculation that,

PKt+1,γtop4tr[PKt+1,γt]4=Jlin(Kt+1γt)4(minKJlin(Kγt)+dx)416tr[P]4,\displaystyle\|P_{K_{t+1},\gamma_{t}}\|_{\mathrm{op}}^{4}~{}\leq~{}\mathrm{tr}\left[P_{K_{t+1},\gamma_{t}}\right]^{4}=J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t})^{4}~{}\leq~{}(\min_{K}J_{\mathrm{lin}}(K\mid\gamma_{t})+d_{x})^{4}~{}\leq~{}16\mathrm{tr}\left[P_{\star}\right]^{4},

which follows from the guarantee that (for our choice of rr_{\star}) policy gradients on the nonlinear system converges to a near optimal controller for the system’s Jacobian linearization. Hence, as in the linear setting, we conclude that γt(1/(128tr[P]4)+1)2tγ0\gamma_{t}~{}\geq~{}(1/(128\mathrm{tr}\left[P_{\star}\right]^{4})+1)^{2t}\gamma_{0} and discount annealing achieves the same rate as for linear systems.

We now focus on establishing that Jlin(Kt+1γ)2Jlin(Kt+1γt)J_{\mathrm{lin}}(K_{t+1}\mid\gamma^{\prime})~{}\geq~{}2J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t}). By guarantee from policy gradients, we have that Jlin(Kt+1γt)minKJlin(Kγt)+dx2tr[P]J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t})~{}\leq~{}\min_{K}J_{\mathrm{lin}}(K\mid\gamma_{t})+d_{x}~{}\leq~{}2\mathrm{tr}\left[P_{\star}\right]. Therefore, by Proposition 3.3 since

r.01rnl8×24tr[P]4βnlmin{rnl2βnlPKt+1,γtop2,.018βnlPKt+1,γtop4}\displaystyle r_{\star}~{}\leq~{}\frac{.01r_{\mathrm{nl}}}{8\times 2^{4}\cdot\mathrm{tr}\left[P_{\star}\right]^{4}\beta_{\mathrm{nl}}}~{}\leq~{}\min\left\{\frac{r_{\mathrm{nl}}}{2\beta_{\mathrm{nl}}\|P_{K_{t+1},\gamma_{t}}\|_{\mathrm{op}}^{2}}\;,\frac{.01}{8\beta_{\mathrm{nl}}\|P_{K_{t+1},\gamma_{t}}\|_{\mathrm{op}}^{4}}\right\}

it holds that |Jlin(Kt+1γt)Jnl(Kt+1γt,r)|.01dx|J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t})-J_{\mathrm{nl}}(K_{t+1}\mid\gamma_{t},r_{\star})|~{}\leq~{}.01d_{x}.

Next, we show that |Jlin(Kt+1γ)Jnl(Kt+1γ,r)||J_{\mathrm{lin}}(K_{t+1}\mid\gamma^{\prime})-J_{\mathrm{nl}}(K_{t+1}\mid\gamma^{\prime},r_{\star})| is also small. In particular, the previous statement, together with the guarantee from Step 4, implies that

Jnl(Kt+1γ)8Jnl(Kt+1γt)8(Jlin(Kt+1γt)+.01dx)8.08Jlin(Kt+1γt)16.16tr[P].\displaystyle J_{\mathrm{nl}}(K_{t+1}\mid\gamma^{\prime})~{}\leq~{}8J_{\mathrm{nl}}(K_{t+1}\mid\gamma_{t})~{}\leq~{}8(J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t})+.01d_{x})~{}\leq~{}8.08J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t})~{}\leq~{}16.16\mathrm{tr}\left[P_{\star}\right].

Therefore, for α:=(80×17)tr[P]dx2\alpha:=(80\times 17)\mathrm{tr}\left[P_{\star}\right]d_{x}^{2}, Lemma B.3 implies that if,

rrnl264βnlα2(1+Kt+1op),\displaystyle r~{}\leq~{}\frac{r_{\mathrm{nl}}^{2}}{64\beta_{\mathrm{nl}}\alpha^{2}(1+\|K_{t+1}\|_{\mathrm{op}})},

it holds that |Jnl(Kt+1γ,r)Jlin(Kt+1γ)|8dxβnlα4r|J_{\mathrm{nl}}(K_{t+1}\mid\gamma^{\prime},r)-J_{\mathrm{lin}}(K_{t+1}\mid\gamma^{\prime})|~{}\leq~{}8d_{x}\beta_{\mathrm{nl}}\alpha^{4}r. Hence, if r.01/(8βnlα4)r~{}\leq~{}.01/(8\beta_{\mathrm{nl}}\alpha^{4}) we get that |Jlin(Kt+1γ)Jnl(Kt+1γ,r)|.01dx|J_{\mathrm{lin}}(K_{t+1}\mid\gamma^{\prime})-J_{\mathrm{nl}}(K_{t+1}\mid\gamma^{\prime},r_{\star})|~{}\leq~{}.01d_{x}. Using again the fact that min{Jlin(Kγ),Jnl(Kγ,r)}dx\min\{J_{\mathrm{lin}}(K\mid\gamma),J_{\mathrm{nl}}(K\mid\gamma,r)\}~{}\geq~{}d_{x} for all K,γ,rK,\gamma,r we hence conclude that

Jlin(Kt+1γ)\displaystyle J_{\mathrm{lin}}(K_{t+1}\mid\gamma^{\prime}) .99Jnl(Kt+1γ,r)\displaystyle~{}\geq~{}.99J_{\mathrm{nl}}(K_{t+1}\mid\gamma^{\prime},r_{\star})
`\displaystyle` 2.5×.99Jnl(Kt+1γt,r)\displaystyle~{}\geq~{}2.5\times.99\cdot J_{\mathrm{nl}}(K_{t+1}\mid\gamma_{t},r_{\star})
2.5×.992Jlin(Kt+1γt),\displaystyle~{}\geq~{}2.5\times.99^{2}\cdot J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t}),

which finished the proof of the fact that Jlin(Kt+1γ)2Jlin(Kt+1γt)J_{\mathrm{lin}}(K_{t+1}\mid\gamma^{\prime})~{}\geq~{}2J_{\mathrm{lin}}(K_{t+1}\mid\gamma_{t}).

B.2 Relating costs and gradients to the linear system: Proof of Proposition 3.3

In order to relate the properties of the nonlinear system to its Jacobian linearization, we employ the following version of the performance difference lemma.

Lemma B.4 (Performance Difference).

Assume Jlin(Kγ)<J_{\mathrm{lin}}(K\mid\gamma)<\infty and define {𝐱t,nl}t=0\{\mathbf{x}_{t,\mathrm{nl}}\}_{t=0}^{\infty} be the sequence of states generated according to 𝐱t+1,nl=γGnl(𝐱t,nl,K𝐱t,nl)\mathbf{x}_{t+1,\mathrm{nl}}=\sqrt{\gamma}G_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}}) where 𝐱t,nl=𝐱0\mathbf{x}_{t,\mathrm{nl}}=\mathbf{x}_{0}. Then,

Jnl(Kγ,𝐱)Jlin(Kγ,𝐱)=t=0γfnl(𝐱t,nl,K𝐱t,nl)PK,γ(Gnl(𝐱t,nl,K𝐱t,nl)+Glin(𝐱t,nl,K𝐱t,nl)).\displaystyle J_{\mathrm{nl}}(K\mid\gamma,\mathbf{x})-J_{\mathrm{lin}}(K\mid\gamma,\mathbf{x})=\sum_{t=0}^{\infty}\gamma\cdot f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})^{\top}P_{K,\gamma}\left(G_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})+G_{\mathrm{lin}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\right).
Proof.

From the definition of the relevant objectives, and A.2, we get that,

Jnl(Kγ,𝐱)Jlin(Kγ,𝐱)\displaystyle J_{\mathrm{nl}}(K\mid\gamma,\mathbf{x})-J_{\mathrm{lin}}(K\mid\gamma,\mathbf{x}) =(t=0𝐱t,nl(Q+KRK)𝐱t,nl)𝐱PK,γ𝐱\displaystyle=\left(\sum_{t=0}^{\infty}\mathbf{x}_{t,\mathrm{nl}}^{\top}(Q+K^{\top}RK)\mathbf{x}_{t,\mathrm{nl}}\right)-\mathbf{x}^{\top}P_{K,\gamma}\mathbf{x}
=(t=0𝐱t,nl(Q+KRK)𝐱t,nl±𝐱t,nlPK,γ𝐱t,nl)𝐱PK,γ𝐱\displaystyle=\left(\sum_{t=0}^{\infty}\mathbf{x}_{t,\mathrm{nl}}^{\top}(Q+K^{\top}RK)\mathbf{x}_{t,\mathrm{nl}}\pm\mathbf{x}_{t,\mathrm{nl}}^{\top}P_{K,\gamma}\mathbf{x}_{t,\mathrm{nl}}\right)-\mathbf{x}^{\top}P_{K,\gamma}\mathbf{x}
=t=0𝐱t,nl(Q+KRK)𝐱t,nl+𝐱t+1,nlPK,γ𝐱t+1,nl𝐱t,nlPK,γ𝐱t,nl,\displaystyle=\sum_{t=0}^{\infty}\mathbf{x}_{t,\mathrm{nl}}^{\top}(Q+K^{\top}RK)\mathbf{x}_{t,\mathrm{nl}}+\mathbf{x}_{t+1,\mathrm{nl}}^{\top}P_{K,\gamma}\mathbf{x}_{t+1,\mathrm{nl}}-\mathbf{x}_{t,\mathrm{nl}}^{\top}P_{K,\gamma}\mathbf{x}_{t,\mathrm{nl}}, (B.7)

where in the last line we have used the fact that 𝐱0,nl=𝐱\mathbf{x}_{0,\mathrm{nl}}=\mathbf{x}. The proof then follows from the following two observations. First, by definition of state sequence, 𝐱t,nl\mathbf{x}_{t,\mathrm{nl}},

𝐱t+1,nl=γGnl(𝐱t,nl,K𝐱t,nl).\displaystyle\mathbf{x}_{t+1,\mathrm{nl}}=\sqrt{\gamma}\cdot G_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}}).

Second, since PK,γP_{K,\gamma} is the solution to a Lyapunov equation,

𝐱t,nlPK,γ𝐱t,nl\displaystyle\mathbf{x}_{t,\mathrm{nl}}^{\top}P_{K,\gamma}\mathbf{x}_{t,\mathrm{nl}} =𝐱t,nl(Q+KRK)𝐱t,nl+γ𝐱t,nl(A+BK)PK,γ(A+BK)𝐱t,nl\displaystyle=\mathbf{x}_{t,\mathrm{nl}}^{\top}(Q+K^{\top}RK)\mathbf{x}_{t,\mathrm{nl}}+\gamma\cdot\mathbf{x}_{t,\mathrm{nl}}^{\top}(A+BK)^{\top}P_{K,\gamma}(A+BK)\mathbf{x}_{t,\mathrm{nl}}
=𝐱t,nl(Q+KRK)𝐱t,nl+γGlin(𝐱t,nl,K𝐱t,nl)PK,γGlin(𝐱t,nl,K𝐱t,nl).\displaystyle=\mathbf{x}_{t,\mathrm{nl}}^{\top}(Q+K^{\top}RK)\mathbf{x}_{t,\mathrm{nl}}+\gamma\cdot G_{\mathrm{lin}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})^{\top}P_{K,\gamma}G_{\mathrm{lin}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}}).

Plugging these last two lines into Eq. B.7, we get that Jnl(Kγ,𝐱)Jlin(Kγ,𝐱)J_{\mathrm{nl}}(K\mid\gamma,\mathbf{x})-J_{\mathrm{lin}}(K\mid\gamma,\mathbf{x}) is equal to,

=t=0γGnl(𝐱t,nl,K𝐱t,nl)PK,γGnl(𝐱t,nl,K𝐱t,nl)γGlin(𝐱t,nl,K𝐱t,nl)PK,γGlin(𝐱t,nl,K𝐱t,nl)\displaystyle=\sum_{t=0}^{\infty}\gamma\cdot G_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})^{\top}P_{K,\gamma}G_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})-\gamma\cdot G_{\mathrm{lin}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})^{\top}P_{K,\gamma}G_{\mathrm{lin}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})
=t=0γ(Gnl(𝐱t,nl,K𝐱t,nl)Glin(𝐱t,nl,K𝐱t,nl))PK,γ(Gnl(𝐱t,nl,K𝐱t,nl)+Glin(𝐱t,nl,K𝐱t,nl))\displaystyle=\sum_{t=0}^{\infty}\gamma\cdot\left(G_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})-G_{\mathrm{lin}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\right)P_{K,\gamma}\left(G_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})+G_{\mathrm{lin}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\right)
=t=0γfnl(𝐱t,nl,K𝐱t,nl)PK,γ(Gnl(𝐱t,nl,K𝐱t,nl)+Glin(𝐱t,nl,K𝐱t,nl)).\displaystyle=\sum_{t=0}^{\infty}\gamma\cdot f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})P_{K,\gamma}\left(G_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})+G_{\mathrm{lin}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\right).

B.2.1 Establishing similarity of costs

The following lemma follows by bounding the terms appearing in the performance difference lemma.

Lemma B.5 (Similarity of Costs).

Assume Jlin(Kγ)<J_{\mathrm{lin}}(K\mid\gamma)<\infty and define {𝐱t,nl}t=0\{\mathbf{x}_{t,\mathrm{nl}}\}_{t=0}^{\infty} be the sequence of states generated according to 𝐱t+1,nl=γGnl(𝐱t,nl,K𝐱t,nl)\mathbf{x}_{t+1,\mathrm{nl}}=\sqrt{\gamma}G_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}}) where 𝐱t,nl=𝐱0\mathbf{x}_{t,\mathrm{nl}}=\mathbf{x}_{0}. For 𝐱0\mathbf{x}_{0} such that,

𝐱0PK,γ𝐱0rnl24βnl2PK,γop3,\displaystyle\mathbf{x}_{0}^{\top}P_{K,\gamma}\mathbf{x}_{0}~{}\leq~{}\frac{r_{\mathrm{nl}}^{2}}{4\beta_{\mathrm{nl}}^{2}\|P_{K,\gamma}\|_{\mathrm{op}}^{3}},

then,

|Jnl(Kγ,𝐱0)Jlin(Kγ,𝐱0)|8βnl𝐱03PK,γop4.\displaystyle\big{|}J_{\mathrm{nl}}(K\mid\gamma,\mathbf{x}_{0})-J_{\mathrm{lin}}(K\mid\gamma,\mathbf{x}_{0})\big{|}~{}\leq~{}8\beta_{\mathrm{nl}}\|\mathbf{x}_{0}\|^{3}\|P_{K,\gamma}\|_{\mathrm{op}}^{4}.
Proof.

We begin with the following observation. Due to our assumption on 𝐱0\mathbf{x}_{0}, we can use Lemma B.2 to conclude that for all t0t~{}\geq~{}0, the following relationship holds for V0:=𝐱0,nlPK,γ𝐱0,nlV_{0}:=\mathbf{x}_{0,\mathrm{nl}}^{\top}P_{K,\gamma}\mathbf{x}_{0,\mathrm{nl}},

𝐱t,nl2𝐱t,nlPK,γ𝐱t,nlV0(112PK,γop)t.\displaystyle\|\mathbf{x}_{t,\mathrm{nl}}\|^{2}~{}\leq~{}\mathbf{x}_{t,\mathrm{nl}}^{\top}P_{K,\gamma}\mathbf{x}_{t,\mathrm{nl}}~{}\leq~{}V_{0}\cdot\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{t}. (B.8)

Now, from the performance difference lemma (Lemma B.4), we get that Jnl(Kγ,𝐱)Jlin(Kγ,𝐱)J_{\mathrm{nl}}(K\mid\gamma,\mathbf{x})-J_{\mathrm{lin}}(K\mid\gamma,\mathbf{x}) is equal to:

t=0γfnl(𝐱t,nl,K𝐱t,nl)PK,γ(Gnl(𝐱t,nl,K𝐱t,nl)+Glin(𝐱t,nl,K𝐱t,nl)).\displaystyle\sum_{t=0}^{\infty}\gamma\cdot f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})^{\top}P_{K,\gamma}\left(G_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})+G_{\mathrm{lin}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\right).

Therefore, the difference |Jnl(Kγ,𝐱0)Jlin(Kγ,𝐱0)|\big{|}J_{\mathrm{nl}}(K\mid\gamma,\mathbf{x}_{0})-J_{\mathrm{lin}}(K\mid\gamma,\mathbf{x}_{0})\big{|} can be bounded by,

t=0γPK,γ1/2fnl(𝐱t,nl,K𝐱t,nl):=T1γPK,γ1/2(Gnl(𝐱t,nl,K𝐱t,nl)+Glin(𝐱t,nl,K𝐱t,nl)):=T2.\displaystyle\sum_{t=0}^{\infty}\underbrace{\|\sqrt{\gamma}\cdot P_{K,\gamma}^{1/2}f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\|}_{:=T_{1}}\cdot\underbrace{\|\sqrt{\gamma}\cdot P_{K,\gamma}^{1/2}\left(G_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})+G_{\mathrm{lin}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\right)\|}_{:=T_{2}}. (B.9)

Now, we analyze each of T1T_{1} and T2T_{2} separately. For T1T_{1}, by Lemma 3.1, and the fact that γ1\gamma~{}\leq~{}1,

γPK,γ1/2fnl(𝐱t,nl,K𝐱t,nl)\displaystyle\|\sqrt{\gamma}\cdot P_{K,\gamma}^{1/2}f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\| PK,γop1/2fnl(𝐱t,nl,K𝐱t,nl)\displaystyle~{}\leq~{}\|P_{K,\gamma}\|_{\mathrm{op}}^{1/2}\|f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\|
=PK,γop1/2βnl(𝐱t,nl2+K𝐱t,nl2)\displaystyle=\|P_{K,\gamma}\|_{\mathrm{op}}^{1/2}\cdot\beta_{\mathrm{nl}}(\|\mathbf{x}_{t,\mathrm{nl}}\|^{2}+\|K\mathbf{x}_{t,\mathrm{nl}}\|^{2})
PK,γop1/2βnl(Kop2+1)V0(112PK,γop)t,\displaystyle~{}\leq~{}\|P_{K,\gamma}\|_{\mathrm{op}}^{1/2}\beta_{\mathrm{nl}}\cdot(\|K\|_{\mathrm{op}}^{2}+1)V_{0}\cdot\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{t},

where in the last line, we have used our assumption on the initial state and Lemma B.2. Moving onto T2T_{2}, we use the triangle inequality to get that

T2γPK,γ1/2Gnl(𝐱t,nl,K𝐱t,nl)+γPK,γ1/2Glin(𝐱t,nl,K𝐱t,nl).\displaystyle T_{2}~{}\leq~{}\|\sqrt{\gamma}\cdot P_{K,\gamma}^{1/2}G_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\|+\|\sqrt{\gamma}\cdot P_{K,\gamma}^{1/2}G_{\mathrm{lin}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\|.

For the second term above, by Lemma A.5 and Lemma B.2, we have that

γPK,γ1/2Glin(𝐱t,nl,K𝐱t,nl)\displaystyle\|\sqrt{\gamma}\cdot P_{K,\gamma}^{1/2}G_{\mathrm{lin}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\| =γ𝐱t,nl(A+BK)PK,γ(A+BK)𝐱t,nl\displaystyle=\sqrt{\gamma\cdot\mathbf{x}_{t,\mathrm{nl}}^{\top}(A+BK)^{\top}P_{K,\gamma}(A+BK)\mathbf{x}_{t,\mathrm{nl}}}
𝐱t,nlPK,γ𝐱t,nl\displaystyle~{}\leq~{}\sqrt{\mathbf{x}_{t,\mathrm{nl}}^{\top}P_{K,\gamma}\mathbf{x}_{t,\mathrm{nl}}}
V01/2(112PK,γop)t/2.\displaystyle~{}\leq~{}V_{0}^{1/2}\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{t/2}.

Lastly, we bound the first term by again using Lemma B.2,

γPK,γ1/2Gnl(𝐱t,nl,K𝐱t,nl)\displaystyle\|\sqrt{\gamma}\cdot P_{K,\gamma}^{1/2}G_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\| =PK,γ1/2𝐱t+1,nlV01/2(112PK,γop)t+12.\displaystyle=\|P_{K,\gamma}^{1/2}\mathbf{x}_{t+1,\mathrm{nl}}\|~{}\leq~{}V_{0}^{1/2}\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{\frac{t+1}{2}}.

Therefore, T2T_{2} is bounded by 2V01/2(11/(2PK,γop))t/22V_{0}^{1/2}(1-1/(2\|P_{K,\gamma}\|_{\mathrm{op}}))^{t/2}. Going back to Eq. B.9, we can combine our bounds on T1T_{1} and T2T_{2} to conclude that,

Jnl(Kγ,𝐱)Jlin(Kγ,𝐱)\displaystyle J_{\mathrm{nl}}(K\mid\gamma,\mathbf{x})-J_{\mathrm{lin}}(K\mid\gamma,\mathbf{x}) V03/22βnlPK,γop1/2(Kop2+1)t=0(112PK,γop)t\displaystyle~{}\leq~{}V_{0}^{3/2}\cdot 2\beta_{\mathrm{nl}}\|P_{K,\gamma}\|_{\mathrm{op}}^{1/2}(\|K\|_{\mathrm{op}}^{2}+1)\cdot\sum_{t=0}^{\infty}\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{t}
=V03/24βnlPK,γop3/2(Kop2+1).\displaystyle=V_{0}^{3/2}\cdot 4\beta_{\mathrm{nl}}\|P_{K,\gamma}\|_{\mathrm{op}}^{3/2}(\|K\|_{\mathrm{op}}^{2}+1).

Using the fact that 1+Kop22PK,γop1+\|K\|_{\mathrm{op}}^{2}~{}\leq~{}2\|P_{K,\gamma}\|_{\mathrm{op}} and V0𝐱02PK,γopV_{0}~{}\leq~{}\|\mathbf{x}_{0}\|^{2}\|P_{K,\gamma}\|_{\mathrm{op}}, we get that

V03/24βnlPK,γop3/2(Kop2+1)8βnl𝐱03PK,γop4.\displaystyle V_{0}^{3/2}\cdot 4\beta_{\mathrm{nl}}\|P_{K,\gamma}\|_{\mathrm{op}}^{3/2}(\|K\|_{\mathrm{op}}^{2}+1)~{}\leq~{}8\beta_{\mathrm{nl}}\|\mathbf{x}_{0}\|^{3}\|P_{K,\gamma}\|_{\mathrm{op}}^{4}.

B.2.2 Establishing similarity of gradients

Much like the previous lemma which bounds the costs between the linear and nonlinear system via the performance difference lemma, this next lemma differentiates the performance difference lemma to bound the difference between gradients.

Lemma B.6.

Assume Jlin(Kγ)<J_{\mathrm{lin}}(K\mid\gamma)<\infty. If 𝐱0\mathbf{x}_{0} is such that

𝐱0PK,γ𝐱0rnl2144βnl2PK,γop4\displaystyle\mathbf{x}_{0}P_{K,\gamma}\mathbf{x}_{0}~{}\leq~{}\frac{r_{\mathrm{nl}}^{2}}{144\beta_{\mathrm{nl}}^{2}\|P_{K,\gamma}\|_{\mathrm{op}}^{4}}

then,

KJnl(Kγ,𝐱0)KJlin(Kγ,𝐱0)F48βnl(1+Bop)PK,γop7𝐱03.\displaystyle\|\nabla\mkern-2.5mu_{K}J_{\mathrm{nl}}(K\mid\gamma,\mathbf{x}_{0})-\nabla\mkern-2.5mu_{K}J_{\mathrm{lin}}(K\mid\gamma,\mathbf{x}_{0})\|_{\mathrm{F}}~{}\leq~{}48\beta_{\mathrm{nl}}(1+\|B\|_{\mathrm{op}})\|P_{K,\gamma}\|_{\mathrm{op}}^{7}\|\mathbf{x}_{0}\|^{3}.
Proof.

Using the variational definition of the Frobenius norm,

KJnl(Kγ,𝐱)KJlin(Kγ,𝐱)F\displaystyle\|\nabla\mkern-2.5mu_{K}J_{\mathrm{nl}}(K\mid\gamma,\mathbf{x})-\nabla\mkern-2.5mu_{K}J_{\mathrm{lin}}(K\mid\gamma,\mathbf{x})\|_{\mathrm{F}} =supΔF1tr[(KJnl(Kγ,𝐱)KJlin(Kγ,𝐱))Δ]\displaystyle=\sup_{\|\Delta\|_{\mathrm{F}}~{}\leq~{}1}\mathrm{tr}\left[(\nabla\mkern-2.5mu_{K}J_{\mathrm{nl}}(K\mid\gamma,\mathbf{x})-\nabla\mkern-2.5mu_{K}J_{\mathrm{lin}}(K\mid\gamma,\mathbf{x}))^{\top}\Delta\right]
=supΔF1DKJnl(Kγ,𝐱)DKJlin(Kγ,𝐱),\displaystyle=\sup_{\|\Delta\|_{\mathrm{F}}~{}\leq~{}1}\mathrm{D}_{K}J_{\mathrm{nl}}(K\mid\gamma,\mathbf{x})-\mathrm{D}_{K}J_{\mathrm{lin}}(K\mid\gamma,\mathbf{x}),

where DK\mathrm{D}_{K} is the directional derivative operator in the direction Δ\Delta. The argument follows by bounding the directional derivative appearing above. From the performance difference lemma, Lemma B.4, we have that

DKJnl(Kγ,𝐱)DKJlin(Kγ,𝐱)=t=0γDK[fnl(𝐱t,nl,K𝐱t,nl)PK,γ𝐱t+1,nl].\displaystyle\mathrm{D}_{K}J_{\mathrm{nl}}(K\mid\gamma,\mathbf{x})-\mathrm{D}_{K}J_{\mathrm{lin}}(K\mid\gamma,\mathbf{x})=\sum_{t=0}^{\infty}\gamma\cdot\mathrm{D}_{K}\left[\cdot f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})^{\top}P_{K,\gamma}\mathbf{x}_{t+1,\mathrm{nl}}\right].

Each term appearing in the sum above can be decomposed into the following three terms,

γ(DKfnl(𝐱t,nl,K𝐱t,nl))PK,γ𝐱t+1,nl:=T1\displaystyle\underbrace{\gamma\cdot\left(\mathrm{D}_{K}\cdot f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\right)^{\top}P_{K,\gamma}\mathbf{x}_{t+1,\mathrm{nl}}}_{:=T_{1}}
+γfnl(𝐱t,nl,K𝐱t,nl)(DKPK,γ)𝐱t+1,nl:=T2\displaystyle\qquad+\underbrace{\gamma\cdot f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})(\mathrm{D}_{K}P_{K,\gamma})\mathbf{x}_{t+1,\mathrm{nl}}}_{:=T_{2}}
+γfnl(𝐱t,nl,K𝐱t,nl)PK,γ(DK𝐱t+1,nl):=T3.\displaystyle\qquad\qquad+\underbrace{\gamma\cdot f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})P_{K,\gamma}(\mathrm{D}_{K}\mathbf{x}_{t+1,\mathrm{nl}})}_{:=T_{3}}.

In order to bound each of these three terms, we start by bounding the directional derivatives appearing above. Throughout the remainder of the proof, we will make repeated use of the following inequalities which follow from Lemma 3.1, Lemma B.2, and our assumption on 𝐱0\mathbf{x}_{0}. For all t0t~{}\geq~{}0,

𝐱t,nl2\displaystyle\|\mathbf{x}_{t,\mathrm{nl}}\|^{2} V0(112PK,γop)t\displaystyle~{}\leq~{}V_{0}\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{t} (B.10)
𝐱t,nl+K𝐱t,nl\displaystyle\|\mathbf{x}_{t,\mathrm{nl}}\|+\|K\mathbf{x}_{t,\mathrm{nl}}\| rnl.\displaystyle~{}\leq~{}r_{\mathrm{nl}}. (B.11)
Lemma B.7 (Bounding DK𝐱t,nl\mathrm{D}_{K}\mathbf{x}_{t,\mathrm{nl}}.).

Let {𝐱t,nl}\{\mathbf{x}_{t,\mathrm{nl}}\} be the sequence of states generated according to 𝐱t+1,nl=γGnl(𝐱t,nl,K𝐱t,nl)\mathbf{x}_{t+1,\mathrm{nl}}=\sqrt{\gamma}G_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}}). Under the same assumptions as in Lemma B.6, for all t0t~{}\geq~{}0

DK𝐱t,nl||tPK,γop1/2(16PK,γop2+Bop)(112PK,γop)t12V01/2.\displaystyle\|\mathrm{D}_{K}\mathbf{x}_{t,\mathrm{nl}}||~{}\leq~{}t\cdot\|P_{K,\gamma}\|_{\mathrm{op}}^{1/2}\left(\frac{1}{6\|P_{K,\gamma}\|_{\mathrm{op}}^{2}}+\|B\|_{\mathrm{op}}\right)\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{\frac{t-1}{2}}V_{0}^{1/2}.
Proof.

Taking derivatives, we get that

DKfnl(𝐱t,nl,K𝐱t,nl)=𝐱,𝐮fnl(𝐱,𝐮)|(𝐱,𝐮)=(𝐱t,nl,K𝐱t,nl)[DK𝐱t,nlDK(K𝐱t,nl)].\displaystyle\mathrm{D}_{K}f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})=\nabla\mkern-2.5mu_{\mathbf{x},\mathbf{u}}f_{\mathrm{nl}}(\mathbf{x},\mathbf{u})\Big{|}_{(\mathbf{x},\mathbf{u})=(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})}\begin{bmatrix}\mathrm{D}_{K}\mathbf{x}_{t,\mathrm{nl}}\\ \mathrm{D}_{K}(K\mathbf{x}_{t,\mathrm{nl}})\end{bmatrix}.

Since, DK(K𝐱t,nl)=Δ𝐱t,nl+K(DK𝐱t,nl)\mathrm{D}_{K}(K\mathbf{x}_{t,\mathrm{nl}})=\Delta\mathbf{x}_{t,\mathrm{nl}}+K(\mathrm{D}_{K}\mathbf{x}_{t,\mathrm{nl}}), we can rewrite the expression above as,

DKfnl(𝐱t,nl,K𝐱t,nl)\displaystyle\mathrm{D}_{K}f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}}) =𝐱,𝐮fnl(𝐱,𝐮)|(𝐱,𝐮)=(𝐱t,nl,K𝐱t,nl)[0Δ]𝐱t,nl\displaystyle=\nabla\mkern-2.5mu_{\mathbf{x},\mathbf{u}}f_{\mathrm{nl}}(\mathbf{x},\mathbf{u})\Big{|}_{(\mathbf{x},\mathbf{u})=(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})}\begin{bmatrix}0\\ \Delta\end{bmatrix}\mathbf{x}_{t,\mathrm{nl}} (B.12)
+𝐱,𝐮fnl(𝐱,𝐮)|(𝐱,𝐮)=(𝐱t,nl,K𝐱t,nl)[IK]DK𝐱t,nl.\displaystyle\qquad+\nabla\mkern-2.5mu_{\mathbf{x},\mathbf{u}}f_{\mathrm{nl}}(\mathbf{x},\mathbf{u})\Big{|}_{(\mathbf{x},\mathbf{u})=(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})}\begin{bmatrix}I\\ K\end{bmatrix}\mathrm{D}_{K}\mathbf{x}_{t,\mathrm{nl}}.

Next, we compute DK𝐱t,nl\mathrm{D}_{K}\mathbf{x}_{t,\mathrm{nl}},

DK𝐱t,nl\displaystyle\mathrm{D}_{K}\mathbf{x}_{t,\mathrm{nl}} =γ(DKfnl(𝐱t1,nl,K𝐱t1,nl)+DK[(A+BK)𝐱t1,nl])\displaystyle=\sqrt{\gamma}\cdot\left(\mathrm{D}_{K}f_{\mathrm{nl}}(\mathbf{x}_{t-1,\mathrm{nl}},K\mathbf{x}_{t-1,\mathrm{nl}})+\mathrm{D}_{K}[(A+BK)\mathbf{x}_{t-1,\mathrm{nl}}]\right)
=γ(DKfnl(𝐱t1,nl,K𝐱t1,nl)+BΔ𝐱t1,nl+(A+BK)DK𝐱t1,nl).\displaystyle=\sqrt{\gamma}\cdot\left(\mathrm{D}_{K}f_{\mathrm{nl}}(\mathbf{x}_{t-1,\mathrm{nl}},K\mathbf{x}_{t-1,\mathrm{nl}})+B\Delta\mathbf{x}_{t-1,\mathrm{nl}}+(A+BK)\mathrm{D}_{K}\mathbf{x}_{t-1,\mathrm{nl}}\right).

Plugging in our earlier calculation for DKfnl(𝐱t,nl,K𝐱t,nl)\mathrm{D}_{K}f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}}), we get that the following recursion holds for ψt:=DK𝐱t,nl\psi_{t}:=\mathrm{D}_{K}\mathbf{x}_{t,\mathrm{nl}},

ψt:=DK𝐱t,nl\displaystyle\psi_{t}:=\mathrm{D}_{K}\mathbf{x}_{t,\mathrm{nl}} =γ[𝐱,𝐮fnl(𝐱,𝐮)|(𝐱,𝐮)=(𝐱t1,nl,K𝐱t1,nl)[0Δ]+BΔ]𝐱t1,nl:=mt1\displaystyle=\underbrace{\sqrt{\gamma}\cdot\left[\nabla\mkern-2.5mu_{\mathbf{x},\mathbf{u}}f_{\mathrm{nl}}(\mathbf{x},\mathbf{u})\Big{|}_{(\mathbf{x},\mathbf{u})=(\mathbf{x}_{t-1,\mathrm{nl}},K\mathbf{x}_{t-1,\mathrm{nl}})}\begin{bmatrix}0\\ \Delta\end{bmatrix}+B\Delta\right]\mathbf{x}_{t-1,\mathrm{nl}}}_{:=m_{t-1}} (B.13)
+γ[𝐱,𝐮fnl(𝐱,𝐮)|(𝐱,𝐮)=(𝐱t1,nl,K𝐱t1,nl)[IK]+(A+BK)]:=Nt1ψt1.\displaystyle+\underbrace{\sqrt{\gamma}\cdot\left[\nabla\mkern-2.5mu_{\mathbf{x},\mathbf{u}}f_{\mathrm{nl}}(\mathbf{x},\mathbf{u})\Big{|}_{(\mathbf{x},\mathbf{u})=(\mathbf{x}_{t-1,\mathrm{nl}},K\mathbf{x}_{t-1,\mathrm{nl}})}\begin{bmatrix}I\\ K\end{bmatrix}+(A+BK)\right]}_{:=N_{t-1}}\psi_{t-1}. (B.14)

Using the shorthand introduced above, we can re-express the recursion as,

ψt=mt1+Nt1ψt1.\displaystyle\psi_{t}=m_{t-1}+N_{t-1}\psi_{t-1}.

Unrolling this recursion, with the base case that ψ0=DK𝐱0,nl=0\psi_{0}=\mathrm{D}_{K}\mathbf{x}_{0,\mathrm{nl}}=0, we get that

ψt=j=0t1(i=j+1t1Ni)mj.\displaystyle\psi_{t}=\sum_{j=0}^{t-1}\left(\prod_{i=j+1}^{t-1}N_{i}\right)m_{j}.

Therefore,

ψtj=0t1i=j+1t1Niopmj.\displaystyle\|\psi_{t}\|~{}\leq~{}\sum_{j=0}^{t-1}\left\|\prod_{i=j+1}^{t-1}N_{i}\right\|_{\mathrm{op}}\|m_{j}\|. (B.15)

Next, we prove that each matrix NiN_{i} is stable so that the product of the NiN_{i} is small. By Lemma 3.1 and our earlier inequalities,Eq. B.10 & Eq. B.11, we have that,

𝐱,𝐮fnl(𝐱,𝐮)|(𝐱,𝐮)=(𝐱t1,nl,K𝐱t1,nl)[IK]op\displaystyle\left\|\nabla\mkern-2.5mu_{\mathbf{x},\mathbf{u}}f_{\mathrm{nl}}(\mathbf{x},\mathbf{u})\Big{|}_{(\mathbf{x},\mathbf{u})=(\mathbf{x}_{t-1,\mathrm{nl}},K\mathbf{x}_{t-1,\mathrm{nl}})}\begin{bmatrix}I\\ K\end{bmatrix}\right\|_{\mathrm{op}} 𝐱,𝐮fnl(𝐱,𝐮)|(𝐱,𝐮)=(𝐱t1,nl,K𝐱t1,nl)op[IK]op\displaystyle~{}\leq~{}\left\|\nabla\mkern-2.5mu_{\mathbf{x},\mathbf{u}}f_{\mathrm{nl}}(\mathbf{x},\mathbf{u})\Big{|}_{(\mathbf{x},\mathbf{u})=(\mathbf{x}_{t-1,\mathrm{nl}},K\mathbf{x}_{t-1,\mathrm{nl}})}\right\|_{\mathrm{op}}\left\|\begin{bmatrix}I\\ K\end{bmatrix}\right\|_{\mathrm{op}}
βnl(1+Kop)2𝐱t1,nl\displaystyle~{}\leq~{}\beta_{\mathrm{nl}}(1+\|K\|_{\mathrm{op}})^{2}\|\mathbf{x}_{t-1,\mathrm{nl}}\|
16PK,γop2,\displaystyle~{}\leq~{}\frac{1}{6\|P_{K,\gamma}\|_{\mathrm{op}}^{2}}, (B.16)

where we have used our initial condition on V0V_{0}. Therefore, Ni=γ(A+BK)+ΔNiN_{i}=\sqrt{\gamma}(A+BK)+\Delta_{N_{i}} where ΔNiop1/(6PK,γop2)\|\Delta_{N_{i}}\|_{\mathrm{op}}~{}\leq~{}1/(6\|P_{K,\gamma}\|_{\mathrm{op}}^{2}) for all ii. By definition of the operator norm, Lemma A.6, and the fact that PK,γIP_{K,\gamma}\succeq I:

i=j+1t1Niop2\displaystyle\left\|\prod_{i=j+1}^{t-1}N_{i}\right\|_{\mathrm{op}}^{2} =(i=j+1t1Ni)(i=j+1t1Ni)op\displaystyle=\left\|\left(\prod_{i=j+1}^{t-1}N_{i}\right)^{\top}\left(\prod_{i=j+1}^{t-1}N_{i}\right)\right\|_{\mathrm{op}}
(i=j+1t1Ni)PK,γ(i=j+1t1Ni)op\displaystyle~{}\leq~{}\left\|\left(\prod_{i=j+1}^{t-1}N_{i}\right)^{\top}P_{K,\gamma}\left(\prod_{i=j+1}^{t-1}N_{i}\right)\right\|_{\mathrm{op}}
PK,γop(112PK,γop)tj1.\displaystyle~{}\leq~{}\left\|P_{K,\gamma}\right\|_{\mathrm{op}}\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{t-j-1}.

Then, using our bound on 𝐱,𝐮fnl\nabla\mkern-2.5mu_{\mathbf{x},\mathbf{u}}f_{\mathrm{nl}} from Eq. B.16, and Lemma B.2, we bound mjm_{j} (defined in Eq. B.13) as,

mj\displaystyle\|m_{j}\| γ[𝐱,𝐮fnl(𝐱,𝐮)|(𝐱,𝐮)=(𝐱j,nl,K𝐱j,nl)[0Δ]+BΔ]op𝐱j,nl\displaystyle~{}\leq~{}\left\|\sqrt{\gamma}\left[\nabla\mkern-2.5mu_{\mathbf{x},\mathbf{u}}f_{\mathrm{nl}}(\mathbf{x},\mathbf{u})\Big{|}_{(\mathbf{x},\mathbf{u})=(\mathbf{x}_{j,\mathrm{nl}},K\mathbf{x}_{j,\mathrm{nl}})}\begin{bmatrix}0\\ \Delta\end{bmatrix}+B\Delta\right]\right\|_{\mathrm{op}}\|\mathbf{x}_{j,\mathrm{nl}}\|
(16PK,γop2+Bop)V01/2(112PK,γop)j/2.\displaystyle~{}\leq~{}\left(\frac{1}{6\|P_{K,\gamma}\|_{\mathrm{op}}^{2}}+\|B\|_{\mathrm{op}}\right)V_{0}^{1/2}\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{j/2}.

Returning to our earlier bound on ψt\|\psi_{t}\| in Eq. B.15, we conclude that,

ψt\displaystyle\|\psi_{t}\| j=0t1PK,γop1/2(112PK,γop)(tj1)/2(16PK,γop2+Bop)V01/2(112PK,γop)j/2\displaystyle~{}\leq~{}\sum_{j=0}^{t-1}\|P_{K,\gamma}\|_{\mathrm{op}}^{1/2}\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{(t-j-1)/2}\left(\frac{1}{6\|P_{K,\gamma}\|_{\mathrm{op}}^{2}}+\|B\|_{\mathrm{op}}\right)V_{0}^{1/2}\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{j/2}
tPK,γop1/2(16PK,γop2+Bop)(112PK,γop)t12V01/2,\displaystyle~{}\leq~{}t\cdot\|P_{K,\gamma}\|_{\mathrm{op}}^{1/2}\left(\frac{1}{6\|P_{K,\gamma}\|_{\mathrm{op}}^{2}}+\|B\|_{\mathrm{op}}\right)\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{\frac{t-1}{2}}V_{0}^{1/2},

concluding the proof. ∎

Using this, we can now return to bounding DKfnl(𝐱t,nl,K𝐱t,nl)\mathrm{D}_{K}f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}}).

Lemma B.8 (Bounding DKfnl(𝐱t,nl,K𝐱t,nl)\mathrm{D}_{K}f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})).

For all t0t\geq 0, the following bound holds:

DKfnl(𝐱t,nl,K𝐱t,nl)\displaystyle\|\mathrm{D}_{K}f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\| 8βnlPK,γop2(1+Bop)t(112PK,γop)t1/2𝐱02.\displaystyle~{}\leq~{}8\beta_{\mathrm{nl}}\|P_{K,\gamma}\|_{\mathrm{op}}^{2}(1+\|B\|_{\mathrm{op}})\cdot t\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{t-1/2}\|\mathbf{x}_{0}\|^{2}.
Proof.

From Eq. B.12, DKfnl(𝐱t,nl,K𝐱t,nl)\|\mathrm{D}_{K}f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\| is less than,

𝐱,𝐮fnl(𝐱,𝐮)|(𝐱,𝐮)=(𝐱t,nl,K𝐱t,nl)op[0Δ]op𝐱t,nl\displaystyle\left\|\nabla\mkern-2.5mu_{\mathbf{x},\mathbf{u}}f_{\mathrm{nl}}(\mathbf{x},\mathbf{u})\Big{|}_{(\mathbf{x},\mathbf{u})=(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})}\right\|_{\mathrm{op}}\left\|\begin{bmatrix}0\\ \Delta\end{bmatrix}\right\|_{\mathrm{op}}\|\mathbf{x}_{t,\mathrm{nl}}\|
+𝐱,𝐮fnl(𝐱,𝐮)|(𝐱,𝐮)=(𝐱t,nl,K𝐱t,nl)op[IK]opDK𝐱t,nl.\displaystyle\qquad+\left\|\nabla\mkern-2.5mu_{\mathbf{x},\mathbf{u}}f_{\mathrm{nl}}(\mathbf{x},\mathbf{u})\Big{|}_{(\mathbf{x},\mathbf{u})=(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})}\right\|_{\mathrm{op}}\left\|\begin{bmatrix}I\\ K\end{bmatrix}\right\|_{\mathrm{op}}\|\mathrm{D}_{K}\mathbf{x}_{t,\mathrm{nl}}\|.

Again using the assumption on the nonlinear dynamics, we can bound the gradient terms as in Eq. B.16 and get that DKfnl(𝐱t,nl,K𝐱t,nl)\|\mathrm{D}_{K}f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\| is no larger than,

βnl(1+Kop)𝐱t,nl2+βnl(1+Kop)2𝐱t,nlDK𝐱t,nl\displaystyle\beta_{\mathrm{nl}}(1+\|K\|_{\mathrm{op}})\|\mathbf{x}_{t,\mathrm{nl}}\|^{2}+\beta_{\mathrm{nl}}(1+\|K\|_{\mathrm{op}})^{2}\|\mathbf{x}_{t,\mathrm{nl}}\|\|\mathrm{D}_{K}\mathbf{x}_{t,\mathrm{nl}}\|
\displaystyle~{}\leq~{} 2βnl(1+Kop)2𝐱t,nlmax{𝐱t,nl,DK𝐱t,nl}.\displaystyle 2\beta_{\mathrm{nl}}(1+\|K\|_{\mathrm{op}})^{2}\|\mathbf{x}_{t,\mathrm{nl}}\|\max\{\|\mathbf{x}_{t,\mathrm{nl}}\|,\;\|\mathrm{D}_{K}\mathbf{x}_{t,\mathrm{nl}}\|\}.

Seeing as how our upper bound on DK𝐱t,nl\|\mathrm{D}_{K}\mathbf{x}_{t,\mathrm{nl}}\| in Lemma B.7 is always larger than the bound for 𝐱t,nl\|\mathbf{x}_{t,\mathrm{nl}}\| in Eq. B.10, we can bound max{𝐱t,nl,DK𝐱t,nl}\max\{\|\mathbf{x}_{t,\mathrm{nl}}\|,\;\|\mathrm{D}_{K}\mathbf{x}_{t,\mathrm{nl}}\|\} by the former. Consequently,

DKfnl(𝐱t,nl,K𝐱t,nl)\displaystyle\|\mathrm{D}_{K}f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\| 2βnl(1+Kop)2PK,γop1/2V0(16PK,γop2+Bop)t(112PK,γop)t1/2\displaystyle~{}\leq~{}2\beta_{\mathrm{nl}}(1+\|K\|_{\mathrm{op}})^{2}\|P_{K,\gamma}\|_{\mathrm{op}}^{1/2}V_{0}\left(\frac{1}{6\|P_{K,\gamma}\|_{\mathrm{op}}^{2}}+\|B\|_{\mathrm{op}}\right)\cdot t\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{t-1/2}
8βnlPK,γop2(1+Bop)t(112PK,γop)t1/2𝐱02.\displaystyle~{}\leq~{}8\beta_{\mathrm{nl}}\|P_{K,\gamma}\|_{\mathrm{op}}^{2}(1+\|B\|_{\mathrm{op}})\cdot t\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{t-1/2}\|\mathbf{x}_{0}\|^{2}.

Finally, we bound DKPK,γ\mathrm{D}_{K}P_{K,\gamma}.

Lemma B.9 (Bounding DKPK,γ\mathrm{D}_{K}P_{K,\gamma}).

The following bound holds:

DKPK,γop2PK,γop2(Bop+Kop).\displaystyle\|\mathrm{D}_{K}P_{K,\gamma}\|_{\mathrm{op}}~{}\leq~{}2\|P_{K,\gamma}\|_{\mathrm{op}}^{2}(\|B\|_{\mathrm{op}}+\|K\|_{\mathrm{op}}).
Proof.

By definition of the discrete time Lyapunov equation,

PK,γ=Q+KRK+γ(A+BK)PK,γ(A+BK).\displaystyle P_{K,\gamma}=Q+K^{\top}RK+\gamma(A+BK)^{\top}P_{K,\gamma}(A+BK).

Therefore, the directional derivative DKPK,γ\mathrm{D}_{K}P_{K,\gamma} satisfies another Lyapunov equation,

DKPK,γ\displaystyle\mathrm{D}_{K}P_{K,\gamma} =ΔRK+KRΔ+γ(BΔ)PK,γ(A+BK)+γ(A+BK)PK,γBΔ:=EK\displaystyle=\underbrace{\Delta^{\top}RK+K^{\top}R\Delta+\gamma\cdot(B\Delta)^{\top}P_{K,\gamma}(A+BK)+\gamma\cdot(A+BK)^{\top}P_{K,\gamma}B\Delta}_{:=E_{K}}
+γ(A+BK)(DKPK,γ)(A+BK),\displaystyle+\gamma(A+BK)^{\top}(\mathrm{D}_{K}P_{K,\gamma})(A+BK),

implying that DKPK,γ=𝖽𝗅𝗒𝖺𝗉(γ(A+BK),EK)\mathrm{D}_{K}P_{K,\gamma}=\mathsf{dlyap}(\sqrt{\gamma}(A+BK),E_{K}). By properties of the Lyapunov equation,

DKPK,γ\displaystyle\mathrm{D}_{K}P_{K,\gamma} EKop𝖽𝗅𝗒𝖺𝗉(A+BK,I)\displaystyle\preceq\|E_{K}\|_{\mathrm{op}}\mathsf{dlyap}(A+BK,I)
EKop𝖽𝗅𝗒𝖺𝗉(γ(A+BK),KRK+Q)=EKopPK,γ.\displaystyle\preceq\|E_{K}\|_{\mathrm{op}}\mathsf{dlyap}(\sqrt{\gamma}(A+BK),K^{\top}RK+Q)=\|E_{K}\|_{\mathrm{op}}P_{K,\gamma}.

Therefore, to bound DKPK,γop\|\mathrm{D}_{K}P_{K,\gamma}\|_{\mathrm{op}} it suffices to bound, Ekop\|E_{k}\|_{\mathrm{op}}. Using the fact that PK,γ1/2γ(A+BK)opPK,γ1/2op\|P_{K,\gamma}^{1/2}\sqrt{\gamma}(A+BK)\|_{\mathrm{op}}~{}\leq~{}\|P_{K,\gamma}^{1/2}\|_{\mathrm{op}} and that Δop1\|\Delta\|_{\mathrm{op}}~{}\leq~{}1, a short calculation reveals that,

EKop2PK,γop(Bop+Kop),\displaystyle\|E_{K}\|_{\mathrm{op}}~{}\leq~{}2\|P_{K,\gamma}\|_{\mathrm{op}}(\|B\|_{\mathrm{op}}+\|K\|_{\mathrm{op}}),

which together with our previous bound on DKPK,γ\mathrm{D}_{K}P_{K,\gamma} implies that,

DKPK,γop2PK,γop2(Bop+Kop).\displaystyle\|\mathrm{D}_{K}P_{K,\gamma}\|_{\mathrm{op}}~{}\leq~{}2\|P_{K,\gamma}\|_{\mathrm{op}}^{2}(\|B\|_{\mathrm{op}}+\|K\|_{\mathrm{op}}).

With Lemmas B.7, B.8 and B.9 in place, we now return to bounding terms T1,T2,T3T_{1},T_{2},T_{3}.

Bounding T1T_{1}

Recall T1:=γ(DKfnl(𝐱t,nl,K𝐱t,nl))PK,γ𝐱t+1,nlT_{1}:=\gamma\cdot\left(\mathrm{D}_{K}\cdot f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\right)^{\top}P_{K,\gamma}\mathbf{x}_{t+1,\mathrm{nl}}. We then have

T1(DKfnl(𝐱t,nl,K𝐱t,nl))PK,γ𝐱t+1,nlPK,γopDKfnl(𝐱t,nl,K𝐱t,nl)𝐱t+1,nl.\displaystyle\|T_{1}\|~{}\leq~{}\|\left(\mathrm{D}_{K}f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\right)^{\top}P_{K,\gamma}\mathbf{x}_{t+1,\mathrm{nl}}\|~{}\leq~{}\|P_{K,\gamma}\|_{\mathrm{op}}\|\mathrm{D}_{K}f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\|\|\mathbf{x}_{t+1,\mathrm{nl}}\|.

Using the bound on 𝐱t+1,nl\|\mathbf{x}_{t+1,\mathrm{nl}}\| stated in Eq. B.10 and on DKfnl(𝐱t,nl,K𝐱t,nl)\|\mathrm{D}_{K}f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\| from Lemma B.8, the above simplifies to

8βnlPK,γop7/2(1+Bop)t(112PK,γop)1.5t𝐱03.\displaystyle 8\beta_{\mathrm{nl}}\|P_{K,\gamma}\|_{\mathrm{op}}^{7/2}(1+\|B\|_{\mathrm{op}})\cdot t\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{1.5t}\|\mathbf{x}_{0}\|^{3}.
Bounding T2T_{2}

Recall T2=fnl(𝐱t,nl,K𝐱t,nl)DKPK,γ𝐱t+1,nlT_{2}=f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\mathrm{D}_{K}P_{K,\gamma}\mathbf{x}_{t+1,\mathrm{nl}}, so that

T2fnl(𝐱t,nl,K𝐱t,nl)DKPK,γop𝐱t+1,nl.\displaystyle\|T_{2}\|~{}\leq~{}\|f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\|\|\mathrm{D}_{K}P_{K,\gamma}\|_{\mathrm{op}}\|\mathbf{x}_{t+1,\mathrm{nl}}\|.

Using the bound on fnl(𝐱,𝐮)\|f_{\mathrm{nl}}(\mathbf{x},\mathbf{u})\| from Lemma 3.1,

fnl(𝐱t,nl,K𝐱t,nl)\displaystyle\|f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\| βnl(1+Kop2)𝐱t,nl2\displaystyle~{}\leq~{}\beta_{\mathrm{nl}}(1+\|K\|_{\mathrm{op}}^{2})\|\mathbf{x}_{t,\mathrm{nl}}\|^{2}
βnl(1+Kop2)PK,γop(112PK,γop)t𝐱02\displaystyle~{}\leq~{}\beta_{\mathrm{nl}}(1+\|K\|_{\mathrm{op}}^{2})\|P_{K,\gamma}\|_{\mathrm{op}}\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{t}\|\mathbf{x}_{0}\|^{2}
2βnlPK,γop2(112PK,γop)t𝐱02.\displaystyle~{}\leq~{}2\beta_{\mathrm{nl}}\|P_{K,\gamma}\|_{\mathrm{op}}^{2}\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{t}\|\mathbf{x}_{0}\|^{2}. (B.17)

Therefore, using Eq. B.10 again and Lemma B.9,

T2\displaystyle\|T_{2}\| 4βnlPK,γop9/2(Bop+Kop)(112PK,γop)32t+12𝐱03\displaystyle~{}\leq~{}4\beta_{\mathrm{nl}}\|P_{K,\gamma}\|_{\mathrm{op}}^{9/2}(\|B\|_{\mathrm{op}}+\|K\|_{\mathrm{op}})\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{\frac{3}{2}t+\frac{1}{2}}\|\mathbf{x}_{0}\|^{3}
4βnlPK,γop5(Bop+1)(112PK,γop)32t+12𝐱03.\displaystyle~{}\leq~{}4\beta_{\mathrm{nl}}\|P_{K,\gamma}\|_{\mathrm{op}}^{5}(\|B\|_{\mathrm{op}}+1)\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{\frac{3}{2}t+\frac{1}{2}}\|\mathbf{x}_{0}\|^{3}.
Bounding T3T_{3}.

Recall T3:=fnl(𝐱t,nl,K𝐱t,nl)PK,,γDK𝐱t+1,nlT_{3}:=f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})P_{{}_{,}K,\gamma}\mathrm{D}_{K}\mathbf{x}_{t+1,\mathrm{nl}}. From Eq. B.17 and Lemma B.7, we have that

T3\displaystyle\|T_{3}\| fnl(𝐱t,nl,K𝐱t,nl)PK,,γopDK𝐱t+1,nlop\displaystyle~{}\leq~{}\|f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})\|\|P_{{}_{,}K,\gamma}\|_{\mathrm{op}}\|\mathrm{D}_{K}\mathbf{x}_{t+1,\mathrm{nl}}\|_{\mathrm{op}}
2βnlPK,γop4(1+Bop)(t+1)(112PK,γop)1.5t𝐱03.\displaystyle~{}\leq~{}2\beta_{\mathrm{nl}}\|P_{K,\gamma}\|_{\mathrm{op}}^{4}\left(1+\|B\|_{\mathrm{op}}\right)\cdot(t+1)\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{1.5t}\|\mathbf{x}_{0}\|^{3}.
Wrapping up

Therefore,

T1+T2+T312βnl(1+Bop)PK,γop5(t+1)(112PK,γop)32t𝐱03.\displaystyle\|T_{1}\|+\|T_{2}\|+\|T_{3}\|~{}\leq~{}12\beta_{\mathrm{nl}}(1+\|B\|_{\mathrm{op}})\|P_{K,\gamma}\|_{\mathrm{op}}^{5}(t+1)\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{\frac{3}{2}t}\|\mathbf{x}_{0}\|^{3}.

And hence,

t=0DK[fnl(𝐱t,nl,K𝐱t,nl)PK,γ𝐱t+1,nl]\displaystyle\sum_{t=0}^{\infty}\mathrm{D}_{K}\left[f_{\mathrm{nl}}(\mathbf{x}_{t,\mathrm{nl}},K\mathbf{x}_{t,\mathrm{nl}})^{\top}P_{K,\gamma}\mathbf{x}_{t+1,\mathrm{nl}}\right] 12βnl(1+Bop)PK,γop5𝐱03t=0(t+1)(112PK,γop)t\displaystyle~{}\leq~{}12\beta_{\mathrm{nl}}(1+\|B\|_{\mathrm{op}})\|P_{K,\gamma}\|_{\mathrm{op}}^{5}\|\mathbf{x}_{0}\|^{3}\sum_{t=0}^{\infty}(t+1)\left(1-\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}\right)^{t}
=48βnl(1+Bop)PK,γop7𝐱03.\displaystyle=48\beta_{\mathrm{nl}}(1+\|B\|_{\mathrm{op}})\|P_{K,\gamma}\|_{\mathrm{op}}^{7}\|\mathbf{x}_{0}\|^{3}.

B.3 Establishing Lyapunov functions: Proof of Lemma B.1

Proof.

To be concise, we use the shorthand, 𝐳=(𝐱,K𝐱)\mathbf{z}=(\mathbf{x},K\mathbf{x}). We start by expanding out,

γGnl(𝐳)PKGnl(𝐳)=γ(Glin(𝐳)PKGlin(𝐳)+fnl(𝐳)PKfnl(𝐳)+2fnl(𝐳)PKGlin(𝐳)).\displaystyle\gamma\cdot G_{\mathrm{nl}}(\mathbf{z})^{\top}P_{K}G_{\mathrm{nl}}(\mathbf{z})=\gamma\left(G_{\mathrm{lin}}(\mathbf{z})^{\top}P_{K}G_{\mathrm{lin}}(\mathbf{z})+f_{\mathrm{nl}}(\mathbf{z})^{\top}P_{K}f_{\mathrm{nl}}(\mathbf{z})+2f_{\mathrm{nl}}(\mathbf{z})^{\top}P_{K}G_{\mathrm{lin}}(\mathbf{z})\right).

By the AM-GM inequality for vectors, the following holds for any τ>0\tau>0,

2fnl(𝐳)PKGlin(𝐳)τGlin(𝐳)PKGlin(𝐳)+1τfnl(𝐳)PKfnl(𝐳).\displaystyle 2f_{\mathrm{nl}}(\mathbf{z})^{\top}P_{K}G_{\mathrm{lin}}(\mathbf{z})~{}\leq~{}\tau\cdot G_{\mathrm{lin}}(\mathbf{z})^{\top}P_{K}G_{\mathrm{lin}}(\mathbf{z})+\frac{1}{\tau}\cdot f_{\mathrm{nl}}(\mathbf{z})^{\top}P_{K}f_{\mathrm{nl}}(\mathbf{z}).

Combining these two relationships, we get that,

γGnl(𝐳)PKGnl(𝐳)γ(1+τ)Glin(𝐳)PKGlin(𝐳)+γ(1+1τ)fnl(𝐳)PKfnl(𝐳).\displaystyle\gamma\cdot G_{\mathrm{nl}}(\mathbf{z})^{\top}P_{K}G_{\mathrm{nl}}(\mathbf{z})~{}\leq~{}\gamma\cdot(1+\tau)\cdot G_{\mathrm{lin}}(\mathbf{z})^{\top}P_{K}G_{\mathrm{lin}}(\mathbf{z})+\gamma\cdot\left(1+\frac{1}{\tau}\right)\cdot f_{\mathrm{nl}}(\mathbf{z})^{\top}P_{K}f_{\mathrm{nl}}(\mathbf{z}). (B.18)

Next, by properties of the Lyapunov function Lemma A.5, we have that

γGlin(𝐳)PKGlin(𝐳)=γ𝐱(A+BK)PK,γ(A+BK)𝐱𝐱PK,γ𝐱(11PK,γop).\displaystyle\gamma G_{\mathrm{lin}}(\mathbf{z})^{\top}P_{K}G_{\mathrm{lin}}(\mathbf{z})=\gamma\cdot\mathbf{x}^{\top}(A+BK)^{\top}P_{K,\gamma}(A+BK)\mathbf{x}~{}\leq~{}\mathbf{x}^{\top}P_{K,\gamma}\mathbf{x}\left(1-\frac{1}{\|P_{K,\gamma}\|_{\mathrm{op}}}\right).

Letting, V𝐱:=𝐱PK,γ𝐱V_{\mathbf{x}}:=\mathbf{x}^{\top}P_{K,\gamma}\mathbf{x}, we can plug in the previous expression into Eq. B.18 and optimize over τ\tau to get that,

γGnl(𝐳)PK,γGnl(𝐳)\displaystyle\gamma\cdot G_{\mathrm{nl}}(\mathbf{z})^{\top}P_{K,\gamma}G_{\mathrm{nl}}(\mathbf{z}) V𝐱(11PK,γop)+fnl(𝐳)PK,γfnl(𝐳)+2V𝐱fnl(𝐳)PK,γfnl(𝐳)\displaystyle~{}\leq~{}V_{\mathbf{x}}\left(1-\frac{1}{\|P_{K,\gamma}\|_{\mathrm{op}}}\right)+f_{\mathrm{nl}}(\mathbf{z})^{\top}P_{K,\gamma}f_{\mathrm{nl}}(\mathbf{z})+2\sqrt{V_{\mathbf{x}}f_{\mathrm{nl}}(\mathbf{z})^{\top}P_{K,\gamma}f_{\mathrm{nl}}(\mathbf{z})}
V𝐱(11PK,γop)+PK,γopfnl(𝐳)2+2V𝐱PK,γopfnl(𝐳),\displaystyle~{}\leq~{}V_{\mathbf{x}}\left(1-\frac{1}{\|P_{K,\gamma}\|_{\mathrm{op}}}\right)+\|P_{K,\gamma}\|_{\mathrm{op}}\|f_{\mathrm{nl}}(\mathbf{z})\|^{2}+2\sqrt{V_{\mathbf{x}}\|P_{K,\gamma}\|_{\mathrm{op}}}\|f_{\mathrm{nl}}(\mathbf{z})\|,

where we have dropped a factor of γ\gamma from the last two terms since γ1\gamma~{}\leq~{}1. Next, the proof follows by noting that this following inequality is satisfied,

PK,γopfnl(𝐳)2+2V𝐱PK,γopfnl(𝐳)V𝐱2PK,γop,\displaystyle\|P_{K,\gamma}\|_{\mathrm{op}}\|f_{\mathrm{nl}}(\mathbf{z})\|^{2}+2\sqrt{V_{\mathbf{x}}\|P_{K,\gamma}\|_{\mathrm{op}}}\|f_{\mathrm{nl}}(\mathbf{z})\|~{}\leq~{}\frac{V_{\mathbf{x}}}{2\|P_{K,\gamma}\|_{\mathrm{op}}}, (B.19)

whenever,

fnl(𝐳)\displaystyle\|f_{\mathrm{nl}}(\mathbf{z})\| V𝐱2PK,γop2+V𝐱PK,γopV𝐱PK,γop\displaystyle~{}\leq~{}\sqrt{\frac{V_{\mathbf{x}}}{2\|P_{K,\gamma}\|_{\mathrm{op}}^{2}}+\frac{V_{\mathbf{x}}}{\|P_{K,\gamma}\|_{\mathrm{op}}}}-\sqrt{\frac{V_{\mathbf{x}}}{\|P_{K,\gamma}\|_{\mathrm{op}}}}
=V𝐱PK,γop(12PK,γop+11).\displaystyle=\sqrt{\frac{V_{\mathbf{x}}}{\|P_{K,\gamma}\|_{\mathrm{op}}}}\left(\sqrt{\frac{1}{2\|P_{K,\gamma}\|_{\mathrm{op}}}+1}-1\right). (B.20)

Therefore, assuming the inequality above Eq. B.20, we get our desired result showing that

Gnl(𝐳)PKGnl(𝐳)𝐱PK𝐱(112PKop).\displaystyle G_{\mathrm{nl}}(\mathbf{z})^{\top}P_{K}G_{\mathrm{nl}}(\mathbf{z})~{}\leq~{}\mathbf{x}^{\top}P_{K}\mathbf{x}\cdot\left(1-\frac{1}{2\|P_{K}\|_{\mathrm{op}}}\right).

We conclude the proof by showing that Eq. B.20 is satisfied for all 𝐱\mathbf{x} small enough. In particular, using our bounds on fnlf_{\mathrm{nl}} from Lemma 3.1, if 𝐱+K𝐱rnl\|\mathbf{x}\|+\|K\mathbf{x}\|~{}\leq~{}r_{\mathrm{nl}},

fnl(𝐳)βnl(𝐱2+K𝐱2)βnl(1+Kop2)𝐱2.\displaystyle\|f_{\mathrm{nl}}(\mathbf{z})\|~{}\leq~{}\beta_{\mathrm{nl}}(\|\mathbf{x}\|^{2}+\|K\mathbf{x}\|^{2})~{}\leq~{}\beta_{\mathrm{nl}}(1+\|K\|_{\mathrm{op}}^{2})\|\mathbf{x}\|^{2}.

Since V𝐱𝐱2V_{\mathbf{x}}~{}\geq~{}\|\mathbf{x}\|^{2}, in order for Eq. B.20 to hold, it suffices for 𝐱\|\mathbf{x}\| to satisfy

𝐱min{rnl1+Kop,1βnlPK,γop1/2(1+Kop2)}.\displaystyle\|\mathbf{x}\|~{}\leq~{}\min\left\{\frac{r_{\mathrm{nl}}}{1+\|K\|_{\mathrm{op}}},\;\frac{1}{\beta_{\mathrm{nl}}\|P_{K,\gamma}\|_{\mathrm{op}}^{1/2}(1+\|K\|_{\mathrm{op}}^{2})}\right\}.

Using the fact that Kop2PK,γop\|K\|_{\mathrm{op}}^{2}~{}\leq~{}\|P_{K,\gamma}\|_{\mathrm{op}}, we can simplify upper bound on 𝐱\|\mathbf{x}\| to be,

𝐱rnl2βnlPK,γop3/2.\displaystyle\|\mathbf{x}\|~{}\leq~{}\frac{r_{\mathrm{nl}}}{2\beta_{\mathrm{nl}}\|P_{K,\gamma}\|_{\mathrm{op}}^{3/2}}.

Note that this condition is always implied by the condition on 𝐱PK,γ𝐱\mathbf{x}^{\top}P_{K,\gamma}\mathbf{x} in the statement of the proposition. ∎

B.4 Bounding the nonlinearity: Proof of Lemma 3.1

Since the origin is an equilibrium point, Gnl(0,0)=0G_{\mathrm{nl}}(0,0)=0, we can rewrite GnlG_{\mathrm{nl}} as,

Gnl(𝐱,𝐮)\displaystyle G_{\mathrm{nl}}(\mathbf{x},\mathbf{u}) =Gnl(0,0)+Gnl(𝐱,𝐮)|(𝐱,𝐮)=(0,0)[𝐱𝐮]+fnl(𝐱,𝐮),\displaystyle=G_{\mathrm{nl}}(0,0)+\nabla\mkern-2.5muG_{\mathrm{nl}}(\mathbf{x},\mathbf{u})\big{|}_{(\mathbf{x},\mathbf{u})=(0,0)}\begin{bmatrix}\mathbf{x}\\ \mathbf{u}\end{bmatrix}+f_{\mathrm{nl}}(\mathbf{x},\mathbf{u}),
=Ajac𝐱+Bjac𝐮+fnl(𝐱,𝐮),\displaystyle=A_{\mathrm{jac}}\mathbf{x}+B_{\mathrm{jac}}\mathbf{u}+f_{\mathrm{nl}}(\mathbf{x},\mathbf{u}),

for some function fnlf_{\mathrm{nl}}. Taking gradients,

fnl(𝐱,𝐮)\displaystyle\nabla\mkern-2.5muf_{\mathrm{nl}}(\mathbf{x},\mathbf{u}) =𝐱,𝐮Gnl(𝐱,𝐮)|(𝐱,𝐮)=(𝐱,𝐮)Gnl(𝐱,𝐮)|(𝐱,𝐮)=(0,0).\displaystyle=\nabla\mkern-2.5mu_{\mathbf{x},\mathbf{u}}G_{\mathrm{nl}}(\mathbf{x},\mathbf{u})\big{|}_{(\mathbf{x},\mathbf{u})=(\mathbf{x},\mathbf{u})}-\nabla\mkern-2.5muG_{\mathrm{nl}}(\mathbf{x},\mathbf{u})\big{|}_{(\mathbf{x},\mathbf{u})=(0,0)}.

Hence, for all 𝐱,𝐮\mathbf{x},\mathbf{u} such that the smoothness assumption holds, we get that

𝐱,𝐮fnl(𝐱,𝐮)opβnl(𝐱+𝐮).\displaystyle\|\nabla\mkern-2.5mu_{\mathbf{x},\mathbf{u}}f_{\mathrm{nl}}(\mathbf{x},\mathbf{u})\|_{\mathrm{op}}~{}\leq~{}\beta_{\mathrm{nl}}(\|\mathbf{x}\|+\|\mathbf{u}\|).

Next, by Taylor’s theorem,

fnl(𝐱,𝐮)=fnl(0,0)+01fnl(t𝐱,t𝐮)[𝐱𝐮]𝑑t,\displaystyle f_{\mathrm{nl}}(\mathbf{x},\mathbf{u})=f_{\mathrm{nl}}(0,0)+\int_{0}^{1}\nabla\mkern-2.5muf_{\mathrm{nl}}(t\cdot\mathbf{x},t\cdot\mathbf{u})\begin{bmatrix}\mathbf{x}\\ \mathbf{u}\end{bmatrix}dt,

and since fnl(0,0)=0f_{\mathrm{nl}}(0,0)=0, we can bound,

fnl(𝐱,𝐮)\displaystyle\|f_{\mathrm{nl}}(\mathbf{x},\mathbf{u})\| 01fnl(t𝐱,t𝐮)op[𝐱𝐮]𝑑t\displaystyle~{}\leq~{}\int_{0}^{1}\|\nabla\mkern-2.5muf_{\mathrm{nl}}(t\cdot\mathbf{x},t\cdot\mathbf{u})\|_{\mathrm{op}}\|\begin{bmatrix}\mathbf{x}\\ \mathbf{u}\end{bmatrix}\|dt
01βt(𝐱+𝐮)2𝑑t.\displaystyle~{}\leq~{}\int_{0}^{1}\beta t\cdot(\|\mathbf{x}\|+\|\mathbf{u}\|)^{2}dt.
2β(𝐱2+𝐮2)01t𝑑t\displaystyle~{}\leq~{}2\beta(\|\mathbf{x}\|^{2}+\|\mathbf{u}\|^{2})\int_{0}^{1}t\cdot dt
=β(𝐱2+𝐮2).\displaystyle=\beta(\|\mathbf{x}\|^{2}+\|\mathbf{u}\|^{2}).

Appendix C Implementing ε-𝙴𝚟𝚊𝚕\varepsilon\text{-}\mathtt{Eval}, ε-𝙶𝚛𝚊𝚍\varepsilon\text{-}\mathtt{Grad}, and Search Algorithms

We conclude by discussing the implementations and relevant sample complexities of the noisy gradient and function evaluation methods, ε-𝙶𝚛𝚊𝚍\varepsilon\text{-}\mathtt{Grad} and ε-𝙴𝚟𝚊𝚕\varepsilon\text{-}\mathtt{Eval}. We then use these to establish the runtime and correctness of the noisy binary and random search algorithms.

We remark that throughout our analysis, we have assumed that ε-𝙶𝚛𝚊𝚍\varepsilon\text{-}\mathtt{Grad} and ε-𝙴𝚟𝚊𝚕\varepsilon\text{-}\mathtt{Eval} succeed with probability 1. This is purely for the sake of simplifying our presentation. As established in Fazel et al. [2018], the relevant estimators in this section all return ε\varepsilon approximate solutions with probability 1δ1-\delta, where δ\delta factors into the sample complexity polynomially. Therefore, it is easy to union bound and get a high probability guarantee. We omit these union bounds for the sake of simplifying the presentation.

C.1 Implementing ε-𝙴𝚟𝚊𝚕\varepsilon\text{-}\mathtt{Eval}

Linear setting.

From the analysis in Fazel et al. [2018], we know that if Jlin(Kγ)<J_{\mathrm{lin}}(K\mid\gamma)<\infty, then

Jlin(Kγ)1Ni=1NJlin(H)(Kγ,𝐱(i)),𝐱(i)r𝒮dx1,\displaystyle J_{\mathrm{lin}}(K\mid\gamma)\approx\frac{1}{N}{\sum}_{i=1}^{N}J_{\mathrm{lin}}^{(H)}(K\mid\gamma,\mathbf{x}^{(i)}),\quad\mathbf{x}^{(i)}\sim r\cdot\mathcal{S}^{d_{x}-1}, (C.1)

where Jlin(H)J_{\mathrm{lin}}^{(H)} is the length HH, finite horizon cost of KK:

Jlin(H)(Kγ,𝐱)=j=0H1γt(𝐱tQ𝐱t+𝐮tR𝐮t),𝐮t=K𝐱t,𝐱t+1=A𝐱t+B𝐮t,𝐱0=𝐱.\displaystyle J_{\mathrm{lin}}^{(H)}(K\mid\gamma,\mathbf{x})=\sum_{j=0}^{H-1}\gamma^{t}\cdot\left(\mathbf{x}_{t}^{\top}Q\mathbf{x}_{t}+\mathbf{u}_{t}^{\top}R\mathbf{u}_{t}\right),\quad\mathbf{u}_{t}=K\mathbf{x}_{t},\quad\mathbf{x}_{t+1}=A\mathbf{x}_{t}+B\mathbf{u}_{t},\quad\mathbf{x}_{0}=\mathbf{x}.

More formally, if Jlin(Kγ)J_{\mathrm{lin}}(K\mid\gamma) is smaller than a constant cc, Lemma 26 from Fazel et al. [2018] states that it suffices to set NN and HH to be polynomials in Aop,Bop,Jlin(Kγ)\|A\|_{\mathrm{op}},\|B\|_{\mathrm{op}},J_{\mathrm{lin}}(K\mid\gamma) and c/εc/\varepsilon in order to have an ε\varepsilon-approximate estimate of Jlin(Kγ)J_{\mathrm{lin}}(K\mid\gamma) with high probability.

On the other hand, if Jlin(Kγ)J_{\mathrm{lin}}(K\mid\gamma) is larger than cc (recall that the costs should never be larger than some universal constant times MlinM_{\mathrm{lin}} during discount annealing), the following lemma proves we can detect that this is the case by setting NN and HH to be polynomials in Aop,Bop\|A\|_{\mathrm{op}},\|B\|_{\mathrm{op}} and cc. This argument follows from the following two lemmas:

Lemma C.1.

Fix a constant cc, and take H=4cH=4c. Then for any KK (possibly even with Jlin(Kγ)=J_{\mathrm{lin}}(K\mid\gamma)=\infty),

𝐱dx𝒮dx1[Jlin(H)(Kγ,𝐱)min{12Jlin(Kγ),c}dx]110.\displaystyle\operatorname{\mathbb{P}}_{\mathbf{x}\sim\sqrt{d_{x}}\cdot\mathcal{S}^{d_{x}-1}}\left[J_{\mathrm{lin}}^{(H)}(K\mid\gamma,\mathbf{x})~{}\geq~{}\frac{\min\left\{\frac{1}{2}J_{\mathrm{lin}}(K\mid\gamma),c\right\}}{d_{x}}\right]~{}\geq~{}\frac{1}{10}.

Setting c=αMlinc=\alpha M_{\mathrm{lin}} for some universal constant α\alpha, we get that all calls to the ε-𝙴𝚟𝚊𝚕\varepsilon\text{-}\mathtt{Eval} oracle during discount annealing can be implemented with polynomially many samples. Using standard arguments, we can boost this result to hold with high probability by running NN independent trials, where NN is again a polynomial in the relevant problem parameters.

Nonlinear setting.

Next, we sketch why the same estimator described in Eq. C.1 also works in the nonlinear case if we replace Jlin(H)(Kγ)J_{\mathrm{lin}}^{(H)}(K\mid\gamma) with the analogous finite horizon cost, Jnl(H)(Kγ,r)J_{\mathrm{nl}}^{(H)}(K\mid\gamma,r_{\star}), for the nonlinear objective. By Lemma B.3 and Proposition 3.3 if the nonlinear cost is small, then costs on the nonlinear system and the linear system are pointwise close. Therefore, previous concentration analysis for the linear setting from Fazel et al. [2018] can be easily carried over in order to implement ε-𝙴𝚟𝚊𝚕\varepsilon\text{-}\mathtt{Eval} where NN and HH are depend polynomially on the relevant problem parameters.

On the other hand if the nonlinear cost is large, then the cost on the linear system must also be large. Recall that if the linear cost was bounded by a constant, then the costs of both systems would be pointwise close by Proposition 3.3. By Proposition C.5, we know that the HH-step nonlinear cost is lower bounded by the cost on the linear system. Since we can always detect that the cost on the linear system is larger than a constant using polynomially many samples as per Lemma C.1, with high probability we can also detect if the cost on the nonlinear system is large using again only polynomially many samples.

C.2 Implementing ε-𝙶𝚛𝚊𝚍\varepsilon\text{-}\mathtt{Grad}

Linear setting.

Just like in the case of ε-𝙴𝚟𝚊𝚕\varepsilon\text{-}\mathtt{Eval}, Fazel et al. [2018] (Lemma 30) prove that

Jlin(Kγ)1Ni=1NJlin(H)(Kγ,𝐱(i)),𝐱(i)r𝒮dx1\displaystyle\nabla\mkern-2.5muJ_{\mathrm{lin}}(K\mid\gamma)\approx\frac{1}{N}{\sum}_{i=1}^{N}\nabla\mkern-2.5muJ_{\mathrm{lin}}^{(H)}(K\mid\gamma,\mathbf{x}^{(i)}),\quad\mathbf{x}^{(i)}\sim r\cdot\mathcal{S}^{d_{x}-1} (C.2)

where NN and HH only need to be made polynomial large in the relevant problem parameter and 1/ε1/\varepsilon in order to get an ε\varepsilon accurate approximation of the gradient. Note that we can safely assume that Jlin(Kγ)J_{\mathrm{lin}}(K\mid\gamma) is finite (and hence the gradient is well defined) since we can always run the test outlined in Lemma C.1 to get a high probability guarantee of this fact. In order to approximate the gradients Jlin(H)(Kγ,𝐱)\nabla\mkern-2.5muJ_{\mathrm{lin}}^{(H)}(K\mid\gamma,\mathbf{x}), one can employ standard techniques from Flaxman et al. [2005]. We refer the interested reader to Appendix D in Fazel et al. [2018] for further details.

Nonlinear setting.

Lastly, we remark that, as in the case of ε-𝙴𝚟𝚊𝚕\varepsilon\text{-}\mathtt{Eval}{}, Lemma B.6 establishes that the gradients between the linear and nonlinear system are again pointwise close if the cost on the linear system is bounded. In the proof of Theorem 2, we established that during all iterations of discount annealing, the cost on the linear system during executions of policy gradients is bounded by MnlM_{\mathrm{nl}}. Therefore, the analysis from Fazel et al. [2018] can be ported over to show that ε\varepsilon-approximate gradients of the nonlinear system can be computed using only polynomially many samples in the relevant problem parameters.

C.3 Auxiliary results

Lemma C.2.

Fix any constant c1c\geq 1. Then, for

PH=j=0H1((γ(A+BK))t)(Q+KRK)(γ(A+BK))t\displaystyle P_{H}=\sum_{j=0}^{H-1}\left((\sqrt{\gamma}(A+BK))^{t}\right)^{\top}(Q+K^{\top}RK)(\sqrt{\gamma}(A+BK))^{t}

the following relationship holds

Jlin(H)(Kγ)min{12Jlin(Kγ),c},H4c,\displaystyle J_{\mathrm{lin}}^{(H)}(K\mid\gamma)\geq\min\left\{\frac{1}{2}J_{\mathrm{lin}}(K\mid\gamma),c\right\},\quad\forall H\geq 4c,

where Jlin(Kγ)J_{\mathrm{lin}}(K\mid\gamma) may be infinite.

Proof.

We have two cases. In the first case Acl:=γ(A+BK)A_{\mathrm{cl}}:=\sqrt{\gamma}(A+BK) has an eigenvalue λ1\lambda\geq 1. Letting 𝐯\mathbf{v} denote such a corresponding eigenvector of unit norm, one can verify that tr(PH)𝐯PH𝐯i=0H1λi2H\mathrm{tr}(P_{H})\geq\mathbf{v}^{\top}P_{H}\mathbf{v}\geq\sum_{i=0}^{H-1}\lambda_{i}^{2}\geq H, which is at least cc by assumption.

In the second case, AclA_{\mathrm{cl}} is stable, so PK,γP_{K,\gamma} exists and its trace is Jlin(Kγ)J_{\mathrm{lin}}(K\mid\gamma) (see A.2). Then, for QK=Q+KRKQ_{K}=Q+K^{\top}RK,

PH=i=0H1AcliQKAcli,\displaystyle P_{H}=\sum_{i=0}^{H-1}A_{\mathrm{cl}}^{i\top}Q_{K}A_{\mathrm{cl}}^{i},

we show that if tr(PH)c\mathrm{tr}(P_{H})\leq c, then tr(PH)12tr(PK,γ)\mathrm{tr}(P_{H})\geq\frac{1}{2}\mathrm{tr}(P_{K,\gamma}).

To show this, observe that if tr(PH)c\mathrm{tr}(P_{H})\leq c, then tr(PH)H/4\mathrm{tr}(P_{H})\leq H/4. Therefore, by the pidgeonhole principle (and the fact that tr(QK)1\mathrm{tr}(Q_{K})\geq 1), there exists some t{1,,H}t\in\{1,\dots,H\} such that tr(AcltQKAclt)1/4\mathrm{tr}(A_{\mathrm{cl}}^{t\top}Q_{K}A_{\mathrm{cl}}^{t})\leq 1/4. Since QKIQ_{K}\succeq I, this means that tr(AcltAclt)=AcltF21/4\mathrm{tr}(A_{\mathrm{cl}}^{t\top}A_{\mathrm{cl}}^{t})=\|A_{\mathrm{cl}}^{t}\|_{\mathrm{F}}^{2}\leq 1/4 as well. Therefore, letting PtP_{t} denote the tt-step value function from Eq. C.3, the identity PK,γ=n=0AclntPtAclntP_{K,\gamma}=\sum_{n=0}^{\infty}A_{\mathrm{cl}}^{nt\top}P_{t}A_{\mathrm{cl}}^{nt} means that

tr(PK,γ)\displaystyle\mathrm{tr}(P_{K,\gamma}) =tr(n=0AclntPtAclnt)\displaystyle=\mathrm{tr}\left(\sum_{n=0}^{\infty}A_{\mathrm{cl}}^{nt\top}P_{t}A_{\mathrm{cl}}^{nt}\right)
tr(Pt)+Ptn1AclntF2\displaystyle\leq\mathrm{tr}(P_{t})+\|P_{t}\|\sum_{n\geq 1}\|A_{\mathrm{cl}}^{nt}\|_{\mathrm{F}}^{2}
tr(Pt)+Ptn1AcltF2n\displaystyle\leq\mathrm{tr}(P_{t})+\|P_{t}\|\sum_{n\geq 1}\|A_{\mathrm{cl}}^{t}\|_{\mathrm{F}}^{2n}
tr(Pt)+Ptn1(1/2)n\displaystyle\leq\mathrm{tr}(P_{t})+\|P_{t}\|\sum_{n\geq 1}(1/2)^{n}
2tr(Pt)2tr(PH),\displaystyle\leq 2\mathrm{tr}(P_{t})\leq 2\mathrm{tr}(P_{H}),

where in the last line we used tHt\leq H. This completes the proof. ∎

Lemma C.3.

Let 𝐮,𝐯i.i.d𝒮dx1\mathbf{u},\mathbf{v}\overset{\mathrm{i.i.d}}{\sim}\mathcal{S}^{d_{x}-1} then, [(𝐮𝐯)21/dx].1\operatorname{\mathbb{P}}\left[(\mathbf{u}^{\top}\mathbf{v})^{2}~{}\geq~{}1/d_{x}\right]~{}\geq~{}.1.

Proof.

The statement is clearly true for dx=1d_{x}=1, therefore we focus on the case where dx2d_{x}~{}\geq~{}2. Without loss of generality, we can take 𝐮\mathbf{u} to be the first basis vector 𝐞1\mathbf{e}_{1} and let 𝐯=Z/Z\mathbf{v}=Z/\|Z\| where Z𝒩(0,I)Z\sim\mathcal{N}(0,I). From these simplifications, we observe that (𝐯𝐮)2(\mathbf{v}^{\top}\mathbf{u})^{2} is equal in distribution to the following random variable,

Z12i=1dZi2,\frac{Z_{1}^{2}}{\sum_{i=1}^{d}Z_{i}^{2}},

where each Zi2Z_{i}^{2} is a chi-squared random variable. Using this equivalency, we have that for arbitrary M>0M>0,

[(𝐮𝐯)21/dx]\displaystyle\operatorname{\mathbb{P}}\left[(\mathbf{u}^{\top}\mathbf{v})^{2}~{}\geq~{}1/d_{x}\right] =[Z12i=2dxZi2dx1]\displaystyle=\operatorname{\mathbb{P}}\left[Z_{1}^{2}~{}\geq~{}\frac{\sum_{i=2}^{d_{x}}Z_{i}^{2}}{d_{x}-1}\right]
=[Z12i=2dxZi2dx1|i=2dxZi2M][i=2dxZi2M]\displaystyle=\operatorname{\mathbb{P}}\left[Z_{1}^{2}~{}\geq~{}\frac{\sum_{i=2}^{d_{x}}Z_{i}^{2}}{d_{x}-1}\bigg{|}\sum_{i=2}^{d_{x}}Z_{i}^{2}~{}\leq~{}M\right]\operatorname{\mathbb{P}}\left[\sum_{i=2}^{d_{x}}Z_{i}^{2}~{}\leq~{}M\right]
+[Z12i=2dxZi2dx1|i=2dZi2>M][i=2dxZi2>M]\displaystyle+\operatorname{\mathbb{P}}\left[Z_{1}^{2}~{}\geq~{}\frac{\sum_{i=2}^{d_{x}}Z_{i}^{2}}{d_{x}-1}\bigg{|}\sum_{i=2}^{d}Z_{i}^{2}>M\right]\operatorname{\mathbb{P}}\left[\sum_{i=2}^{d_{x}}Z_{i}^{2}>M\right]
[Z12i=2dxZi2dx1|i=2dZi2M][i=2dxZi2M]\displaystyle~{}\geq~{}\operatorname{\mathbb{P}}\left[Z_{1}^{2}~{}\geq~{}\frac{\sum_{i=2}^{d_{x}}Z_{i}^{2}}{d_{x}-1}\bigg{|}\sum_{i=2}^{d}Z_{i}^{2}~{}\leq~{}M\right]\operatorname{\mathbb{P}}\left[\sum_{i=2}^{d_{x}}Z_{i}^{2}~{}\leq~{}M\right]
[Z12Mdx1][i=2dxZi2M].\displaystyle~{}\geq~{}\operatorname{\mathbb{P}}\left[Z_{1}^{2}~{}\geq~{}\frac{M}{d_{x}-1}\right]\operatorname{\mathbb{P}}\left[\sum_{i=2}^{d_{x}}Z_{i}^{2}~{}\leq~{}M\right].

Setting M=2(dx1)M=2(d_{x}-1), we get that,

[(𝐮𝐯)21/dx][Z122][i=2dxZi22(dx1)].\displaystyle\operatorname{\mathbb{P}}\left[(\mathbf{u}^{\top}\mathbf{v})^{2}~{}\geq~{}1/d_{x}\right]~{}\geq~{}\operatorname{\mathbb{P}}\left[Z_{1}^{2}~{}\geq~{}2\right]\operatorname{\mathbb{P}}\left[\sum_{i=2}^{d_{x}}Z_{i}^{2}~{}\leq~{}2(d_{x}-1)\right].

From a direct computation,

[Z122].15.\displaystyle\operatorname{\mathbb{P}}\left[Z_{1}^{2}~{}\geq~{}2\right]~{}\geq~{}.15.

To bound the last term, if YY is a chi-squared random variable with kk degrees of freedom, by Lemma 1 in Laurent and Massart [2000],

[Yk+2kx+x]exp(x).\displaystyle\operatorname{\mathbb{P}}[Y~{}\geq~{}k+2\sqrt{kx}+x]~{}\leq~{}\exp(-x).

Setting x=22(dx1)k+2dx+k2x=2\sqrt{2(d_{x}-1)k}+2d_{x}+k-2 we get that k+2kx+x=2(dx1)k+2\sqrt{kx}+x=2(d_{x}-1). Substituting in k=dx1k=d_{x}-1, we conclude that,

[i=2dxZi22(dx1)]1exp(22(dx1)3dx+3),\displaystyle\operatorname{\mathbb{P}}\left[\sum_{i=2}^{d_{x}}Z_{i}^{2}~{}\leq~{}2(d_{x}-1)\right]~{}\geq~{}1-\exp(-2\sqrt{2}(d_{x}-1)-3d_{x}+3),

which is greater than .99 for dx2d_{x}~{}\geq~{}2. ∎

Lemma C.1 (restated).

Fix a constant cc, and take H=4cH=4c. Then for any KK (possibly even with Jlin(Kγ)=J_{\mathrm{lin}}(K\mid\gamma)=\infty),

𝐱dx𝒮dx1[Jlin(H)(Kγ,𝐱)min{12Jlin(Kγ),c}dx]110.\displaystyle\operatorname{\mathbb{P}}_{\mathbf{x}\sim\sqrt{d_{x}}\cdot\mathcal{S}^{d_{x}-1}}\left[J_{\mathrm{lin}}^{(H)}(K\mid\gamma,\mathbf{x})~{}\geq~{}\frac{\min\left\{\frac{1}{2}J_{\mathrm{lin}}(K\mid\gamma),c\right\}}{d_{x}}\right]~{}\geq~{}\frac{1}{10}.
Proof.

Observe that for the finite horizon value matrix PH=i=0H1(A+BK)i(Q+KRK)(A+BK)iP_{H}=\sum_{i=0}^{H-1}(A+BK)^{i\top}(Q+K^{\top}RK)(A+BK)^{i}, we have Jlin(H)(Kγ,𝐱)=𝐱PH𝐱J_{\mathrm{lin}}^{(H)}(K\mid\gamma,\mathbf{x})=\mathbf{x}^{\top}P_{H}\mathbf{x}. We now observe that, since PH0P_{H}\succeq 0, it has a (possibly non-unique) top eigenvector v1v_{1} for which

Jlin(H)(Kγ,𝐱)=𝐱PH𝐱\displaystyle J_{\mathrm{lin}}^{(H)}(K\mid\gamma,\mathbf{x})=\mathbf{x}^{\top}P_{H}\mathbf{x} v1,𝐱dx2dxPH\displaystyle\geq\langle v_{1},\frac{\mathbf{x}}{\sqrt{d_{x}}}\rangle^{2}\cdot d_{x}\|P_{H}\|
v1,𝐱dx2tr(PH)\displaystyle\geq\langle v_{1},\frac{\mathbf{x}}{\sqrt{d_{x}}}\rangle^{2}\cdot\mathrm{tr}(P_{H})
=v1,𝐱dx:=Z2Jlin(H)(Kγ)\displaystyle=\underbrace{\langle v_{1},\frac{\mathbf{x}}{\sqrt{d_{x}}}}_{:=Z}\rangle^{2}\cdot J_{\mathrm{lin}}^{(H)}(K\mid\gamma)

Since 𝐱/dx𝒮dx1\mathbf{x}/\sqrt{d_{x}}\sim\mathcal{S}^{d_{x}-1}, Lemma C.3 ensures that [Z1/dx]1/10\operatorname{\mathbb{P}}[Z\geq 1/d_{x}]\geq 1/10. Hence,

𝐱dx𝒮dx1[Jlin(H)(Kγ,𝐱)Jlin(H)(Kγ)dx]110.\displaystyle\operatorname{\mathbb{P}}_{\mathbf{x}\sim\sqrt{d_{x}}\cdot\mathcal{S}^{d_{x}-1}}\left[J_{\mathrm{lin}}^{(H)}(K\mid\gamma,\mathbf{x})~{}\geq~{}\frac{J_{\mathrm{lin}}^{(H)}(K\mid\gamma)}{d_{x}}\right]~{}\geq~{}\frac{1}{10}.

The bound now follows from invoking Lemma C.2 to lower bound Jlin(H)(Kγ,𝐱)min{12Jlin(Kγ),c}J_{\mathrm{lin}}^{(H)}(K\mid\gamma,\mathbf{x})\geq\min\left\{\frac{1}{2}J_{\mathrm{lin}}(K\mid\gamma),c\right\} provided H4cH\geq 4c.

In the following lemmas, we define PHP_{H} to be the following matrix where KK is any state-feedback controller.

PH=j=0H1((γ(A+BK))t)(Q+KRK)(γ(A+BK))t\displaystyle P_{H}=\sum_{j=0}^{H-1}\left((\sqrt{\gamma}(A+BK))^{t}\right)^{\top}(Q+K^{\top}RK)(\sqrt{\gamma}(A+BK))^{t} (C.3)

Similarly, we let Jnl(H)(Kγ,𝐱0)J_{\mathrm{nl}}^{(H)}(K\mid\gamma,\mathbf{x}_{0}) be the horizon HH cost of the nonlinear dynamical system:

Jnl(H)(K𝐱,γ):=t=0H1𝐱tQ𝐱t+𝐮tR𝐮t\displaystyle J_{\mathrm{nl}}^{(H)}(K\mid\mathbf{x},\gamma):=\sum_{t=0}^{H-1}\mathbf{x}_{t}^{\top}Q\mathbf{x}_{t}+\mathbf{u}_{t}^{\top}R\mathbf{u}_{t} (C.4)
s.t 𝐮t=K𝐱t,𝐱t+1=γGnl(𝐱t,𝐮t),𝐱0=𝐱.\displaystyle\text{ s.t }\mathbf{u}_{t}=K\mathbf{x}_{t},\quad\mathbf{x}_{t+1}=\sqrt{\gamma}\cdot G_{\mathrm{nl}}(\mathbf{x}_{t},\mathbf{u}_{t}),\quad\mathbf{x}_{0}=\mathbf{x}. (C.5)

And again overloading notation like before, we let Jnl(H)(Kγ,r):=𝔼𝐱r𝒮dx1[Jnl(H)(Kγ,𝐱)]×dxr2J_{\mathrm{nl}}^{(H)}(K\mid\gamma,r):=\mathbb{E}_{\mathbf{x}\sim r\cdot\mathcal{S}^{d_{x}-1}}\left[J_{\mathrm{nl}}^{(H)}(K\mid\gamma,\mathbf{x})\right]\times\frac{d_{x}}{r^{2}}.

Lemma C.4.

Fix a horizon HH, constant α(0,dx)\alpha\in(0,d_{x}), 𝐱0dx\mathbf{x}_{0}\in\mathbb{R}^{d_{x}}, and suppose that

𝐱02α2H2βnl2(1+K2)dxrnl4.\displaystyle\|\mathbf{x}_{0}\|^{2}\cdot\frac{\alpha}{2H^{2}\beta_{\mathrm{nl}}^{2}(1+\|K\|^{2})d_{x}}\leq r_{\mathrm{nl}}^{4}.

Furthermore, define 𝒳α:={𝐱0dx:𝐱0,𝐯max(PH)2α𝐱02/dx}\mathcal{X}_{\alpha}:=\{\mathbf{x}_{0}\in\mathbb{R}^{d_{x}}:\langle\mathbf{x}_{0},\mathbf{v}_{\max}(P_{H})\rangle^{2}\geq\alpha\|\mathbf{x}_{0}\|^{2}/d_{x}\}. Then, if 𝐱0𝒳α\mathbf{x}_{0}\in\mathcal{X}_{\alpha}, it holds that

Jnl(H)(Kγ,𝐱0)min{α𝐱024dx2Jlin(H)(K𝐱0,γ),αdx𝐱02Hβnl(1+K)}.\displaystyle J_{\mathrm{nl}}^{(H)}(K\mid\gamma,\mathbf{x}_{0})\geq\min\left\{\frac{\alpha\|\mathbf{x}_{0}\|^{2}}{4d_{x}^{2}}\cdot J_{\mathrm{lin}}^{(H)}(K\mid\mathbf{x}_{0},\gamma),\sqrt{\frac{\alpha}{d_{x}}}\frac{\|\mathbf{x}_{0}\|}{2H\beta_{\mathrm{nl}}(1+\|K\|)}\right\}.
Proof.

Fix a constant r1rnlr_{1}\leq r_{\mathrm{nl}} to be selected. Throughout, we use the shorthand β12=βnl2(1+K2)\beta_{1}^{2}=\beta_{\mathrm{nl}}^{2}(1+\|K\|^{2}) We consider two cases.

Case 1:

The initial point 𝐱0\mathbf{x}_{0} is such that it is always the case that 𝐱tr1\|\mathbf{x}_{t}\|\leq r_{1} for all t{0,1,,H1}t\in\{0,1,\dots,H-1\} and Jnl(H)(Kγ,𝐱0)r12J_{\mathrm{nl}}^{(H)}(K\mid\gamma,\mathbf{x}_{0})\leq r_{1}^{2}. Observe that we can write the nonlinear dynamics as

𝐱t+1=γ(A+BK)𝐱t+𝐰t,\displaystyle\mathbf{x}_{t+1}=\sqrt{\gamma}(A+BK)\mathbf{x}_{t}+\mathbf{w}_{t},

where 𝐰t=γfnl(𝐱t,K𝐱t)\mathbf{w}_{t}=\sqrt{\gamma}\cdot f_{\mathrm{nl}}(\mathbf{x}_{t},K\mathbf{x}_{t}). We now write:

𝐱t\displaystyle\mathbf{x}_{t} =𝐱t;0+𝐱t;w,where\displaystyle=\mathbf{x}_{t;0}+\mathbf{x}_{t;w},\text{where }
𝐱t;0=γ(A+BK)t𝐱0\displaystyle\qquad\mathbf{x}_{t;0}=\sqrt{\gamma}(A+BK)^{t}\mathbf{x}_{0}
𝐱t;w=i=0t1γi/2(A+BK)i𝐰ti.\displaystyle\qquad\mathbf{x}_{t;w}=\sum_{i=0}^{t-1}\gamma^{i/2}(A+BK)^{i}\mathbf{w}_{t-i}.

Then, setting QK=Q+KRKQ_{K}=Q+K^{\top}RK,

Jnl(H)(Kγ,𝐱0)\displaystyle J_{\mathrm{nl}}^{(H)}(K\mid\gamma,\mathbf{x}_{0}) =t=0H1𝐱tQK𝐱t\displaystyle=\sum_{t=0}^{H-1}\mathbf{x}_{t}^{\top}Q_{K}\mathbf{x}_{t}
=t=0H1(𝐱t;0QK𝐱t;0+2𝐱t;0QK𝐱t;w+𝐱t;wQK𝐱t;w)\displaystyle=\sum_{t=0}^{H-1}\left(\mathbf{x}_{t;0}^{\top}Q_{K}\mathbf{x}_{t;0}+2\mathbf{x}_{t;0}^{\top}Q_{K}\mathbf{x}_{t;w}+\mathbf{x}_{t;w}^{\top}Q_{K}\mathbf{x}_{t;w}\right)
12t=0H1𝐱t;0QK𝐱t;0t=0H1𝐱t;wQK𝐱t;w\displaystyle\geq\frac{1}{2}\sum_{t=0}^{H-1}\mathbf{x}_{t;0}^{\top}Q_{K}\mathbf{x}_{t;0}-\sum_{t=0}^{H-1}\mathbf{x}_{t;w}^{\top}Q_{K}\mathbf{x}_{t;w}
=12𝐱0PH𝐱0t=0H1𝐱t;wQK𝐱t;w,\displaystyle=\frac{1}{2}\mathbf{x}_{0}^{\top}P_{H}\mathbf{x}_{0}-\sum_{t=0}^{H-1}\mathbf{x}_{t;w}^{\top}Q_{K}\mathbf{x}_{t;w},

where (a) the last inequality uses the elementary inequality 𝐯1,Σ𝐯1+𝐯2,Σ𝐯2+2𝐯1,Σ𝐯212𝐯1,Σ𝐯1𝐯2,Σ𝐯2\langle\mathbf{v}_{1},\Sigma\mathbf{v}_{1}\rangle+\langle\mathbf{v}_{2},\Sigma\mathbf{v}_{2}\rangle+2\langle\mathbf{v}_{1},\Sigma\mathbf{v}_{2}\rangle\geq\frac{1}{2}\langle\mathbf{v}_{1},\Sigma\mathbf{v}_{1}\rangle-\langle\mathbf{v}_{2},\Sigma\mathbf{v}_{2}\rangle for any pair of vectors 𝐯1,𝐯2\mathbf{v}_{1},\mathbf{v}_{2} and Σ0\Sigma\succeq 0, and (b) the last inequality recognizes how

t=0H1𝐱t;0QK𝐱t;0=Jlin(H)(K𝐱0,γ)=𝐱0PH𝐱0\displaystyle\sum_{t=0}^{H-1}\mathbf{x}_{t;0}^{\top}Q_{K}\mathbf{x}_{t;0}=J_{\mathrm{lin}}^{(H)}(K\mid\mathbf{x}_{0},\gamma)=\mathbf{x}_{0}^{\top}P_{H}\mathbf{x}_{0}

for PHP_{H} defined above in Eq. C.3. Moreover, for any tt,

t=0H1𝐱t;wQK𝐱t;w\displaystyle\sum_{t=0}^{H-1}\mathbf{x}_{t;w}^{\top}Q_{K}\mathbf{x}_{t;w} =t=0H1i=0t1j=0t1𝐰tiγ(i+j)/2((A+BK)i)QK(A+BK)j𝐰tj\displaystyle=\sum_{t=0}^{H-1}\sum_{i=0}^{t-1}\sum_{j=0}^{t-1}\mathbf{w}_{t-i}^{\top}\gamma^{(i+j)/2}((A+BK)^{i})^{\top}Q_{K}(A+BK)^{j}\mathbf{w}_{t-j}
Ht=0H1i=0t1𝐰tiγi((A+BK)i)QK(A+BK)i𝐰ti\displaystyle\leq H\sum_{t=0}^{H-1}\sum_{i=0}^{t-1}\mathbf{w}_{t-i}^{\top}\gamma^{i}((A+BK)^{i})^{\top}Q_{K}(A+BK)^{i}\mathbf{w}_{t-i}
=Ht=0H2𝐰t(i=0Ht1γi((A+BK)i)QK(A+BK)i)𝐰t\displaystyle=H\sum_{t=0}^{H-2}\mathbf{w}_{t}^{\top}\left(\sum_{i=0}^{H-t-1}\gamma^{i}((A+BK)^{i})^{\top}Q_{K}(A+BK)^{i}\right)\mathbf{w}_{t}
=Ht=0H2𝐰tPHt1𝐰tH2PHopmaxt{0,,H2}𝐰t2.\displaystyle=H\sum_{t=0}^{H-2}\mathbf{w}_{t}^{\top}P_{H-t-1}\mathbf{w}_{t}\leq H^{2}\|P_{H}\|_{\mathrm{op}}\max_{t\in\{0,\dots,H-2\}}\|\mathbf{w}_{t}\|^{2}.

Now, because 𝐱tr1rnl\|\mathbf{x}_{t}\|\leq r_{1}\leq r_{\mathrm{nl}}, Lemma 3.1 lets use bound 𝐰t2βnl2(1+K2)𝐱t4β12r14\|\mathbf{w}_{t}\|^{2}\leq\beta_{\mathrm{nl}}^{2}(1+\|K\|^{2})\|\mathbf{x}_{t}\|^{4}\leq\beta_{1}^{2}r_{1}^{4}, where we adopt the previously defined shorthand β12=βnl2(1+K2)\beta_{1}^{2}=\beta_{\mathrm{nl}}^{2}(1+\|K\|^{2}). Therefore,

Jnl(H)(Kγ,𝐱0)12𝐱0PH𝐱0H2β12r14PHop.\displaystyle J_{\mathrm{nl}}^{(H)}(K\mid\gamma,\mathbf{x}_{0})\geq\frac{1}{2}\mathbf{x}_{0}^{\top}P_{H}\mathbf{x}_{0}-H^{2}\beta_{1}^{2}r_{1}^{4}\|P_{H}\|_{\mathrm{op}}.

Next, if 𝐱0𝒳α\mathbf{x}_{0}\in\mathcal{X}_{\alpha},

Jnl(H)(Kγ,𝐱0)\displaystyle J_{\mathrm{nl}}^{(H)}(K\mid\gamma,\mathbf{x}_{0}) α2dx𝐱02PHH2β12r14PHop\displaystyle\geq\frac{\alpha}{2d_{x}}\|\mathbf{x}_{0}\|^{2}\|P_{H}\|-H^{2}\beta_{1}^{2}r_{1}^{4}\|P_{H}\|_{\mathrm{op}}
=PH(α2dx𝐱02H2β12r14).\displaystyle=\|P_{H}\|\left(\frac{\alpha}{2d_{x}}\|\mathbf{x}_{0}\|^{2}-H^{2}\beta_{1}^{2}r_{1}^{4}\right).

In particular, selecting r14=α4β12dxH2𝐱02r_{1}^{4}=\frac{\alpha}{4\beta_{1}^{2}d_{x}H^{2}}\|\mathbf{x}_{0}\|^{2} (which ensures r1rr_{1}\leq r_{\star} by the conditions of the lemma), it holds that,

Jnl(H)(Kγ,𝐱0)PHα4dx𝐱02tr(PH)α4dx2𝐱02=αJlin(H)(K,γ)4dx2𝐱02.\displaystyle J_{\mathrm{nl}}^{(H)}(K\mid\gamma,\mathbf{x}_{0})\geq\frac{\|P_{H}\|\alpha}{4d_{x}}\|\mathbf{x}_{0}\|^{2}\geq\frac{\mathrm{tr}(P_{H})\alpha}{4d_{x}^{2}}\|\mathbf{x}_{0}\|^{2}=\frac{\alpha J_{\mathrm{lin}}^{(H)}(K,\gamma)}{4d_{x}^{2}}\|\mathbf{x}_{0}\|^{2}.
Case 2:

The initial point 𝐱0\mathbf{x}_{0} is such that it is always the case that either 𝐱tr1\|\mathbf{x}_{t}\|\geq r_{1} for all t{0,1,,H1}t\in\{0,1,\dots,H-1\} or Jnl(H)(Kγ,𝐱0)r12J_{\mathrm{nl}}^{(H)}(K\mid\gamma,\mathbf{x}_{0})\geq r_{1}^{2}. Therefore, in either case, Jnl(H)(Kγ,𝐱0)r12J_{\mathrm{nl}}^{(H)}(K\mid\gamma,\mathbf{x}_{0})\geq r_{1}^{2}. For our choice of r1r_{1}, this gives

Jnl(H)(Kγ,𝐱0)αx024dxH2β12=αx024dxH2βnl2(1+K2)αdxx02Hβnl(1+K).\displaystyle J_{\mathrm{nl}}^{(H)}(K\mid\gamma,\mathbf{x}_{0})\geq\sqrt{\frac{\alpha\|x_{0}\|^{2}}{4d_{x}H^{2}\beta_{1}^{2}}}=\sqrt{\frac{\alpha\|x_{0}\|^{2}}{4d_{x}H^{2}\beta_{\mathrm{nl}}^{2}(1+\|K\|^{2})}}\geq\sqrt{\frac{\alpha}{d_{x}}}\frac{\|x_{0}\|}{2H\beta_{\mathrm{nl}}(1+\|K\|)}.

Combining the cases, we have

Jnl(H)(Kγ,𝐱0)min{α𝐱024dx2Jlin(H)(K𝐱0,γ),αdxx02Hβnl(1+K).}.\displaystyle J_{\mathrm{nl}}^{(H)}(K\mid\gamma,\mathbf{x}_{0})\geq\min\left\{\frac{\alpha\|\mathbf{x}_{0}\|^{2}}{4d_{x}^{2}}\cdot J_{\mathrm{lin}}^{(H)}(K\mid\mathbf{x}_{0},\gamma),\sqrt{\frac{\alpha}{d_{x}}}\frac{\|x_{0}\|}{2H\beta_{\mathrm{nl}}(1+\|K\|)}.\right\}.

Proposition C.5.

Let cc be a given (integer) tolerance cc and 𝒳α:={𝐱d:𝐱,𝐯max(PH)2α𝐱2/dx}\mathcal{X}_{\alpha}:=\{\mathbf{x}\in\mathbb{R}^{d}:\langle\mathbf{x},\mathbf{v}_{\max}(P_{H})\rangle^{2}\geq\alpha\|\mathbf{x}\|^{2}/d_{x}\} be defined as in Lemma C.4. Then, for α(0,2]\alpha\in(0,2], H=4cH=4c, and 𝐱0𝒳α\mathbf{x}_{0}\in\mathcal{X}_{\alpha} satisfying,

𝐱02c2rnl4,and𝐱0dx64c2βnl(1+K),\displaystyle\|\mathbf{x}_{0}\|^{2}\leq c^{2}r_{\mathrm{nl}}^{4},\quad\text{and}\quad\|\mathbf{x}_{0}\|\leq\frac{d_{x}}{64c^{2}\beta_{\mathrm{nl}}(1+\|K\|)}, (C.6)

it holds that:

Jnl(H)(Kγ,𝐱0)𝐱02α8dx2min{Jlin(Kγ),c}.\displaystyle\frac{J_{\mathrm{nl}}^{(H)}(K\mid\gamma,\mathbf{x}_{0})}{\|\mathbf{x}_{0}\|^{2}}\geq\frac{\alpha}{8d_{x}^{2}}\min\left\{J_{\mathrm{lin}}(K\mid\gamma),c\right\}.

Moreover, for rmin{crnl2,dx64c2βnl(1+K)}r~{}\leq~{}\min\{cr_{\mathrm{nl}}^{2},\frac{d_{x}}{64c^{2}\beta_{\mathrm{nl}}(1+\|K\|)}\},

Jnl(H)(Kγ,r)180dx2min{Jlin(Kγ),c}.\displaystyle J_{\mathrm{nl}}^{(H)}(K\mid\gamma,r)\geq\frac{1}{80d_{x}^{2}}\min\left\{J_{\mathrm{lin}}(K\mid\gamma),c\right\}.
Proof.

From Lemma C.4, it holds that for H=4cH=4c,

𝐱02α32c2βnl2(1+K2)dxrnl4\displaystyle\|\mathbf{x}_{0}\|^{2}\cdot\frac{\alpha}{32c^{2}\beta_{\mathrm{nl}}^{2}(1+\|K\|^{2})d_{x}}\leq r_{\mathrm{nl}}^{4} (C.7)

and hence

Jnl(H)(Kγ,𝐱0)min{α𝐱024dx2Jlin(H)(K𝐱0,γ),αdx𝐱08cβnl(1+K)}.\displaystyle J_{\mathrm{nl}}^{(H)}(K\mid\gamma,\mathbf{x}_{0})\geq\min\left\{\frac{\alpha\|\mathbf{x}_{0}\|^{2}}{4d_{x}^{2}}\cdot J_{\mathrm{lin}}^{(H)}(K\mid\mathbf{x}_{0},\gamma),\sqrt{\frac{\alpha}{d_{x}}}\frac{\|\mathbf{x}_{0}\|}{8c\beta_{\mathrm{nl}}(1+\|K\|)}\right\}.

Note that, due to α/dx1\alpha/d_{x}\leq 1, βnl21\beta_{\mathrm{nl}}^{2}\geq 1, Eq. C.7 holds as soon as 𝐱0c2rnl4\|\mathbf{x}_{0}\|\leq c^{2}r_{\mathrm{nl}}^{4}. By Lemma C.2, it then follows that:

Jnl(H)(Kγ,𝐱0)\displaystyle J_{\mathrm{nl}}^{(H)}(K\mid\gamma,\mathbf{x}_{0}) min{α𝐱024dx2min{12Jlin(Kγ),c},αdx𝐱08cβnl(1+K)}\displaystyle\geq\min\left\{\frac{\alpha\|\mathbf{x}_{0}\|^{2}}{4d_{x}^{2}}\cdot\min\left\{\frac{1}{2}J_{\mathrm{lin}}(K\mid\gamma),c\right\},\sqrt{\frac{\alpha}{d_{x}}}\frac{\|\mathbf{x}_{0}\|}{8c\beta_{\mathrm{nl}}(1+\|K\|)}\right\}
min{α𝐱028dx2min{Jlin(Kγ),c},αdx𝐱08cβnl(1+K)}\displaystyle\geq\min\left\{\frac{\alpha\|\mathbf{x}_{0}\|^{2}}{8d_{x}^{2}}\min\left\{J_{\mathrm{lin}}(K\mid\gamma),c\right\},\sqrt{\frac{\alpha}{d_{x}}}\frac{\|\mathbf{x}_{0}\|}{8c\beta_{\mathrm{nl}}(1+\|K\|)}\right\}
=α𝐱028dx2min{Jlin(Kγ),c,(α𝐱028dx2)1αdx𝐱08cβnl(1+K)}.\displaystyle=\frac{\alpha\|\mathbf{x}_{0}\|^{2}}{8d_{x}^{2}}\min\left\{J_{\mathrm{lin}}(K\mid\gamma),\,c,\,\left(\frac{\alpha\|\mathbf{x}_{0}\|^{2}}{8d_{x}^{2}}\right)^{-1}\sqrt{\frac{\alpha}{d_{x}}}\frac{\|\mathbf{x}_{0}\|}{8c\beta_{\mathrm{nl}}(1+\|K\|)}\right\}.

Simplifying, the last term gives

(α𝐱028dx2)1αdx𝐱08cβnl(1+K)=1c𝐱0(dx3α)1/2164βnl(1+K)1c𝐱0dx64βnl(1+K),\displaystyle\left(\frac{\alpha\|\mathbf{x}_{0}\|^{2}}{8d_{x}^{2}}\right)^{-1}\sqrt{\frac{\alpha}{d_{x}}}\frac{\|\mathbf{x}_{0}\|}{8c\beta_{\mathrm{nl}}(1+\|K\|)}=\frac{1}{c\|\mathbf{x}_{0}\|}\cdot\left(\frac{d_{x}^{3}}{\alpha}\right)^{1/2}\frac{1}{64\beta_{\mathrm{nl}}(1+\|K\|)}\geq\frac{1}{c\|\mathbf{x}_{0}\|}\cdot\frac{d_{x}}{64\beta_{\mathrm{nl}}(1+\|K\|)},

where we use dx/α1d_{x}/\alpha\geq 1. Thus, for

𝐱0dx64c2βnl(1+K),\displaystyle\|\mathbf{x}_{0}\|\leq\frac{d_{x}}{64c^{2}\beta_{\mathrm{nl}}(1+\|K\|)},

the third term in the minimum is at most cc, so that

Jnl(H)(Kγ,𝐱0)α𝐱028dx2min{Jlin(Kγ),c}.\displaystyle J_{\mathrm{nl}}^{(H)}(K\mid\gamma,\mathbf{x}_{0})\geq\frac{\alpha\|\mathbf{x}_{0}\|^{2}}{8d_{x}^{2}}\min\left\{J_{\mathrm{lin}}(K\mid\gamma),\,c\right\}.

Lastly, using Lemma C.3,

Jnl(H)(Kγ,r)\displaystyle J_{\mathrm{nl}}^{(H)}(K\mid\gamma,r) =dxr2𝔼𝐱0r𝒮dx1Jnl(H)(Kγ,𝐱0)\displaystyle=\frac{d_{x}}{r^{2}}\cdot\mathbb{E}_{\mathbf{x}_{0}\sim r\cdot\mathcal{S}^{d_{x}-1}}J_{\mathrm{nl}}^{(H)}(K\mid\gamma,\mathbf{x}_{0})
[x0𝒳1]18dx2min{Jlin(Kγ),c}\displaystyle\geq\operatorname{\mathbb{P}}[x_{0}\in\mathcal{X}_{1}]\cdot\frac{1}{8d_{x}^{2}}\min\left\{J_{\mathrm{lin}}(K\mid\gamma),c\right\}
180dx2min{Jlin(Kγ),c}.\displaystyle\geq\frac{1}{80d_{x}^{2}}\min\left\{J_{\mathrm{lin}}(K\mid\gamma),c\right\}.

C.4 Search analysis

Noisy Binary Search Require: f¯1\overline{f}_{1} and f¯2\overline{f}_{2} as defined in Lemma C.6. Initialize: b00,u01b_{0}\leftarrow 0,u_{0}\leftarrow 1, cf¯2+2αc~{}\geq~{}\overline{f}_{2}+2\alpha for α:=α=min{|f1f¯1|,|f2f¯2|}\alpha:=\alpha=\min\{|f_{1}-\overline{f}_{1}|,|f_{2}-\overline{f}_{2}|\}, ε(0,α/2)\varepsilon\in(0,\alpha/2) For t=1,t=1,\dots 1. Query atε-𝙴𝚟𝚊𝚕(f,xt,c)a_{t}\leftarrow\varepsilon\text{-}\mathtt{Eval}(f,x_{t},c) where xt=bt+ut2x_{t}=\frac{b_{t}+u_{t}}{2} 2. If at>f¯2+εa_{t}>\overline{f}_{2}+\varepsilon, update utxtu_{t}\leftarrow x_{t} and btbt1b_{t}\leftarrow b_{t-1} 3. Else if, at<f¯1+εa_{t}<\overline{f}_{1}+\varepsilon, update btxtb_{t}\leftarrow x_{t} and utut1u_{t}\leftarrow u_{t-1} 4. Else, break and return xtx_{t} Noisy Random Search Require: f¯1\overline{f}_{1}, f¯2\overline{f}_{2} as defined in Lemma C.7. Initialize: cf¯2+2αc~{}\geq~{}\overline{f}_{2}+2\alpha for α:=min{|f1f¯1|,|f2f¯2|}\alpha:=\min\{|f_{1}-\overline{f}_{1}|,|f_{2}-\overline{f}_{2}|\}, ε(0,α/2)\varepsilon\in(0,\alpha/2) For t=1,t=1,\dots 1. Sample xx uniformly at random from [0,1][0,1] 2. Query aε-𝙴𝚟𝚊𝚕(f,x,c)a\leftarrow\varepsilon\text{-}\mathtt{Eval}(f,x,c) 3. If a[f¯1,f¯2]a\in[\overline{f}_{1},\overline{f}_{2}], break and return xx,

Figure 2: Algorithms for 1 dimensional search used as subroutines for Step 4 in the discount annealing algorithm.
Lemma C.6.

Let f:[0,1]{}f:[0,1]\rightarrow\mathbb{R}\cup\{\infty\} be a nondecreasing function over the unit interval. Then, given f¯1,f¯2\overline{f}_{1},\overline{f}_{2} such that [f¯1,f¯2][f1,f2][\overline{f}_{1},\overline{f}_{2}]\subseteq[f_{1},f_{2}] and for which there exist [x1,x2][0,1][x_{1},x_{2}]\subseteq[0,1] such that for all x[x1,x2]x^{\prime}\in[x_{1},x_{2}], f(x)[f¯1,f¯2]f(x^{\prime})\in[\overline{f}_{1},\overline{f}_{2}], binary search as defined in Figure 2 returns a value x[0,1]x_{\star}\in[0,1] such that f(x)[f1,f2]f(x_{\star})\in[f_{1},f_{2}] in at most log2(1/Δ)\lceil\log_{2}(1/\Delta)\rceil many iterations where Δ=x2x1\Delta=x_{2}-x_{1}.

Lemma C.7.

Let f:[0,1]{}f:[0,1]\rightarrow\mathbb{R}\cup\{\infty\} be a function over the unit interval. Then, given given f¯1,f¯2\overline{f}_{1},\overline{f}_{2} such that [f¯1,f¯2][f1,f2][\overline{f}_{1},\overline{f}_{2}]\subseteq[f_{1},f_{2}] and for which there exist [x1,x2][0,1][x_{1},x_{2}]\subseteq[0,1] such that for all x[x1,x2]x^{\prime}\in[x_{1},x_{2}], f(x)[f¯1,f¯2]f(x^{\prime})\in[\overline{f}_{1},\overline{f}_{2}], with probability 1δ1-\delta, noisy random search as defined in Figure 2 returns a value x[0,1]x_{\star}\in[0,1] such that f(x)[f1,f2]f(x_{\star})\in[f_{1},f_{2}] in at most 1/Δlog(1/δ)1/\Delta\log(1/\delta) many iterations where Δ=x2x1\Delta=x_{2}-x_{1}.

The analysis of the correctness and runtime of this classical algorithms for search problems in 1 dimension are standard. We omit the proofs for the sake of concision.

Appendix D Additional Experiments

D.1 Cart-pole dynamics

The state of the cart-pole is given by 𝐱=(x,θ,x˙,θ˙)\mathbf{x}=(x,\theta,\dot{x},\dot{\theta}), where xx and x˙\dot{x} denote the horizontal position and velocity of the cart, and θ\theta and θ˙\dot{\theta} denote the angular position and velocity of the pole, with θ=0\theta=0 corresponding to the upright equilibrium. The control input uu\in\mathbb{R} corresponds to the horizontal force applied to the cart. The continuous time dynamics are given by

[mp+mcmplcos(θ)mplcos(θ)mpl2][x¨θ¨]=[umplsin(θ)θ˙2mpglsin(θ)]\left[\begin{array}[]{cc}m_{p}+m_{c}&-m_{p}l\cos(\theta)\\ -m_{p}l\cos(\theta)&m_{p}l^{2}\end{array}\right]\left[\begin{array}[]{c}\ddot{x}\\ \ddot{\theta}\end{array}\right]=\left[\begin{array}[]{c}u-m_{p}l\sin(\theta)\dot{\theta}^{2}\\ m_{p}gl\sin(\theta)\end{array}\right]

where mpm_{p} denotes the mass of the pole, mcm_{c} the mass of the cart, ll the length of the pendulum, and gg acceleration due to gravity. For our experiments, we set all parameters to unity, i.e. mp=mc=l=g=1m_{p}=m_{c}=l=g=1.

D.2 {\mathcal{H}_{\infty}} synthesis

Consider the following linear system

𝐱t+1=A𝐱t+B𝐮t+𝐰t,𝐳t=[Q1/2𝐱tR1/2𝐮t],\mathbf{x}_{t+1}=A\mathbf{x}_{t}+B\mathbf{u}_{t}+\mathbf{w}_{t},\quad\mathbf{z}_{t}=\left[\begin{array}[]{c}Q^{1/2}\mathbf{x}_{t}\\ R^{1/2}\mathbf{u}_{t}\end{array}\right], (D.1)

where 𝐰\mathbf{w} denotes an additive disturbance to the state transition, and 𝐳\mathbf{z} denotes the so-called performance output. Notice that 𝐳t22=𝐱tQ𝐱t+𝐮tR𝐮t\|\mathbf{z}_{t}\|_{2}^{2}=\mathbf{x}_{t}^{\top}Q\mathbf{x}_{t}+\mathbf{u}_{t}^{\top}R\mathbf{u}_{t}. The {\mathcal{H}_{\infty}} optimal controller minimizes the {\mathcal{H}_{\infty}}-norm of the closed-loop system from input 𝐰\mathbf{w} to output 𝐳\mathbf{z}, i.e. the smallest η\eta such that

t=0T𝐳t22ηt=0Tη𝐰t22\sum_{t=0}^{T}\|\mathbf{z}_{t}\|_{2}^{2}~{}\leq~{}\eta\sum_{t=0}^{T}\eta\|\mathbf{w}_{t}\|_{2}^{2}

holds for all 𝐰0:T\mathbf{w}_{0:T}, all TT, and 𝐱0=0\mathbf{x}_{0}=0. In essence, the {\mathcal{H}_{\infty}} optimal controller minimizes the effect of the worst-case disturbance on the cost. In this setting, additive disturbances 𝐰\mathbf{w} serve as a crude (and unstructured) proxy for modeling error. We synthesize the {\mathcal{H}_{\infty}} optimal controller using Matlab’s hinfsyn function. For the system (D.1) with perfect state observation, we require only static state feedback 𝐮t=K𝐱t\mathbf{u}_{t}=K\mathbf{x}_{t} to implement the controller.

D.3 Difference between LQR and discount annealing as a function of discount

In Figure 3 we plot the error Kpg(γ)Klin(γ)F\|K_{\text{pg}}^{*}(\gamma)-K_{\text{lin}}^{*}(\gamma)\|_{F} between the policy Kpg(γ)K_{\text{pg}}^{*}(\gamma) returned by policy gradient and the optimal policy Klin(γ)K_{\text{lin}}^{*}(\gamma) for the (damped) linearized system, as a function of the discount γ\gamma during the discount annealing process. Observe that for small radius of the ball of initial conditions (r=0.05r=0.05), the optimal controller from policy gradient remains very close to the exact optimal controller for the linearized system; however, for larger radius (r=0.7r=0.7) the controller from policy gradient differs significantly.

Refer to caption
Figure 3: Error between the policy returned by policy gradient and optimal LQR policy for the (damped) linearized system during discount annealing, for two different radius values, r=0.05r=0.05 and r=0.7r=0.7, of the ball of initial conditions. Five independent trials are plotted for each radius value.The values across trials highly overlap.