This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Nonlinear Control of Quadcopters via Approximate Dynamic Programming

Angel Romero, Paul N. Beuchat, Yvonne R. Stürz, Roy S. Smith, and John Lygeros All authors are with the Automatic Control Laboratory at the Swiss Federal Institute of Technology in Zurich (ETHZ), Physikstrasse 3, 8092 Zurich, Swizerland. Email addresses: {beuchatp, stuerzy, rsmith, jlygeros}@ethz.ch
Abstract

While Approximate Dynamic Programming has successfully been used in many applications involving discrete states and inputs such as playing the games of Tetris or chess, it has not been used in many continuous state and input space applications. In this paper, we combine Approximate Dynamic Programming techniques and apply them to the continuous, non-linear and high dimensional dynamics of a quadcopter vehicle. We use a polynomial approximation of the dynamics and sum-of-squares programming techniques to compute a family of polynomial value function approximations for different tuning parameters. The resulting approximations to the optimal value function are combined in a point-wise maximum approach, which is used to compute the online policy. The success of the method is demonstrated in both simulations and experiments on a quadcopter. The control performance is compared to a linear time-varying Model Predictive Controller. The two methods are then combined to keep the computational benefits of a short horizon MPC and the long term performance benefits of the Approximate Dynamic Programming value function as the terminal cost.

I INTRODUCTION

Dynamic Programming (DP), [1], is an important method for optimal decision making, relevant in a broad range of applications including finance, health sciences and engineering. While DP has been successfully applied to many specific problems [2], it suffers from severe limitations when dealing with high dimensional systems. Approximate Dynamic Programming (ADP) aims to alleviate these limitations by finding tractable approximations to the optimal policies that result from DP. ADP methods have proven to work well for certain practical applications such as the game of Tetris [3], or the game of chess [4]. However, despite considerable progress in the theory and algorithms of ADP, [5, 6, 7], there have not been many successful applications of ADP to systems with continuous state and input spaces.

In [8], the authors suggest the so-called iterated Bellman inequality as a novel way of obtaining approximations to the value function by solving convex optimization problems. The approach was validated by applying it to the linear model of a simple mechanical system. In [9] the iterated Bellman inequality method is used to approximate the tail cost of a Model Predictive Control (MPC) problem of a continuous state space and finite input space power inverter. This effectively allows for a shorter time horizon, and thus a reduced computational burden. In [10], the authors generalize the problem formulation of [8] by considering polynomials as the chosen function space, and then use sum-of-squares (SOS) techniques to relax the problem to a semi-definite program (SDP). This method is implemented and applied to control a miniature-helicopter using a simplified linear model around hover.

The combination of ADP methods we apply to quadcopters is most closely connected with the method proposed in [11]. There the authors introduce a novel way of computing successively tighter under-estimators to the optimal value function in an iterative fashion. This method has proven to be computationally tractable for the nonlinear quadcopter model considered in this paper, in contrast to [8]. In [12], a family of objective functions is chosen instead of only one, with a value function approximation computed for each and then put together in a point-wise maximum (PWM) approach. This reduces the dependency of the result on the choice of the individual objectives. Parts of these techniques are used in [13] and demonstrated in simulation on a nonlinear energy storage system with a low dimensional state-by-input space. In that application, the slow time scale allows for computationally demanding online policies.

Due to their inherent instability and ease of design, unmanned aerial vehicles (UAV) have widely been used as a test bench for control and robotics research, [14]. In [15], nonlinear MPC is used to control the fast attitude dynamics of a hexacopter vehicle, and a Linear Quadratic Regulator (LQR) is used for the slower position dynamics. A nonlinear constraint on the tilt angle is considered. The maneuverability and tracking performance of the system are shown in simulation and experiment while tracking setpoints on a square. Similarly, in [16], a real time iteration linear MPC for the control of the position dynamics is run on-board a quadcopter.

In this paper, we combine multiple ADP methods to design a controller for the continuous, nonlinear, high dimensional model of a quadcopter and demonstrate the success of the methods in simulations and experiments. The key contributions of this work are:

  • We combine the ADP techniques from [10], [11] and [12], leveraging the strengths of each.

  • We develop the necessary approximation steps for synthesizing an ADP controller by this combined method for a high-dimensional quadcopter model.

  • We demonstrate the benefits of this method compared to other linear and nonlinear control techniques in simulations and experiments.

The paper is organized as follows: In Section II, we present the deterministic control problem and introduce the ADP formulation. In Section III, we derive the nonlinear model of the quadcopter, transform it to fit in our ADP formulation, and use it then to derive the approximation to the optimal value function. Section IV presents the results obtained from simulations and experiments.

II Approximate Dynamic Programming Formulation

In this section, we first state the optimal control problem to be solved, and then we present the techniques used for computing approximate solutions, which will be applied to a quadcopter in the next section.

II-A Optimal Control Problem

We consider the discrete-time dynamical system of a quadcopter with continuous state and input spaces and aim to minimize a discounted cost objective. The specific states, inputs and dynamics of the quadcopter will be given in Section III. For now, let us denote the state of the system at time kk as xknx\smash{x_{k}\in\mathbb{R}^{n_{x}}}, and the control input as uknu\smash{u_{k}\in\mathbb{R}^{n_{u}}}. The evolution of the system is described by the dynamics function f:𝒳×𝒰𝒳\smash{f:\mathcal{X}\times\mathcal{U}\to\mathcal{X}} such that

xk+1=f(xk,uk).x_{k+1}=f(x_{k},u_{k}). (1)

The stage cost function is the non-negative cost of taking decision uku_{k} when being in state xkx_{k}, denoted as l:𝒳×𝒰+\smash{l:\mathcal{X}\times\mathcal{U}\rightarrow\mathbb{R}_{+}}. Given a discount factor ξ[0,1)\smash{\xi\in[0,1)} and a stationary control policy of the form π:𝒳𝒰\smash{\pi:\mathcal{X}\rightarrow\mathcal{U}}, the discounted infinite horizon cost is thus,

Jπ(x)=k=0ξkl(xk,π(xk)),with x0=x.J_{\pi}(x)=\sum_{k=0}^{\infty}\xi^{k}\,l(x_{k},\pi(x_{k})),\hskip 28.45274pt\text{with $x_{0}=x$.} (2)

The aim is to find the policy that minimizes (2) for all x𝒳x\in\mathcal{X}. The solution is characterized by the optimal value function V:𝒳\smash{V^{*}:\mathcal{X}\to\mathbb{R}} that satisfies the Bellman equation,

V(x)=minu𝒰l(x,u)+ξV(f(x,u))𝒯V,x𝒳,V^{*}(x)\,=\,\underbrace{\,\min_{u\in\mathcal{U}}\hskip 2.84544ptl(x,u)+\xi\,V^{*}(f(x,u))\,}_{\mathcal{T}V^{*}}\,,\hskip 14.22636pt\forall x\in\mathcal{X}, (3)

where 𝒯\mathcal{T} is the Bellman operator. Given a solution of (3) the optimal policy is the so called greedy policy,

π(x)=argminu𝒰l(x,u)+ξV(f(x,u)).\pi^{*}(x)=\operatorname*{arg\,min}_{u\in\mathcal{U}}\hskip 2.84544ptl(x,u)+\xi\,V^{*}(f(x,u)). (4)

which minimizes the cost of taking the decision uu now, plus the cost-to-go from the next time period onwards. We require the standard assumptions on the stage cost, dynamics, and existence of feasible policies to ensure that VV^{*} and π\pi^{*} exist, are time-invariant, and attain the minimum, for example [17, Assumption 4.2.1, 4.2.2] or [18, §1.2]. For a general problem instance, solving (3) and implementing (4) is intractable.

II-B Linear Programming Approach to ADP with Polynomials

The monotone and contractive properties of the Bellman operator mean that any function V^:𝒳\smash{\hat{V}:\mathcal{X}\to\mathbb{R}} satisfying the so called Bellman inequality is a point-wise under-estimator of VV^{*}, i.e.,

V^(x)𝒯V^(x),x𝒳V^(x)V(x),x𝒳,\hat{V}(x)\leq\mathcal{T}\hat{V}(x),\,\forall x\!\in\!\mathcal{X}\hskip 2.84544pt\Rightarrow\hskip 2.84544pt\hat{V}(x)\leq V^{*}(x),\,\forall x\!\in\!\mathcal{X}, (5)

where V^\hat{V} is referred to as an approximate value function. This is the key property that motivates the various LP approaches to ADP proposed in [5, 8, 10, 11, 13]. Letting (𝒳)\smash{\mathcal{F(X)}} denote a space of functions mapping 𝒳\smash{\mathcal{X}\rightarrow\mathbb{R}}, and using the Bellman inequality, a point-wise under-estimator is found by solving the following optimization problem,

maximizeV^(x)\displaystyle\underset{\hat{V}(x)}{\text{maximize}}\hskip 5.69046pt V^(x)c(x)𝑑x\displaystyle{\displaystyle\int\hat{V}(x)\,c(x)\,dx} (6a)
subject to V^(𝒳)\displaystyle\;\;\hat{V}\in\mathcal{F(X)} (6b)
V^(x)𝒯V^(x),x𝒳,\displaystyle\;\;\hat{V}(x)\leq\mathcal{T}\hat{V}(x),\hskip 2.84544pt\forall x\in\mathcal{X}, (6c)

where the state relevance weighting function, c(x)c(x), is a finite measure on the state space and it allows the practitioner to select the region where a better approximation is desired. When (𝒳)\mathcal{F(X)} is the space of real-valued measurable functions on 𝒳\mathcal{X} it is proven in [17] that the solution to (6) solves the Bellman equation for cc-almost all x𝒳\smash{x\in\mathcal{X}}, however the infinite dimensional decision variable makes the problem intractable.

In this paper we use the space of polynomials for the function space (𝒳)\mathcal{F(X)}. This choice is motivated by [10] where the authors show that problem (6) can be reformulated as a finite SDP when the dynamics, stage costs, and spaces are described by polynomials. Letting 𝒫d(𝒳)\smash{\mathcal{P}_{d}\mathcal{(X)}} denote the space of polynomials up to degree dd, we replace (6b) by V^𝒫d(𝒳)\smash{\hat{V}\in\mathcal{P}_{d}\mathcal{(X)}} and hence the decision variables are the coefficients of the monomials up to degree dd. Considering the minimization in the Bellman operator 𝒯\mathcal{T}, the infinite constraints (6c) can be equivalently written as,

0V^(x)+l(x,u)+ξV^(f(x,u))b(x,u),x𝒳,u𝒰,0\leq\underbrace{-\hat{V}(x)+l(x,u)+\xi{\hat{V}(f(x,u))}\,}_{b(x,u)},\hskip 2.84544pt\forall\,x\!\in\!\mathcal{X},\,u\!\in\!\mathcal{U}, (7)

where b(x,u)b(x,u) is a polynomial when ll and ff are polynomials.

We now use the SOS S-Procedure [19] to reformulate (7) as a single Linear Matrix Inequality (LMI) constraint on the decision variable V^𝒫d(𝒳)\smash{\hat{V}\in\mathcal{P}_{d}\mathcal{(X)}}. Letting SOSSOS denote the set of polynomials that are representable as a sum of polynomials squared, then from [19, Theorem 3.3] we have the following equivalence for certifying that a polynomial v𝒫2d(𝒳)\smash{v\in\mathcal{P}_{2d}\mathcal{(X)}} is an SOS polynomial,

v(x)SOSM0,s.t.v(x)=z(x)TMz(x),v(x)\in SOS\hskip 4.26773pt\Leftrightarrow\hskip 4.26773pt\exists M\succeq 0,\hskip 2.84544pt\text{s.t.}\hskip 2.84544ptv(x)=z(x)^{T}Mz(x), (8)

where z(x)z(x) is the vector of monomials of xnx\smash{x\in\mathbb{R}^{n_{x}}} up to degree dd, and MM is a square matrix of size (nx+dd){n_{x}+d\choose d}. Letting gi(x,u)0\smash{g_{i}(x,u)\geq 0} denote the polynomials that describe the state and input spaces (𝒳\mathcal{X} and 𝒰\mathcal{U}), then the following set of equations,

b(x,u)λ(x,u)igi(x,u)\displaystyle b(x,u)-\lambda(x,u)\sum\nolimits_{i}g_{i}(x,u) SOS,\displaystyle\,\in\,SOS, (9a)
λ(x,u)\displaystyle\lambda(x,u) SOS,\displaystyle\,\in\,SOS, (9b)

is a sufficient reformulation of (7) in the sense that (9)(7){\text{\eqref{eq:SOS_procedure}}\Rightarrow\text{\eqref{eq:BI_relaxed}}}. Replacing (6c) with (9), the multiplier λ𝒫d(𝒳×𝒰)\smash{\lambda\in\mathcal{P}_{d}(\mathcal{X}\!\times\!\mathcal{U})} is an additional decision variable with the polynomial degree of the multiplier as a choice. It is reasonable to choose the largest multiplier degree for which the LMI constraints stemming from (9) are computationally tractable.

Finally, the objective (6a) is linear in the coefficients of V^𝒫d(𝒳)\smash{\hat{V}\in\mathcal{P}_{d}\mathcal{(X)}} and requires knowledge of the moments of the state relevance weighting function c(x)c(x) up to order dd. Thus, a point-wise under-estimator of VV^{*} can be found using a commercial solver for the SDP relaxation of (6).

II-C Point-Wise Maximum Approach to ADP

To improve the quality of the approximate value function, we use the approach proposed in [11] that solves a sequence of optimization problems, each with constraints of the same size as (9).

First, we solve the SDP relaxation of (6) for a particular choice of the state relevance weighting, and denote the solution as V^1\hat{V}_{1}^{*}. Then we solve the SDP relaxation of the following optimization problem with j=2j=2,

maximizeV^j\displaystyle\underset{\hat{V}_{j}}{\text{maximize}}\hskip 5.69046pt V^j(x)c(x)𝑑x\displaystyle\displaystyle\int\hat{V}_{j}(x)\,c(x)\,dx (10a)
subject to V^j𝒫d(𝒳),\displaystyle\hat{V}_{j}\in\mathcal{P}_{d}\mathcal{(X)}, (10b)
V^j(x)(𝒯maxk=1,,j1V^k)(x),x𝒳\displaystyle\hat{V}_{j}(x)\leq\left(\mathcal{T}\underset{k=1,\dots,j-1}{\text{max}}\hat{V}^{*}_{k}\right)(x),\hskip 5.69046pt\forall x\!\in\!\mathcal{X} (10c)

where V^1\hat{V}_{1}^{*} is fixed problem data, and we denote the solution as V^2\hat{V}_{2}^{*}. We then solve the SDP relaxation of (10) iteratively for j3{j\geq 3}, each time storing the solution V^j\hat{V}_{j}^{*} as fixed data to be used in the next iteration. As the cost function (10a) is non-decreasing with the iterations of (10), we terminate when the improvement in the cost function becomes less than a pre-specified threshold. The approximate value function for a particular choice of c(x)c(x) is set as the solution of the final iteration.

The steps given in [11] show how to reformulate (10c) as a polynomial inequality constraint similar to (7). The SOS S-Procedure is then applied and the resulting relaxation involves one LMI constraint with the same size as (9a), and j1\smash{j-1} LMI constraints identical to (9b).

II-D Online Policy

To construct an online policy, we first construct our best approximation to the value function, and we use this as a surrogate for VV^{*} in the greedy policy (4). For different choices of the state relevance weighting, denoted ci(x)c_{i}(x) for i=1,,Nc\smash{i=1,\dots,N_{c}}, we perform the iteration described in Section II-C and denote V^ci\hat{V}_{c_{i}}^{*} as the value function estimate returned by the final iteration. We take the PWM of these under-estimators as our best estimate for VV^{*}, i.e.,

maxi=1,,NcV^ci(x)V(x),x𝒳.\max_{i=1,\dots,N_{c}}\hskip 2.84544pt\hat{V}_{c_{i}}^{*}(x)\hskip 5.69046pt\leq\hskip 5.69046ptV^{*}(x),\hskip 5.69046pt\forall x\in\mathcal{X}. (11)

The online policy, also called the approximate greedy policy, is thus,

π^(x)\displaystyle\hat{\pi}(x) =argminu𝒰l(x,u)+ξmaxi=1,,NcV^ci(f(x,u)),\displaystyle=\operatorname*{arg\,min}_{u\in\mathcal{U}}\hskip 2.84544ptl(x,u)+\xi\,\max_{i=1,\dots,N_{c}}\hskip 2.84544pt\hat{V}_{c_{i}}^{*}(f(x,u)), (12)

where the maximum over the V^ci\hat{V}_{c_{i}}^{*} is readily reformulated using an epigraph variable.

To construct the surrogate for VV^{*}, it remains to choose the cic_{i} weightings and in Section IV we provide the weightings used for nonlinear control of a quadcopter. The surrogate for VV^{*} can be considered as a sufficiently accurate estimate if the online performance of (12) achieves a low cost.

III Control of a Quadcopter

In the following, the theory from the previous section will be used for the control of a quadcopter vehicle. The dynamics are brought into a suitable form by applying a reduction in the state space and a polynomial approximation.

III-A Dynamical Model of the Quadcopter

To describe the dynamics of the quadcopter, two frames of reference are used, the inertial (world) frame and the body frame (attached to the quadcopter). The location of the body frame with respect to the inertial frame is denoted as p=[px,py,pz]{\vec{p}=[p_{x},p_{y},p_{z}]^{\top}} and shown in Figure 1. The angular rates about the body frame axes x(B)x^{(B)}, y(B)y^{(B)} and z(B)z^{(B)} are denoted ω=[wx,wy,wz]{\vec{\omega}=[w_{x},w_{y},w_{z}]^{\top}}. The orientation of the body frame relative to the inertial frame is given by ψ=[γ,β,α]{\vec{\psi}=[\gamma,\beta,\alpha]^{\top}}, the roll, pitch, and yaw intrinsic Euler angles, respectively. Using the ZYX convention for the Euler angles, and with s():=sin(){s_{(\cdot)}:=\sin(\cdot)}, c():=cos(){c_{(\cdot)}:=\cos(\cdot)}, the equations of motion for a quadcopter of mass mm and with moment of inertia JJ, are readily derived as,

p¨(I)\displaystyle\ddot{\vec{p}}^{(I)} =1mi=14fi[cαsβsγ+sαcγsαsβcγcαsγcβcγ][00g],\displaystyle=\frac{1}{m}\sum\limits_{i=1}^{4}f_{i}\begin{bmatrix}c_{\alpha}s_{\beta}s_{\gamma}+s_{\alpha}c_{\gamma}\\ s_{\alpha}s_{\beta}c_{\gamma}-c_{\alpha}s_{\gamma}\\ c_{\beta}c_{\gamma}\end{bmatrix}-\begin{bmatrix}0\\ 0\\ g\end{bmatrix}, (13a)
[ωx˙ωy˙ωz˙]\displaystyle\begin{bmatrix}\dot{\omega_{x}}\\ \dot{\omega_{y}}\\ \dot{\omega_{z}}\end{bmatrix} =J1([i=14fidyii=14fidxii=14fidτi]ω×Jω),\displaystyle=J^{-1}\left(\begin{bmatrix}\sum_{i=1}^{4}f_{i}d_{y_{i}}\\ \sum_{i=1}^{4}-f_{i}d_{x_{i}}\\ \sum_{i=1}^{4}f_{i}d_{\tau_{i}}\end{bmatrix}-\vec{\omega}\times J\vec{\omega}\right), (13b)

where fif_{i} is the thrust produced by propeller ii, dyid_{y_{i}} and dxid_{x_{i}} are the distances from the axis of propeller ii to the center of gravity of the quadcopter, dτid_{\tau_{i}} is a constant of proportionality for how much torque is produced by thrust fif_{i}, and gg is the acceleration due to gravity. For more details on the derivation of the dynamics refer to [20]. The system thus has 12 states, given by [p,p˙,ψ,ω][\vec{p}^{\,\top},\dot{\vec{p}}^{\,\top},\vec{\psi}^{\,\top},\vec{\omega}^{\,\top}]^{\top}, and four inputs, given by the motor thrusts fif_{i}, i=1,,4i=1,...,4.

x(I)x^{(\mathrm{I})}y(I)y^{(\mathrm{I})}z(I)z^{(\mathrm{I})}x(I)x^{(\mathrm{I})}y(I)y^{(\mathrm{I})}z(I)z^{(\mathrm{I})}x(B)x^{(\mathrm{B})}y(B)y^{(\mathrm{B})}z(B)z^{(\mathrm{B})}p\vec{p}p\vec{p}pxp_{x}pxp_{x}pyp_{y}pyp_{y}pzp_{z}pzp_{z}Refer to caption
Figure 1: Coordinate frames for the quadcopter: Inertial frame (I)(I) and body frame (B)(B).

III-B System Model Approximation for ADP

In order to apply the ADP method from Section II, we approximate the full dynamics of (13) in three steps. First, we exploit time scale separation to reduce the state dimension of the dynamics by using the cascaded control structure shown in Figure 2. The inner controller considers the states [ψ,ω][\vec{\psi\,}^{\top},\vec{\omega}^{\top}]^{\top} and uses the thrust inputs f1,,f4f_{1},\dots,f_{4} to track a reference attitude ψref\vec{\psi}_{\mathrm{ref}} and total thrust fT,reff_{T,\mathrm{ref}}. The outer controller considers the states x=[p,p˙]{x=[\,\vec{p}^{\top},\dot{\vec{p}}^{\top}\,]^{\top}} and uses the inputs u=[γ,β,α,fT]{u=[\gamma,\beta,\alpha,f_{T}]^{\top}} to stabilize the quadcopter to a specified position and yaw setpoint. These inputs are given as references to the inner controller. This cascaded structure is common practice for quadcopter control [15], and the approximation step assumes a sufficient time-scale separation so that the transients of the inner loop controller can be neglected when designing the outer loop controller. The goal is to design the outer controller using the combination of ADP methods described in Section II.

In the second approximation step, we use a Taylor expansion of (13a) to replace the trigonometric functions with a polynomial approximation that fits the SOS framework described in Section II-B. We use a third order Taylor expansion as a compromise between the accuracy of the approximated nonlinear dynamics and the size of the optimization problem (10) for fitting approximate value functions. The translational dynamics (13a) are thus approximated as

[p˙1m(βfT12mgα2β+fTαγ16mgβ312mgβγ2)1m(γfT+fTαβ+12mgα2γ+16mgγ3)1m((fTmg)12fTβ212fTγ2)],\begin{bmatrix}\dot{\vec{p}}\\ \frac{1}{m}(\beta f_{T}-\frac{1}{2}mg\alpha^{2}\beta+f_{T}\alpha\gamma-\frac{1}{6}mg\beta^{3}-\frac{1}{2}mg\beta\gamma^{2})\\ \frac{1}{m}(-\gamma f_{T}+f_{T}\alpha\beta+\frac{1}{2}mg\alpha^{2}\gamma+\frac{1}{6}mg\gamma^{3})\\ \frac{1}{m}((f_{T}-mg)-\frac{1}{2}f_{T}\beta^{2}-\frac{1}{2}f_{T}\gamma^{2})\\ \end{bmatrix}, (14)

where we have introduced fT=ifi{f_{T}=\sum_{i}f_{i}} to represent the total thrust.

Finally, to bring the dynamics of (14) to discrete time, we use a forward Euler scheme. The discrete-time dynamics are required for applying the ADP methods to the design of the outer-loop controller. We choose a forward Euler approximation in order not to increase the polynomial order of the dynamics, and assume that the sampling time is fast enough to maintain stability properties of the plant. For the stage cost we chose the quadratic form, l(x,u)=xQx+uRu{l(x,u)=x^{\top}Qx+u^{\top}Ru}.

OuterController InnerController Plantas per (13)fT,reff_{T,\text{ref}}ψref\vec{\psi}_{\text{ref}}f1,f2f_{1},f_{2}f3,f4f_{3},f_{4}ω\vec{\omega}-ψ\vec{\psi}-(p,p˙)(\vec{p},\dot{\vec{p}})
Figure 2: Cascaded control structure. The references fT,reff_{T,ref} and ψref\vec{\psi}_{ref} are the inputs uu used in the model for synthesizing the outer controller. The inner loop consists of PID controllers that track these attitude and thrust references.

III-C Input Constraints

x(I)x^{(I)}y(I)y^{(I)}z(I)z^{(I)}x(B)x^{(B)}y(B)y^{(B)}z(B)z^{(B)}θ\theta
Figure 3: Visualization of the constraint on the tilt angle θ\theta.

We chose to constrain the tilt angle that the z(I)z^{(I)} axis forms with the z(B)z^{(B)} axis, denoted by θ\theta and shown in Figure 3. The motivation behind this choice is linked to the nonlinearities in the dynamics. We choose θ\theta large enough so that the nonlinearities are relevant for the solution of the optimal control problem, but small enough so that the third order Taylor expansion (14) is an adequate approximation of the true dynamics (13a). The angle constraint is given by

cos(θ)cos(β)cos(γ),\cos(\theta)\leq\cos(\beta)\cos(\gamma), (15)

which is nonlinear in the inputs γ\gamma and β\beta, similar to the constraint introduced in [15].

To represent the physical actuator limits of the propellers, we impose a lower and upper bound on the total thrust, fTf_{T}. Denoting the lower and upper bound as fT,lbf_{T,lb} and fT,ubf_{T,ub}, respectively, the constraint is thus,

fT,lbfTfT,ub.f_{T,lb}\leq f_{T}\leq f_{T,ub}. (16)

In addition to the dynamics, the constraints also need to be in polynomial form in order to apply the SOS techniques from Section II-B. We chose quadratic forms for the constraints in (15) and (16). The constraint on the angle in (15) is approximated via least-squares to the closest ellipse with radii a1a_{1} and a2a_{2}, and the constraint on the total thrust in (16) can be transformed into a single second order polynomial to reduce the number of SOS multipliers. The constraint functions are thus given by

1γ2a12β2a22=:gθ(u)\displaystyle 1-\frac{\gamma^{2}}{{a_{1}}^{2}}-\frac{\beta^{2}}{{a_{2}}^{2}}=:g_{\theta}(u) 0,\displaystyle\geq 0, (17a)
fT(fT,lb+fT,ub)fT,lbfT,ubfT2=:gf(u)\displaystyle f_{T}\left(f_{T,lb}+f_{T,ub}\right)-f_{T,lb}f_{T,ub}-f_{T}^{2}=:g_{f}(u) 0.\displaystyle\geq 0. (17b)

Note that gf(u)g_{f}(u) and gθ(u)g_{\theta}(u) correspond to gi(x,u)g_{i}(x,u) in (9).

III-D Value Function Fitting

Heuristics have been developed for how to choose the state relevance weighting, c(x)c(x), which is the main tuning parameter. As presented in [12], to alleviate the sensitivity of this choice, we select a family of ci(x)𝒩(0,Σi)\smash{c_{i}(x)\sim\mathcal{N}(0,\Sigma_{i})} as Gaussians with zero mean and different covariance matrices. Let diag(v)\mathrm{diag}(v) define a diagonal matrix with the entries of vector vv on its diagonal. Then, with Σ0=diag([0.1,0.1,0.1,1,1,1]){\Sigma_{0}=\mathrm{diag}([0.1,0.1,0.1,1,1,1])}, the set of covariance matrices Σi\Sigma_{i} used is obtained as,

Σi=KiΣ0,i={1,,9},with\displaystyle\Sigma_{i}=K_{i}\,\Sigma_{0},~i=\{1,.,9\},~\text{with} (18)
Ki{0.01, 0.1, 0.3, 0.5, 0.7, 1, 1.3, 1.5, 2}.\displaystyle K_{i}\in\{01,01,03,05,07,1,13,15,2\}.

For each different ci(x)c_{i}(x) in (18), we perform the iterative procedure described in Section II-C. As shown in Figure 4, the relative cost improvement converges to under a threshold of 10510^{-5} within 9 iterations.

55101015151.41.41.441.441.481.481.521.52104\cdot 10^{4}PWM ADP iterations, jjV^j(x)c4(x)𝑑x\int\hat{V}^{*}_{j}(x)c_{4}(x)dx
Figure 4: Convergence of the objective function value of (10) over iterations jj, for state relevance weighting c4c_{4} in (18).

In Figure 5 (a), a cut of the value function approximation for different iterations for the same ci(x)c_{i}(x) is shown. The cut is obtained by setting all states to zero except p˙z\dot{p}_{z}. Solving this iterative problem for the different ci(x)c_{i}(x) results in a family of value function approximations, V^ci(x)\hat{V}_{c_{i}}^{*}(x), i=1,,Nci=1,...,N_{c}. As presented in Section II-D, the point-wise maximum (PWM) of this family, V^PWM(x):=maxi=1,,NcV^ci(x){\hat{V}^{*}_{PWM}(x):=\underset{i=1,\dots,N_{c}}{\text{max}}\hat{V}^{*}_{c_{i}}(x)}, is the value function approximation that is used for the online greedy policy (12). Figure 5 (b) shows a cut in the p˙z\dot{p}_{z} state of the value function approximations V^ci(x)\hat{V}_{c_{i}}^{*}(x) and of V^PWM(x)\hat{V}^{*}_{PWM}(x).

0224466104\cdot 10^{4}V^1(x)\hat{V}_{1}^{*}(x)V^2,,8(x)\hat{V}^{*}_{2,\cdots,8}(x)​​​​V^9(x)\hat{V}_{9}^{*}(x)1-101100.50.5
(a)
4-42-2022440224466105\cdot 10^{5}V^c1,,c9(x)\hat{V}_{c_{1},...,c_{9}}^{*}(x)V^PWM(x)\hat{V}_{PWM}^{*}(x)
(b)
Figure 5: Cuts around p˙z\dot{p}_{z} (a): of the approximate value functions V^j(x)\hat{V}_{j}^{*}(x) for different iterations j=1,,9j=1,...,9 of (10) for c4(x)c_{4}(x) as shown in the lower plot, and (b): of the value function approximations V^ci(x)\hat{V}_{c_{i}}^{*}(x) for the ci(x)c_{i}(x) in (18) and their PWM V^PWM(x)\hat{V}_{PWM}^{*}(x).

IV Numerical Results

In this section, in order to evaluate the control performance of the presented ADP approach, we compare different control schemes, which will be explained in detail in the following, with an overview given in Table I. The names are comprised of three parts, the first part encodes the approximation of the system dynamics used in the online policy, the middle part refers to the type of policy, and the last part indicates the terminal cost or value function used. The control task is a setpoint change of 22 meters in pxp_{x}-direction, which shows the performance benefits of the control methods using the ADP value function compared to the other control schemes. First, results from simulations are given and then, experimental results on the quadcopter are presented.

TABLE I: Control policies compared in Section IV.
Name dynamics approximation policy terminal cost
NL-GP-ADP 3rd order Taylor as in (14) Greedy policy (12) V^PWM\hat{V}^{\ast}_{PWM}, §III-D
LTV-MPC-LQR LTV as in (20) MPC PP from Riccati
LTI-MPC-LQR linearized around hover MPC PP from Riccati
LTV-MPC-ADP LTV as in (LABEL:eq:optimization_problem_general_ADP) MPC V^PWM\hat{V}^{\ast}_{PWM}, §III-D

IV-A Simulation Results

We consider the model of a small-sized quadcopter, the Crazyflie 2.0, [21], with a mass of 27 gg. Details about the model parameters such as its inertia and dimensions can be found in [22], [23]. For the simulation the full state dynamics of the quadcopter as in (13) are used, together with the control structure shown in Figure 2. The inner loop tracks the commanded reference Euler angles ψref\psi_{ref} with a PID controller running at 500 Hz, with the controller parameters and implementation details found in [23]. The outer control loop, and hence all the controllers listed in Table I, runs at 50 Hz.

In the ADP approach, the matrices for the stage cost, l(x,u)=xQx+uRu{l(x,u)=x^{\top}Qx+u^{\top}Ru}, have been chosen as,

Q\displaystyle Q =diag([505050555]),\displaystyle=\mathrm{diag}(\begin{bmatrix}50&50&50&5&5&5\end{bmatrix}), (19)
R\displaystyle R =diag([20201001500]),\displaystyle=\mathrm{diag}(\begin{bmatrix}20&20&100&1500\end{bmatrix}),

and the discount factor is set to ξ=0.98\xi=0.98. The total thrust is constrained to fT[0N, 0.56N]{f_{T}\!\in\![0N,\,0.56N]}. For the angle constraint in (15) we have chosen θ=45\theta=45^{\circ}, which is large enough to demonstrate that the controller takes into account the nonlinearities of the system, but small enough for the constraint to become active.

At each time step, the online policy in (12) is solved and the optimal input is applied to the system. As the term V^ci(f(x,u))\hat{V}_{c_{i}}^{*}(f(x,u)) is a composition of a quadratic and a third order polynomial, the resulting optimization problem is not convex. We refer to this as NL-GP-ADP, see Table I, and we solve it in simulation with an interior point method.

In the following, we compare NL-GP-ADP to a linear time-varying (LTV) MPC, [24], which also accounts for the nonlinear dynamics in its predictions. At each time step, the dynamics (13a) are linearized around the predicted trajectory, xtraj,kx_{\mathrm{traj},k}, utraj,ku_{\mathrm{traj},k}, and denoted by Ak,Bk,gkA_{k},B_{k},g_{k}. Then, the following problem is solved,

minimize{uk}k=0N1\displaystyle\underset{\{u_{k}\}_{k=0}^{N-1}}{\text{minimize}} k=0N1xkxtraj,kQ2+ukutraj,kR2\displaystyle\sum_{k=0}^{N-1}\lVert x_{k}-x_{\mathrm{traj},k}\rVert_{Q}^{2}+\lVert u_{k}-u_{\mathrm{traj},k}\rVert_{R}^{2} (20)
+xNxtraj,NP2\displaystyle\hskip 78.24507pt+\lVert x_{N}-x_{\mathrm{traj},N}\rVert_{P}^{2}
subject to xk+1=Akxk+Bkuk+gk,\displaystyle x_{k+1}=A_{k}x_{k}+B_{k}u_{k}+g_{k},
uk𝒰,\displaystyle u_{k}\in\mathcal{U},

where the set 𝒰\mathcal{U} is defined as in (17), making this problem a convex Quadratically Constrained Quadratic Program (QCQP). Note that the xkx_{k} and uku_{k} variables are deviations from the predicted trajectory. The cost matrices QQ and RR are the same as in (19), and PP comes from the solution to the Riccati equation using the linearization of dynamics (13a) around the hover state [p,p˙]=0{[\,\vec{p},\dot{\vec{p}}\,]^{\top}}\!=\!0. The first input, u0u_{0}^{*}, is applied, and the system evolves by one time step. The predicted trajectory for the next iteration, i.e., xtraj,kx_{\mathrm{traj},k}, utraj,ku_{\mathrm{traj},k}, is based on the state measurement and the inputs u1,,uN1u_{1}^{*},...,u_{N-1}^{*}. We refer to this as LTV-MPC-LQR of horizon length NN, see Table I.

Furthermore, a comparison of the NL-GP-ADP policy to a controller that does not account for the nonlinear dynamics is given. For this we use a linear time-invariant (LTI) MPC, i.e., for the dynamics we use the linearized model around hover and the same QQ, RR, and PP cost and Riccati solution matrices as in (20). Consistent with the naming convention, this policy is referred to as LTI-MPC-LQR of horizon N=1N=1, and the policy is also a convex QCQP. This policy is similar to the ADP approach demonstrated on miniature helicopters in [10], and for the control task we consider the performance was almost identical. Thus, for the sake of clarity, we do not include a policy based on [10] in the comparison here.

A comparison between the three policies is shown in Figure 6. The main difference between the approaches is the drop in pzp_{z} due to the fact that NL-GP-ADP and LTV-MPC-LQR have more information about the nonlinearities of the system than LTI-MPC-LQR which uses the linearized dynamics around hover. The nonlinearity of interest for this test is the change in vertical thrust when the quadcopter tilts, and as seen in Figure 6 (b), LTI-MPC-LQR corrects for the drop in pzp_{z} after it happens. However, NL-GP-ADP and LTV-MPC-LQR pre-compensate for this drop by increasing the thrust in advance.

011223301122Time [s]Position [m]
(a) Evolution of pxp_{x} (starting from 22 meters) and pzp_{z}
01122330.30.30.350.350.40.40.450.45Time [s]Thrust [N]
(b) Evolution of fTf_{T}
Figure 6: Comparison of the behavior of the system in simulation for: 6(a) NL-GP-ADP, 6(a) LTI-MPC-LQR N=1{N\!=\!1}, and 6(a) LTV-MPC-LQR of different horizon lengths (from lighter to darker, N=1,2,5,10,15,20\smash{N=1,2,5,10,15,20}).

IV-B Experimental Results

For the experiments, the Crazyflie quadcopter is used together with a set of infrared cameras that provide the system with position and attitude measurements, [25]. Body frame angular rate measurements come from the on-board IMU. The control loops are the same as described in Section IV-A for the simulations.

Implementation of the NL-GP-ADP in real-time (i.e., at 50 Hz) was computationally intractable. To overcome this difficulty, the ADP approach is combined with the linear time-varying MPC scheme by replacing the terminal cost in (20) with the ADP value function approximation, leading to

minimize {uk}k=0N1\displaystyle\underset{\{u_{k}\}_{k=0}^{N-1}}{\text{minimize }}\hskip 2.84544pt k=0N1xkxtraj,kQ2+ukutraj,kR2\displaystyle\sum_{k=0}^{N-1}\lVert x_{k}-x_{\mathrm{traj},k}\rVert_{Q}^{2}+\lVert u_{k}-u_{\mathrm{traj},k}\rVert_{R}^{2} (21)
+V^PWM(xNxtraj,N)\displaystyle\hskip 78.24507pt+\hat{V}^{*}_{PWM}(x_{N}-x_{\mathrm{traj},N})
subject to xk+1=Akxk+Bkuk+gk,\displaystyle x_{k+1}=A_{k}x_{k}+B_{k}u_{k}+g_{k},
uk𝒰.\displaystyle u_{k}\in\mathcal{U}.

We call this the LTV-MPC-ADP policy of horizon NN. If we introduce the epigraph of all quadratics V^ci(x)\hat{V}^{*}_{c_{i}}(x) in V^PWM\hat{V}^{*}_{PWM} as a new optimization variable, as in [12], the resulting optimization problem is a QCQP, making problem (LABEL:eq:optimization_problem_general_ADP) implementable in real-time. Particularly, if in (LABEL:eq:optimization_problem_general_ADP) we choose the horizon length to be 1, we have the linearized greedy policy, which differs from problem (12) in that the prediction of the next state is done using the linearized dynamics around the current state, instead of the nonlinear dynamics.

022446601122Time [s]pxp_{x}-positon [m]
0224466-0.4-0.200.20.4Time [s]pzp_{z}-position [m]
02244660.30.30.350.350.40.4Time [s]Thrust [N]
Figure 7: Experimental comparison of different policies: 7 LTV-MPC-ADP, 7 LTV-MPC-LQR, 7 LTI-MPC-LQR, all with horizon length N=1{N\!=\!1}.

In Figure 7, we show a comparison of the experimental results when using LTV-MPC-ADP, LTV-MPC-LQR and LTI-MPC-LQR, all with a horizon length of N=1N=1. As can be seen in Figure 7, the initial thrust is higher for LTV-MPC-ADP, which suggests that it is able to predict the behavior of the system better than LTV-MPC-LQR. It shows that for the same horizon length, LTV-MPC-ADP reacts faster in the correction of the drop in pzp_{z}, since the initial thrust input is larger. This suggests that using the ADP value function approximation for the final cost yields better predictions and therefore, performance. The experimental comparison of LTV-MPC-ADP and LTI-MPC-LQR can be seen in the video at [26]. For details on the implementation, we refer to [23].

2244668810104,0004{,}0004,2004{,}2004,4004{,}400Horizon lengthl(xk,uk)\sum{l(x_{k},u_{k})}
Figure 8: Costs for different horizon lengths, for policies: 8 LTV-MPC-ADP, 8LTV-MPC-LQR and 8LTI-MPC-LQR.

Figure 8 shows a series of simulations using the LTV-MPC-ADP, the LTV-MPC-LQR, and the LTI-MPC-LQR policies for horizon lengths from 1 up to 11. For every horizon length, the sum of stage costs, k=0Nl(xk,uk)\sum_{k=0}^{N}l(x_{k},u_{k}) is computed, until all states and inputs have converged, and then plotted as a measurement of the performance of the policy. As the aim is to minimize (2), the smaller this cost is, the better the controller is performing. Figure 8 supports the results of the performance comparison given in Figure 6, as for short horizon lengths the cost of LTV-MPC-ADP is lower than the one of LTV-MPC-LQR. It also shows that the quality of the approximation of the final cost loses importance when the horizon length increases.

55101015150101020203030123456789101112131415Horizon lengthSolve times (ms)
(a) LTV-MPC-LQR
5510101515123456789101112131415Horizon length
(b) LTV-MPC-ADP
Figure 9: Comparison of the solve times reported by the Gurobi solver [27] when playing the respective policy online. The black dotted line show the 50 Hz real-time limit.

Figure 9 shows that the maximum implementable horizon length for the LTV-MPC-ADP policy at 50 Hz is of 4-5 steps, while the LTV-MPC-LQR policy is tractable for up to 15 steps. The online computation time of the LTV-MPC-ADP policy can be reduced by using fewer approximate value functions in VPWMV_{PWM}^{*}, which introduces fewer quadratic constraints in (LABEL:eq:optimization_problem_general_ADP). This generally introduces a trade-off with the online performance because the approximation quality of the cost-to-go may get reduced. Moreover, reducing the computation time allows for a higher frequency which might improve the forward Euler approximation used for the discrete time dynamics.

The offline computation time required to compute the policies LTV-MPC-LQR and LTI-MPC-LQR is solving the Riccati equation, which takes less than one second for this model. The offline computation required to prepare V^PWM\hat{V}^{*}_{PWM} for the LTV-MPC-ADP policy is less than 10 minutes on a standard desktop computer, using Yalmip [28] and Mosek [29] for building and solving the necessary SDPs.

V Conclusion and Future Work

We applied Approximate Dynamic Programming methods to a nonlinear, high dimensional quadcopter system with continuous state and input spaces. Using polynomial approximations of the dynamics, the constraints and the value function, leads to an online policy capturing the relevant nonlinearities of the system. For computational tractability, the ADP method was combined with an MPC scheme, which leverages the computational benefits of a short-term MPC scheme and the long term advantages of capturing the effects of nonlinearities by the ADP value function. The latter was used as the terminal cost in a linear time-varying MPC. For short time horizons this control method outperformed the same MPC with the terminal cost being the solution to a Riccati equation. The performance of the resulting controller was successfully demonstrated both in simulations and in experiments on a nano-quadcopter.

In future work, the control performance could further be improved by allowing for higher order polynomial approximations of the dynamics, or of relevant parts of the dynamics while keeping the complexity of other parts low, in order to ensure computational tractability. Furthermore, formulating the dynamics in terms of quaternions instead of Euler angles is promising as quaternion dynamics are naturally polynomial. However, more investigation is needed on how to effectively approximate the value function that only takes values on the quaternion manifold.

References

  • [1] R. E. Bellman, “On the theory of dynamic programming,” Proceedings of the National Academy of Sciences, vol. 38, no. 8, pp. 716–719, 1952.
  • [2] S. Yakowitz, “Dynamic programming applications in water resources,” Water Resources Research, vol. 18, no. 4, pp. 673–696, 1982.
  • [3] V. Gabillon, M. Ghavamzadeh, and B. Scherrer, “Approximate dynamic programming finally performs well in the game of tetris,” in Neural Information Processing Systems (NIPS), pp. 1–9, 2013.
  • [4] R. E. Bellman, “On the application of dynamic programming to the determination of optimal play in chess and checkers,” Proceedings of the National Academy of Sciences, vol. 53, no. 2, pp. 244–247, 1965.
  • [5] D. P. De Farias and B. Van Roy, “The linear programming approach to approximate dynamic programming,” Operations Research, vol. 51, no. 6, pp. pages 850–865, 2003.
  • [6] W. B. Powell and J. Ma, “A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications,” Journal of Control Theory and Applications, vol. 9, no. 3, pp. 336–352, 2011.
  • [7] J. Si, A. G. Barto, W. B. Powell, and D. Wunsch, Handbook of learning and approximate dynamic programming, vol. 2. John Wiley & Sons, 2004.
  • [8] Y. Wang, B. O’Donoghue, and S. Boyd, “Approximate dynamic programming via iterated Bellman inequalities,” International Journal of Robust and Nonlinear Control, vol. 25, no. 10, pp. 1472–1496, 2015.
  • [9] B. Stellato, T. Geyer, and P. J. Goulart, “High-speed finite control set model predictive control for power electronics,” IEEE Transactions on Power Electronics, vol. 32, no. 5, pp. 4007 – 4020, 2017.
  • [10] T. Summers, K. Kunz, N. Kariotoglou, M. Kamgarpour, S. Summers, and J. Lygeros, “Approximate dynamic programming via sum of squares programming,” in IEEE European Control Conference (ECC), pp. 191–197, 2013.
  • [11] P. N. Beuchat, J. Warrington, and J. Lygeros, “Point-wise maximum approach to approximate dynamic programming,” in Conference on Decision and Control (CDC), pp. 3694–3701, 2017.
  • [12] P. Beuchat, A. Georghiou, and J. Lygeros, “Alleviating tuning sensitivity in approximate dynamic programming,” in European Control Conference (ECC), pp. 1616–1622, 2016.
  • [13] M. Hohmann, J. Warrington, and J. Lygeros, “A moment and sum-of-squares extension of dual dynamic programming with application to nonlinear energy storage problems,” preprint: arXiv:1807.05947, 2018.
  • [14] C. F. Liew, D. DeLatte, N. Takeishi, and T. Yairi, “Recent developments in aerial robotics: A survey and prototypes overview,” preprint arXiv:1711.10085, 2017.
  • [15] M. Kamel, K. Alexis, M. Achtelik, and R. Siegwart, “Fast nonlinear model predictive control for multicopter attitude tracking on SOS(3),” in Conference on Control Applications (CCA), pp. 1160–1166, 2015.
  • [16] M. Bangura and R. Mahony, “Real-time model predictive control for quadrotors,” 19th IFAC World Congress, vol. 47, no. 3, pp. 11773 – 11780, 2014.
  • [17] O. Hernández-Lerma and J. B. Lasserre, Discrete-time Markov control processes: basic optimality criteria, vol. 30. Springer Science and Business Media, 2012.
  • [18] D. Bertsekas, Dynamic programming and optimal control, vol. 2. Athena Scientific, 4 ed., 2012.
  • [19] P. Parrilo, “Semidefinite programming relaxations for semialgebraic problems,” Mathematical Programming, vol. 96, no. 2, pp. pages 293–320, 2003.
  • [20] R. Mahony, V. Kumar, and P. Corke, “Multirotor aerial vehicles: Modeling, estimation, and control of quadrotor,” Robotics Automation Magazine, IEEE, vol. 19, pp. 20 –32, 09 2012.
  • [21] “Crazyflie 2.0,” 2018. https://www.bitcraze.io/crazyflie-2/.
  • [22] A. Romero Aguilar, “Non-linear control of quadcopters via approximate dynamic programming,” 2018. https://doi.org/10.3929/ethz-b-000272622.
  • [23] Distributed Flying and Localization Lab, “Open-source code D-FaLL-System,” 2018. https://gitlab.ethz.ch/D-FaLL/PandS-System/D-FaLL-System.
  • [24] M. Diehl, H. Bock, and J. Schlöder, “A real-time iteration scheme for nonlinear optimization in optimal feedback control,” SIAM Journal on Control and Optimization, vol. 43, no. 5, pp. 1714–1736, 2005.
  • [25] “Motion capture system Vicon,” 2018. https://www.vicon.com.
  • [26] Video of experimental results, “Comparison of LTV-MPC-ADP and LTI-MPC-LQR.,” 2018. https://polybox.ethz.ch/index.php/s/XZ4nhMlaL9q6vrd.
  • [27] Gurobi Optimization, LLC, Gurobi Optimizer Reference Manual, 2018. http://www.gurobi.com.
  • [28] J. Löfberg, “Yalmip: A toolbox for modeling and optimization in matlab,” in International Symposium on Computer Aided Control Systems Design, pp. 284–289, IEEE, 2004.
  • [29] MOSEK ApS, The MOSEK optimization toolbox for MATLAB manual. Version 8.1, 2017. http://docs.mosek.com/8.1/toolbox/index.html.