Nonlinear Control of Quadcopters via Approximate Dynamic Programming

I INTRODUCTION

II Approximate Dynamic Programming Formulation

III Control of a Quadcopter

IV Numerical Results

V Conclusion and Future Work

References

II-A Optimal Control Problem

II-B Linear Programming Approach to ADP with Polynomials

II-C Point-Wise Maximum Approach to ADP

II-D Online Policy

III-A Dynamical Model of the Quadcopter

III-B System Model Approximation for ADP

III-C Input Constraints

III-D Value Function Fitting

IV-A Simulation Results

IV-B Experimental Results

Abstract

Angel Romero, Paul N. Beuchat, Yvonne R. Stürz, Roy S. Smith, and John Lygeros All authors are with the Automatic Control Laboratory at the Swiss Federal Institute of Technology in Zurich (ETHZ), Physikstrasse 3, 8092 Zurich, Swizerland. Email addresses: {beuchatp, stuerzy, rsmith, jlygeros}@ethz.ch

While Approximate Dynamic Programming has successfully been used in many applications involving discrete states and inputs such as playing the games of Tetris or chess, it has not been used in many continuous state and input space applications. In this paper, we combine Approximate Dynamic Programming techniques and apply them to the continuous, non-linear and high dimensional dynamics of a quadcopter vehicle. We use a polynomial approximation of the dynamics and sum-of-squares programming techniques to compute a family of polynomial value function approximations for different tuning parameters. The resulting approximations to the optimal value function are combined in a point-wise maximum approach, which is used to compute the online policy. The success of the method is demonstrated in both simulations and experiments on a quadcopter. The control performance is compared to a linear time-varying Model Predictive Controller. The two methods are then combined to keep the computational benefits of a short horizon MPC and the long term performance benefits of the Approximate Dynamic Programming value function as the terminal cost.

Dynamic Programming (DP), [1], is an important method for optimal decision making, relevant in a broad range of applications including finance, health sciences and engineering. While DP has been successfully applied to many specific problems [2], it suffers from severe limitations when dealing with high dimensional systems. Approximate Dynamic Programming (ADP) aims to alleviate these limitations by finding tractable approximations to the optimal policies that result from DP. ADP methods have proven to work well for certain practical applications such as the game of Tetris [3], or the game of chess [4]. However, despite considerable progress in the theory and algorithms of ADP, [5, 6, 7], there have not been many successful applications of ADP to systems with continuous state and input spaces.

In [8], the authors suggest the so-called iterated Bellman inequality as a novel way of obtaining approximations to the value function by solving convex optimization problems. The approach was validated by applying it to the linear model of a simple mechanical system. In [9] the iterated Bellman inequality method is used to approximate the tail cost of a Model Predictive Control (MPC) problem of a continuous state space and finite input space power inverter. This effectively allows for a shorter time horizon, and thus a reduced computational burden. In [10], the authors generalize the problem formulation of [8] by considering polynomials as the chosen function space, and then use sum-of-squares (SOS) techniques to relax the problem to a semi-definite program (SDP). This method is implemented and applied to control a miniature-helicopter using a simplified linear model around hover.

The combination of ADP methods we apply to quadcopters is most closely connected with the method proposed in [11]. There the authors introduce a novel way of computing successively tighter under-estimators to the optimal value function in an iterative fashion. This method has proven to be computationally tractable for the nonlinear quadcopter model considered in this paper, in contrast to [8]. In [12], a family of objective functions is chosen instead of only one, with a value function approximation computed for each and then put together in a point-wise maximum (PWM) approach. This reduces the dependency of the result on the choice of the individual objectives. Parts of these techniques are used in [13] and demonstrated in simulation on a nonlinear energy storage system with a low dimensional state-by-input space. In that application, the slow time scale allows for computationally demanding online policies.

Due to their inherent instability and ease of design, unmanned aerial vehicles (UAV) have widely been used as a test bench for control and robotics research, [14]. In [15], nonlinear MPC is used to control the fast attitude dynamics of a hexacopter vehicle, and a Linear Quadratic Regulator (LQR) is used for the slower position dynamics. A nonlinear constraint on the tilt angle is considered. The maneuverability and tracking performance of the system are shown in simulation and experiment while tracking setpoints on a square. Similarly, in [16], a real time iteration linear MPC for the control of the position dynamics is run on-board a quadcopter.

In this paper, we combine multiple ADP methods to design a controller for the continuous, nonlinear, high dimensional model of a quadcopter and demonstrate the success of the methods in simulations and experiments. The key contributions of this work are:

•

We combine the ADP techniques from [10], [11] and [12], leveraging the strengths of each.
•

We develop the necessary approximation steps for synthesizing an ADP controller by this combined method for a high-dimensional quadcopter model.
•

We demonstrate the benefits of this method compared to other linear and nonlinear control techniques in simulations and experiments.

The paper is organized as follows: In Section II, we present the deterministic control problem and introduce the ADP formulation. In Section III, we derive the nonlinear model of the quadcopter, transform it to fit in our ADP formulation, and use it then to derive the approximation to the optimal value function. Section IV presents the results obtained from simulations and experiments.

In this section, we first state the optimal control problem to be solved, and then we present the techniques used for computing approximate solutions, which will be applied to a quadcopter in the next section.

We consider the discrete-time dynamical system of a quadcopter with continuous state and input spaces and aim to minimize a discounted cost objective. The specific states, inputs and dynamics of the quadcopter will be given in Section III. For now, let us denote the state of the system at time $k$ as $\smash{x_{k}\in\mathbb{R}^{n_{x}}}$ , and the control input as $\smash{u_{k}\in\mathbb{R}^{n_{u}}}$ . The evolution of the system is described by the dynamics function $\smash{f:\mathcal{X}\times\mathcal{U}\to\mathcal{X}}$ such that

x_{k+1}=f(x_{k},u_{k}).

(1)

The stage cost function is the non-negative cost of taking decision $u_{k}$ when being in state $x_{k}$ , denoted as $\smash{l:\mathcal{X}\times\mathcal{U}\rightarrow\mathbb{R}_{+}}$ . Given a discount factor $\smash{\xi\in[0,1)}$ and a stationary control policy of the form $\smash{\pi:\mathcal{X}\rightarrow\mathcal{U}}$ , the discounted infinite horizon cost is thus,

J_{\pi}(x)=\sum_{k=0}^{\infty}\xi^{k}\,l(x_{k},\pi(x_{k})),\hskip 28.45274pt\text{with $x_{0}=x$.}

(2)

The aim is to find the policy that minimizes (2) for all $x\in\mathcal{X}$ . The solution is characterized by the optimal value function $\smash{V^{*}:\mathcal{X}\to\mathbb{R}}$ that satisfies the Bellman equation,

V^{*}(x)\,=\,\underbrace{\,\min_{u\in\mathcal{U}}\hskip 2.84544ptl(x,u)+\xi\,V^{*}(f(x,u))\,}_{\mathcal{T}V^{*}}\,,\hskip 14.22636pt\forall x\in\mathcal{X},

(3)

where $\mathcal{T}$ is the Bellman operator. Given a solution of (3) the optimal policy is the so called greedy policy,

\pi^{*}(x)=\operatorname*{arg\,min}_{u\in\mathcal{U}}\hskip 2.84544ptl(x,u)+\xi\,V^{*}(f(x,u)).

(4)

which minimizes the cost of taking the decision $u$ now, plus the cost-to-go from the next time period onwards. We require the standard assumptions on the stage cost, dynamics, and existence of feasible policies to ensure that $V^{*}$ and $\pi^{*}$ exist, are time-invariant, and attain the minimum, for example [17, Assumption 4.2.1, 4.2.2] or [18, §1.2]. For a general problem instance, solving (3) and implementing (4) is intractable.

The monotone and contractive properties of the Bellman operator mean that any function $\smash{\hat{V}:\mathcal{X}\to\mathbb{R}}$ satisfying the so called Bellman inequality is a point-wise under-estimator of $V^{*}$ , i.e.,

\hat{V}(x)\leq\mathcal{T}\hat{V}(x),\,\forall x\!\in\!\mathcal{X}\hskip 2.84544pt\Rightarrow\hskip 2.84544pt\hat{V}(x)\leq V^{*}(x),\,\forall x\!\in\!\mathcal{X},

(5)

where $\hat{V}$ is referred to as an approximate value function. This is the key property that motivates the various LP approaches to ADP proposed in [5, 8, 10, 11, 13]. Letting $\smash{\mathcal{F(X)}}$ denote a space of functions mapping $\smash{\mathcal{X}\rightarrow\mathbb{R}}$ , and using the Bellman inequality, a point-wise under-estimator is found by solving the following optimization problem,


$\displaystyle\underset{\hat{V}(x)}{\text{maximize}}\hskip 5.69046pt$	$\displaystyle{\displaystyle\int\hat{V}(x)\,c(x)\,dx}$	(6a)
subject to	$\displaystyle\;\;\hat{V}\in\mathcal{F(X)}$	(6b)
	$\displaystyle\;\;\hat{V}(x)\leq\mathcal{T}\hat{V}(x),\hskip 2.84544pt\forall x\in\mathcal{X},$	(6c)

where the state relevance weighting function, $c(x)$ , is a finite measure on the state space and it allows the practitioner to select the region where a better approximation is desired. When $\mathcal{F(X)}$ is the space of real-valued measurable functions on $\mathcal{X}$ it is proven in [17] that the solution to (6) solves the Bellman equation for $c$ -almost all $\smash{x\in\mathcal{X}}$ , however the infinite dimensional decision variable makes the problem intractable.

In this paper we use the space of polynomials for the function space $\mathcal{F(X)}$ . This choice is motivated by [10] where the authors show that problem (6) can be reformulated as a finite SDP when the dynamics, stage costs, and spaces are described by polynomials. Letting $\smash{\mathcal{P}_{d}\mathcal{(X)}}$ denote the space of polynomials up to degree $d$ , we replace (6b) by $\smash{\hat{V}\in\mathcal{P}_{d}\mathcal{(X)}}$ and hence the decision variables are the coefficients of the monomials up to degree $d$ . Considering the minimization in the Bellman operator $\mathcal{T}$ , the infinite constraints (6c) can be equivalently written as,

0\leq\underbrace{-\hat{V}(x)+l(x,u)+\xi{\hat{V}(f(x,u))}\,}_{b(x,u)},\hskip 2.84544pt\forall\,x\!\in\!\mathcal{X},\,u\!\in\!\mathcal{U},

(7)

where $b(x,u)$ is a polynomial when $l$ and $f$ are polynomials.

We now use the SOS S-Procedure [19] to reformulate (7) as a single Linear Matrix Inequality (LMI) constraint on the decision variable $\smash{\hat{V}\in\mathcal{P}_{d}\mathcal{(X)}}$ . Letting $SOS$ denote the set of polynomials that are representable as a sum of polynomials squared, then from [19, Theorem 3.3] we have the following equivalence for certifying that a polynomial $\smash{v\in\mathcal{P}_{2d}\mathcal{(X)}}$ is an SOS polynomial,

v(x)\in SOS\hskip 4.26773pt\Leftrightarrow\hskip 4.26773pt\exists M\succeq 0,\hskip 2.84544pt\text{s.t.}\hskip 2.84544ptv(x)=z(x)^{T}Mz(x),

(8)

where $z(x)$ is the vector of monomials of $\smash{x\in\mathbb{R}^{n_{x}}}$ up to degree $d$ , and $M$ is a square matrix of size ${n_{x}+d\choose d}$ . Letting $\smash{g_{i}(x,u)\geq 0}$ denote the polynomials that describe the state and input spaces ( $\mathcal{X}$ and $\mathcal{U}$ ), then the following set of equations,


$\displaystyle b(x,u)-\lambda(x,u)\sum\nolimits_{i}g_{i}(x,u)$	$\displaystyle\,\in\,SOS,$	(9a)
$\displaystyle\lambda(x,u)$	$\displaystyle\,\in\,SOS,$	(9b)

is a sufficient reformulation of (7) in the sense that ${\text{\eqref{eq:SOS_procedure}}\Rightarrow\text{\eqref{eq:BI_relaxed}}}$ . Replacing (6c) with (9), the multiplier $\smash{\lambda\in\mathcal{P}_{d}(\mathcal{X}\!\times\!\mathcal{U})}$ is an additional decision variable with the polynomial degree of the multiplier as a choice. It is reasonable to choose the largest multiplier degree for which the LMI constraints stemming from (9) are computationally tractable.

Finally, the objective (6a) is linear in the coefficients of $\smash{\hat{V}\in\mathcal{P}_{d}\mathcal{(X)}}$ and requires knowledge of the moments of the state relevance weighting function $c(x)$ up to order $d$ . Thus, a point-wise under-estimator of $V^{*}$ can be found using a commercial solver for the SDP relaxation of (6).

To improve the quality of the approximate value function, we use the approach proposed in [11] that solves a sequence of optimization problems, each with constraints of the same size as (9).

First, we solve the SDP relaxation of (6) for a particular choice of the state relevance weighting, and denote the solution as $\hat{V}_{1}^{*}$ . Then we solve the SDP relaxation of the following optimization problem with $j=2$ ,


$\displaystyle\underset{\hat{V}_{j}}{\text{maximize}}\hskip 5.69046pt$	$\displaystyle\displaystyle\int\hat{V}_{j}(x)\,c(x)\,dx$	(10a)
subject to	$\displaystyle\hat{V}_{j}\in\mathcal{P}_{d}\mathcal{(X)},$	(10b)
	$\displaystyle\hat{V}_{j}(x)\leq\left(\mathcal{T}\underset{k=1,\dots,j-1}{\text{max}}\hat{V}^{*}_{k}\right)(x),\hskip 5.69046pt\forall x\!\in\!\mathcal{X}$	(10c)

where $\hat{V}_{1}^{*}$ is fixed problem data, and we denote the solution as $\hat{V}_{2}^{*}$ . We then solve the SDP relaxation of (10) iteratively for ${j\geq 3}$ , each time storing the solution $\hat{V}_{j}^{*}$ as fixed data to be used in the next iteration. As the cost function (10a) is non-decreasing with the iterations of (10), we terminate when the improvement in the cost function becomes less than a pre-specified threshold. The approximate value function for a particular choice of $c(x)$ is set as the solution of the final iteration.

The steps given in [11] show how to reformulate (10c) as a polynomial inequality constraint similar to (7). The SOS S-Procedure is then applied and the resulting relaxation involves one LMI constraint with the same size as (9a), and $\smash{j-1}$ LMI constraints identical to (9b).

To construct an online policy, we first construct our best approximation to the value function, and we use this as a surrogate for $V^{*}$ in the greedy policy (4). For different choices of the state relevance weighting, denoted $c_{i}(x)$ for $\smash{i=1,\dots,N_{c}}$ , we perform the iteration described in Section II-C and denote $\hat{V}_{c_{i}}^{*}$ as the value function estimate returned by the final iteration. We take the PWM of these under-estimators as our best estimate for $V^{*}$ , i.e.,

\max_{i=1,\dots,N_{c}}\hskip 2.84544pt\hat{V}_{c_{i}}^{*}(x)\hskip 5.69046pt\leq\hskip 5.69046ptV^{*}(x),\hskip 5.69046pt\forall x\in\mathcal{X}.

(11)

The online policy, also called the approximate greedy policy, is thus,

\displaystyle\hat{\pi}(x)

\displaystyle=\operatorname*{arg\,min}_{u\in\mathcal{U}}\hskip 2.84544ptl(x,u)+\xi\,\max_{i=1,\dots,N_{c}}\hskip 2.84544pt\hat{V}_{c_{i}}^{*}(f(x,u)),

(12)

where the maximum over the $\hat{V}_{c_{i}}^{*}$ is readily reformulated using an epigraph variable.

To construct the surrogate for $V^{*}$ , it remains to choose the $c_{i}$ weightings and in Section IV we provide the weightings used for nonlinear control of a quadcopter. The surrogate for $V^{*}$ can be considered as a sufficiently accurate estimate if the online performance of (12) achieves a low cost.

In the following, the theory from the previous section will be used for the control of a quadcopter vehicle. The dynamics are brought into a suitable form by applying a reduction in the state space and a polynomial approximation.

To describe the dynamics of the quadcopter, two frames of reference are used, the inertial (world) frame and the body frame (attached to the quadcopter). The location of the body frame with respect to the inertial frame is denoted as ${\vec{p}=[p_{x},p_{y},p_{z}]^{\top}}$ and shown in Figure 1. The angular rates about the body frame axes $x^{(B)}$ , $y^{(B)}$ and $z^{(B)}$ are denoted ${\vec{\omega}=[w_{x},w_{y},w_{z}]^{\top}}$ . The orientation of the body frame relative to the inertial frame is given by ${\vec{\psi}=[\gamma,\beta,\alpha]^{\top}}$ , the roll, pitch, and yaw intrinsic Euler angles, respectively. Using the ZYX convention for the Euler angles, and with ${s_{(\cdot)}:=\sin(\cdot)}$ , ${c_{(\cdot)}:=\cos(\cdot)}$ , the equations of motion for a quadcopter of mass $m$ and with moment of inertia $J$ , are readily derived as,


$\displaystyle\ddot{\vec{p}}^{(I)}$	$\displaystyle=\frac{1}{m}\sum\limits_{i=1}^{4}f_{i}\begin{bmatrix}c_{\alpha}s_{\beta}s_{\gamma}+s_{\alpha}c_{\gamma}\\ s_{\alpha}s_{\beta}c_{\gamma}-c_{\alpha}s_{\gamma}\\ c_{\beta}c_{\gamma}\end{bmatrix}-\begin{bmatrix}0\\ 0\\ g\end{bmatrix},$	(13a)
$\displaystyle\begin{bmatrix}\dot{\omega_{x}}\\ \dot{\omega_{y}}\\ \dot{\omega_{z}}\end{bmatrix}$	$\displaystyle=J^{-1}\left(\begin{bmatrix}\sum_{i=1}^{4}f_{i}d_{y_{i}}\\ \sum_{i=1}^{4}-f_{i}d_{x_{i}}\\ \sum_{i=1}^{4}f_{i}d_{\tau_{i}}\end{bmatrix}-\vec{\omega}\times J\vec{\omega}\right),$	(13b)

where $f_{i}$ is the thrust produced by propeller $i$ , $d_{y_{i}}$ and $d_{x_{i}}$ are the distances from the axis of propeller $i$ to the center of gravity of the quadcopter, $d_{\tau_{i}}$ is a constant of proportionality for how much torque is produced by thrust $f_{i}$ , and $g$ is the acceleration due to gravity. For more details on the derivation of the dynamics refer to [20]. The system thus has 12 states, given by $[\vec{p}^{\,\top},\dot{\vec{p}}^{\,\top},\vec{\psi}^{\,\top},\vec{\omega}^{\,\top}]^{\top}$ , and four inputs, given by the motor thrusts $f_{i}$ , $i=1,...,4$ .

Refer to caption — Figure 1: Coordinate frames for the quadcopter: Inertial frame $(I)$ and body frame $(B)$ .

In order to apply the ADP method from Section II, we approximate the full dynamics of (13) in three steps. First, we exploit time scale separation to reduce the state dimension of the dynamics by using the cascaded control structure shown in Figure 2. The inner controller considers the states $[\vec{\psi\,}^{\top},\vec{\omega}^{\top}]^{\top}$ and uses the thrust inputs $f_{1},\dots,f_{4}$ to track a reference attitude $\vec{\psi}_{\mathrm{ref}}$ and total thrust $f_{T,\mathrm{ref}}$ . The outer controller considers the states ${x=[\,\vec{p}^{\top},\dot{\vec{p}}^{\top}\,]^{\top}}$ and uses the inputs ${u=[\gamma,\beta,\alpha,f_{T}]^{\top}}$ to stabilize the quadcopter to a specified position and yaw setpoint. These inputs are given as references to the inner controller. This cascaded structure is common practice for quadcopter control [15], and the approximation step assumes a sufficient time-scale separation so that the transients of the inner loop controller can be neglected when designing the outer loop controller. The goal is to design the outer controller using the combination of ADP methods described in Section II.

In the second approximation step, we use a Taylor expansion of (13a) to replace the trigonometric functions with a polynomial approximation that fits the SOS framework described in Section II-B. We use a third order Taylor expansion as a compromise between the accuracy of the approximated nonlinear dynamics and the size of the optimization problem (10) for fitting approximate value functions. The translational dynamics (13a) are thus approximated as

\begin{bmatrix}\dot{\vec{p}}\\ \frac{1}{m}(\beta f_{T}-\frac{1}{2}mg\alpha^{2}\beta+f_{T}\alpha\gamma-\frac{1}{6}mg\beta^{3}-\frac{1}{2}mg\beta\gamma^{2})\\ \frac{1}{m}(-\gamma f_{T}+f_{T}\alpha\beta+\frac{1}{2}mg\alpha^{2}\gamma+\frac{1}{6}mg\gamma^{3})\\ \frac{1}{m}((f_{T}-mg)-\frac{1}{2}f_{T}\beta^{2}-\frac{1}{2}f_{T}\gamma^{2})\\ \end{bmatrix},

(14)

where we have introduced ${f_{T}=\sum_{i}f_{i}}$ to represent the total thrust.

Finally, to bring the dynamics of (14) to discrete time, we use a forward Euler scheme. The discrete-time dynamics are required for applying the ADP methods to the design of the outer-loop controller. We choose a forward Euler approximation in order not to increase the polynomial order of the dynamics, and assume that the sampling time is fast enough to maintain stability properties of the plant. For the stage cost we chose the quadratic form, ${l(x,u)=x^{\top}Qx+u^{\top}Ru}$ .

Figure 2: Cascaded control structure. The references

f_{T,ref}

and

\vec{\psi}_{ref}

are the inputs

u

used in the model for synthesizing the outer controller. The inner loop consists of PID controllers that track these attitude and thrust references.

Figure 3: Visualization of the constraint on the tilt angle

\theta

We chose to constrain the tilt angle that the $z^{(I)}$ axis forms with the $z^{(B)}$ axis, denoted by $\theta$ and shown in Figure 3. The motivation behind this choice is linked to the nonlinearities in the dynamics. We choose $\theta$ large enough so that the nonlinearities are relevant for the solution of the optimal control problem, but small enough so that the third order Taylor expansion (14) is an adequate approximation of the true dynamics (13a). The angle constraint is given by

\cos(\theta)\leq\cos(\beta)\cos(\gamma),

(15)

which is nonlinear in the inputs $\gamma$ and $\beta$ , similar to the constraint introduced in [15].

To represent the physical actuator limits of the propellers, we impose a lower and upper bound on the total thrust, $f_{T}$ . Denoting the lower and upper bound as $f_{T,lb}$ and $f_{T,ub}$ , respectively, the constraint is thus,

f_{T,lb}\leq f_{T}\leq f_{T,ub}.

(16)

In addition to the dynamics, the constraints also need to be in polynomial form in order to apply the SOS techniques from Section II-B. We chose quadratic forms for the constraints in (15) and (16). The constraint on the angle in (15) is approximated via least-squares to the closest ellipse with radii $a_{1}$ and $a_{2}$ , and the constraint on the total thrust in (16) can be transformed into a single second order polynomial to reduce the number of SOS multipliers. The constraint functions are thus given by


$\displaystyle 1-\frac{\gamma^{2}}{{a_{1}}^{2}}-\frac{\beta^{2}}{{a_{2}}^{2}}=:g_{\theta}(u)$	$\displaystyle\geq 0,$	(17a)
$\displaystyle f_{T}\left(f_{T,lb}+f_{T,ub}\right)-f_{T,lb}f_{T,ub}-f_{T}^{2}=:g_{f}(u)$	$\displaystyle\geq 0.$	(17b)

Note that $g_{f}(u)$ and $g_{\theta}(u)$ correspond to $g_{i}(x,u)$ in (9).

Heuristics have been developed for how to choose the state relevance weighting, $c(x)$ , which is the main tuning parameter. As presented in [12], to alleviate the sensitivity of this choice, we select a family of $\smash{c_{i}(x)\sim\mathcal{N}(0,\Sigma_{i})}$ as Gaussians with zero mean and different covariance matrices. Let $\mathrm{diag}(v)$ define a diagonal matrix with the entries of vector $v$ on its diagonal. Then, with ${\Sigma_{0}=\mathrm{diag}([0.1,0.1,0.1,1,1,1])}$ , the set of covariance matrices $\Sigma_{i}$ used is obtained as,

		$\displaystyle\Sigma_{i}=K_{i}\,\Sigma_{0},~i=\{1,.,9\},~\text{with}$		(18)
		$\displaystyle K_{i}\in\{01,01,03,05,07,1,13,15,2\}.$		(18)

For each different $c_{i}(x)$ in (18), we perform the iterative procedure described in Section II-C. As shown in Figure 4, the relative cost improvement converges to under a threshold of $10^{-5}$ within 9 iterations.

Figure 4: Convergence of the objective function value of (10) over iterations

j

, for state relevance weighting

c_{4}

in (18).

In Figure 5 (a), a cut of the value function approximation for different iterations for the same $c_{i}(x)$ is shown. The cut is obtained by setting all states to zero except $\dot{p}_{z}$ . Solving this iterative problem for the different $c_{i}(x)$ results in a family of value function approximations, $\hat{V}_{c_{i}}^{*}(x)$ , $i=1,...,N_{c}$ . As presented in Section II-D, the point-wise maximum (PWM) of this family, ${\hat{V}^{*}_{PWM}(x):=\underset{i=1,\dots,N_{c}}{\text{max}}\hat{V}^{*}_{c_{i}}(x)}$ , is the value function approximation that is used for the online greedy policy (12). Figure 5 (b) shows a cut in the $\dot{p}_{z}$ state of the value function approximations $\hat{V}_{c_{i}}^{*}(x)$ and of $\hat{V}^{*}_{PWM}(x)$ .

(a)

(b)

Figure 5: Cuts around

\dot{p}_{z}

(a): of the approximate value functions

\hat{V}_{j}^{*}(x)

for different iterations

j=1,...,9

of (10) for

c_{4}(x)

as shown in the lower plot, and (b): of the value function approximations

\hat{V}_{c_{i}}^{*}(x)

for the

c_{i}(x)

in (18) and their PWM

\hat{V}_{PWM}^{*}(x)

In this section, in order to evaluate the control performance of the presented ADP approach, we compare different control schemes, which will be explained in detail in the following, with an overview given in Table I. The names are comprised of three parts, the first part encodes the approximation of the system dynamics used in the online policy, the middle part refers to the type of policy, and the last part indicates the terminal cost or value function used. The control task is a setpoint change of $2$ meters in $p_{x}$ -direction, which shows the performance benefits of the control methods using the ADP value function compared to the other control schemes. First, results from simulations are given and then, experimental results on the quadcopter are presented.

TABLE I: Control policies compared in Section IV.

Name	dynamics approximation	policy	terminal cost
NL-GP-ADP	3rd order Taylor as in (14)	Greedy policy (12)	$\hat{V}^{\ast}_{PWM}$ , §III-D
LTV-MPC-LQR	LTV as in (20)	MPC	$P$ from Riccati
LTI-MPC-LQR	linearized around hover	MPC	$P$ from Riccati
LTV-MPC-ADP	LTV as in (LABEL:eq:optimization_problem_general_ADP)	MPC	$\hat{V}^{\ast}_{PWM}$ , §III-D

We consider the model of a small-sized quadcopter, the Crazyflie 2.0, [21], with a mass of 27 $g$ . Details about the model parameters such as its inertia and dimensions can be found in [22], [23]. For the simulation the full state dynamics of the quadcopter as in (13) are used, together with the control structure shown in Figure 2. The inner loop tracks the commanded reference Euler angles $\psi_{ref}$ with a PID controller running at 500 Hz, with the controller parameters and implementation details found in [23]. The outer control loop, and hence all the controllers listed in Table I, runs at 50 Hz.

In the ADP approach, the matrices for the stage cost, ${l(x,u)=x^{\top}Qx+u^{\top}Ru}$ , have been chosen as,

	$\displaystyle Q$	$\displaystyle=\mathrm{diag}(\begin{bmatrix}50&50&50&5&5&5\end{bmatrix}),$		(19)
	$\displaystyle R$	$\displaystyle=\mathrm{diag}(\begin{bmatrix}20&20&100&1500\end{bmatrix}),$		(19)

and the discount factor is set to $\xi=0.98$ . The total thrust is constrained to ${f_{T}\!\in\![0N,\,0.56N]}$ . For the angle constraint in (15) we have chosen $\theta=45^{\circ}$ , which is large enough to demonstrate that the controller takes into account the nonlinearities of the system, but small enough for the constraint to become active.

At each time step, the online policy in (12) is solved and the optimal input is applied to the system. As the term $\hat{V}_{c_{i}}^{*}(f(x,u))$ is a composition of a quadratic and a third order polynomial, the resulting optimization problem is not convex. We refer to this as NL-GP-ADP, see Table I, and we solve it in simulation with an interior point method.

In the following, we compare NL-GP-ADP to a linear time-varying (LTV) MPC, [24], which also accounts for the nonlinear dynamics in its predictions. At each time step, the dynamics (13a) are linearized around the predicted trajectory, $x_{\mathrm{traj},k}$ , $u_{\mathrm{traj},k}$ , and denoted by $A_{k},B_{k},g_{k}$ . Then, the following problem is solved,

where the set $\mathcal{U}$ is defined as in (17), making this problem a convex Quadratically Constrained Quadratic Program (QCQP). Note that the $x_{k}$ and $u_{k}$ variables are deviations from the predicted trajectory. The cost matrices $Q$ and $R$ are the same as in (19), and $P$ comes from the solution to the Riccati equation using the linearization of dynamics (13a) around the hover state ${[\,\vec{p},\dot{\vec{p}}\,]^{\top}}\!=\!0$ . The first input, $u_{0}^{*}$ , is applied, and the system evolves by one time step. The predicted trajectory for the next iteration, i.e., $x_{\mathrm{traj},k}$ , $u_{\mathrm{traj},k}$ , is based on the state measurement and the inputs $u_{1}^{*},...,u_{N-1}^{*}$ . We refer to this as LTV-MPC-LQR of horizon length $N$ , see Table I.

Furthermore, a comparison of the NL-GP-ADP policy to a controller that does not account for the nonlinear dynamics is given. For this we use a linear time-invariant (LTI) MPC, i.e., for the dynamics we use the linearized model around hover and the same $Q$ , $R$ , and $P$ cost and Riccati solution matrices as in (20). Consistent with the naming convention, this policy is referred to as LTI-MPC-LQR of horizon $N=1$ , and the policy is also a convex QCQP. This policy is similar to the ADP approach demonstrated on miniature helicopters in [10], and for the control task we consider the performance was almost identical. Thus, for the sake of clarity, we do not include a policy based on [10] in the comparison here.

A comparison between the three policies is shown in Figure 6. The main difference between the approaches is the drop in $p_{z}$ due to the fact that NL-GP-ADP and LTV-MPC-LQR have more information about the nonlinearities of the system than LTI-MPC-LQR which uses the linearized dynamics around hover. The nonlinearity of interest for this test is the change in vertical thrust when the quadcopter tilts, and as seen in Figure 6 (b), LTI-MPC-LQR corrects for the drop in $p_{z}$ after it happens. However, NL-GP-ADP and LTV-MPC-LQR pre-compensate for this drop by increasing the thrust in advance.

(a) Evolution of

p_{x}

(starting from

2

meters) and

p_{z}

(b) Evolution of

f_{T}

Figure 6: Comparison of the behavior of the system in simulation for: 6(a) NL-GP-ADP, 6(a) LTI-MPC-LQR

{N\!=\!1}

, and 6(a) LTV-MPC-LQR of different horizon lengths (from lighter to darker,

\smash{N=1,2,5,10,15,20}

For the experiments, the Crazyflie quadcopter is used together with a set of infrared cameras that provide the system with position and attitude measurements, [25]. Body frame angular rate measurements come from the on-board IMU. The control loops are the same as described in Section IV-A for the simulations.

Implementation of the NL-GP-ADP in real-time (i.e., at 50 Hz) was computationally intractable. To overcome this difficulty, the ADP approach is combined with the linear time-varying MPC scheme by replacing the terminal cost in (20) with the ADP value function approximation, leading to

We call this the LTV-MPC-ADP policy of horizon $N$ . If we introduce the epigraph of all quadratics $\hat{V}^{*}_{c_{i}}(x)$ in $\hat{V}^{*}_{PWM}$ as a new optimization variable, as in [12], the resulting optimization problem is a QCQP, making problem (LABEL:eq:optimization_problem_general_ADP) implementable in real-time. Particularly, if in (LABEL:eq:optimization_problem_general_ADP) we choose the horizon length to be 1, we have the linearized greedy policy, which differs from problem (12) in that the prediction of the next state is done using the linearized dynamics around the current state, instead of the nonlinear dynamics.

Figure 7: Experimental comparison of different policies: 7 LTV-MPC-ADP, 7 LTV-MPC-LQR, 7 LTI-MPC-LQR, all with horizon length

{N\!=\!1}

In Figure 7, we show a comparison of the experimental results when using LTV-MPC-ADP, LTV-MPC-LQR and LTI-MPC-LQR, all with a horizon length of $N=1$ . As can be seen in Figure 7, the initial thrust is higher for LTV-MPC-ADP, which suggests that it is able to predict the behavior of the system better than LTV-MPC-LQR. It shows that for the same horizon length, LTV-MPC-ADP reacts faster in the correction of the drop in $p_{z}$ , since the initial thrust input is larger. This suggests that using the ADP value function approximation for the final cost yields better predictions and therefore, performance. The experimental comparison of LTV-MPC-ADP and LTI-MPC-LQR can be seen in the video at [26]. For details on the implementation, we refer to [23].

Figure 8: Costs for different horizon lengths, for policies: 8 LTV-MPC-ADP, 8LTV-MPC-LQR and 8LTI-MPC-LQR.

Figure 8 shows a series of simulations using the LTV-MPC-ADP, the LTV-MPC-LQR, and the LTI-MPC-LQR policies for horizon lengths from 1 up to 11. For every horizon length, the sum of stage costs, $\sum_{k=0}^{N}l(x_{k},u_{k})$ is computed, until all states and inputs have converged, and then plotted as a measurement of the performance of the policy. As the aim is to minimize (2), the smaller this cost is, the better the controller is performing. Figure 8 supports the results of the performance comparison given in Figure 6, as for short horizon lengths the cost of LTV-MPC-ADP is lower than the one of LTV-MPC-LQR. It also shows that the quality of the approximation of the final cost loses importance when the horizon length increases.

(a) LTV-MPC-LQR

(b) LTV-MPC-ADP

Figure 9: Comparison of the solve times reported by the Gurobi solver [27] when playing the respective policy online. The black dotted line show the 50 Hz real-time limit.

Figure 9 shows that the maximum implementable horizon length for the LTV-MPC-ADP policy at 50 Hz is of 4-5 steps, while the LTV-MPC-LQR policy is tractable for up to 15 steps. The online computation time of the LTV-MPC-ADP policy can be reduced by using fewer approximate value functions in $V_{PWM}^{*}$ , which introduces fewer quadratic constraints in (LABEL:eq:optimization_problem_general_ADP). This generally introduces a trade-off with the online performance because the approximation quality of the cost-to-go may get reduced. Moreover, reducing the computation time allows for a higher frequency which might improve the forward Euler approximation used for the discrete time dynamics.

The offline computation time required to compute the policies LTV-MPC-LQR and LTI-MPC-LQR is solving the Riccati equation, which takes less than one second for this model. The offline computation required to prepare $\hat{V}^{*}_{PWM}$ for the LTV-MPC-ADP policy is less than 10 minutes on a standard desktop computer, using Yalmip [28] and Mosek [29] for building and solving the necessary SDPs.

We applied Approximate Dynamic Programming methods to a nonlinear, high dimensional quadcopter system with continuous state and input spaces. Using polynomial approximations of the dynamics, the constraints and the value function, leads to an online policy capturing the relevant nonlinearities of the system. For computational tractability, the ADP method was combined with an MPC scheme, which leverages the computational benefits of a short-term MPC scheme and the long term advantages of capturing the effects of nonlinearities by the ADP value function. The latter was used as the terminal cost in a linear time-varying MPC. For short time horizons this control method outperformed the same MPC with the terminal cost being the solution to a Riccati equation. The performance of the resulting controller was successfully demonstrated both in simulations and in experiments on a nano-quadcopter.

In future work, the control performance could further be improved by allowing for higher order polynomial approximations of the dynamics, or of relevant parts of the dynamics while keeping the complexity of other parts low, in order to ensure computational tractability. Furthermore, formulating the dynamics in terms of quaternions instead of Euler angles is promising as quaternion dynamics are naturally polynomial. However, more investigation is needed on how to effectively approximate the value function that only takes values on the quaternion manifold.

$\displaystyle\underset{\{u_{k}\}_{k=0}^{N-1}}{\text{minimize}}$

$\displaystyle\sum_{k=0}^{N-1}\lVert x_{k}-x_{\mathrm{traj},k}\rVert_{Q}^{2}+\lVert u_{k}-u_{\mathrm{traj},k}\rVert_{R}^{2}$

$\displaystyle\hskip 78.24507pt+\lVert x_{N}-x_{\mathrm{traj},N}\rVert_{P}^{2}$

subject to

$\displaystyle x_{k+1}=A_{k}x_{k}+B_{k}u_{k}+g_{k},$

$\displaystyle u_{k}\in\mathcal{U},$

[1] R. E. Bellman, “On the theory of dynamic programming,” Proceedings of the National Academy of Sciences, vol. 38, no. 8, pp. 716–719, 1952.
[2] S. Yakowitz, “Dynamic programming applications in water resources,” Water Resources Research, vol. 18, no. 4, pp. 673–696, 1982.
[3] V. Gabillon, M. Ghavamzadeh, and B. Scherrer, “Approximate dynamic programming finally performs well in the game of tetris,” in Neural Information Processing Systems (NIPS), pp. 1–9, 2013.
[4] R. E. Bellman, “On the application of dynamic programming to the determination of optimal play in chess and checkers,” Proceedings of the National Academy of Sciences, vol. 53, no. 2, pp. 244–247, 1965.
[5] D. P. De Farias and B. Van Roy, “The linear programming approach to approximate dynamic programming,” Operations Research, vol. 51, no. 6, pp. pages 850–865, 2003.
[6] W. B. Powell and J. Ma, “A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications,” Journal of Control Theory and Applications, vol. 9, no. 3, pp. 336–352, 2011.
[7] J. Si, A. G. Barto, W. B. Powell, and D. Wunsch, Handbook of learning and approximate dynamic programming, vol. 2. John Wiley & Sons, 2004.
[8] Y. Wang, B. O’Donoghue, and S. Boyd, “Approximate dynamic programming via iterated Bellman inequalities,” International Journal of Robust and Nonlinear Control, vol. 25, no. 10, pp. 1472–1496, 2015.
[9] B. Stellato, T. Geyer, and P. J. Goulart, “High-speed finite control set model predictive control for power electronics,” IEEE Transactions on Power Electronics, vol. 32, no. 5, pp. 4007 – 4020, 2017.
[10] T. Summers, K. Kunz, N. Kariotoglou, M. Kamgarpour, S. Summers, and J. Lygeros, “Approximate dynamic programming via sum of squares programming,” in IEEE European Control Conference (ECC), pp. 191–197, 2013.
[11] P. N. Beuchat, J. Warrington, and J. Lygeros, “Point-wise maximum approach to approximate dynamic programming,” in Conference on Decision and Control (CDC), pp. 3694–3701, 2017.
[12] P. Beuchat, A. Georghiou, and J. Lygeros, “Alleviating tuning sensitivity in approximate dynamic programming,” in European Control Conference (ECC), pp. 1616–1622, 2016.
[13] M. Hohmann, J. Warrington, and J. Lygeros, “A moment and sum-of-squares extension of dual dynamic programming with application to nonlinear energy storage problems,” preprint: arXiv:1807.05947, 2018.
[14] C. F. Liew, D. DeLatte, N. Takeishi, and T. Yairi, “Recent developments in aerial robotics: A survey and prototypes overview,” preprint arXiv:1711.10085, 2017.
[15] M. Kamel, K. Alexis, M. Achtelik, and R. Siegwart, “Fast nonlinear model predictive control for multicopter attitude tracking on SOS(3),” in Conference on Control Applications (CCA), pp. 1160–1166, 2015.
[16] M. Bangura and R. Mahony, “Real-time model predictive control for quadrotors,” 19th IFAC World Congress, vol. 47, no. 3, pp. 11773 – 11780, 2014.
[17] O. Hernández-Lerma and J. B. Lasserre, Discrete-time Markov control processes: basic optimality criteria, vol. 30. Springer Science and Business Media, 2012.
[18] D. Bertsekas, Dynamic programming and optimal control, vol. 2. Athena Scientific, 4 ed., 2012.
[19] P. Parrilo, “Semidefinite programming relaxations for semialgebraic problems,” Mathematical Programming, vol. 96, no. 2, pp. pages 293–320, 2003.
[20] R. Mahony, V. Kumar, and P. Corke, “Multirotor aerial vehicles: Modeling, estimation, and control of quadrotor,” Robotics Automation Magazine, IEEE, vol. 19, pp. 20 –32, 09 2012.
[21] “Crazyflie 2.0,” 2018. https://www.bitcraze.io/crazyflie-2/.
[22] A. Romero Aguilar, “Non-linear control of quadcopters via approximate dynamic programming,” 2018. https://doi.org/10.3929/ethz-b-000272622.
[23] Distributed Flying and Localization Lab, “Open-source code D-FaLL-System,” 2018. https://gitlab.ethz.ch/D-FaLL/PandS-System/D-FaLL-System.
[24] M. Diehl, H. Bock, and J. Schlöder, “A real-time iteration scheme for nonlinear optimization in optimal feedback control,” SIAM Journal on Control and Optimization, vol. 43, no. 5, pp. 1714–1736, 2005.
[25] “Motion capture system Vicon,” 2018. https://www.vicon.com.
[26] Video of experimental results, “Comparison of LTV-MPC-ADP and LTI-MPC-LQR.,” 2018. https://polybox.ethz.ch/index.php/s/XZ4nhMlaL9q6vrd.
[27] Gurobi Optimization, LLC, Gurobi Optimizer Reference Manual, 2018. http://www.gurobi.com.
[28] J. Löfberg, “Yalmip: A toolbox for modeling and optimization in matlab,” in International Symposium on Computer Aided Control Systems Design, pp. 284–289, IEEE, 2004.
[29] MOSEK ApS, The MOSEK optimization toolbox for MATLAB manual. Version 8.1, 2017. http://docs.mosek.com/8.1/toolbox/index.html.