This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Search Methods for Policy Decompositions

Ashwin Khadke and Hartmut Geyer Ashwin Khadke and Hartmut Geyer are with the Robotics Institute, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh PA, 15213 USA {akhadke, hgeyer}@andrew.cmu.edu
Abstract

Computing optimal control policies for complex dynamical systems requires approximation methods to remain computationally tractable. Several approximation methods have been developed to tackle this problem. However, these methods do not reason about the suboptimality induced in the resulting control policies due to these approximations. We introduced Policy Decomposition, an approximation method that provides a suboptimality estimate, in our earlier work. Policy decomposition proposes strategies to break an optimal control problem into lower-dimensional subproblems, whose optimal solutions are combined to build a control policy for the original system. However, the number of possible strategies to decompose a system scale quickly with the complexity of a system, posing a combinatorial challenge. In this work we investigate the use of Genetic Algorithm and Monte-Carlo Tree Search to alleviate this challenge. We identify decompositions for swing-up control of a 4 degree-of-freedom manipulator, balance control of a simplified biped, and hover control of a quadcopter.

I Introduction

Computing global optimal control policies for complex systems quickly becomes intractable owing to the curse of dimensionality [1]. One approach to tackle this problem is to compute several local controllers using iLQG [2] or DDP [3] and combine them to obtain a global policy [4, 5, 6]. Another approach is to compute policies as functions of some lower-dimensional features of the state; either hand designed [7], or obtained by minimizing some projection error [8]. As can be inferred from Tab. I, none of these methods predict nor reason about the effect of the respective simplifications on the closed-loop behavior of the resulting policy. For linear systems or linear approximations of nonlinear systems, [9] and [10] identify subspaces within the state-space for which the optimal control of the lower-dimensional system and that of the original system closely resemble. Reduction methods that assess the suboptimality of the resulting policies when applied to the full nonlinear system have not been explored.

TABLE I: Comparison of methods to solve the optimal control problem
method output suboptimality run time
iLQG [2] / DDP [3] Trajectory unknown O(d3)O(\text{d}^{3})
Trajectory based DP[4] [5] Policy unknown unknown
Model Reduction [7][8] Policy unknown O(ndreduced)O(n^{\text{d}_{\text{reduced}}})
Policy Decomposition [11] Policy estimated O(ndreduced)O(n^{\text{d}_{\text{reduced}}})

In our earlier work we introduced Policy Decomposition, an approximate method for solving optimal control problems with suboptimality estimates for the resulting control policies [11]. Policy Decomposition proposes strategies to decouple and cascade the process of computing control policies for different inputs to a system. Based on the strategy the control policies depend on only a subset of the entire state of the system (Fig. 1), and can be obtained by solving lower-dimensional optimal control problems leading to reduction in policy compute times. We further introduced the value error, i.e. the difference between value functions of control policies obtained with and without decomposing, to assess the closed-loop performance of possible decompositions. This error cannot be obtained without knowing the true optimal control, and we estimate it based on local LQR approximations. The estimate can be computed without solving for the policies and in minimal time, allowing us to predict which decompositions sacrifice minimally on optimality while offering substantial reduction in computation.

Refer to caption
Figure 1: Idea of Policy Decomposition shown for a fictive system. Instead of jointly computing optimal control policies for all inputs over the full state-space (top), the policies are approximated with decoupled and cascaded sub-policies for smaller sub-systems that are faster to compute (bottom).

The number of possible decompositions that can be constructed with just the two fundamental strategies of cascading and decoupling shown in Fig. 1, turns out to be substantial even for moderately complex systems. Exhaustive search by evaluating the value error estimates for all possible decompositions becomes impractical for systems with more than 1212 state variables (Fig. 2). We thus explore the use of Genetic Algorithm (GA) [12] and Monte-Carlo Tree Search (MCTS) [13] to efficiently find promising decompositions. We first review the main ideas behind Policy Decomposition in Sec. II. Sec. III introduces input-trees, an abstraction to represent any decomposition, and discusses how GA and MCTS search over the space of input-trees. We find decompositions for the control of a 4 degree-of-freedom manipulator, a simplified biped and a quadcopter in Sec. IV, and conclude with some directions for future work in Sec. V.

Refer to caption
Figure 2: Number of possible decompositions as the number of state variables and control inputs increase. Assuming it takes 0.1sec/value error estimate computation, the number of decompositions that can be evaluated in some fixed times are marked with dotted lines.

II Policy Decomposition

Consider the general dynamical system

𝒙˙=𝒇(𝒙,𝒖)\dot{\boldsymbol{x}}=\boldsymbol{f}(\boldsymbol{x},\boldsymbol{u}) (1)

with state 𝒙\boldsymbol{x} and input 𝒖\boldsymbol{u}. The optimal control policy π𝒖(𝒙)\pi^{*}_{\boldsymbol{u}}(\boldsymbol{x}) for this system minimizes the objective function

J=0eλtc(𝒙(t),𝒖(t))𝑑t.J=\int_{0}^{\infty}e^{-\lambda t}c(\boldsymbol{x}(t),\boldsymbol{u}(t))\>dt. (2)

which describes the discounted sum of costs c(𝒙,𝒖)=(𝒙𝒙d)T𝑸(𝒙𝒙d)+(𝒖𝒖d)T𝑹(𝒖𝒖d)c(\boldsymbol{x},\boldsymbol{u})=(\boldsymbol{x}-\boldsymbol{x}^{d})^{T}\boldsymbol{Q}(\boldsymbol{x}-\boldsymbol{x}^{d})+(\boldsymbol{u}-\boldsymbol{u}^{d})^{T}\boldsymbol{R}(\boldsymbol{u}-\boldsymbol{u}^{d}) accrued over time, where 𝒙d\boldsymbol{x}^{d} is the goal state, and 𝒖d\boldsymbol{u}^{d} the input that stabilizes the system at 𝒙d\boldsymbol{x}^{d}. The discount factor λ\lambda characterizes the trade-off between immediate and future costs.

Refer to caption
Figure 3: Input-tree (left) for the decomposition shown in Fig 1 and the resulting subsystems (mid) are depicted. Policies for u1u_{1} and u4u_{4}, i.e. inputs at the leaf nodes, are obtained first by independently solving optimal control problems for subsystems 1 and 2 respectively. Policies for u2u_{2} and u3u_{3} are computed jointly by solving the optimal control problem for subsystem 3. Resulting policies for different inputs (right) are a function of different state variables.

Policy Decomposition approximates the search for one high-dimensional policy π𝒖(𝒙)\pi^{*}_{\boldsymbol{u}}(\boldsymbol{x}) with a search for a combination of lower-dimensional sub-policies of smaller optimal control problems that are much faster to compute. These sub-policies are computed in either a cascaded or decoupled fashion and then reassembled into one decomposition policy π𝒖δ(𝒙)\pi^{\delta}_{\boldsymbol{u}}(\boldsymbol{x}) for the entire system (Fig. 1). Any sub-policy π𝒖i(𝒙i)\pi_{\boldsymbol{u}_{i}}(\boldsymbol{x}_{i}) is the optimal control policy for its corresponding subsystem,

𝒙˙i=𝒇i(𝒙i,𝒖i𝒙¯i=𝒙¯id,𝒖¯i),\dot{\boldsymbol{x}}_{i}=\boldsymbol{f}_{i}(\boldsymbol{x}_{i},\boldsymbol{u}_{i}\mid\>\bar{\boldsymbol{x}}_{i}=\bar{\boldsymbol{x}}_{i}^{d},\bar{\boldsymbol{u}}_{i}), (3)

where 𝒙i\boldsymbol{x}_{i} and 𝒖i\boldsymbol{u}_{i} are subsets of 𝒙\boldsymbol{x} and 𝒖\boldsymbol{u}, 𝒇i\boldsymbol{f}_{i} only contains the dynamics associated with 𝒙i\boldsymbol{x}_{i}, and the complement state 𝒙¯i=𝒙𝒙i\bar{\boldsymbol{x}}_{i}=\boldsymbol{x}\setminus\boldsymbol{x}_{i} is assumed to be constant. The complement input 𝒖¯i=𝒖𝒖i\bar{\boldsymbol{u}}_{i}=\boldsymbol{u}\setminus\boldsymbol{u}_{i} comprises of inputs that are decoupled from 𝒖i\boldsymbol{u}_{i} and those that are in cascade with 𝒖i\boldsymbol{u}_{i}. The decoupled inputs are set to zero and sub-policies for the cascaded inputs are used as is while computing π𝒖i(𝒙i)\pi_{\boldsymbol{u}_{i}}(\boldsymbol{x}_{i}). In general, π𝒖¯i(𝒙i)=[0,, 0,π𝒖j(𝒙j), 0,, 0]\pi_{\bar{\boldsymbol{u}}_{i}}(\boldsymbol{x}_{i})=[0,\>\ldots,\>0,\>\pi_{\boldsymbol{u}_{j}}(\boldsymbol{x}_{j}),\>0,\>\ldots,\>0] where 𝒖j𝒖¯i\boldsymbol{u}_{j}\subseteq\bar{\boldsymbol{u}}_{i} are inputs that appear lower in the cascade to 𝒖i\boldsymbol{u}_{i}, and 𝒙j𝒙i\boldsymbol{x}_{j}\subseteq\boldsymbol{x}_{i}. Note, (i) π𝒖¯i(𝒙i)\pi_{\bar{\boldsymbol{u}}_{i}}(\boldsymbol{x}_{i}) can contain multiple sub-policies, (ii) these sub-policies can themselves be computed in a cascaded or decoupled fashion, and (iii) they have to be known before computing π𝒖i(𝒙i)\pi_{\boldsymbol{u}_{i}}(\boldsymbol{x}_{i}). We represent policies with lookup tables and use Policy Iteration [1] to compute them. Dimensions of the tables are suitably reduced when using a decomposition.

Value error, errδ\text{err}^{\delta}, defined as the average difference over the state space 𝒮\mathcal{S} between the value functions VδV^{\delta} and VV^{*} of the control policies obtained with and without decomposition,

errδ=1|𝒮|𝒮Vδ(𝒙)V(𝒙)d𝒙,\text{err}^{\delta}=\frac{1}{|\mathcal{S}|}\>\int_{\mathcal{S}}V^{\delta}(\boldsymbol{x})-V^{*}(\boldsymbol{x})\>\>d\boldsymbol{x}, (4)

directly quantifies the suboptimality of the resulting control. errδ\text{err}^{\delta} cannot be computed without knowing VV^{*}, the optimal value function for the original intractable problem. Thus, an estimate is derived by linearizing the dynamics (Eq. 1) about the goal state and input, 𝒙˙=𝑨(𝒙𝒙d)+𝑩(𝒖𝒖d)\dot{\boldsymbol{x}}=\boldsymbol{A}(\boldsymbol{x}-\boldsymbol{x}^{d})+\boldsymbol{B}(\boldsymbol{u}-\boldsymbol{u}^{d}). As the costs are quadratic, the optimal control of this linear system is an LQR, whose value function Vlqr(𝒙)V_{\text{lqr}}^{*}(\boldsymbol{x}) can be readily computed [14]. Value error estimate of decomposition δ\delta is

errlqrδ=1|𝒮|𝒮Vlqrδ(𝒙)Vlqr(𝒙)d𝒙,\text{err}^{\delta}_{\text{lqr}}=\frac{1}{|\mathcal{S}|}\>\int_{\mathcal{S}}V^{\delta}_{\text{lqr}}(\boldsymbol{x})-V^{*}_{\text{lqr}}(\boldsymbol{x})\>\>d\boldsymbol{x}, (5)

where Vlqrδ(𝒙)V_{\text{lqr}}^{\delta}(\boldsymbol{x}) is the value function for the equivalent decomposition of the linear system [11]. Note that, errlqrδ[0,]\text{err}^{\delta}_{\text{lqr}}\in[0,\infty].

III Searching For Promising Decompositions

We first introduce an intuitive abstraction for decompositions that allows us to search over the space of possibilities. A decomposition δ\delta can be represented using an input-tree Tδ=(𝒱,)T^{\delta}=(\mathcal{V},\mathcal{E}) where all nodes except the root node are tuples of disjoint subsets of inputs and state variables 𝒗i=(𝒖i,𝒙𝒖i)𝒗i𝒱{𝒗root}\boldsymbol{v}_{i}=\left(\boldsymbol{u}_{i},\boldsymbol{x}_{\boldsymbol{u}_{i}}\right)\forall\boldsymbol{v}_{i}\in\mathcal{V}\setminus\{\boldsymbol{v}_{\text{root}}\}. Inputs that lie on the same branch are in a cascade where policies for inputs lower in the branch (leaf node being the lowest) influence the policies for inputs higher-up. Inputs that belong to different branches are decoupled for the sake of policy computation. A sub-tree rooted at node 𝒗i\boldsymbol{v}_{i} characterizes a subsystem with control inputs 𝒖i\boldsymbol{u}_{i} and state 𝒙i=j𝒙𝒖j\boldsymbol{x}_{i}=\cup_{j}\boldsymbol{x}_{\boldsymbol{u}_{j}}, where 𝒙𝒖j\boldsymbol{x}_{\boldsymbol{u}_{j}} are state variables belonging to nodes in the sub-tree. Note, 𝒙𝒖i𝒙𝒊\boldsymbol{x}_{\boldsymbol{u}_{i}}\subseteq\boldsymbol{x_{i}} and 𝒙𝒖i\boldsymbol{x}_{\boldsymbol{u}_{i}} may be empty if it does not belong to a leaf-node. Fig. 3 depicts the input-tree for the decomposition shown in Fig. 1 and describes the resulting subsystems. Policies are computed in a child-first order, starting from leaf nodes followed by their parents and so on. Finally, input-trees can be easily enumerated, allowing us to count the number of possible decompositions for a system (See Sec. VI-A).

Search for promising decompositions, requires algorithms that can operate over the combinatorial space of possibilities. Several methods have been proposed to tackle such problems [15, 16]. Genetic Algorithm (GA) is one such method that has been used for solving a variety of combinatorial optimization problems [17, 18]. GA starts off with a randomly generated set of candidate solutions and suitably combines promising candidates to find better ones. A top-down alternative to GA is to start from the original system and decompose it step-by-step. The search for good decompositions can then be posed as a sequential decision making problem for which a number of algorithms exist, especially in the context of computer games [19]. Among these, Monte-Carlo Tree Search (MCTS) [13] methods have been shown to be highly effective for games that have a large number of possible moves in each step, a similar challenge that we encounter when generating decompositions from the original system. Here, we describe the requisite adaptations to GA and MCTS for searching over the space of input-trees.

III-A Genetic Algorithm

GA iteratively evolves a randomly generated initial population of candidates, through selection, crossovers and mutations, to find promising ones. In this case, the candidates are input-trees. We generate the initial population by uniformly sampling from the set of all possible input-trees for a system (Sec. III-C). The quality of an input tree is assessed using a fitness function that accounts for value error estimates as well as estimates of the policy compute times for the corresponding decomposition. Moreover, we add a memory component to GA using a hash table, that alleviates the need to recompute fitness values for previously seen candidates. Typical GA components include operators for crossover and mutation. We introduce mutation operators for input-trees but exclude crossovers as we did not observe any improvement in search performance on including them.

III-A1 Fitness Function

The fitness function for a decomposition δ\delta is a product of two components

F(δ)=Ferr(δ)×Fcomp(δ)F(\delta)=F_{\text{err}}(\delta)\times F_{\text{comp}}(\delta) (6)

where Ferr(δ)F_{\text{err}}(\delta) and Fcomp(δ)F_{\text{comp}}(\delta) quantify the suboptimality of a decomposition and potential reduction in policy computation time respectively. Ferr(δ)=(1exp(errlqrδ))F_{\text{err}}(\delta)=(1-\exp(-\text{err}^{\delta}_{\text{lqr}})), is the value error estimate (Eq. 5) scaled to the range [0,1][0,1], and Fcomp(δ)F_{\text{comp}}(\delta) is the ratio of estimates of floating point operations required to compute policies with and without decomposition δ\delta. We use lookup tables to represent policies and policy iteration [1] to compute them, and Fcomp(δ)F_{\text{comp}}(\delta) can thus be derived using the size of the resulting tables, maximum iterations for policy evaluation and update, and number of actions sampled in every iteration of policy update. Note that F(δ)[0,1]F(\delta)\in[0,1] where lower values indicate a more promising decomposition.

III-A2 Mutations

Valid mutations to input-trees are

  1. (i)

    Swap state variables between nodes

  2. (ii)

    Move a single state variable from one node to another

  3. (iii)

    Move a sub-tree

  4. (iv)

    Couple two nodes

  5. (v)

    Decouple a node into two nodes

Fig. 4 depicts the above operations applied to the input-tree depicted in Fig. 3. In every GA iteration, a candidate selected for mutation is modified using only one operator. Operators (i), (ii) and (iii) each have a 25%\% chance of being applied to a candidate. If neither of these three operators are applied then two distinct inputs are randomly selected and depending on whether they are decoupled or coupled operators (iv) or (v) are applied respectively.

Refer to caption
Figure 4: Mutation operators for Genetic Algorithm described in Sec. III-A2 applied to the input-tree depicted in Fig. 3.

III-A3 Hashing Input-Trees

To avoid re-evaluating the fitness for previously seen candidates, we maintain a hash table [20]. This hash table maps a unique identifier, or a key, for input-trees to their computed fitness values. See Sec. VI-B for details on computing these unique identifiers for input-trees.

III-B Monte-Carlo Tree Search

MCTS takes the top-down approach, and creates a search tree of input-trees with an input-tree representing the undecomposed system at its root. It builds the search tree by doing rollouts i.e. a sequence of node expansions starting from the root till a terminal node is reached. After each rollout a backup operation is performed whereby a value function for every node in the explored branch is computed/updated. The value function for a node captures the quality of solutions obtained in the sub-tree rooted at that node, and is used during rollouts to select nodes for expansion. Here, we describe the strategy for expanding nodes, the role of value functions during rollouts, the backup operation performed at the end of a rollout, and some techniques to speed-up search.

III-B1 Node Expansion

Expanding a node for decomposition δ\delta entails evaluating the fitness F(δ)F(\delta) (Eq. 6) and enumerating possible child nodes. These children are constructed by splitting the input and the state variable sets into two, for any of the leafs in TδT^{\delta}, and arranging them in a cascaded or decoupled fashion. Fig. 5 depicts an example search tree obtained from MCTS rollouts. We use the UCT strategy [21],

Tδexpand=argminTδchildren(Tδ)Q(Tδ)2ln(NTδ+1)NTδT^{\delta_{\text{expand}}}=\underset{T^{\delta^{\prime}}\in\text{children}(T^{\delta})}{\text{argmin}}Q(T^{\delta^{\prime}})-\sqrt{\frac{2\ln(N_{T^{\delta}}+1)}{N_{T^{\delta^{\prime}}}}} (7)

to pick a node to expand. Here, Q(Tδ)Q(T^{\delta^{\prime}}), i.e. the value function for the node in the search tree corresponding to TδT^{\delta^{\prime}}, is the best fitness value observed in the sub-tree rooted at that node. Q(Tδ)Q(T^{\delta^{\prime}}) is initialized to F(δ)F(\delta^{\prime}). NTδN_{T^{\delta^{\prime}}} denotes the number of times TδT^{\delta^{\prime}} was visited. We break ties randomly.

III-B2 Backup

At the end of each rollout the value functions for nodes visited during the rollout are updated as follows

Q(Tδ)=minTδ({Tδ}children(Tδ))Q(Tδ)Q(T^{\delta})=\underset{T^{\delta^{\prime}}\in\left(\{T^{\delta}\}\hskip 2.84544pt\cup\hskip 2.84544pt\text{children}(T^{\delta})\right)}{\text{min}}Q(T^{\delta^{\prime}})

starting from the terminal node and going towards the root of the search tree. Value function of the root node is the fitness value of the best decomposition identified in the search.

III-B3 Speed-up Techniques

To prevent redundant rollouts, we remove a node from consideration in the UCT strategy (Eq. 7) if all possible nodes reachable through it have been expanded. Similar to GA, we use a hash table to avoid re-evaluating fitness for previously seen input-trees.

III-C Uniform Sampling of Input-Trees

The strategy to sample input-trees is tightly linked to the process of constructing them, which entails partitioning the set of control inputs into groups, arranging the groups in a tree, and then assigning the state variables to nodes of the tree. To ensure uniform sampling, probabilities of choosing a partition and tree structure are scaled proportional to the number of valid input-trees resulting from said partition and structure. Finally, an assignment of state variables consistent with the tree structure is uniformly sampled (Sec. VI-C).

Refer to caption
Figure 5: Example of a search tree generated from MCTS rollouts for a fictive system with four inputs and four state variables.

IV Results and Discussion

We test the efficacy of GA and MCTS in finding promising policy decompositions for three representative robotic systems, a 4 degree-of-freedom manipulator (Fig. 7(a)), a simplified model of a biped (Fig. 6(a)), and a quadcopter (Fig. 8(a)). We run every search algorithm 5 times for a period of 300, 600 and 1200 seconds for the biped, manipulator and quadcopter respectively. We report the number of unique decompositions discovered. Moreover, we derive policies for the best decomposition, i.e. the one with the lowest fitness value (Eq. 6), in each run and report the suboptimality estimates (Eq. 5) as well as the policy computation times (Tab. V). Tabs. II, III and IV list the best decompositions across all runs for the three systems.

Overall, in comparison to MCTS and random sampling GA consistently identifies decompositions that are on par, if not better in terms of suboptimality estimates, but offer greater reduction in policy computation times, and discovers the most number of unique decompositions. Furthermore, for the biped and the manipulator, GA evaluates only a small fraction of all possible decompositions but consistently identifies the one with the lowest fitness value. Here, we analyze the decompositions identified through the search methods. For details regarding policy computation see Sec. VI-D.

IV-1 Biped

Refer to caption
Figure 6: (a) Simplified model of a biped. See Sec. VI-D1 for details. (b) Closed-loop trajectories derived from the optimal policy (magenta), heuristic policy (blue), and from policies computed using the best decompositions found by GA (yellow), MCTS (cyan) and random sampling (green). (c) Basins of attraction, estimated by simulating the system with the policies from different decompositions.

We design policies to balance the biped (Fig. 6(a)) in double stance. Sec. VI-D1 details the biped dynamics. From Tab II, the best decompositions discovered by GA and MCTS regulate the vertical behavior of the biped using leg force FlF_{l}, the fore-aft behavior using the other leg force FrF_{r} and a hip-torque (τl/τr)(\tau_{l}/\tau_{r}), and regulate the torso using the remaining hip torque. These decompositions are similar to the heuristic approach (Tab. II, row 5) of regulating the centroidal behavior using leg forces and computing a torso control in cascade [22], but offer major reduction in complexity. It takes about 1.7 hours to derive the optimal policy whereas decompositions found by GA and MCTS reduce the computation time to seconds (Tab. V, row 1). But, the reduction comes at the cost of smaller basins of attraction (Fig. 6(c)).

TABLE II: Best decompositions found by search methods : biped
method 𝜹\boldsymbol{\delta} with the lowest fitness value #\boldsymbol{\#}policy parameters
π\pi^{*} 52,865,904
GA [πτr(θ,θ˙),πτl(lr),πFr(lr,z˙,πτl),πFl(αr,x˙)][\pi_{\tau_{r}}(\theta,\dot{\theta}),\pi_{\tau_{l}}(l_{r}),\pi_{\text{F}_{r}}(l_{r},\dot{z},\pi_{\tau_{l}}),\pi_{\text{F}_{l}}(\alpha_{r},\dot{x})] 736
MCTS [πτr(lr),πτl(θ,θ˙),πFr(lr,z˙,πτr),πFl(αr,x˙)][\pi_{\tau_{r}}(l_{r}),\pi_{\tau_{l}}(\theta,\dot{\theta}),\pi_{\text{F}_{r}}(l_{r},\dot{z},\pi_{\tau_{r}}),\pi_{\text{F}_{l}}(\alpha_{r},\dot{x})] 736
Random [πτr(θ,θ˙),πτl(αr),πFr(lr,z˙),πFl(αr,x˙,πτl)][\pi_{\tau_{r}}(\theta,\dot{\theta}),\pi_{\tau_{l}}(\alpha_{r}),\pi_{\text{F}_{r}}(l_{r},\dot{z}),\pi_{\text{F}_{l}}(\alpha_{r},\dot{x},\pi_{\tau_{l}})] 736
Heuristic [π[Fl,Fr](lr,αr,x˙,z˙)[\pi_{[F_{l},F_{r}]}(l_{r},\alpha_{r},\dot{x},\dot{z}), π[τl,τr](lr,αr,x˙,z˙,θ,θ˙,π[Fl,Fr])]\pi_{[\tau_{l},\tau_{r}]}(l_{r},\alpha_{r},\dot{x},\dot{z},\theta,\dot{\theta},\pi_{[F_{l},F_{r}]})] 26,522,860

IV-2 Manipulator

Refer to caption
Figure 7: (a) Model of a 4 degree-of-freedom manipulator. See Sec. VI-D2 for details. (b) Closed-loop trajectories for swing-up control; derived from policies obtained using the best decompositions found by GA (yellow), MCTS (cyan) and random sampling (green) are shown. (c) Estimated basin of attraction for policies from decomposition found by GA.

For the manipulator (Fig. 7(a)), we desire control policies that drive it to an upright position. Sec. VI-D2 details the dynamics parameters. Due to computational limits, we are unable to derive the optimal policy. But, the completely decentralized strategy identified by GA (Tab. III, row 2), requires less than 15 seconds to compute torque policies (Tab. V, row 2, column 2) and has a wide basin of attraction (Fig. 7(c)). Random sampling too identifies a decentralized strategy (Tab. III, row 4), allowing policies to be computed in minimal time, but has a very high value error estimate, resulting in poor performance on the full system (Fig. 7(b), green trajectory diverges). The strategy discovered by MCTS couples policy computation for the 1st1^{\text{st}}-2nd2^{\text{nd}} and the 3rd3^{\text{rd}}-4th4^{\text{th}} joint torques (Tab. III, row 3), and in fact has a lower value error estimate than the one found by GA. But, the resulting policies perform poorly on the full system (Fig. 7(b), cyan trajectory diverges). The LQR value error estimate (Eq. 5) only accounts for the linearized system dynamics and is agnostic to the bounds on control inputs, and thus decompositions with low suboptimality estimates need not perform well on the full system. This issue can be alleviated by filtering decompositions shortlisted through search using the DDP based value error estimate discussed in [11]. While the DDP based estimate is better, it too does not provide guarantees but rather serves as an improved heuristic.

TABLE III: Best decompositions found by search methods : manipulator
method 𝜹\boldsymbol{\delta} with the lowest fitness value #\boldsymbol{\#}policy parameters
π\pi^{*} 2.38×1092.38\times 10^{9}
GA [πτ1(θ1,θ˙1),πτ2(θ2,θ˙2),πτ3(θ3,θ˙3),πτ4(θ4,θ˙4)][\pi_{\tau_{1}}(\theta_{1},\dot{\theta}_{1}),\pi_{\tau_{2}}(\theta_{2},\dot{\theta}_{2}),\pi_{\tau_{3}}(\theta_{3},\dot{\theta}_{3}),\pi_{\tau_{4}}(\theta_{4},\dot{\theta}_{4})] 884
MCTS [π[τ1,τ2](θ1,θ˙1,θ2,θ˙2),π[τ3,τ4](θ3,θ˙3,θ4,θ˙4)][\pi_{[\tau_{1},\tau_{2}]}(\theta_{1},\dot{\theta}_{1},\theta_{2},\dot{\theta}_{2}),\pi_{[\tau_{3},\tau_{4}]}(\theta_{3},\dot{\theta}_{3},\theta_{4},\dot{\theta}_{4})] 97682
Random [πτ1(θ1,θ˙3),πτ2(θ4,θ˙2),πτ3(θ2,θ˙4),πτ4(θ3,θ˙1)][\pi_{\tau_{1}}(\theta_{1},\dot{\theta}_{3}),\pi_{\tau_{2}}(\theta_{4},\dot{\theta}_{2}),\pi_{\tau_{3}}(\theta_{2},\dot{\theta}_{4}),\pi_{\tau_{4}}(\theta_{3},\dot{\theta}_{1})] 884

IV-3 Quadcopter

Refer to caption
Figure 8: (a) Model of a quadcopter. See Sec. VI-D3 for details. (b) Closed-loop trajectories for hover control derived from heuristic policy (blue), and policies computed using best decompositions found by GA (yellow), MCTS (cyan) and from the Pareto front (red) are shown.

The quadcopter (Fig. 8(a)) is required to stabilize its attitude and come to rest. We identify decompositions to simplify computing policies for the net thrust TT and the roll, pitch and yaw inducing differential thrusts FϕF_{\phi}, FθF_{\theta} and FψF_{\psi} respectively. Sec. VI-D3 details the system modelling. Solving for the optimal policy is beyond our computational limits. However, we discovered almost optimal decompositions with just 20 minutes of search and derived policies using them in less than 3 hours (Tab V, row 3). As depicted in Tab. IV, the best decompositions discovered by GA and MCTS both decouple the yaw (ψ)(\psi) control. But the one found by GA further cascades zz and roll (ϕ)(\phi) control with the pitch control (θ)(\theta), offering additional reduction in computation. Random sampling finds decompositions that have lower suboptimality estimates than the ones found by GA and MCTS (Tab. V, row 3) and lower fitness values as well, but they require significantly higher compute than we can afford. We are thus unable to report policy compute times for these decompositions in Tab. V and also omit it from Tab. IV. In general, our fitness criteria (Eq. 6) can skew the search towards decompositions that are almost optimal but offer little/no reduction in computation, or towards those that greatly reduce computation but are highly suboptimal. A workaround is to compute a Pareto front of decompositions over the objectives FcompF_{\text{comp}} and FerrF_{\text{err}} (Eq. 6). We derived such a Pareto front using the NSGA II [23], sifted through the front in increasing order of FerrF_{\text{err}}, and found a decomposition (Tab. IV, row 4) that fit our computational budget. Closed-loop trajectories for decompositions in Tab IV are shown in Fig. 8(b). Decompositions discovered by search are non-trivial and in fact the trivial heuristic (Tab. IV, row 5) to independently regulate roll, pitch, yaw and height with FϕF_{\phi}, FθF_{\theta}, FψF_{\psi} and TT respectively, fails to stabilize the attitude and bring the quadcopter to rest (Fig. 8, blue).

TABLE IV: Best decompositions found by search methods : Quadcopter
method 𝜹\boldsymbol{\delta} with the lowest fitness value #\boldsymbol{\#}policy parameters
π\pi^{*} 1.74×10101.74\times 10^{10}
GA [πFψ(ψ,ψ˙)[\pi_{F_{\psi}}(\psi,\dot{\psi}), π[T,Fϕ](z,z˙,ϕ,ϕ˙,y˙,θ˙)\pi_{[T,F_{\phi}]}(z,\dot{z},\phi,\dot{\phi},\dot{y},\dot{\theta}), πFθ(z,z˙,ϕ,ϕ˙,y˙,θ,θ˙,x˙,π[T,Fϕ])]\pi_{F_{\theta}}(z,\dot{z},\phi,\dot{\phi},\dot{y},\theta,\dot{\theta},\dot{x},\pi_{[T,F_{\phi}]})] 14817796
MCTS [πFψ(ψ,ψ˙)[\pi_{F_{\psi}}(\psi,\dot{\psi}), π[T,Fϕ,Fθ](z,z˙,ϕ,ϕ˙,y˙,θ,θ˙,x˙)]\pi_{[T,F_{\phi},F_{\theta}]}(z,\dot{z},\phi,\dot{\phi},\dot{y},\theta,\dot{\theta},\dot{x})] 42707812
Pareto [πFψ(ψ,ψ˙)[\pi_{F_{\psi}}(\psi,\dot{\psi}), πT(z,z˙)\pi_{T}(z,\dot{z}), πFϕ(z,z˙,ϕ,ϕ˙,y˙,πT)\pi_{F_{\phi}}(z,\dot{z},\phi,\dot{\phi},\dot{y},\pi_{T}), πFθ(z,z˙,ϕ,ϕ˙,y˙,θ,θ˙,x˙,πT,πFϕ)]\pi_{F_{\theta}}(z,\dot{z},\phi,\dot{\phi},\dot{y},\theta,\dot{\theta},\dot{x},\pi_{T},\pi_{F_{\phi}})] 14263214
Heuristic [πFψ(ψ,ψ˙),πT(z,z˙),πFϕ(ϕ,ϕ˙,y˙),πFθ(θ,θ˙,x˙)][\pi_{F_{\psi}}(\psi,\dot{\psi}),\pi_{T}(z,\dot{z}),\pi_{F_{\phi}}(\phi,\dot{\phi},\dot{y}),\pi_{F_{\theta}}(\theta,\dot{\theta},\dot{x})] 2352
TABLE V: Summary of search results. errlqrδ\text{err}^{\delta}_{\text{lqr}} and policy compute times for the best decomposition found (in terms of fitness value Eq. 6) as well as the number of unique decompositions discovered are reported, averaged across 5 runs. Algorithms have a fixed time period in each run; 300, 600 and 1200 seconds for the Biped, Manipulator and Quadcopter respectively.
GA MCTS Random
errlqr𝜹\textbf{err}_{\text{lqr}}^{\boldsymbol{\delta}} time (sec) #𝜹\boldsymbol{\#\delta} found errlqr𝜹\textbf{err}_{\text{lqr}}^{\boldsymbol{\delta}} time (sec) #𝜹\boldsymbol{\#\delta} found errlqr𝜹\textbf{err}_{\text{lqr}}^{\boldsymbol{\delta}} time (sec) #𝜹\boldsymbol{\#\delta} found
Biped 0.1018±00.1018\pm 0 7.35±07.35\pm 0 33785±59133785\pm 591 0.105±0.0080.105\pm 0.008 23±20.723\pm 20.7 21588±44821588\pm 448 0.217±0.1630.217\pm 0.163 23±20.923\pm 20.9 29625±40929625\pm 409
Manipulator 0.0357±00.0357\pm 0 14.94±014.94\pm 0 60135±260460135\pm 2604 0.087±0.1670.087\pm 0.167 821±152821\pm 152 19993±29719993\pm 297 16.89±22.2716.89\pm 22.27 59.5±36.659.5\pm 36.6 40295±54540295\pm 545
Quadcopter (1.43±0)(1.43\pm 0) ×1016\times 10^{-16} 9534.5±09534.5\pm 0 169408±1475169408\pm 1475 (1.08±0.53)(1.08\pm 0.53) ×1016\times 10^{-16} - 38306±283238306\pm 2832 (5.72±3.36)(5.72\pm 3.36) ×1018\times 10^{-18} - 75176±109575176\pm 1095

V Conclusions and Future Work

We addressed the combinatorial challenge of applying Policy Decomposition to complex systems, finding non-trivial decompositions to compute policies for otherwise intractable optimal control problems. We introduced an intuitive abstraction, viz input-trees, to represent decompositions and a metric to gauge their fitness. Moreover, we put forth a strategy to uniformly sample input-trees and operators to mutate them, facilitating the use of GA in search for promising ones. We also offered another perspective to the search problem whereby decompositions are constructed by incrementally decomposing the original complex system, reducing the search to a sequential decision making problem. We showcased how MCTS can be used to tackle it. Across three example systems, viz a simplified biped, a manipulator and a quadcopter, GA consistently finds the best decompositions in a short period of time. Our work makes Policy Decomposition a viable approach to systematically simplify policy synthesis for complex optimal control problems.

We identified three future research directions that would further broaden the applicability of Policy Decomposition. First, to extend the framework to handle optimal trajectory tracking problems. Second, to weigh in the effect of sensing and actuation noise in evaluating the fitness for different decompositions. Finally, the conduciveness to decomposition often depends on the choice of inputs. In case of the quadcopter, policy computation for the rotor forces FiF_{i} is not readily decomposable but it is so for the linearly transformed inputs TT, FϕF_{\phi}, FθF_{\theta} and FψF_{\psi}. Simultaneously discovering mappings from the original inputs to a space favorable for decomposition would be very useful.

VI APPENDIX

VI-A Counting Possible Policy Decompositions

For a system with nn states and mm inputs, the number of possible decompositions resolves to

N(n,m)=r=2m{Δrmr!k=1r[(rk)(Δrkr1+Δrk+1r1)i=0nk(ni)Δkni(rk)i]},\begin{split}N(n,m)=\sum_{r=2}^{m}\Bigg{\{}\frac{\Delta^{m}_{r}}{r!}\sum_{k=1}^{r}\Bigg{[}&\binom{r}{k}\left(\Delta^{r-1}_{r-k}+\Delta^{r-1}_{r-k+1}\right)\\ &\sum_{i=0}^{n-k}\binom{n}{i}\Delta^{n-i}_{k}(r-k)^{i}\Bigg{]}\Bigg{\}},\end{split} (8)

where Δba=c=0b1(1)c(bc)(bc)a\Delta^{a}_{b}=\sum_{c=0}^{b-1}(-1)^{c}\binom{b}{c}(b-c)^{a}. This formula results from the observations that any viable decomposition splits the mm inputs of a system into rr input groups, where rr can range from 2 to mm, and there are Δrm/r!\Delta^{m}_{r}/r! ways to distribute the mm inputs into rr groups [24]; the rr input groups can then be arranged into input-trees with k{1,,r}k\in\{1,\dots,r\} leaf-nodes leading to (rk)(Δrkr1+Δrk+1r1)\binom{r}{k}\left(\Delta^{r-1}_{r-k}+\Delta^{r-1}_{r-k+1}\right) possible trees per grouping; the nn states can be assigned to each such tree with kk leaf-nodes in i=0nk(ni)Δkni(rk)i\sum_{i=0}^{n-k}\binom{n}{i}\Delta^{n-i}_{k}(r-k)^{i} different ways.

VI-B Hash Keys For Input-Trees

For a system with mm inputs, we use a binary connectivity matrix (Cm×m)(C\equiv m\times m) to encode the graph. If input uju_{j} belongs to node 𝒗j\boldsymbol{v}_{j} in the input-tree, then the jthj^{\text{th}} row in CC has entries 11 for all other inputs that belong to 𝒗j\boldsymbol{v}_{j} as well as for inputs that belong to the parent node of 𝒗j\boldsymbol{v}_{j}. For the input-tree in Fig. 3 the connectivity matrix is

C={blockarray}ccccc&u1u2u3u4{block}c[cccc]u10000u20010u30100u40110C=\blockarray{ccccc}&u_{1}u_{2}u_{3}u_{4}\\ \block{c[cccc]}u_{1}0000\\ u_{2}0010\\ u_{3}0100\\ u_{4}0110\\

We define the binary state-dependence matrix (Sm×n)(S\equiv m\times n) to encode the influence of the nn state variables on the mm inputs. The jthj^{\text{th}} row in SS corresponds to input uju_{j} and has entries 11 for all state variables that belong to the node 𝒗j\boldsymbol{v}_{j}. For the input-tree in Fig. 3 the state-dependence matrix is

S={blockarray}ccccc&x1x2x3x4{block}c[cccc]u10001u20110u30110u41000S=\blockarray{ccccc}&x_{1}x_{2}x_{3}x_{4}\\ \block{c[cccc]}u_{1}0001\\ u_{2}0110\\ u_{3}0110\\ u_{4}1000\\

The matrices CC and SS uniquely encode an input-tree.

VI-C Uniform Random Sampling of Input-Trees

For a system with mm inputs and nn state variables we first sample r{2,,m}r\in\{2,\cdots,m\}, i.e. the number of input groups in an input-tree, with probability proportional to the number of possible input-trees with rr input groups (each entry in the outermost summation in Eq. 8). Next, we generate all partitions of the inputs into rr groups and pick one uniformly at random. Input groups constitute the sets 𝒖i\boldsymbol{u}_{i} in an input-tree. Subsequently, we sample the number of leaf-nodes k{1,,r}k\in\{1,\cdots,r\}, with probabilities proportional to the number of input-trees with rr input groups and kk leaf-nodes (each entry in the second summation in Eq. 8). To sample the tree structure we use Prufer Codes [25]. Specifically, we uniformly sample a sequence of numbers of length r1r-1 with exactly rkr-k distinct entries from {1,,r+1}\{1,\cdots,r+1\}. Finally, for assigning state variables to different nodes of the input-tree we uniformly sample a label for every variable from {1,,r}\{1,\cdots,r\}. Variables with the label ii are grouped to form sets 𝒙𝒖i\boldsymbol{x}_{\boldsymbol{u}_{i}}. We re-sample labels if a state assignment is invalid i.e. no variables are assigned to leaf-nodes of the input-tree.

VI-D Policy Computation Details

Here, we report the dynamics parameters for each system and other relevant details. Policies are represented using lookup tables and computed using policy iteration [1]. Dimensions of the lookup tables are suitably reduced when using a decomposition versus jointly optimizing the policies. Policies are computed using Matlab on an RTX 2080 Ti. Our code is available at https://github.com/ash22194/ Policy-Decomposition

VI-D1 Balancing control for Biped

The biped (Fig. 6(a)) weighs m=72kgm=72\text{kg}, has rotational inertia I=3kgm2I=3\text{kgm}^{2}, and hip-to-COM distance d=0.2md=0.2\text{m}. Legs are massless and contact the ground at fixed locations df=0.5d_{f}=0.5m apart. A leg breaks contact if its length exceeds l0=1.15l_{0}=1.15m. In contact, legs can exert forces (0Fl/r3mg0\leq F_{l/r}\leq 3mg) and hip torques (|τl/r|0.25mg/l0|\tau_{l/r}|\leq 0.25mg/l_{0}) leading to dynamics mx¨=Frcosαr+τrlrsinαr+Flcosαl+τlllsinαlm\ddot{x}=F_{r}\cos{\alpha_{r}}+\frac{\tau_{r}}{l_{r}}\sin{\alpha_{r}}+F_{l}\cos{\alpha_{l}}+\frac{\tau_{l}}{l_{l}}\sin{\alpha_{l}}, mz¨=Frsinαrτrlrcosαr+Flsinαlτlllcosαlmgm\ddot{z}=F_{r}\sin{\alpha_{r}}-\frac{\tau_{r}}{l_{r}}\cos{\alpha_{r}}+F_{l}\sin{\alpha_{l}}-\frac{\tau_{l}}{l_{l}}\cos{\alpha_{l}}-mg, and Iθ¨=τr(1+dlrsin(αrθ))+Frdcos(αrθ)+τl(1+dllsin(αlθ))+Fldcos(αlθ)I\ddot{\theta}=\tau_{r}(1+\frac{d}{l_{r}}\sin(\alpha_{r}-\theta))+F_{r}d\cos(\alpha_{r}-\theta)+\tau_{l}(1+\frac{d}{l_{l}}\sin(\alpha_{l}-\theta))+F_{l}d\cos(\alpha_{l}-\theta), where ll=lr2+df2+2lrdfcosαrl_{l}=\sqrt{l_{r}^{2}+d_{f}^{2}+2l_{r}d_{f}\cos{\alpha_{r}}} and αl=arcsinlrsinαrll\alpha_{l}=\arcsin\frac{l_{r}\sin{\alpha_{r}}}{l_{l}}. The control objective is to balance the biped midway between the footholds.

VI-D2 Swing Up control for Manipulator

We design policies to swing up a fully actuated 4 link manipulator. Dynamic parameters are [m1,m2,m3,m4]=[5.4,1.8,0.6,0.2][m_{1},m_{2},m_{3},m_{4}]=[5.4,1.8,0.6,0.2]kg and [l1,l2,l3,l4]=[0.2,0.5,0.25,0.125][l_{1},l_{2},l_{3},l_{4}]=[0.2,0.5,0.25,0.125]m. Torque limits are |τ1|24|\tau_{1}|\leq 24, |τ2|15|\tau_{2}|\leq 15, |τ3|7.5|\tau_{3}|\leq 7.5 and |τ4|1|\tau_{4}|\leq 1.

VI-D3 Hover control for a Quadcopter

We derived policies to stabilize the attitude of the quadcopter while bringing it to rest at a height of 1m. The system is described by its height zz, attitude (roll ϕ\phi, pitch θ\theta and yaw ψ\psi), translational velocities (x˙,y˙,z˙)(\dot{x},\dot{y},\dot{z}), and attitude rate (ϕ˙\dot{\phi}, θ˙\dot{\theta}, ψ˙\dot{\psi}). The quadcopter need not stop at a fixed xyxy location. Inputs to the system are the net thrust T=iFiT=\sum_{i}F_{i} and the roll, pitch and yaw inducing differential thrusts Fϕ=F4F2F_{\phi}=F_{4}-F_{2}, Fθ=F3F1F_{\theta}=F_{3}-F_{1} and Fψ=F2+F4F1F3F_{\psi}=F_{2}+F_{4}-F_{1}-F_{3}, with limits 0T2mg0\leq T\leq 2mg, |Fϕ/θ|0.25mg|F_{\phi/\theta}|\leq 0.25mg and Fψ0.125mgF_{\psi}\leq 0.125mg. Policies for rotor forces FiF_{i} can be derived from policies for TT, FϕF_{\phi}, FθF_{\theta} and FψF_{\psi} through a linear transformation. The quadcopter weighs 0.5 kg, has rotational inertia I=diag([4.86e3,4.86e3,8.8e3])I=\text{diag}([4.86\mathrm{e}{-3},4.86\mathrm{e}{-3},8.8\mathrm{e}{-3}]) kgm2, and wing length l=0.225ml=0.225\text{m}. Rotor moments Mi=kMFiM_{i}=k_{M}F_{i}, where kM=0.0383k_{M}=0.0383. Dynamics are described in [26].

References

  • [1] D. P. Bertsekas, Dynamic programming and optimal control.   Belmont, MA: Athena Scientific, 1995, vol. 1, no. 2.
  • [2] E. Todorov and W. Li, “A generalized iterative lqg method for locally-optimal feedback control of constrained nonlinear stochastic systems,” in IEEE ACC, 2005.
  • [3] Y. Tassa, N. Mansard, and E. Todorov, “Control-limited differential dynamic programming,” in IEEE ICRA, 2014.
  • [4] C. G. Atkeson and C. Liu, “Trajectory-based dynamic programming,” in Modeling, simulation and optimization of bipedal walking.   Berlin, Heidelberg: Springer, 2013.
  • [5] R. Tedrake, I. R. Manchester, M. Tobenkin, and J. W. Roberts, “Lqr-trees: Feedback motion planning via sums-of-squares verification,” IJRR, vol. 29, no. 8, 2010.
  • [6] M. Zhong, M. Johnson, Y. Tassa, T. Erez, and E. Todorov, “Value function approximation and model predictive control,” in IEEE ADPRL, April 2013.
  • [7] M. Stilman, C. G. Atkeson, J. J. Kuffner, and G. Zeglin, “Dynamic programming in reduced dimensional spaces: Dynamic planning for robust biped locomotion,” in IEEE ICRA, 2005.
  • [8] J. Bouvrie and B. Hamzi, “Kernel methods for the approximation of nonlinear systems,” SIAM SICON, vol. 55, no. 4, 2017.
  • [9] N. Xue and A. Chakrabortty, “Optimal control of large-scale networks using clustering based projections,” arXiv:1609.05265, 2016.
  • [10] A. Alla, A. Schmidt, and B. Haasdonk, “Model order reduction approaches for infinite horizon optimal control problems via the hjb equation,” in Model Reduction of Parametrized Systems.   Berlin, Heidelberg: Springer, 2017.
  • [11] A. Khadke and H. Geyer, “Policy decomposition: Approximate optimal control with suboptimality estimates,” in IEEE-RAS Humanoids, 2020.
  • [12] M. Mitchell, An introduction to genetic algorithms.   MIT press, 1998.
  • [13] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton, “A survey of monte carlo tree search methods,” IEEE T-CIAIG, vol. 4, no. 1, 2012.
  • [14] M. Palanisamy, H. Modares, F. L. Lewis, and M. Aurangzeb, “Continuous-time q-learning for infinite-horizon discounted cost linear quadratic regulator problems,” IEEE Trans. Cybern., vol. 45, 2015.
  • [15] F. Glover and M. Laguna, Tabu Search.   Boston, MA: Springer, 1998.
  • [16] C. Blum and A. Roli, “Metaheuristics in combinatorial optimization: Overview and conceptual comparison,” ACM Comput. Surv., vol. 35, no. 3, Sept. 2003.
  • [17] C. D. Chapman, K. Saitou, and M. J. Jakiela, “Genetic Algorithms as an Approach to Configuration and Topology Design,” J. Mech. Des, vol. 116, no. 4, Dec. 1994.
  • [18] K. Stanley and R. Miikkulainen, “Evolving neural networks through augmenting topologies,” Evol. Comput., vol. 10, no. 2, 2002.
  • [19] B. Bouzy and T. Cazenave, “Computer go: An ai oriented survey,” Artificial Intelligence, vol. 132, no. 1, 2001.
  • [20] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms, Third Edition, 3rd ed.   The MIT Press, 2009.
  • [21] L. Kocsis and C. Szepesvári, “Bandit based monte-carlo planning,” in ECML.   Berlin, Heidelberg: Springer, 2006.
  • [22] W. C. Martin, A. Wu, and H. Geyer, “Experimental evaluation of deadbeat running on the atrias biped,” IEEE RAL, vol. 2, no. 2, 2017.
  • [23] K. Deb, “Multi-objective optimisation using evolutionary algorithms: an introduction,” in Multi-objective evolutionary optimisation for product design and manufacturing.   Springer, 2011.
  • [24] R. P. Stanley, “What is enumerative combinatorics?” in Enumerative combinatorics.   Boston, MA: Springer, 1986.
  • [25] N. Biggs, E. K. Lloyd, and R. J. Wilson, Graph Theory, 1736-1936.   Oxford University Press, 1986.
  • [26] T. Luukkonen, “Modelling and control of quadcopter,” Independent research project in applied mathematics, Espoo, vol. 22, 2011.