\tocauthor

Gennaro Notomista

¹¹institutetext: Department of Electrical and Computer Engineering
University of Waterloo
Waterloo, ON, Canada
¹¹email: gennaro.notomista@uwaterloo.ca

A Constrained-Optimization Approach to the
Execution of Prioritized Stacks of
Learned Multi-Robot Tasks

Gennaro Notomista

Abstract

This paper presents a constrained-optimization formulation for the prioritized execution of learned robot tasks. The framework lends itself to the execution of tasks encoded by value functions, such as tasks learned using the reinforcement learning paradigm. The tasks are encoded as constraints of a convex optimization program by using control Lyapunov functions. Moreover, an additional constraint is enforced in order to specify relative priorities between the tasks. The proposed approach is showcased in simulation using a team of mobile robots executing coordinated multi-robot tasks.

keywords:

Multi-robot motion coordination, Distributed control and planning, Learning and adaptation in teams of robots

1 Introduction

Learning complex robotic tasks can be challenging for several reasons. The nature of compound tasks, made up of several simpler subtasks, renders it difficult to simultaneously capture and combine all features of the subtasks to be learned. Another limiting factor of the learning process of compound tasks is the computational complexity of machine learning algorithms employed in the learning phase. This can make the training phase prohibitive, especially when the representation of the tasks comprises of a large number of parameters, as it is generally the case when dealing with complex tasks made up of several subtasks, or in the case of high-dimensional state space representations.

For these reasons, when there is an effective way of combining the execution of multiple subtasks, it is useful to break down complex tasks into building blocks that can be independently learned in a more efficient fashion. Besides the reduced computational complexity stemming from the simpler nature of the subtasks to be learned, this approach has the benefit of increasing the modularity of the task execution framework, by allowing for a reuse of the subtasks as building blocks for the execution of different complex tasks. Discussions and analyses of such advantages can be found, for instance, in [26, 9, 32, 16].

Along these lines, in [13], compositionality and incrementality are recognized to be two fundamental features of robot learning algorithms. Compositionality, in the context of learning to execute multiple tasks, is intended as the property of learning strategies to be in a form that allows them to be combined with previous knowledge. Incrementality, guarantees the possibility of adding new knowledge and abilities over time, by, for instance, incorporating new tasks. Several approaches have been proposed, which exhibit these two properties. Nevertheless, challenges still remain regarding tasks prioritization and stability guarantees [21, 25, 28, 34, 6]. The possibility of prioritizing tasks together with the stability guarantees allows us to characterize the behavior resulting from the composition of multiple tasks.

In fact, when dealing with redundant robotic systems—i.e. systems which possess more degrees of freedom compared to the minimum number required to execute a given task, as, for example, multi-robot systems—it is often useful to allow for the execution of multiple subtasks in a prioritized stack. Task priorities may allow robots to adapt to the different scenarios in which they are employed by exhibiting structurally different behaviors. Therefore, it is desirable that a multi-task execution framework allows for the prioritized execution of multiple tasks.

In this paper, we present a constrained-optimization robot-control framework suitable for the stable execution of multiple tasks in a prioritized fashion. This approach leverages the reinforcement learning (RL) paradigm in order to get an approximation of the value functions which will be used to encode the tasks as constraints of a convex quadratic program (QP). Owing to its convexity, the latter can be solved in polynomial time [3], and it is therefore suitable to be employed in a large variety of robotic applications, in online settings, even under real-time constraints. The proposed framework shares the optimization-based nature with the one proposed in [18] for redundant robotic manipulators, where, however, it is assumed that a representation for all tasks to be executed is known a priori. As will be discussed later in the paper, this framework indeed combines compositionality and incrementality—i.e. the abilities of combining and adding sub-tasks to build up compound tasks, respectively—with stable and prioritized task execution in a computationally efficient optimization-based algorithm.

Figure 1: Pictorial representation of the strategy adopted in this paper for the execution of prioritized stacks of learned tasks.

Figure 1 pictorially shows the strategy adopted in this paper to allow robots to execute multiple prioritized tasks learned using the RL paradigm. Once a value function is learned using the RL paradigm (using, e.g., the value iteration algorithm [2]), this learned value function is used to construct a control Lyapunov function [30] in such a way that a controller synthesized using a min-norm optimization program is equivalent to the optimal policy corresponding to the value function [20]. Then, multiple tasks encoded by constraints in a min-norm controller are combined in a prioritized stack as in [17].

To summarize, the contributions of this paper are the following: (i) We present a compositional and incremental framework for the execution of multiple tasks encoded by value functions; (ii) We show how priorities among tasks can be enforced in a constrained-optimization-based formulation; (iii) We frame the prioritized multi-task execution as a convex QP which can be efficiently solved in online settings; (iv) We demonstrate how the proposed framework can be employed to control robot teams to execute coordinated tasks.

2 Background and Related Work

2.1 Multi-Task Learning, Composition, and Execution

The prioritized execution framework for learned tasks proposed in this paper can be related to approaches devised for multi-task learning—a machine learning paradigm which aims at leveraging useful information contained in multiple related tasks to help improve the generalization performance of all the tasks [35]. The learning of multiple tasks can happen in parallel (independently) or in sequence for naturally sequential tasks [10, 29], and a number of computational frameworks have been proposed to learn multiple tasks (see, e.g., [35, 14, 24], and references therein). It is worth noticing how, owing to its constrained-optimization nature, the approach proposed in this paper is dual to multi-objective optimization frameworks, such as [27, 5] or compared to the Riemannian motion policies [23, 15, 22].

Several works have focused on the composition and hierarchy of deep reinforcement learning policies. The seminal work [33] shows compositionality for a specific class of value functions. More general value functions are considered in [12], where, however, there are no guarantees on the policy resulting from the multi-task learning process. Boolean and weighted composition of reward, (Q-)value functions, or policies are considered in [11, 19, 34]. While these works have shown their effectiveness on complex systems and tasks, our proposed approach differs from them in two main aspects: (i) It separates the task learning from the task composition; (ii) It allows for (possibly time-varying and state-dependent) task prioritization, with task stacks that are enforced at runtime.

2.2 Constraint-Based Task Execution

In this paper, we adopt a constrained-optimization approach to the prioritized execution of multiple tasks learned using the RL paradigm. In [17], a constraint-based task execution framework is presented for a robotic system with control affine dynamics

\dot{x}=f_{0}(x)+f_{1}(x)u,

(1)

where $x\in\mathscr{X}\subseteq\mathbb{R}^{n}$ and $u\in\mathscr{U}\subseteq\mathbb{R}^{m}$ denote state and control input, respectively. The $M$ tasks to be executed are encoded by continuously differentiable, positive definite cost functions $V_{i}\colon\mathscr{X}\to\mathbb{R}_{+},~{}i=1,\ldots,M$ . With the notation which will be adopted in this paper, the constraint-based task execution framework in [17] can be expressed as follows:

$\displaystyle\operatorname*{minimize~{}}_{u,\delta}$	$\displaystyle\\|u\\|^{2}+\kappa\\|\delta\\|^{2}$	(2)
$\displaystyle\operatorname*{subject\,to~{}}$	$\displaystyle L_{f_{0}}V_{i}(x)+L_{f_{1}}V_{i}(x)u\leq-\gamma(V_{i}(x))+\delta_{i}\quad i=1,\ldots,M$
	$\displaystyle K\delta\geq 0,$

where $L_{f_{0}}V_{i}(x)$ and $L_{f_{1}}V_{i}(x)$ are the Lie derivatives of $V_{i}$ along the vector fields $f_{0}$ and $f_{1}$ , respectively. The components of $\delta=[\delta_{1},\ldots,\delta_{M}]^{T}$ are used as slack variables employed to prioritize the different tasks; $\gamma\colon\mathbb{R}\to\mathbb{R}$ is a Lipschitz continuous extended class $\mathcal{K}$ function—i.e. a continuous, monotonically increasing function, with $\gamma(0)=0$ — $\kappa>0$ is an optimization parameter, and $K$ is the prioritization matrix, known a priori, which enforces relative constraints between components of $\delta$ of the following type: $\delta_{i}\leq l\delta_{j}$ , for $l\ll 1$ , which encodes the fact that task $i$ is executed at higher priority than task $j$ .

In the following, Section 2.3 will be devoted to showing the connection between dynamic programming and optimization-based controllers. In Section 3, this connection will allow us to execute tasks learned using the RL paradigm by means of a formulation akin to (2).

2.3 From Dynamic Programming to Constraint-Driven Control

To illustrate how controllers obtained using dynamic programming can be synthesized as the solution of an optimization program, consider a system with the following discrete-time dynamics:

x_{k+1}=f(x_{k},u_{k}).

(3)

These dynamics can be obtained, for instance, by (1), through a discretization process. In (3), $x_{k}$ denotes the state, $u_{k}\in\mathscr{U}_{k}(x_{k})$ the input, and the input set $\mathscr{U}_{k}(x_{k})$ may depend in general on the time $k$ and the state $x_{k}$ . The value iteration algorithm to solve a deterministic dynamic programming problem with no terminal cost can be stated as follows [2]:

J_{k+1}(x_{k})=\min_{u_{k}\in\mathscr{U}_{k}(x_{k})}\bigg{\{}g_{k}(x_{k},u_{k})+J_{k}(f_{k}(x_{k},u_{k}))\bigg{\}},

(4)

with $J_{0}(x_{0})=0$ , where $x_{0}$ is the initial state, and $g_{k}(x_{k},u_{k})$ is the cost incurred at time $k$ . The total cost accumulated along the system trajectory is given by

J(x_{0})=\lim_{N\to\infty}\sum_{k=0}^{N-1}\alpha^{k}g_{k}(x_{k},u_{k}).

(5)

In this paper, we will consider $\alpha=1$ and we will assume there exists a cost-free termination state.¹¹1Problems of this class are referred to as shortest path problems in [2].

By Proposition 4.2.1 in [2] the value iteration algorithm (4) converges to $J^{\star}$ satisfying

J^{\star}(x)=\min_{u\in\mathscr{U}(x)}\bigg{\{}g(x,u)+J^{\star}(f(x,u))\bigg{\}}.

(6)

Adopting an approximation scheme in value space, $J^{\star}$ can be replaced by its approximation $\tilde{J}^{\star}$ by solving the following approximate dynamic programming algorithm:

\tilde{J}_{k+1}(x_{k})=\min_{u_{k}\in\mathscr{U}_{k}(x_{k})}\bigg{\{}g_{k}(x_{k},u_{k})+\tilde{J}_{k}(f_{k}(x_{k},u_{k}))\bigg{\}}.

In these settings, deep RL algorithms can be leveraged to find parametric approximations, $\tilde{J}^{\star}$ , of the value function using neural networks. This will be the paradigm considered in this paper in order to approximate value functions encoding the tasks to be executed in a prioritized fashion.

The bridge between dynamic programming and constraint-driven control is optimal control. In fact, the cost in (5) is typically considered in optimal control problems, recalled, in the following, for the continuous time control affine system (1):

	$\displaystyle\operatorname*{minimize~{}}_{u(\cdot)}$	$\displaystyle\int_{0}^{\infty}\left(q(x(t))+u(t)^{T}u(t)\right)\mathrm{d}t$		(7)
	$\displaystyle\operatorname*{subject\,to~{}}$	$\displaystyle\dot{x}=f_{0}(x)+f_{1}(x)u.$		(7)

Comparing (7) with (5), we recognize that the instantaneous cost $g(x,u)$ in (5) in the context of the optimal control problem (7) corresponds to $q(x)+u^{T}u$ , where $q\colon\mathscr{X}\to\mathbb{R}$ is a continuously differentiable and positive definite function.

A dynamic programming argument on (7) leads to the following Hamilton-Jacobi-Bellman equation:

L_{f_{0}}J^{\star}(x)-\frac{1}{4}L_{f_{1}}J^{\star}(x)\left(L_{f_{1}}J^{\star}(x)\right)^{T}+q(x)=0,

where $J^{*}$ is the value function—similar to (6) for continuous-time problems—representing the minimum cost-to-go from state $x$ , defined as

J^{\star}(x)=\min_{u(\cdot)}\int_{t}^{\infty}\left(q(x(\tau))+u(\tau)^{T}u(\tau)\right)\mathrm{d}\tau.

(8)

The optimal policy corresponding to the optimal value function (8) can be evaluated as follows [4]:

u^{\star}=-\frac{1}{2}\left(L_{f_{1}}J^{\star}(x)\right)^{T}.

(9)

In order to show how the optimal policy $u^{\star}$ in (9) can be obtained using an optimization-based formulation, we now recall the concept of control Lyapunov functions.

Definition 2.1 (Control Lyapunov function [30]).

A continuously differentiable, positive definite function $V\colon\mathbb{R}^{n}\to\mathbb{R}$ is a control Lyapunov function (CLF) for the system (1) if, for all $x\neq 0$

\inf_{u}\bigg{\{}L_{f_{0}}V(x)+L_{f_{1}}V(x)u\bigg{\}}<0.

(10)

To select a control input $u$ which satisfies the inequality (10), a universal expression—known as the Sontag’s formula [31]—can be employed. With the aim of encoding the optimal control input $u^{\star}$ by means of a CLF, we will consider the following modified Sontag’s formula originally proposed in [7]:

u(x)=\begin{cases}-v(x)\left(L_{f_{1}}V(x)\right)^{T}&\mathrm{if}~{}L_{f_{1}}V(x)\neq 0\\ 0&\mathrm{otherwise},\\ \end{cases}

(11)

where $v(x)=\frac{L_{f_{0}}V(x)+\sqrt{\left(L_{f_{0}}V(x)\right)^{2}+q(x)L_{f_{1}}V(x)\left(L_{f_{1}}V(x)\right)^{T}}}{L_{f_{1}}V(x)\left(L_{f_{1}}V(x)\right)^{T}}$ .

As shown in [20], the modified Sontag’s formula (11) is equivalent to the solution of the optimal control problem (7) if the following relation between the CLF $V$ and the value function $J^{\star}$ holds:

\frac{\partial J^{\star}}{\partial x}=\lambda(x)\frac{\partial V}{\partial x},

(12)

where $\lambda(x)=2v(x)\left(L_{f_{1}}V(x)\right)^{T}$ . The relation in (12) corresponds to the fact that the level sets of the CLF $V$ and those of the value function $J^{\star}$ are parallel.

The last step towards the constrained-optimization-based approach to generate optimal control policies is to recognize the fact that, owing to its inverse optimality property [7], the modified Sontag’s formula (11) can be obtained using the following constrained-optimization formulation, also known as the pointwise min-norm controller:

	$\displaystyle\operatorname*{minimize~{}}_{u}$	$\displaystyle\\|u\\|^{2}$		(13)
	$\displaystyle\operatorname*{subject\,to~{}}$	$\displaystyle L_{f_{0}}V(x)+L_{f_{1}}V(x)u\leq-\sigma(x),$		(13)

where $\sigma(x)=\sqrt{\left(L_{f_{0}}V(x)\right)^{2}+q(x)L_{f_{1}}V(x)\left(L_{f_{1}}V(x)\right)^{T}}$ . This formulation shares the same optimization structure with the one introduced in (2) in Section 2, and in the next section we will provide a formulation which strengthens the connection with approximate dynamic programming.

In Appendix A, additional results are reported, which further illustrate the theoretical equivalence discussed in this section, by comparing the optimal controller, the optimization-based controller, and a policy learned using the RL framework for a simple dynamical system.

3 Prioritized Multi-Task Execution

When $V=\tilde{J}^{\star}$ , the min-norm controller solution of (13) is the optimal policy which would be learned using a deep RL algorithm. This is what allows us to bridge the gap between constraint-driven control and RL and it is the key to execute tasks learned using the RL paradigm in a compositional, incremental, prioritized, and computationally-efficient fashion.

Following the formulation given in (2), the multi-task prioritized execution of tasks learned using RL can be implemented executing the control input solution of the following optimization program:

$\displaystyle\operatorname*{minimize~{}}_{u,\delta}$	$\displaystyle\\|u\\|^{2}+\kappa\\|\delta\\|^{2}$	(14)
$\displaystyle\operatorname*{subject\,to~{}}$	$\displaystyle\frac{1}{\lambda_{i}(x)}\left(L_{f_{0}}\tilde{J}_{i}^{\star}(x)+L_{f_{1}}\tilde{J}_{i}^{\star}(x)u\right)\leq-\sigma_{i}(x)+\delta_{i},\quad i=1,\ldots,M$
	$\displaystyle K\delta\geq 0$

where $\tilde{J}_{1}^{\star},\ldots,\tilde{J}_{M}^{\star}$ are the approximated value functions encoding the tasks learned using the RL paradigm (e.g. value iteration). In summary, with the RL paradigm, one can get the approximate value functions $\tilde{J}_{1}^{\star},\ldots,\tilde{J}_{M}^{\star}$ ; the robotic system is then controlled using the control input solution of (14) in order to execute these tasks in a prioritized fashion.

Remark 3.1.

The Lie derivatives $L_{f_{0}}\tilde{J}_{1}^{\star}(x),\ldots,L_{f_{0}}\tilde{J}_{M}^{\star}(x)$ contain the gradients $\frac{\partial J_{1}^{\star}}{\partial x},\ldots,\frac{\partial J_{M}^{\star}}{\partial x}$ . When $\tilde{J}_{1}^{\star}(x),\ldots,\tilde{J}_{M}^{\star}(x)$ are approximated using neural networks, these gradients can be efficiently computed using back propagation.

We conclude this section with the following Proposition 3.2, which ensures the stability of the prioritized execution of multiple tasks encoded through the value functions $\tilde{J}_{i}^{\star}$ by a robotic system modeled by the dynamics (1) and controlled with control input solution of (14).

Proposition 3.2 (Stability of multiple prioritized learned tasks).

Consider executing a set of $M$ prioritized tasks encoded by approximate value functions $\tilde{J}^{\star}_{i},~{}i=1,\ldots,M$ , by solving the optimization problem in (14). Assume the following:

1.

All constraints in (14) are active
2.

The robotic system can be modeled by driftless control affine dynamical system, i.e. $f_{0}(x)=0~{}\forall x\in\mathscr{X}$
3.

The instantaneous cost function $g$ used to learn the tasks is positive for all $x\in\mathscr{X}$ and $u\in\mathscr{U}$ .

Then,

\begin{bmatrix}\tilde{J}_{1}^{\star}(x(t))\\ \vdots\\ \tilde{J}_{M}^{\star}(x(t))\end{bmatrix}\to\mathcal{N}(K),

as $t\to\infty$ , where $\mathcal{N}(K)$ denotes the null space of the prioritization matrix $K$ . That is, the tasks will be executed according to the priorities specified by the prioritization matrix $K$ in (2).

Proof 3.3.

The Lagrangian associated with the optimization problem (14) is given by $L(u,\delta)=\|u\|^{2}+\kappa\|\delta\|^{2}+\eta_{1}^{T}\left(\hat{f}_{0}(x)+\hat{f}_{1}(x)u+\sigma(x)-\delta\right)+\eta_{2}^{T}(-K\delta)$ , where $\hat{f}_{0}(x)\in\mathbb{R}^{M}$ and $\hat{f}_{1}(x)\in\mathbb{R}^{M\times m}$ defined as follows: the $i$ -th component of $\hat{f}_{0}(x)$ is equal to $\frac{1}{\lambda_{i}(x)}L_{f_{0}}\tilde{J}_{i}^{\star}(x)$ , while the $i$ -th row of $\hat{f}_{i}(x)$ is equal to $\frac{1}{\lambda_{i}(x)}L_{f_{1}}\tilde{J}_{i}^{\star}(x)$ . $\eta_{1}$ and $\eta_{2}$ are the Lagrange multipliers corresponding to the task and prioritization constraints, respectively.

From the KKT conditions, we obtain:

u=-\frac{1}{2}\begin{bmatrix}\hat{f}_{1}(x)^{T}&0\end{bmatrix}\eta\qquad\delta=\frac{1}{2\kappa}\begin{bmatrix}I&-K^{T}\end{bmatrix}\eta,

(15)

where $\eta=[\eta_{1}^{T},\eta_{2}^{T}]^{T}$ . By resorting to the Lagrange dual problem, and by using assumption 1, we get the following expression for $\eta$ :

\eta=2\underbrace{\begin{bmatrix}\frac{I}{\kappa}+\hat{f}_{1}(x)\hat{f}_{1}(x)^{T}&-K^{T}\\ K&KK^{T}\end{bmatrix}^{-1}}_{A_{1}^{-1}}\underbrace{\begin{bmatrix}\hat{f}_{0}(x)+\sigma(x)\\ 0\end{bmatrix}}_{b_{0}},

(16)

where $I$ denotes an identity matrix of appropriate size. Substituting (16) in (15), we get $u=-\begin{bmatrix}\hat{f}_{1}(x)^{T}&0\end{bmatrix}A_{1}^{-1}b_{0}$ and $\delta=\frac{1}{\kappa}\begin{bmatrix}I&-K^{T}\end{bmatrix}A_{1}^{-1}b_{0}$ .

To show the claimed stability property, we will proceed by a Lyapunov argument. Let us consider the Lyapunov function candidate $V(x)=\frac{1}{2}\tilde{J}(x)^{\star T}K^{T}K\tilde{J}^{\star}(x)$ , where $\tilde{J}^{\star}(x)=\begin{bmatrix}\tilde{J}_{1}^{\star}(x)&\ldots&\tilde{J}_{M}^{\star}(x)\end{bmatrix}^{T}$ . The time derivative of $V$ evaluates to:

	$\displaystyle\dot{V}$	$\displaystyle=\frac{\partial V}{\partial x}\dot{x}=\tilde{J}(x)^{\star T}K^{T}K\frac{\partial\tilde{J}^{\star}}{\partial x}\dot{x}$
		$\displaystyle=\tilde{J}(x)^{\star T}K^{T}K\underbrace{\frac{\partial\tilde{J}^{\star}}{\partial x}f_{1}(x)}_{\hat{f}_{1,\lambda}(x)}u\qquad\text{(by assumption 2)},$

where, notice that $\hat{f}_{1,\lambda}(x)=\Lambda(x)\hat{f}_{1}(x)\in\mathbb{R}^{M}$ and $\Lambda(x)=\mathrm{diag}\left(\left[\lambda_{1}(x),\ldots,\lambda_{M}(x)\right]\right)$ . By assumption 2, $\lambda_{i}(x)\geq 0$ for all $i$ , and therefore $\Lambda(x)\succeq 0$ , i.e. $\Lambda(x)$ is positive semidefinite. Then, $\dot{V}=\tilde{J}(x)^{\star T}K^{T}K\Lambda(x)\hat{f}_{1}(x)u=-\tilde{J}(x)^{\star T}K^{T}K\Lambda(x)\hat{A}\begin{bmatrix}\sigma(x)\\ 0\end{bmatrix}$ , where

\hat{A}=\hat{f}_{1}(x)\begin{bmatrix}\hat{f}_{1}(x)^{T}&0\end{bmatrix}\begin{bmatrix}\frac{I}{\kappa}+\hat{f}_{1}(x)\hat{f}_{1}(x)^{T}&-K^{T}\\ K&KK^{T}\end{bmatrix}^{-1}\succeq 0

(17)

as in Proposition 3 in [17], and we used assumption 2 to simplify the expression of $b_{0}$ .

By assumption 3, it follows that the value functions $\tilde{J}^{\star}_{i}$ are positive definite. Therefore, from the definition of $\sigma$ , in a neighborhood of $0\in\mathbb{R}^{M}$ , we can bound $\sigma(x)$ —defined by the gradients of $\tilde{J}^{\star}$ —by the value of $\tilde{J}^{\star}$ as $\sigma(x)=\gamma_{J}(\tilde{J}^{\star}(x))$ , where $\gamma_{J}$ is a class $\mathcal{K}$ function.

Then, proceeding similarly to Proposition 3 in [17], we can bound $\dot{V}$ as follows: $\dot{V}=-\tilde{J}(x)^{\star T}K^{T}K\Lambda(x)\hat{A}\gamma_{J}(\tilde{J}(x))\leq-\tilde{J}(x)^{\star T}K^{T}K\Lambda(x)\tilde{J}(x)\leq-\bar{\lambda}V(x)$ , where $\bar{\lambda}=\min\{\lambda_{1}(x),\ldots,\lambda_{M}(x)\}$ . Hence, $K\tilde{J}^{\star}(x(t))\to 0$ as $t\to\infty$ , and $\tilde{J}^{\star}(x(t))\to\mathscr{N}(K)$ as $t\to\infty$ .

Remark 3.4.

The proof of Proposition 3.2 can be carried out even in case of time-varying and state-dependent prioritization matrix $K$ . Under the assumption that $K$ is bounded and continuously differentiable for all $x$ and uniformly in time, the norm and the gradient of $K$ can be bounded in order to obtain an upper bound for $\dot{V}$ .

Remark 3.5.

Even when the prioritization stack specified through the matrix $K$ in (14) is not physically realizable—due to the fact that, for instance, the functions encoding the tasks cannot achieve the relative values prescribed by the prioritization matrix—the optimization program will still be feasible. Nevertheless, the tasks will not be executed with the desired priorities and even the execution of high-priority tasks might be degraded.

4 Experimental Results

In this section, the proposed framework for the execution of prioritized stacks of tasks is showcased in simulation using a team of mobile robots. Owing to the multitude of robotic units of which they are comprised, multi-robot systems are often highly redundant with respect to the tasks they have to execute. Therefore, they perfectly lend themselves to the concurrent execution of multiple prioritized tasks.

4.1 Multi-Robot Tasks

For multi-robot systems, the redundancy stems from the multiplicity of robotic units of which the system is comprised. In this section, we will showcase the execution of dependent tasks—two tasks are dependent if executing one prevents the execution of the other [1]—in different orders of priority. The multi-robot system is comprised of 6 planar robots modeled with single integrator dynamics and controlled to execute the following 4 tasks: All robots assemble an hexagonal formation (task $T_{1}$ ), robot 1 goes to goal point 1 (task $T_{2}$ ), robot 2 goes to goal point 2 (task $T_{3}$ ), robot 3 goes to goal point 3 (task $T_{4}$ ). While Task 1 is independent of each of the other tasks taken singularly, it is not independent of any pair of tasks 2, 3, and 4. This intuitively corresponds to the fact that it is possible to form a hexagonal formation in different points in space, but it might not be feasible to form a hexagonal formation while two robots are constrained to be in two pre-specified arbitrary locations.

Figure 2 reports a sequence of snapshots and the graph of the value functions encoding the four tasks recorded during the course of the experiment. Denoting by $T_{i}\prec T_{j}$ the condition under which task $T_{i}$ has priority higher than $T_{j}$ , the sequence of prioritized stacks tested in the experiment are the following:

\begin{cases}T_{2},T_{3},T_{4}\prec T_{1}&\quad 0s\leq t<15s\\ T_{1}\prec T_{2},T_{3},T_{4}&\quad 15s\leq t<30s\\ T_{1}\prec T_{2}\prec T_{3},T_{4}&\quad 30s\leq t\leq 45s.\end{cases}

(18)

The plot of the value functions in Fig. 2l shows how, for $0s\leq t<15s$ , since the hexagonal formation control algorithm has lower priority compared to the three go-to-goal tasks, its value function $\tilde{J}^{\star}_{1}$ is allowed to grow while the other three value functions are driven to 0 by the velocity control input solution of (14) supplied to the robots. For $15s\leq t<30s$ , the situation is reversed: the hexagonal formation control is executed with highest priority while the value functions encoding the three go-to-goal tasks are allowed to grow—a condition which corresponds to the non-execution of the tasks. Finally, for $30s\leq t\leq 45s$ , task $T_{2}$ , i.e. go-to-goal task for robot 1 to goal point 1 is added at higher priority with respect to tasks $T_{3}$ and $T_{4}$ . Since this is independent by task $T_{1}$ , it can be executed at the same time. As a result, as can be seen from the snapshots, the formation translates towards the red point marked with 1. Tasks $T_{1}$ and $T_{2}$ are successfully executed while tasks $T_{3}$ and $T_{4}$ are not executed since are not independent by the first two and they have lower priority.

Remark 4.1.

The optimization program responsible for the execution of multiple prioritized tasks encoded by value functions is solved at each iteration of the robot control loop. This illustrates how the convex optimization formulation of the developed framework is computationally efficient and therefore amenable to be employed in online settings. Alternative approaches for task prioritization and allocation in the context of multi-robot systems generally result in (mixed-)integer optimization programs, which are often characterized by a combinatorial nature and are not always suitable for an online implementation [8].

4.2 Discussion

The experiments of the previous section highlight several amenable properties of the framework developed in this paper for the prioritized execution of tasks encoded by a value function. First of all, its compositionality is given by the fact that tasks can easily be inserted and removed by adding and removing constraints from the optimization program (14). For the same reason the framework is incremental and modular as it allows for building a complex task using a number of subtasks which can be incrementally added to the constraints of an optimization-based controller. Moreover, it allows for seamless incorporation of priorities among tasks, and, as we showcased in Section 4.1, these priority can also be switched in an online fashion, in particular without the need of stopping and restarting the motion of the robots. Furthermore, Proposition 3.2 shows that the execution of multiple tasks using the constraint-driven control is stable and the robotic system will indeed execute the given tasks according to the specified priorities. Finally, as the developed optimization program is a convex QP, its low computational complexity allows for an efficient implementation in online settings even under real-time constraints on computationally limited robotic platforms.

5 Conclusion

In this paper, we presented an optimization-based framework for the prioritized execution of multiple tasks encoded by value functions. The approach combines control-theoretic and learning techniques in order to exhibit properties of compositionality, incrementality, stability, and low computational complexity. These properties render the proposed framework suitable for online and real-time robotic implementations. A multi-robot simulated scenario illustrated its effectiveness in the control of a redundant robotic system executing a prioritized stack of tasks.

Appendix A Comparison Between Optimal Control, Optimization-Based Control, and RL policy

To compare optimal controller, optimization-based controller, and RL policy, in this section, we consider the stabilization of a double integrator system to the origin. The system dynamics are given by: $\dot{x}=\begin{bmatrix}0&1\\ 0&0\end{bmatrix}x+\begin{bmatrix}0\\ 1\end{bmatrix}u$ , where $x=[x_{1},x_{2}]^{T}\in\mathbb{R}^{2}$ and $u\in\mathbb{R}$ . The instantaneous cost considered in the optimal control problem (7) is given by $q(x)+u^{2}$ where $q(x)=x^{T}x$ . The reward function of the value iteration algorithm employed to learn an approximate representation of the value function has been set to $g(x,u)=-q(x)-u^{2}$ , and the resulting value function $\tilde{J}^{\star}$ has been shifted so that $\tilde{J}^{\star}(0)=0$ .

The results of the comparison are reported in Fig. 3. Here, the optimization-based controller solution of (13) with $V=\tilde{J}^{\star}$ is compared to the optimal controller given in (9), and the RL policy corresponding to the approximate value function $\tilde{J}^{\star}$ . As can be seen, the optimization-based controller and the optimal controller coincide, while the RL policy becomes closer and closer as the number of training epochs increases.

Appendix B Implementation Details

The results reported in Section 4 have been obtained using a custom value function learning algorithm written in Python. The details of each multi-robot task are given in the following.

Each robot in the team of $N$ robots is modeled using single integrator dynamics $\dot{x}_{i}=u_{i}$ , where $x_{i},u_{i}\in\mathbb{R}^{2}$ are position and velocity input of robot $i$ . The ensemble state and input will be denoted by $x$ and $u$ , respectively. For the formation control task, the expression of the cost $g$ is given by $g(x,u)=1000-0.01(-\mathcal{E}(x)-10\|u\|^{2})$ , where the value of $\mathcal{E}(x)$ is the formation energy defined as $\mathcal{E}(x)=\sum_{i=1}^{N}\sum_{j\in\mathscr{N}_{i}}(\|x_{i}-x_{j}\|^{2}-W_{ij}^{2})^{2}$ , $\mathscr{N}_{i}$ being the neighborhood of robot $i$ , i.e. the set of robots with which robot $i$ shares an edge, and

W=\begin{bmatrix}0&l&\sqrt{3}l&2l&0&l\\ l&0&l&0&2l&0\\ \sqrt{3}l&l&0&l&0&2l\\ 2l&0&l&0&l&0\\ 0&2l&0&l&0&l\\ l&0&2l&0&l&0\end{bmatrix}

(19)

with $l=1$ . The entry $ij$ of the matrix $W$ corresponds to the desired distance to be maintained between robots $i$ and $j$ .

The cost function $g$ for the go-to-goal tasks is given by $g(x,u)=100-0.01(-\|x-\hat{x}\|^{2}-\|u\|^{2})$ , where $\hat{x}\in\mathbb{R}^{2}$ is the desired goal point.

Remark B.1 (Combination of single-robot and multi-robot tasks).

Single-robot tasks (e.g. the go-to-goal tasks considered in this paper) are combined with multi-robot tasks (e.g. the formation control task) by defining the task gradient required to compute $L_{f_{0}}\tilde{J}_{i}^{\star}(x)$ and $L_{f_{1}}\tilde{J}_{i}^{\star}(x)$ in the optimization program (14) in the following way: $\frac{\partial\tilde{J}_{i}^{\star}}{\partial x}=\begin{bmatrix}0&\cdots&0&\frac{\partial\tilde{J}_{ij}^{\star}}{\partial x}&0&\cdots&0\end{bmatrix}$ , where the $j$ -th entry $\tilde{J}_{ij}^{\star}$ is the approximate value function for task $i$ and robot $j$ .

References

[1] Antonelli, G.: Stability analysis for prioritized closed-loop inverse kinematic algorithms for redundant robotic systems. IEEE Transactions on Robotics 25(5), 985–994 (2009). 10.1109/TRO.2009.2017135
[2] Bertsekas, D.P.: Reinforcement learning and optimal control. Athena Scientific Belmont, MA (2019)
[3] Boyd, S., Vandenberghe, L.: Convex optimization. Cambridge University Press (2004)
[4] Bryson, A.E., Ho, Y.C.: Applied optimal control: optimization, estimation, and control. Routledge (2018)
[5] Bylard, A., Bonalli, R., Pavone, M.: Composable geometric motion policies using multi-task pullback bundle dynamical systems. arXiv preprint arXiv:2101.01297 (2021)
[6] Dulac-Arnold, G., Mankowitz, D., Hester, T.: Challenges of real-world reinforcement learning. arXiv preprint arXiv:1904.12901 (2019)
[7] Freeman, R.A., Primbs, J.A.: Control lyapunov functions: New ideas from an old source. In: Proceedings of 35th IEEE Conference on Decision and Control, vol. 4, pp. 3926–3931. IEEE (1996)
[8] Gerkey, B.P., Matarić, M.J.: A formal analysis and taxonomy of task allocation in multi-robot systems. The International journal of robotics research 23(9), 939–954 (2004)
[9] Ghosh, D., Singh, A., Rajeswaran, A., Kumar, V., Levine, S.: Divide-and-conquer reinforcement learning. arXiv preprint arXiv:1711.09874 (2017)
[10] Gupta, A., Yu, J., Zhao, T.Z., Kumar, V., Rovinsky, A., Xu, K., Devlin, T., Levine, S.: Reset-free reinforcement learning via multi-task learning: Learning dexterous manipulation behaviors without human intervention. arXiv preprint arXiv:2104.11203 (2021)
[11] Haarnoja, T., Pong, V., Zhou, A., Dalal, M., Abbeel, P., Levine, S.: Composable deep reinforcement learning for robotic manipulation. In: 2018 IEEE international conference on robotics and automation (ICRA), pp. 6244–6251. IEEE (2018)
[12] Haarnoja, T., Tang, H., Abbeel, P., Levine, S.: Reinforcement learning with deep energy-based policies. In: International Conference on Machine Learning, pp. 1352–1361. PMLR (2017)
[13] Kaelbling, L.P.: The foundation of efficient robot learning. Science 369(6506), 915–916 (2020)
[14] Micchelli, C.A., Pontil, M.: Kernels for multi–task learning. In: NIPS, vol. 86, p. 89. Citeseer (2004)
[15] Mukadam, M., Cheng, C.A., Fox, D., Boots, B., Ratliff, N.: Riemannian motion policy fusion through learnable lyapunov function reshaping. In: Conference on Robot Learning, pp. 204–219. PMLR (2020)
[16] Nachum, O., Gu, S., Lee, H., Levine, S.: Data-efficient hierarchical reinforcement learning. arXiv preprint arXiv:1805.08296 (2018)
[17] Notomista, G., Mayya, S., Hutchinson, S., Egerstedt, M.: An optimal task allocation strategy for heterogeneous multi-robot systems. In: 2019 18th European Control Conference (ECC), pp. 2071–2076. IEEE (2019)
[18] Notomista, G., Mayya, S., Selvaggio, M., Santos, M., Secchi, C.: A set-theoretic approach to multi-task execution and prioritization. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 9873–9879. IEEE (2020)
[19] Peng, X.B., Chang, M., Zhang, G., Abbeel, P., Levine, S.: Mcp: Learning composable hierarchical control with multiplicative compositional policies. arXiv preprint arXiv:1905.09808 (2019)
[20] Primbs, J.A., Nevistić, V., Doyle, J.C.: Nonlinear optimal control: A control lyapunov function and receding horizon perspective. Asian Journal of Control 1(1), 14–24 (1999)
[21] Qureshi, A.H., Johnson, J.J., Qin, Y., Henderson, T., Boots, B., Yip, M.C.: Composing task-agnostic policies with deep reinforcement learning. arXiv preprint arXiv:1905.10681 (2019)
[22] Rana, M.A., Li, A., Ravichandar, H., Mukadam, M., Chernova, S., Fox, D., Boots, B., Ratliff, N.: Learning reactive motion policies in multiple task spaces from human demonstrations. In: Conference on Robot Learning, pp. 1457–1468. PMLR (2020)
[23] Ratliff, N.D., Issac, J., Kappler, D., Birchfield, S., Fox, D.: Riemannian motion policies. arXiv preprint arXiv:1801.02854 (2018)
[24] Ruder, S.: An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098 (2017)
[25] Sahni, H., Kumar, S., Tejani, F., Isbell, C.: Learning to compose skills. arXiv preprint arXiv:1711.11289 (2017)
[26] Schwartz, A., Thrun, S.: Finding structure in reinforcement learning. Advances in neural information processing systems 7, 385–392 (1995)
[27] Sener, O., Koltun, V.: Multi-task learning as multi-objective optimization. arXiv preprint arXiv:1810.04650 (2018)
[28] Singh, S.P.: Transfer of learning by composing solutions of elemental sequential tasks. Machine Learning 8(3), 323–339 (1992)
[29] Smith, V., Chiang, C.K., Sanjabi, M., Talwalkar, A.: Federated multi-task learning. arXiv preprint arXiv:1705.10467 (2017)
[30] Sontag, E.D.: A lyapunov-like characterization of asymptotic controllability. SIAM journal on control and optimization 21(3), 462–471 (1983)
[31] Sontag, E.D.: A ’universal’ construction of artstein’s theorem on nonlinear stabilization. Systems & control letters 13(2), 117–123 (1989)
[32] Teh, Y.W., Bapst, V., Czarnecki, W.M., Quan, J., Kirkpatrick, J., Hadsell, R., Heess, N., Pascanu, R.: Distral: Robust multitask reinforcement learning. arXiv preprint arXiv:1707.04175 (2017)
[33] Todorov, E.: Compositionality of optimal control laws. Advances in neural information processing systems 22, 1856–1864 (2009)
[34] Van Niekerk, B., James, S., Earle, A., Rosman, B.: Composing value functions in reinforcement learning. In: International Conference on Machine Learning, pp. 6401–6409. PMLR (2019)
[35] Zhang, Y., Yang, Q.: A survey on multi-task learning. IEEE Transactions on Knowledge and Data Engineering (2021)

A Constrained-Optimization Approach to the Execution of Prioritized Stacks of Learned Multi-Robot Tasks