OGMP: Oracle Guided Multi-mode Policies for
Agile and Versatile Robot Control

Lokesh Krishna, Nikhil Sobanbabu, and Quan Nguyen All authors with the Dynamic Robotics and Control Laboratory, University of Southern California, Los Angeles, CA 90089, USA lkrajan@usc.edu, ns_562@usc.edu, quann@usc.edu

Abstract

The efficacy of reinforcement learning for robot control relies on the tailored integration of task-specific priors and heuristics for effective exploration, which challenges their straightforward application to complex tasks and necessitates a unified approach. In this work, we define a general class for priors called oracles that generate state references when queried in a closed-loop manner during training. By bounding the permissible state around the oracle’s ansatz, we propose a task-agnostic oracle-guided policy optimization. To enhance modularity, we introduce task-vital modes, showing that a policy mastering a compact set of modes and transitions can handle infinite-horizon tasks. For instance, to perform parkour on an infinitely long track, the policy must learn to jump, leap, pace, and transition between these modes effectively. We validate this approach in challenging bipedal control tasks: parkour and diving—using a 16-DoF dynamic bipedal robot, HECTOR. Our method results in a single policy per task, solving parkour across diverse tracks and omnidirectional diving from varied heights up to $2m$ in simulation, showcasing versatile agility. We demonstrate successful sim-to-real transfer of parkour, including leaping over gaps up to $105\%$ of the leg length, jumping over blocks up to $20\%$ of the robot’s nominal height, and pacing at speeds of up to $0.6$ m/s, along with effective transitions between these modes in the real robot.

I Introduction

Deep reinforcement learning (RL) has shown remarkable success in synthesizing control policies for hybrid and underactuated legged robots [1], particularly in enabling inherently stable quadrupedal robots to achieve extreme parkour [2, 3, 4], agile [5], and robust [6] locomotion. Following a common philosophy: 1) define an exhaustive observation space, 2) engineer task-specific rewards and/or curriculum, 3) perform policy distillation, and 4) extensively randomize, these methods rely on task-specific tricks in each step, lacking a systematic approach for robot control. Specifically, since Deep RL methods are quasi-solvers for unconstrained optimization, they are prone to anomalous, case-specific local optima. Hence, practitioners often resort to task-specific reward shaping [2, 5, 6] and heuristics [3] for a structured exploration and to meet the intended performance. Furthermore, the established approach for robot control is privileged training and policy distillation: training teacher policies [3, 4] with privileged information, solving a pseudo-MDP with RL, and distilling it into a single policy for the true POMDP. In contrast, we aim to find an optimal policy by structured exploration in the true POMDP guided by priors.

Refer to caption — Figure 1: Overview of OGMP: Oracle guided policy optimization and the applied tasks visualized. Trained OGMPs $\pi_{\text{parkour}}$ performing agile parkour in the simulation and the real-robot. $\pi_{\text{dive}}$ performing a frontflip dive from a $2$ m high platform in simulation. Accompanying video results at : $\href https://youtu.be/69SVc-43Oqg?si=w4r3i67oBaoThLN7$

Guided Policy Optimization (GPO) aims to improve sample efficiency and mitigate poor local optima. GPO methods are either Control-guided or Reference-guided. Control-guided approaches require control trajectories: [7, 8] employ prior controllers and policy-trajectory constraints, while [9, 10] alternate trajectory optimization and minimize RL variance. [11, 12, 13] demonstrate quadrupedal locomotion but relying on pre-existing model-based controllers, limiting complex tasks (e.g., parkour). Reference-guided methods like [14, 15, 16, 17, 18] use morphologically similar state-reference trajectories to guide RL to learn the corresponding optimal actions for character control. However, with pre-generated open-loop reference, policy exploration is confined to the demonstration’s scenario, thus hindering the emergent behaviors(like recovery) seen in from-scratch RL methods [1, 3, 2], which explore full-order dynamics and challenging randomization in simulation, crucial for real robot control. Therefore, we propose a reference-guided policy optimization using closed-loop state-reference generators(oracles) that can be queried dynamically to produce references from any state, with a novel hyperparameter to address local optima in complex tasks.

Alternatively, imitation learning (IL) proves to be a reliable task-agnostic strategy for robot manipulation with dynamically consistent demonstrations through proficient human teleoperation, which solves the intended task [19]. In contrast, we have dynamically inconsistent demonstrations for locomotion that partially solve a task, challenging the direct application of IL. For instance, to parkour with a bipedal robot, we may have demonstrations for runs, leaps, and jumps from motion capture on humans, which suffer from morphological dissimilarity due to source-target mismatches [20]. Moreover, naive imitation of partial demonstrations (run, leap, etc.) does not guarantee to solve the overarching task (parkour), requiring high-level RL-trained policy for transitions and emergent behaviors [21, 20, 22]. Besides demonstrations, robot tasks have rich priors like heuristics [23], task/motion planners [24], and model-based controllers[25], which can guide learning, leading to regularized behaviors [13]. While the idea of imitating such priors has been studied, we instead propose building a “trust region” in the state space around the prior’s solution. Thus, the more we “trust” a prior, the tighter the trust region could be and vice versa. Formally defining a general class for priors: oracle, an oracle-guided policy optimization can be performed by bounding the policy’s permissible state space within the local neighborhood of an oracle’s ansatz. Empirically, we observe that the right choice of this bound helps escape erroneous local optima providing an optimal balance between emergent and regularized behaviors ideal for robot control, making it an effective hyperparameter in practice.

On the other hand, solving complex tasks requires behavioral multi-modality. Classical multi-mode control [26, 27]involves switching among a finite set of pre-designed controllers to address high-level tasks, creating a pseudo-hybrid system. Learning methods leave multi-modality in control to emerge implicitly [5, 3], lacking a methodical synthesis. [22] propose encoding a dataset of demonstrations to a latent space and latent conditioning to train multi-skilled policies. However, the notion of skill is poorly defined. For instance, solving a task requires not only mastering discrete modes (e.g., walking, jumping) but also continuous parameter variation of the same (e.g., speed, height) and inter-skill transitions. [2] trains multiple low-level controllers and a high-level mode-switching policy for quadrupedal parkour, requiring diverse reward shaping and training routines. [28] shows that a “fixed” set of uni-mode controllers limits complex transition maneuvers, introducing a single multi-mode policy for mastering a set of modes and transitions to handle new tasks zero-shot. In line with this approach, we aim to achieve a single policy that learns a finite set of modes with “infinite” parameter variations and transitions through reference-guided policy optimization. Unlike switching fixed controllers, we hypothesize that reference-guided exploration can better accommodate emergent control modes. To this end, we first show task-vital multi-modality as a way to decompose tasks into their principal modes and transcribe them into our proposed OGMP framework. Thus, the major contributions of our paper are twofold:

•

Oracle Guided Multi-mode Policies: A theoretical framework for task-centered control synthesis leveraging oracle-guided optimization to effectively search through bounded exploration and task-vital multi-modality for versatile control.
•

Experimental validation on agile bipedal control tasks requiring versatility, such as parkour and diving. A single policy per task demonstrates the ability to perform diverse variants of the task-vital modes realized in simulation and the real-robot on the 16-Dof bipedal robot Hector.

The remaining paper is structured as follows: Sec. II presents the theoretical development of our framework, Sec. III discusses applying the proposed framework to bipedal control tasks, followed by Sec. IV presenting the experimental results, analysis and ablation studies.

II Oracle Guided Multi-mode Policies

This section presents our theoretical framework with two synergetic ideas: oracle-guided policy optimization and task-vital multi-modality. Specifically, we aim to prune undesirable local optima by bounding exploration to the local neighborhood of an oracle and by designing the learning of multiple behavior modes and transitions to effectively solve tasks.

II-1 Oracle Guided Policy Optimization(OGPO)

Let $\mathcal{T}$ be an infinite horizon task with a task parameter set, $\psi_{\mathcal{T}}\in\Psi_{\mathcal{T}}$ . Sufficiently solving $\mathcal{T}$ requires maximizing a task objective, $J_{\mathcal{T}}$ over the task parameter distribution, $p(\psi_{\mathcal{T}})$ . Given the corresponding state space (or task space) of interest, $x\in\mathcal{X}$ let $x_{t}$ , $x[a,b]$ denote a state at time $t$ and a state trajectory from time $t\in[a,b]$ respectively. We define $\Xi$ to be a receding horizon oracle that provides a finite horizon state trajectory ( $x^{\Xi}[t,\,t+\Delta t]$ ) from any given state ( $x_{t}$ ) until $\Delta t$ into the future for any task-variant ( $\psi_{\mathcal{T}}$ ), such that $x^{\Xi}[t,\,t+\Delta t]$ is always within an $\epsilon\text{-neighborhood}$ of an optimal state trajectory. Formally,


	$\displaystyle x^{\Xi}[t,\,t+\Delta t]=\Xi(x_{t},\psi_{\mathcal{T}})$	(1a)
s.t.	$\displaystyle\exists\,\,x^{\Xi}[t,\,t+\Delta t]\quad\forall\,(x_{t},\psi_{\mathcal{T}})$	(1b)
	$\displaystyle\\|x^{\Xi}_{t}-x^{*}_{t}\\|_{W}<\epsilon\,\,\,\,\,\forall\,t\in[0,\,\infty)$	(1c)

where $\epsilon$ is the maximum deviation bound, a constant for a given pair $(\mathcal{T},\,\Xi)$ , and $W$ is a diagonal weight matrix¹¹1Note that $x^{*}[0,\infty],\,\epsilon$ are unknown and are only provided for constructing the conceptual argument. We aim to obtain the optimal policy $\pi^{*}$ from a policy class $\Pi$ guided by $\Xi$ that sufficiently solves $\mathcal{T}$ . Since $\Xi$ provides a reference in the state space, we propose constraining the permissible states for $\pi$ to be within a $\rho$ -neighbourhood of the oracle’s guidance. Formally,


	$\displaystyle\pi^{*}:=\arg\max_{\pi\in\Pi}J_{\mathcal{T}}$	(2a)
s.t.	$\displaystyle\\|x^{\pi}_{t}-x^{\Xi}_{t}\\|_{W}<\rho\quad\forall t\in[0,\infty)$	(2b)

Where $x^{\pi}$ are the states visited while rolling out policy $\pi$ , $\rho$ is the permissible state-bound for oracle-guided exploration.

With the above setup, one can observe that the bounded set of permissible states visualized in Fig. 2i is given by

x^{\pi}_{t}\in\{x\,|\,\|x^{\Xi}_{t}-x^{*}_{t}\|-\rho\leq\|x-x^{*}_{t}\|\leq\|x^{\Xi}_{t}-x^{*}_{t}\|+\rho\}

(3)

Thus, $\rho$ should be chosen so that $x^{*}_{t}$ is reachable by $x^{\pi}_{t}$ , which requires the lower bound in Eq. 3, to be non-positive. Therefore as shown in Fig 2ii, for $x^{*}_{t}$ to be within the permissible states of $\pi$ , $\rho$ must satisfy

(x^{*}_{t}\text{ is reachable by }x^{\pi}_{t})\implies\rho\geq\|x^{\Xi}_{t}-x^{*}_{t}\|

(4)

By definition (Eq. 1.c), since $\|x^{\Xi}_{t}-x^{*}_{t}\|<\epsilon$ a sufficient choice of $\rho$ to satisfy Eq 4 for all time , is

\rho\geq\epsilon

(5)

$\epsilon$ being maximum deviation of $x^{\Xi}$ , an oracle with low $\epsilon$ will generate references closer to $x^{*}$ thus, the exploration can be bounded to tight-neighborhood to filter out most local optima in the objective landscape In contrast, for a “poor” oracle with high $\epsilon$ , there should be sufficient search space for $\pi$ to explore around $x^{\Xi}$ and converge to $x^{*}$ . From Eq 5, as $\epsilon\to\infty$ , the optimization is unguided, thus needing $\rho\to\infty$ (a standard RL setting). Conversely, as $\epsilon\to 0$ , an arbitrarily small $\rho$ satisfying Eq 5 could be chosen to avoid local optima by localizing the search while still being able to converge $x^{*}_{t}$ . In practice, as $\epsilon$ is unknown, we perform a grid search over $\rho$ for any given $(\mathcal{T},\,\Xi)$ .

II-2 Task Vital Multi-modality

A policy learning to solve a task, $\mathcal{T}$ , can be seen as mastering a “bundle” of spacetime trajectories in the task space, corresponding to $\psi_{\mathcal{T}}\in\Psi_{\mathcal{T}}$ . Simple tasks allow straightforward oracle construction satisfying Eq 1. However, complex and infinite-horizon tasks make $\Psi_{\mathcal{T}}$ intractable. For instance, an oracle for indefinite parkour requires knowledge of an infinite track apriori, which is impractical. To address this, we define modes as finite spacetime segments that preserve some spatial and/or temporal invariances. Like in parkour, modes like jump and leap remain consistent regardless of location or time. Therefore, we define a finite set of modes, $\mathbb{M}$ , having a temporal length of $\Delta t$ , vital for $\mathcal{T}$ . Each $m\in\mathbb{M}$ (like jump) can have continuous parameters $\Psi_{m}$ (like jump height). Then the mode parameter set $\Psi_{\mathbb{M}}:=\bigcup_{i=1}^{|\mathbb{M}|}\Psi_{m_{i}}$ is related to the task parameter set as $\Psi_{\mathcal{T}}:=\Psi_{\mathbb{M}}^{N}$ , where $N$ is the number of horizons. Thus, by mastering modes in $\mathbb{M}$ (jump, leap, and pace) and transitions over varying $\Psi_{\mathbb{M}}$ (speeds, distances, and heights ), task $\mathcal{T}$ (indefinite parkour, $N\rightarrow\infty$ ) can be solved as visualized in Fig 3.a.

III Design Methodology

This section presents the design methodology using OGMP for the bipedal control tasks: parkour and dive, as shown in Fig 3. For a given task, we first define the task-vital modes—such as jump, leap, and pace for parkour—and design a reference generation scheme for each (Fig. 3.a).Spanning the mode parameter set, we employ the oracle to generate a custom dataset of diverse behaviors and train a mode encoder to construct a compact latent space for command conditioning (Fig 3.c). Finally, we train a multi-mode policy guided by the oracle (Fig 3.d). During policy optimization, the oracle is periodically queried to generate references online, bounding the policy’s search space to a reference’s local neighborhood for effective exploration(Fig 3.b) The above approach is explained in detail as follows.

III-A Task description:

We apply the proposed framework to two bipedal control tasks: parkour and dive, with varying objectives and extent of multi-modality as shown in table I. Note that the choice of task-vital multi-mode is user-defined, and table I simply reflects our choice for the same. Evident from table I, $J_{\mathcal{T}}$ is task-dependent. Recent attempts in quadruped parkour [2, 3], and locomotion [6, 1] show some well-shaped candidates for $J_{\mathcal{T}}$ , albeit case-specific. In general, a reasonable $J_{\mathcal{T}}$ could be hard to design (for instance, the dive task); a compelling unified alternative would be to “track” the oracle’s $\epsilon$ -neighbourhood reference to the optimal solution. Hence, we propose minimizing the task-independent surrogate objective: $\tilde{J_{\mathcal{T}}}:=\sum_{t=0}^{\infty}\|x^{\pi}_{t}-x^{\Xi}_{t}\|$ . $\tilde{J_{\mathcal{T}}}$ ’s applicability is studied in Sec. IV-B and a reward is proposed in Sec III-C for an equivalent maximization objective.

$\mathcal{T}:J_{\mathcal{T}}$	( $m,\Psi_{m})\in(\mathbb{M},\Psi_{\mathbb{M}}$ )
$\mathcal{T}:J_{\mathcal{T}}$	Mode	Parameters
dive: $360^{\degree}$ flip and land	settle	$\{\}$
dive: $360^{\degree}$ flip and land	flip	$\{(r,h)\}$
parkour: traverse the track indefinitely	pace	$\{v\}$
	jump	$\{(w,h)\}$
	leap	$\{(w,d)\}$

TABLE I: Task description and corresponding task vital multi-modality. Parameters visualized in Fig. 4

III-B Oracle Design

For any locomotion task, a simple heuristic for an oracle would be linearly interpolating the relevant state variables from the initial to the desired goal states. In parkour, to advance along the track, the heading position can be linearly interpolated along the track while adapting to the terrain height, as shown in Fig. 4 (left). For dive, the oracle can linearly interpolate the base height and corresponding rotational DoF from $0^{\degree}$ to $360^{\degree}$ . Naming this heuristic oracle as $\Xi_{\text{li}}$ (Fig. 4), its high $\epsilon$ is obvious as the generated ansatz do not consider the system’s inertia and gravity. Hence oracles that capture the dominant dynamics of the hybrid system are required. To this end, we have used a modified version of the simplified Single Rigid Body (SRB) model[25] whose dynamics in the world coordinates are given by:

	$\displaystyle m(\ddot{p}+g)=\sum_{i=1}^{2}f_{i},\frac{d}{dt}(I\omega)\approx I\dot{\omega}=\sum_{i=1}^{2}(r_{i}\times f_{i}+\tau_{i})=\sum_{1}^{2}\bar{\tau}_{i}$		(6)
	$\displaystyle x_{t+1}=Ax_{t}+Bu_{t},\>y_{t+1}=Cx_{t},\>u_{t}=[f_{1},f_{2},\bar{\tau}_{1},\bar{\tau}_{2}]^{T}$		(7)

where $\ddot{p}$ , $\omega$ is the robot COM acceleration and the angular velocity, $r_{i},\,f_{i},\,\tau_{i}$ are the position, force and moment vectors from the $i^{th}$ contact point and $m,\,I,\,g$ are the mass, moment of inertia and gravity. Typically, $r_{i}$ is from a predefined contact schedule, leading to time-varying dynamics. Since, by definition, an oracle need not provide realistic control, we define an auxiliary control $\bar{\tau}_{i}$ , encompassing the overall moment, making the dynamics time-invariant. Additionally, the rotation and rotation rate matrices are made constant by considering the average reference orientation over a horizon. Upon these approximations to Eq. 6 and discretization leads to a linear time-invariant (LTI) system over the current horizon, where $x_{t}\in\mathbb{R}^{13},\,y_{t},\,u_{t}\in\mathbb{R}^{12}$ are the gravity-augmented state, relevant output, and control vectors. Thus, oracles can be constructed considering two distinct phases: flight and contact. In flight, $u_{t}=0$ as there are no contacts, and during contact, an optimizer of choice can be used to compute the optimal control for a given objective, $u_{t}=u^{*}_{t}$ . The reference state trajectory is obtained by applying the corresponding control and forward simulating Eq. 7. Using $\Xi_{\text{li}}$ as the reference for a quadratic tracking objective on the LTI system, optimizing with preview control [29] and LQR results in oracles, $\Xi_{\text{prev}}$ and $\Xi_{\text{lqr}}$ respectively, having a smaller $\epsilon$ than $\Xi_{\text{li}}$ as shown in Fig 4

III-C Multi-mode Policy:

Mode Encoder: We train a mode encoder, $\xi$ , on diverse locomotion modes to obtain a compact conditioning space ideal for commanding our policy. Similar to [28], the encoder, $z=\xi(x[t,\,t+\Delta t])$ , maps the trajectory space to a latent space ( $\dim(z)=2$ ). Uniformly sampling from a mode parameter set, $\Psi_{\mathbb{M}}$ and a set of initial states, $\mathbb{X}_{0}$ , we generate a rich and balanced modal dataset by querying the oracle, $\Xi$ as shown in Fig. 3.c. Minimizing the reconstruction loss for a $32$ neurons single hidden layer LSTM auto-encoder on the custom dataset generates a set of latent mode points with a structured clustering as visualized in Fig. 3.c.

Mode Conditioned Policy: Our choice of action space for a stationary policy is from [28]. Given the inherent partial observability of the system, for the observation space $o_{t}\in\mathcal{O}$ we choose $o_{t}=[\tilde{x}_{t},z_{t},\,c_{t},h_{t}]$ , where $\tilde{x}_{t}$ is the robot’s proprioceptive feedback, $z_{t}$ is the latent mode command, $c_{t}$ is a clock signal [28] and $h_{t}$ is optional task-based feedback (like terrain scan for parkour). The per-step reward for a task-agnostic surrogate tracking objective is defined as

$\displaystyle r_{t}$	$\displaystyle:=r_{\text{track}}\text{ + }r_{\text{regulation}}$	(8)
$\displaystyle r_{\text{regulation}}$	$\displaystyle:=0.05e^{-0.01\\|u_{t}\\|}\text{ - }0.3\cdot\mathds{1}\text{(if non-toe contact)}$
$\displaystyle r_{\text{track}}$	$\displaystyle:=0.475e^{-5\\|er_{p}\\|}\text{ + }0.475e^{-5\\|er_{o}\\|}$

where $er_{p},er_{o}$ are the errors in base position and orientations. Thus, $r_{\text{track}}$ minimizes an error in the task space, and $r_{\text{regulation}}$ regularizes the motion for enhancing sim-to-real. Note for the diverse modes across both the tasks (parkour and dive) the reward weights remain the same, hinting a sense of algorithmic robustness that arises from guided learning. The proposed permissible state constraint (Sec. II) is programmed as a termination condition, terminate episode:= $\mathds{1}{(\|x^{\pi}_{t}-x^{\Xi}_{t}\|_{W}>\rho)}$ . $W_{11}$ and $W_{33}$ are set to $1$ for the parkour and dive tasks, respectively, with the remaining entries as zero. For solving the resulting random horizon POMDP, we use off-the-shelf PPO to train a policy: $128$ nodes per layer, $2$ layer LSTM network, where each episode is an arbitrary variant of the task $\psi_{\mathcal{T}}$ uniformly sampled from $\Psi_{\mathcal{T}}$ .

IV Results

IV-A Performance

To achieve extreme agility and mode versatility, a single multi-mode policy is trained per task: $\pi_{\text{parkour}}$ for parkour and $\pi_{\text{dive}}$ for diving. The supplementary video and Fig 1 show $\pi_{\text{parkour}}$ successfully navigating challenging parkour tracks with blocks and gaps placed randomly, demonstrating versatile agility over leap lengths and jump heights. $\pi_{\text{dive}}$ performs omnidirectional flips from different heights and transitions smoothly to landing. Despite lacking a reference for the actuated DoFs, $\pi_{\text{dive}}$ learns an emergent behavior to curl and extend its legs for flips and landings, modulating the torso angular velocity and landing impact. As seen in the video, $\pi_{\text{parkour}}$ significantly deviates from the oracle’s reference to find the optimal behavior, yet results in regularized motion due to the oracle bound. Sim-to-real transfer of $\pi_{\text{parkour}}$ ’s modes and transitions can be seen in Fig 1,5 and supplementary video.

IV-A1 Agility

For quantitative benchmarking of agility, we report the sample mean of performance metrics: Maximum Heading Acceleration (M.H.A), Froude number (M.F), Maximum Heading Speed (M.H.S), Episode Length (E.L), measured over $100$ episode rollouts in table II. We define a test environment with a track length of $10$ m, an episode length of $400$ steps for the parkour, and an episode length of $150$ for the dive. In each case, the episode is terminated only if the episode length is reached or the robot falls down (terrain-relative base height $<0.3$ m).

metric, units	$z_{t}$	$z_{t},c_{t}$	$h_{t}$	$h_{t},c_{t}$	$h_{t},z_{t}$	$h_{t},z_{t},c_{t}$
M.H.A	4.7g	3.5g	3.2g	3.6g	3.5g	3.1g
M.H.S ( $v$ )	1.4	1.57	1.66	1.74	1.74	1.77
M.F ( $\frac{v^{2}}{g\cdot ll}$ )	0.48	0.56	0.64	0.69	0.70	0.72
% E.L	0.18	0.43	0.63	0.80	0.66	0.84

TABLE II: Estimated metrics for variants of

\pi_{\text{parkour}}

with different observation spaces

On average, we find $\pi_{\text{parkour}}$ to reach accelerations of $4.7g$ with heading speeds of $1.77$ m/s, and Froude numbers between $0.48-0.72$ while completing $84\%$ of the track as shown in table II. $\pi_{\text{parkour}}$ dynamically advances along the track, avoiding conservative motions with precise foot placements for landing and take-off, leading to agile manevours. The measured Fraude numbers and resulting motion are consistent with [30], where a switch from energy-efficient walking to agile jumping gaits was observed in bipedalism for a value around $0.5$ .

IV-A2 Mode Versatility

Since the defined $\Psi_{\mathbb{M}}$ of our training is a compact set, we leverage it to visualize the generalization of $\pi_{\text{parkour}}$ over mode parameters. Dilating $\Psi_{\mathbb{M}}$ and defining higher test ranges for each parameter, we test for both in-domain (ID) and out-of-domain (OD) generalization. Discretizing this test set, we evaluate $\pi_{\text{parkour}}$ and plot the undiscounted returns obtained for blocks and gaps with varying dimensions in Fig 6ii. Different blocks and gaps require jumps and leaps of varying magnitudes, showcasing our policies’ versatility. The training sets $\Psi_{\text{jump}}\text{ and }\Psi_{\text{leap}}$ are the regions within the boundary marked in black in each plot. Thus, $\pi_{\text{parkour}}$ shows consistent performance for variants within the black boundary (ID) while also extrapolating its skills by jumping and leaping over unseen terrain variants outside the black boundary (OD) as seen in the supplementary video and Fig 6ii.

IV-B Ablations and Analysis

Finally, we analyze our choice of surrogate $\tilde{J}_{\mathcal{T}}$ and design choices that impact performance (measured via undiscounted returns). For parkour, a potential true objective, $J_{\mathcal{T}}$ , is the displacement along the track. Thus the validity of using $\tilde{J}_{\mathcal{T}}$ can be quantified through its disparity with $J_{\mathcal{T}}$ .

Choice of observation space $(o_{t})$ : Ablation of observation space components (excluding $\tilde{x}_{t}$ ) shows that variants with $c_{t}$ consistently outperform their counterparts (Fig. 6i.a). Variants without $h_{t}$ are myopic and aggressive, with higher accelerations (table II) as they purely rely on compressed $z_{t}$ for terrain feedback, making them sub-optimal compared to terrain-aware variants. From Fig. 6i.a, we observe that latent conditioning does not improve performance (see [ $h_{t},c_{t}$ ] and [ $z_{t},h_{t},c_{t}$ ]), hence is purely for analysis and reusability. However, a conditioned $\pi_{\text{parkour}}$ can use the oracle as a closed-loop reactive planner during inference, driving the system to the commanded mode. From Fig.6i.a and b, a similar trend of $J_{\mathcal{T}}$ and $\tilde{J}_{\mathcal{T}}$ shows no disparity caused by observation space variations.

Choice of oracle $(\Xi)$ : The three oracles, $\Xi_{\text{li}}$ , $\Xi_{\text{lqr}}$ , and $\Xi_{\text{prev}}$ , have a non-increasing trend in maximum deviation bound, $\epsilon$ . As the optimal exploration bound ( $\rho^{*}$ ) depends on $\epsilon$ , we vary $\rho$ from 0.1 to 0.8 for each oracle. For parkour, different oracles show no significant performance difference (see video results). However, for dive, $\Xi_{\text{prev}}$ variants perform significantly better (Fig. 6i.c and 6iii.c).A tighter bound ( $\rho=0.1$ ) works best for $\Xi_{\text{prev}}$ , as it has the lowest $\epsilon$ . Conversely, higher exploratory deviation ( $\rho=0.4$ ) performs best for $\Xi_{\text{lqr}}$ and $\Xi_{\text{li}}$ . Thus, $\epsilon_{\text{li}}\geq\epsilon_{\text{lqr}}\geq\epsilon_{\text{prev}}\implies\rho^{*}_{\text{li}}\geq\rho^{*}_{\text{lqr}}>\rho^{*}_{\text{prev}}$ , affirming Eq 5 .

Choice of oracle’s horizon $(\Delta t)$ : We evaluate policy performance across different horizons: $\Delta t=7dt$ , $15dt$ , and $30dt$ for parkour. The shortest horizon, $\Delta t=7dt$ , leads to a myopic behavior, maintaining a high $\tilde{J}_{\mathcal{T}}$ but low $J_{\mathcal{T}}$ as it remains stationary by exploiting the high replanning frequency (Fig. 6iii.d and 6i.d, e) as also observed by [13]. Although advancing forward, $\Delta t=15dt$ fails to anticipate farther terrain, resulting in quasi-static maneuvers (Fig. 6iii.e). Conversely, $\Delta t=30dt$ enables the robot to leap efficiently from block to block, demonstrating agility and achieving the optimal outcome (Fig. 6iii.f). Increasing $\Delta t$ aligns $\tilde{J}_{\mathcal{T}}$ with the true task objective, $J_{\mathcal{T}}$ (Fig. 6i.d, e).

Choice of permissible state bound $(\rho)$ : For $\Xi_{\text{prev}}$ in parkour, we varied $\rho$ from $0.05$ to $0.8$ (Fig. 6i.f and g). We found an optimal $\rho^{*}=0.5$ with performance decreasing away from this value. For $\rho<\rho^{*}$ , the optimal solution may lie outside the $\rho$ -neighborhood of $x^{\Xi}$ (Fig. 6iii.g and Fig. 6iv). Conversely, $\rho>\rho^{*}$ increases local optima within $\epsilon+\rho$ , leading to sub-optimal solutions (high $\tilde{J}_{\mathcal{T}}$ ), seen in Fig. 6iii.i for $\rho=0.8$ , where the policy stagnates without advancing (low $J_{\mathcal{T}}$ ). Training curves in Fig. 6iv show that $0.1\leq\rho\leq 0.6$ converges to the global optima, while the rest settle at local optima. Note that $\rho\rightarrow\infty$ is standard PPO as it optimizes $\tilde{J}_{\mathcal{T}}$ unguided by $\Xi$ . Vanilla unguided PPO ( $\rho=10^{10}$ ) falls into the same local optima as $\rho=0.8$ . Thus, oracle-guided optimization improves standard PPO by escaping local optima with the right choice of $\rho$ .

V Conclusion

This paper introduces a framework for guided policy optimization through prior-bounded permissible states and task-vital multi-modality to tackle complex tasks. A single OGMP (per task) successfully solved agile bipedal parkour and diving, showcasing versatile agility. Future work will aim to extend to contact-rich open-world loco-manipulation tasks. Furthermore, by restricting the reachable states to a subset of the state space, we forgo the possibility of OGMP being a global policy. Consequently, any state outside the $\rho$ -neighbourhood of the oracle’s references may result in failure. Since current Deep RL methods lack any global convergence guarantees, this is nonrestrictive but highlights the need for future extensions for stronger algorithms.

References

[1] J. Siekmann, K. Green, J. Warila, A. Fern, and J. Hurst, “Blind Bipedal Stair Traversal via Sim-to-Real Reinforcement Learning,” in RSS, 2021.
[2] D. Hoeller, N. Rudin, D. Sako, and M. Hutter, “Anymal parkour: Learning agile navigation for quadrupedal robots,” Science Robotics, 2024.
[3] X. Cheng, K. Shi, A. Agarwal, and D. Pathak, “Extreme parkour with legged robots,” in RoboLetics: Workshop @CoRL 2023, 2023.
[4] Z. Zhuang, Z. Fu, J. Wang, C. G. Atkeson, S. Schwertfeger, C. Finn, and H. Zhao, “Robot parkour learning,” in CoRL, 2023.
[5] N. Rudin, D. Hoeller, M. Bjelonic, and M. Hutter, “Advanced skills by learning locomotion and local navigation end-to-end,” in IEEE IROS, 2022.
[6] T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, and M. Hutter, “Learning robust perceptive locomotion for quadrupedal robots in the wild,” Science Robotics, 2022.
[7] S. Levine and V. Koltun, “Guided policy search,” in Proceedings of the 30th International Conference on Machine Learning (S. Dasgupta and D. McAllester, eds.), vol. 28 of Proceedings of Machine Learning Research, (Atlanta, Georgia, USA), pp. 1–9, PMLR, 17–19 Jun 2013.
[8] S. Levine and V. Koltun, “Learning complex neural network policies with trajectory optimization,” in Proceedings of the 31st International Conference on Machine Learning (E. P. Xing and T. Jebara, eds.), vol. 32 of Proceedings of Machine Learning Research, (Bejing, China), pp. 829–837, PMLR, 22–24 Jun 2014.
[9] I. Mordatch and E. Todorov, “Combining the benefits of function approximation and trajectory optimization,” in Proceedings of Robotics: Science and Systems, (Berkeley, USA), July 2014.
[10] R. Cheng, A. Verma, G. Orosz, S. Chaudhuri, Y. Yue, and J. Burdick, “Control regularization for reduced variance reinforcement learning,” in Proceedings of the 36th International Conference on Machine Learning (K. Chaudhuri and R. Salakhutdinov, eds.), vol. 97 of Proceedings of Machine Learning Research, pp. 1141–1150, PMLR, 09–15 Jun 2019.
[11] J. Carius, F. Farshidian, and M. Hutter, “Mpc-net: A first principles guided policy search,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 2897–2904, 2020.
[12] S. Gangapurwala, A. Mitchell, and I. Havoutis, “Guided constrained policy optimization for dynamic quadrupedal robot locomotion,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 3642–3649, 2020.
[13] F. Jenelten, J. He, F. Farshidian, and M. Hutter, “Dtc: Deep tracking control,” Science Robotics, 2024.
[14] X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne, “Deepmimic: Example-guided deep reinforcement learning of physics-based character skills,” ACM Trans. Graph., vol. 37, pp. 143:1–143:14, July 2018.
[15] X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa, “Amp: adversarial motion priors for stylized physics-based character control,” ACM Trans. Graph., vol. 40, jul 2021.
[16] E. Vollenweider, M. Bjelonic, V. Klemm, N. Rudin, J. Lee, and M. Hutter, “Advanced skills through multiple adversarial motion priors in reinforcement learning,” in ICRA 2023, pp. 5120–5126.
[17] Z. Li, X. B. Peng, P. Abbeel, S. Levine, G. Berseth, and K. Sreenath, “Reinforcement learning for versatile, dynamic, and robust bipedal locomotion control,” 2024.
[18] Z. Luo, J. Wang, K. Liu, H. Zhang, C. Tessler, J. Wang, Y. Yuan, J. Cao, Z. Lin, F. Wang, J. Hodgins, and K. Kitani, “Smplolympics: Sports environments for physically simulated humanoids,” 2024.
[19] C. Chi, S. Feng, Y. Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, “Diffusion policy: Visuomotor policy learning via action diffusion,” 2023.
[20] S. Bohez, S. Tunyasuvunakool, P. Brakel, F. Sadeghi, L. Hasenclever, Y. Tassa, E. Parisotto, J. Humplik, T. Haarnoja, R. Hafner, et al., “Imitate and repurpose: Learning reusable robot movement skills from human and animal behaviors,” arXiv preprint arXiv:2203.1713, 2022.
[21] T. Haarnoja, B. Moran, G. Lever, S. H. Huang, D. Tirumala, J. Humplik, M. Wulfmeier, S. Tunyasuvunakool, N. Y. Siegel, R. Hafner, M. Bloesch, K. Hartikainen, A. Byravan, L. Hasenclever, Y. Tassa, F. Sadeghi, N. Batchelor, F. Casarini, S. Saliceti, C. Game, N. Sreendra, K. Patel, M. Gwira, A. Huber, N. Hurley, F. Nori, R. Hadsell, and N. Heess, “Learning agile soccer skills for a bipedal robot with deep reinforcement learning,” Science Robotics, vol. 9, no. 89, p. eadi8022, 2024.
[22] L. Hasenclever, F. Pardo, R. Hadsell, N. Heess, and J. Merel, “CoMic: Complementary task learning &; mimicry for reusable skills,” in ICML, PMLR, 2020.
[23] M. H. Raibert, “Legged robots,” Communications of the ACM, vol. 29, no. 6, pp. 499–514, 1986.
[24] J. Norby and A. M. Johnson, “Fast global motion planning for dynamic legged robots,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020.
[25] J. Li and Q. Nguyen, “Force-and-moment-based model predictive control for achieving highly dynamic locomotion on bipedal robots,” in IEEE CDC, 2021.
[26] T. Koo, G. Pappas, and S. Sastry, “Multi-modal control of systems with constraints,” in Proceedings of the 40th IEEE Conference on Decision and Control (Cat. No.01CH37228), vol. 3, pp. 2075–2080 vol.3, 2001.
[27] E. Asarin, O. Bournez, T. Dang, O. Maler, and A. Pnueli, “Effective synthesis of switching controllers for linear systems,” Proceedings of the IEEE, vol. 88, no. 7, pp. 1011–1025, 2000.
[28] L. Krishna and Q. Nguyen, “Learning multimodal bipedal locomotion and implicit transitions: A versatile policy approach,” in IEEE IROS, 2023.
[29] M. Murooka, M. Morisawa, and F. Kanehiro, “Centroidal trajectory generation and stabilization based on preview control for humanoid multi-contact motion,” IEEE RAL, 2022.
[30] R. M. Alexander, “The gaits of bipedal and quadrupedal animals,” IJRR, 1984.

OGMP: Oracle Guided Multi-mode Policies for Agile and Versatile Robot Control