This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Back-stepping Experience Replay with Application to Model-free Reinforcement Learning for a Soft Snake Robot

Xinda Qi1, Dong Chen1,∗, Member, IEEE, Zhaojian Li2, Senior Member, IEEE, Xiaobo Tan1, Fellow, IEEE This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. This work was supported by the National Science Foundation (Grant CNS 2125484).1Xinda Qi, Dong Chen, and Xiaobo Tan are with the Department of Electrical and Computer Engineering, Michigan State University, Lansing, MI, 48824, USA. Email:{qixinda, chendon9, xbtan}@msu.edu.2Zhaojian Li is with the Department of Mechanical Engineering, Michigan State University, Lansing, MI, 48824, USA. Email: lizhaoj1@msu.edu. Corresponding author.
Abstract

In this paper, we propose a novel technique, Back-stepping Experience Replay (BER), that is compatible with arbitrary off-policy reinforcement learning (RL) algorithms. BER aims to enhance learning efficiency in systems with approximate reversibility, reducing the need for complex reward shaping. The method constructs reversed trajectories using back-stepping transitions to reach random or fixed targets. Interpretable as a bi-directional approach, BER addresses inaccuracies in back-stepping transitions through a distillation of the replay experience during learning. Given the intricate nature of soft robots and their complex interactions with environments, we present an application of BER in a model-free RL approach for the locomotion and navigation of a soft snake robot, which is capable of serpentine motion enabled by anisotropic friction between the body and ground. In addition, a dynamic simulator is developed to assess the effectiveness and efficiency of the BER algorithm, in which the robot demonstrates successful learning (reaching a 100% success rate) and adeptly reaches random targets, achieving an average speed 48% faster than that of the best baseline approach.

Index Terms:
Deep reinforcement learning, experience replay, soft robot, snake robot, locomotion, navigation.

I Introduction

As a promising decision-making approach, reinforcement learning (RL) has drawn increasing attention for its ability to solve complex control problems and achieve generalization in both virtual and physical tasks, as evidenced in various applications, such as chess games [1], quadrupedal locomotion [2], and autonomous driving [3, 4]. Considering the inherent infinite degrees of freedom of soft robots and their complicated interactions with environments [5], RL approaches were adopted for the control of soft robots, such as soft manipulators [6] and wheeled soft snake robots [7].

As a typical challenge for RL, especially in tasks where complicated behaviors are involved, the learning efficiency suffers from the relatively large search space and the inherent difficulties of the tasks, which usually requires delicate reward shaping [8] to guide the policy optimization and to constrain the learning directions or the behavior styles. The RL agents have to successfully reach their goals for efficient learning before getting lost in numerous inefficient failure trials. Multiple strategies were proposed to address the hard exploration challenge with sparse rewards, including improving the exploration techniques for more versatile trajectories from intrinsic motivations [9, 10, 11, 12], and exploiting the information acquired from the undesired trails [13, 14, 15].

Compatible with these techniques that might improve learning efficiency, the motivation of BER proposed for off-policy RL is the human ability to solve problems forward (from the beginning to goal) and backward (from the goal to the beginning) simultaneously, which is different from the standard model-free RL algorithms that mostly rely on forward exploration. For example, in proving a complicated mathematical equation, an effective method is to derive the equation from both sides where the information of both the left-hand side (beginning) and the right-hand side (goal) is utilized, to which the reasoning process and the mechanism of BER are similar.

In this paper, a BER algorithm is introduced that allows the RL agent to explore bidirectionally, which is compatible with arbitrary off-policy RL algorithms. It is applicable for systems with approximate reversibility and with fixed or random goal setups. After an evaluation of BER with a toy task, it is applied to the locomotion and navigation task of a soft snake robot. The developed algorithm is validated on a physics-based dynamic simulator with a computationally efficient serpentine locomotion model based on the system characteristic. Comprehensive experimental results demonstrate the effectiveness of the proposed RL framework with BER in learning the locomotion and navigation skills of the soft snake robot compared with other state-of-the-art benchmarks, indicating the potential of BER in general off-policy RL and robot control applications.

The remainder of the paper is structured as follows. Section \@slowromancapii@ introduces the BER algorithm with an evaluation of a toy task. Section \@slowromancapiii@ details the BER application in locomotion and navigation of a soft snake robot with the performance comparisons with other benchmarks, and Section \@slowromancapiv@ concludes the paper.

II Back-stepping Experience Replay

II-A Background

II-A1 Reinforcement Learning

A standard RL formalism is adopted where an agent (e.g. a robot) interacts with an environment and learns a policy according to the perceptions and rewards.

In each episode, the system starts with an initial state 𝒔0\bm{s}_{0} with a distribution of p(𝒔0)p(\bm{s}_{0}), and the agent observes a current state 𝒔t𝒮n\bm{s}_{t}\in\mathcal{S}\subseteq\mathcal{R}^{n} in the environment at each time step tt. Then, an action 𝒂t𝒜m\bm{a}_{t}\in\mathcal{A}\subseteq\mathcal{R}^{m} is generated to control the agent based on the current policy π\pi and the observations. Afterward, the system evolves to a new state 𝒔t+1\bm{s}_{t+1} based on the action and transition dynamics p(|𝒔t,𝒂t)p(\cdot|\bm{s}_{t},\bm{a}_{t}), and a reward rt=r(𝒔t,𝒂t,𝒔t+1)r_{t}=r(\bm{s}_{t},\bm{a}_{t},\bm{s}_{t+1}) is collected by the agent for the learning before the termination of the episode. During the training process, the RL agent learns an optimal policy π:𝒮𝒜\pi^{*}:\mathcal{S}\rightarrow\mathcal{A} mapping states to actions that maximize the expected return. The return is defined as the accumulated discounted reward Rt=i=tγitriR_{t}=\sum_{i=t}^{\infty}\gamma^{i-t}r_{i}, where γ\gamma is a discount factor.

The state value function Vπ(𝒔t)=𝔼(Rt|𝒔t)V^{\pi}(\bm{s}_{t})=\mathop{\mathbb{E}}(R_{t}|\bm{s}_{t}) represents the expected return starting from state 𝒔t\bm{s}_{t} following the current policy π\pi, and the action value function Qπ(𝒔t,𝒂t)=𝔼(Rt|𝒔t,𝒂t)Q^{\pi}(\bm{s}_{t},\bm{a}_{t})=\mathop{\mathbb{E}}(R_{t}|\bm{s}_{t},\bm{a}_{t}) represents the expected return starting from the state 𝒔t\bm{s}_{t} with an immediate action 𝒂t\bm{a}_{t} by following the current policy π\pi. All optimal policies π\pi^{*} share the same optimal Q-function QQ^{*}, according to the Bellman equation [16]:

Q(𝒔t,𝒂t)=𝔼𝒔p(𝒔t,𝒂t)[r(𝒔t,𝒂t,𝒔)+γmax𝒂𝒜Q(𝒔,𝒂)]\begin{split}Q^{*}(\bm{s}_{t},\bm{a}_{t})=\mathbb{E}_{\bm{s}^{\prime}\sim p(\cdot\bm{s}_{t},\bm{a}_{t})}\left[r(\bm{s}_{t},\bm{a}_{t},\bm{s}^{\prime})+\gamma\max_{\bm{a}^{\prime}\in\mathcal{A}}Q^{*}(\bm{s}^{\prime},\bm{a}^{\prime})\right]\end{split}

(1)

II-A2 Deep Q-Networks (DQN) and Deep Deterministic Policy Gradient (DDPG)

The Deep Q-Network (DQN) is a model-free, off-policy RL approach suitable for agents operating in discrete action spaces [16]. It typically employs a neural network QQ to approximate the optimal Q-function QQ^{*}, selecting optimal actions: 𝒂=argmax𝒂𝒜Q(𝒔t,𝒂)\bm{a}^{*}=\arg\max_{\bm{a}\in\mathcal{A}}Q(\bm{s}_{t},\bm{a}). Exploration is often facilitated by the ϵ\epsilon-greedy algorithm. To stabilize training, a replay buffer stores transition data (𝒔t,𝒂t,rt,𝒔t+1\bm{s}_{t},\bm{a}_{t},r_{t},\bm{s}_{t+1}) and is used to optimize QQ with a loss =𝔼(Q(𝒔t,𝒂t)yt)\mathcal{L}=\mathbb{E}(Q(\bm{s}_{t},\bm{a}_{t})-y_{t}), where the target is calculated by using a periodically updated target network QtargQ_{\text{targ}}: yt=rt+γmax𝒂𝒜Qtarg(𝒔t+1,𝒂)y_{t}=r_{t}+\gamma\max_{\bm{a}\in\mathcal{A}}Q_{\text{targ}}(\bm{s}_{t+1},\bm{a}), and using transitions in the replay buffer.

Deep Deterministic Policy Gradient (DDPG) [17] is an off-policy RL algorithm that simultaneously learns a Q-function and a policy, tailored specifically for environments with continuous action spaces. DDPG interweaves the learning process of an approximator to QQ^{*}, with an approximator to select 𝒂\bm{a}^{*}, offering a unique adaptation for continuous action scenarios.

II-B Algorithm for BER

The above classical off-policy RL algorithms often face challenges with systems characterized by sparse rewards or challenging tasks with rewards hard to reshape. In such scenarios, RL agents rarely achieve informative standard forward explorations due to a low success rate in reaching goals in complex problems without precise guidance [13]. To address these challenges, we propose a novel Back-stepping Experience Replay (BER) algorithm for tasks with different goals (Alg. 1), designed to enhance the learning efficiency of off-policy RL algorithms. This is achieved by incorporating exploration methods in both forward and backward directions.

The BER algorithm requires at least an approximate reversibility of the system. This means that from a standard transition (𝒔t,𝒂t,𝒔t+1)(\bm{s}_{t},\bm{a}_{t},\bm{s}_{t+1}), a back-stepping transition (𝒔t+1,𝒂t~,𝒔t)(\bm{s}_{t+1},\widetilde{\bm{a}_{t}},\bm{s}_{t}) can be constructed, which is similar to a real transition (𝒔t+1,𝒂t~,𝒔b,t)(\bm{s}_{t+1},\widetilde{\bm{a}_{t}},\bm{s}_{b,t}) in the environment, i.e., 𝒔b,t𝒔t\bm{s}_{b,t}\approx\bm{s}_{t}. The action in the back-stepping transition is calculated as 𝒂t~=f(𝒔t,𝒂t,𝒔t+1)\widetilde{\bm{a}_{t}}=f(\bm{s}_{t},\bm{a}_{t},\bm{s}_{t+1}), where function ff is dependent on the environment. The approximate reversibility is evaluated by a small upper bound KK for all transitions during back-stepping calculation:

𝒔b,t𝒔tK𝒔t+1𝒔t,K<1\lVert\bm{s}_{b,t}-\bm{s}_{t}\rVert\leq K\cdot\lVert\bm{s}_{t+1}-\bm{s}_{t}\rVert,K<1 (2)

There exists a perfect reversibility when K=0K=0 with a probably complex function ff, while an approximate reversibility might be achieved with a slightly larger KK and a simpler and solvable function ff.

The idea of BER is simple yet effective: instead of solely relying on forward explorations (navy blue solid line in Fig. 1) from initial states to goals, which depend heavily on the randomness of forward trajectories to reach these goals, RL agents also navigate backward from the goals to the initial states in the tasks (sky blue solid line in Fig. 1). The standard transitions are sampled from the standard forward and backward exploration trajectories (solid lines in Fig. 1), where the initial states of themselves are included. Then, the back-stepping transitions are calculated based on the standard transitions to constitute the reversed trajectories (dashed lines in Fig. 1), where the virtual goals are set to be the original initial state in their corresponding standard trajectories, such that the reversed trajectories are guaranteed to reach their virtual goals and contribute to the learning efficiency.

Refer to caption
Figure 1: Illustration of the Back-stepping Experience Replay.

During the explorations, the standard and the back-stepping transitions are collected and stored in separate replay buffers for training. A strategy 𝕊t\mathbb{S}_{t} is used to sample the transitions from the standard replay RfR_{f} with a probability Pt,fP_{t,f} and from the back-stepping replay RbR_{b} with a probability Pt,bP_{t,b}, where Pt,f+Pt,b=1P_{t,f}+P_{t,b}=1. For a system with imperfect reversibility, Pt,bP_{t,b} gradually drops to zero to distill the transition set for training because of the inaccurate back-stepping transition. The details of BER are shown in Alg. 1. It should be noticed that the operator \odot between the states and the goals also indicates the modification of the sequential data (e.g., the history data) when the back-stepping transitions are constructed.

Given:
- An off-policy RL algorithm 𝔸\mathbb{A}. \triangleright e.g. DDPG
- A probability PbP_{b} triggering backward trial.
- A strategy 𝕊t\mathbb{S}_{t} for sampling transitions in replays
Require:
- Approximate reversibility of the system
1   Initialize 𝔸\mathbb{A} \triangleright e.g. initialize networks
2 Initialize replay buffers RfR_{f} and RbR_{b}
3 for epoch =1M=1\rightarrow M do
4       Sample a goal 𝒈\bm{g} with an initial state 𝒔0\bm{s}_{0}.
5       Forward trial starts
6       for t =0Tend1=0\rightarrow T_{end}-1 do
7             Sample an action 𝒂t\bm{a}_{t} using the policy of 𝔸\mathbb{A}:
8             𝒂tπ(𝒔t𝒈)\bm{a}_{t}\leftarrow\pi(\bm{s}_{t}\odot\bm{g}) \triangleright e.g. \odot\rightarrow diff, concat
9             Execute action 𝒂t\bm{a}_{t}, observe new state 𝒔t+1\bm{s}_{t+1}
10       end for
11      for t=0Tend1t=0\rightarrow T_{end}-1 do
12             rt:=r(𝒔t,𝒂t,𝒔t+1,𝒈)r_{t}:=r(\bm{s}_{t},\bm{a}_{t},\bm{s}_{t+1},\bm{g})
13             Store transition (𝒔t𝒈,𝒂t,rt,𝒔t+1𝒈)(\bm{s}_{t}\odot\bm{g},\bm{a}_{t},r_{t},\bm{s}_{t+1}\odot\bm{g}) in RfR_{f}
             \triangleright standard experience replay
             Construct a back-stepping transition:
14             rb,t:=r(𝒔t+1,𝒂t~,𝒔t,𝒔0)r_{b,t}:=r(\bm{s}_{t+1},\widetilde{\bm{a}_{t}},\bm{s}_{t},\bm{s}_{0})
             Store transition (𝒔t+1𝒔0,𝒂t~,rb,t,𝒔t𝒔0)(\bm{s}_{t+1}\odot\bm{s}_{0},\widetilde{\bm{a}_{t}},r_{b,t},\bm{s}_{t}\odot\bm{s}_{0}) in RbR_{b} \triangleright BER
15       end for
16      Forward trial ends
17       Backward trial starts with PbP_{b}
18       Swap the goal 𝒈\bm{g} and the initial state 𝒔0\bm{s}_{0}: 𝒔0,𝒈=𝒈,𝒔0\bm{s}_{0},\bm{g}=\bm{g},\bm{s}_{0}
19       Repeat line 6 - line 16
20       Backward trial ends
21      for t=1Nt=1\rightarrow N do
22             Sample a mini-batch BB from the replay buffers {Rf,Rb}\{R_{f},R_{b}\} using 𝕊r\mathbb{S}_{r}
23             Perform one step of optimization using 𝔸\mathbb{A} and mini-batch BB
24       end for
25      
26 end for
Algorithm 1 Back-stepping Experience Replay (BER)

The BER accelerates the estimation of Q-functions of the RL agent by using the reversed successful trajectories to bootstrap the networks. One interpretation of BER is a bi-directional search method for standard off-policy RL approaches, with a higher convergence rate and learning efficiency. The distillation strategy of the transitions for training needs to be carefully tuned and might be combined with other exploration tricks, to reach an accurate policy learning in the end and avoid the limitations brought by the bi-directional search method, e.g., non-trivial sub-optimum.

In the practical learning tasks, the accuracy and the complexity of the function f:𝒂t~=f(𝒔t,𝒂t,𝒔t+1)f:\widetilde{\bm{a}_{t}}=f(\bm{s}_{t},\bm{a}_{t},\bm{s}_{t+1}), which calculates the actions 𝒂t~\widetilde{\bm{a}_{t}} in the back-stepping transitions (𝒔t+1,𝒂t~,𝒔t)(\bm{s}_{t+1},\widetilde{\bm{a}_{t}},\bm{s}_{t}), need to be balanced. An accurate ff yields better reversibility (with smaller KK in Eq. (2)) with more accurate back-stepping transitions and brings less bias and noise, while ff itself could be computationally expensive or even unsolvable. On the other hand, a moderate relaxation of the accuracy of ff might boost the efficiency of the calculation of back-stepping transitions, when the larger bias and the noises brought by the approximate reversibility (with larger KK) are managed by the distillation mechanism in BER.

II-B1 A case study of BER

To illustrate the effectiveness and generality of BER, a general binary bit flipping game [13] with nn bits was considered as an environment for the RL agent, where the state was the bit value array 𝒔={si}i=1n𝒮\bm{s}=\{s_{i}\}_{i=1}^{n}\in\mathcal{S}, si{0,1}s_{i}\in\{0,1\}, and the action was the index of the chosen bit 𝒂{1,,n}=𝒜\bm{a}\in\{1,...,n\}=\mathcal{A} that was flipped. It was noticed that the game was completely reversible and 𝒂t~=f(𝒂t)=𝒂t\widetilde{\bm{a}_{t}}=f(\bm{a}_{t})=\bm{a}_{t} for any time step and transition. The initial state 𝒔0𝒮\bm{s}_{0}\in\mathcal{S} and the goal 𝒈𝒮\bm{g}\in\mathcal{S} were sampled uniformly and randomly, with a sparse non-negative reward: rt(𝒔,𝒂)=[𝒔𝒈]r_{t}(\bm{s},\bm{a})=-[\bm{s}\neq\bm{g}]. The game is terminated once 𝒔=𝒈\bm{s}=\bm{g}.

A simple ablation study was designed where a DQN and a DQN with BER were used for training when n=4,6,8n=4,6,8. The fully activated backward exploration and the use of back-stepping transitions were stopped after 1k epochs directly. The experimental result (Fig. 2) showed that BER facilitated an effective and efficient policy learning for a general DQN approach, and contributed more when the problem became more complex (i.e., nn was larger).

Refer to caption
Figure 2: Training experiments of the bit flip game with different algorithms and state dimensions. (A) Returns; (B) Success rates.

III BER in Model-free RL for a Soft Snake Robot

In this section, a locomotion and navigation task for a compact pneumatic soft snake robot with snake skins in our previous works [18, 19] is utilized to evaluate the effectiveness and efficiency of BER with a model-free RL approach, where the robot learns both movement skills and efficient strategies to reach different challenging targets.

III-A Soft Snake Robot and Serpentine Locomotion

Compared with soft snake robots where each air chamber was controlled independently [20], in this paper, a more compact soft snake robot with snake skins [18] is considered. There are only four independent air paths to generate the traveling-wave deformation of the robot, which enables the robot to traverse complex environments more easily by reducing the number of pneumatic tubing. The body of the robot consists of six bending actuators and each actuator is divided into four air chambers (Figs. 3A, 3D) that connect to four air paths (Fig. 3B). Four sinusoidal waves with 90-degree phase differences and the same amplitude can be used as references of pressures in air paths to generate traveling-wave deformation (Fig. 3C), when the biases of waves induce unbalance actuation for steering of the robot.

Refer to caption
Figure 3: The overview of the soft snake robot with skins. (A) The soft snake robot with soft snakeskins; (B) The connection between air chambers and air paths; (C) The actuation pressures for air paths; (D) The structure of one bending actuator; (E) The structure of soft snakeskin; (F) The simulation (sim) and experimental (exp) results of the trajectory of the COM of the snake robot on a rough paper surface.

Serpentine locomotion is adopted for the movement of the soft snake robot, where the anisotropic friction between the snake skins and the ground propels the robot during the traveling-wave deformation [21]. The artificial snake skins are designed with a soft substrate and embedded rigid scales (Fig. 3E); see [19] for more details.

To describe the serpentine locomotion of the robot, the dynamic model in [21] is adopted, where the body of the robot is modeled as an inextensible curve in a 2D plane with a total length LL and a constant density ρ\rho per unit length. The position of each point on the robot at time tt is defined as:

𝑿(s,t)=(x(s,t),y(s,t))\bm{X}(s,t)=(x(s,t),y(s,t)) (3)

where ss is the curve length measured from the tail of the robot.

By utilizing a mean-zero anti-derivative I0I_{0} [22] (I0[f](s,t)I_{0}[f](s,t) =0sf(s,t)ds1L0Lds0sdsf(s,t)=\int_{0}^{s}f(s^{\prime},t)\differential{s}^{\prime}-\frac{1}{L}\int_{0}^{L}\differential{s}\int_{0}^{s}\differential{s^{\prime}}f(s^{\prime},t)), the position 𝑿(s,t)\bm{X}(s,t) and the orientation θ(s,t)\theta(s,t) (the angle between the local tangent direction and the X-axis of the inertial frame) of each point are described as a function of the position 𝑿¯(t)\overline{\bm{X}}(t) and orientation θ¯(t)\overline{\theta}(t) (Fig. 4) of the center of mass (COM) of the robot:

𝑿(s,t)=𝑿¯(t)+I0[𝑿s](s,t)\bm{X}(s,t)=\overline{\bm{X}}(t)+I_{0}[\bm{X}_{s}](s,t) (4)
θ(s,t)=θ¯(t)+I0[κ](s,t)\theta(s,t)=\overline{\theta}(t)+I_{0}[\kappa](s,t) (5)

where 𝑿s=(cos(θ),sin(θ))\bm{X}_{s}=(\cos{\theta},\sin{\theta}) and κ(s,t)\kappa(s,t) is the local curvature. 𝑿¯(t)=1L0L𝑿(s,t)ds\overline{\bm{X}}(t)=\frac{1}{L}\int_{0}^{L}\bm{X}(s,t)\differential{s}, θ¯(t)=1L0Lθ(s,t)ds\overline{\theta}(t)=\frac{1}{L}\int_{0}^{L}\theta(s,t)\differential{s}. The curvature κ(s,t)\kappa(s,t) is related to the local pneumatic pressure via:

κ(s,t)=KbΔp(s,t)\kappa(s,t)=K_{b}\cdot\Delta p(s,t) (6)

where KbK_{b} is the proportional constant and Δp(s,t)\Delta p(s,t) is the pressure difference between the two air chambers at point ss.

Refer to caption
Figure 4: The illustration of the soft snake robot with serpentine locomotion approaching a target.

The anisotropic friction 𝒇fric\bm{f}_{fric} between the snake skins and the ground is described as a weighted average of the independent components in different local directions (forward 𝒇^\hat{\bm{f}}, backward 𝒃^\hat{\bm{b}}, transverse 𝒕^\hat{\bm{t}}):

{𝒇fric=ρg(μt(𝒖^𝒕^)𝒕^+μl(𝒖^𝒇^)𝒇^)μl=μfH(𝒖^𝒇^)+μb(1H(𝒖^𝒇^))\begin{cases}\bm{f}_{fric}=-\rho g(\mu_{t}(\hat{\bm{u}}\cdot\hat{\bm{t}})\hat{\bm{t}}+\mu_{l}(\hat{\bm{u}}\cdot\hat{\bm{f}})\hat{\bm{f}})\\ \mu_{l}=\mu_{f}H(\hat{\bm{u}}\cdot\hat{\bm{f}})+\mu_{b}(1-H(\hat{\bm{u}}\cdot\hat{\bm{f}}))\end{cases} (7)

where 𝒖^\hat{\bm{u}} represents the direction of the local velocity, μf\mu_{f}, μb\mu_{b}, and μt\mu_{t} are the friction coefficients of the snakeskin in 𝒇^\hat{\bm{f}}, backward 𝒃^\hat{\bm{b}}, and 𝒕^\hat{\bm{t}} directions, respectively. H(x)=(1+sgn(x))/2H(x)=(1+sgn(x))/2, where sgnsgn is the signum function.

The dynamics of each point of the snake robot is determined by Newton’s second law:

ρ𝑿¨=𝒇fric+𝒇inte\rho\ddot{\bm{X}}=\bm{f}_{fric}+\bm{f}_{inte} (8)

where fintef_{inte} is the internal force in the robot body, which includes internal air pressure, bending elastic force, et al, with observations: 0L𝒇inte=0\int_{0}^{L}\bm{f}_{inte}=0 and 0L(𝑿(s,t)𝑿¯(t))×𝒇inte=0\int_{0}^{L}(\bm{X}(s,t)-\overline{\bm{X}}(t))\times\bm{f}_{inte}=0.

Finally, the dynamics for the COM of the robot are derived using the equation (3)-(8) with the observations of fintef_{inte}; see [22] for more details.

Based on the dynamic model of the robot, which simplifies a dynamic system for all points of the robot to a single dynamic system for the COM of the robot, a simulator is designed with proper discretizations and numerical techniques for RL training. The simulation results matched the experimental results [19] of the soft snake robot when different pressure biases were applied for the robot’s steering (Fig. 3F), where the wavy trajectories in the experiments were attributed to the limited number (25) of the tracking markers in the tests.

III-B RL formulation of Locomotion and Navigation of the Robot

In this paper, the locomotion and navigation of the soft snake robot is formulated as a Markov Decision Process (MDP) \mathcal{M} and solved with a model-free RL. The \mathcal{M} is defined as a tuple =(𝒜,𝒮,,𝒯,γ)\mathcal{M}=(\mathcal{A},\mathcal{S},\mathcal{R},\mathcal{T},\gamma):

  1. 1.

    Action space: Compared with a random Central Pattern Generator (CPG) [7], more constrained sinusoidal waves are used to generate a smoother traveling-wave deformation of the robot for better locomotion efficiency. Besides, the learned controller of the robot is limited to avoid high-frequency pressure change, i.e., the RL agent is only able to generate an action to change the parameters of the waveform at the beginning of each actuation period [0,T][0,T] that is same as the period of the sinusoidal waves, and one episode consists of multiple connected actuation periods. The sinusoidal pressure pip_{i} for ii-th channel of the robot is designed as:

    pi=pmsin((c2πTtr+(i1)π2))+bi,pre+(bibi,pre)trT\begin{split}p_{i}=p_{m}\sin{(c\cdot\frac{2\pi}{T}t_{r}+\frac{(i-1)\cdot\pi}{2})}+b_{i,pre}+(b_{i}-b_{i,pre})\frac{t_{r}}{T}\end{split}

    (9)

    where tr[0,T]t_{r}\in[0,T] is the relative time in one actuation period. pmp_{m} and bi[0,bm]b_{i}\in[0,b_{m}] are the fixed magnitude and bias of the sinusoidal waves for the ii-th channel, respectively, i{1,2,3,4}i\in\{1,2,3,4\}. bi,preb_{i,pre} is a one-step history of the wave bias bib_{i} for the ii-th channel with bi,pre=0b_{i,pre}=0 at the initial state. c{1,1}c\in\{-1,1\} is a variable to control the propagation direction of the traveling-wave deformation and thus can change the movement direction of the robot.

    The action space 𝒜\mathcal{A} of the RL agent for locomotion and navigation of the robot is designed as:

    𝒂={ba,1,ba,2,c}𝒜\bm{a}=\{b_{a,1},b_{a,2},c\}\in\mathcal{A} (10)

    where bib_{i}’s are constructed by ba,1[bm,bm]b_{a,1}\in[-b_{m},b_{m}] and ba,2[bm,bm]b_{a,2}\in[-b_{m},b_{m}]:

    {b1,b3=max(0,ba,1),min(0,ba,1)b2,b4=max(0,ba,2),min(0,ba,2)\begin{cases}b_{1},b_{3}=\max(0,b_{a,1}),-\min(0,b_{a,1})\\ b_{2},b_{4}=\max(0,b_{a,2}),-\min(0,b_{a,2})\end{cases} (11)

    At the beginning of each actuation period, based on the current policy, the RL agent observes the state and generates an action, which specifies the waveform of the pressures in that period to propel the snake robot. The wave design guarantees the continuity of the pressures across different actuation periods to avoid impractical sudden changes in the pressures and the robot’s body shape.

  2. 2.

    State space: A goal-conditioned state is used for the learning of the RL agent for adapting to different random targets. Specifically, a relative representation of the snake robot’s position and orientation with respect to the target is used as part of the state (Fig. 4):

    𝒔={ΔX,ΔY,Δθ,𝒃a,1,pre,𝒃a,2,pre}𝒮\bm{s}=\{\Delta X,\Delta Y,\Delta\theta,\bm{b}_{a,1,pre},\bm{b}_{a,2,pre}\}\in\mathcal{S} (12)

    where ΔX=xgX¯\Delta X=x_{g}-\overline{X}, ΔY=ygY¯\Delta Y=y_{g}-\overline{Y} denote the relative position of the target to the COM of the snake robot, Δθ=θgθ¯(π,π]\Delta\theta=\theta_{g}-\overline{\theta}\in(-\pi,\pi] represents the relative direction of the target to the main direction of the robot, and θg=arctan(ΔY/ΔX)\theta_{g}=\arctan(\Delta Y/\Delta X) is the angle between the line from the COM of the robot to the target and the X-axis, 𝒃a,1,pre\bm{b}_{a,1,pre} and 𝒃a,2,pre\bm{b}_{a,2,pre} are two-step histories of the action ba,1b_{a,1} and ba,2b_{a,2}, respectively, with an initial setting of {0,0}\{0,0\}.

    The velocities of COM of the robot are not included as part of the state because the value of the Froude number FrFr [22] in serpentine locomotion of the snake robot is small, indicating that the frictional and gravitational effects dominate the inertial effect. Two-step histories (longer than one step) are introduced to compensate for the omission of the velocity state.

  3. 3.

    Reward function: The reward function rr is pivotal for the RL agent to learn the desired behaviors. The training objective in this work is to drive the COM of the snake robot to reach a random target as soon as possible, with a preference for serpentine locomotion where the robot approaches the target along its main direction. Therefore, the reward assigned to the agent at time tt is designed as:

    rt={w1ΔLtΔL0+w22Δθr,tπ+Rg,ΔLtϵw1ΔLtΔL0+w22Δθr,tπ,elser_{t}=\begin{cases}\begin{split}&w_{1}\frac{\Delta L_{t}}{\Delta L_{0}}+w_{2}\frac{2\Delta\theta_{r,t}}{\pi}+R_{g},&\Delta L_{t}\leq\epsilon\\ &w_{1}\frac{\Delta L_{t}}{\Delta L_{0}}+w_{2}\frac{2\Delta\theta_{r,t}}{\pi},&\text{else}\end{split}\end{cases} (13)

    where w1w_{1} and w2w_{2} are non-positive coefficients, RgR_{g} is a large sparse positive success reward once the COM of the robot enters a neighborhood of the target with a radius of ϵ\epsilon. ΔLt=ΔXt2+ΔYt2\Delta L_{t}=\sqrt{\Delta X_{t}^{2}+\Delta Y_{t}^{2}} is the distance between the COM of the robot and target at time tt, and ΔLt=ΔL0\Delta L_{t}=\Delta L_{0} when t=0t=0. The deflection Δθr,t[0,π/2]\Delta\theta_{r,t}\in[0,\pi/2] is used in the reward to allow the robot to approach the target in a backward direction as well:

    Δθr,t={|Δθt|,π/2Δθtπ/2π|Δθt|,else\Delta\theta_{r,t}=\begin{cases}\begin{split}&|\Delta\theta_{t}|,&-\pi/2\leq\Delta\theta_{t}\leq\pi/2\\ &\pi-|\Delta\theta_{t}|,&\text{else}\end{split}\end{cases} (14)
  4. 4.

    Transition probabilities: The transition probability, 𝒯(𝒔|𝒔,a)\mathcal{T}(\bm{s}^{\prime}|\bm{s},a), characterizes the underlying dynamics of the robot system in the environment. In this study, we do not assume any detailed knowledge of this transition probability while developing our RL algorithm.

III-C Experiments of RL algorithms

III-C1 Experimental Setups

The RL experiments for the locomotion and navigation of the snake robot were conducted in a customized dynamic simulator which was developed based on the aforementioned serpentine locomotion model (Sec. III. A). The soft snake robot had a length of 0.5 m with a linear density of 1.08 kg/m. The frictional anisotropy between the snake skins and the ground was set as μf:μb:μt=1:1:1.5\mu_{f}:\mu_{b}:\mu_{t}=1:1:1.5, and the maximum of the pressure bias bmb_{m} was set as the same as pm=276p_{m}=276 kPa. The proportional constant KbK_{b} between the applied pressure difference and the curvature was set as 0.058 kPa\cdotm. The period of the actuation and the sinusoidal waves was 1 s.

The serpentine locomotion of the soft snake robot demonstrated approximate reversibility (Fig. 5A) when the function ff was designed as: 𝒂t~=f(𝒂t)={ba,1,ba,2,c}\widetilde{\bm{a}_{t}}=f(\bm{a}_{t})=\{b_{a,1},b_{a,2},-c\} when 𝒂t={ba,1,ba,2,c}\bm{a}_{t}=\{b_{a,1},b_{a,2},c\}. The trajectories in the simulation results (Fig. 5A) suggested a small K<1K<1 (in Eq. (2)) for locomotion and navigation of the soft snake robot when the above function ff was used.

Refer to caption
Figure 5: (A). The approximate reversibility of the movement of the soft snake robot with snake skins. (B). The fixed target and the sampling range of the random targets.
Refer to caption
Figure 6: Experimental results of the training for locomotion and navigation of the soft snake robot with one fixed target (0,0.5(0,0.5 m)). (A) Returns; (B) Success rates; (C) Average distances; (D) Average deflections.
Refer to caption
Figure 7: Evolution of the maximum Q-value at different locations during the training (from left to right: initial state, epoch 100, 500, 1k, 10k). (A) Training with BER; (B) Training with DDPG.

The soft snake robot was initialized in the simulator by using a horizontal static curved shape ((X¯,Y¯)=(0,0),θ¯=0(\overline{X},\overline{Y})=(0,0),\overline{\theta}=0) with zero-value action histories and a target (with neighborhoods: ϵ=0.03\epsilon=0.03 m), whose control policies was learned by using BER (with DDPG) and several state-of-art benchmark algorithms, including DDPG, HER [13], and PPO [23]. The number of total training epochs was 10k and the strategy to sample the transitions was Pt,b=0.5e0.002iP_{t,b}=0.5e^{-0.002i} when the index of epoch i2500i\leq 2500, Pt,b=0P_{t,b}=0 when i>2500i>2500, and Pt,f=1Pt,bP_{t,f}=1-P_{t,b}, Pb=Pt,bP_{b}=P_{t,b}. The coefficients of the reward were set as ω1=0.15\omega_{1}=0.15, ω2=1\omega_{2}=1, and the termination condition for one episode was either the COM of the robot entering a neighborhood of the target and receiving a success reward (Rg=50R_{g}=50) or the exploration time exceeding 150 s.

The return, success rate, average distance (the averaged ΔLt/L0\Delta L_{t}/L_{0} for each time step tt), and average deflection (the averaged Δθr,t\Delta\theta_{r,t} for each time step tt) were used to evaluate the algorithms during the training, with moving-window averaging for training with different seeds (lwindow=50l_{window}=50 epochs). Three training experiments with different random seeds (for parameter initialization) were conducted to evaluate each algorithm, where the solid line and the shaded area showed the mean and the standard deviation, respectively (Figs. 6 and 8). An AMD 9820X processor with 64 GB memory and Ubuntu 18.04 was used for the training.

III-C2 Locomotion and Navigation with a Fixed Target

The performance of the algorithms was initially evaluated on the locomotion and navigation task of the robot, targeting a challenging fixed point (xg,yg)=(0,0.5m)(x_{g},y_{g})=(0,0.5\leavevmode\nobreak\ m) (Fig. 5B). The experiment results of the training showed that both DDPG and BER were able to solve the task and learn policies to reach the fixed target successfully, while HER had worse stability and PPO was unable to solve the task within the epoch limitation (Fig. 6). It was also shown that BER had a faster convergence rate and better stability compared with other baseline algorithms.

Refer to caption
Figure 8: Experimental results of the training for locomotion and navigation of the soft snake robot with random targets. (A) Returns; (B) Success rates; (C) Average distances; (D) Average deflections.

The evolution of the maximum Q-value at different locations for the algorithms during the training process (with the same seed) revealed the underlying mechanism and the advantage of BER (Fig. 7). It was shown that the effective Q-values in the training with BER were estimated from both the start and the target locations, expediting the successful explorations and the convergence of the estimation. The BER learned a more informative Q-value distribution after 500 epochs than that of the baseline DDPG after 1000 epochs. The final Q-value distribution of BER was also more accurate than that of the baseline DDPG, manifested by their shapes and the positions of the Q-value’s peaks.

III-C3 Locomotion and Navigation with Random Targets

A locomotion and navigation task of the soft snake robot with random targets was then explored by using different RL algorithms, where a half ring was used to randomly sample the target because of the system symmetry: g{(d,α)|d[0.3,1],α[0,π]}g\in\{(d,\alpha)\>|\>d\in[0.3,1],\alpha\in[0,\pi]\} (Fig. 5B). Besides, a strategy was designed where the targets were sampled uniformly from gradually expanding areas for the ii-th training epoch within the total nn epochs: g{(d,α)|d[0.3,1],α[0,ψ](π2ψ,π2+ψ](πψ,π]}g\in\{(d,\alpha)\>|\>d\in[0.3,1],\alpha\in[0,\psi]\cup(\frac{\pi}{2}-\psi,\frac{\pi}{2}+\psi]\cup(\pi-\psi,\pi]\}, ψ=π4n2i2\psi=\frac{\pi}{4n^{2}}i^{2}.

The training results revealed that BER outperformed all other benchmark algorithms tested (Fig. 8). BER achieved the highest return and success rate during training, exhibiting more stable behavior and a smaller average deflection. In contrast, the baseline DDPG’s performance declined when introduced to a variety of targets, despite its strong early-stage performance. HER struggled to learn to reach targets in different areas, whereas PPO gradually learned an effective policy, a process that benefited from the random-goal training setup involving progressively changing targets.

The robot’s trajectories further demonstrated BER’s efficiency (Fig. 9), where controllers with median success rates from each algorithm were used for control. A video for these experiments can be viewed at https://youtu.be/Z0da6rVu9j8. Three representative targets were tested: (0.8(-0.8 m,0.1,0.1 m)) for moving backward, (0,0.5(0,0.5 m)) for moving towards a lateral target, (0.8(0.8 m,0.1,0.1 m)) for moving forward. The BER controller successfully and smoothly guided the robot to all targets. In contrast, the DDPG and HER controllers exhibited inefficient oscillations, possibly due to less accurate Q-function estimation. While the PPO controller managed to reach all targets, it also displayed inefficient oscillation and adopted a sub-optimal policy for the forward target (0.8(0.8 m,0.1,0.1 m)).

Refer to caption
Figure 9: Trajectories of the COM of the soft snake robot by using the controllers learned by different algorithms. (A) Trajectories with a backward target where the relationships between positions and time are shown in (D), (G); (B) Trajectories with a lateral target where the relationships between positions and time are shown in (E), (H); (C) Trajectories with a forward target where the relationships between positions and time are shown in (F), (I).

The quantitative results of the algorithms (Table I) were the average values tested by using the controllers trained with different seeds, and using 50 random targets sampled from the half-ring area (Fig. 5B). The average velocity (vavg=ΔL0/tepv_{avg}=\Delta L_{0}/t_{ep}, tept_{ep}: episode length) indicated the efficiency of the learned controllers. Notably, the average velocity of the robot with the BER controller (0.0169 m/s) was approximately 48% faster than that of the DDPG baseline (0.0114 m/s), and significantly higher compared to other benchmarks.

Besides, compared to other algorithms, BER not only learned an efficient controller based on the primary reward (highest average deflection: 0.29200.2920 rad), but was also able to sacrifice the secondary reward to some extent (second highest average distance: 0.40020.4002 m/m) for better performance. The success rate of BER reached 100%\% while that of the other baselines did not exceed 65%\%, which exhibited the advantage of BER in the locomotion and navigation learning of the soft snake robot.

Table I: Testing performance comparisons of different algorithms.
Metrics PPO HER DDPG BER
Average velocity (m/s) 0.0061 0.0080 0.0114 0.0169
Average distance (m/m) 0.6241 0.5202 0.3903 0.4002
Average deflection (rad) 0.4049 0.6702 0.3915 0.2920
Success rate (%) 64.44 43.33 61.11 100

IV Conclusions and Discussions

A novel technique, Back-stepping Experience Replay, was proposed in this paper, which exploited the back-stepping transitions constructed by using the standard transitions in both forward and backward exploration trajectories, improving the learning efficiencies in off-policy RL algorithms for the approximate reversible systems. The BER was compatible with arbitrary off-policy RL algorithms, demonstrated by combining with DQN and DDPG in a bit-flip task and locomotion and navigation task for a soft snake robot, respectively.

A model-free RL framework was proposed for locomotion and navigation of a soft snake robot as an application of the proposed BER, where a conventional locomotion model for real snakes was adopted to describe the serpentine locomotion of the soft snake robot and to design a simulator for learning. An RL formulation for locomotion and navigation of the soft snake robot was built based on the characteristics of the robot. Extensive experiments showed that the proposed RL approach was able to learn an efficient controller that drove the soft snake robot approaching fixed or even random targets by using serpentine locomotion. For the tasks with random targets, the controller learned by using BER achieved a 100 %\% success rate and the robot’s average speed was 48 %\% faster than that of the best baseline RL benchmark.

For future work, we will apply the proposed RL approach with BER to a physical soft snake robot system, where the data from the physical system will be used for learning. Besides, we will also study the influence of the approximate reversibility of general systems to BER, and analyze the convergence properties of BER for proper state-of-the-art off-policy RL algorithms.

References

  • [1] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
  • [2] X. Liu, R. Gasoto, Z. Jiang, C. Onal, and J. Fu, “Learning to locomote with artificial neural-network and cpg-based control in a soft snake robot,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7758–7765, IEEE, 2020.
  • [3] D. Chen, L. Jiang, Y. Wang, and Z. Li, “Autonomous driving using safe reinforcement learning by incorporating a regret-based human lane-changing decision model,” in 2020 American Control Conference (ACC), pp. 4355–4361, IEEE, 2020.
  • [4] M. Hua, D. Chen, X. Qi, K. Jiang, Z. E. Liu, Q. Zhou, and H. Xu, “Multi-agent reinforcement learning for connected and automated vehicles control: Recent advancements and future prospects,” arXiv preprint arXiv:2312.11084, 2023.
  • [5] C. Lee, M. Kim, Y. J. Kim, N. Hong, S. Ryu, H. J. Kim, and S. Kim, “Soft robot review,” International Journal of Control, Automation and Systems, vol. 15, pp. 3–15, 2017.
  • [6] A. Gupta, C. Eppner, S. Levine, and P. Abbeel, “Learning dexterous manipulation for a soft robotic hand from human demonstrations,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3786–3793, IEEE, 2016.
  • [7] X. Liu, C. D. Onal, and J. Fu, “Reinforcement learning of cpg-regulated locomotion controller for a soft snake robot,” IEEE Transactions on Robotics, 2023.
  • [8] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward transformations: Theory and application to reward shaping,” in International Conference on Machine Learning, vol. 99, pp. 278–287, Citeseer, 1999.
  • [9] G. Ostrovski, M. G. Bellemare, A. Oord, and R. Munos, “Count-based exploration with neural density models,” in International Conference on Machine Learning, pp. 2721–2730, PMLR, 2017.
  • [10] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell, “Curiosity-driven exploration by self-supervised prediction,” in International Conference on Machine Learning, pp. 2778–2787, PMLR, 2017.
  • [11] L. Choshen, L. Fox, and Y. Loewenstein, “Dora the explorer: Directed outreaching reinforcement action-selection,” arXiv preprint arXiv:1804.04012, 2018.
  • [12] A. P. Badia, P. Sprechmann, A. Vitvitskyi, D. Guo, B. Piot, S. Kapturowski, O. Tieleman, M. Arjovsky, A. Pritzel, A. Bolt, et al., “Never give up: Learning directed exploration strategies,” arXiv preprint arXiv:2002.06038, 2020.
  • [13] M. Andrychowicz et al., “Hindsight experience replay,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [14] M. Fang, C. Zhou, B. Shi, B. Gong, J. Xu, and T. Zhang, “Dher: Hindsight experience replay for dynamic goals,” in International Conference on Learning Representations, 2018.
  • [15] Y. Ding, C. Florensa, P. Abbeel, and M. Phielipp, “Goal-conditioned imitation learning,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [16] V. Mnih, A. K. others, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [17] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
  • [18] X. Qi, H. Shi, T. Pinto, and X. Tan, “A novel pneumatic soft snake robot using traveling-wave locomotion in constrained environments,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 1610–1617, 2020.
  • [19] X. Qi, T. Gao, and X. Tan, “Bioinspired 3d-printed snakeskins enable effective serpentine locomotion of a soft robotic snake,” Soft Robotics, vol. 10, no. 3, pp. 568–579, 2023.
  • [20] M. Luo, M. Agheli, and C. D. Onal, “Theoretical modeling and experimental analysis of a pressure-operated soft robotic snake,” Soft Robotics, vol. 1, no. 2, pp. 136–146, 2014.
  • [21] D. L. Hu, J. Nirody, T. Scott, and M. J. Shelley, “The mechanics of slithering locomotion,” Proceedings of the National Academy of Sciences, vol. 106, no. 25, pp. 10081–10085, 2009.
  • [22] D. L. Hu and M. Shelley, “Slithering locomotion,” in Natural locomotion in fluids and on surfaces: swimming, flying, and sliding, pp. 117–135, Springer, 2012.
  • [23] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.