This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robot Deformable Object Manipulation via NMPC-generated Demonstrations in Deep Reinforcement Learning

Haoyuan Wang, Zihao Dong, Hongliang Lei, Zejia Zhang, Weizhuang Shi, Wei Luo, Weiwei Wan,   and Jian Huang This work was supported in part by National Natural Science Foundation of China under Grant 62333007 and U24A20280, and in part by Hubei Provincial Technology Innovation Program under Grant 2024BAA007. H. Wang and Z. Dong contributed equally to this work. (Corresponding authors: Jian Huang; Wei Luo.)H. Wang, Z. Dong, H. Lei, Z. Zhang, W. Shi, and J. Huang are with Hubei Key Laboratory of Brain-inspired Intelligent Systems and the Key Laboratory of Image Processing and Intelligent Control, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology (HUST), Wuhan 430074, Hubei, China. Email: {why427@, M202173202@, leihl@, zejiazhang@, swz@, huang_jan@mail.}hust.edu.cnW. Luo is with Department of Innovation Center, China Ship Development and Design Center, Wuhan 430064, Hubei, China. Email: csddc_weiluo@mail.163.comW. Wan is with the Graduate School of Engineering Science, Osaka University, Toyonaka 560-0043, Japan. Email: wan@sys.es.osaka-u.ac.jp
Abstract

In this work, we conducted research on deformable object manipulation by robots based on demonstration-enhanced reinforcement learning (RL). To improve the learning efficiency of RL, we enhanced the utilization of demonstration data from multiple aspects and proposed the HGCR-DDPG algorithm. It uses a novel high-dimensional fuzzy approach for grasping-point selection, a refined behavior-cloning method to enhance data-driven learning in Rainbow-DDPG, and a sequential policy-learning strategy. Compared to the baseline algorithm (Rainbow-DDPG), our proposed HGCR-DDPG achieved 2.01 times the global average reward and reduced the global average standard deviation to 45% of that of the baseline algorithm. To reduce the human labor cost of demonstration collection, we proposed a low-cost demonstration collection method based on Nonlinear Model Predictive Control (NMPC). Simulation experiment results show that demonstrations collected through NMPC can be used to train HGCR-DDPG, achieving comparable results to those obtained with human demonstrations. To validate the feasibility of our proposed methods in real-world environments, we conducted physical experiments involving deformable object manipulation. We manipulated fabric to perform three tasks: diagonal folding, central axis folding, and flattening. The experimental results demonstrate that our proposed method achieved success rates of 83.3%, 80%, and 100% for these three tasks, respectively, validating the effectiveness of our approach. Compared to current large-model approaches for robot manipulation, the proposed algorithm is lightweight, requires fewer computational resources, and offers task-specific customization and efficient adaptability for specific tasks.

Index Terms:
Deformable objects, robotic Manipulation, reinforcement Learning, demonstration, nonlinear model predictive control

I Introduction

Deformable objects play a critical role in numerous key industries and are widely used across various sectors [1]. Their manipulation is a common task in manufacturing [2, 3], medical surgery [4, 5], and service robotics [6, 7, 8]. However, manual handling of deformable objects can be time-consuming, labor-intensive, costly, and may not guarantee efficiency or accuracy. Consequently, robots are often employed to replace human operators for manipulating deformable objects, such as connecting cables on automated assembly lines [9, 10], cutting or suturing soft tissue during medical surgeries [11], and handling fabrics like towels and clothes in home service scenarios [12, 13]. Automating the manipulation of deformable objects with robots can significantly reduce labor costs while improving operational efficiency and precision. Therefore, robotic systems for manipulating deformable objects have attracted considerable attention and research [14].

Currently, the majority of robotic manipulation research focuses on rigid objects, where the deformation caused during grasping is negligible. In contrast, when dealing with deformable objects, robots face many new challenges, including high-dimensional state spaces, complex dynamics, and highly nonlinear physical properties [15]. To address these challenges, some researchers have established dynamical models for deformable objects and designed robotic manipulation strategies based on these models [16, 17]. Nonetheless, ensuring high accuracy in the dynamical model presents significant difficulties, and the derivation of model gradients can be highly complex [15, 18]. To avoid the complexity of dynamical model derivation, some researchers have turned to learning-based methods, particularly reinforcement learning (RL) and imitation learning (IL) [15]. These methods learn control policies from data using learning algorithms, without requiring explicit dynamical modeling. RL involves the agent continuously exploring the action space through trial and error, collecting interaction data with the environment to facilitate learning. Still, in real-world scenarios, the intricacy involved in handling deformable objects frequently results in inefficient learning processes that require extensive time and data, yielding less-than-optimal results. Therefore, it is crucial to set more effective states and action spaces to reduce task complexity based on domain-specific knowledge [19, 20]. With the advancement of deep learning technology, Deep Reinforcement Learning (DRL) is being used to tackle deformable object manipulation problems. Matas et al. [21] trained agents using DRL methods in simulation environments to fold clothes or place them on hangers. Researchers incorporated seven commonly used techniques in the DRL field into the Deep Deterministic Policy Gradient (DDPG) to develop the Rainbow-DDPG algorithm and validated the effectiveness of these techniques through ablation experiments. Additionally, they conducted deformable object manipulation experiments in real scenes through domain randomization. Jangir et al. [22] treated the coordinates of a series of key points on the fabric as the state space, introducing Behavioral Cloning (BC) and Hindsight Experience Replay (HER) to improve the DDPG algorithm for handling tasks involving dynamic manipulation of fabrics by robots. They also studied the impact of key point selection on RL performance. Despite making some progress in learning effective strategies for deformable object manipulation, DRL still faces challenges in terms of learning efficiency due to the inherent complexity of such manipulation tasks, requiring substantial amounts of data and computational resources for training.

Collecting human demonstration data and extracting operational skills using IL algorithms from these demonstrations has also received extensive research attention. Unlike the trial-and-error mechanism of RL, IL completes tasks through observation and imitation of expert behavior. This method has unique advantages in handling tasks that are too complex for RL or where clear rewards are difficult to define [23]. With the continuous development of deep learning technology and hardware infrastructure, recent research has been able to collect a large amount of human demonstration data and utilize deep learning techniques to extract manipulation skills from it [24, 25, 26]. Although extracting manipulation skills from a large amount of human demonstrations can yield decent results, the high manpower cost associated with this approach is often challenging and unsustainable. Some studies combine RL and IL, leveraging human demonstrations to enhance the learning efficiency of RL while also benefiting from RL’s ability for autonomous exploration [21, 22, 27]. Balaguer et al. [27] used the K-means clustering algorithm to categorize human demonstration actions into M\mathit{M} classes, testing the feasibility of each type of human demonstration action on the robot and selecting the feasible one with the highest reward as the starting point for agent exploration, thus streamlining the search space of RL algorithms. It is worth mentioning that the Rainbow-DDPG algorithm mentioned earlier [21] and the work by Jangir et al. [22] also incorporate human demonstration data to improve the learning efficiency of RL. Undeniably, the morphological diversity of deformable objects imposes higher requirements on the range of operational scenarios covered by demonstration data. To cover as wide a range of operational scenarios as possible, researchers typically need to collect a large amount of demonstration data. Existing studies [24, 25] use manually collected large-scale demonstration data for training purposes, which inevitably incurs significant human resource costs.

To address the challenges of low learning efficiency in RL and the difficulty in collecting IL demonstration data mentioned above, this article adopts a learning-based approach to tackle the problem of deformable object manipulation by robots. The aim is to improve the learning efficiency of algorithms and reduce learning costs. The research aims to optimize existing RL algorithms from two perspectives. First, by integrating IL, we innovatively design the HGCR-DDPG algorithm. It leverages a novel high-dimensional fuzzy-based approach to select grasping points, a refined behavior cloning-inspired method to boost data-driven learning in Rainbow-DDPG, and a sequential policy-learning strategy. This holistic design enhances RL learning efficiency. Second, we develop a low-cost demonstration data collection method using NMPC. It’s built upon a spring-mass model, enabling automated data generation and effective mapping to robot actions, thus reducing data collection costs. Through these, robots can learn manipulation skills with higher efficiency and lower cost, thus operating deformable objects more efficiently in practical applications. Specifically, the main contributions of this article are as follows:

  • An RL method enhanced by demonstration to increase the learning efficiency of RL with human demonstration data for training, which is named as HGCR-DDPG.

  • A demonstration data collection method in simulation environment based on Nonlinear Model Predictive Control (NMPC) to reduce the cost of demonstration data collection.

  • The feasibility of the proposed methods in simulation and real environments through experiments.

The article is divided into 6 sections, with the main research content and their relationships shown in Fig. 1. The specific content arrangement of each section is as follows: Section I is the introduction. Section II addresses the issue of low learning efficiency in RL algorithms by proposing HGCR-DDPG algorithm that combines a High-Dimensional Takagi-Sugeno-Kang (HTSK) fuzzy system, Generative Adversarial Behavior Cloning (GABC) techniques, Rainbow-DDPG, and Conditional Policy Learning (CPL). Section III addresses the issue of high cost of collecting demonstration data by exploring automated demonstration collection techniques and proposes a low-cost demonstration collection method based on NMPC. Section IV presents the simulation and physical experiment settings that validate the methods proposed in this article. Section V presents the experiment results. Section VI provides a summary of the entire article and outlooks future work.

Refer to caption
Figure 1: The diagram of this work.

II Problem Formulation

II-A Introduction to the Simulation Environment and the Physical Experiment Platform

A complex deformable object manipulation simulation environment was established using PyBullet [34]. Specifically, a UR5e robot model and a square cloth model with a side length of 0.24 meters were constructed. The cloth model consisted of detailed triangular meshes, with its mass uniformly distributed across the mesh vertices (i.e., nodes). The process of the robot grasping the cloth was simulated by establishing precise anchor connections between the corresponding nodes of the cloth model and the robot end-effector. During the simulation, the cloth model was affected only by gravity, friction with the table surface, and the traction force exerted by the robot. The simulation experiments were conducted on a high-performance server equipped with 48GB of RAM, an Intel E5-2660 v4 processor, and an NVIDIA 3090 graphics card.

The deformable object manipulation system in this article consists of sensor subsystems, decision and control subsystems, and robot motion subsystems, as illustrated in Fig. S1(a). The target object for the experiment is a red square fabric with a side length of 0.24 meters. The sensor subsystem is responsible for capturing image information of the fabric. The robot motion subsystem executes precise grasping and placing actions. The decision and control subsystem extract key features from visual information, utilize the algorithm to generate decision commands, and communicate with the robot motion subsystem via ROS. As depicted in Fig. S1(b), the Intel RealSense D435i camera is fixed in an “eye-to-hand” manner, the UR5e robotic arm is mounted on a dedicated workstation, and the RG2 gripper, serving as the execution end, is installed at the end of the robot.

II-B Problem Formulation for Robotic Manipulation of Deformable Objects under the DRL Framework

This section will provide a detailed introduction to the DRL model of this study from five aspects: task setting, state space, action space, state transition, and reward setting.

II-B1 Task Setting

For a square piece of fabric as a representative deformable object, these tasks are designed: folding along the diagonal, folding along the central axis, and flattening. In each task, the robot is allowed a maximum number of operations, denoted as tmt_{\text{m}}, which varies among different tasks. At the beginning of each experimental round, the robot returns to its default initial pose, while the initial position of the fabric is set according to the requirements of the specific task and is subject to a certain degree of random noise. The details are as followed:

  • Folding along the diagonal: The specific objective of the operation is to achieve perfect alignment of one pair of diagonal endpoints of the fabric, while maintaining the distance between the other pair of diagonal endpoints exactly equal to the length of the fabric diagonal, and ensuring that the area of the fabric is equal to half of its area when fully unfolded.

  • Folding along the central axis: Before folding, the two sets of fabric endpoints should be symmetrically arranged relative to the folding axis. After folding, these two sets of endpoints should coincide, while ensuring that the distance between endpoints on the side of the folding axis remains consistent with before folding, and the area of the fabric is equal to half of its area when fully unfolded.

  • Flattening: When faced with heavily wrinkled fabric, the robot’s task is to flatten it to its maximum area. At the beginning of each experiment, the fabric is initialized by the robot applying random actions within the first 10 time steps. The robot moves a point on the fabric from its initial position to a placement point within a distance of 0.1 m to 0.2 m during each random step, ensuring sufficient disturbance to generate random wrinkles.

II-B2 State Space

Previous studies [21, 30] often directly fed the visual information of the scene as state inputs to DRL, which is intuitive but results in an overly complex state space. Some research [22] simplifies the state space in simulation by using the coordinates of deformable object feature points as state inputs, which is simple but difficult to directly transfer to the real world. This article adopts a compromise solution. Algorithms from OpenCV are utilized to preprocess the visual information of the scene, and the processed results are then used as state inputs for DRL. The state spaces of three different tasks are introduced as follows.

In both the along-diagonal and along-axis folding tasks, using Canny edge detection [31] and the Douglas-Peucker algorithm in OpenCV, the four right-angle corners of the fabric can be identified. During robot manipulation, considering the relationship between the fabric’s motion speed and the robot’s operation speed, this article, under the premise of relatively slow robot operation, employs the pyramid Lucas-Kanade optical flow tracking method to track the four corners. This article selects the positions of the four corners of the square fabric and the proportion ftf_{t} of the fabric’s current area to its area when fully flattened as the state representation in these two folding tasks. This results in a 13-dimensional state space, as shown in Fig. 2(a). The symbols defining the state variables for the folding tasks are as follows:

𝒔t=(𝒑1t,𝒑2t,𝒑3t,𝒑4t,ft),\boldsymbol{s}_{t}=(\boldsymbol{p}_{1_{t}},\boldsymbol{p}_{2_{t}},\boldsymbol{p}_{3_{t}},\boldsymbol{p}_{4_{t}},f_{t}), (1)

where 𝒑1t,𝒑2t,𝒑3t,𝒑4t\boldsymbol{p}_{1_{t}},\boldsymbol{p}_{2_{t}},\boldsymbol{p}_{3_{t}},\boldsymbol{p}_{4_{t}} respectively represent the three-dimensional coordinates of the four corners of the fabric at time tt. All coordinates and vectors mentioned in this article are described with respect to the base coordinate system of the robot they are associated with.

In the flattening task, the fabric’s initial state is heavily wrinkled, making it extremely difficult to detect the right-angle corners of the fabric. We use Canny edge detection and the Douglas-Peucker algorithm in OpenCV to fit the contour of the fabric into an octagon, representing the eight points on the contour that best characterize the shape of the fabric. The coordinates of the eight endpoints, the coordinates of the center point of the fabric contour and the proportion of the fabric’s current area to its area when fully flattened are the state representation. Ultimately, the dimensionality of the state space used in the spreading task in this article is 28, as shown in Fig. 2(b). The symbols defining the state variables for the spreading task are as follows:

𝒔t=(𝒑1t,𝒑2t,,𝒑8t,𝒑ct,ft),\boldsymbol{s}_{t}=(\boldsymbol{p}_{1_{t}},\boldsymbol{p}_{2_{t}},\cdots,\boldsymbol{p}_{8_{t}},\boldsymbol{p}_{\text{c}_{t}},f_{t}), (2)

where 𝒑1t,𝒑2t,,𝒑8t\boldsymbol{p}_{1_{t}},\boldsymbol{p}_{2_{t}},\cdots,\boldsymbol{p}_{8_{t}} respectively represent the three-dimensional coordinates of the eight fitted endpoints of the fabric at time tt, and 𝒑ct\boldsymbol{p}_{\text{c}_{t}} represents the three-dimensional coordinates of the center point of the fabric contour at time tt.

Refer to caption
Figure 2: State Spaces for Different Tasks.

II-B3 Action Space

During the process of collecting human demonstrations, we found that humans only need to manipulate the four endpoints of the fabric to complete all folding tasks. In contrast, a strategy solely based on manipulating the endpoints of the fabric outline has minimal effect in flattening tasks. Fig. S1 explains this phenomenon: the fabric in a folded state is divided into upper layer (green), intermediate connecting parts (purple), and lower layer (orange), as shown in Fig. S2(a). Effective relative displacement between the upper and lower layers occurs only when manipulating points on the upper layer, as depicted in Fig. S2(b). Conversely, manipulating the lower layer or the connecting parts, as shown in Fig. S2(c), mostly results in overall movement of the fabric, which is not substantially helpful for flattening tasks.

This article designs a motion vector, as illustrated in Fig. 3. An offset vector 𝜹t\boldsymbol{\delta}_{t} is introduced based on the fabric’s edge endpoints to enable the robot to grasp points on the upper layer of the fabric. By adjusting 𝜹t\boldsymbol{\delta}_{t}, the robot can grasp any part of the fabric to manipulate it.

Refer to caption
Figure 3: Illustration of Offset Vectors.

This article refines the operation process into three key steps: firstly, selecting one endpoint from the state variables as the base grasping point; secondly, generating an offset vector 𝜹t\boldsymbol{\delta}_{t} to accurately adjust the position of the grasping point; thirdly, determining the coordinates of the placement point to guide the robot to complete the entire action from grasping to placing. The representation of the action space is described in (3):

𝒂t=(pgt,𝜹t,𝒑pt),\boldsymbol{a}_{t}=(p_{\text{g}_{t}},\boldsymbol{\delta}_{t},\boldsymbol{p}_{\text{p}_{t}}), (3)

where pgtp_{\text{g}_{t}} represents the index of the endpoint selected at time tt in the state variables, 𝜹t\boldsymbol{\delta}_{t} represents the offset vector for time tt, and 𝒑pt\boldsymbol{p}_{\text{p}_{t}} represents the coordinates of the placement point at time tt.

II-B4 State Transition

The state transition from time tt to time t+1t+1 is controlled by 𝒂t\boldsymbol{a}_{t} as expressed in (3). Initially, the robot determines the coordinates of the corresponding endpoint 𝒑gt\boldsymbol{p}_{\text{g}_{t}} in 𝒔t\boldsymbol{s}_{t} based on pgtp_{\text{g}_{t}}. Subsequently, by combining 𝜹t\boldsymbol{\delta}_{t}, the actual grasping coordinates 𝒈t=𝒑gt+𝜹t\boldsymbol{g}_{t}=\boldsymbol{p}_{\text{g}_{t}}+\boldsymbol{\delta}_{t} are calculated, guiding the end effector to execute the grasping action at the 𝒈t\boldsymbol{g}_{t} position. Finally, the robot moves the end effector to the 𝒑pt\boldsymbol{p}_{\text{p}_{t}} position, opens the gripper, and completes the placing operation.

II-B5 Reward Setting

In the diagonal folding task, as shown in Fig. 2, the target state of the fabric is when endpoint 1 and endpoint 3 coincide, and the distance between endpoint 2 and endpoint 4 equals the length of the diagonal, with the unfolded area of the fabric equal to half of its fully unfolded area. Assuming the side length of the square fabric is lsl_{s}, we define the difference ete_{t} between the fabric state at time tt and the target state in the diagonal folding task as follows:

et=𝒑1t𝒑3t2+|2ls𝒑2t𝒑4t2|+|ft0.5|.e_{t}=\big{\|}\boldsymbol{p}_{1_{t}}-\boldsymbol{p}_{3_{t}}\big{\|}_{2}+\Big{|}\sqrt{2}l_{s}-\big{\|}\boldsymbol{p}_{2_{t}}-\boldsymbol{p}_{4_{t}}\big{\|}_{2}\Big{|}+\big{|}f_{t}-0.5\big{|}. (4)

The objective of folding along the central axis is to align endpoint 1 with endpoint 2, endpoint 3 with endpoint 4, ensure that the distance between endpoint 1 and endpoint 3 equals the distance between endpoint 2 and endpoint 4, both equal to ls, and the fabric’s unfolded area equals half of its fully unfolded area. We define the gap ete_{t} between the fabric state at time tt and the target state in the folding along the central axis task as follows:

et=𝒑1t𝒑2t2+𝒑3t𝒑4t2+|ls𝒑1t𝒑3t2|+|ls𝒑2t𝒑4t2|+|ft0.5|.\begin{split}&e_{t}=\big{\|}\boldsymbol{p}_{1_{t}}-\boldsymbol{p}_{2_{t}}\big{\|}_{2}+\big{\|}\boldsymbol{p}_{3_{t}}-\boldsymbol{p}_{4_{t}}\big{\|}_{2}+\Big{|}l_{s}-\big{\|}\boldsymbol{p}_{1_{t}}-\boldsymbol{p}_{3_{t}}\big{\|}_{2}\Big{|}\\ &+\Big{|}l_{s}-\big{\|}\boldsymbol{p}_{2_{t}}-\boldsymbol{p}_{4_{t}}\big{\|}_{2}\Big{|}+\big{|}f_{t}-0.5\big{|}.\end{split} (5)

For ete_{t} in different folding tasks, we define their reward functions as follows:

r(𝒔t,𝒂t)={200et+100,done3,not done and et1et>tz3,not done and etet1>tz0,otherwise,r(\boldsymbol{s}_{t},\boldsymbol{a}_{t})=\begin{cases}-200e_{t}+100,&done\\ 3,&\text{not }done\text{ and }e_{t-1}-e_{t}>t_{\text{z}}\\ -3,&\text{not }done\text{ and }e_{t}-e_{t-1}>t_{\text{z}}\\ 0,&\text{otherwise}\end{cases}, (6)

where donedone represents the completion status of the task, which becomes True when the maximum number of operations, tmt_{\text{m}}, is reached. tzt_{\text{z}} is the threshold to measure whether the fabric state has significantly changed. According to (6), the reward mechanism r(𝒔t,𝒂t)r(\boldsymbol{s}_{t},\boldsymbol{a}_{t}) assigns rewards or penalties to the agent based on its immediate actions and states, following the following guidelines:

  • At the end of a round, a decisive reward of 200et+100-200e_{t}+100 is given based on the error ete_{t}. The greater the error, the lower the decisive reward. The decisive reward is set to 100 when the ete_{t} is 0.

  • If the round is not over and the error significantly decreases, i.e., et1et>tze_{t-1}-e_{t}>t_{\text{z}}, indicating the agent is approaching the target, a positive reward of 3 is given to encourage similar behavior.

  • Conversely, if the round is not over and the error significantly increases, i.e., etet1>tze_{t}-e_{t-1}>t_{\text{z}}, indicating the agent deviates from the target, a negative penalty of -3 is applied to suppress this behavior.

In the flattening task, we directly define the reward function for the flattening task based on the ratio ftf_{t} of the fabric’s unfolded area at time tt to its fully unfolded area:

r(𝒔t,𝒂t)={200ft100,done3,not done and ftft1>tz3,not done and ft1ft>tz0,otherwise.r(\boldsymbol{s}_{t},\boldsymbol{a}_{t})=\begin{cases}200f_{t}-100,&done\\ 3,&\text{not }done\text{ and }f_{t}-f_{t-1}>t_{\text{z}}\\ -3,&\text{not }done\text{ and }f_{t-1}-f_{t}>t_{\text{z}}\\ 0,&\text{otherwise}\end{cases}. (7)

The criteria followed here are similar to those described in (6), and will not be repeated here.

II-C Establishment and Analysis of the Spring-Mass Model

We introduce a low-cost demonstration collection method based on NMPC. A NMPC problem is built based on the spring-mass particle model, and optimal control strategies are obtained by solving this problem to accomplish specific tasks. It achieves automated generation of demonstration data and significantly cuts data collection costs. However, in complex states of deformable objects (e.g., heavily wrinkled fabric), extracting the state of all particles in real environments poses significant challenges, limiting the feasibility of NMPC in real environments. Therefore, the purpose of the NMPC method is to automatically collect demonstration data in simulation to assist RL training.

The spring-mass particle model adopted in this article is illustrated in Fig. 4(a). This model can be viewed as a system of particles connected by multiple springs, with the mass of the cloth evenly distributed among the particles. The spring-mass particle model established in this article consists of a set of NspN_{\text{sp}} particles denoted by =(1,2,,Nsp)\mathcal{L}=(1,2,\cdots,N_{\text{sp}}) and a set of MM springs denoted by 𝒮\mathcal{S}. The set of neighbors of the ii-th particle, i.e., the set of particles connected to particle ii by springs, is defined as 𝒩i\mathcal{N}_{i}, and it is assumed that the neighbor set of each particle is fixed.

Refer to caption
Figure 4: Illustration of the Spring-Mass Model and Node Numbering.

We use a square cloth, so Nsp=n2N_{sp}=n^{2}, where nn is the number of particles on one edge of the cloth. The mass of each particle ii is denoted by mim_{i}, mi=m,im_{i}=m,\forall i. The position of particle ii at time tt is represented by the three-dimensional coordinate vector 𝒙ti=(xti,yti,zti)\boldsymbol{x}_{t}^{i}=(x_{t}^{i},y_{t}^{i},z_{t}^{i}). The stiffness coefficient of the springs usually depends on the physical properties of the material, while the natural length depends on the initial state of the cloth. We assume that all spring stiffness coefficients and natural lengths are equal, denoted as kk and ll, respectively. The spring connecting particle ii and particle jj is denoted by (i,j)(i,j) (or (j,i)(j,i), which is equivalent). Thus, the neighbor set 𝒩i\mathcal{N}_{i} can be expressed as:

𝒩i={j|(i,j)𝒮}.\mathcal{N}_{i}=\{j|(i,j)\in\mathcal{S}\}. (8)

The particles are numbered from left to right and from top to bottom. Specifically, the particles in the first row are numbered from 1 to nn, the particles in the second row are numbered from n+1n+1 to 2n2n, and so on. Fig. 4(b) illustrates the case when n=6n=6. Particles are initially subjected to the force exerted by the springs, which depends on the relative positions of the particles. For any two particles ii and jj connected by a spring, the spring force 𝒔ti,j\boldsymbol{s}_{t}^{i,j} at time tt can be expressed as:

𝒔ti,j=k(lti,jl)𝒙ti𝒙tjlti,j,\boldsymbol{s}_{t}^{i,j}=k(l_{t}^{i,j}-l)\frac{\boldsymbol{x}_{t}^{i}-\boldsymbol{x}_{t}^{j}}{l_{t}^{i,j}}, (9)

where lti,jl_{t}^{i,j} represents the actual distance between particles ii and jj at time tt, which can be calculated using the following equation:

lti,j=𝒙ti𝒙tjl_{t}^{i,j}=\big{\|}\boldsymbol{x}_{t}^{i}-\boldsymbol{x}_{t}^{j}\big{\|} (10)

The damping force can be expressed as follows:

𝒅ti,j=c(𝒗ti𝒗tj),\boldsymbol{d}_{t}^{i,j}=-c(\boldsymbol{v}_{t}^{i}-\boldsymbol{v}_{t}^{j}), (11)

where cc is the damping coefficient, 𝒗ti\boldsymbol{v}_{t}^{i} is the velocity vector of particle ii at time tt, and 𝒗ti=(vt,xi,vt,yi,vt,zi)\boldsymbol{v}_{t}^{i}=(v_{t,x}^{i},v_{t,y}^{i},v_{t,z}^{i}), where vt,xi,vt,yi,vt,ziv_{t,x}^{i},v_{t,y}^{i},v_{t,z}^{i} are the velocity components of particle ii in the x,y,zx,y,z directions, respectively. Each particle is also influenced by gravity 𝑮=mg\boldsymbol{G}=mg, external force 𝒖ti\boldsymbol{u}_{t}^{i}, and damping force 𝒅ti,j\boldsymbol{d}_{t}^{i,j}. In summary, the total force 𝑭ti\boldsymbol{F}_{t}^{i} acting on particle ii at time tt can be expressed as:

𝑭ti=j𝒩i𝒔ti,j+𝑮+𝒖ti+j𝒩i𝒅ti,j.\boldsymbol{F}_{t}^{i}=\sum_{j\in\mathcal{N}_{i}}\boldsymbol{s}_{t}^{i,j}+\boldsymbol{G}+\boldsymbol{u}_{t}^{i}+\sum_{j\in\mathcal{N}_{i}}\boldsymbol{d}_{t}^{i,j}. (12)

We express the acceleration 𝒂ti\boldsymbol{a}_{t}^{i} of particle ii at time tt as:

𝒂ti=𝑭tim.\boldsymbol{a}_{t}^{i}=\frac{\boldsymbol{F}_{t}^{i}}{m}. (13)

Furthermore, the particle’s velocity can be update as:

𝒗t+Δti=𝒗ti+Δt𝒂ti.\boldsymbol{v}_{t+\Delta t}^{i}=\boldsymbol{v}_{t}^{i}+\Delta t\cdot\boldsymbol{a}_{t}^{i}. (14)

Next, we treat 𝒙ti\boldsymbol{x}_{t}^{i} as a function of time tt. Then, the position 𝒙t0+Δti\boldsymbol{x}_{t_{0}+\Delta t}^{i} of particle ii at time t0+Δtt_{0}+\Delta t can be obtained by Second-order Taylor expanding at t=t0t=t_{0}:

𝒙t0+Δti=𝒙t0i+Δt𝒗t0i+12Δt2𝒂t0i.\boldsymbol{x}_{t_{0}+\Delta t}^{i}=\boldsymbol{x}_{t_{0}}^{i}+\Delta t\cdot\boldsymbol{v}_{t_{0}}^{i}+\frac{1}{2}\Delta t^{2}\cdot\boldsymbol{a}_{t_{0}}^{i}. (15)

This article uses a small time step Δt\Delta t and sets the damping coefficient cc to a large positive value to ensure the stability of the model. In the following text, t+nΔtt+n\cdot\Delta t is abbreviated as t+nt+n to simplify the subscript. The update formula for particle position is shown as (16):

𝒙t+1i=𝒙ti+Δt𝒗ti+12Δt2𝒂ti.\boldsymbol{x}_{t+1}^{i}=\boldsymbol{x}_{t}^{i}+\Delta t\cdot\boldsymbol{v}_{t}^{i}+\frac{1}{2}\Delta t^{2}\cdot\boldsymbol{a}_{t}^{i}. (16)

III Methodology

III-A HGCR-DDPG Algorithm

III-A1 Algorithm for Selecting Benchmark Grasping Points Based on the HTSK Fuzzy System

In this article, as long as a suitable final 𝒑pt\boldsymbol{p}_{\text{p}_{t}} is chosen, selecting any reference pgtp_{\text{g}_{t}} can promote the task to some extent. Therefore, the selection of the fabric pgtp_{\text{g}_{t}} is more closely related to the application range of fuzzy sets. We use the state-action pairs (𝒔t,𝒂t)(\boldsymbol{s}_{t},\boldsymbol{a}_{t}), with the state 𝒔t\boldsymbol{s}_{t} as input and pgtp_{\text{g}_{t}} from the action 𝒂t\boldsymbol{a}_{t} as output, to construct an HTSK fuzzy system, denoted as H(𝒔t;𝜽h)H(\boldsymbol{s}_{t};\boldsymbol{\theta}^{h}), where 𝜽h\boldsymbol{\theta}^{h} represents its parameters. The input-output relationship of H(𝒔t;𝜽h)H(\boldsymbol{s}_{t};\boldsymbol{\theta}^{h}) is represented as follows:

pgt=H(𝒔t;𝜽h).p_{\text{g}_{t}}=H(\boldsymbol{s}_{t};\boldsymbol{\theta}^{h}). (17)

The training data for the selection strategy of the reference grasping point is sourced from the human demonstration dataset 𝒟demo\mathcal{D}_{\text{demo}}. Additionally, data with significant contributions to task progress are continuously supplemented during the interaction between the robot and the environment, denoted as 𝒟grasp=(𝒔n,pgn)n=1N\mathcal{D}_{\text{grasp}}=(\boldsymbol{s}_{n},p_{\text{g}_{n}})^{N}_{n=1}, where NN is the size of the training dataset, 𝒔n=(s1(n),s2(n),,sM(n))\boldsymbol{s}_{n}=({s}_{1}^{(n)},s_{2}^{(n)},\ldots,s_{M}^{(n)}) represents the MM-dimensional state variables of the nn-th sample, pgn(1,2,,k)p_{\text{g}_{n}}\in(1,2,\cdots,k) is the index of the reference grasping point for the nn-th sample, which is also the label of the training dataset, and kk is the number of candidate reference grasping points. The system learns the mapping relationship between the states 𝒔n\boldsymbol{s}_{n} and the corresponding reference grasping points pgnp_{\text{g}_{n}} in the demonstration dataset, enabling it to predict appropriate reference grasping points under different states.

The primary improvement of HTSK over TSK is reflected in the saturation issues related to the dimensions of the data. In TSK fuzzy systems, the traditional softmax function is used to calculate the normalized firing levels ω¯r\overline{\omega}_{r} of each fuzzy rule, as follows:

ω¯r=exp(Hr)r=1Rexp(Hr).\overline{\omega}_{r}=\frac{\text{exp}(H_{r})}{\sum_{r=1}^{R}\text{exp}(H_{r})}. (18)

Here, HrH_{r} decreases as the dimension MM of the input data increases, leading to the saturation of the softmax function [28]. In conventional TSK fuzzy systems, typically only the fuzzy rule with the highest HrH_{r} receives a non-zero firing level ω¯r\overline{\omega}_{r}. Consequently, as the data dimension MM increases, the distinctiveness of all HrH_{r} values diminishes, and the classification performance of the TSK fuzzy system declines. To address this saturation issue, HTSK substitutes HrH_{r} in the normalized firing levels ω¯r\overline{\omega}_{r} of each fuzzy rule within the TSK system with its mean value HrH_{r}^{*}, thereby allowing the normalization process to better accommodate high-dimensional data inputs. In the sixth and seventh layers of the HTSK net [29], based on the softmax function and the probability distribution, the final pgtp_{\text{g}_{t}} is selected as the output.

This article adopts the k-means clustering method to initialize cr,mc_{r,m} which is the parameter of the Gaussian membership function, the cross-entropy loss function to measure the difference between the output of the fuzzy system and the true labels, and the Adam optimizer for gradient descent, with a learning rate of 0.04, a batch size of 64, and a weight decay of 1e-8.

III-A2 GABC-Improved Rainbow-DDPG Algorithm

In Rainbow-DDPG, BC is typically implemented by adding LbcL_{\text{bc}} to the loss function of the Actor network, as shown in (19):

Lbc={(μ(𝒔i;𝜽)𝒂i)2,Q(𝒔i,𝒂i;𝝋)>Q(𝒔i,μ(𝒔i;𝜽);𝝋) and (𝒔i,𝒂i)𝒟demo0,otherwise.L_{\text{bc}}=\begin{cases}(\mu(\boldsymbol{s}_{i};\boldsymbol{\theta})-\boldsymbol{a}_{i})^{2},&Q(\boldsymbol{s}_{i},\boldsymbol{a}_{i};\boldsymbol{\varphi})>Q(\boldsymbol{s}_{i},\mu(\boldsymbol{s}_{i};\boldsymbol{\theta});\boldsymbol{\varphi})\\ &\text{ and }(\boldsymbol{s}_{i},\boldsymbol{a}_{i})\in\mathcal{D}_{\text{demo}}\\ 0,&\text{otherwise}\end{cases}. (19)

The definition of LbcL_{\text{bc}} only takes effect when Q(𝒔i,𝒂i;𝝋)>Q(𝒔i,μ(𝒔i;𝜽);𝝋)Q(\boldsymbol{s}_{i},\boldsymbol{a}_{i};\boldsymbol{\varphi})>Q(\boldsymbol{s}_{i},\mu(\boldsymbol{s}_{i};\boldsymbol{\theta});\boldsymbol{\varphi}). However, training the Critic network to output accurate Q-values is a time-consuming process, which results in the ineffectiveness of LbcL_{\text{bc}} in the early stages of training. Additionally, when the Critic network is fully trained, the replay buffer mainly contains real-time interaction data rather than demonstration data, reducing the probability of sampling demonstration data for training. Therefore, LbcL_{\text{bc}} may not have a significant impact, and RL still requires considerable training time to achieve good policies.

We propose GABC to improve Rainbow-DDPG for generating 𝜹t\boldsymbol{\delta}_{t} and 𝒑pt\boldsymbol{p}_{\text{p}_{t}}. We denote the current Actor network as μ(𝒔t,pgt;𝜽μ)\mu(\boldsymbol{s}_{t},p_{\text{g}_{t}};\boldsymbol{\theta}^{\mu}), where 𝜽μ\boldsymbol{\theta}^{\mu} represents its parameters. In each state 𝒔t\boldsymbol{s}_{t}, this network combines with environmental noise 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}) to output 𝜹t\boldsymbol{\delta}_{t} and 𝒑pt\boldsymbol{p}_{\text{p}_{t}}, guiding the robot to perform fabric manipulation tasks. The input-output relationship of μ(𝒔t,pgt;𝜽μ)\mu(\boldsymbol{s}_{t},p_{\text{g}_{t}};\boldsymbol{\theta}^{\mu}) is represented as follows:

(𝜹t,𝒑pt)=μ(𝒔t,pgt;𝜽μ)+𝒩(0,σ2).(\boldsymbol{\delta}_{t},\boldsymbol{p}_{\text{p}_{t}})=\mu(\boldsymbol{s}_{t},p_{\text{g}_{t}};\boldsymbol{\theta}^{\mu})+\mathcal{N}(0,\sigma^{2}). (20)

Its current Critic network is denoted as Q(𝒔t,𝒂t;𝜽q)Q(\boldsymbol{s}_{t},\boldsymbol{a}_{t};\boldsymbol{\theta}^{q}), where 𝜽q\boldsymbol{\theta}^{q} represents its parameters. In each state 𝒔t\boldsymbol{s}_{t}, this network outputs a Q-value Q(𝒔t,𝒂t;𝜽q)Q(\boldsymbol{s}_{t},\boldsymbol{a}_{t};\boldsymbol{\theta}^{q}) through 𝜽q\boldsymbol{\theta}^{q} to evaluate the quality of action 𝒂t\boldsymbol{a}_{t}. Since the quality of 𝜹t\boldsymbol{\delta}_{t} and 𝒑pt\boldsymbol{p}_{\text{p}_{t}} is closely related to the selection of pgtp_{\text{g}_{t}}, the input 𝒂t\boldsymbol{a}_{t} of Q(𝒔t,𝒂t;𝜽q)Q(\boldsymbol{s}_{t},\boldsymbol{a}_{t};\boldsymbol{\theta}^{q}) not only includes 𝜹t\boldsymbol{\delta}_{t} and 𝒑pt\boldsymbol{p}_{\text{p}_{t}} output by the Actor network but also includes pgtp_{\text{g}_{t}} output from demonstration data or the H(𝒔t;𝜽h)H(\boldsymbol{s}_{t};\boldsymbol{\theta}^{h}).

Assuming that during training, the sampled state-action data pairs are (𝒔i,𝒂i)(\boldsymbol{s}_{i},\boldsymbol{a}_{i}), where 𝒂i=(pgi,𝜹i,𝒑pi)\boldsymbol{a}_{i}=(p_{\text{g}_{i}},\boldsymbol{\delta}_{i},\boldsymbol{p}_{\text{p}_{i}}). This study denotes (𝜹i,𝒑pi)(\boldsymbol{\delta}_{i},\boldsymbol{p}_{\text{p}_{i}}) as 𝒃i\boldsymbol{b}_{i}, and sets 𝒘i=(pgi,μ(𝒔i,pgi;𝜽μ))\boldsymbol{w}_{i}=(p_{\text{g}_{i}},\mu(\boldsymbol{s}_{i},p_{\text{g}_{i}};\boldsymbol{\theta}^{\mu})). Then, in the framework of this article, LbcL_{\text{bc}} can be redefined as follows:

Lbc={(μ(𝒔t,pgt;𝜽μ)𝒃i)2,Q(𝒔i,𝒂i;𝜽q)>Q(𝒔i,𝒘i;𝜽q) and (𝒔i,𝒂i)𝒟demo0,otherwise.L_{\text{bc}}=\begin{cases}(\mu(\boldsymbol{s}_{t},p_{\text{g}_{t}};\boldsymbol{\theta}^{\mu})-\boldsymbol{b}_{i})^{2},&Q(\boldsymbol{s}_{i},\boldsymbol{a}_{i};\boldsymbol{\theta}^{q})>Q(\boldsymbol{s}_{i},\boldsymbol{w}_{i};\boldsymbol{\theta}^{q})\\ &\text{ and }(\boldsymbol{s}_{i},\boldsymbol{a}_{i})\in\mathcal{D}_{\text{demo}}\\ 0,&\text{otherwise}\end{cases}. (21)

To expedite the training of the Critic, GABC introduces a loss term called QdiffQ_{\text{diff}} into the current Critic network’s loss function. Assuming (𝒔i,𝒂i)𝒟demo(\boldsymbol{s}_{i},\boldsymbol{a}_{i})\in\mathcal{D}_{\text{demo}}, based on the fact that 𝜹i\boldsymbol{\delta}_{i} and the placement point 𝒑pi\boldsymbol{p}_{\text{p}_{i}} in the human demonstration actions 𝒂i\boldsymbol{a}_{i} are significantly superior to the offset vector and placement point output by the current Actor network μ(𝒔i,pgi;𝜽μ)\mu(\boldsymbol{s}_{i},p_{\text{g}_{i}};\boldsymbol{\theta}^{\mu}), QdiffQ_{\text{diff}} guides the training of the Critic network by measuring the difference between the QQ-values output by the current Critic network for the actions 𝒂i\boldsymbol{a}_{i} in the human demonstration data and the QQ-values output for the actions 𝒘i=(pgi,μ(𝒔i,pgi;𝜽μ))\boldsymbol{w}_{i}=(p_{\text{g}_{i}},\mu(\boldsymbol{s}_{i},p_{\text{g}_{i}};\boldsymbol{\theta}^{\mu})) by the current Actor network, under the same state 𝒔i\boldsymbol{s}_{i}. Specifically, this article sets a pre-training stage, during which, in the pre-training phase, when Q(𝒔i,𝒂i;𝜽q)Q(𝒔i,𝒘i;𝜽q)100Q(\boldsymbol{s}_{i},\boldsymbol{a}_{i};\boldsymbol{\theta}^{q})-Q(\boldsymbol{s}_{i},\boldsymbol{w}_{i};\boldsymbol{\theta}^{q})\geq 100, QdiffQ_{\text{diff}} is set to 0; otherwise, Qdiff=100(Q(𝒔i,𝒂i;𝜽q)Q(𝒔i,𝒘i;𝜽q))Q_{\text{diff}}=100-(Q(\boldsymbol{s}_{i},\boldsymbol{a}_{i};\boldsymbol{\theta}^{q})-Q(\boldsymbol{s}_{i},\boldsymbol{w}_{i};\boldsymbol{\theta}^{q})), as shown in (22):

Qdiff=max(0,100(Q(𝒔i,𝒂i;𝜽q)Q(𝒔i,𝒘i;𝜽q))).Q_{\text{diff}}=\max(0,100-(Q(\boldsymbol{s}_{i},\boldsymbol{a}_{i};\boldsymbol{\theta}^{q})-Q(\boldsymbol{s}_{i},\boldsymbol{w}_{i};\boldsymbol{\theta}^{q}))). (22)

After pre-training is completed, the Actor network has already acquired a certain policy, and its output actions μ(𝒔i,pgi;𝜽μ)\mu(\boldsymbol{s}_{i},p_{\text{g}_{i}};\boldsymbol{\theta}^{\mu}) may not significantly inferior to the actions 𝒂i\boldsymbol{a}_{i} in the human demonstration data. The introduction of QdiffQ_{\text{diff}} may lead to the training of the Actor network getting stuck in local optima, so that the QdiffQ_{\text{diff}} is removed.

The loss function LCriticL_{\text{Critic}} for the improved Critic network Q(𝒔t,𝒂t;𝜽q)Q(\boldsymbol{s}_{t},\boldsymbol{a}_{t};\boldsymbol{\theta}^{q}) is defined as follows:

LCritic=λ1stepL1s+λnstepLns+λdiffQdiff,L_{\text{Critic}}=\lambda_{\text{1step}}L_{\text{1s}}+\lambda_{n\text{step}}L_{n\text{s}}+\lambda_{\text{diff}}Q_{\text{diff}}, (23)

where λ1step\lambda_{\text{1step}} and λnstep\lambda_{n\text{step}} are the weights of the 1-step and n-step TD loss functions, respectively. λdiff\lambda_{\text{diff}} is the weight of QdiffQ_{\text{diff}}, set to 1 during the pre-training phase and 0 in subsequent phases. L1sL_{\text{1s}} and LnsL_{n\text{s}} are similar to the 1-step and n-step TD loss functions [21], and the target functions y1sy_{\text{1s}} and ynsy_{n\text{s}} can be referenced from TD3 model [32]. Here, we define Q1Q^{\prime}_{1}, Q2Q^{\prime}_{2}, and μ\mu^{\prime} as two target Critic networks and one target Actor network, with 𝒘i+k=(H(𝒔i+k;𝜽h),μ(𝒔i+k,H(𝒔i+k;𝜽h);𝜽μ)),k=1,2,,n,\boldsymbol{w}^{\prime}_{i+k}=(H(\boldsymbol{s}_{i+k};\boldsymbol{\theta}^{h}),\mu^{\prime}(\boldsymbol{s}_{i+k},H(\boldsymbol{s}_{i+k};\boldsymbol{\theta}^{h});\boldsymbol{\theta}^{{\mu}^{\prime}})),k=1,2,\cdots,n, where nn is the step length of the n-step TD loss function, NN is the batch size, rir_{i} is the reward, and γ\gamma is the discount factor. Consequently, L1sL_{\text{1s}} and LnsL_{\text{ns}} are defined as follows:

L1s=1Ni=1N(Q(𝒔i,𝒂i;𝜽q)y1s)2,L_{\text{1s}}=\frac{1}{N}\sum_{i=1}^{N}(Q(\boldsymbol{s}_{i},\boldsymbol{a}_{i};\boldsymbol{\theta}^{q})-y_{\text{1s}})^{2}, (24)
y1s=ri+γminj=1,2Qj(𝒔i+1,𝒘i+1;𝜽q),y_{\text{1s}}=r_{i}+\gamma\min_{j=1,2}Q^{\prime}_{j}(\boldsymbol{s}_{i+1},\boldsymbol{w}^{\prime}_{i+1};\boldsymbol{\theta}^{q^{\prime}}), (25)
Lns=1Ni=1N(Q(𝒔i,𝒂i;𝜽q)yns)2,L_{n\text{s}}=\frac{1}{N}\sum_{i=1}^{N}(Q(\boldsymbol{s}_{i},\boldsymbol{a}_{i};\boldsymbol{\theta}^{q})-y_{n\text{s}})^{2}, (26)
yns=t=0n1γtri+t+1+γnminj=1,2Qj(𝒔i+n,𝒘i+n;𝜽q).y_{n\text{s}}=\sum_{t=0}^{n-1}{\gamma^{t}r_{i+t+1}+\gamma^{n}\min_{j=1,2}Q^{\prime}_{j}(\boldsymbol{s}_{i+n},\boldsymbol{w}^{\prime}_{i+n};\boldsymbol{\theta}^{{q}^{\prime}})}. (27)

During training, the introduction of QdiffQ_{\text{diff}} causes the current Critic network to initially tend towards generating larger QQ-values for actions 𝒂i\boldsymbol{a}_{i} from human demonstrations, the effect of LbcL_{\text{bc}} in (21) becomes more pronounced, leading to a more thorough utilization of human demonstrations and thus accelerating the training of the Actor network. Additionally, the training of the current Actor network μ(𝒔i,pgi;𝜽μ)\mu(\boldsymbol{s}_{i},p_{\text{g}_{i}};\boldsymbol{\theta}^{\mu}) also depends on the QQ-values output by the current Critic network μ(𝒔t,pgt;𝜽μ)\mu(\boldsymbol{s}_{t},p_{\text{g}_{t}};\boldsymbol{\theta}^{\mu}). Consequently, a current Critic network capable of providing more precise QQ-values will further accelerate the training of the current Actor network.

III-A3 CPL Learning Method

This article adopts the CPL training method illustrated in Fig. 5. At each state 𝒔t\boldsymbol{s}_{t}, CPL first utilizes the H(𝒔t;𝜽h)H(\boldsymbol{s}_{t};\boldsymbol{\theta}^{h}) to select pgtp_{\text{g}_{t}}, and then, based on pgtp_{\text{g}_{t}}, selects 𝜹t\boldsymbol{\delta}_{t} and 𝒑pt\boldsymbol{p}_{\text{p}_{t}} through μ(𝒔t,pgt;𝜽μ)\mu(\boldsymbol{s}_{t},p_{\text{g}_{t}};\boldsymbol{\theta}^{\mu}).

Refer to caption
Figure 5: Illustration of Conditional Policy Learning.

CPL often faces the challenge of loss allocation [30]. It’s difficult to determine whether high rewards obtained for an action are attributed to the grasping policy H(𝒔t;𝜽h)H(\boldsymbol{s}_{t};\boldsymbol{\theta}^{h}) or the offset vector and placement point selection policy μ(𝒔t,pgt;𝜽μ)\mu(\boldsymbol{s}_{t},p_{\text{g}_{t}};\boldsymbol{\theta}^{\mu}). To address this, the following training approach is adopted. Firstly, each state-action pair (𝒔t,𝒂t)(\boldsymbol{s}_{t},\boldsymbol{a}_{t}) is extracted from 𝒟demo\mathcal{D}_{\text{demo}} to form an initial training dataset 𝒟grasp\mathcal{D}_{\text{grasp}} tailored for the HTSK fuzzy system. Then, 𝒟grasp\mathcal{D}_{\text{grasp}} is used to train H(𝒔t;𝜽h)H(\boldsymbol{s}_{t};\boldsymbol{\theta}^{h}) to get 𝜽h\boldsymbol{\theta}^{h}. Subsequently, with 𝜽h\boldsymbol{\theta}^{h} fixed, based on the H(𝒔t;𝜽h)H(\boldsymbol{s}_{t};\boldsymbol{\theta}^{h}) to select pgtp_{\text{g}_{t}}, the improved Rainbow-DDPG algorithm with GABC enhancement is employed to train the offset vector and placement point selection policy μ(𝒔t,pgt;𝜽μ)\mu(\boldsymbol{s}_{t},p_{\text{g}_{t}};\boldsymbol{\theta}^{\mu}), obtaining 𝜽μ\boldsymbol{\theta}^{\mu}.

During the process, data is also collected to supplement the training dataset 𝒟grasp\mathcal{D}_{\text{grasp}} and continuously train the improved grasping policy H(𝒔t;𝜽h)H(\boldsymbol{s}_{t};\boldsymbol{\theta}^{h}) parameters 𝜽h\boldsymbol{\theta}^{h}. Specifically, when the action 𝒂t\boldsymbol{a}_{t} executed by the robot at time tt significantly advances the progress of the task (in folding tasks, et1et>tze_{t-1}-e_{t}>t_{\text{z}}, in flattening tasks, ftft1>tzf_{t}-f_{t-1}>t_{\text{z}}), (𝒔t,𝒂t)(\boldsymbol{s}_{t},\boldsymbol{a}_{t}) is added to 𝒟grasp\mathcal{D}_{\text{grasp}}. After getting a certain amount of new data, 𝒟grasp\mathcal{D}_{\text{grasp}} is used to retrain the grasping policy H(𝒔t;𝜽h)H(\boldsymbol{s}_{t};\boldsymbol{\theta}^{h}) to get new parameters 𝜽h\boldsymbol{\theta}^{h}.

Combining all the improvements introduced above, we proposed the HGCR-DDPG algorithm. The relevant pseudocode is detailed in Algorithm 1.

Algorithm 1 HGCR-DDPG
0:  Demonstration Dataset 𝒟demo\mathcal{D}_{\text{demo}}, Total Number of RoundsMM, Number of Pre-Training RoundsMpM_{p}, Maximum Number of Operations per Roundtmt_{\text{m}}, Number of Strategy Updates per Interactiontnt_{\text{n}}, Batch Size NN, Random Environmental Noise𝒩\mathcal{N}.
0:  Trained HTSK Fuzzy System H(𝒔t;𝜽h)H(\boldsymbol{s}_{t};\boldsymbol{\theta}^{h}), Critic Network Q(𝒔t,𝒂t;𝜽q)Q(\boldsymbol{s}_{t},\boldsymbol{a}_{t};\boldsymbol{\theta}^{q}) and Offset Vector & Placement Point Selection Strategy Network μ(𝒔t,pgt;𝜽μ)\mu(\boldsymbol{s}_{t},p_{\text{g}_{t}};\boldsymbol{\theta}^{\mu}).
1:  Extract(𝒔i,𝒂i)(\boldsymbol{s}_{i},\boldsymbol{a}_{i}) from 𝒟demo\mathcal{D}_{\text{demo}}to form 𝒟grasp\mathcal{D}_{\text{grasp}}. Add 𝒟demo\mathcal{D}_{\text{demo}} to the replay buffer R. Initialize H(𝒔t;𝜽h)H(\boldsymbol{s}_{t};\boldsymbol{\theta}^{h}), Q(𝒔t,𝒂t;𝜽q)Q(\boldsymbol{s}_{t},\boldsymbol{a}_{t};\boldsymbol{\theta}^{q}), μ(𝒔t,pgt;𝜽μ)\mu(\boldsymbol{s}_{t},p_{\text{g}_{t}};\boldsymbol{\theta}^{\mu}), Q1(𝒔t,𝒂t;𝜽q)Q^{\prime}_{1}(\boldsymbol{s}_{t},\boldsymbol{a}_{t};\boldsymbol{\theta}^{q^{\prime}}), Q2(𝒔t,𝒂t;𝜽q)Q^{\prime}_{2}(\boldsymbol{s}_{t},\boldsymbol{a}_{t};\boldsymbol{\theta}^{q^{\prime}}), μ(𝒔t,pgt;𝜽μ)\mu^{\prime}(\boldsymbol{s}_{t},p_{\text{g}_{t}};\boldsymbol{\theta}^{\mu^{\prime}}).
2:  while H(𝒔t;𝜽h)H(\boldsymbol{s}_{t};\boldsymbol{\theta}^{h})has not converged do
3:     Use 𝒟grasp\mathcal{D}_{\text{grasp}} as the training dataset, and update the parameters 𝜽h\boldsymbol{\theta}^{h} of H(𝒔t;𝜽h)H(\boldsymbol{s}_{t};\boldsymbol{\theta}^{h}) according to the method described in Section II.B.1).
4:  end while
5:  for e=1,Me=1,M do
6:     Initialize the environment and receive the initial observation state 𝒔1\boldsymbol{s}_{1}.
7:     for tt = 1,tmt_{\text{m}} do
8:        Referring to Fig. 5, generate 𝒂t\boldsymbol{a}_{t} by combining H(𝒔t;𝜽h)H(\boldsymbol{s}_{t};\boldsymbol{\theta}^{h}), μ(𝒔t,pgt;𝜽μ)\mu(\boldsymbol{s}_{t},p_{\text{g}_{t}};\boldsymbol{\theta}^{\mu}), and noise 𝒩\mathcal{N}.
9:        Execute action 𝒂t\boldsymbol{a}_{t}to interact with the environment, and add (𝒔t,𝒂t,rt,𝒔t+1)(\boldsymbol{s}_{t},\boldsymbol{a}_{t},r_{t},\boldsymbol{s}_{t+1}) to the replay buffer R.
10:        for b=1,tnb=1,t_{\text{n}} do
11:           if eMpe\leq M_{p} then
12:              Set λdiff\lambda_{\text{diff}} to 1, and sample NN data points (𝒔i,𝒂i,ri,𝒔i+1)(\boldsymbol{s}_{i},\boldsymbol{a}_{i},r_{i},\boldsymbol{s}_{i+1}) from 𝒟demo\mathcal{D}_{\text{demo}}.
13:           else
14:              Set λdiff\lambda_{\text{diff}} to 0, and sample NN data points (𝒔i,𝒂i,ri,𝒔i+1)(\boldsymbol{s}_{i},\boldsymbol{a}_{i},r_{i},\boldsymbol{s}_{i+1}) from R.
15:           end if
16:           Update Q(𝒔t,𝒂t;𝜽q)Q(\boldsymbol{s}_{t},\boldsymbol{a}_{t};\boldsymbol{\theta}^{q}) by minimizing the loss LCriticL_{\text{Critic}} defined in (23).
17:           Update μ(𝒔t,pgt;𝜽μ)\mu(\boldsymbol{s}_{t},p_{\text{g}_{t}};\boldsymbol{\theta}^{\mu}) using the method in Rainbow-DDPG.
18:           If 𝒂i\boldsymbol{a}_{i}significantly advances the task, add (𝒔i,𝒂i)(\boldsymbol{s}_{i},\boldsymbol{a}_{i}) to 𝒟grasp\mathcal{D}_{\text{grasp}}. If there are more than 50 new data points added to 𝒟grasp\mathcal{D}_{\text{grasp}}, retrain H(𝒔t;𝜽h)H(\boldsymbol{s}_{t};\boldsymbol{\theta}^{h}).
19:        end for
20:        Soft-update the parameters of the target networks Q(𝒔t,𝒂t;𝜽q)Q^{\prime}(\boldsymbol{s}_{t},\boldsymbol{a}_{t};\boldsymbol{\theta}^{q^{\prime}}) and μ(𝒔t,pgt;𝜽μ)\mu^{\prime}(\boldsymbol{s}_{t},p_{\text{g}_{t}};\boldsymbol{\theta}^{\mu^{\prime}}).
21:     end for
22:  end for

III-B Low-Cost Demonstration Collection Based on NMPC

In this article, the objective of NMPC control is to find an optimal control sequence 𝑼list=(𝑼t,𝑼t+1,,𝑼t+Hp1)\boldsymbol{U}_{list}^{*}=(\boldsymbol{U}_{t}^{*},\boldsymbol{U}_{t+1}^{*},\cdots,\boldsymbol{U}_{t+H_{p}-1}^{*}) within the prediction horizon HpH_{p} to minimize the objective function JJ. The system’s state consists of the coordinates of each particle 𝑿t=(𝒙t1,𝒙t2,,𝒙tNsp)\boldsymbol{X}_{t}=(\boldsymbol{x}_{t}^{1},\boldsymbol{x}_{t}^{2},\cdots,\boldsymbol{x}_{t}^{N_{sp}}), and the control inputs are the external forces applied to each particle 𝑼t=(𝒖t1,𝒖t2,,𝒖tNsp)\boldsymbol{U}_{t}=(\boldsymbol{u}_{t}^{1},\boldsymbol{u}_{t}^{2},\cdots,\boldsymbol{u}_{t}^{N_{sp}}), where 𝒖ti=(ut,xi,ut,yi,ut,zi)\boldsymbol{u}_{t}^{i}=(u_{t,x}^{i},u_{t,y}^{i},u_{t,z}^{i}), ut,xi,ut,yi,ut,ziu_{t,x}^{i},u_{t,y}^{i},u_{t,z}^{i} are the components of the external force in the x,y,zx,y,z directions, respectively. Consequently, the state transition equation for the spring-mass model can be expressed as:

𝑿t+1=𝒇(𝑿t,𝑼t)=𝑿t+Δt𝑽t+12mΔt2(𝑺t+𝑮+𝑼t+𝑫t),\begin{aligned} \boldsymbol{X}_{t+1}&=\boldsymbol{f}(\boldsymbol{X}_{t},\boldsymbol{U}_{t})\\ &=\boldsymbol{X}_{t}+\Delta t\cdot\boldsymbol{V}_{t}+\frac{1}{2\cdot m}\Delta t^{2}\cdot(\boldsymbol{S}_{t}+\boldsymbol{G}+\boldsymbol{U}_{t}+\boldsymbol{D}_{t})\end{aligned}, (28)

where 𝑽t=(𝒗t1,𝒗t2,,𝒗tNsp)\boldsymbol{V}_{t}=(\boldsymbol{v}_{t}^{1},\boldsymbol{v}_{t}^{2},\cdots,\boldsymbol{v}_{t}^{N_{sp}}) is the velocity of each particle, 𝑺t=(𝒔t1,𝒔t2,,𝒔tNsp)\boldsymbol{S}_{t}=(\boldsymbol{s}_{t}^{1},\boldsymbol{s}_{t}^{2},\cdots,\boldsymbol{s}_{t}^{N_{sp}}) is the spring force applied to each particle, 𝑮\boldsymbol{G} is the gravity acting on each particle, 𝑫t=(𝒅t1,𝒅t2,,𝒅tNsp)\boldsymbol{D}_{t}=(\boldsymbol{d}_{t}^{1},\boldsymbol{d}_{t}^{2},\cdots,\boldsymbol{d}_{t}^{N_{sp}}) is the damping force applied to each particle, and mm is the mass of each particle.

The design of the loss function is based on the distances between particles. Specifically, for the three task objectives, each of which is specified by the distances lti,jl^{i,j}_{t} between particles at time tt. Taking the particle ordering in Fig. 4 as an example, the target state 𝑿ref\boldsymbol{X}_{\text{ref}} is redefined as follows:

1. Folding along the diagonal: For any pair of particles ii and jj symmetrically positioned about the specified diagonal, in the target state, the distance lti,jl^{i,j}_{t} between them should satisfy:

lti,j=0, such as lt1,36=lt8,29==0.l^{i,j}_{t}=0,\text{ such as }l^{1,36}_{t}=l^{8,29}_{t}=\ldots=0. (29)

2. Folding along the central axis: For any pair of particles ii and jj symmetrically positioned about the specified central axis, in the target state, the distance lti,jl^{i,j}_{t} between them should satisfy:

lti,j=0, such as lt1,6=lt8,11==0.l^{i,j}_{t}=0,\text{ such as }l^{1,6}_{t}=l^{8,11}_{t}=\ldots=0. (30)

3. Flattening: For the particles ii and jj at the ends of the two diagonals of the cloth, and for the particles aa and bb, in the target state, the distances between them should satisfy:

lti,j=lta,b=2ls, such as lt1,36=lt6,31=2ls,l^{i,j}_{t}=l^{a,b}_{t}=\sqrt{2}l_{s},\text{ such as }l^{1,36}_{t}=l^{6,31}_{t}=\sqrt{2}l_{s}, (31)

where lsl_{s} is the side length of the cloth in the fully flattened state.

For a given task objective, the loss function L(𝑿t)L(\boldsymbol{X}_{t}) can be defined as:

L(𝑿t)=i,j𝒫wi,j(lti,jlrefi,j)2,L(\boldsymbol{X}_{t})=\sum_{i,j\in\mathcal{P}}w_{i,j}(l^{i,j}_{t}-l^{i,j}_{\text{ref}})^{2}, (32)

where 𝒫\mathcal{P} is the set of all pairs of particles to be considered, lrefi,jl^{i,j}_{\text{ref}} is the desired distance between particles ii and jj in the target state 𝑿ref\boldsymbol{X}_{\text{ref}}, wi,jw_{i,j} is the weight factor used to adjust the relative importance of the differences in distances between different pairs of particles. For a specific task objective, the value of lrefi,jl^{i,j}_{\text{ref}} can be chosen based on (29) to (31).

The objective of NMPC is to minimize the cumulative loss within the prediction horizon. (28) is used to predict future states based on the initial state and control inputs:

𝑿k+1|t=𝒇(𝑿k|t,𝑼k),k=t,,t+Hp1,\boldsymbol{X}_{k+1|t}=\boldsymbol{f}(\boldsymbol{X}_{k|t},\boldsymbol{U}_{k}),\quad k=t,\ldots,t+H_{p}-1, (33)

where 𝑿k|t\boldsymbol{X}_{k|t} and 𝑿k+1|t\boldsymbol{X}_{k+1|t} represent the states at time steps kk and k+1k+1 predicted based on the current state 𝑿t\boldsymbol{X}_{t} and a series of control inputs 𝑼list\boldsymbol{U}_{list}. Specifically, when k=tk=t, 𝑿k|t=𝑿t|t=𝑿t\boldsymbol{X}_{k|t}=\boldsymbol{X}_{t|t}=\boldsymbol{X}_{t}.

In summary, the NMPC in this article can be implemented by solving the following optimization problem in (34):

minimize𝑼list\displaystyle\underset{\boldsymbol{U}_{list}}{\text{minimize}} J(𝑿t,𝑼list)=k=t+1t+HpL(𝑿k|t)\displaystyle J(\boldsymbol{X}_{t},\boldsymbol{U}_{list})=\sum_{k=t+1}^{t+H_{p}}L(\boldsymbol{X}_{k|t}) (34)
=k=t+1t+Hpi,j𝒫wi,j(lk|ti,jlrefi,j)2\displaystyle=\sum_{k=t+1}^{t+H_{p}}\sum_{i,j\in\mathcal{P}}w_{i,j}(l^{i,j}_{k|t}-l^{i,j}_{\text{ref}})^{2}
subject to 𝑿k+1|t=𝒇(𝑿k|t,𝑼k)\displaystyle\boldsymbol{X}_{k+1|t}=\boldsymbol{f}(\boldsymbol{X}_{k|t},\boldsymbol{U}_{k})
k=t,,t+Hp1\displaystyle\quad k=t,\ldots,t+H_{p}-1
𝑿t|t=𝑿t\displaystyle\boldsymbol{X}_{t|t}=\boldsymbol{X}_{t}
10Nuk,xi,uk,yi,uk,zi10N,\displaystyle-10N\leq u_{k,x}^{i},u_{k,y}^{i},u_{k,z}^{i}\leq 10N,
(i=1,,N,k=t,,t+Hp1),\displaystyle\text{ }\quad(i=1,\ldots,N,\;k=t,\ldots,t+H_{p}-1),
𝒙k|ti𝒳workspace,\displaystyle\boldsymbol{x}_{k|t}^{i}\in\mathcal{X}_{\text{workspace}},
(i=1,,N,k=t+1,,t+Hp),\displaystyle\text{ }\quad(i=1,\ldots,N,\;k=t+1,\ldots,t+H_{p}),

where J(𝑿t,𝑼list)J(\boldsymbol{X}_{t},\boldsymbol{U}_{list}) represents the cumulative loss function over the entire prediction horizon HpH_{p}. The term lk|ti,jl^{i,j}_{k|t} denotes the distance between the ii-th and jj-th particles calculated based on 𝑿k|t\boldsymbol{X}_{k|t} at the kk-th time step. 𝒳workspace\mathcal{X}_{\text{workspace}} denotes the set of state constraints in the robot’s workspace. We utilized the Interior Point OPTimizer (Ipopt), which is based on interior-point methods for nonlinear programming [33]. By solving the optimization problem described above, we can obtain the optimal control sequence 𝑼list\boldsymbol{U}_{list}^{*} that minimizes the loss function within the prediction horizon.

The optimal control sequence 𝑼t\boldsymbol{U}_{t}^{*} obtained from solving the NMPC problem defines the ideal external forces applied to each particle of the system at time tt. However, the robot can only apply force to a single particle at any given time. Therefore, to translate NMPC into a practically executable robot control strategy, 𝑼t\boldsymbol{U}_{t}^{*} must be mapped to the robot’s action space. The specific process is illustrated in Fig. 6, and the detailed explanation follows.

Refer to caption
Figure 6: Process for Generating the Robot’s Motion Space Based on NMPC.

Firstly, the optimal control sequence 𝑼t\boldsymbol{U}_{t}^{*} is applied to the dynamic equations of the spring-mass particle model to predict the system’s state 𝑿t+1\boldsymbol{X}_{t+1}^{*} at the next time step:

𝑿t+1=𝒇(𝑿t,𝑼t).\boldsymbol{X}_{t+1}^{*}=\boldsymbol{f}(\boldsymbol{X}_{t},\boldsymbol{U}_{t}^{*}). (35)

Next, utilizing the PyBullet simulation environment, we obtain the current state vector of the system, from which we extract the coordinates of the model’s endpoints, denoted as 𝑷vertex=(𝒑vertex1,𝒑vertex2,,𝒑vertexk)\boldsymbol{P}_{\text{vertex}}=(\boldsymbol{p}_{\text{vertex}}^{1},\boldsymbol{p}_{\text{vertex}}^{2},\cdots,\boldsymbol{p}_{\text{vertex}}^{k}), where kk is the number of endpoints in the state vector.

Subsequently, we analyze each endpoint 𝒑vertexi\boldsymbol{p}_{\text{vertex}}^{i}, identify the nearest 10 particles and the neighboring particles of these identified particles to form the set 𝒑particlei\boldsymbol{p}_{\text{particle}}^{i}. Therefore, for all endpoints, we construct a broader set 𝑷particle=(𝒑particle1,,𝒑particlek)\boldsymbol{P}_{\text{particle}}=(\boldsymbol{p}_{\text{particle}}^{1},\cdots,\boldsymbol{p}_{\text{particle}}^{k}).

By analyzing 𝑼t\boldsymbol{U}_{t}^{*}, we identify the maximum external force 𝒖t,imax\boldsymbol{u}_{t}^{*,i_{\text{max}}} acting on all particles in 𝑷particle\boldsymbol{P}_{\text{particle}}, and determine the index pgtp_{\text{g}_{t}} of the endpoint nearest to particle imaxi_{\text{max}}in the state vector 𝒔t\boldsymbol{s}_{t} as the reference grasping point for the robot. We calculate the displacement vector between particle imaxi_{\text{max}} and endpoint pgtp_{\text{g}_{t}} as the grasping offset vector 𝜹t\boldsymbol{\delta}_{t}.

Finally, based on the predicted position information 𝑿t+1\boldsymbol{X}_{t+1}^{*}, we determine the ideal position 𝒙t+1,imax\boldsymbol{x}_{t+1}^{*,i_{\text{max}}} of particle imaxi_{\text{max}} at the next time step and designate it as the placement point coordinate 𝒑pt\boldsymbol{p}_{\text{p}_{t}}, completing the conversion from theoretical control quantities to actual robot motion commands.

In summary, for the optimal control quantity 𝑼t\boldsymbol{U}_{t}^{*} computed by NMPC, we transform it to the action space:

𝒂t=(pgt,(𝒙timax𝒑vertexpgt),𝒙t+1,imax).\boldsymbol{a}_{t}=(p_{\text{g}_{t}},(\boldsymbol{x}_{t}^{i_{\text{max}}}-\boldsymbol{p}_{\text{vertex}}^{p_{\text{g}_{t}}}),\boldsymbol{x}_{t+1}^{*,i_{\text{max}}}). (36)

Additionally, due to potential simulation errors in the spring-mass particle model, we utilize the PyBullet simulation environment to correct the errors in the spring-mass particle model at each time step, as illustrated in Fig. S3. After obtaining 𝒂t\boldsymbol{a}_{t} from NMPC as in (36), it is executed in PyBullet to obtain the new state of the cloth, which is then used as the initial state 𝑿t+1\boldsymbol{X}_{t+1} for the next control cycle, thereby achieving precise updating of the cloth state.

We use dndn to denote whether a demonstration episode has ended. The termination condition for an episode is |etet1|0.01|e_{t}-e_{t-1}|\leq 0.01 for folding tasks or |ftft1|0.01|f_{t}-f_{t-1}|\leq 0.01 for flattening tasks. We collect episodes with relatively high rewards as demonstration data, resulting in the NMPC demonstration dataset 𝒟NMPC\mathcal{D}_{\text{NMPC}}, as shown in Algorithm 2.

Algorithm 2 NMPC Demonstration Data Collection
0:  System model 𝒇(𝑿t,𝑼t)\boldsymbol{f}(\boldsymbol{X}_{t},\boldsymbol{U}_{t}), initial state 𝑿0\boldsymbol{X}_{0}, prediction horizon HpH_{p}, cost function J(𝑿t,𝑼list)J(\boldsymbol{X}_{t},\boldsymbol{U}_{list}), reward threshold rtsr_{\text{ts}} for each task.
0:  NMPC demonstration dataset 𝒟NMPC\mathcal{D}_{\text{NMPC}}.
1:  Initialize temporary dataset 𝒟temp\mathcal{D}_{\text{temp}}, set dndn to False.
2:  while sufficient demonstration data has not been collected do
3:     Formulate the NMPC optimization problem according to (34).
4:     while not dn\text{not }dn do
5:        Solve the optimization problem using Ipopt to obtain the control sequence 𝑼list=(𝑼t,𝑼t+1,,𝑼t+Hp1)\boldsymbol{U}_{list}^{*}=(\boldsymbol{U}_{t}^{*},\boldsymbol{U}_{t+1}^{*},\cdots,\boldsymbol{U}_{t+H_{p}-1}^{*}).
6:        Extract 𝑼t\boldsymbol{U}_{t}^{*} and generate the action vector 𝒂t\boldsymbol{a}_{t} according to (36).
7:        Apply 𝒂t\boldsymbol{a}_{t} to the PyBullet simulation environment to control the robot and update the fabric state.
8:        Retrieve the new fabric state 𝑿t+1\boldsymbol{X}_{t+1}, as well as the state vector 𝒔t+1\boldsymbol{s}_{t+1}, reward rtr_{t}, and done flag dndn from the PyBullet simulation environment.
9:        Store (𝒔t,𝒂t,rt,𝒔t+1)(\boldsymbol{s}_{t},\boldsymbol{a}_{t},r_{t},\boldsymbol{s}_{t+1}) in 𝒟temp\mathcal{D}_{\text{temp}}.
10:        Set 𝑿t+1\boldsymbol{X}_{t+1} as the initial state for the next control cycle.
11:     end while
12:     if rtrtsr_{t}\geq r_{\text{ts}} then
13:        Add 𝒟temp\mathcal{D}_{\text{temp}} to 𝒟NMPC\mathcal{D}_{\text{NMPC}}.
14:     end if
15:     Clear 𝒟temp\mathcal{D}_{\text{temp}} and reset the simulation environment for the next control task.
16:  end while

IV Experimental Setup

IV-A Simulation Experiment Settings of the Improved HGCR-DDPG Algorithm with Human Demonstrations

As shown in Fig. S4, in simulation, humans guide the robot to perform grasping and placing actions by clicking on the grasp point 𝒑grt\boldsymbol{p}_{\text{gr}_{t}} and the placement point 𝒑pt\boldsymbol{p}_{\text{p}_{t}} with a mouse, respectively. During this process, we identify the endpoint closest to 𝒑grt\boldsymbol{p}_{\text{gr}_{t}} in state 𝒔t\boldsymbol{s}_{t}, denote its coordinates as 𝒑gt\boldsymbol{p}_{\text{g}_{t}}, and its index as pgtp_{\text{g}_{t}}, and calculate the offset vector as 𝜹t=𝒑grt𝒑gt\boldsymbol{\delta}_{t}=\boldsymbol{p}_{\text{gr}_{t}}-\boldsymbol{p}_{\text{g}_{t}}. Subsequently, we obtain the action 𝒂t=(pgt,𝜹t,𝒑pt)\boldsymbol{a}_{t}=(p_{\text{g}_{t}},\boldsymbol{\delta}_{t},\boldsymbol{p}_{\text{p}_{t}}), and organize it along with state, reward, and other information into a tuple (𝒔t,𝒂t,rt,𝒔t+1)(\boldsymbol{s}_{t},\boldsymbol{a}_{t},r_{t},\boldsymbol{s}_{t+1}). At the end of each demonstration, all tuples of the round are stored in a dedicated dataset 𝒟demo\mathcal{D}_{\text{demo}} to assist in training the DRL. During data collection, the end condition for rounds of both folding tasks is et0.1e_{t}\leq 0.1, and for flattening tasks, the end condition for rounds is ft0.9f_{t}\geq 0.9.

In Table. I, we present the analysis results of human demonstration datasets for three different tasks (diagonal folding, axial folding, and flattening). For the diagonal folding task, the demonstrators achieve the highest average reward (93.662), the most stable performance (with a reward standard deviation of only 3.105), and the fewest rounds (1.000) to successfully complete the task. While for the axial folding task, the average reward is 87.741, the reward standard deviation is 3.624, and the number of rounds needed to complete the task is 2.868. These metrics indicate that although the quality of task execution remains relatively high, consistency and efficiency have slightly decreased compared to diagonal folding. In the flattening task, although the average reward (87.688) is comparable to that of axial folding, the standard deviation of the reward significantly increased to 5.572, and the number of steps required to complete the task surged to 8.291, indicating the high complexity of the flattening task.

TABLE I: Human Demonstration Dataset
Task Metric
Average
Reward
Reward
Standard Deviation
Average
Steps
Folding Along
the Diagonal
93.662 3.105 1.000
Folding Along
the Central Axis
87.741 3.624 2.868
Flattening 87.688 5.572 8.291

Human demonstration data was collected to construct 𝒟demo\mathcal{D}_{\text{demo}}, which was used to enhance the training of HGCR-DDPG. This chapter systematically evaluated the effectiveness of three key technical improvements (HTSK, GABC, CPL) introduced in the HGCR-DDPG algorithm and their contributions to model performance through comparative experiments with multiple algorithms. These comparative algorithms include:

1. Rainbow-DDPG: As the baseline model for the experiment, Rainbow-DDPG utilizes the same neural network structure to implement μ(𝒔t,pgt;𝜽μ)\mu(\boldsymbol{s}_{t},p_{\text{g}_{t}};\boldsymbol{\theta}^{\mu}), Q(𝒔t,𝒂t;𝜽q)Q(\boldsymbol{s}_{t},\boldsymbol{a}_{t};\boldsymbol{\theta}^{q}), and the target networks μ\mu^{\prime}, Q1Q^{\prime}_{1} and Q2Q^{\prime}_{2}. These networks consist of an input layer, 50 hidden fully connected layers (each with 16 neurons), and an output layer. In this structure, the loss function μ(𝒔t,pgt;𝜽μ)\mu(\boldsymbol{s}_{t},p_{\text{g}_{t}};\boldsymbol{\theta}^{\mu}) is defined as (21).

The original Rainbow-DDPG does not use the HTSK fuzzy system but employs a neural network as H(𝒔t;𝜽h)H(\boldsymbol{s}_{t};\boldsymbol{\theta}^{h}). During the training process, a cross-entropy loss function is used as the BC loss function LbcL_{\text{bc}} for H(𝒔t;𝜽h)H(\boldsymbol{s}_{t};\boldsymbol{\theta}^{h}), defined as follows:

Lbc={p=1kδ(p,pgi)log(Hp),Q(𝒔i,𝒂i)𝒟grasp0,otherwise,L_{\text{bc}}=\begin{cases}-\sum_{p=1}^{k}\delta(p,p_{\text{g}_{i}})\log(H_{p}),&Q(\boldsymbol{s}_{i},\boldsymbol{a}_{i})\in\mathcal{D}_{\text{grasp}}\\ 0,&\text{otherwise}\end{cases}, (37)

where kk is the number of candidate endpoints. δ(p,pgi)\delta(p,p_{\text{g}_{i}}) is the Kronecker delta function, which equals 1 when p=pgip=p_{\text{g}_{i}}, otherwise it’s 0. HpH_{p} is the probability corresponding to the pp-th endpoint in the output probability distribution H(𝒔i;𝜽h)H(\boldsymbol{s}_{i};\boldsymbol{\theta}^{h}).It’s assumed that the grasp point selected from 𝒟grasp\mathcal{D}_{\text{grasp}} is consistently superior to H(𝒔t;𝜽h)H(\boldsymbol{s}_{t};\boldsymbol{\theta}^{h}). The cross-entropy loss function is applied whenever the sampled data (𝒔i,𝒂i)(\boldsymbol{s}_{i},\boldsymbol{a}_{i}) comes from 𝒟grasp\mathcal{D}_{\text{grasp}}; otherwise, LbcL_{\text{bc}} is set to 0. Note that in Rainbow-DDPG, the training strategy does not include CPL. This means H(𝒔t;𝜽h)H(\boldsymbol{s}_{t};\boldsymbol{\theta}^{h}) is trained in the same way as the current Actor, and its output is not used as part of the input for μ(𝒔t,pgt;𝜽μ)\mu(\boldsymbol{s}_{t},p_{\text{g}_{t}};\boldsymbol{\theta}^{\mu}).

2. Rainbow-DDPG + GABC: Integrates only the GABC into Rainbow-DDPG.

3. Rainbow-DDPG + CPL + HTSK: Incorporates the CPL into Rainbow-DDPG, using HTSK to select the grasp point.

4. Rainbow-DDPG + CPL + Random: Integrates the CPL into Rainbow-DDPG, employing a random sampling strategy to select the grasp point.

5. Rainbow-DDPG + CPL + Uniform: Combines the CPL with Rainbow-DDPG, utilizing a uniform sampling strategy to select the grasp point.

6. Rainbow-DDPG + GABC + CPL + Random: Merges both GABC and CPL into Rainbow-DDPG, employing a random sampling strategy to select the grasp point.

7. Rainbow-DDPG + GABC + CPL + Uniform: Extends Rainbow-DDPG by incorporating both GABC and CPL, selecting the grasp point using a uniform sampling strategy.

To evaluate the performance under different levels of difficulty, we designed two modes, simple and challenging, for each of the three tasks. In the simple mode, the maximum number of operations for each task is set to tmt_{\text{m}}; while in the challenging mode, this limit is halved to tm2\frac{t_{\text{m}}}{2} (tmt_{\text{m}} is 2 for fold along diagonal tasks, 4 for fold along axis tasks, and 10 for flatten tasks). Additionally, to investigate the specific impact of the amount of human demonstration data, the models were trained with human demonstration data from 5, 20, and 100 rounds, respectively.

During the training phase, we first performed 20 rounds of pre-training to optimize the Critic network using GABC technique. Subsequently, training proceeded to the regular phase, which consisted of 30 training epochs, with each epoch comprising 20 rounds of training and a batch size (NN) of 64. After executing a set of actions (grasping and manipulation), the policy was updated tnt_{\text{n}} times. The single interaction update counts (NN) for policies in fold along diagonal tasks, fold along axis tasks, and flatten tasks were 80, 40, and 20, respectively.

At the end of each training epoch, we conducted a testing phase comprising 10 rounds. Specifically, this study initially tested the performance of the initial policy upon completing policy initialization and increased the testing frequency during the pre-training phase (testing every 5 rounds of training). Ultimately, we conducted 35 testing epochs. Furthermore, we conducted experiments with three different random seeds. At the end of the tt-th testing epoch (t=0,,34t=0,\ldots,34), for each seed ii (i=1,2,3i=1,2,3), we recorded the total reward Rji,t,j=1,,10R_{j}^{i,t},j=1,\ldots,10 obtained by the agent in a single round. Subsequently, the average reward Ravgi,tR_{\text{avg}}^{i,t} for each seed was computed across the ten testing instances.

Based on the aforementioned processing steps, we define several key performance metrics:

1. The average reward per testing epoch RavgtR_{\text{avg}}^{t} is defined as the average of all Ravgi,tR_{\text{avg}}^{i,t} values within the tt-th testing epoch. All reward curves presented in this article are plotted based on Ravgi,tR_{\text{avg}}^{i,t}. The specific calculation formula is:

Ravgt=13i=13Ravgi,t.R_{\text{avg}}^{t}=\frac{1}{3}\sum_{i=1}^{3}R_{\text{avg}}^{i,t}. (38)

2. The average reward RavgR_{\text{avg}}, which is the average of all RtavgR_{t}^{avg} values for t=0,,34t=0,\ldots,34.

3. The average standard deviation σavg\sigma_{\text{avg}}. Firstly, calculate the standard deviation σt\sigma^{t} for Ravgi,tR_{\text{avg}}^{i,t} for i=1,2,3i=1,2,3, then calculate the average of all σt\sigma^{t} as σavg\sigma_{\text{avg}}.

4. Average Reward Ranking RankavgRank_{\text{avg}} can be obtained by ranking all algorithms based on RavgR_{\text{avg}}.

5. Average Standard Deviation Ranking RankσRank_{\sigma} can be obtained by ranking all algorithms based on σavg\sigma_{\text{avg}}. Lower rankings demonstrate higher stability.

IV-B Experiment Settings for Verifying the Effectiveness of the NMPC Demonstration Dataset

We constructed a 6×66\times 6 point-mass spring-damper model, collected demonstration data, and selected the top 100 rounds with the highest final rewards for each task, forming 𝒟NMPC\mathcal{D}_{\text{NMPC}}. Table II provides a detailed analysis of the NMPC demonstration dataset for three specific tasks. Compared to Table I, the NMPC dataset exhibits similar average rewards and standard deviations but need more average number of steps to complete tasks, indicating that NMPC strategies can accomplish tasks with lower operational efficiency.

TABLE II: Human Demonstration Dataset
Task Metric
Average
Reward
Reward
Standard Deviation
Average
Steps
Folding Along
the Diagonal
95.1 2.4 4.4
Folding Along
the Central Axis
87.1 3.5 6.1
Flattening 80.9 6.7 9.2

In this experiments, we adopt the same experimental settings, evaluation metrics, and numbering system as in Section IV.B. The demonstration dataset 𝒟NMPC\mathcal{D}_{\text{NMPC}} generated by the NMPC algorithm is used as the demonstration training set for the HGCR-DDPG model. Three different statistical metrics are also employed in this experiment to compare the effectiveness of HGCR-DDPG models trained with assistance from 𝒟NMPC\mathcal{D}_{\text{NMPC}} and 𝒟demo\mathcal{D}_{\text{demo}}. These metrics are Cosine Similarity (CSS), Dynamic Time Warping (DTW), and Pearson Correlation Coefficient (PCC). We have calculated these three similarity measures in terms of both reward and standard deviation, the specific metrics are as follows:

1. Reward Cosine Similarity (RCS): Cosine similarity of the average reward sequence RavgtR_{\text{avg}}^{t} for a single test cycle.

2. Reward Dynamic Time Warping (RDT): DTW similarity of the average reward sequence RavgtR_{\text{avg}}^{t} for a single test cycle.

3. Reward Pearson Correlation (RPC): Pearson correlation coefficient of the average reward sequence RavgtR_{\text{avg}}^{t} for a single test cycle.

4. Standard Deviation Cosine Similarity (SCS): Cosine similarity of the standard deviation of reward sequences obtained with different random seeds.

5. Standard Deviation Dynamic Time Warping (SDT): DTW similarity of the standard deviation of reward sequences obtained with different random seeds.

6. Standard Deviation Pearson Correlation (SPC): Pearson correlation coefficient of the standard deviation of reward sequences obtained with different random seeds.

IV-C Physical Experiment for Deformable Object Robot Manipulation

Hand-eye calibration is used to determine the spatial relationship between the camera coordinate system and the robot coordinate system. Endpoint detection is utilized to extract key endpoints from the fabric’s image. Optical flow tracking is employed for real-time tracking of the fabric endpoints’ positions. The specific processes of these three parts are as follows:

IV-C1 Hand-eye calibration

In this study, the easy-eye-hand software package is utilized for calibration.

IV-C2 Endpoint Detection

During the initial stages of the tasks, this study employs endpoint detection algorithms to extract the pixel coordinates of fabric endpoints in the image. For non-initial states in folding tasks, we use the previous placement point 𝒑pt1\boldsymbol{p}_{\text{p}_{t-1}} as a reference to determine the current position of the baseline grasp point. Simultaneously, optical flow tracking algorithms are employed to track the movement of the remaining endpoints. After coordinate transformation and hand-eye calibration, the three-dimensional positions of the endpoints in the robot coordinate system are obtained. Combining observable visual information such as the fabric’s area, a comprehensive state vector is formed.

We designed an endpoint recognition algorithm for extracting key endpoints from images of fabrics. The algorithm is based on the Canny edge detection and Douglas-Peucker polygon approximation algorithms, which accurately extract edge information from fabrics and identify several furthest endpoints. First, the input RGB image is converted into a grayscale image. Then, Gaussian blur is applied to the grayscale image to reduce noise influence. Next, the Canny edge detection algorithm is employed to extract the edges of the image. We further refine the edges by contour detection to find the largest contour in the image and apply the Douglas-Peucker algorithm [35] for polygon approximation to simplify the contour. Finally, we obtain a simplified contour containing a series of points, denoted as 𝑪=(𝒑1,𝒑2,,𝒑n)\boldsymbol{C}=(\boldsymbol{p}_{1},\boldsymbol{p}_{2},\ldots,\boldsymbol{p}_{n}).

To select the furthest kk endpoints from the simplified contour 𝑪\boldsymbol{C} (where k=4k=4 for folding tasks and k=8k=8 for flattening tasks), this study devises a heuristic method called Maximum Minimum Distance Vertices Selection (MMDVS). For any kk points 𝒑r1,𝒑r2,,𝒑rk\boldsymbol{p}_{r_{1}},\boldsymbol{p}_{r_{2}},\cdots,\boldsymbol{p}_{r_{k}} on the contour, we first define a set 𝑺\boldsymbol{S} containing all possible pairs of points. Then, we compute the Euclidean distance between each pair of points in set 𝑺\boldsymbol{S} and find the minimum distance dmind_{min}. By traversing all possible combinations of kk points in 𝑪\boldsymbol{C}, we find the group of points with the maximum minimum distance dmind_{min}, which are the desired kk endpoints 𝑷=(𝒑1,𝒑2,,𝒑k)\boldsymbol{P}=(\boldsymbol{p}_{1},\boldsymbol{p}_{2},\cdots,\boldsymbol{p}_{k}). Additionally, if the number of endpoints in the simplified contour 𝑪\boldsymbol{C} obtained by the Douglas-Peucker algorithm is less than kk, we adjust the relevant parameters of the Douglas-Peucker algorithm to include new endpoints into the simplified contour set 𝑪\boldsymbol{C} until a sufficient number of endpoints are obtained. The area of the fabric, needed for deep RL, can be obtained by calculating the area of the contour, and the centroid can be obtained by calculating the centroid of the contour by using relevant functions in the OpenCV library.

IV-C3 Optical Flow Tracking

To achieve real-time tracking of the fabric’s shape, this study designs an optical flow-based tracking algorithm. To address the issue that tracking failures can be attributed to occlusion by the end effector, the endpoints that cannot be successfully tracked can be classified into two categories: non-baseline grasp points and baseline grasp points. For non-baseline grasp points, their positions are set to the positions from the last frame before the tracking failure, 𝒑it1,i=1,2,,k\boldsymbol{p}_{i_{t-1}},i=1,2,\cdots,k, and tracking continues based on these positions. For baseline grasp points, after completing the placement action, the placement point 𝒑pt1\boldsymbol{p}_{\text{p}_{t-1}} is considered as the new position for anchor grasp points.

The comprehensive experimental procedure is shown in Fig. S5. At the beginning of the experiment, the robot and fabric were placed in their initial states. Subsequently, the system operated in a loop according to the following steps until the fabric is manipulated from its initial state to the target state. First, capture and analyze the current state of the fabric to form a comprehensive state vector. Then, this state vector is input into the HGCR-DDPG model pre-trained with 𝒟NMPC\mathcal{D}_{\text{NMPC}} to generate instructions. Next, the robot executes actions according to the generated instructions, pushing the fabric towards the desired next state. Finally, after the operation is completed, the system collects and updates the state vector of the fabric, preparing for the next steps of operation.

V Experiment Results

V-A Simulation Results of the Improved HGCR-DDPG Algorithm with Human Demonstrations

Tables S1 and S2 respectively list the algorithm numbers and experiment numbers involved in this article. Table S3 presents the design of 8 control groups in this study. Each control group evaluates the effectiveness of three technical improvements introduced in the HGCR-DDPG algorithm (HTSK, GABC, CPL) by comparing the performance of multiple algorithms. Table S3 also lists the legend styles and expected results of each algorithm in Fig. 7 and Fig. S6.

Refer to caption
Figure 7: Reward Curve. (a) Folding along the diagonal. (b) Folding along the central axis. (c) Flattening.

Fig. 7 and Fig. S6 display the dynamic changes of the average reward RavgtR_{\text{avg}}^{t} of individual test cycles obtained by various algorithms in different simulation experiments under three task settings: folding along the diagonal, folding along the central axis, and flattening. We smoothed the reward curves using a window of length 3. Tables S4, S6, and S8 respectively list the average rewards RavgR_{\text{avg}} and average reward rankings RankavgRank_{\text{avg}} achieved by various algorithms in different experiments conducted under the three tasks. Tables S2, S7, and S9 then respectively display the average standard deviations σavg\sigma_{\text{avg}} and average standard deviation rankings RankσRank_{\sigma} achieved by various algorithms in different experiments targeting the aforementioned three tasks. The last column of each of these tables shows the average value of the respective metrics (Ravg/RankavgR_{\text{avg}}/Rank_{\text{avg}} or σavg/Rankσ\sigma_{\text{avg}}/Rank_{\sigma}) for each algorithm across all experiments for that task. Table III presents the global average reward (average of RavgR_{\text{avg}}), global average standard deviation (σavg\sigma_{\text{avg}}), reward average ranking (average of RankavgRank_{\text{avg}}), and standard deviation average ranking (average of RankσRank_{\sigma}) for each algorithm across all experiments. In this section, the optimal indicators are highlighted in bold, and the HGCR-DDPG algorithm proposed in this article is underlined.

TABLE III: Global Performance Metrics and Rankings of Algorithms
Algorithm
Code
Global
Average
Reward
Global
Average
Standard
Deviation
Average
Reward
Rank
Average
Standard
Deviation
Rank
1 76.3 4.8 1.1 7.1
2 60.1 11.2 3.6 3.9
3 57.1 11.9 4.8 3.3
4 57.8 8.1 4.7 5.2
5 67.4 6.7 2.7 6.4
6 52.2 14.0 5.7 2.7
7 52.4 12.7 5.9 2.7
8 37.9 10.6 7.5 4.8

From these figures and tables, we can derive the following key conclusions: Firstly, Algorithm 1 (HGCR-DDPG) demonstrates outstanding performance across all numerical simulation experiments for all tasks. Secondly, HTSK significantly enhances algorithm performance. Under the same marker style, the red curve (representing algorithms using HTSK) generally exhibits significant advantages. In the majority of experimental scenarios, algorithms utilizing HTSK for benchmark grasping point selection outperform those employing random selection strategies, particularly in experimental setups with stringent constraints on the number of operations. Furthermore, the positive impact of GABC is also significant. Among curves of the same color, the curves marked with circles (representing algorithms using the GABC) generally outperform those marked with squares. Although the performance of algorithms incorporating GABC in flattening tasks is not particularly remarkable in terms of standard deviation, its performance surpasses algorithms without GABC in all other experiments. Additionally, the effectiveness of CPL has been validated. It is worth noting that the performance of CPL is influenced by both the benchmark grasping point selection strategy it adopts and the operational constraints in the experiments. The looser the constraints on the number of operations and the more stable the benchmark grasping point selection strategy, the more significant the effect of CPL. Finally, from the Table III, it can be observed that Algorithm 1 (HGCR-DDPG) achieved the best performance across all metrics. Compared to the selected baseline algorithm, namely Algorithm 8 (Rainbow-DDPG), HGCR-DDPG achieved a 2.01-fold improvement in global average reward and successfully reduced the global average standard deviation to 45% of the baseline algorithm, demonstrating a significant performance advantage.

V-B Results of the Experiment for Verifying the Effectiveness of the NMPC Demonstration Dataset

Fig. 8 and Fig. S7 depict the variation curves of the average reward RavgtR_{\text{avg}}^{t} for single test cycles of the HGCR-DDPG model trained with assistance from 𝒟NMPC\mathcal{D}_{\text{NMPC}} and 𝒟demo\mathcal{D}_{\text{demo}} for the tasks of folding along the diagonal, folding along the central axis, and flattening. Tables S10, S11, and S12 respectively show the performance of the HGCR-DDPG model assisted by 𝒟NMPC\mathcal{D}_{\text{NMPC}} in terms of average reward RavgR_{\text{avg}} and average standard deviation σavg\sigma_{\text{avg}} for the three tasks, as well as the ratio of the performance achieved by models assisted by 𝒟NMPC\mathcal{D}_{\text{NMPC}} to those assisted by 𝒟demo\mathcal{D}_{\text{demo}}.

Refer to caption
Figure 8: Performance Comparison of HGCR-DDPG Trained with Assistance from 𝒟NMPC\mathcal{D}_{\text{NMPC}} and 𝒟demo\mathcal{D}_{\text{demo}}. (a) Folding along the diagonal. (b) Folding along the central axis. (c) Flattening.

From these curves and tables, it is evident that in the task of folding along the diagonal, HGCR-DDPG can quickly learn and develop effective strategies regardless of whether 𝒟NMPC\mathcal{D}_{\text{NMPC}} or 𝒟demo\mathcal{D}_{\text{demo}} is used. However, in the tasks of folding along the central axis and flattening, as the difficulty increases, the performance difference between HGCR-DDPG assisted by the two demonstration datasets gradually becomes significant. In simplified task settings (Experiments 2.2, 2.4, 2.6, 3.2, 3.4), HGCR-DDPG assisted by 𝒟NMPC\mathcal{D}_{\text{NMPC}} demonstrates the ability to learn rapidly, with its performance even reaching or slightly exceeding that of models assisted by 𝒟demo\mathcal{D}_{\text{demo}}. This may be because the NMPC strategy itself performs well in scenarios with a generous number of steps, allowing HGCR-DDPG to effectively extract strategies from its demonstrations. Conversely, under more stringent task settings (Experiments 2.1, 2.3, 2.5, 3.3, 3.5), HGCR-DDPG assisted by 𝒟NMPC\mathcal{D}_{\text{NMPC}} is generally lower than models assisted by 𝒟demo\mathcal{D}_{\text{demo}}. This could be attributed to the difficulty of the NMPC strategy in completing tasks within a limited number of steps, thus affecting the performance of HGCR-DDPG under these conditions. This pattern also aligns with the higher average number of steps observed in the NMPC dataset in Table II. The results of Experiments 3.1 and 3.6 show the inherent stochastic factors in the experimental process.

TABLE IV: Comparison of Global Performance Metrics for HGCR-DDPG Trained with 𝒟NMPC\mathcal{D}_{\text{NMPC}} and 𝒟demo\mathcal{D}_{\text{demo}}
Demonstration
Dataset
Global Average
Reward
Global Average
Standard Deviation
𝒟demo\mathcal{D}_{\text{demo}} 76.3 4.8
𝒟NMPC\mathcal{D}_{\text{NMPC}} 76.1 4.0

Table IV presents a comparison of the overall performance metrics between HGCR-DDPG models assisted by 𝒟NMPC\mathcal{D}_{\text{NMPC}} and 𝒟demo\mathcal{D}_{\text{demo}}. The global average reward achieved by the HGCR-DDPG model assisted by 𝒟NMPC\mathcal{D}_{\text{NMPC}} is 99.7% of that achieved by the model assisted by 𝒟demo\mathcal{D}_{\text{demo}}, while the global average standard deviation of rewards obtained under different random seeds is 83.3% of that achieved by the model assisted by 𝒟demo\mathcal{D}_{\text{demo}}. This indicates that HGCR-DDPG assisted by 𝒟NMPC\mathcal{D}_{\text{NMPC}} exhibits a performance level similar to that of HGCR-DDPG assisted by 𝒟demo\mathcal{D}_{\text{demo}}.

From Table S13, it can be observed that the HGCR-DDPG model trained with 𝒟NMPC\mathcal{D}_{\text{NMPC}} and 𝒟demo\mathcal{D}_{\text{demo}} exhibits significant similarity at the RavgtR_{\text{avg}}^{t} sequence level, particularly in RCS and RPC. This emphasizes a strong consistency between the 𝒟NMPC\mathcal{D}_{\text{NMPC}}-assisted HGCR-DDPG model and the 𝒟demo\mathcal{D}_{\text{demo}}-assisted model concerning the RavgtR_{\text{avg}}^{t} sequence. However, in terms of the standard deviation of reward sequences obtained with different random seeds, the similarity metrics show significant differences, especially evident in SPC. This difference may stem from two factors: firstly, the inherent randomness of the experiment may lead to some fluctuations in reward curves under different random seeds; secondly, the stylistic differences between NMPC and human-operated strategies may cause the model to adopt different action strategies in specific contexts, thus affecting certain performance metrics.

V-C Results of Physical Experiments on Visual Processing and Robot Manipulation

Refer to caption
Figure 9: Optical Flow Tracking Results.

The endpoint recognition algorithm mainly targets two situations: when the fabric is completely flattened and when it is fully wrinkled. In the case of complete flattening, the endpoint recognition algorithm can accurately identify the four endpoints of the fabric, as shown in the first picture of Fig. 9. In the case of complete wrinkling, the endpoint recognition algorithm can also accurately identify the eight representative endpoints of the fabric, as shown in Fig. S8. This indicates that the endpoint recognition algorithm can accurately identify the endpoints of the fabric under different fabric states, providing accurate initial positions for subsequent optical flow tracking. The optical flow tracking algorithm is primarily used to track the endpoints of the fabric in folding tasks, as shown in Fig. 9. This indicates that the optical flow tracking algorithm can accurately track the endpoints of the fabric when the shape of the fabric changes, providing precise target positions for subsequent robot operations. It is worth noting that in the second-to-last image of Fig. 9, there is a tracking failure for the two endpoints in the top left corner of the fabric. This is caused by occlusion from the end effector, and in such cases, we utilized the method introduced in Section IV.D for supplementation. After the operation is completed, the placement point is treated as the new position of the reference grasping point, ensuring the smoothness of robot operations.

The experimental operation process is illustrated in Fig. S9. In physical experiments, measuring the distance between different endpoints is inconvenient, so several indicators directly computable from visual information were set as follows:

1. Task Completion Rate: For folding along the diagonal, the target shape of the fabric was set as an isosceles right triangle with a side length of 0.24 meters. For folding along the central axis, the target shape of the fabric was set as a rectangle measuring 0.24 meters by 0.12 meters. For flattening, the target shape of the fabric was set as a square with a side length of 0.24 meters. Subsequently, the similarity between the target shape of the fabric and the actual shape was calculated using the ‘cv2.matchShapes()’ function in OpenCV, and this was used as the task completion rate.

2. Success Rate: An experiment was considered successful when the final task completion rate exceeded 0.9.

3. Average Steps: The average number of actions required for the robot to complete a specific task measured the efficiency of the robot’s operations. In this study, due to the thinness of the fabric used, the positioning accuracy of the sensor subsystem in the z-axis direction was extremely strict, with a tolerance of only 2 mm, which greatly increased the likelihood of gripping failure. To address this challenge, a heuristic strategy was adopted in the experiment: first attempt gripping based on the positioning information provided by the sensor subsystem. If the first gripping attempt was unsuccessful (i.e., no improvement in task completion rate), the gripping point was lowered by 2 mm in the z-axis direction and another attempt was made, repeating this process until successful gripping was achieved.

In the process of counting operation steps, this study only included each successful placement action in the total steps, without counting repeated attempts due to gripping failures. 30 experiments were conducted for each of the three tasks, and the aforementioned indicators were recorded. The experimental results are shown in Table V. For the folding along the diagonal task, 93.3% of the trials achieved a task completion score of no less than 0.6, while 90.0% of the trials achieved a task completion score of no less than 0.8. The overall success rate for this task is 83.3%, with an average of 1.1 steps required, indicating a relatively high success rate and fewer required steps. For the folding along the central axis task, 90.0% of the trials reached the standard of a task completion score of at least 0.6, and 86.7% of the trials reached the standard of a task completion score of at least 0.8. The success rate is 80.0%, with an average of 3.9 steps required. Compared to folding along the diagonal, this task requires more steps but still maintains a relatively high success rate. For the flattening task, all trials reached the standards of a task completion score of at least 0.6 and 0.8, with a high success rate of 96.7%. However, the average number of steps required is 13.5 steps, indicating that although the flattening task has the highest success rate, it is also the most time-consuming of the three tasks. This phenomenon can be attributed to the high tolerance for errors in the flattening task. Specifically, even if a certain operation leads to a decrease in task completion score, the robot can still flatten the fabric through subsequent operations. This characteristic leads to a high success rate for the flattening task but also results in an increase in the number of required steps. In summary, the success rates of all three tasks are relatively high, indicating that the experimental setup and methods used perform well in physical operations.

TABLE V: Physical Experiment Results
Task Metric Task Completion 0.6\geq 0.6 Task Completion 0.8\geq 0.8 Success Rate Average Steps
Diagonal Folding 93.3% 90.0% 83.3% 1.1
Central Axis Folding 90.0% 86.7% 80.0% 3.9
Flattening 100.0% 100.0% 96.7% 13.5

V-D Discussion

In the context of robotic manipulation tasks for deformable objects, this article addresses the inefficiency of traditional RL methods by proposing the HGCR-DDPG algorithm. To tackle the issue of high costs associated with traditional human teaching methods, a low-cost demonstration collection method based on NMPC is introduced. The effectiveness of the proposed methods is validated through three experimental scenarios involving folding fabric diagonally, along the midline, and flattening it, both in simulation and real-world experiments. Extensive ablation studies are conducted to substantiate the rationality and efficacy of the algorithms.

Compared to similar research, Matas et al. [21] required nearly 80,000 interactions between the robot and the environment to complete the learning process; Jangir et al. [22] needed approximately 260,000 rounds of interaction data to train their agent; Yang et al. [24] utilized 28,000 pairs of images and actions collected via teleoperation to train a DNN as an end-to-end policy for folding a single towel. This study simplifies the data acquisition process and achieves comparable or even higher success rates than the aforementioned studies, providing novel insights and contributions for future tasks of a similar nature. Currently, more and more research tends to adopt Vision-Language-Action models (VLA) for robotic manipulation. However, such research often requires significant computational resources and is overqualified when dealing with specific tasks. For example, OpenVLA is a 7B-parameter VLA that was trained on 64 A100 GPUs for 14 days. During inference, it requires 15GB of video memory and runs at approximately 6Hz on an NVIDIA RTX 4090 GPU [36]. The largest model of RT-2 uses 55B parameters, and it is infeasible to directly run such a model on standard desktop machines or on-robot GPUs commonly used for real-time robot control [37]. Even TinyVLA requires 1.3B parameters [38]. In contrast, our proposed algorithm shows significant advantages in learning efficiency. Trained on an Intel i5 12400f CPU and NVIDIA RTX 3050 GPU, our algorithm can converge within dozens of epochs, and the entire training process takes at about 4 hours, with a maximum number of 14,183 parameters. Compared with the currently popular approaches based on large models for robot manipulation, the algorithm proposed in this paper has the advantages of being lightweight, requiring low computational resources, and being able to provide task-specific customization and efficient adaptability when handling specific tasks.

VI Conclusion

This article presents a study on deformable object robot manipulation based on demonstration-enhanced RL. To improve the learning efficiency of RL, this article enhances the utilization efficiency of algorithms for demonstration data from multiple aspects, proposing the HGCR-DDPG algorithm and collecting 𝒟demo\mathcal{D}_{\text{demo}} for training. It first uses demonstration data to train the HTSK fuzzy system to select appropriate grasp points, then proposes the GABC to improve the utilization of demonstration data in Rainbow-DDPG, and finally uses CPL to synthesize HTSK and GABC improved Rainbow-DDPG, forming a complete control algorithm for deformable object robot manipulation, namely HGCR-DDPG. Additionally, the effectiveness of the proposed methods is verified through comprehensive simulation experiments. Compared to the baseline algorithm (Rainbow-DDPG), the proposed HGCR-DDPG algorithm achieves a 2.01 times higher global average reward and reduces the global average standard deviation to 45% of the baseline algorithm. To reduce the labor cost of demonstration collection, this article proposes a low-cost demonstration collection method based on NMPC. Based on the established spring-mass model, it uses the NMPC algorithm to control the robot to perform deformable object manipulation tasks in a simulation environment, and uses the trajectories of rounds with higher rewards as demonstration data. Simulation results show that the global average reward obtained by the HGCR-DDPG model trained with 𝒟NMPC\mathcal{D}_{\text{NMPC}} is 99.7% of the model trained with 𝒟demo\mathcal{D}_{\text{demo}}, and the global average standard deviation of rewards obtained under different random seeds is 83.3% of the model trained with 𝒟demo\mathcal{D}_{\text{demo}}. This indicates that demonstration data collected through NMPC can be used to train HGCR-DDPG and its effectiveness is comparable to human demonstration data. To verify the feasibility of the proposed methods in a real environment, this article conducts physical experiments on deformable object robot manipulation. Utilizing hardware facilities such as the UR5e robot, OnRobot RG2 gripper, and RealSense D435i camera, this article builds a physical experimental platform for deformable object robot manipulation and uses the 𝒟NMPC\mathcal{D}_{\text{NMPC}}-assisted training HGCR-DDPG algorithm on this platform to control the robot to manipulate fabric and perform folding along the diagonal, folding along the central axis, and flattening tasks. The experimental results show that the proposed methods achieve success rates of 83.3%, 80%, and 100% respectively in these three tasks, verifying the effectiveness of the method.

There are still many areas for improvement due to time constraints. Specifically, future work of this article could be expanded in the aspect such as multimodal perception input for RL state vectors, refinement of deformable object dynamic models, and small-sample learning for operations on various deformable objects, etc.

References

  • [1] J. Zhu et al., “Challenges and Outlook in Robotic Manipulation of Deformable Objects,” in IEEE Robotics & Automation Magazine, vol. 29, no. 3, pp. 67-77, Sept. 2022.
  • [2] P. Long, W. Khalil and P. Martinet, “Modeling & control of a meat-cutting robotic cell,” 2013 16th International Conference on Advanced Robotics (ICAR), Montevideo, Uruguay, 2013, pp. 1-6.
  • [3] M. C. Gemici and A. Saxena, “Learning haptic representation for manipulating deformable food objects,” 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, Chicago, IL, USA, 2014, pp. 638-645.
  • [4] I. Leizea, A. Mendizabal, H. Alvarez, I. Aguinaga, D. Borro and E. Sanchez, “Real-Time Visual Tracking of Deformable Objects in Robot-Assisted Surgery,” in IEEE Computer Graphics and Applications, vol. 37, no. 1, pp. 56-68, Jan.-Feb. 2017.
  • [5] B. Thananjeyan, A. Garg, S. Krishnan, C. Chen, L. Miller and K. Goldberg, “Multilateral surgical pattern cutting in 2D orthotropic gauze with deep reinforcement learning policies for tensioning,” 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 2017, pp. 2371-2378.
  • [6] Y. Gao, H. J. Chang and Y. Demiris, “Iterative path optimisation for personalised dressing assistance using vision and force information,” 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Daejeon, Korea (South), 2016, pp. 4398-4403.
  • [7] F. Zhang and Y. Demiris, “Learning garment manipulation policies toward robot-assisted dressing,” in Science Robotics, vol.7, no.65, eabm6010, 2022.
  • [8] I. G. Ramirez-Alpizar, M. Higashimori, M. Kaneko, C. -H. D. Tsai and I. Kao, “Dynamic Nonprehensile Manipulation for Rotating a Thin Deformable Object: An Analogy to Bipedal Gaits,” in IEEE Transactions on Robotics, vol. 28, no. 3, pp. 607-618, June 2012.
  • [9] J. Huang, T. Fukuda and T. Matsuno, “Model-Based Intelligent Fault Detection and Diagnosis for Mating Electric Connectors in Robotic Wiring Harness Assembly Systems,” in IEEE/ASME Transactions on Mechatronics, vol. 13, no. 1, pp. 86-94, Feb. 2008.
  • [10] J. Zhu, B. Navarro, P. Fraisse, A. Crosnier and A. Cherubini, “Dual-arm robotic manipulation of flexible cables,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 2018, pp. 479-484.
  • [11] A. Shademan, R. S. Decker, J. D. Opfermann, S. Leonard, A. Krieger and P. C. W. Kim, “Supervised autonomous robotic soft tissue surgery,” in Science Translational Medicine, vpl. 8, no. 337, ra64-337, 2016.
  • [12] A. Jevtić et al., ”Personalized Robot Assistant for Support in Dressing,” in IEEE Transactions on Cognitive and Developmental Systems, vol. 11, no. 3, pp. 363-374, Sept. 2019.
  • [13] J. Zhu, M. Gienger, G. Franzese and J. Kober, “Do You Need a Hand? – A Bimanual Robotic Dressing Assistance Scheme,” in IEEE Transactions on Robotics, vol. 40, pp. 1906-1919, 2024.
  • [14] J. Sanchez, J-A. Corrales, B-C. Bouzgarrou and Y. Mezouar, “Robotic manipulation and sensing of deformable objects in domestic and industrial applications: a survey,” in The International Journal of Robotics Research, vol. 37, no. 7, pp. 688-716, 2018.
  • [15] H. Yin, A. Varava and D. Kragi, “Modeling, learning, perception, and control methods for deformable object manipulation,” in Science Robotics, vol.6, no.54, eabd8803, 2021.
  • [16] Y. Li, Y. Yue, D. Xu, E. Grinspun and P. K. Allen, “Folding deformable objects using predictive simulation and trajectory optimization,” 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 2015, pp. 6000-6006.
  • [17] H. Lin, F. Guo, F. Wang and Y-B. Jia, “Picking up a soft 3D object by “feeling” the grip,” in The International Journal of Robotics Research, vol. 34, no. 11, pp. 1361-1384, 2015.
  • [18] D. Navarro-Alarcon and Y. -H. Liu, “Fourier-Based Shape Servoing: A New Feedback Method to Actively Deform Soft Objects into Desired 2-D Image Contours,” in IEEE Transactions on Robotics, vol. 34, no. 1, pp. 272-279, Feb. 2018.
  • [19] T. Tamei, T. Matsubara, A. Rai and T. Shibata, “Reinforcement learning of clothing assistance with a dual-arm robot,” 2011 11th IEEE-RAS International Conference on Humanoid Robots, Bled, Slovenia, 2011, pp. 733-738.
  • [20] A. Colomé and C. Torras, “Dimensionality Reduction for Dynamic Movement Primitives and Application to Bimanual Manipulation of Clothes,” in IEEE Transactions on Robotics, vol. 34, no. 3, pp. 602-615, June 2018.
  • [21] J. Matas, S. James and A. J. Davison. “Sim-to-real reinforcement learning for deformable object manipulation,” 2018 Conference on Robot Learning (CoRL), Zürich, Switzerland, PMLR, 2018, pp. 734–743.
  • [22] R. Jangir, G. Alenyà and C. Torras, “Dynamic Cloth Manipulation with Deep Reinforcement Learning,” 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 2020, pp. 4630-4636.
  • [23] E. Pignat and S. Calinon, “Learning adaptive dressing assistance from human demonstration,” Robotics and Autonomous Systems, vol. 93, pp. 61-75, 2017.
  • [24] P. -C. Yang, K. Sasaki, K. Suzuki, K. Kase, S. Sugano and T. Ogata, “Repeatable Folding Task by Humanoid Robot Worker Using Deep Learning,” in IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 397-403, April 2017.
  • [25] A. Cherubini, J. Leitner, V. Ortenzi and P. Corke, “Towards vision-based manipulation of plastic materials,” 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 2018, pp. 485-490.
  • [26] J. Zhu, M. Gienger and J. Kober, “Learning Task-Parameterized Skills From Few Demonstrations,” in IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 4063-4070, April 2022.
  • [27] B. Balaguer and S. Carpin, “Combining imitation and reinforcement learning to fold deformable planar objects,” 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems, San Francisco, CA, USA, 2011, pp. 1405-1412.
  • [28] H. Wang, J. Huang, H. Ru, Z. Fu, H. Lei, D. Wu and H. Wu, “Grasping State Analysis of Soft Manipulator Based on Flexible Tactile Sensor and High-Dimensional Fuzzy System,” in IEEE/ASME Transactions on Mechatronics, doi: 10.1109/TMECH.2024.3445504.
  • [29] Y. Cui, Y. Xu, R. Peng, and D. Wu, “Layer normalization for TSK fuzzy system optimization in regression problems,” in IEEE Trans. Fuzzy Syst., vol. 31, no. 1, pp. 254–264, Jan. 2023.
  • [30] Y. Wu, W. Yan, T. Kurutach, et al., “Learning to manipulate deformable objects without demonstrations,” Proc. 16th Robot.: Sci. Syst., 2020, [online] Available: https://roboticsproceedings.org/rss16/p065.html.
  • [31] J. Canny, “A Computational Approach to Edge Detection,” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-8, no. 6, pp. 679-698, Nov. 1986.
  • [32] S. Fujimoto, H. Hoof and D. Meger, “Addressing function approximation error in actorcritic methods,” 2018 International Conference on Machine Learning (ICML), Stockholm, Sweden, PMLR, 2018, pp. 1587–1596.
  • [33] A. Wächter and L. T. Biegler, “On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming,” in Mathematical Programming, vol. 106, pp. 25–57, 2006.
  • [34] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning.,” GitHub Repository, 2016.
  • [35] D. H. Douglas and T. K. Peucker, “Algorithms for the reduction of the number of points required to represent a digitized line or its caricature,” in Cartographica: The International Journal for Geographic Information and Geovisualization, vol. 10, no. 2, pp. 112–122, 1973.
  • [36] M. J. Kim, K. Pertsch, S. Karamcheti, et al., “OpenVLA: An Open-Source Vision-Language-Action Model,” in arXiv preprint, arXiv:2406.09246, 2024.
  • [37] B. Zitkovich, T. Yu, S. Xu ,et al., “Rt-2: Vision-language-action models transfer web knowledge to robotic control,” in Conference on Robot Learning, PMLR, Atlanta, GA, USA, 2023, pp. 2165-2183.
  • [38] J. Wen, Y. Zh , J. Li ,et al., “Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation,” in arXiv preprint, arXiv:2409.12514, 2024.