This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

SDSRA: A Skill-Driven Skill-Recombination Algorithm for Efficient Policy Learning

Eric H. Jiang and Andrew Lizarraga
Department of Statistics and Data Science
University of California, Los Angeles
Los Angeles, CA 90095-1554
{ericjiang0318, andrewlizarraga}@g.ucla.edu
Abstract

In this paper we introduce a novel algorithm-the Skill-Driven Skill Recombination Algorithm (SDSRA)—an innovative framework that significantly enhances the efficiency of achieving maximum entropy in reinforcement learning tasks. We find that SDSRA achieves faster convergence compared to the traditional Soft Actor-Critic (SAC) algorithm and produces improved policies. By integrating skill-based strategies within the robust Actor-Critic framework, SDSRA demonstrates remarkable adaptability and performance across a wide array of complex and diverse benchmarks.
Code: https://github.com/ericjiang18/SDSRA/.

1 Introduction

Reinforcement Learning (RL) has significantly advanced, with the Soft Actor-Critic (SAC) algorithm, introduced by Haarnoja et al. (2018), standing out for efficient exploration in complex tasks. Despite its strengths, SAC, like other RL methods, faces challenges in more intricate environments. To address these issues, recent research, such as goal-enforced hierarchical learning Chane-Sane et al. (2021) and intrinsically motivated RL with skill selection Singh et al. (2004), focuses on enhancing RL frameworks. In this paper we address these issues and make the following contributions:

  • Innovative Framework: We introduce SDSRA a novel approach that surpasses SAC methods.

  • Integration of Intrinsic Motivation: SDSRA incorporates intrinsically motivated learning within a hierarchical structure, enhancing self-directed exploration and skill development which is lacking in SAC.

  • Enhanced Skill Acquisition and Dynamic Selection: Our method excels in acquiring and dynamically selecting a wide range of skills suitable for varying environmental conditions, offering greater adaptability.

  • Superior Performance and Learning Rate: We demonstrate faster performance and a quicker learning rate compared to conventional SAC methods, leading to improved rewards in various benchmarks.

1.1 Related Work

Reinforcement learning research is expanding, particularly in hierarchical structures and intrinsic motivation. Tang et al. (2021) developed a hierarchical SAC variant with sub-goals, yet lacks public code and detailed results. Ma et al. (2022) proposed ELIGN for predicting agent cooperation using intrinsic rewards, while Aubret et al. (2019) surveyed RL algorithms with intrinsic motivation. Other notable works include Laskin et al. (2022)’s skill learning algorithm combining intrinsic rewards and representation learning, Sharma et al. (2019)’s skill discovery algorithm, Bagaria & Konidaris (2020)’s skill discovery algorithm, and Zheng et al. (2018)’s intrinsic reward mechanism for Policy Gradient and PPO algorithm. Despite progress, a gap persists in skill-driven recombination algorithms using intrinsic rewards in Actor-Critic frameworks, particularly in physical environments like MuJoCo Gym. Our SDSRA work addresses this, blending skill-driven learning with Actor-Critic methods, proving effective in complex simulations.

2 Motivation for SDSRA

The SDSRA algorithm adapts the SAC framework, retaining its integration of rewards and entropy maximization, and using actor and critic networks for action selection and evaluation. While SAC emphasizes entropy for diverse exploration, SDSRA introduces a novel selection scheme for enhanced performance in complex environments. SDSRA defines a set of Gaussian Policy skills S={π1,π2,,πN}S=\{\pi_{1},\pi_{2},\ldots,\pi_{N}\} with parameters θi\theta_{i} representing mean μθi(s)\mu_{\theta_{i}}(s) and covariance Σθi(s)\Sigma_{\theta_{i}}(s). Each skill πi\pi_{i} is formulated as: πi(θi)=𝒩(μθi(s),Σθi(s))\pi_{i}(\theta_{i})=\mathcal{N}(\mu_{\theta_{i}}(s),\Sigma_{\theta_{i}}(s)). Skills initially have a relevance score ri=cr_{i}=c, and skill selection is probabilistic, based on softmax distribution of relevance scores: P(i|s)=erij=1NerjP(i|s)=\frac{e^{r_{i}}}{\sum_{j=1}^{N}e^{r_{j}}}. Skill optimization in SDSRA involves minimizing a loss function lossi=εi+β(πi)\text{loss}_{i}=\varepsilon_{i}+\beta\cdot\mathcal{H}(\pi_{i}), combining prediction error εi=1Mm=1M(a^i,mam)2\varepsilon_{i}=\frac{1}{M}\sum_{m=1}^{M}(\hat{a}_{i,m}-a_{m})^{2} and policy entropy (πi)=πi(a|s)log(πi(a|s))𝑑a\mathcal{H}(\pi_{i})=-\int\pi_{i}(a|s)\log(\pi_{i}(a|s))\,da. Precise parameter updates and implementations details are discussed in Appendix. B. SDSRA’s decision-making involves selecting and executing actions based on skill selection and continuous skill refinement, enabling adaptive and effective decision-making in diverse environments. In the integrated framework, the SAC objective function is modified to incorporate the dynamic skill selection process. The new objective function aims to maximize not just the expected return, but also the entropy across the diverse set of skills. The modified objective function is expressed as:

Jintegrated(π)=i=1NP(i|s)(𝔼(st,at)πi[Q(st,at)+α(πi(|st))])J_{\text{integrated}}(\pi)=\sum_{i=1}^{N}P(i|s)\left(\mathbb{E}_{(s_{t},a_{t})\sim\pi_{i}}\left[Q(s_{t},a_{t})+\alpha\mathcal{H}(\pi_{i}(\cdot|s_{t}))\right]\right) (1)

where Q(st,at)Q(s_{t},a_{t}) represents the action-value function as estimated by SAC’s critic networks, and α\alpha scales the importance of the entropy term (πi(|st))\mathcal{H}(\pi_{i}(\cdot|s_{t})) for each skill πi\pi_{i}. Under this proposal we find that SDSRA converges to an improved policy, see Appendix. A.1 and Appendix. A.2. Moreso, we find that experiments ran on a commonly tested dataset for SAC algorithms demonstrates significant improvements in SDSRA over SAC.

3 Experiments

Refer to caption
(a) Ant-v2
Refer to caption
(b) Half-Cheetah-v2
Refer to caption
(c) Hopper-v2
Figure 1: Performance comparison of SDSRA and SAC in the MuJoCo Ant, Half-Cheetah, and Hopper environment.

We assessed the Skill-Driven Skill Recombination Algorithm (SDSRA) in MuJoCo gym locomotion tasks Brockman et al. (2016), comparing it to the Soft Actor-Critic (SAC) algorithm. Our tests in challenging environments like Ant-v2, HalfCheetah-v2, and Hopper-v2 and showed that SDSRA outperformed SAC, achieving faster reward convergence in fewer steps. This highlights SDSRA’s efficiency and potential in complex reinforcement learning tasks.

4 Conclusion

In this paper, we introduced the Skill-Driven Skill Recombination Algorithm (SDSRA) outperforms the traditional Soft Actor-Critic in reinforcement learning, particularly in the MuJoCo environment. Its skill-based approach leads to faster convergence and higher rewards, showing great potential for complex tasks requiring quick adaptability and learning efficiency.

References

  • Aubret et al. (2019) A. Aubret, L. Matignon, and S. Hassas. A survey on intrinsic motivation in reinforcement learning. arXiv preprint arXiv:1908.06976, 2019. URL https://arxiv.org/abs/1908.06976.
  • Bagaria & Konidaris (2020) Akjil Bagaria and George Konidaris. Option discovery using deep skill chaining. ICLR 2020, 2020. URL https://openreview.net/pdf?id=B1gqipNYwH.
  • Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, J. Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016. URL https://arxiv.org/abs/1606.01540.
  • Chane-Sane et al. (2021) Elliot Chane-Sane, Cordelia Schmid, and Ivan Laptev. Goal-conditioned reinforcement learning with imagined subgoals. Proceeding in Machine Learning, 2021. URL http://proceedings.mlr.press/v139/chane-sane21a/chane-sane21a.pdf.
  • Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018. URL https://arxiv.org/abs/1801.01290.
  • Laskin et al. (2022) M. Laskin, Hao Liu, X. B. Peng, Denis Yarats, A. Rajeswaran, and P. Abbeel. Cic: Contrastive intrinsic control for unsupervised skill discovery. arXiv preprint arXiv:2202.00161, 2022. URL https://arxiv.org/abs/2202.00161.
  • Ma et al. (2022) Zixian Ma, Rose Wang, Li Fei-Fei, Michael Bernstein, and Ranjay Krishna. Elign: Expectation alignment as a multi-agent intrinsic reward. NeurIPS 2022, 2022. URL https://arxiv.org/pdf/2210.04365.pdf.
  • Sharma et al. (2019) Archit Sharma, Shixiang Shane Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-aware unsupervised discovery of skills. ArXiv, abs/1907.01657, 2019. URL https://api.semanticscholar.org/CorpusID:195791369.
  • Singh et al. (2004) Satinder Singh, Andrew G. Barto, and Nuttapong Chentanez. Intrinsically motivated reinforcement learning. NeurIPS 2004, 2004. URL https://www.cs.cornell.edu/~helou/IMRL.pdf.
  • Tang et al. (2021) Hengliang Tang, Anqi Wang, Fei Xue, Jiaxin Yang, and Yang Cao. A novel hierarchical soft actor-critic algorithm for multi-logistics robots task allocation. IEEE Access, 9:42568–42582, 2021. doi: 10.1109/ACCESS.2021.3062457.
  • Zheng et al. (2018) Zeyu Zheng, Junhyuk Oh, and Satinder Singh. On learning intrinsic rewards for policy gradient methods. NeurIPS 2004, 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/51de85ddd068f0bc787691d356176df9-Paper.pdf.

URM Statement

The authors acknowledge that at least one key author of this work meets the URM criteria of ICLR 2024 Tiny Papers Track.

Appendix A Proofs of Main Results

A.1 Policy Improvement Guarantee

Lemma A.1 (Policy Improvement Guarantee).

Given a policy π\pi, if the soft Q-values QπQ^{\pi} are updated according to the soft Bellman backup operator, then the policy π\pi^{\prime}, which acts greedily with respect to QπQ^{\pi}, achieves an equal or greater expected return than π\pi.

Proof.

Let π\pi be any policy and π\pi^{\prime} be the policy that is greedy with respect to the soft Q-values QπQ^{\pi}. By definition of the greedy policy, for all states s𝒮s\in\mathcal{S}, we have:

π(a|s)=argmaxa(Qπ(s,a)+α(π(|s))).\pi^{\prime}(a|s)=\arg\max_{a^{\prime}}\left(Q^{\pi}(s,a^{\prime})+\alpha\mathcal{H}(\pi(\cdot|s))\right). (2)

Now consider the soft value function VπV^{\pi} which is given by:

Vπ(s)=𝔼aπ[Qπ(s,a)αlogπ(a|s)].V^{\pi}(s)=\mathbb{E}_{a\sim\pi}[Q^{\pi}(s,a)-\alpha\log\pi(a|s)]. (3)

Using the soft Bellman optimality equation for QπQ^{\pi^{\prime}}, we get:

Qπ(s,a)=𝔼sP[r(s,a)+γVπ(s)].Q^{\pi^{\prime}}(s,a)=\mathbb{E}_{s^{\prime}\sim P}[r(s,a)+\gamma V^{\pi^{\prime}}(s^{\prime})]. (4)

Substituting the expression for VπV^{\pi^{\prime}} into the above, we have:

Qπ(s,a)=𝔼sP,aπ[r(s,a)+γ(Qπ(s,a)αlogπ(a|s))].Q^{\pi^{\prime}}(s,a)=\mathbb{E}_{s^{\prime}\sim P,a^{\prime}\sim\pi^{\prime}}[r(s,a)+\gamma(Q^{\pi^{\prime}}(s^{\prime},a^{\prime})-\alpha\log\pi^{\prime}(a^{\prime}|s^{\prime}))]. (5)

Since π\pi^{\prime} is greedy with respect to QπQ^{\pi}, it follows that Qπ(s,a)Qπ(s,a)Q^{\pi^{\prime}}(s,a)\geq Q^{\pi}(s,a) for all s𝒮s\in\mathcal{S} and a𝒜a\in\mathcal{A}.

Thus, we have shown that acting greedily with respect to the soft Q-values under policy π\pi results in a policy π\pi^{\prime} that has greater or equal Q-value for all state-action pairs, which completes the proof. ∎

A.2 Theorem A.2: Convergence to Optimal Policy

Theorem A.2 (Convergence to Optimal Policy).

Repeated application of soft policy evaluation and soft policy improvement from any initial policy πΠ\pi\in\Pi converges to a policy π\pi^{*} such that Qπ(st,at)Qπ(st,at)Q^{\pi^{*}}(s_{t},a_{t})\geq Q^{\pi}(s_{t},a_{t}) for all πΠ\pi\in\Pi and (st,at)𝒮×𝒜(s_{t},a_{t})\in\mathcal{S}\times\mathcal{A}, assuming |𝒜|<|\mathcal{A}|<\infty.

Proof.

The soft Bellman backup operator for policy evaluation under policy π\pi is given by:

TπQ(s,a)=𝔼sP,aπ[r(s,a)+γ(Q(s,a)αlogπ(a|s))].T^{\pi}Q(s,a)=\mathbb{E}_{s^{\prime}\sim P,a^{\prime}\sim\pi}[r(s,a)+\gamma(Q(s^{\prime},a^{\prime})-\alpha\log\pi(a^{\prime}|s^{\prime}))]. (6)

This operator is a contraction mapping in the supremum norm, which ensures that repeated application of TπT^{\pi} to any initial Q-function Q0Q_{0} converges to a unique fixed point QπQ^{\pi} that satisfies the soft Bellman equation for policy π\pi.

Now, let us define the soft Bellman optimality operator TT^{*} as:

TQ(s,a)=maxπTπQ(s,a).T^{*}Q(s,a)=\max_{\pi}T^{\pi}Q(s,a). (7)

The soft policy improvement step involves updating the policy π\pi to a new policy π\pi^{\prime} by choosing actions that maximize the current soft Q-values plus the entropy term:

π=argmaxπ𝔼aπ[Qπ(s,a)αlogπ(a|s)].\pi^{\prime}=\arg\max_{\pi}\mathbb{E}_{a\sim\pi}[Q^{\pi}(s,a)-\alpha\log\pi(a|s)]. (8)

By the policy improvement theorem, this new policy π\pi^{\prime} achieves a Q-value that is greater than or equal to that of π\pi, i.e., Qπ(s,a)Qπ(s,a)Q^{\pi^{\prime}}(s,a)\geq Q^{\pi}(s,a) for all (s,a)(s,a).

Since the action space 𝒜\mathcal{A} is finite, there are a finite number of deterministic policies in Π\Pi. Thus, the sequence of policies {πk}\{\pi_{k}\} obtained by alternating soft policy evaluation and soft policy improvement must eventually converge to a policy π\pi^{*} that cannot be improved further, which means it is the optimal policy with respect to the soft Bellman optimality equation. Therefore, we have:

Qπ(s,a)=TQπ(s,a),Q^{\pi^{*}}(s,a)=T^{*}Q^{\pi^{*}}(s,a), (9)

for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}. Hence, Qπ(s,a)Qπ(s,a)Q^{\pi^{*}}(s,a)\geq Q^{\pi}(s,a) for all πΠ\pi\in\Pi, which concludes the proof that the sequence of policies converges to an optimal policy π\pi^{*}. ∎

A.3 Proof of Theorem 2: Entropy Maximization Efficiency of SDSRA

Theorem A.3 (Entropy Maximization Efficiency of SDSRA).

Let πSAC\pi^{\text{SAC}} and πSDSRA\pi^{\text{SDSRA}} be the policies obtained from the SAC and SDSRA algorithms, respectively, when trained under identical conditions. Assume that both algorithms achieve convergence. Then, for any state s𝒮s\in\mathcal{S}, the expected entropy of πSDSRA\pi^{\text{SDSRA}} is greater than or equal to that of πSAC\pi^{\text{SAC}}:

𝔼aπSDSRA[logπSDSRA(a|s)]𝔼aπSAC[logπSAC(a|s)],\mathbb{E}_{a\sim\pi^{\text{SDSRA}}}[-\log\pi^{\text{SDSRA}}(a|s)]\geq\mathbb{E}_{a\sim\pi^{\text{SAC}}}[-\log\pi^{\text{SAC}}(a|s)], (10)

or the time to reach an ϵ\epsilon-optimal policy entropy for SDSRA is less than that for SAC:

tSDSRA(ϵ)tSAC(ϵ),t_{\text{SDSRA}}(\epsilon)\leq t_{\text{SAC}}(\epsilon), (11)

where tSDSRA(ϵ)t_{\text{SDSRA}}(\epsilon) and tSAC(ϵ)t_{\text{SAC}}(\epsilon) denote the time to reach a policy entropy within ϵ\epsilon of the maximum entropy for SDSRA and SAC, respectively.

Proof.

Assume that both πSAC\pi^{\text{SAC}} and πSDSRA\pi^{\text{SDSRA}} have converged to their respective policy distributions for all states s𝒮s\in\mathcal{S}. By the definition of convergence, we have that the policies are stationary and hence the expected entropy under each policy is constant over time.

Consider the skill-based decision-making process inherent in SDSRA. At each decision step, SDSRA selects a skill from a diversified set, which is represented as a policy over actions. This process is formalized by a softmax function over the skills’ relevance scores, which in turn are updated based on the performance and diversity of actions taken. As a consequence, the SDSRA policy is encouraged to explore a wider range of actions, leading to an increase in the expected entropy of the policy.

Formally, let SS be the set of all skills in SDSRA, and let rir_{i} be the relevance score of skill ii. Then the probability of selecting an action aa given state ss under policy πSDSRA\pi^{\text{SDSRA}} is given by a mixture of policies corresponding to each skill:

πSDSRA(a|s)=i=1NP(i|s)πskilli(a|s),\pi^{\text{SDSRA}}(a|s)=\sum_{i=1}^{N}P(i|s)\pi_{\text{skill}_{i}}(a|s), (12)

where P(i|s)P(i|s) is the softmax probability of selecting skill ii.

The entropy of a mixture of policies is generally higher than the entropy of any individual policy in the mixture. Therefore, the expected entropy of πSDSRA\pi^{\text{SDSRA}} is greater than the expected entropy of any individual skill policy, and by extension, greater than or equal to the entropy of πSAC\pi^{\text{SAC}}, which does not utilize a mixture of policies:

𝔼aπSDSRA[logπSDSRA(a|s)]𝔼aπSAC[logπSAC(a|s)].\mathbb{E}_{a\sim\pi^{\text{SDSRA}}}[-\log\pi^{\text{SDSRA}}(a|s)]\geq\mathbb{E}_{a\sim\pi^{\text{SAC}}}[-\log\pi^{\text{SAC}}(a|s)]. (13)

Furthermore, due to the dynamic and adaptive nature of skill selection in SDSRA, the algorithm rapidly explores high-entropy policies, thus reaching a policy with entropy within ϵ\epsilon of the maximum entropy faster than SAC, which optimizes a single policy without such a mechanism. This leads to:

tSDSRA(ϵ)tSAC(ϵ),t_{\text{SDSRA}}(\epsilon)\leq t_{\text{SAC}}(\epsilon), (14)

completing the proof. ∎

Appendix B SDSRA Algorithm

Algorithm 1 Soft Actor-Critic with Skill-Driven Skill Recombination Algorithm (SDSRA)
1:Initialize action-value functions Qθ1,Qθ2Q_{\theta_{1}},Q_{\theta_{2}} with parameters θ1,θ2\theta_{1},\theta_{2}
2:Initialize the policy πϕ\pi_{\phi} with parameters ϕ\phi
3:Initialize target value parameters θ1θ1,θ2θ2\theta^{\prime}_{1}\leftarrow\theta_{1},\theta^{\prime}_{2}\leftarrow\theta_{2}
4:Initialize skill set S={πθskilli}i=1NS=\{\pi_{\theta_{\text{skill}_{i}}}\}_{i=1}^{N} with parameters θskilli\theta_{\text{skill}_{i}}
5:Initialize relevance scores ric,i{1,,N}r_{i}\leftarrow c,\forall i\in\{1,\ldots,N\}
6:Initialize replay buffer DD
7:for each iteration do
8:     for each environment step do
9:         Sample skill index ii using probabilities P(i|s)=erij=1NerjP(i|s)=\frac{e^{r_{i}}}{\sum_{j=1}^{N}e^{r_{j}}}
10:         Select action atπθskilli(st)a_{t}\sim\pi_{\theta_{\text{skill}_{i}}}(s_{t})
11:         Execute ata_{t} and observe reward rtr_{t} and new state st+1s_{t+1}
12:         Store transition tuple (st,at,rt,st+1,i)(s_{t},a_{t},r_{t},s_{t+1},i) in buffer DD
13:     end for
14:     for each gradient step do
15:         Randomly sample a batch of transitions from DD
16:         Compute target values using the Bellman equation
17:         Update Qθ1,Qθ2Q_{\theta_{1}},Q_{\theta_{2}} by minimizing the loss:
L(θi)=𝔼(s,a,r,s)D[(Qθi(s,a)(r+γ(minj=1,2Qθj(s,πϕ(s))αlogπϕ(a|s)))2]L(\theta_{i})=\mathbb{E}_{(s,a,r,s^{\prime})\sim D}\left[\left(Q_{\theta_{i}}(s,a)-(r+\gamma(\min_{j=1,2}Q_{\theta^{\prime}_{j}}(s^{\prime},\pi_{\phi}(s^{\prime}))-\alpha\log\pi_{\phi}(a|s^{\prime}))\right)^{2}\right]
18:         Update policy πϕ\pi_{\phi} using the policy gradient:
ϕJ(πϕ)=𝔼sD,aπϕ[ϕlogπϕ(a|s)Qθ1(s,a)]\nabla_{\phi}J(\pi_{\phi})=\mathbb{E}_{s\sim D,a\sim\pi_{\phi}}\left[\nabla_{\phi}\log\pi_{\phi}(a|s)Q_{\theta_{1}}(s,a)\right]
19:         Update target networks: θiτθi+(1τ)θi\theta^{\prime}_{i}\leftarrow\tau\theta_{i}+(1-\tau)\theta^{\prime}_{i}
20:     end for
21:     for each skill update interval do
22:         Evaluate and update the performance of each skill πθskilli\pi_{\theta_{\text{skill}_{i}}}
23:         Update relevance scores rir_{i} based on the performance
24:     end for
25:end for