SDSRA: A Skill-Driven Skill-Recombination Algorithm for Efficient Policy Learning

Eric H. Jiang and Andrew Lizarraga
Department of Statistics and Data Science
University of California, Los Angeles
Los Angeles, CA 90095-1554
{ericjiang0318, andrewlizarraga}@g.ucla.edu

Abstract

In this paper we introduce a novel algorithm-the Skill-Driven Skill Recombination Algorithm (SDSRA)—an innovative framework that significantly enhances the efficiency of achieving maximum entropy in reinforcement learning tasks. We find that SDSRA achieves faster convergence compared to the traditional Soft Actor-Critic (SAC) algorithm and produces improved policies. By integrating skill-based strategies within the robust Actor-Critic framework, SDSRA demonstrates remarkable adaptability and performance across a wide array of complex and diverse benchmarks.
Code: https://github.com/ericjiang18/SDSRA/.

1 Introduction

Reinforcement Learning (RL) has significantly advanced, with the Soft Actor-Critic (SAC) algorithm, introduced by Haarnoja et al. (2018), standing out for efficient exploration in complex tasks. Despite its strengths, SAC, like other RL methods, faces challenges in more intricate environments. To address these issues, recent research, such as goal-enforced hierarchical learning Chane-Sane et al. (2021) and intrinsically motivated RL with skill selection Singh et al. (2004), focuses on enhancing RL frameworks. In this paper we address these issues and make the following contributions:

•

Innovative Framework: We introduce SDSRA a novel approach that surpasses SAC methods.
•

Integration of Intrinsic Motivation: SDSRA incorporates intrinsically motivated learning within a hierarchical structure, enhancing self-directed exploration and skill development which is lacking in SAC.
•

Enhanced Skill Acquisition and Dynamic Selection: Our method excels in acquiring and dynamically selecting a wide range of skills suitable for varying environmental conditions, offering greater adaptability.
•

Superior Performance and Learning Rate: We demonstrate faster performance and a quicker learning rate compared to conventional SAC methods, leading to improved rewards in various benchmarks.

1.1 Related Work

Reinforcement learning research is expanding, particularly in hierarchical structures and intrinsic motivation. Tang et al. (2021) developed a hierarchical SAC variant with sub-goals, yet lacks public code and detailed results. Ma et al. (2022) proposed ELIGN for predicting agent cooperation using intrinsic rewards, while Aubret et al. (2019) surveyed RL algorithms with intrinsic motivation. Other notable works include Laskin et al. (2022)’s skill learning algorithm combining intrinsic rewards and representation learning, Sharma et al. (2019)’s skill discovery algorithm, Bagaria & Konidaris (2020)’s skill discovery algorithm, and Zheng et al. (2018)’s intrinsic reward mechanism for Policy Gradient and PPO algorithm. Despite progress, a gap persists in skill-driven recombination algorithms using intrinsic rewards in Actor-Critic frameworks, particularly in physical environments like MuJoCo Gym. Our SDSRA work addresses this, blending skill-driven learning with Actor-Critic methods, proving effective in complex simulations.

2 Motivation for SDSRA

The SDSRA algorithm adapts the SAC framework, retaining its integration of rewards and entropy maximization, and using actor and critic networks for action selection and evaluation. While SAC emphasizes entropy for diverse exploration, SDSRA introduces a novel selection scheme for enhanced performance in complex environments. SDSRA defines a set of Gaussian Policy skills $S=\{\pi_{1},\pi_{2},\ldots,\pi_{N}\}$ with parameters $\theta_{i}$ representing mean $\mu_{\theta_{i}}(s)$ and covariance $\Sigma_{\theta_{i}}(s)$ . Each skill $\pi_{i}$ is formulated as: $\pi_{i}(\theta_{i})=\mathcal{N}(\mu_{\theta_{i}}(s),\Sigma_{\theta_{i}}(s))$ . Skills initially have a relevance score $r_{i}=c$ , and skill selection is probabilistic, based on softmax distribution of relevance scores: $P(i|s)=\frac{e^{r_{i}}}{\sum_{j=1}^{N}e^{r_{j}}}$ . Skill optimization in SDSRA involves minimizing a loss function $\text{loss}_{i}=\varepsilon_{i}+\beta\cdot\mathcal{H}(\pi_{i})$ , combining prediction error $\varepsilon_{i}=\frac{1}{M}\sum_{m=1}^{M}(\hat{a}_{i,m}-a_{m})^{2}$ and policy entropy $\mathcal{H}(\pi_{i})=-\int\pi_{i}(a|s)\log(\pi_{i}(a|s))\,da$ . Precise parameter updates and implementations details are discussed in Appendix. B. SDSRA’s decision-making involves selecting and executing actions based on skill selection and continuous skill refinement, enabling adaptive and effective decision-making in diverse environments. In the integrated framework, the SAC objective function is modified to incorporate the dynamic skill selection process. The new objective function aims to maximize not just the expected return, but also the entropy across the diverse set of skills. The modified objective function is expressed as:

J_{\text{integrated}}(\pi)=\sum_{i=1}^{N}P(i|s)\left(\mathbb{E}_{(s_{t},a_{t})\sim\pi_{i}}\left[Q(s_{t},a_{t})+\alpha\mathcal{H}(\pi_{i}(\cdot|s_{t}))\right]\right)

(1)

where $Q(s_{t},a_{t})$ represents the action-value function as estimated by SAC’s critic networks, and $\alpha$ scales the importance of the entropy term $\mathcal{H}(\pi_{i}(\cdot|s_{t}))$ for each skill $\pi_{i}$ . Under this proposal we find that SDSRA converges to an improved policy, see Appendix. A.1 and Appendix. A.2. Moreso, we find that experiments ran on a commonly tested dataset for SAC algorithms demonstrates significant improvements in SDSRA over SAC.

3 Experiments

We assessed the Skill-Driven Skill Recombination Algorithm (SDSRA) in MuJoCo gym locomotion tasks Brockman et al. (2016), comparing it to the Soft Actor-Critic (SAC) algorithm. Our tests in challenging environments like Ant-v2, HalfCheetah-v2, and Hopper-v2 and showed that SDSRA outperformed SAC, achieving faster reward convergence in fewer steps. This highlights SDSRA’s efficiency and potential in complex reinforcement learning tasks.

4 Conclusion

In this paper, we introduced the Skill-Driven Skill Recombination Algorithm (SDSRA) outperforms the traditional Soft Actor-Critic in reinforcement learning, particularly in the MuJoCo environment. Its skill-based approach leads to faster convergence and higher rewards, showing great potential for complex tasks requiring quick adaptability and learning efficiency.

References

Aubret et al. (2019) A. Aubret, L. Matignon, and S. Hassas. A survey on intrinsic motivation in reinforcement learning. arXiv preprint arXiv:1908.06976, 2019. URL https://arxiv.org/abs/1908.06976.
Bagaria & Konidaris (2020) Akjil Bagaria and George Konidaris. Option discovery using deep skill chaining. ICLR 2020, 2020. URL https://openreview.net/pdf?id=B1gqipNYwH.
Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, J. Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016. URL https://arxiv.org/abs/1606.01540.
Chane-Sane et al. (2021) Elliot Chane-Sane, Cordelia Schmid, and Ivan Laptev. Goal-conditioned reinforcement learning with imagined subgoals. Proceeding in Machine Learning, 2021. URL http://proceedings.mlr.press/v139/chane-sane21a/chane-sane21a.pdf.
Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018. URL https://arxiv.org/abs/1801.01290.
Laskin et al. (2022) M. Laskin, Hao Liu, X. B. Peng, Denis Yarats, A. Rajeswaran, and P. Abbeel. Cic: Contrastive intrinsic control for unsupervised skill discovery. arXiv preprint arXiv:2202.00161, 2022. URL https://arxiv.org/abs/2202.00161.
Ma et al. (2022) Zixian Ma, Rose Wang, Li Fei-Fei, Michael Bernstein, and Ranjay Krishna. Elign: Expectation alignment as a multi-agent intrinsic reward. NeurIPS 2022, 2022. URL https://arxiv.org/pdf/2210.04365.pdf.
Sharma et al. (2019) Archit Sharma, Shixiang Shane Gu, Sergey Levine, Vikash Kumar, and Karol Hausman. Dynamics-aware unsupervised discovery of skills. ArXiv, abs/1907.01657, 2019. URL https://api.semanticscholar.org/CorpusID:195791369.
Singh et al. (2004) Satinder Singh, Andrew G. Barto, and Nuttapong Chentanez. Intrinsically motivated reinforcement learning. NeurIPS 2004, 2004. URL https://www.cs.cornell.edu/~helou/IMRL.pdf.
Tang et al. (2021) Hengliang Tang, Anqi Wang, Fei Xue, Jiaxin Yang, and Yang Cao. A novel hierarchical soft actor-critic algorithm for multi-logistics robots task allocation. IEEE Access, 9:42568–42582, 2021. doi: 10.1109/ACCESS.2021.3062457.
Zheng et al. (2018) Zeyu Zheng, Junhyuk Oh, and Satinder Singh. On learning intrinsic rewards for policy gradient methods. NeurIPS 2004, 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/file/51de85ddd068f0bc787691d356176df9-Paper.pdf.

URM Statement

The authors acknowledge that at least one key author of this work meets the URM criteria of ICLR 2024 Tiny Papers Track.

Appendix A Proofs of Main Results

A.1 Policy Improvement Guarantee

Lemma A.1 (Policy Improvement Guarantee).

Given a policy $\pi$ , if the soft Q-values $Q^{\pi}$ are updated according to the soft Bellman backup operator, then the policy $\pi^{\prime}$ , which acts greedily with respect to $Q^{\pi}$ , achieves an equal or greater expected return than $\pi$ .

Proof.

Let $\pi$ be any policy and $\pi^{\prime}$ be the policy that is greedy with respect to the soft Q-values $Q^{\pi}$ . By definition of the greedy policy, for all states $s\in\mathcal{S}$ , we have:

\pi^{\prime}(a|s)=\arg\max_{a^{\prime}}\left(Q^{\pi}(s,a^{\prime})+\alpha\mathcal{H}(\pi(\cdot|s))\right).

(2)

Now consider the soft value function $V^{\pi}$ which is given by:

V^{\pi}(s)=\mathbb{E}_{a\sim\pi}[Q^{\pi}(s,a)-\alpha\log\pi(a|s)].

(3)

Using the soft Bellman optimality equation for $Q^{\pi^{\prime}}$ , we get:

Q^{\pi^{\prime}}(s,a)=\mathbb{E}_{s^{\prime}\sim P}[r(s,a)+\gamma V^{\pi^{\prime}}(s^{\prime})].

(4)

Substituting the expression for $V^{\pi^{\prime}}$ into the above, we have:

Q^{\pi^{\prime}}(s,a)=\mathbb{E}_{s^{\prime}\sim P,a^{\prime}\sim\pi^{\prime}}[r(s,a)+\gamma(Q^{\pi^{\prime}}(s^{\prime},a^{\prime})-\alpha\log\pi^{\prime}(a^{\prime}|s^{\prime}))].

(5)

Since $\pi^{\prime}$ is greedy with respect to $Q^{\pi}$ , it follows that $Q^{\pi^{\prime}}(s,a)\geq Q^{\pi}(s,a)$ for all $s\in\mathcal{S}$ and $a\in\mathcal{A}$ .

Thus, we have shown that acting greedily with respect to the soft Q-values under policy $\pi$ results in a policy $\pi^{\prime}$ that has greater or equal Q-value for all state-action pairs, which completes the proof. ∎

A.2 Theorem A.2: Convergence to Optimal Policy

Theorem A.2 (Convergence to Optimal Policy).

Repeated application of soft policy evaluation and soft policy improvement from any initial policy $\pi\in\Pi$ converges to a policy $\pi^{*}$ such that $Q^{\pi^{*}}(s_{t},a_{t})\geq Q^{\pi}(s_{t},a_{t})$ for all $\pi\in\Pi$ and $(s_{t},a_{t})\in\mathcal{S}\times\mathcal{A}$ , assuming $|\mathcal{A}|<\infty$ .

Proof.

The soft Bellman backup operator for policy evaluation under policy $\pi$ is given by:

T^{\pi}Q(s,a)=\mathbb{E}_{s^{\prime}\sim P,a^{\prime}\sim\pi}[r(s,a)+\gamma(Q(s^{\prime},a^{\prime})-\alpha\log\pi(a^{\prime}|s^{\prime}))].

(6)

This operator is a contraction mapping in the supremum norm, which ensures that repeated application of $T^{\pi}$ to any initial Q-function $Q_{0}$ converges to a unique fixed point $Q^{\pi}$ that satisfies the soft Bellman equation for policy $\pi$ .

Now, let us define the soft Bellman optimality operator $T^{*}$ as:

T^{*}Q(s,a)=\max_{\pi}T^{\pi}Q(s,a).

(7)

The soft policy improvement step involves updating the policy $\pi$ to a new policy $\pi^{\prime}$ by choosing actions that maximize the current soft Q-values plus the entropy term:

\pi^{\prime}=\arg\max_{\pi}\mathbb{E}_{a\sim\pi}[Q^{\pi}(s,a)-\alpha\log\pi(a|s)].

(8)

By the policy improvement theorem, this new policy $\pi^{\prime}$ achieves a Q-value that is greater than or equal to that of $\pi$ , i.e., $Q^{\pi^{\prime}}(s,a)\geq Q^{\pi}(s,a)$ for all $(s,a)$ .

Since the action space $\mathcal{A}$ is finite, there are a finite number of deterministic policies in $\Pi$ . Thus, the sequence of policies $\{\pi_{k}\}$ obtained by alternating soft policy evaluation and soft policy improvement must eventually converge to a policy $\pi^{*}$ that cannot be improved further, which means it is the optimal policy with respect to the soft Bellman optimality equation. Therefore, we have:

Q^{\pi^{*}}(s,a)=T^{*}Q^{\pi^{*}}(s,a),

(9)

for all $(s,a)\in\mathcal{S}\times\mathcal{A}$ . Hence, $Q^{\pi^{*}}(s,a)\geq Q^{\pi}(s,a)$ for all $\pi\in\Pi$ , which concludes the proof that the sequence of policies converges to an optimal policy $\pi^{*}$ . ∎

A.3 Proof of Theorem 2: Entropy Maximization Efficiency of SDSRA

Theorem A.3 (Entropy Maximization Efficiency of SDSRA).

Let $\pi^{\text{SAC}}$ and $\pi^{\text{SDSRA}}$ be the policies obtained from the SAC and SDSRA algorithms, respectively, when trained under identical conditions. Assume that both algorithms achieve convergence. Then, for any state $s\in\mathcal{S}$ , the expected entropy of $\pi^{\text{SDSRA}}$ is greater than or equal to that of $\pi^{\text{SAC}}$ :

\mathbb{E}_{a\sim\pi^{\text{SDSRA}}}[-\log\pi^{\text{SDSRA}}(a|s)]\geq\mathbb{E}_{a\sim\pi^{\text{SAC}}}[-\log\pi^{\text{SAC}}(a|s)],

(10)

or the time to reach an $\epsilon$ -optimal policy entropy for SDSRA is less than that for SAC:

t_{\text{SDSRA}}(\epsilon)\leq t_{\text{SAC}}(\epsilon),

(11)

where $t_{\text{SDSRA}}(\epsilon)$ and $t_{\text{SAC}}(\epsilon)$ denote the time to reach a policy entropy within $\epsilon$ of the maximum entropy for SDSRA and SAC, respectively.

Proof.

Assume that both $\pi^{\text{SAC}}$ and $\pi^{\text{SDSRA}}$ have converged to their respective policy distributions for all states $s\in\mathcal{S}$ . By the definition of convergence, we have that the policies are stationary and hence the expected entropy under each policy is constant over time.

Consider the skill-based decision-making process inherent in SDSRA. At each decision step, SDSRA selects a skill from a diversified set, which is represented as a policy over actions. This process is formalized by a softmax function over the skills’ relevance scores, which in turn are updated based on the performance and diversity of actions taken. As a consequence, the SDSRA policy is encouraged to explore a wider range of actions, leading to an increase in the expected entropy of the policy.

Formally, let $S$ be the set of all skills in SDSRA, and let $r_{i}$ be the relevance score of skill $i$ . Then the probability of selecting an action $a$ given state $s$ under policy $\pi^{\text{SDSRA}}$ is given by a mixture of policies corresponding to each skill:

\pi^{\text{SDSRA}}(a|s)=\sum_{i=1}^{N}P(i|s)\pi_{\text{skill}_{i}}(a|s),

(12)

where $P(i|s)$ is the softmax probability of selecting skill $i$ .

The entropy of a mixture of policies is generally higher than the entropy of any individual policy in the mixture. Therefore, the expected entropy of $\pi^{\text{SDSRA}}$ is greater than the expected entropy of any individual skill policy, and by extension, greater than or equal to the entropy of $\pi^{\text{SAC}}$ , which does not utilize a mixture of policies:

\mathbb{E}_{a\sim\pi^{\text{SDSRA}}}[-\log\pi^{\text{SDSRA}}(a|s)]\geq\mathbb{E}_{a\sim\pi^{\text{SAC}}}[-\log\pi^{\text{SAC}}(a|s)].

(13)

Furthermore, due to the dynamic and adaptive nature of skill selection in SDSRA, the algorithm rapidly explores high-entropy policies, thus reaching a policy with entropy within $\epsilon$ of the maximum entropy faster than SAC, which optimizes a single policy without such a mechanism. This leads to:

t_{\text{SDSRA}}(\epsilon)\leq t_{\text{SAC}}(\epsilon),

(14)

completing the proof. ∎

Appendix B SDSRA Algorithm

Algorithm 1 Soft Actor-Critic with Skill-Driven Skill Recombination Algorithm (SDSRA)

1:Initialize action-value functions

Q_{\theta_{1}},Q_{\theta_{2}}

with parameters

\theta_{1},\theta_{2}

2:Initialize the policy

\pi_{\phi}

with parameters

\phi

3:Initialize target value parameters

\theta^{\prime}_{1}\leftarrow\theta_{1},\theta^{\prime}_{2}\leftarrow\theta_{2}

4:Initialize skill set

S=\{\pi_{\theta_{\text{skill}_{i}}}\}_{i=1}^{N}

with parameters

\theta_{\text{skill}_{i}}

5:Initialize relevance scores

r_{i}\leftarrow c,\forall i\in\{1,\ldots,N\}

6:Initialize replay buffer

D

7:for each iteration do

8: for each environment step do

9: Sample skill index

i

using probabilities

P(i|s)=\frac{e^{r_{i}}}{\sum_{j=1}^{N}e^{r_{j}}}

10: Select action

a_{t}\sim\pi_{\theta_{\text{skill}_{i}}}(s_{t})

11: Execute

a_{t}

and observe reward

r_{t}

and new state

s_{t+1}

12: Store transition tuple

(s_{t},a_{t},r_{t},s_{t+1},i)

in buffer

D

13: end for

14: for each gradient step do

15: Randomly sample a batch of transitions from

D

16: Compute target values using the Bellman equation

17: Update

Q_{\theta_{1}},Q_{\theta_{2}}

by minimizing the loss:

L(\theta_{i})=\mathbb{E}_{(s,a,r,s^{\prime})\sim D}\left[\left(Q_{\theta_{i}}(s,a)-(r+\gamma(\min_{j=1,2}Q_{\theta^{\prime}_{j}}(s^{\prime},\pi_{\phi}(s^{\prime}))-\alpha\log\pi_{\phi}(a|s^{\prime}))\right)^{2}\right]

18: Update policy

\pi_{\phi}

using the policy gradient:

\nabla_{\phi}J(\pi_{\phi})=\mathbb{E}_{s\sim D,a\sim\pi_{\phi}}\left[\nabla_{\phi}\log\pi_{\phi}(a|s)Q_{\theta_{1}}(s,a)\right]

19: Update target networks:

\theta^{\prime}_{i}\leftarrow\tau\theta_{i}+(1-\tau)\theta^{\prime}_{i}

20: end for

21: for each skill update interval do

22: Evaluate and update the performance of each skill

\pi_{\theta_{\text{skill}_{i}}}

23: Update relevance scores

r_{i}

based on the performance

24: end for

25:end for