Sample Algo

Baskin Burak Senbaslar

(October 2020)

1 Introduction

1:Distribution over tasks

p(\mathcal{T})

2:Learning rates

\alpha

and

\beta

3:Randomly initialize the dynamic model

P_{\theta}

and the policy

\pi_{\Phi}

4:Initialize data buffer

D_{f}=\emptyset

5:while not done do

6: Sample batch of tasks

\mathcal{T}^{i}

from

p(\mathcal{T})

7: for all

\mathcal{T}^{i}

8: Sample trajectories

D^{i}

using policy

\pi_{\Phi}

and

D_{f}=D_{f}\cup D^{i}

split

D^{i}

into

D^{i}_{tr}

and

D^{i}_{ts}

9: Update

\Phi=\Phi-\beta\,\nabla_{\Phi}\,L(\Phi,\,D^{i}_{tr})

10: Update

\theta=\theta-\alpha\,\nabla_{\theta}\,L(\theta,\,D^{i}_{tr})

11: end for

12:

\Phi\longleftarrow\Phi-\beta\,\nabla_{\Phi}\frac{1}{|\mathcal{T}|}\sum_{\mathcal{T}^{i}}L(\Phi,\,D^{i}_{ts})

13:

\theta\longleftarrow\theta-\alpha\,\nabla_{\theta}\frac{1}{|\mathcal{T}|}\sum_{\mathcal{T}^{i}}L(\theta,\,D^{i}_{ts})

14:end while

15:meta-learned dynamic model

P_{\theta}

and meta-policy parameters

\Phi

16:Data buffer

D_{f}

Algorithm 1 Phase 1: Meta-learning dynamic model and policy

1:meta-learned dynamic model

P_{\theta}

, learning rate

\alpha

2:meta-policy parameters

\Phi

, learning rate

\beta

3:Data buffer

D_{f}

4:Randomly initialize the dynamic model

P_{\theta}

and the policy

\pi_{\Phi}

5:Initialize data buffer

D_{f}=\emptyset

6:while not done do

7: Randomly choose batch of

D^{i}

s from

D_{f}

8: for all

D^{i}

9: while not done do

10: Use

D^{i}

to update

\theta

corresponding to

\mathcal{T}^{i}

by:

\theta^{i}\longleftarrow\theta-\alpha\,\nabla_{\theta}L(\theta,\,D^{i})

11: end while

12: Sample batch of data

\bar{D}^{i}

using dynamic model

P_{\theta^{i}}

and policy

\pi_{\Phi}

13: Update

\Phi=\Phi-\beta\,\nabla_{\Phi}L(\Phi,\,\bar{D}^{i})

14: end for

15: Update

\Phi\longleftarrow\Phi-\beta\,\nabla_{\Phi}\frac{1}{|\mathcal{T}|}\sum_{\mathcal{T}^{i}}L(\Phi,\,\bar{D}^{i})

16:end while

17:meta-policy

\pi_{\Phi}

Algorithm 2 Phase 2: Learning meta-policy from simulated data