This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Sample Algo

Baskin Burak Senbaslar
(October 2020)

1 Introduction

1:Distribution over tasks p(𝒯)p(\mathcal{T})
2:Learning rates α\alpha and β\beta
3:Randomly initialize the dynamic model PθP_{\theta} and the policy πΦ\pi_{\Phi}
4:Initialize data buffer Df=D_{f}=\emptyset
5:while not done do
6:     Sample batch of tasks 𝒯i\mathcal{T}^{i} from p(𝒯)p(\mathcal{T})
7:     for all 𝒯i\mathcal{T}^{i} do
8:         Sample trajectories DiD^{i} using policy πΦ\pi_{\Phi} and Df=DfDiD_{f}=D_{f}\cup D^{i} split DiD^{i} into DtriD^{i}_{tr} and DtsiD^{i}_{ts}
9:         Update Φ=ΦβΦL(Φ,Dtri)\Phi=\Phi-\beta\,\nabla_{\Phi}\,L(\Phi,\,D^{i}_{tr})
10:         Update θ=θαθL(θ,Dtri)\theta=\theta-\alpha\,\nabla_{\theta}\,L(\theta,\,D^{i}_{tr})
11:     end for
12:     ΦΦβΦ1|𝒯|𝒯iL(Φ,Dtsi)\Phi\longleftarrow\Phi-\beta\,\nabla_{\Phi}\frac{1}{|\mathcal{T}|}\sum_{\mathcal{T}^{i}}L(\Phi,\,D^{i}_{ts})
13:     θθαθ1|𝒯|𝒯iL(θ,Dtsi)\theta\longleftarrow\theta-\alpha\,\nabla_{\theta}\frac{1}{|\mathcal{T}|}\sum_{\mathcal{T}^{i}}L(\theta,\,D^{i}_{ts})
14:end while
15:meta-learned dynamic model PθP_{\theta} and meta-policy parameters Φ\Phi
16:Data buffer DfD_{f}
Algorithm 1 Phase 1: Meta-learning dynamic model and policy
1:meta-learned dynamic model PθP_{\theta}, learning rate α\alpha
2:meta-policy parameters Φ\Phi, learning rate β\beta
3:Data buffer DfD_{f}
4:Randomly initialize the dynamic model PθP_{\theta} and the policy πΦ\pi_{\Phi}
5:Initialize data buffer Df=D_{f}=\emptyset
6:while not done do
7:     Randomly choose batch of DiD^{i}s from DfD_{f}
8:     for all DiD^{i} do
9:         while not done do
10:              Use DiD^{i} to update θ\theta corresponding to 𝒯i\mathcal{T}^{i} by: θiθαθL(θ,Di)\theta^{i}\longleftarrow\theta-\alpha\,\nabla_{\theta}L(\theta,\,D^{i})
11:         end while
12:         Sample batch of data D¯i\bar{D}^{i} using dynamic model PθiP_{\theta^{i}} and policy πΦ\pi_{\Phi}
13:         Update Φ=ΦβΦL(Φ,D¯i)\Phi=\Phi-\beta\,\nabla_{\Phi}L(\Phi,\,\bar{D}^{i})
14:     end for
15:     Update ΦΦβΦ1|𝒯|𝒯iL(Φ,D¯i)\Phi\longleftarrow\Phi-\beta\,\nabla_{\Phi}\frac{1}{|\mathcal{T}|}\sum_{\mathcal{T}^{i}}L(\Phi,\,\bar{D}^{i})
16:end while
17:meta-policy πΦ\pi_{\Phi}
Algorithm 2 Phase 2: Learning meta-policy from simulated data