This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Dreaming: Model-based Reinforcement Learning
by Latent Imagination without Reconstruction

Masashi Okada1,⋆ and Tadahiro Taniguchi1,2 1 Masashi Okada and Tadahiro Taniguchi are with Digital & AI Technology Center, Technology Division, Panasonic Corporation, Japan. 2 Tadahiro Taniguchi is also with Ritsumeikan University, College of Information Science and Engineering, Japan. okada.masashi001@jp.panasonic.com
Abstract

In the present paper, we propose a decoder-free extension of Dreamer, a leading model-based reinforcement learning (MBRL) method from pixels. Dreamer is a sample- and cost-efficient solution to robot learning, as it is used to train latent state-space models based on a variational autoencoder and to conduct policy optimization by latent trajectory imagination. However, this autoencoding based approach often causes object vanishing, in which the autoencoder fails to perceives key objects for solving control tasks, and thus significantly limiting Dreamer’s potential. This work aims to relieve this Dreamer’s bottleneck and enhance its performance by means of removing the decoder. For this purpose, we firstly derive a likelihood-free and InfoMax objective of contrastive learning from the evidence lower bound of Dreamer. Secondly, we incorporate two components, (i) independent linear dynamics and (ii) the random crop data augmentation, to the learning scheme so as to improve the training performance. In comparison to Dreamer and other recent model-free reinforcement learning methods, our newly devised Dreamer with InfoMax and without generative decoder (Dreaming) achieves the best scores on 5 difficult simulated robotics tasks, in which Dreamer suffers from object vanishing.

I Introduction

In the present paper, we focus on model-based reinforcement learning (MBRL) from pixels without complex reconstruction. MBRL is a promising technique to build controllers in a sample efficient manner, which trains forward dynamics models to predict future states and rewards for the purpose of planning and/or policy optimization. The recent study of MBRL in fully-observable environments [1, 2, 3, 4] have achieved both sample efficiency and competitive performance with the state-of-the-art model-free reinforcement learning (MFRL) methods like soft-actor-critic (SAC) [5]. Although real-robot learning has been achieved with fully-observable MBRL  [6, 7, 8, 9, 10, 11], there has been an increasing demand for robot learning in partially observable environments in which only incomplete information (especially vision) is available. MBRL from pixels can be realized by introducing deep generative models based on autoencoding variational Bayes [12].

Object vanishing is a critical problem of the autoencoding based MBRL from pixels. Previous studies in this field [13, 14, 15, 16, 17, 18, 19] train autoencoding models along with latent dynamics models that generate imagined trajectories that are used for planning or policy optimization. However, the autoencoder often fails to perceive small objects in pixel space. The top part of Fig. 1 exemplifies this kind of failures where small (or thin) and important objects are not reconstructed in their correct positions. This shows the failure to successfully embed their information into the latent space, which significantly limits the training performance. This problem is a result of a log-likelihood objective (reconstruction loss) defined in the pixel space. Since the reconstruction errors of small objects in the pixel space are insignificant compared to the errors of other objects and uninformative textures that occupy most parts of the image region, it is hard to train the encoder to perceive the small objects from the weak error signals. Also, we have to train the decoder which requires a high model capacity with massive parameters from the convolutional neural networks (CNN), although the trained models are not exploited both in the planning and policy optimization.

Refer to caption
Figure 1: The overview of the motivation and concept of this work. (Top) The concept of Dreamer’s autoencoding-based representation learning [16], which often causes object vanishing as shown with the dashed-line ovals. The left three tasks are from DeepMind Control Suite  [20], and the remaining two tasks are our original tasks which are assumed to represent industrial applications. (Bottom) The concept of Dreaming’s representation learning, which trains a discriminator instead of the decoder to embed different samples to be spaced apart from each other. The learning scheme is characterized with the two key components; (i) Linear dynamics successfully constrains temporally consecutive samples are not distributed too further away. (ii) Image augmentation encourages only key features for control to be embedded into the latent space.

To avoid this complex reconstruction, some previous MBRL studies have proposed decoder-free representation learning [16, 21] based on contrastive learning [22, 23], which trains a discriminator instead of the decoder. The discriminator is trained by categorical cross-entropy optimization, which encourages latent embeddings to be sufficiently distinguishable among different embeddings. Nevertheless, to the best of our knowledge, no MBRL methods have achieved the state-of-the-art results on the difficult benchmark tasks of DeepMind Control Suite (DMC) [20] without reconstruction.

Motivated by these observations, this paper aims to achieve the state-of-the-art results with MBRL from pixels without reconstruction. This paper mainly focuses on the latest autoencoding-based MBRL method Dreamer, considering the success of a variety of control tasks (i.e., DMC and Atari Games [24]), and tries to extend this method to be a decoder-free fashion. We adopt Dreamer’s policy optimization without any form of modification. We call our extended Dreamer as Dreamer with InfoMax and without generative decoder (Dreaming). The concept of this proposed method is illustrated in the bottom part of Fig. 1. Our primary contributions are summarized as follows.

  • We derive a likelihood-free (decoder-free) and InfoMax objective for contrastive learning by reformulating the variational evidence lower bound (ELBO) of the graphical model of the partially observable Markov decision process.

  • We show that two key components, (i) an independent and linear forward dynamics, which is only utilized for contrastive learning, and (ii) appropriate data augmentation (i.e., random crop), are indispensable to achieve the state-of-the-art results.

In comparison to Dreamer and the recent cutting edge MFRL methods, Dreaming can achieve the state-of-the-art results on difficult simulated robotics tasks exhibited in Fig. 1 in which Dreamer suffers from object vanishing. The remainder of this paper is organized as follows. In Sec. II, key differences from related work are discussed. In Sec. III, we provide a brief review of Dreamer and contrastive learning. In Sec. IV, we first describe the proposed contrastive learning scheme in detail, and then introduce Dreaming. In Sec. V, the effectiveness of Dreaming is demonstrated through simulated evaluations. Finally, Sec. VI concludes this paper.

II Related Work

Some of the most related work are contrastive predictive coding (CPC) [22] and contrastive forward model (CFM) [21]. Our work is highly inspired by CPC, and our contrastive learning scheme has similar components with CPC; e.g., a recurrent neural network and a bilinear similarity function. However, CPC has no action-conditioned dynamics models. Since CPC alone cannot generate imagined trajectories from arbitrary actions, CPC is only used as an auxiliary objective of MFRL. CFM heuristically introduces a similar decoder-free objective function like ours. A primary difference between ours and CFM is that CFM exploits a shared and non-linear forward model for both contrastive and behavior learning. In addition, the relation between the ELBO of time-series variational inference is not discussed in the above two literature. Meanwhile, the original Dreamer paper [16] has also derived a contrastive objective from the ELBO. However, dynamics models and temporal correlation of observations are not involved in the contrastive objective. Furthermore, CPC, CFM, and Dreamer do not introduce data augmentation.

MFRL methods with representation learning: A state-of-the-art MFRL method, contrastive unsupervised representation for reinforcement learning (CURL) [25], also makes use of contrastive learning with the random crop data augmentation. Deep bisimulation for control (DBC) [26] and discriminative particle filter reinforcement learning (DPFRL) [27] are other types of cutting edge MFRL methods, which utilize different concepts of representation learning without reconstruction.

MFRL methods without representation learning: Recently proposed MFRL methods, which include reinforcement learning with augmented data (RAD) [28], data-regularized Q (DrQ) [29], and simple unified framework for reinforcement learning using ensembles (SUNRISE) [30], have achieved state-of-the-art result without representation learning. All these work employ the random crop data augmentation as an important component of their method.

Refer to caption
Figure 2: Graphical model of the partially observable Markov decision process.

III Preliminary

III-A Autoencoding Variational Bayes for Time-series

Let us begin by considering the graphical model illustrated in Fig. 2, whose joint distribution is defined as follows:

p(𝒛T,𝒂<T,𝒙T)=tp(𝒛t+1|𝒛t,𝒂t)p(𝒙t|𝒛t),\displaystyle p({\boldsymbol{z}}_{\leq T},\boldsymbol{a}_{<T},\boldsymbol{x}_{\leq T})=\prod_{t}p\left({\boldsymbol{z}}_{t+1}|{\boldsymbol{z}}_{t},\boldsymbol{a}_{t}\right)p\left(\boldsymbol{x}_{t}|{\boldsymbol{z}}_{t}\right), (1)

where 𝒛{\boldsymbol{z}}, 𝒙\boldsymbol{x}, and 𝒂\boldsymbol{a} denote latent state, observation, and action, respectively. As in the case of well-known variational autoencoders (VAEs) [12], generative models p(𝒛t+1|𝒛t,𝒂t)p({\boldsymbol{z}}_{t+1}|{\boldsymbol{z}}_{t},\boldsymbol{a}_{t}), p(𝒙t|𝒛t)p(\boldsymbol{x}_{t}|{\boldsymbol{z}}_{t}) and inference model q(𝒛t|𝒙t,𝒂<t)q({\boldsymbol{z}}_{t}|\boldsymbol{x}_{\leq t},\boldsymbol{a}_{<t}) can be trained by maximizing the evidence lower bound [15]:

logp(𝒙T|𝒂T)=logp(𝒛T,𝒂<T,𝒙T)𝑑𝒛t\displaystyle\log p(\boldsymbol{x}_{\leq T}|\boldsymbol{a}_{\leq T})=\log\int p({\boldsymbol{z}}_{\leq T},\boldsymbol{a}_{<T},\boldsymbol{x}_{\leq T})d{\boldsymbol{z}}_{\leq t}\geq
t(𝔼q(𝒛t|𝒙t,𝒂<t)[logp(𝒙t|𝒛t)]𝒥likelihood\displaystyle\sum_{t}\left(\underbrace{\mathbb{E}_{q({\boldsymbol{z}}_{t}|\boldsymbol{x}_{\leq t},\boldsymbol{a}_{<t})}\left[\log p(\boldsymbol{x}_{t}|{\boldsymbol{z}}_{t})\right]}_{\coloneqq\mathcal{J}^{\mathrm{likelihood}}}\right. (2)
𝔼q(𝒛t|)[KL[q(𝒛t+1|𝒙t+1,𝒂<t+1)||p(𝒛t+1|𝒛t,𝒂t)]]𝒥KL).\displaystyle\left.\underbrace{-\mathbb{E}_{q({\boldsymbol{z}}_{t}|\cdot)}\left[{\operatorname{KL}}\left[q({\boldsymbol{z}}_{t+1}|\boldsymbol{x}_{\leq t+1},\boldsymbol{a}_{<t+1})||p({\boldsymbol{z}}_{t+1}|{\boldsymbol{z}}_{t},\boldsymbol{a}_{t})\right]\right]}_{\coloneqq\mathcal{J}^{\mathrm{KL}}}\right).

If the models are defined to be differentiable and trainable, this objective can be maximized by the stochastic gradient ascent via backpropagation.

Multi-step variational inference is proposed in [15] to improve long-term predictions. This inference, named latent overshooting, involves the multi-step objective 𝒥kKL\mathcal{J}^{\mathrm{KL}}_{k} defined as:

𝒥KL𝒥kKL𝔼p(𝒛t|𝒛tk,𝒂<t)q(𝒛tk|𝒙tk,𝒂<tk)[\displaystyle\mathcal{J}^{\mathrm{KL}}\geq\mathcal{J}^{\mathrm{KL}}_{k}\coloneqq\mathbb{E}_{p({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-k},\boldsymbol{a}_{<t})q({\boldsymbol{z}}_{t-k}|\boldsymbol{x}_{\leq t-k},\boldsymbol{a}_{<t-k})}[\hookleftarrow
KL[q(𝒛t+1|𝒙t+1,𝒂<t+1)||p(𝒛t+1|𝒛t,𝒂t)]],\displaystyle{\operatorname{KL}}\left[q({\boldsymbol{z}}_{t+1}|\boldsymbol{x}_{\leq t+1},\boldsymbol{a}_{<t+1})||p({\boldsymbol{z}}_{t+1}|{\boldsymbol{z}}_{t},\boldsymbol{a}_{t})\right]], (3)

where

p(𝒛t|𝒛tk,𝒂<t)𝔼p(𝒛t1|𝒛tk,𝒂<t1)[p(𝒛t|𝒛t1,𝒂t1)]\displaystyle{p}({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-k},\boldsymbol{a}_{<t})\coloneqq\mathbb{E}_{{p}({\boldsymbol{z}}_{t-1}|{\boldsymbol{z}}_{t-k},\boldsymbol{a}_{<t-1})}\left[{p}({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-1},\boldsymbol{a}_{t-1})\right]

is the multi-step prediction model.

For the purpose of planning or policy optimization, not only for the dynamics model p(𝒛t+1|𝒛t,𝒂t)p({\boldsymbol{z}}_{t+1}|{\boldsymbol{z}}_{t},\boldsymbol{a}_{t}) but also the reward function p(rt|𝒛t)p(r_{t}|{\boldsymbol{z}}_{t}) is also required. To do this, we can simply regard the rewards as observations and learn the reward function p(rt|𝒛t)p(r_{t}|{\boldsymbol{z}}_{t}) along with the decoder p(𝒙t|𝒛t)p(\boldsymbol{x}_{t}|{\boldsymbol{z}}_{t}). For readability, we omit the specifications of the reward function p(rt|𝒛t)p(r_{t}|{\boldsymbol{z}}_{t}) in the following discussion. Although we remove the decoder later, the reward function and its likelihood objective are kept untouched.

III-B Recurrent State Space Model and Dreamer

The recurrent state space model (RSSM) is a latent dynamics model equipped with an expressive recurrent neural network, realizing accurate long-term prediction. RSSM is used as an essential component of various MBRL methods from pixels [15, 16, 17, 19, 31] including Dreamer. RSSM assumes the latent 𝒛t{\boldsymbol{z}}_{t} comprises 𝒛t=(𝒔t,𝒉t){\boldsymbol{z}}_{t}=({\boldsymbol{s}}_{t},{\boldsymbol{h}}_{t}) where 𝒔t{\boldsymbol{s}}_{t}, 𝒉t{\boldsymbol{h}}_{t} are the probabilistic and deterministic variables, respectively. RSSM’s generative and inference models are defined as:

Generativemodels\displaystyle\mathrm{Generative\ models} :{𝒉t=fGRU(𝒉t1,𝒔t1,𝒂t1)𝒔tp(𝒔t|𝒉t)𝒙tp(𝒙t|𝒉t,𝒔t),\displaystyle:\begin{cases}{\boldsymbol{h}}_{t}=f^{\mathrm{GRU}}({\boldsymbol{h}}_{t-1},{\boldsymbol{s}}_{t-1},\boldsymbol{a}_{t-1})\\ {\boldsymbol{s}}_{t}\sim p({\boldsymbol{s}}_{t}|{\boldsymbol{h}}_{t})\\ \boldsymbol{x}_{t}\sim p(\boldsymbol{x}_{t}|{\boldsymbol{h}}_{t},{\boldsymbol{s}}_{t})\end{cases}, (4)
Inferencemodel\displaystyle\mathrm{Inference\ model} :𝒔tq(𝒔t|𝒉t,𝒙t),\displaystyle:{\boldsymbol{s}}_{t}\sim q({\boldsymbol{s}}_{t}|{\boldsymbol{h}}_{t},\boldsymbol{x}_{t}),

where deterministic 𝒉t{\boldsymbol{h}}_{t} is considered to be the hidden state of the gated recurrent unit (GRU) fGRU()f^{\mathrm{GRU}}(\cdot) [32] so that historical information can be embedded into 𝒉t{\boldsymbol{h}}_{t}.

Dreamer [16] makes use of RSSM as a differentiable dynamics and efficiently learns the behaviors via backpropagation of Bellman errors estimated from imagined trajectories. Dreamer’s training procedure is simply summarized as follows: (1) Train RSSM with a given dataset by optimizing Eq. (2). (2) Train a policy from the latent imaginations. (3) Execute the trained policy in a real environment and augment the dataset with the observed results. The above steps are iteratively executed until the policy performs as expected.

III-C Contrastive Learning of RSSM

The original Dreamer paper [16] also introduced a likelihood-free objective by reformulating 𝒥likelihood\mathcal{J}^{\mathrm{likelihood}} of Eq. (2). By adding a constant logp(𝒙t)\log p(\boldsymbol{x}_{t}) and applying Bayes’ theorem, we get a decoder-free objective:

𝒥likelihood=+𝔼q(𝒛t|)[logp(𝒙t|𝒛z)logp(𝒙t)]\displaystyle\mathcal{J}^{\mathrm{likelihood}}\stackrel{{\scriptstyle+}}{{=}}\mathbb{E}_{q({\boldsymbol{z}}_{t}|\cdot)}\left[\log p(\boldsymbol{x}_{t}|{\boldsymbol{z}}_{z})-\log p(\boldsymbol{x}_{t})\right]
=𝔼q(𝒛t|)[logp(𝒛t|𝒙t)logp(𝒛t)]\displaystyle=\mathbb{E}_{q({\boldsymbol{z}}_{t}|\cdot)}\left[\log p({\boldsymbol{z}}_{t}|\boldsymbol{x}_{t})-\log p({\boldsymbol{z}}_{t})\right]
𝔼q(𝒛t|)[logp(𝒛t|𝒙t)log𝒙𝒟p(𝒛t|𝒙)]\displaystyle\geq\mathbb{E}_{q({\boldsymbol{z}}_{t}|\cdot)}\left[\log p({\boldsymbol{z}}_{t}|\boldsymbol{x}_{t})-\log\sum_{\boldsymbol{x}^{\prime}\in\mathcal{D}}p({\boldsymbol{z}}_{t}|\boldsymbol{x}^{\prime})\right]
𝒥NCE,\displaystyle\coloneqq\mathcal{J}^{\mathrm{NCE}}, (5)

where 𝒟\mathcal{D} denotes the mini-batch and the lower bound in the second line was from the Info-NCE (noise-contrastive estimator) mini-batch bound [33]. Let BB be the batch size of 𝒟\mathcal{D}, 𝒥NCE\mathcal{J}^{\mathrm{NCE}} is considered as a BB-class categorical cross entropy objective to discriminate the positive pair (𝒛t,𝒙t)({\boldsymbol{z}}_{t},\boldsymbol{x}_{t}) among the other negative pairs (𝒛t,𝒙(𝒙t))({\boldsymbol{z}}_{t},\boldsymbol{x}^{\prime}(\neq\boldsymbol{x}_{t})). In this interpretation, p(𝒛t|𝒙t)p({\boldsymbol{z}}_{t}|\boldsymbol{x}_{t}) can be considered as a discriminator to discern the positive pairs. Representation learning with this type of objective is known as contrastive learning [22, 23] that encourages the embeddings to be sufficiently seperated from each other in the latent space. However, the experiment in [16] has demonstrated that this objective significantly degrades the performance compared to the original objective 𝒥likelihood\mathcal{J}^{\mathrm{likelihood}}.

IV Proposed Contrastive Learning
and MBRL Method

IV-A Deriving Another Contrastive Objective

We propose to further reformulate 𝒥NCE\mathcal{J}^{\mathrm{NCE}} of Eq. (5) by introducing a multi-step prediction model: p~(𝒛t|𝒛tk,𝒂<t)𝔼p~(𝒛t1|𝒛tk,𝒂<t1)[p~(𝒛t|𝒛t1,𝒂t1)].\tilde{p}({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-k},\boldsymbol{a}_{<t})\coloneqq\mathbb{E}_{\tilde{p}({\boldsymbol{z}}_{t-1}|{\boldsymbol{z}}_{t-k},\boldsymbol{a}_{<t-1})}\left[\tilde{p}({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-1},\boldsymbol{a}_{t-1})\right]. The accent of p~\tilde{p} implies that an independent dynamics model from p(𝒛t|𝒛t1,𝒂t1)p({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-1},\boldsymbol{a}_{t-1}) in Eq. (2) can be employed here. By multiplying a constant 𝔼q(𝒛tk|)[p~()/p~()]=1\mathbb{E}_{q({\boldsymbol{z}}_{t-k}|\cdot)}[\tilde{p}(\cdot)/\tilde{p}(\cdot)]=1, we obtain an importance sampling form of 𝒥NCE\mathcal{J}^{\mathrm{NCE}} as:

𝒥NCE=\displaystyle\mathcal{J}^{\mathrm{NCE}}= 𝔼p~(𝒛t|𝒛tk,𝒂<t)q(𝒛tk|)[q(𝒛t|)p~(𝒛t|)×\displaystyle\ \mathbb{E}_{\tilde{p}({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-k},\boldsymbol{a}_{<t})q({\boldsymbol{z}}_{t-k}|\cdot)}\left[\frac{q({\boldsymbol{z}}_{t}|\cdot)}{\tilde{p}({\boldsymbol{z}}_{t}|\cdot)}\times\hookleftarrow\right.
(logp(𝒛t|𝒙t)log𝒙p(𝒛t|𝒙))].\displaystyle\left.\left(\log p({\boldsymbol{z}}_{t}|\boldsymbol{x}_{t})-\log\sum_{\boldsymbol{x}^{\prime}}p({\boldsymbol{z}}_{t}|\boldsymbol{x}^{\prime})\right)\right]. (6)

For computational simplicity, we approximate the likelihood ratio q(𝒛t|)/p~(𝒛t|){q({\boldsymbol{z}}_{t}|\cdot)}/{\tilde{p}({\boldsymbol{z}}_{t}|\cdot)} as a constant and assume that the summation of 𝒥NCE\mathcal{J}^{\mathrm{NCE}} across batch and time dimension is approximated as: 𝒥NCE𝒥kNCE,\sum\mathcal{J}^{\mathrm{NCE}}\stackrel{{\scriptstyle\sim}}{{\propto}}\sum\mathcal{J}^{\mathrm{NCE}}_{k}, where

𝒥kNCE\displaystyle\mathcal{J}^{\mathrm{NCE}}_{k}\coloneqq (7)
𝔼p~(𝒛t|𝒛tk,𝒂<t)q(𝒛tk|)[logp(𝒛t|𝒙t)log𝒙p(𝒛t|𝒙)].\displaystyle\mathbb{E}_{\tilde{p}({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-k},\boldsymbol{a}_{<t})q({\boldsymbol{z}}_{t-k}|\cdot)}\left[\log p({\boldsymbol{z}}_{t}|\boldsymbol{x}_{t})-\log\sum_{\boldsymbol{x}^{\prime}}p({\boldsymbol{z}}_{t}|\boldsymbol{x}^{\prime})\right].

We further import the concept of overshooting and optimize 𝒥kNCE\mathcal{J}^{\mathrm{NCE}}_{k} along with 𝒥kKL\mathcal{J}^{\mathrm{KL}}_{k} on multi-step prediction of varying kks. Finally, the objective we use to train RSSM is:

𝒥k=0K(𝒥kNCE+𝒥kKL).\displaystyle\mathcal{J}\coloneqq\textstyle\sum^{K}_{k=0}\left(\mathcal{J}^{\mathrm{NCE}}_{k}+\mathcal{J}^{\mathrm{KL}}_{k}\right). (8)

Note that 𝒥kNCE\mathcal{J}^{\mathrm{NCE}}_{k} and 𝒥kKL\mathcal{J}^{\mathrm{KL}}_{k} have different dynamics model (i.e., p~(𝒛t|)\tilde{p}({\boldsymbol{z}}_{t}|\cdot) and p(𝒛t|)p({\boldsymbol{z}}_{t}|\cdot), respectively).

IV-B Relation among the Objectives

As shown in Appx. A, 𝒥NCE\mathcal{J}^{\mathrm{NCE}} is a lower bound of the mutual information I(𝒙t;𝒛t)I(\boldsymbol{x}_{t};{\boldsymbol{z}}_{t}), while 𝒥kNCE\mathcal{J}^{\mathrm{NCE}}_{k} is a bound of I(𝒙t;𝒛tk)I(\boldsymbol{x}_{t};{\boldsymbol{z}}_{t-k}). Since the latent state sequence is Markovian, we have the data processing inequality as I(𝒙t;𝒛t)I(𝒙t;𝒛tk)I(\boldsymbol{x}_{t};{\boldsymbol{z}}_{t})\geq I(\boldsymbol{x}_{t};{\boldsymbol{z}}_{t-k}). In other words, 𝒥NCE\mathcal{J}^{\mathrm{NCE}} and approximately derived 𝒥kNCE\mathcal{J}^{\mathrm{NCE}}_{k} share the same InfoMax upper bound metrics. An intuitive motivation to introduce 𝒥kNCE\mathcal{J}^{\mathrm{NCE}}_{k} instead of 𝒥NCE\mathcal{J}^{\mathrm{NCE}} is so that we can incorporate temporal correlation between tt and tkt-k. Another motivation is that we can increase the model capacity of the discriminator p(𝒛t|𝒙t)p({\boldsymbol{z}}_{t}|\boldsymbol{x}_{t}) by incorporating the independent dynamics model p~(𝒛t|)\tilde{p}({\boldsymbol{z}}_{t}|\cdot).

IV-C Model Definitions

This section discusses how we define the discriminator components: p(𝒛t|𝒙t)p({\boldsymbol{z}}_{t}|\boldsymbol{x}_{t}) and p~(𝒛t|𝒛t1,𝒂t1)\tilde{p}({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-1},\boldsymbol{a}_{t-1}). Ref. [34] has empirically shown that the inductive bias from model architectures is a significant factor for contrastive learning. As experimentally recommended in the literature, we define p(𝒛t|𝒙t)p({\boldsymbol{z}}_{t}|\boldsymbol{x}_{t}) as an exponentiated bilinear similarity function parameterized with W𝒛|𝒙W_{{\boldsymbol{z}}|\boldsymbol{x}}:

p(𝒛t|𝒙t)exp(𝒛tW𝒛|𝒙𝒆t),\displaystyle p({\boldsymbol{z}}_{t}|\boldsymbol{x}_{t})\propto\exp({\boldsymbol{z}}_{t}^{\top}W_{{\boldsymbol{z}}|\boldsymbol{x}}\boldsymbol{e}_{t}), (9)

where 𝒆tfCNN(𝒙t)\boldsymbol{e}_{t}\coloneqq f^{\mathrm{CNN}}(\boldsymbol{x}_{t}) and fCNN()f^{\mathrm{CNN}}(\cdot) denotes feature extraction by a CNN unit. With this definition, 𝒥kNCE\mathcal{J}^{\mathrm{NCE}}_{k} is simply a softmax cross-entropy objective with logits 𝒛W𝒆{\boldsymbol{z}}^{\top}W\boldsymbol{e}. Contrary to the previous contrastive learning literature [22, 34], the definition of newly introduced p~(𝒛t|𝒛t1,𝒂t1)\tilde{p}({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-1},\boldsymbol{a}_{t-1}) is required. Here, we propose to apply linear modeling to define the model deterministically as:

p~(𝒛t|𝒛t1,𝒂t1)δ(𝒛t𝒛t),\displaystyle\tilde{p}({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-1},\boldsymbol{a}_{t-1})\coloneqq\delta({\boldsymbol{z}}_{t}-{\boldsymbol{z}}^{\prime}_{t}),\
where𝒛tW𝒛𝒛t1+W𝒂𝒂t1,\displaystyle\mathrm{where}\ {\boldsymbol{z}}^{\prime}_{t}\coloneqq W_{{\boldsymbol{z}}}{\boldsymbol{z}}_{t-1}+W_{\boldsymbol{a}}\boldsymbol{a}_{t-1}, (10)

δ\delta is the Dirac delta function, and W𝒛,𝒂W_{{\boldsymbol{z}},\boldsymbol{a}} are linear parameters.

This linear modeling of p~(𝒛t|)\tilde{p}({\boldsymbol{z}}_{t}|\cdot) successfully regularizes 𝒥kNCE\mathcal{J}^{\mathrm{NCE}}_{k} and contributes to construct smooth latent space. We can alternatively define p~(𝒛t|)p(𝒛t|)\tilde{p}({\boldsymbol{z}}_{t}|\cdot)\coloneqq p({\boldsymbol{z}}_{t}|\cdot), where p(𝒛t|)p({\boldsymbol{z}}_{t}|\cdot) is generally defined as an expressive model aiming at precise prediction. However, the high model capacity allows to embed temporally consecutive samples too distant from each other to sufficiently optimize 𝒥kNCE\mathcal{J}^{\mathrm{NCE}}_{k}, thus yielding unsmooth latent space.

IV-D Instantiation with RSSM

Figure 3 illustrates the architecture to compute 𝒥kNCE\mathcal{J}^{\mathrm{NCE}}_{k}. We describe the two paramount components which characterize our proposed contrastive learning scheme as follows:

(i) Independent linear forward dynamics: As previously proposed in Sec. IV-C, we employ a simple linear forward dynamics p~\tilde{p}, which is used only for contrastive learning. During the policy optimization phase, the expressive model with GRU is alternatively utilized to make the most out of its long-term prediction accuracy.

(ii) Data augmentation: We append two independent image preprocessors which process two sets of input images (i.e., 𝒙t\boldsymbol{x}_{\leq t} and 𝒙t+1:t+K\boldsymbol{x}_{t+1:t+K}). Considering the empirical success of the previous literature [23, 25, 28, 30, 29], we adopt the random crop of images. In our implementation, the original image shaped (72,72)(72,72) is cropped to be (64,64)(64,64). The origin of the crop rectangle is determined at each preprocessor randomly and indenpendently. This makes it difficult for the contrastive learner to discriminate correct positive pairs, encouraging only informative features for control to be extracted.

Refer to caption
Figure 3: The RSSM-based architecture to compute 𝒥kNCE\mathcal{J}^{\mathrm{NCE}}_{k}. CNN, GRU and FC represent a convolutional neural network, a GRU-cell, and a fully-connected layer, respectively. In module (a), latent states 𝒔t{\boldsymbol{s}}_{t} are recurrently inferred given 𝒙t\boldsymbol{x}_{\leq t}. In module (b), 𝒔t+1:t+K{\boldsymbol{s}}_{t+1:t+K} are sequentially predicted by the linear model W𝒛,𝒂W_{{\boldsymbol{z}},\boldsymbol{a}}, and then they are compared with the observations 𝒙t+1:t+K\boldsymbol{x}_{t+1:t+K} to compute logits. For readability, we only illustrate positive logits, however, negative logits are also computed by pairing samples from different time-steps or frames, yielding (B×K)2(B\times K)^{2} logits. To compute 𝒥kNCE\mathcal{J}^{\mathrm{NCE}}_{k} of a certain positive logit, remaining (B×K)21(B\times K)^{2}-1 logits are used for negative logits.

We propose a decoder-free variant of Dreamer, which we call Dreamer with InfoMax and without generative decoder (Dreaming). Dreaming trains a policy as almost same way with the original Dreamer. The only difference between the methods is that we alternatively use the contrastive learning scheme introduced in the previous section to train RSSM. We implement Dreaming in TensorFlow [35] by modifying the official source code of Dreamer111https://github.com/google-research/dreamer. We keep all hyperparameters and experimental conditions similar to the original ones. A newly introduced hyperparameter KK in Eq. (8) (overshooting distance) is set to be K=3K=3 based on the ablation study in Appx. C.

V Experiments

V-A Comparison to State-of-the-art Methods

TABLE I: Performance on 15 benchmark tasks around 500K environment steps (100K only for Cup-catch).
Dreaming (ours) Dreamer w/ Dreamer w/ CURL [25] DrQ [29] RAD [28]
𝒥likelihood\mathcal{J}^{\mathrm{likelihood}} [16] 𝒥NCE\mathcal{J}^{\mathrm{NCE}}  [16]
(A) Manipulation tasks where object vanishing is critical
Cup-catch (100K)
925 ±\pm 48
698 ±\pm 350
609 ±\pm 404
693 ±\pm 334
882 ±\pm 174
792 ±\pm 315
Reacher-hard
868 ±\pm 272
8 ±\pm 33
115 ±\pm 298
431 ±\pm 435
616 ±\pm 464
783 ±\pm 370
Finger-turn-hard
752 ±\pm 325
264 ±\pm 368
222 ±\pm 379
339 ±\pm 443
270 ±\pm 427
303 ±\pm 443
UR5-reach
845 ±\pm 147
652 ±\pm 230
592 ±\pm 271
729 ±\pm 201
633 ±\pm 312
642 ±\pm 274
Connector-insert
629 ±\pm 391
169 ±\pm 348
304 ±\pm 399
297 ±\pm 384
183 ±\pm 361
367 ±\pm 387
(B) Manipulation tasks where object vanishing is NOT critical
Reacher-easy
905 ±\pm 210
947 ±\pm 145
183 ±\pm 325
834 ±\pm 286
- -
Finger-turn-easy
661 ±\pm 394
689 ±\pm 394
232 ±\pm 398
576 ±\pm 464
- -
Finger-spin
762 ±\pm 113
763 ±\pm 188
886 ±\pm 169
922 ±\pm 55
- -
(C) Pole-swingup tasks
Pendulum-swingup
811 ±\pm 98
432 ±\pm 408
825 ±\pm 106
46 ±\pm 207
- -
Acrobot-swingup
267 ±\pm 177
98 ±\pm 119
48 ±\pm 54
4 ±\pm 14
- -
Cartpole-swingup-sparse
465 ±\pm 328
317 ±\pm 345
197 ±\pm 79
17 ±\pm 17
- -
(D) Locomotion tasks
Quadrupled-walk
719 ±\pm 193
441 ±\pm 219
201 ±\pm 272
188 ±\pm 174
- -
Walker-walk
469 ±\pm 123
955 ±\pm 19
483 ±\pm 111
914 ±\pm 33
- -
Cheetah-run
566 ±\pm 118
781 ±\pm 132
303 ±\pm 174
580 ±\pm 56
- -
Hopper-hop
78 ±\pm 55
172 ±\pm 114
25 ±\pm 29
10 ±\pm 17
- -

The main objective of this experiment is to demonstrate that Dreaming has advantages over the baseline method Dreamer [16] on difficult 5 manipulation tasks exhibited in Fig. 1, in which Dreamer suffers from object vanishing. We also prepare a likelihood-free variant of Dreamer introduced in Sec. III-C, which utilizes the vanilla contrastive objective 𝒥NCE\mathcal{J}^{\mathrm{NCE}} instead of 𝒥kNCE\mathcal{J}^{\mathrm{NCE}}_{k} and 𝒥likelihood\mathcal{J}^{\mathrm{likelihood}}. The specifications of the two original tasks, UR5-reach and Connector-insert, are described in Appx. B. For the difficult 5 tasks, we also compare the performance with the latest cutting edge MFRL methods, which are CURL [25], DrQ [29] and RAD [28]. In addition, another variety of 10 DMC tasks are evaluated, which are categorized into three classes namely; manipulation, pole-swingup, and locomotion. For the additional 10 tasks, only CURL is selected as an MFRL representative.

Table I summarizes the training results benchmarked at certain environment steps. The results show the mean and standard deviation averaged 4 seeds and 10 consecutive trajectories. This table shows a similar result as in [16] that decoder-free Dreamer with the vanilla contrastive objective 𝒥NCE\mathcal{J}^{\mathrm{NCE}} degrades the performances on most of tasks than decoder-based Dreamer with 𝒥likelihood\mathcal{J}^{\mathrm{likelihood}}. In the following discussions, we use the decoder-based Dreamer as a primary baseline. (A) We put much focus on these difficult tasks and it can be seen that Dreaming consistently achieves better performance than Dreamer. Hence, this indicates that the decoder-free nature of the proposed method successfully surmounts the object vanishing problem. In addition, Dreaming achieves outperforming performance than the leading MFRL methods. (B) On other manipulation tasks, there are no significant difference between Dreaming and Dreamer because the key objects are large enough. (C) Since the pole-swingup tasks also cause vanishing of thin poles, Dreaming takes better performance than Dreamer. (D) Dreaming lags behind the Dreamer on 3 of 4 locomotion tasks, i.e., planar locomotion tasks (Walker-walk, Cheetah-run and Hopper-hop). On these tasks, the cameras always track the center of locomotive robots, and this causes the key control information (i.e., velocity) to be extracted from the background texture. We suppose that this robot-centric nature makes it difficult for the contrastive learner to extract such information because only robots’ attitudes provide enough information to discriminate different samples.

Figure 4 shows video prediction by Dreaming, in which principal features for control (e.g., positions and orientations) are successfully reconstructed from the embeddings learned without likelihood objective. However on Cheeta-run, another kind of object vanishing arises; the checkered floor pattern, which is required to extract the velocity information, is vanished.

Refer to caption
Figure 4: Open-loop video predictions. The left 5 consecutive images show reconstructed context frames and the remaining images are generated open-loop. The decoder is trained independently without backpropagating reconstruction errors to other models.

V-B Ablation Study

TABLE II: Ablation study: The effects of (i)\mathrm{(i)} linear forward dynamics (left) and (ii)\mathrm{(ii)} data augmentation (right).
p~\tilde{p}: linear p~p\tilde{p}\coloneqq p
as Eq. (10) as in [21]
Cup-catch (100K) 925 ±\pm 48 575 ±\pm 449
Reacher-hard 868 ±\pm 272 232 ±\pm 370
Finger-turn-hard 752 ±\pm 325 263 ±\pm 369
Random crop - \checkmark - \checkmark
Color jitter - - \checkmark \checkmark
Cup-catch (100K) 866 ±\pm 133 925 ±\pm 48 846 ±\pm 192 866 ±\pm 121
Reacher-hard 11 ±\pm 32 868 ±\pm 272 121 ±\pm 292 733 ±\pm 388
Finger-turn-hard 114 ±\pm 283 752 ±\pm 325 191 ±\pm 357 399 ±\pm 440

This experiment is conducted to analyze how the major components of the proposed representation learning, introduced in Sec. IV-D, contribute to the overall performance. For this purpose, some variants of the proposed method have been prepared: (i) the effect of independent linear dynamics is demonstrated with a variant that has shared dynamics p~(𝒛t|𝒛t1,𝒂t1)p(𝒛t|𝒛t1,𝒂t1)\tilde{p}({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-1},\boldsymbol{a}_{t-1})\coloneqq{p}({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-1},\boldsymbol{a}_{t-1})222 The prepared variant can be considered as a special case of the contrastive forward model (CFM) [21] as discussed in Sec. II. , (ii) the effect of data augmentation is demonstrated by removing the image preprocessors shown in Fig. 3. We also prepare another data augmentation called color jittering [23, 29, 28], for reference. Only three tasks, which are Cup-catch, Reacher-hard, and Finger-turn-hard, are taken into this experiment. Tables II summarize the results from the performed ablation study, which reveals that both of the proposed components are essential to achieve state-of-the-art results.

VI Conclusion

In the present paper, we proposed Dreaming, a decoder-free extension of the state-of-the-art MBRL method from pixels, Dreamer. A likelihood-free contrastive objective was derived by reformulating the original ELBO of Dreamer. We incorporated the two indispensable components below to the contrastive learning: (i) independent and linear forward dynamics, (ii) the random crop data augmentation. By making the most of the decoder-free nature and the two components, Dreaming was able to outperform the baseline methods on difficult tasks especially where Dreamer suffers from object vanishing.

An disadvantage we observed in the experiments was that Dreaming degraded the training performance on planar locomotion tasks (e.g., Walker-walk), where the contrastive learner has to focus on not only robots but also the background texture. This weak point should be resolved in future work as it may affect industrial manipulation tasks where first-person-view from robots dynamically changes. Another future research direction is to incorporate the uncertainty-aware concepts proposed in recent MBRL studies [1, 2, 19, 30]. Although we have achieved state-of-the-art results on some difficult tasks, we have often observed overfitted behaviors during the early training phase. We believe that this model-bias problem [36] can be successfully solved by the above state-of-the-art strategy.

Appendix A Derivation

In this section, we clarify that 𝒥kNCE\mathcal{J}^{\mathrm{NCE}}_{k} is a lower bound of I(𝒙t,𝒛tk)I(\boldsymbol{x}_{t},{\boldsymbol{z}}_{t-k}). 𝒥kNCE\mathcal{J}^{\mathrm{NCE}}_{k} can be rewriten as:

𝒥kNCE=\displaystyle\mathcal{J}^{\mathrm{NCE}}_{k}=
𝔼q(𝒛tk|)[logf(𝒙t,𝒛tk)log𝒙𝒟f(𝒙,𝒛tk)],\displaystyle\mathbb{E}_{q({\boldsymbol{z}}_{t-k}|\cdot)}\left[\log f(\boldsymbol{x}_{t},{\boldsymbol{z}}_{t-k})-\log\sum_{\boldsymbol{x}^{\prime}\in\mathcal{D}}f(\boldsymbol{x}^{\prime},{\boldsymbol{z}}_{t-k})\right], (11)

where f(𝒙t,𝒛tk)f(\boldsymbol{x}_{t},{\boldsymbol{z}}_{t-k}) includes deterministic multi-step prediction with p~(𝒛t|𝒛tk,𝒂<t)\tilde{p}({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-k},\boldsymbol{a}_{<t}) and computation of the bilinear similarity by Eq. (9). For ease of notation, actions 𝒂<t\boldsymbol{a}_{<t} in the conditioning set are omitted from f()f(\cdot). As already shown in [22], the optimal value of f()f(\cdot) is given by:

f(𝒙t,𝒛tk)p(𝒙t|𝒛tk)/p(𝒙t).\displaystyle f(\boldsymbol{x}_{t},{\boldsymbol{z}}_{t-k})\propto{p(\boldsymbol{x}_{t}|{\boldsymbol{z}}_{t-k})}/{p(\boldsymbol{x}_{t})}. (12)

By applying Bayes’ theorem f(𝒙t,𝒛tk)p(𝒛tk|𝒙t)/p(𝒛tk)f(\boldsymbol{x}_{t},{\boldsymbol{z}}_{t-k})\propto{p({\boldsymbol{z}}_{t-k}|\boldsymbol{x}_{t})}/{p({\boldsymbol{z}}_{t-k})} and inserting this to Eq. (11), we get:

𝒥kNCE\displaystyle\mathcal{J}^{\mathrm{NCE}}_{k} 𝔼q(𝒛tk|)[logp(𝒛tk|𝒙t)log𝒙p(𝒛tk|𝒙)]\displaystyle\propto\mathbb{E}_{q({\boldsymbol{z}}_{t-k}|\cdot)}\left[\log{p({\boldsymbol{z}}_{t-k}|\boldsymbol{x}_{t})}-\log\sum_{\boldsymbol{x}^{\prime}}{p({\boldsymbol{z}}_{t-k}|\boldsymbol{x}^{\prime})}\right]
𝔼q(𝒛tk|)[logp(𝒛tk|𝒙t)logp(𝒛tk)].\displaystyle\leq\mathbb{E}_{q({\boldsymbol{z}}_{t-k}|\cdot)}\left[\log{p({\boldsymbol{z}}_{t-k}|\boldsymbol{x}_{t})}-\log{p({\boldsymbol{z}}_{t-k})}\right]. (13)

By marginalizing Eq. (13) with respect to the data distribution, we finally obtain: 𝔼[𝒥kNCE]I(𝒙t;𝒛tk).\mathbb{E}[\mathcal{J}^{\mathrm{NCE}}_{k}]\leq I(\boldsymbol{x}_{t};{\boldsymbol{z}}_{t-k}). Note that setting k=0k=0 derives 𝔼[𝒥NCE]I(𝒙t;𝒛t).\mathbb{E}[\mathcal{J}^{\mathrm{NCE}}]\leq I(\boldsymbol{x}_{t};{\boldsymbol{z}}_{t}).

Appendix B Specifications of the Original Tasks

Figure 5 exhibits the specifications of newly introduced robotics tasks.

Refer to caption
Figure 5: UR5-reach (left) is to bring the robot end effector to goal positions. The observation is a blended image of two different views, implicitly providing depth information. Connector-insert (right) is to insert a millimeter-sized connector gripped by a robot to a socket. This tasks is originally introduced in [37]. Since the gap between the connector and socket is very tight, pixel-wise precise control is required. In the both tasks, the goal positions are initialized at random.

Appendix C Ablation Study of Overshooting Distance

Table III summarizes the ablation study of the overshooting distance KK, which demonstrates that incorporating temporal correlation of appropriate multi-steps (K=3K=3) is effective.

TABLE III: Ablation study: The effect of the overshooting distance KK.
K=1K=1 K=3K=3 K=5K=5 K=7K=7
Cup-catch (100K) 280 ±\pm 437 925 ±\pm 48 734 ±\pm 378 736 ±\pm 378
Reacher-hard 234 ±\pm 364 868 ±\pm 272 561 ±\pm 447 471 ±\pm 433
Finger-turn-hard 354 ±\pm 438 752 ±\pm 325 468 ±\pm 432 715 ±\pm 375

References

  • [1] K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep reinforcement learning in a handful of trials using probabilistic dynamics models,” in NeurIPS, 2018.
  • [2] M. Okada and T. Taniguchi, “Variational inference MPC for Bayesian model-based reinforcement learning,” in CoRL, 2019.
  • [3] E. Langlois, S. Zhang, G. Zhang, P. Abbeel, and J. Ba, “Benchmarking model-based reinforcement learning,” arXiv:1907.02057, 2019.
  • [4] L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, et al., “Model-based reinforcement learning for Atari,” in ICLR, 2020.
  • [5] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in ICML, 2018.
  • [6] S. Bechtle, Y. Lin, A. Rai, L. Righetti, and F. Meier, “Curious iLQR: Resolving uncertainty in model-based RL,” in CoRL, 2019.
  • [7] A. Nagabandi, K. Konolige, S. Levine, and V. Kumar, “Deep dynamics models for learning dexterous manipulation,” in CoRL, 2019.
  • [8] Y. Yang, K. Caluwaerts, A. Iscen, T. Zhang, J. Tan, and V. Sindhwani, “Data efficient reinforcement learning for legged robots,” in CoRL, 2019.
  • [9] Y. Zhang, I. Clavera, B. Tsai, and P. Abbeel, “Asynchronous methods for model-based reinforcement learning,” in CoRL, 2019.
  • [10] G. R. Williams, B. Goldfain, K. Lee, J. Gibson, J. M. Rehg, and E. A. Theodorou, “Locally weighted regression pseudo-rehearsal for adaptive model predictive control,” in CoRL, 2019.
  • [11] K. Fang, Y. Zhu, A. Garg, S. Savarese, and L. Fei-Fei, “Dynamics learning with cascaded variational inference for multi-step manipulation,” in CoRL, 2019.
  • [12] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, 2014.
  • [13] D. Ha and J. Schmidhuber, “Recurrent world models facilitate policy evolution,” in NeurIPS, 2018.
  • [14] A. X. Lee, A. Nagabandi, P. Abbeel, and S. Levine, “Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model,” arXiv:1907.00953, 2019.
  • [15] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson, “Learning latent dynamics for planning from pixels,” in ICML, 2019.
  • [16] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to control: Learning behaviors by latent imagination,” ICLR, 2020.
  • [17] D. Han, K. Doya, and J. Tani, “Variational recurrent models for solving partially observable control tasks,” in ICLR, 2020.
  • [18] D. Yarats, A. Zhang, I. Kostrikov, B. Amos, J. Pineau, and R. Fergus, “Improving sample efficiency in model-free reinforcement learning from images,” arXiv:1910.01741, 2019.
  • [19] M. Okada, N. Kosaka, and T. Taniguchi, “PlaNet of the Bayesians: Reconsidering and improving deep planning network by incorporating Bayesian inference,” in IROS, 2020.
  • [20] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, et al., “DeepMind control suite,” arXiv:1801.00690, 2018.
  • [21] W. Yan, A. Vangipuram, P. Abbeel, and L. Pinto, “Learning predictive representations for deformable objects using contrastive estimation,” arXiv:2003.05436, 2020.
  • [22] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv:1807.03748, 2018.
  • [23] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in ICLR, 2020.
  • [24] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation platform for general agents,” Journal of Artificial Intelligence Research, vol. 47, pp. 253–279, 2013.
  • [25] A. Srinivas, M. Laskin, and P. Abbeel, “CURL: Contrastive unsupervised representations for reinforcement learning,” in ICML, 2020.
  • [26] A. Zhang, R. McAllister, R. Calandra, Y. Gal, and S. Levine, “Learning invariant representations for reinforcement learning without reconstruction,” arXiv:2006.10742, 2020.
  • [27] X. Ma, P. Karkus, D. Hsu, W. S. Lee, and N. Ye, “Discriminative particle filter reinforcement learning for complex partial observations,” in ICLR, 2020.
  • [28] M. Laskin, K. Lee, A. Stooke, L. Pinto, P. Abbeel, and A. Srinivas, “Reinforcement learning with augmented data,” arXiv:2004.14990, 2020.
  • [29] I. Kostrikov, D. Yarats, and R. Fergus, “Image augmentation is all you need: Regularizing deep reinforcement learning from pixels,” arXiv:2004.13649, 2020.
  • [30] K. Lee, M. Laskin, A. Srinivas, and P. Abbeel, “SUNRISE: A simple unified framework for ensemble learning in deep reinforcement learning,” arXiv:2007.04938, 2020.
  • [31] R. Sekar, O. Rybkin, K. Daniilidis, P. Abbeel, D. Hafner, and D. Pathak, “Planning to explore via self-supervised world models,” arXiv:2005.05960, 2020.
  • [32] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, et al., “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv:1406.1078, 2014.
  • [33] B. Poole, S. Ozair, A. v. d. Oord, A. A. Alemi, and G. Tucker, “On variational bounds of mutual information,” in ICML, 2019.
  • [34] M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic, “On mutual information maximization for representation learning,” ICLR, 2020.
  • [35] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, et al., “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015.
  • [36] M. Deisenroth and C. E. Rasmussen, “PILCO: A model-based and data-efficient approach to policy search,” in ICML, 2011.
  • [37] R. Okumura, M. Okada, and T. Taniguchi, “Domain-adversarial and -conditional state space model for imitation learning,” in IROS, 2020.