Dreaming: Model-based Reinforcement Learning
by Latent Imagination without Reconstruction

Masashi Okada^1,⋆ and Tadahiro Taniguchi^1,2 ¹ Masashi Okada and Tadahiro Taniguchi are with Digital & AI Technology Center, Technology Division, Panasonic Corporation, Japan. ² Tadahiro Taniguchi is also with Ritsumeikan University, College of Information Science and Engineering, Japan. ^⋆okada.masashi001@jp.panasonic.com

Abstract

In the present paper, we propose a decoder-free extension of Dreamer, a leading model-based reinforcement learning (MBRL) method from pixels. Dreamer is a sample- and cost-efficient solution to robot learning, as it is used to train latent state-space models based on a variational autoencoder and to conduct policy optimization by latent trajectory imagination. However, this autoencoding based approach often causes object vanishing, in which the autoencoder fails to perceives key objects for solving control tasks, and thus significantly limiting Dreamer’s potential. This work aims to relieve this Dreamer’s bottleneck and enhance its performance by means of removing the decoder. For this purpose, we firstly derive a likelihood-free and InfoMax objective of contrastive learning from the evidence lower bound of Dreamer. Secondly, we incorporate two components, (i) independent linear dynamics and (ii) the random crop data augmentation, to the learning scheme so as to improve the training performance. In comparison to Dreamer and other recent model-free reinforcement learning methods, our newly devised Dreamer with InfoMax and without generative decoder (Dreaming) achieves the best scores on 5 difficult simulated robotics tasks, in which Dreamer suffers from object vanishing.

I Introduction

In the present paper, we focus on model-based reinforcement learning (MBRL) from pixels without complex reconstruction. MBRL is a promising technique to build controllers in a sample efficient manner, which trains forward dynamics models to predict future states and rewards for the purpose of planning and/or policy optimization. The recent study of MBRL in fully-observable environments [1, 2, 3, 4] have achieved both sample efficiency and competitive performance with the state-of-the-art model-free reinforcement learning (MFRL) methods like soft-actor-critic (SAC) [5]. Although real-robot learning has been achieved with fully-observable MBRL [6, 7, 8, 9, 10, 11], there has been an increasing demand for robot learning in partially observable environments in which only incomplete information (especially vision) is available. MBRL from pixels can be realized by introducing deep generative models based on autoencoding variational Bayes [12].

Object vanishing is a critical problem of the autoencoding based MBRL from pixels. Previous studies in this field [13, 14, 15, 16, 17, 18, 19] train autoencoding models along with latent dynamics models that generate imagined trajectories that are used for planning or policy optimization. However, the autoencoder often fails to perceive small objects in pixel space. The top part of Fig. 1 exemplifies this kind of failures where small (or thin) and important objects are not reconstructed in their correct positions. This shows the failure to successfully embed their information into the latent space, which significantly limits the training performance. This problem is a result of a log-likelihood objective (reconstruction loss) defined in the pixel space. Since the reconstruction errors of small objects in the pixel space are insignificant compared to the errors of other objects and uninformative textures that occupy most parts of the image region, it is hard to train the encoder to perceive the small objects from the weak error signals. Also, we have to train the decoder which requires a high model capacity with massive parameters from the convolutional neural networks (CNN), although the trained models are not exploited both in the planning and policy optimization.

Refer to caption — Figure 1: The overview of the motivation and concept of this work. (Top) The concept of Dreamer’s autoencoding-based representation learning [16], which often causes object vanishing as shown with the dashed-line ovals. The left three tasks are from DeepMind Control Suite [20], and the remaining two tasks are our original tasks which are assumed to represent industrial applications. (Bottom) The concept of Dreaming’s representation learning, which trains a discriminator instead of the decoder to embed different samples to be spaced apart from each other. The learning scheme is characterized with the two key components; (i) Linear dynamics successfully constrains temporally consecutive samples are not distributed too further away. (ii) Image augmentation encourages only key features for control to be embedded into the latent space.

To avoid this complex reconstruction, some previous MBRL studies have proposed decoder-free representation learning [16, 21] based on contrastive learning [22, 23], which trains a discriminator instead of the decoder. The discriminator is trained by categorical cross-entropy optimization, which encourages latent embeddings to be sufficiently distinguishable among different embeddings. Nevertheless, to the best of our knowledge, no MBRL methods have achieved the state-of-the-art results on the difficult benchmark tasks of DeepMind Control Suite (DMC) [20] without reconstruction.

Motivated by these observations, this paper aims to achieve the state-of-the-art results with MBRL from pixels without reconstruction. This paper mainly focuses on the latest autoencoding-based MBRL method Dreamer, considering the success of a variety of control tasks (i.e., DMC and Atari Games [24]), and tries to extend this method to be a decoder-free fashion. We adopt Dreamer’s policy optimization without any form of modification. We call our extended Dreamer as Dreamer with InfoMax and without generative decoder (Dreaming). The concept of this proposed method is illustrated in the bottom part of Fig. 1. Our primary contributions are summarized as follows.

•

We derive a likelihood-free (decoder-free) and InfoMax objective for contrastive learning by reformulating the variational evidence lower bound (ELBO) of the graphical model of the partially observable Markov decision process.
•

We show that two key components, (i) an independent and linear forward dynamics, which is only utilized for contrastive learning, and (ii) appropriate data augmentation (i.e., random crop), are indispensable to achieve the state-of-the-art results.

In comparison to Dreamer and the recent cutting edge MFRL methods, Dreaming can achieve the state-of-the-art results on difficult simulated robotics tasks exhibited in Fig. 1 in which Dreamer suffers from object vanishing. The remainder of this paper is organized as follows. In Sec. II, key differences from related work are discussed. In Sec. III, we provide a brief review of Dreamer and contrastive learning. In Sec. IV, we first describe the proposed contrastive learning scheme in detail, and then introduce Dreaming. In Sec. V, the effectiveness of Dreaming is demonstrated through simulated evaluations. Finally, Sec. VI concludes this paper.

II Related Work

Some of the most related work are contrastive predictive coding (CPC) [22] and contrastive forward model (CFM) [21]. Our work is highly inspired by CPC, and our contrastive learning scheme has similar components with CPC; e.g., a recurrent neural network and a bilinear similarity function. However, CPC has no action-conditioned dynamics models. Since CPC alone cannot generate imagined trajectories from arbitrary actions, CPC is only used as an auxiliary objective of MFRL. CFM heuristically introduces a similar decoder-free objective function like ours. A primary difference between ours and CFM is that CFM exploits a shared and non-linear forward model for both contrastive and behavior learning. In addition, the relation between the ELBO of time-series variational inference is not discussed in the above two literature. Meanwhile, the original Dreamer paper [16] has also derived a contrastive objective from the ELBO. However, dynamics models and temporal correlation of observations are not involved in the contrastive objective. Furthermore, CPC, CFM, and Dreamer do not introduce data augmentation.

MFRL methods with representation learning: A state-of-the-art MFRL method, contrastive unsupervised representation for reinforcement learning (CURL) [25], also makes use of contrastive learning with the random crop data augmentation. Deep bisimulation for control (DBC) [26] and discriminative particle filter reinforcement learning (DPFRL) [27] are other types of cutting edge MFRL methods, which utilize different concepts of representation learning without reconstruction.

MFRL methods without representation learning: Recently proposed MFRL methods, which include reinforcement learning with augmented data (RAD) [28], data-regularized Q (DrQ) [29], and simple unified framework for reinforcement learning using ensembles (SUNRISE) [30], have achieved state-of-the-art result without representation learning. All these work employ the random crop data augmentation as an important component of their method.

Figure 2: Graphical model of the partially observable Markov decision process.

III Preliminary

III-A Autoencoding Variational Bayes for Time-series

Let us begin by considering the graphical model illustrated in Fig. 2, whose joint distribution is defined as follows:

\displaystyle p({\boldsymbol{z}}_{\leq T},\boldsymbol{a}_{<T},\boldsymbol{x}_{\leq T})=\prod_{t}p\left({\boldsymbol{z}}_{t+1}|{\boldsymbol{z}}_{t},\boldsymbol{a}_{t}\right)p\left(\boldsymbol{x}_{t}|{\boldsymbol{z}}_{t}\right),

(1)

where ${\boldsymbol{z}}$ , $\boldsymbol{x}$ , and $\boldsymbol{a}$ denote latent state, observation, and action, respectively. As in the case of well-known variational autoencoders (VAEs) [12], generative models $p({\boldsymbol{z}}_{t+1}|{\boldsymbol{z}}_{t},\boldsymbol{a}_{t})$ , $p(\boldsymbol{x}_{t}|{\boldsymbol{z}}_{t})$ and inference model $q({\boldsymbol{z}}_{t}|\boldsymbol{x}_{\leq t},\boldsymbol{a}_{<t})$ can be trained by maximizing the evidence lower bound [15]:

	$\displaystyle\log p(\boldsymbol{x}_{\leq T}\|\boldsymbol{a}_{\leq T})=\log\int p({\boldsymbol{z}}_{\leq T},\boldsymbol{a}_{<T},\boldsymbol{x}_{\leq T})d{\boldsymbol{z}}_{\leq t}\geq$
	$\displaystyle\sum_{t}\left(\underbrace{\mathbb{E}_{q({\boldsymbol{z}}_{t}\|\boldsymbol{x}_{\leq t},\boldsymbol{a}_{<t})}\left[\log p(\boldsymbol{x}_{t}\|{\boldsymbol{z}}_{t})\right]}_{\coloneqq\mathcal{J}^{\mathrm{likelihood}}}\right.$		(2)
	$\displaystyle\left.\underbrace{-\mathbb{E}_{q({\boldsymbol{z}}_{t}\|\cdot)}\left[{\operatorname{KL}}\left[q({\boldsymbol{z}}_{t+1}\|\boldsymbol{x}_{\leq t+1},\boldsymbol{a}_{<t+1})\|\|p({\boldsymbol{z}}_{t+1}\|{\boldsymbol{z}}_{t},\boldsymbol{a}_{t})\right]\right]}_{\coloneqq\mathcal{J}^{\mathrm{KL}}}\right).$

If the models are defined to be differentiable and trainable, this objective can be maximized by the stochastic gradient ascent via backpropagation.

Multi-step variational inference is proposed in [15] to improve long-term predictions. This inference, named latent overshooting, involves the multi-step objective $\mathcal{J}^{\mathrm{KL}}_{k}$ defined as:

	$\displaystyle\mathcal{J}^{\mathrm{KL}}\geq\mathcal{J}^{\mathrm{KL}}_{k}\coloneqq\mathbb{E}_{p({\boldsymbol{z}}_{t}\|{\boldsymbol{z}}_{t-k},\boldsymbol{a}_{<t})q({\boldsymbol{z}}_{t-k}\|\boldsymbol{x}_{\leq t-k},\boldsymbol{a}_{<t-k})}[\hookleftarrow$
	$\displaystyle{\operatorname{KL}}\left[q({\boldsymbol{z}}_{t+1}\|\boldsymbol{x}_{\leq t+1},\boldsymbol{a}_{<t+1})\|\|p({\boldsymbol{z}}_{t+1}\|{\boldsymbol{z}}_{t},\boldsymbol{a}_{t})\right]],$		(3)

where

\displaystyle{p}({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-k},\boldsymbol{a}_{<t})\coloneqq\mathbb{E}_{{p}({\boldsymbol{z}}_{t-1}|{\boldsymbol{z}}_{t-k},\boldsymbol{a}_{<t-1})}\left[{p}({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-1},\boldsymbol{a}_{t-1})\right]

is the multi-step prediction model.

For the purpose of planning or policy optimization, not only for the dynamics model $p({\boldsymbol{z}}_{t+1}|{\boldsymbol{z}}_{t},\boldsymbol{a}_{t})$ but also the reward function $p(r_{t}|{\boldsymbol{z}}_{t})$ is also required. To do this, we can simply regard the rewards as observations and learn the reward function $p(r_{t}|{\boldsymbol{z}}_{t})$ along with the decoder $p(\boldsymbol{x}_{t}|{\boldsymbol{z}}_{t})$ . For readability, we omit the specifications of the reward function $p(r_{t}|{\boldsymbol{z}}_{t})$ in the following discussion. Although we remove the decoder later, the reward function and its likelihood objective are kept untouched.

III-B Recurrent State Space Model and Dreamer

The recurrent state space model (RSSM) is a latent dynamics model equipped with an expressive recurrent neural network, realizing accurate long-term prediction. RSSM is used as an essential component of various MBRL methods from pixels [15, 16, 17, 19, 31] including Dreamer. RSSM assumes the latent ${\boldsymbol{z}}_{t}$ comprises ${\boldsymbol{z}}_{t}=({\boldsymbol{s}}_{t},{\boldsymbol{h}}_{t})$ where ${\boldsymbol{s}}_{t}$ , ${\boldsymbol{h}}_{t}$ are the probabilistic and deterministic variables, respectively. RSSM’s generative and inference models are defined as:

	$\displaystyle\mathrm{Generative\ models}$	$\displaystyle:\begin{cases}{\boldsymbol{h}}_{t}=f^{\mathrm{GRU}}({\boldsymbol{h}}_{t-1},{\boldsymbol{s}}_{t-1},\boldsymbol{a}_{t-1})\\ {\boldsymbol{s}}_{t}\sim p({\boldsymbol{s}}_{t}\|{\boldsymbol{h}}_{t})\\ \boldsymbol{x}_{t}\sim p(\boldsymbol{x}_{t}\|{\boldsymbol{h}}_{t},{\boldsymbol{s}}_{t})\end{cases},$		(4)
	$\displaystyle\mathrm{Inference\ model}$	$\displaystyle:{\boldsymbol{s}}_{t}\sim q({\boldsymbol{s}}_{t}\|{\boldsymbol{h}}_{t},\boldsymbol{x}_{t}),$

where deterministic ${\boldsymbol{h}}_{t}$ is considered to be the hidden state of the gated recurrent unit (GRU) $f^{\mathrm{GRU}}(\cdot)$ [32] so that historical information can be embedded into ${\boldsymbol{h}}_{t}$ .

Dreamer [16] makes use of RSSM as a differentiable dynamics and efficiently learns the behaviors via backpropagation of Bellman errors estimated from imagined trajectories. Dreamer’s training procedure is simply summarized as follows: (1) Train RSSM with a given dataset by optimizing Eq. (2). (2) Train a policy from the latent imaginations. (3) Execute the trained policy in a real environment and augment the dataset with the observed results. The above steps are iteratively executed until the policy performs as expected.

III-C Contrastive Learning of RSSM

The original Dreamer paper [16] also introduced a likelihood-free objective by reformulating $\mathcal{J}^{\mathrm{likelihood}}$ of Eq. (2). By adding a constant $\log p(\boldsymbol{x}_{t})$ and applying Bayes’ theorem, we get a decoder-free objective:

	$\displaystyle\mathcal{J}^{\mathrm{likelihood}}\stackrel{{\scriptstyle+}}{{=}}\mathbb{E}_{q({\boldsymbol{z}}_{t}\|\cdot)}\left[\log p(\boldsymbol{x}_{t}\|{\boldsymbol{z}}_{z})-\log p(\boldsymbol{x}_{t})\right]$
	$\displaystyle=\mathbb{E}_{q({\boldsymbol{z}}_{t}\|\cdot)}\left[\log p({\boldsymbol{z}}_{t}\|\boldsymbol{x}_{t})-\log p({\boldsymbol{z}}_{t})\right]$
	$\displaystyle\geq\mathbb{E}_{q({\boldsymbol{z}}_{t}\|\cdot)}\left[\log p({\boldsymbol{z}}_{t}\|\boldsymbol{x}_{t})-\log\sum_{\boldsymbol{x}^{\prime}\in\mathcal{D}}p({\boldsymbol{z}}_{t}\|\boldsymbol{x}^{\prime})\right]$
	$\displaystyle\coloneqq\mathcal{J}^{\mathrm{NCE}},$		(5)

where $\mathcal{D}$ denotes the mini-batch and the lower bound in the second line was from the Info-NCE (noise-contrastive estimator) mini-batch bound [33]. Let $B$ be the batch size of $\mathcal{D}$ , $\mathcal{J}^{\mathrm{NCE}}$ is considered as a $B$ -class categorical cross entropy objective to discriminate the positive pair $({\boldsymbol{z}}_{t},\boldsymbol{x}_{t})$ among the other negative pairs $({\boldsymbol{z}}_{t},\boldsymbol{x}^{\prime}(\neq\boldsymbol{x}_{t}))$ . In this interpretation, $p({\boldsymbol{z}}_{t}|\boldsymbol{x}_{t})$ can be considered as a discriminator to discern the positive pairs. Representation learning with this type of objective is known as contrastive learning [22, 23] that encourages the embeddings to be sufficiently seperated from each other in the latent space. However, the experiment in [16] has demonstrated that this objective significantly degrades the performance compared to the original objective $\mathcal{J}^{\mathrm{likelihood}}$ .

IV Proposed Contrastive Learning
and MBRL Method

IV-A Deriving Another Contrastive Objective

We propose to further reformulate $\mathcal{J}^{\mathrm{NCE}}$ of Eq. (5) by introducing a multi-step prediction model: $\tilde{p}({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-k},\boldsymbol{a}_{<t})\coloneqq\mathbb{E}_{\tilde{p}({\boldsymbol{z}}_{t-1}|{\boldsymbol{z}}_{t-k},\boldsymbol{a}_{<t-1})}\left[\tilde{p}({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-1},\boldsymbol{a}_{t-1})\right].$ The accent of $\tilde{p}$ implies that an independent dynamics model from $p({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-1},\boldsymbol{a}_{t-1})$ in Eq. (2) can be employed here. By multiplying a constant $\mathbb{E}_{q({\boldsymbol{z}}_{t-k}|\cdot)}[\tilde{p}(\cdot)/\tilde{p}(\cdot)]=1$ , we obtain an importance sampling form of $\mathcal{J}^{\mathrm{NCE}}$ as:

	$\displaystyle\mathcal{J}^{\mathrm{NCE}}=$	$\displaystyle\ \mathbb{E}_{\tilde{p}({\boldsymbol{z}}_{t}\|{\boldsymbol{z}}_{t-k},\boldsymbol{a}_{<t})q({\boldsymbol{z}}_{t-k}\|\cdot)}\left[\frac{q({\boldsymbol{z}}_{t}\|\cdot)}{\tilde{p}({\boldsymbol{z}}_{t}\|\cdot)}\times\hookleftarrow\right.$
		$\displaystyle\left.\left(\log p({\boldsymbol{z}}_{t}\|\boldsymbol{x}_{t})-\log\sum_{\boldsymbol{x}^{\prime}}p({\boldsymbol{z}}_{t}\|\boldsymbol{x}^{\prime})\right)\right].$		(6)

For computational simplicity, we approximate the likelihood ratio ${q({\boldsymbol{z}}_{t}|\cdot)}/{\tilde{p}({\boldsymbol{z}}_{t}|\cdot)}$ as a constant and assume that the summation of $\mathcal{J}^{\mathrm{NCE}}$ across batch and time dimension is approximated as: $\sum\mathcal{J}^{\mathrm{NCE}}\stackrel{{\scriptstyle\sim}}{{\propto}}\sum\mathcal{J}^{\mathrm{NCE}}_{k},$ where

	$\displaystyle\mathcal{J}^{\mathrm{NCE}}_{k}\coloneqq$		(7)
	$\displaystyle\mathbb{E}_{\tilde{p}({\boldsymbol{z}}_{t}\|{\boldsymbol{z}}_{t-k},\boldsymbol{a}_{<t})q({\boldsymbol{z}}_{t-k}\|\cdot)}\left[\log p({\boldsymbol{z}}_{t}\|\boldsymbol{x}_{t})-\log\sum_{\boldsymbol{x}^{\prime}}p({\boldsymbol{z}}_{t}\|\boldsymbol{x}^{\prime})\right].$

We further import the concept of overshooting and optimize $\mathcal{J}^{\mathrm{NCE}}_{k}$ along with $\mathcal{J}^{\mathrm{KL}}_{k}$ on multi-step prediction of varying $k$ s. Finally, the objective we use to train RSSM is:

\displaystyle\mathcal{J}\coloneqq\textstyle\sum^{K}_{k=0}\left(\mathcal{J}^{\mathrm{NCE}}_{k}+\mathcal{J}^{\mathrm{KL}}_{k}\right).

(8)

Note that $\mathcal{J}^{\mathrm{NCE}}_{k}$ and $\mathcal{J}^{\mathrm{KL}}_{k}$ have different dynamics model (i.e., $\tilde{p}({\boldsymbol{z}}_{t}|\cdot)$ and $p({\boldsymbol{z}}_{t}|\cdot)$ , respectively).

IV-B Relation among the Objectives

As shown in Appx. A, $\mathcal{J}^{\mathrm{NCE}}$ is a lower bound of the mutual information $I(\boldsymbol{x}_{t};{\boldsymbol{z}}_{t})$ , while $\mathcal{J}^{\mathrm{NCE}}_{k}$ is a bound of $I(\boldsymbol{x}_{t};{\boldsymbol{z}}_{t-k})$ . Since the latent state sequence is Markovian, we have the data processing inequality as $I(\boldsymbol{x}_{t};{\boldsymbol{z}}_{t})\geq I(\boldsymbol{x}_{t};{\boldsymbol{z}}_{t-k})$ . In other words, $\mathcal{J}^{\mathrm{NCE}}$ and approximately derived $\mathcal{J}^{\mathrm{NCE}}_{k}$ share the same InfoMax upper bound metrics. An intuitive motivation to introduce $\mathcal{J}^{\mathrm{NCE}}_{k}$ instead of $\mathcal{J}^{\mathrm{NCE}}$ is so that we can incorporate temporal correlation between $t$ and $t-k$ . Another motivation is that we can increase the model capacity of the discriminator $p({\boldsymbol{z}}_{t}|\boldsymbol{x}_{t})$ by incorporating the independent dynamics model $\tilde{p}({\boldsymbol{z}}_{t}|\cdot)$ .

IV-C Model Definitions

This section discusses how we define the discriminator components: $p({\boldsymbol{z}}_{t}|\boldsymbol{x}_{t})$ and $\tilde{p}({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-1},\boldsymbol{a}_{t-1})$ . Ref. [34] has empirically shown that the inductive bias from model architectures is a significant factor for contrastive learning. As experimentally recommended in the literature, we define $p({\boldsymbol{z}}_{t}|\boldsymbol{x}_{t})$ as an exponentiated bilinear similarity function parameterized with $W_{{\boldsymbol{z}}|\boldsymbol{x}}$ :

\displaystyle p({\boldsymbol{z}}_{t}|\boldsymbol{x}_{t})\propto\exp({\boldsymbol{z}}_{t}^{\top}W_{{\boldsymbol{z}}|\boldsymbol{x}}\boldsymbol{e}_{t}),

(9)

where $\boldsymbol{e}_{t}\coloneqq f^{\mathrm{CNN}}(\boldsymbol{x}_{t})$ and $f^{\mathrm{CNN}}(\cdot)$ denotes feature extraction by a CNN unit. With this definition, $\mathcal{J}^{\mathrm{NCE}}_{k}$ is simply a softmax cross-entropy objective with logits ${\boldsymbol{z}}^{\top}W\boldsymbol{e}$ . Contrary to the previous contrastive learning literature [22, 34], the definition of newly introduced $\tilde{p}({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-1},\boldsymbol{a}_{t-1})$ is required. Here, we propose to apply linear modeling to define the model deterministically as:

	$\displaystyle\tilde{p}({\boldsymbol{z}}_{t}\|{\boldsymbol{z}}_{t-1},\boldsymbol{a}_{t-1})\coloneqq\delta({\boldsymbol{z}}_{t}-{\boldsymbol{z}}^{\prime}_{t}),\$
	$\displaystyle\mathrm{where}\ {\boldsymbol{z}}^{\prime}_{t}\coloneqq W_{{\boldsymbol{z}}}{\boldsymbol{z}}_{t-1}+W_{\boldsymbol{a}}\boldsymbol{a}_{t-1},$		(10)

$\delta$ is the Dirac delta function, and $W_{{\boldsymbol{z}},\boldsymbol{a}}$ are linear parameters.

This linear modeling of $\tilde{p}({\boldsymbol{z}}_{t}|\cdot)$ successfully regularizes $\mathcal{J}^{\mathrm{NCE}}_{k}$ and contributes to construct smooth latent space. We can alternatively define $\tilde{p}({\boldsymbol{z}}_{t}|\cdot)\coloneqq p({\boldsymbol{z}}_{t}|\cdot)$ , where $p({\boldsymbol{z}}_{t}|\cdot)$ is generally defined as an expressive model aiming at precise prediction. However, the high model capacity allows to embed temporally consecutive samples too distant from each other to sufficiently optimize $\mathcal{J}^{\mathrm{NCE}}_{k}$ , thus yielding unsmooth latent space.

IV-D Instantiation with RSSM

Figure 3 illustrates the architecture to compute $\mathcal{J}^{\mathrm{NCE}}_{k}$ . We describe the two paramount components which characterize our proposed contrastive learning scheme as follows:

(i) Independent linear forward dynamics: As previously proposed in Sec. IV-C, we employ a simple linear forward dynamics $\tilde{p}$ , which is used only for contrastive learning. During the policy optimization phase, the expressive model with GRU is alternatively utilized to make the most out of its long-term prediction accuracy.

(ii) Data augmentation: We append two independent image preprocessors which process two sets of input images (i.e., $\boldsymbol{x}_{\leq t}$ and $\boldsymbol{x}_{t+1:t+K}$ ). Considering the empirical success of the previous literature [23, 25, 28, 30, 29], we adopt the random crop of images. In our implementation, the original image shaped $(72,72)$ is cropped to be $(64,64)$ . The origin of the crop rectangle is determined at each preprocessor randomly and indenpendently. This makes it difficult for the contrastive learner to discriminate correct positive pairs, encouraging only informative features for control to be extracted.

We propose a decoder-free variant of Dreamer, which we call Dreamer with InfoMax and without generative decoder (Dreaming). Dreaming trains a policy as almost same way with the original Dreamer. The only difference between the methods is that we alternatively use the contrastive learning scheme introduced in the previous section to train RSSM. We implement Dreaming in TensorFlow [35] by modifying the official source code of Dreamer¹¹1https://github.com/google-research/dreamer. We keep all hyperparameters and experimental conditions similar to the original ones. A newly introduced hyperparameter $K$ in Eq. (8) (overshooting distance) is set to be $K=3$ based on the ablation study in Appx. C.

V Experiments

V-A Comparison to State-of-the-art Methods

TABLE I: Performance on 15 benchmark tasks around 500K environment steps (100K only for Cup-catch).

Dreaming (ours)

Dreamer w/

CURL [25]

DrQ [29]

RAD [28]

\mathcal{J}^{\mathrm{likelihood}}

[16]

\mathcal{J}^{\mathrm{NCE}}

[16]

(A) Manipulation tasks where object vanishing is critical

Cup-catch (100K)

925

\pm

698

\pm

350

609

\pm

404

693

\pm

334

882

\pm

174

792

\pm

315

Reacher-hard

868

\pm

272

\pm

115

\pm

298

431

\pm

435

616

\pm

464

783

\pm

370

Finger-turn-hard

752

\pm

325

264

\pm

368

222

\pm

379

339

\pm

443

270

\pm

427

303

\pm

443

UR5-reach

845

\pm

147

652

\pm

230

592

\pm

271

729

\pm

201

633

\pm

312

642

\pm

274

Connector-insert

629

\pm

391

169

\pm

348

304

\pm

399

297

\pm

384

183

\pm

361

367

\pm

387

(B) Manipulation tasks where object vanishing is NOT critical

Reacher-easy

905

\pm

210

947

\pm

145

183

\pm

325

834

\pm

286

Finger-turn-easy

661

\pm

394

689

\pm

394

232

\pm

398

576

\pm

464

Finger-spin

762

\pm

113

763

\pm

188

886

\pm

169

922

\pm

Pendulum-swingup

811

\pm

432

\pm

408

825

\pm

106

\pm

207

Acrobot-swingup

267

\pm

177

\pm

119

\pm

\pm

Cartpole-swingup-sparse

465

\pm

328

317

\pm

345

197

\pm

\pm

(D) Locomotion tasks

Quadrupled-walk

719

\pm

193

441

\pm

219

201

\pm

272

188

\pm

174

Walker-walk

469

\pm

123

955

\pm

483

\pm

111

914

\pm

Cheetah-run

566

\pm

118

781

\pm

132

303

\pm

174

580

\pm

Hopper-hop

\pm

172

\pm

114

\pm

\pm

The main objective of this experiment is to demonstrate that Dreaming has advantages over the baseline method Dreamer [16] on difficult 5 manipulation tasks exhibited in Fig. 1, in which Dreamer suffers from object vanishing. We also prepare a likelihood-free variant of Dreamer introduced in Sec. III-C, which utilizes the vanilla contrastive objective $\mathcal{J}^{\mathrm{NCE}}$ instead of $\mathcal{J}^{\mathrm{NCE}}_{k}$ and $\mathcal{J}^{\mathrm{likelihood}}$ . The specifications of the two original tasks, UR5-reach and Connector-insert, are described in Appx. B. For the difficult 5 tasks, we also compare the performance with the latest cutting edge MFRL methods, which are CURL [25], DrQ [29] and RAD [28]. In addition, another variety of 10 DMC tasks are evaluated, which are categorized into three classes namely; manipulation, pole-swingup, and locomotion. For the additional 10 tasks, only CURL is selected as an MFRL representative.

Table I summarizes the training results benchmarked at certain environment steps. The results show the mean and standard deviation averaged 4 seeds and 10 consecutive trajectories. This table shows a similar result as in [16] that decoder-free Dreamer with the vanilla contrastive objective $\mathcal{J}^{\mathrm{NCE}}$ degrades the performances on most of tasks than decoder-based Dreamer with $\mathcal{J}^{\mathrm{likelihood}}$ . In the following discussions, we use the decoder-based Dreamer as a primary baseline. (A) We put much focus on these difficult tasks and it can be seen that Dreaming consistently achieves better performance than Dreamer. Hence, this indicates that the decoder-free nature of the proposed method successfully surmounts the object vanishing problem. In addition, Dreaming achieves outperforming performance than the leading MFRL methods. (B) On other manipulation tasks, there are no significant difference between Dreaming and Dreamer because the key objects are large enough. (C) Since the pole-swingup tasks also cause vanishing of thin poles, Dreaming takes better performance than Dreamer. (D) Dreaming lags behind the Dreamer on 3 of 4 locomotion tasks, i.e., planar locomotion tasks (Walker-walk, Cheetah-run and Hopper-hop). On these tasks, the cameras always track the center of locomotive robots, and this causes the key control information (i.e., velocity) to be extracted from the background texture. We suppose that this robot-centric nature makes it difficult for the contrastive learner to extract such information because only robots’ attitudes provide enough information to discriminate different samples.

Figure 4 shows video prediction by Dreaming, in which principal features for control (e.g., positions and orientations) are successfully reconstructed from the embeddings learned without likelihood objective. However on Cheeta-run, another kind of object vanishing arises; the checkered floor pattern, which is required to extract the velocity information, is vanished.

V-B Ablation Study

TABLE II: Ablation study: The effects of

\mathrm{(i)}

linear forward dynamics (left) and

\mathrm{(ii)}

data augmentation (right).

	$\tilde{p}$ : linear	$\tilde{p}\coloneqq p$
	as Eq. (10)	as in [21]
Cup-catch (100K)	925 $\pm$ 48	575 $\pm$ 449
Reacher-hard	868 $\pm$ 272	232 $\pm$ 370
Finger-turn-hard	752 $\pm$ 325	263 $\pm$ 369

Random crop	-	$\checkmark$	-	$\checkmark$
Color jitter	-	-	$\checkmark$	$\checkmark$
Cup-catch (100K)	866 $\pm$ 133	925 $\pm$ 48	846 $\pm$ 192	866 $\pm$ 121
Reacher-hard	11 $\pm$ 32	868 $\pm$ 272	121 $\pm$ 292	733 $\pm$ 388
Finger-turn-hard	114 $\pm$ 283	752 $\pm$ 325	191 $\pm$ 357	399 $\pm$ 440

This experiment is conducted to analyze how the major components of the proposed representation learning, introduced in Sec. IV-D, contribute to the overall performance. For this purpose, some variants of the proposed method have been prepared: (i) the effect of independent linear dynamics is demonstrated with a variant that has shared dynamics $\tilde{p}({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-1},\boldsymbol{a}_{t-1})\coloneqq{p}({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-1},\boldsymbol{a}_{t-1})$ ²²2 The prepared variant can be considered as a special case of the contrastive forward model (CFM) [21] as discussed in Sec. II. , (ii) the effect of data augmentation is demonstrated by removing the image preprocessors shown in Fig. 3. We also prepare another data augmentation called color jittering [23, 29, 28], for reference. Only three tasks, which are Cup-catch, Reacher-hard, and Finger-turn-hard, are taken into this experiment. Tables II summarize the results from the performed ablation study, which reveals that both of the proposed components are essential to achieve state-of-the-art results.

VI Conclusion

In the present paper, we proposed Dreaming, a decoder-free extension of the state-of-the-art MBRL method from pixels, Dreamer. A likelihood-free contrastive objective was derived by reformulating the original ELBO of Dreamer. We incorporated the two indispensable components below to the contrastive learning: (i) independent and linear forward dynamics, (ii) the random crop data augmentation. By making the most of the decoder-free nature and the two components, Dreaming was able to outperform the baseline methods on difficult tasks especially where Dreamer suffers from object vanishing.

An disadvantage we observed in the experiments was that Dreaming degraded the training performance on planar locomotion tasks (e.g., Walker-walk), where the contrastive learner has to focus on not only robots but also the background texture. This weak point should be resolved in future work as it may affect industrial manipulation tasks where first-person-view from robots dynamically changes. Another future research direction is to incorporate the uncertainty-aware concepts proposed in recent MBRL studies [1, 2, 19, 30]. Although we have achieved state-of-the-art results on some difficult tasks, we have often observed overfitted behaviors during the early training phase. We believe that this model-bias problem [36] can be successfully solved by the above state-of-the-art strategy.

Appendix A Derivation

In this section, we clarify that $\mathcal{J}^{\mathrm{NCE}}_{k}$ is a lower bound of $I(\boldsymbol{x}_{t},{\boldsymbol{z}}_{t-k})$ . $\mathcal{J}^{\mathrm{NCE}}_{k}$ can be rewriten as:

	$\displaystyle\mathcal{J}^{\mathrm{NCE}}_{k}=$
	$\displaystyle\mathbb{E}_{q({\boldsymbol{z}}_{t-k}\|\cdot)}\left[\log f(\boldsymbol{x}_{t},{\boldsymbol{z}}_{t-k})-\log\sum_{\boldsymbol{x}^{\prime}\in\mathcal{D}}f(\boldsymbol{x}^{\prime},{\boldsymbol{z}}_{t-k})\right],$		(11)

where $f(\boldsymbol{x}_{t},{\boldsymbol{z}}_{t-k})$ includes deterministic multi-step prediction with $\tilde{p}({\boldsymbol{z}}_{t}|{\boldsymbol{z}}_{t-k},\boldsymbol{a}_{<t})$ and computation of the bilinear similarity by Eq. (9). For ease of notation, actions $\boldsymbol{a}_{<t}$ in the conditioning set are omitted from $f(\cdot)$ . As already shown in [22], the optimal value of $f(\cdot)$ is given by:

\displaystyle f(\boldsymbol{x}_{t},{\boldsymbol{z}}_{t-k})\propto{p(\boldsymbol{x}_{t}|{\boldsymbol{z}}_{t-k})}/{p(\boldsymbol{x}_{t})}.

(12)

By applying Bayes’ theorem $f(\boldsymbol{x}_{t},{\boldsymbol{z}}_{t-k})\propto{p({\boldsymbol{z}}_{t-k}|\boldsymbol{x}_{t})}/{p({\boldsymbol{z}}_{t-k})}$ and inserting this to Eq. (11), we get:

	$\displaystyle\mathcal{J}^{\mathrm{NCE}}_{k}$	$\displaystyle\propto\mathbb{E}_{q({\boldsymbol{z}}_{t-k}\|\cdot)}\left[\log{p({\boldsymbol{z}}_{t-k}\|\boldsymbol{x}_{t})}-\log\sum_{\boldsymbol{x}^{\prime}}{p({\boldsymbol{z}}_{t-k}\|\boldsymbol{x}^{\prime})}\right]$
		$\displaystyle\leq\mathbb{E}_{q({\boldsymbol{z}}_{t-k}\|\cdot)}\left[\log{p({\boldsymbol{z}}_{t-k}\|\boldsymbol{x}_{t})}-\log{p({\boldsymbol{z}}_{t-k})}\right].$		(13)

By marginalizing Eq. (13) with respect to the data distribution, we finally obtain: $\mathbb{E}[\mathcal{J}^{\mathrm{NCE}}_{k}]\leq I(\boldsymbol{x}_{t};{\boldsymbol{z}}_{t-k}).$ Note that setting $k=0$ derives $\mathbb{E}[\mathcal{J}^{\mathrm{NCE}}]\leq I(\boldsymbol{x}_{t};{\boldsymbol{z}}_{t}).$

Appendix B Specifications of the Original Tasks

Figure 5 exhibits the specifications of newly introduced robotics tasks.

Figure 5: UR5-reach (left) is to bring the robot end effector to goal positions. The observation is a blended image of two different views, implicitly providing depth information. Connector-insert (right) is to insert a millimeter-sized connector gripped by a robot to a socket. This tasks is originally introduced in [37]. Since the gap between the connector and socket is very tight, pixel-wise precise control is required. In the both tasks, the goal positions are initialized at random.

Appendix C Ablation Study of Overshooting Distance

Table III summarizes the ablation study of the overshooting distance $K$ , which demonstrates that incorporating temporal correlation of appropriate multi-steps ( $K=3$ ) is effective.

TABLE III: Ablation study: The effect of the overshooting distance

K

	$K=1$	$K=3$	$K=5$	$K=7$
Cup-catch (100K)	280 $\pm$ 437	925 $\pm$ 48	734 $\pm$ 378	736 $\pm$ 378
Reacher-hard	234 $\pm$ 364	868 $\pm$ 272	561 $\pm$ 447	471 $\pm$ 433
Finger-turn-hard	354 $\pm$ 438	752 $\pm$ 325	468 $\pm$ 432	715 $\pm$ 375

References

[1] K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep reinforcement learning in a handful of trials using probabilistic dynamics models,” in NeurIPS, 2018.
[2] M. Okada and T. Taniguchi, “Variational inference MPC for Bayesian model-based reinforcement learning,” in CoRL, 2019.
[3] E. Langlois, S. Zhang, G. Zhang, P. Abbeel, and J. Ba, “Benchmarking model-based reinforcement learning,” arXiv:1907.02057, 2019.
[4] L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, et al., “Model-based reinforcement learning for Atari,” in ICLR, 2020.
[5] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in ICML, 2018.
[6] S. Bechtle, Y. Lin, A. Rai, L. Righetti, and F. Meier, “Curious iLQR: Resolving uncertainty in model-based RL,” in CoRL, 2019.
[7] A. Nagabandi, K. Konolige, S. Levine, and V. Kumar, “Deep dynamics models for learning dexterous manipulation,” in CoRL, 2019.
[8] Y. Yang, K. Caluwaerts, A. Iscen, T. Zhang, J. Tan, and V. Sindhwani, “Data efficient reinforcement learning for legged robots,” in CoRL, 2019.
[9] Y. Zhang, I. Clavera, B. Tsai, and P. Abbeel, “Asynchronous methods for model-based reinforcement learning,” in CoRL, 2019.
[10] G. R. Williams, B. Goldfain, K. Lee, J. Gibson, J. M. Rehg, and E. A. Theodorou, “Locally weighted regression pseudo-rehearsal for adaptive model predictive control,” in CoRL, 2019.
[11] K. Fang, Y. Zhu, A. Garg, S. Savarese, and L. Fei-Fei, “Dynamics learning with cascaded variational inference for multi-step manipulation,” in CoRL, 2019.
[12] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, 2014.
[13] D. Ha and J. Schmidhuber, “Recurrent world models facilitate policy evolution,” in NeurIPS, 2018.
[14] A. X. Lee, A. Nagabandi, P. Abbeel, and S. Levine, “Stochastic latent actor-critic: Deep reinforcement learning with a latent variable model,” arXiv:1907.00953, 2019.
[15] D. Hafner, T. Lillicrap, I. Fischer, R. Villegas, D. Ha, H. Lee, and J. Davidson, “Learning latent dynamics for planning from pixels,” in ICML, 2019.
[16] D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, “Dream to control: Learning behaviors by latent imagination,” ICLR, 2020.
[17] D. Han, K. Doya, and J. Tani, “Variational recurrent models for solving partially observable control tasks,” in ICLR, 2020.
[18] D. Yarats, A. Zhang, I. Kostrikov, B. Amos, J. Pineau, and R. Fergus, “Improving sample efficiency in model-free reinforcement learning from images,” arXiv:1910.01741, 2019.
[19] M. Okada, N. Kosaka, and T. Taniguchi, “PlaNet of the Bayesians: Reconsidering and improving deep planning network by incorporating Bayesian inference,” in IROS, 2020.
[20] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, et al., “DeepMind control suite,” arXiv:1801.00690, 2018.
[21] W. Yan, A. Vangipuram, P. Abbeel, and L. Pinto, “Learning predictive representations for deformable objects using contrastive estimation,” arXiv:2003.05436, 2020.
[22] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv:1807.03748, 2018.
[23] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in ICLR, 2020.
[24] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation platform for general agents,” Journal of Artificial Intelligence Research, vol. 47, pp. 253–279, 2013.
[25] A. Srinivas, M. Laskin, and P. Abbeel, “CURL: Contrastive unsupervised representations for reinforcement learning,” in ICML, 2020.
[26] A. Zhang, R. McAllister, R. Calandra, Y. Gal, and S. Levine, “Learning invariant representations for reinforcement learning without reconstruction,” arXiv:2006.10742, 2020.
[27] X. Ma, P. Karkus, D. Hsu, W. S. Lee, and N. Ye, “Discriminative particle filter reinforcement learning for complex partial observations,” in ICLR, 2020.
[28] M. Laskin, K. Lee, A. Stooke, L. Pinto, P. Abbeel, and A. Srinivas, “Reinforcement learning with augmented data,” arXiv:2004.14990, 2020.
[29] I. Kostrikov, D. Yarats, and R. Fergus, “Image augmentation is all you need: Regularizing deep reinforcement learning from pixels,” arXiv:2004.13649, 2020.
[30] K. Lee, M. Laskin, A. Srinivas, and P. Abbeel, “SUNRISE: A simple unified framework for ensemble learning in deep reinforcement learning,” arXiv:2007.04938, 2020.
[31] R. Sekar, O. Rybkin, K. Daniilidis, P. Abbeel, D. Hafner, and D. Pathak, “Planning to explore via self-supervised world models,” arXiv:2005.05960, 2020.
[32] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, et al., “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv:1406.1078, 2014.
[33] B. Poole, S. Ozair, A. v. d. Oord, A. A. Alemi, and G. Tucker, “On variational bounds of mutual information,” in ICML, 2019.
[34] M. Tschannen, J. Djolonga, P. K. Rubenstein, S. Gelly, and M. Lucic, “On mutual information maximization for representation learning,” ICLR, 2020.
[35] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, et al., “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015.
[36] M. Deisenroth and C. E. Rasmussen, “PILCO: A model-based and data-efficient approach to policy search,” in ICML, 2011.
[37] R. Okumura, M. Okada, and T. Taniguchi, “Domain-adversarial and -conditional state space model for imitation learning,” in IROS, 2020.

	$\displaystyle\log p(\boldsymbol{x}_{\leq T}\|\boldsymbol{a}_{\leq T})=\log\int p({\boldsymbol{z}}_{\leq T},\boldsymbol{a}_{<T},\boldsymbol{x}_{\leq T})d{\boldsymbol{z}}_{\leq t}\geq$
	$\displaystyle\sum_{t}\left(\underbrace{\mathbb{E}_{q({\boldsymbol{z}}_{t}\|\boldsymbol{x}_{\leq t},\boldsymbol{a}_{<t})}\left[\log p(\boldsymbol{x}_{t}\|{\boldsymbol{z}}_{t})\right]}_{\coloneqq\mathcal{J}^{\mathrm{likelihood}}}\right.$		(2)
	$\displaystyle\left.\underbrace{-\mathbb{E}_{q({\boldsymbol{z}}_{t}\|\cdot)}\left[{\operatorname{KL}}\left[q({\boldsymbol{z}}_{t+1}\|\boldsymbol{x}_{\leq t+1},\boldsymbol{a}_{<t+1})\|\|p({\boldsymbol{z}}_{t+1}\|{\boldsymbol{z}}_{t},\boldsymbol{a}_{t})\right]\right]}_{\coloneqq\mathcal{J}^{\mathrm{KL}}}\right).$

	$\displaystyle\mathcal{J}^{\mathrm{likelihood}}\stackrel{{\scriptstyle+}}{{=}}\mathbb{E}_{q({\boldsymbol{z}}_{t}\|\cdot)}\left[\log p(\boldsymbol{x}_{t}\|{\boldsymbol{z}}_{z})-\log p(\boldsymbol{x}_{t})\right]$
	$\displaystyle=\mathbb{E}_{q({\boldsymbol{z}}_{t}\|\cdot)}\left[\log p({\boldsymbol{z}}_{t}\|\boldsymbol{x}_{t})-\log p({\boldsymbol{z}}_{t})\right]$
	$\displaystyle\geq\mathbb{E}_{q({\boldsymbol{z}}_{t}\|\cdot)}\left[\log p({\boldsymbol{z}}_{t}\|\boldsymbol{x}_{t})-\log\sum_{\boldsymbol{x}^{\prime}\in\mathcal{D}}p({\boldsymbol{z}}_{t}\|\boldsymbol{x}^{\prime})\right]$
	$\displaystyle\coloneqq\mathcal{J}^{\mathrm{NCE}},$		(5)

Dreaming: Model-based Reinforcement Learning by Latent Imagination without Reconstruction