00 \jnum00 \jyear2013 \jmonthJanuary
Sparse Representation Learning with Modified q-VAE towards Minimal Realization of World Model
Abstract
Extraction of low-dimensional latent space from high-dimensional observation data is essential to construct a real-time robot controller with a world model on the extracted latent space. However, there is no established method for tuning the dimension size of the latent space automatically, suffering from finding the necessary and sufficient dimension size, i.e. the minimal realization of the world model. In this study, we analyze and improve Tsallis-based variational autoencoder (q-VAE), and reveal that, under an appropriate configuration, it always facilitates making the latent space sparse. Even if the dimension size of the pre-specified latent space is redundant compared to the minimal realization, this sparsification collapses unnecessary dimensions, allowing for easy removal of them. We experimentally verified the benefits of the sparsification by the proposed method that it can easily find the necessary and sufficient six dimensions for a reaching task with a mobile manipulator that requires a six-dimensional state space. Moreover, by planning with such a minimal-realization world model learned in the extracted dimensions, the proposed method was able to exert a more optimal action sequence in real-time, reducing the reaching accomplishment time by around 20 %.
keywords:
Variational autoencoder; World model; Model predictive control1 Introduction
Expectations for robots are increasing along with the rapid development of robot and AI technologies, and coupled with the shortage of labor force, robots are beginning to be required to accomplish more complex tasks than ones like factory automation. For example, manipulation of flexible objects [1, 2]; (physical) human-robot interaction [3, 4]; and autonomous driving based on high-dimensional observation data from cameras and LiDAR [5, 6] can be raised. In these tasks, modeling is a major obstacle to the use of conventional model-based control, where the whole behavior in the task is mathematically modeled in advance for planning the optimal action sequence of the robot [7]. This is because the state that can adequately represent the whole behavior is unknown and must be extracted from observations somehow.
Recently, a so-called world model, which simulates prediction and evaluation of the whole behavior at each time step, has been attracting attention [8, 9, 10, 11, 12]. The world model is acquired from the experienced data under the state, the extraction way of which is also learned from the data, in most cases simultaneously, mainly by a variant of variational autoencoder (VAE) [13, 14, 15]. VAE compresses high-dimensional observation data into a low-dimensional latent space with each axis of the obtained latent space as the state. The low-dimensional latent space with appropriately compressed observation data can capture the behavior while eliminating unnecessary calculations, and therefore, the world model constructed in this space is suitable for control applications because it enables future predictions accurately with low computational cost.
For the optimal control using the acquired world model, sampling-based nonlinear model predictive control (MPC) [6, 7, 16] is often employed for its generality. This is a methodology that randomly generates candidates of the optimal action sequence and finds the better candidates based on the evaluation results of them simulated by the world model with them. While this methodology can be applied to arbitrary world models because it does not require gradient information, its optimization process is fully accomplished by re-evaluating numerous candidates many times, and requires a very large computational cost. In particular, this computational cost is correlated to the dimensionality of the state of the world model, and it is intractable to generate the (near) optimal action sequence in real-time if the sufficiently low-dimensional state is not extracted. On the other hand, of course, if the state dimension is set too small, the whole behavior cannot be simulated by the world model, and the accuracy of the planning itself would be greatly reduced.
Thus, in order to construct a world model that can accomplish the task in real-time, it is essential to keep the size of dimensions of the extracted latent space to a necessary and sufficient level. In other words, it is desirable to achieve minimal realization [17] of the world model. Most developers to date have adjusted this manually, changing the size of dimensions of the latent space little by little and re-learning to find the minimum size of dimensions that will leave enough information in the state to recover the observation. Unfortunately, this fine-tuning process is a highly time-consuming process that should be automated.
For this automation, disentangled representation learning [18, 19] (or more directly, the independence and sparsification of the latent space) may play an important role. In this concept, the latent space should be divided into independent state dimensions and unnecessary state dimensions. To this end, it requires to eliminate as much as possible the dependencies among dimensions. In addition, it requires to collapse the state dimensions that can only be dependent so that they are always zero. If these requirements are always performed correctly, we can hypothesize that the uncollapsed state dimensions correspond to the minimal realization.
In this paper, we focus on one of the latest disentangled representation learning methods, q-VAE [19], for promoting such independence and sparsity. This q-VAE is derived by replacing the log-likelihood maximization of the observed data, which is the starting point of the conventional VAE, with the -log-likelihood maximization given in Tsallis statistics [20, 21]. As a characteristic of q-VAE, adaptive learning is performed to balance between the term that improves the reconstruction accuracy of the observed data and the term that refines the latent space, and experimental results have reported that the independence of the latent space is increased. However, although the cause of this independence could be understood qualitatively, it was not clear whether it was always mathematically valid. In addition, numerical stability was needed to be guaranteed by ad-hoc constraints.

Therefore, we deepen the analysis of this q-VAE to establish a new formulation that increases numerical stability and implementation flexibility by eliminating the ad-hoc constraints. To this end, we exclude common terms that cause instability over all terms found by further decomposing the q-VAE. We also consider a further lower bound to prevent numerical divergence. After these modifications, we reveal the conditions to always facilitate sparsification based on the inter-axis dependence and the finite lower bound of the -logarithm.
Using the modified q-VAE, the pre-specified size of dimensions of the latent space can be increased to ensure the reconstruction accuracy of the observed data, and unnecessary state dimensions that can be easily discriminated thanks to sparsification can be masked. By constructing a world model based on the masked state that may satisfy the minimal realization, we can expect to accomplish the task in real-time using MPC with the trained world model. The proposed framework for the above processes is illustrated in Fig. 1. Note that unlike conventional methods [9, 10, 11, 12], it is not possible to train all neural networks at the same time, but instead, by dividing the optimization problems like [8], the advantages of step-by-step performance analysis and verification can be obtained.
The proposed framework is empirically validated in an autonomous driving simulation and in a reaching task to a target object by a mobile manipulator. In both tasks, we show that the modified q-VAE improves the sparsity over a conventional method while ensuring the reconstruction accuracy. We also confirm that the modified q-VAE can make the latent space sparse to six dimensions and achieve almost the minimal realization in the reaching task. With the world model constructed after masking the unnecessary dimensions, the prediction accuracy can be maintained before the masking. We finally report that the world model with masking contributes to the improvement of control performance in real-time.
2 Preliminaries
2.1 Model predictive control with world model

Before describing the method of extracting the latent space, which is the main topic of this paper, we briefly introduce the world model for the given state and the use of MPC with it [7]. Here, we first define the state as , the robot action as , and the reward (or cost) as (with the state and action spaces and , respectively). Note that with space denotes the size of dimensions of the given space. In addition, a discrete-time system is often supposed in the world model and MPC. Therefore, the time step is given as , and it can be noted as a subscript to the above variables to clarify the time of them.
Following the above definitions, we set the world model with a set of parameters as follows:
(1) |
where denotes the conditional probability. That is, with the current state-action pair , the world model predicts the future state and evaluates the current situation as . With this structure (i.e. Markov decision process), a transited state and an evaluated reward obtained by an action given to the “actual” environment in a state are combined into a tuple , and a dataset with tuples, , can be constructed and used to train the world model. Specifically, we can find that achieves the following negative log-likelihood minimization problem.
(2) |
where denotes the expectation operation by randomly sampling tuples from .
It is important to note that the world model only includes the action as one of the conditions, namely, if the robot freely plans and generates the action sequence with horizon step, its value can be evaluated via simulating the world model to improve the way to generate . This mechanism is utilized in the sampling-based nonlinear MPC used in this paper, so-called cross entropy method (CEM) [7] (see Fig. 2). Based on the evaluation of , the optimal is eventually obtained by repeatedly modifying and re-evaluating in the direction of improving the evaluation. Note that although MPC optimizes the whole action sequence, only is actually used, since this optimization is conducted at every time step.
Specifically, CEM samples candidates of , , from a proposal distribution (or policy), , at each iteration (i.e. the evaluation and improvement), and evaluates all of them using the world model. The score of each candidate is given as the sum of rewards . With this score, candidates is sorted in descending (or ascending if cost is used instead of reward) order, and then the top ( denotes the elite ratio) candidates is extracted as the elites. Since these elites should be actively sampled, a new policy is obtained through the following maximum likelihood estimation.
(3) |
where denotes the minimum score in the elites. is defined as the indicator function, which returns one if the condition in the bracket is satisfied; otherwise zero. If is modeled as normal distribution with location and scale, this can be analytically solved by the mean and standard deviation of the elites, respectively.
Note that this improvement is largely sample-dependent, hence if the samples are biased, would overfit to one of the local optima. To mitigate this issue, the following smooth update is often employed.
(4) |
where denotes the set of parameters in (in the case of normal distribution, ). Larger makes the update smoother. The above process (with sampling, evaluation, and improvement) are iterated until the specified number of times or the specified time is exceeded, and the mean of the final updated, or the one with the highest score among the candidates sampled so far, is returned as the optimal action sequence.
2.2 Tsallis statistics

Let us briefly introduce several important properties of Tsallis statistics [20, 21], which is utilized in this paper. First of all, well-known natural logarithm is extended to -logarithm with in Tsallis statistics.
(5) |
where . As illustrated in Fig. 3, -logarithm with is concave. While natural logarithm has infinite upper and lower bounds , in -logarithm, either is finite according to .
(6) | ||||
(7) |
In addition, the following inequality holds for .
(8) |
The equality is satisfied only when .
In the derivation of q-VAE, several important tricks are described below. First, in -logarithm, pseudo-additivity is established instead of additivity.
(9) | ||||
For the reciprocal, the following formula holds.
(10) |
Finally, the -deformed Kullback-Leibler (KL) divergence (or, Tsallis divergence) is given as follows:
(11) |
Note that some probability distribution models, such as exponential distribution families, have closed-form solutions even for Tsallis divergence [22].
2.3 Original VAE and q-VAE
As a comparison to q-VAE, we first derive the original VAE [13]. For VAE, a dataset , where is the data observed by sensors and of them are collected. Here, is distinguished from the dataset for the world model by definition. However, for practical use, the dataset can be reused for both of them by extracting from it to train VAE; and converting it into by mapping with VAE.
Anyway, for , in order to obtain a generative distribution , the problem of maximizing its log-likelihood is considered. In variational inference, is supposed to be generated stochastically depending on the corresponding latent variable (in general, ). In that case, can be represented to be with a pre-designed prior distribution and a decoder with the set of parameters . From this relation, the variational lower bound is derived as follows:
(12) |
where denotes the variational posterior distribution (or encoder). Note that is generally used instead of to denote the variational distribution, but since appears in Tsallis statistics, it is unified with to avoid confusion. The inequality in the above derivation is given by Jensen’s inequality using the fact that the natural logarithm is a concave function. In order to minimize , the computational graph for is constructed using the reparameterization trick [13], etc., and one of the stochastic gradient descent methods [23, 24] is used to optimize . Furthermore, by considering the minimization problem of as a constrained optimization problem with KL divergence, -VAE [14], which multiplies KL divergence by a weight , is derived via Lagrange’s method of undetermined multipliers.
For convenience, the first term is called the reconstruction term to increase the accuracy of reconstructing the observed data from the encoded latent variable, and the second term is called the regularization term that attempts to match the encoder to the prior. The regularization term shapes the latent space according to the prior, and the design of the prior promotes disentangled representation (i.e. independence and sparsification). For implementation, from the many reasons (e.g. the closed-form solution of KL divergence can be obtained, the reparameterization trick is well established, and the computational cost is small), is frequently given by the standard normal distribution , and is accordingly modeled by a diagonal normal distribution. Note that the model of depends on : for real data such as robot coordinates, a diagonal normal distribution (with fixed variance in some cases) or other real-space distribution like student-t distribution [25] is used; and for image data (normalized to for each pixel), Bernoulli distribution (recently, continuous Bernoulli distribution [26]) is adopted.
Finally, we introduce the original q-VAE [19], which uses the same variables and probability distributions, but replaces the starting point for the maximization problem with the -log likelihood. By restricting to make -logarithm concave, the variational lower bound for this problem can be derived as in the usual VAE (note the pseudo-additivity).
(13) |
where . Under the condition on , when is small (i.e. ), the influence of the reconstruction term is suppressed and the regularization term dominates, thus promoting . Otherwise, the reconstruction term becomes dominant and tries to extract the information needed for the reconstruction. In the original paper, this behavior is regarded as an adaptive in -VAE, automatically adjusting the trade-off between the disentangled representation promoted by large and the reconstruction accuracy impaired by it. Indeed, experimental results in that paper showed that the reconstruction accuracy can be retained while increasing the independence among latent variables compared to -VAE. Note that, with , the above problem reverts to the standard VAE.
3 Modified q-VAE
3.1 Stability issues
In the original q-VAE, numerical stability issues were found, and two ad-hoc cheap tricks were made to address these issues. The first is the removal of the computational graph leading to , making a merely adaptive coefficient. In this way, can be regarded as a part of in the original version, but should naturally be updated by the computational graph to . This trick may cause large biases in the behavior during training and the obtained latent space.
The other is the limitation of the decoder model. In many cases, the observed data handled by VAE are of very high dimension, and the following another representation of -logarithm for them reveals that it tends to have relatively large values in the exponential function, causing numerical divergence.
(14) |
That is, if is given as probability density function (i.e. is in continuous space), can be over one, resulting in the positive log-likelihood. Even if the log-likelihood for each dimension is slightly positive, the value accumulated tens of thousands of them would easily diverge the above exponential function numerically. To avoid this issue, the original q-VAE limits the decoder model that cancel -logarithm, such as -Gaussian distribution [21]. Of course, it is not a good idea to do so, because various literatures have reported performance gains by utilizing different decoder models [25, 26], as mentioned above.
For these two open issues, this paper deepens the analysis of the original q-VAE and decomposes it into a new surrogated variational lower bound. In addition, as a part of the flexibility of the decoder model, we also derive a formulation that takes into account the case of mixed observations that should be represented by different models. Note that we take care of making the modified q-VAE a general form by guaranteeing that it reverts to the standard VAE when (and other hyperparameters are appropriately given).
3.1.1 Alternative to removal of computational graph
First, Tsallis divergence is decomposed as follows, making full use of its definition in eq. (11), pseudo-additivity in eq. (9), and the formula for the reciprocal in eq. (10).
(15) |
By applying this to eq. (13) and decomposing as , we see that is multiplied to every term.
(16) |
Here, with the fact and when , can be ignored as a slightly biased but consistent surrogated problem.
(17) |
In this case, the first term can be regarded as the reconstruction term, the second as the regularization term that brings the encoder closer to the prior, and the third as an entropy term that maximizes the entropy of the encoder. Although the surrogated problem induces only a small bias, the elimination of simplifies the gradient considerably, making it numerically much more stable. Note that the reconstruction term is not yet sufficiently stable numerically at this stage, since the direction to be updated is not unique unless under certain conditions, as described later.
3.1.2 Alternative to limitation of decoder model
Based on eq. (17), the decoder model is first decomposed for representing several types of observations. To this end, is classified into classes, i.e. with (). Note that, at this stage, each class is unordered. Suppose that each class is independent, the decoder can be decomposed as follows:
(18) |
where is modeled by an appropriate distribution, such as a diagonal normal distribution and a continuous Bernoulli distribution. Although the production of likelihoods can be converted into the sum of them through natural logarithm, -logarithm requires the pseudo-additivity defined in eq. (9). By applying the pseudo-additivity iteratively, -logarithm of the decomposed decoders is derived as follows:
(19) |
where
(20) |
To avoid numerical divergence of the decomposed decoders, we pay attention to , which is in the exponential function of eq. (14) as a coefficient. That is, the larger is, the smaller the scale of values in the exponential function becomes. Therefore, the condition for no numerical divergence can be found by increasing . In addition, as introduced in eq. (8), the -logarithm becomes smaller for larger . From these facts, the following lower bound is gained by introducing ().
(21) |
where
(22) |
Note that although was assumed to be unordered, its order can be determined by finding each of the non-divergent and arranging them in ascending order. As the continuous space with larger tends to diverge more easily, this fact can be a guide for roughly adjusting .
By substituting this lower bound into eq. (17), a modified q-VAE, which can be numerically stable under the appropriate conditions, is obtained. Here, for convenience, all terms are weighted respectively as in -VAE [14] and its variants like [15]. Specifically, the modified q-VAE aims to find that maximizes the following equation (i.e. minimizes ) weighted by , , and .
(23) |
3.2 Analysis for sparsification
In the original q-VAE, the computational graph of is removed to stabilize the computation, and it becomes a merely adaptive coefficient. Thus, overall, the original q-VAE solves the multi-objective optimization problem of reconstruction and regularization terms by scalarizing them with an (adaptive) linear weighted sum. On the other hand, in the modified q-VAE (with ), the likelihood of the encoder, which was the denominator of , is eliminated, but the regularization by the prior distribution in the numerator is retained, preserving the computational graph. This means that the multi-objective optimization problem with the reconstruction and regularization terms is scalarized and solved in the form of a product. This product is important: both terms must be satisfied simultaneously to obtain a high value.
However, in the maximization problem involving such a product, each term must always be non-negative. That is, if the sign of one term is reversed to negative, the other term has a negative coefficient, converting its maximization problem into a minimization problem. Of course, this switching is also a factor that makes learning unstable. Since the regularization term is non-negative in definition, we need to reveal the conditions for the non-negative reconstruction term.
Now, we first consider the case with for simplicity. The reconstruction term becomes negative with and non-negative with . Since sign reversal may occur depending on the performance of the decoder, we substitute the definition of -logarithm when (see eq. (5)) and reform eq. (23) as follows:
(24) |
Using this form and the fact that -logarithm with has a finite lower bound as shown in eq. (6), the condition for the values in the brace to always be non-negative is revealed.
(25) |
Similarly, the conditions required when are also considered. The terms inside the brace of eq. (24) is derived as follows:
(26) |
where and for simplifying the description. The first term is always non-negative, so the required conditions are gained from the second term.
(27) |
When the above conditions are satisfied, the improvement in the reconstruction accuracy of the observed data (i.e. compression of important information into the latent space) and the regularization of the encoder (i.e. organization of the latent space) should be promoted simultaneously. It is also important that this regularization is given in terms of likelihood rather than log-likelihood to the prior. That is, even if the prior is assumed to be a diagonal distribution such as , the likelihood is given by the product of the likelihoods for the respective dimensions. As in the discussion above, if the regularization for each dimension is taken as a multi-objective problem, the regularization for all dimensions should be simultaneously established. This suggests that each dimension of the latent space is sparsified as much as possible to match the prior (basically, centered at the origin), while still extracting sufficient information into the latent space to reconstruct the observed data. In addition, the regularization status of the other dimensions influences each other, thus preventing duplication of information and encouraging independence. As a result, the latent variable on the latent space constructed by the modified q-VAE is expected to coincide with the state , which is the minimal realization for the whole behavior of the system.
4 Simulation
4.1 Task


As a statistical validation in the simulation, we use CEM to control highway-env [27]. Specifically, as shown in Fig. 4, this task is to avoid colliding with other blue car(s) and going out of the lane by operating the accelerator and steering wheel (i.e. two-dimensional continuous action space) of a yellow car. Usually, geometric information between cars can be used for observation, but in this verification, the RGB image of 3001503 in Fig. 4 is resized to 64643 as an observation. This modification requires to extract the latent state from high-dimensional observation. In addition, to show that multiple types of sensors can be integrated as described above, the yellow car’s velocity is given as another observation.
The control by CEM is real-time oriented and outputs the currently optimal action in each control period even if the maximum number of iterations has not been finished. In addition, the network architecture of VAE implemented by PyTorch [28] is illustrated in Fig. 5. These details and the way to collect the dataset are described in Appendix A.
4.2 Results
Method | |||||||
---|---|---|---|---|---|---|---|
VAE | 1 | – | – | 50 | 1 | 1 | – |
-VAE | 1 | – | – | 50 | 1 | 0.3 | – |
q-VAE (proposal) | 0.99 | 0.99 | 0.999 | 10 | 1 | 10 | 3 |
Under the common settings described above, three methods shown in Table 1 are compared. Note that these values were adjusted by trial and error to increase the accuracy of observation reconstruction as much as possible. In addition, q-VAE is set to satisfy the sparsity condition indicated by eq. (27) while confirming the stability of the numerical computation.
4.3 Sparse extraction of latent state





The learning curves for the statistical reconstruction performance (for image and velocity, respectively) with 25 trials are depicted in Fig. 6. The well-tuned parameters yielded that -VAE and q-VAE eventually achieved approximately the same level of reconstruction performance, while the standard VAE (with ) performed poorly in reconstructing images. This is probably due to the strong regularization to the prior distribution . In fact, -VAE succeeded in image reconstruction by setting .
Another feature of q-VAE is that its learning speed tends to be faster than others. Although it is difficult to make a clear agreement because the parameter settings differ greatly from others, we can consider that over the entropy term of the encoder can be set separately from , and mitigates the loss of information from the encoder (and the latent space). In fact, previous studies have pointed out the negative effects of the entropy term found in the decomposition of the regularization term in VAE [15], and this is consistent with those reports.
Next, five samples are selected to illustrate the post-learning reconstruction accuracy in Fig. (7). As can be seen, while all methods succeeded in reconstructing the velocity with good accuracy, in the image reconstruction, the standard VAE in the second row failed to visualize the other blue car(s). In other words, it can be said that the encoder of the standard VAE did not properly incorporate the information of the other blue cars into the latent state. In contrast, -VAE and q-VAE properly embedded the important information needed for the reconstruction contained in the observation into the latent space.
Finally, the sparsity of the acquired latent space is evaluated. As sparsity, we use the following definition in the literature [29].
(28) | ||||
where denotes the location of the encoder over in the dataset. If each component of is with the same magnitude, this definition returns zero; in contrast, if only one component has a non-zero value and others are zero, converges to one.
With this definition of the sparsity, we evaluated each method as shown in Fig. 8. As a result, only -VAE gained low sparsity. This is due to the fact that is used to improve the reconstruction accuracy. The standard VAE with achieved the same level of sparsity as the proposed q-VAE, but at the expense of the reconstruction accuracy as mentioned above, extracting the meaningless latent space. This trend is also consistent with the previous study [19]. From the above, it can be concluded that q-VAE mitigates the trade-off between the reconstruction accuracy and sparsity, and increases both of them sufficiently, namely, it enables to acquire the important information contained in the observation with the smallest dimension size of the latent space (i.e. minimal realization).
4.4 Control performance




#Inputs and outputs for all layers | #Parameters | |||
---|---|---|---|---|
Model | w/o masking | w/ masking | w/o masking | w/ masking |
Dynamics | 302 | 92 | 5,810 | 1,526 |
Reward | 155 | 71 | 3,752 | 1,232 |
Total | 457 | 163 | 9,562 | 2,758 |


The control performance by CEM is compared between -VAE and q-VAE, both of which obtained the sufficiently meaningful latent space. Here, was set to be 50, which cannot reduce the computational cost sufficiently. Therefore, in order to confirm the benefit of sparsification, the unnecessary latent dimensions are excluded by masking, and the state space with minimal realization is extracted. As a criterion for judging unnecessary dimensions, the sample standard deviation of the locations of the encoder is evaluated (see Fig. 9). The dimension with the small sample standard deviation mostly takes zero for most of the data, and can be eliminated as an unnecessary dimension. As can be seen in the figure, it is easy to expect that the top eight dimensions (with a standard deviation over 0.15) are important in the case of q-VAE. In line with this, -VAE also extracts the top eight dimensions, but there is concern that the necessary information may be truncated.
The world model constructed on the latent space before and after masking is used to implement control with CEM. Each method was tested with different random seeds for 300 episodes, and if no failure occurred during the episode, it was successfully completed after 200 steps. The statistical number of steps and the number of CEM iterations in each episode are depicted in Fig. 10. It is remarkable that masking reduced the computational cost and increased the number of iterations. In fact, the number of inputs/outputs and parameters of the world model are reduced by masking, as shown in Table 2.
The number of steps reveals the performance difference between the methods. First, -VAE clearly reduced the maximum performance by masking. This is because the necessary information was missing due to masking, and the optimization by CEM did not work properly. On the other hand, in q-VAE, masking improved the maximum performance more than in the other cases. One of the reason for this is simply that the necessary information is retained even after masking, facilitating the optimization by increasing the number of iterations.
In addition, it was confirmed that q-VAE without masking requires a larger learning rate than the others (from to ) for making learning of the world model progress. This may be due to the fact that most of inputs and outputs in the training dataset were zero, resulting in over-learning that generate zero. Therefore, the performance of q-VAE without masking may have been insufficient. In fact, the negative log-likelihood of the world model for the test dataset, as shown in Fig. 11, revealed that q-VAE without masking increased the worst-case loss of the dynamics model. Although the learning rate could be adjusted to make progress, over-learning still occurred in favor of majority zeros, overlooking the important features. Note that the masked -VAE was reduced the accuracy of the dynamics model, as expected above.
5 Experiment
5.1 Task


As a demonstration, we conduct a reaching task, named stretch-reach, with a Stretch RE1 developed by Hello Robot [30]. Specifically, as shown in Fig. 12(a), Stretch RE1 is a kind of mobile manipulator with a camera on its top. For simplicity, the motion of this robot is limited to -axes arm movement (within in direction and in direction). The target position is 2 cm above an object randomly placed in the direction (within ), and the task is to move the arm to that position (see Fig. 12(b)). Similar to the above simulation, one of the observations is an RGB image (originally, 4242403), which is pre-processed to 64643. In addition, since the observation of the arm position and velocity can be easily measured from the encoders for the respective actuators, this 4D information is also added as an observation.
The task is considered successful when the arm stops near the target position. Specifically, the following reward function is provided, and is considered success, and 20 steps without success is considered failure.
(29) |
where denotes the distance between the current arm position and the target position, and denotes the arm velocity. From the above setup, although the task itself is simple, the random target position has to be estimated from the RGB image to compute and to obtain the world model.
Note that the details of other configurations are described in Appendix A, similar to that for the simulation.
5.2 Results
Method | |||||||
---|---|---|---|---|---|---|---|
-VAE | 1 | – | – | 50 | 1 | 0.3 | – |
q-VAE (proposal) | 0.95 | 0.95 | 0.999 | 50 | 1 | 50 | 3 |


Since the standard VAE is insufficient to extract the important information, only two methods, -VAE and q-VAE, are tested as described in Table 3. Note again that q-VAE is set to satisfy the sparsity condition indicated by eq. (27) while confirming the stability of the numerical computation.
As in the simulation, after confirming that the reconstruction accuracy was comparable to each other, the importance of the latent dimensions was evaluated in terms of the sample standard deviation (see Fig. 13). From the figure, q-VAE can select six dimensions (with the same threshold 0.15). In theory, this task may have five dimensions for the minimal realization (i.e. the 2D arm position and velocity and the -axis target position), but in reality, other environmental noise (e.g. misalignment of the target object in the direction) may occur. Therefore, a total of six dimensions can be reasonably extracted: five dimensions for the theoretical minimal realization and one dimension to handle other noise in practice. On the other hand, -VAE is difficult to find the important dimensions.
For practical use, the size of dimensions extracted by q-VAE is unknown when using -VAE alone, so masking to -VAE is omitted in this experiment. In addition, the simulation results described before indicated that q-VAE without masking is prone to make learning of the world model unstable by unnecessary axes, so we omit it as well. Therefore, the performance comparison in this experiment is limited to -VAE without masking and q-VAE with masking.
5.3 Accuracy of world model



First, we confirm that the performance of the world models is comparable to each other. The dynamics to horizon ahead handled in CEM is illustrated in Fig. 14. It can be seen that the differences between the predictions and the true observations for -VAE and q-VAE are comparable for both the image and the arm state. The predictions and true values of the rewards are also compared in Fig. 15. Similarly, the predictions are mostly consistent with the true values, suggesting that both methods achieved the good accuracy. From the above, it can be expected that the difference in control performance between the two methods (to be confirmed in the next section) would be only due to the masking (from 30 to six dimensions) obtained by the sparsity of q-VAE.
5.4 Control performance










Using the learned world model, 5 trials of reaching were performed for each of the 3 target object positions (i.e. 0.2, 0.3, and 0.4 m, respectively). The number of steps until task termination (20 steps at maximum), average reward, and average number of iterations are listed up in Fig. 16. An example of the trials are shown in Fig. 17 and the attached video.
When the object was placed at 0.2 m, q-VAE succeeded in reaching the target earlier than -VAE, although there is little difference in reward due to the overall smaller distance penalty. Even in the case with 0.3 m target object position, q-VAE succeeded in performing the task earlier than -VAE, and also increased the reward due to smoother acceleration/deceleration. When the target object was placed as far as 0.4 m, -VAE failed in all trials, while q-VAE generally succeeded in most cases.
As noted above, these performance gains are not due to differences in the prediction accuracy of the world model, but rather due to the reduced computational cost by the compact world model, which is with almost the minimal realization. In fact, the number of iterations of q-VAE is more than double that of -VAE under all conditions. Thus, we confirmed that q-VAE facilitates the minimal realization of the world model through sparsification and contributes to improving the control performance of computationally expensive optimal control such as CEM in real time.
6 Conclusion and discussion
6.1 Conclusion
In this paper, we improved and analyzed q-VAE, a deep learning technique for extracting sparse latent space, aiming at the minimal realization of the world model. In particular, we clarified the hyperparameters condition under which the modified q-VAE always sparsifies the latent space. In both simulations and experiments, the modified q-VAE successfully collapsed many latent dimensions to zero while maintaining the same level of reconstruction accuracy as the baseline method, -VAE, by learning according to the condition. The world model with almost the minimal realization was obtained by masking the unnecessary latent dimensions and utilized in CEM, which is a sampling-based optimal control method. Consequently, the optimization of CEM was facilitated by the reduction in computational cost, resulting in better control performance.
6.2 Discussion for future work
Two major open issues can be raised from our investigation. One is how to adjust hyperparameters. The hyperparameters of the modified q-VAE have increased due to the decomposition from the conventional q-VAE, and it is clear that their optimal values are task-specific, although the sparsification condition limits the degree of freedom in design of them. In particular, is desirable to strongly promote sparsification, but as the literature [31] reported, a small would exclude many data from training as outliers. In fact, in highway-env simulations, we observed cases where other blue cars could not be reconstructed as in the standard VAE, depending on the value of . A framework that is adaptive or robust to this trade-off, such as meta-optimization [32] or ensemble learning with multiple combinations of hyperparameters [33], would be useful.
Another open issue is the simultaneous learning of the latent space and the world model. In this paper, the latent space extraction phase and the world model acquisition phase were conducted independently, giving priority to ease of analysis. However, in order to obtain the state holding the minimal realization, it is desirable to extract the latent space by considering the world model (and controller). Since simultaneous learning of multiple modules tends to be basically difficult, it would be important to introduce curriculum learning [34], etc.
Anyway, we will apply our method to more practical tasks in the near future and enhance its practicality by resolving the above-mentioned open issues.
Acknowledgement
This work was supported by JSPS KAKENHI, Grant-in-Aid for Scientific Research (B), Grant Number JP20H04265 and JST, PRESTO Grant Number JPMJPR20C3, Japan.
References
- [1] Sanchez J, Corrales JA, Bouzgarrou BC, et al. Robotic manipulation and sensing of deformable objects in domestic and industrial applications: a survey. The International Journal of Robotics Research. 2018;37(7):688–716.
- [2] Tsurumine Y, Cui Y, Uchibe E, et al. Deep reinforcement learning with smooth policy update: Application to robotic cloth manipulation. Robotics and Autonomous Systems. 2019;112:72–83.
- [3] Modares H, Ranatunga I, Lewis FL, et al. Optimized assistive human–robot interaction using reinforcement learning. IEEE transactions on cybernetics. 2015;46(3):655–667.
- [4] Kobayashi T, Dean-Leon E, Guadarrama-Olvera JR, et al. Whole-body multicontact haptic human–humanoid interaction based on leader–follower switching: A robot dance of the “box step”. Advanced Intelligent Systems. 2022;4(2):2100038.
- [5] Paden B, Čáp M, Yong SZ, et al. A survey of motion planning and control techniques for self-driving urban vehicles. IEEE Transactions on intelligent vehicles. 2016;1(1):33–55.
- [6] Williams G, Drews P, Goldfain B, et al. Information-theoretic model predictive control: Theory and applications to autonomous driving. IEEE Transactions on Robotics. 2018;34(6):1603–1622.
- [7] Botev ZI, Kroese DP, Rubinstein RY, et al. The cross-entropy method for optimization. In: Handbook of statistics. Vol. 31. Elsevier; 2013. p. 35–59.
- [8] Ha D, Schmidhuber J. World models. arXiv preprint arXiv:180310122. 2018;.
- [9] Hafner D, Lillicrap T, Ba J, et al. Dream to control: Learning behaviors by latent imagination. In: International Conference on Learning Representations; 2020.
- [10] Hafner D, Lillicrap TP, Norouzi M, et al. Mastering atari with discrete world models. In: International Conference on Learning Representations; 2021.
- [11] Okada M, Kosaka N, Taniguchi T. Planet of the bayesians: Reconsidering and improving deep planning network by incorporating bayesian inference. In: IEEE/RSJ International Conference on Intelligent Robots and Systems; IEEE; 2020. p. 5611–5618.
- [12] Okada M, Taniguchi T. Dreaming: Model-based reinforcement learning by latent imagination without reconstruction. In: IEEE International Conference on Robotics and Automation; IEEE; 2021. p. 4209–4215.
- [13] Kingma DP, Welling M. Auto-encoding variational bayes. In: International Conference on Learning Representations; 2014.
- [14] Higgins I, Matthey L, Pal A, et al. beta-vae: Learning basic visual concepts with a constrained variational framework. In: International Conference on Learning Representations; 2017.
- [15] Mathieu E, Rainforth T, Siddharth N, et al. Disentangling disentanglement in variational autoencoders. In: International Conference on Machine Learning; PMLR; 2019. p. 4402–4412.
- [16] Okada M, Taniguchi T. Variational inference mpc for bayesian model-based reinforcement learning. In: Conference on robot learning; PMLR; 2020. p. 258–272.
- [17] Williams RL, Lawrence DA, et al. Linear state-space control systems. John Wiley & Sons; 2007.
- [18] Higgins I, Amos D, Pfau D, et al. Towards a definition of disentangled representations. arXiv preprint arXiv:181202230. 2018;.
- [19] Kobayashi T. q-vae for disentangled representation learning and latent dynamical systems. IEEE Robotics and Automation Letters. 2020;5(4):5669–5676.
- [20] Tsallis C. Possible generalization of boltzmann-gibbs statistics. Journal of statistical physics. 1988;52(1-2):479–487.
- [21] Suyari H, Tsukada M. Law of error in tsallis statistics. IEEE Transactions on Information Theory. 2005;51(2):753–757.
- [22] Gil M, Alajaji F, Linder T. Rényi divergence measures for commonly used univariate continuous distributions. Information Sciences. 2013;249:124–131.
- [23] Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:14126980. 2014;.
- [24] Ilboudo WEL, Kobayashi T, Sugimoto K. Robust stochastic gradient descent with student-t distribution based first-order momentum. IEEE Transactions on Neural Networks and Learning Systems. 2020;.
- [25] Takahashi H, Iwata T, Yamanaka Y, et al. Student-t variational autoencoder for robust density estimation. In: International Joint Conference on Artificial Intelligence; 2018. p. 2696–2702.
- [26] Loaiza-Ganem G, Cunningham JP. The continuous bernoulli: fixing a pervasive error in variational autoencoders. Advances in Neural Information Processing Systems. 2019;32.
- [27] Leurent E. An environment for autonomous driving decision-making [https://github.com/eleurent/highway-env]; 2018.
- [28] Paszke A, Gross S, Chintala S, et al. Automatic differentiation in pytorch. In: Advances in Neural Information Processing Systems Workshop; 2017.
- [29] Hoyer PO. Non-negative matrix factorization with sparseness constraints. Journal of machine learning research. 2004;5(9).
- [30] Kemp CC, Edsinger A, Clever HM, et al. The design of stretch: A compact, lightweight mobile manipulator for indoor human environments. In: International Conference on Robotics and Automation; IEEE; 2022. p. 3150–3157.
- [31] Kobayashi T, Enomoto T. Towards autonomous driving of personal mobility with small and noisy dataset using tsallis-statistics-based behavioral cloning. arXiv preprint arXiv:211114294. 2021;.
- [32] Aotani T, Kobayashi T, Sugimoto K. Meta-optimization of bias-variance trade-off in stochastic model learning. IEEE Access. 2021;9:148783–148799.
- [33] Sagi O, Rokach L. Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2018;8(4):e1249.
- [34] Graves A, Bellemare MG, Menick J, et al. Automated curriculum learning for neural networks. In: International conference on machine learning; PMLR; 2017. p. 1311–1320.
- [35] Elfwing S, Uchibe E, Doya K. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks. 2018;107:3–11.
- [36] Ba JL, Kiros JR, Hinton GE. Layer normalization. arXiv preprint arXiv:160706450. 2016;.
Appendix A Learning configurations
Meaning | Setting |
---|---|
Kernel sizes for convolutional layers | [4, 3, 3, 3, 3] |
Channel sizes for convolutional layers | [8, 16, 32, 64, 128] |
Unit sizes for FC layers ① | [8, 32, 128] |
Unit sizes for FC layers ② | [100, 80, 60] |
Kernel sizes for deconvolutional layers | [3, 3, 3, 3, 4] |
Channel sizes for deconvolutional layers | [128, 64, 32, 16, 8] |
Unit sizes for FC layers ③ | [128, 32, 8] |
Unit sizes for FC layers ④ | [60, 80, 100] |
The size of latent space | {50, 30} |
Activation function for all layers | Swish [35] |
+ Layer normalization [36] |
Meaning | Setting in simulation | Setting in experiment |
---|---|---|
Unit sizes for dynamics | [50, 50] | [10, 10] |
Unit sizes for reward | [20, 10] | [10, 10] |
Activation function for all layers | Tanh + Layer normalization [36] |
Meaning | Setting for VAE | Setting for dynamics | Setting for reward |
---|---|---|---|
Optimizer | t-Adam [24] | Adam [23] | Adam [23] |
Learning rate | |||
Batch size | 256 | 512 | 512 |
In VAE, a continuous Bernoulli distribution [26] is employed for the image decoder , and a diagonal Gaussian distribution is also employed for the velocity decoder . From the fact that , in eq. (23) is for and corresponds to . The encoder is given by a diagonal Gaussian distribution.
The details of each module in the figure are summarized in Table 4. In addition, the network architecture for the world model is also summarized in Table 5. The conditions related to the learning of the respective modules are summarized in Table 6. In this way, after training multiple models of VAE with different random seeds, the world model is trained by selecting the median model among them.
A.1 Configurations for simulation
Symbol | Meaning | Value |
The size of action space | 2 | |
The size of image observation | 64643 | |
The size of velocity observation | 2 | |
– | Control frequency | 10 Hz |
Horizon step | 10 | |
The number of candidates | 10,000 | |
– | Maximum iteration | 10 |
Elite ratio | 0.01 | |
Smooth update | 0.4 |
The network architecture of VAE implemented by PyTorch [28] is illustrated in Fig. 5. Since the simulation task is more complex than the experimental one, the latent dimension size is given a larger value of 50, and the world model is also designed slightly larger than in the experiment (although still small enough).
The control frequency is set to 10 Hz. Because it is necessary to predict some time ahead for safe driving, Horizon set to be . Other configurations are summarized in Table 7.
For collecting a dataset, CEM under the true world model with the true state (i.e. position and velocity of each car) is utilized. Using it, 52,365 tuples for the training, 971 tuples for the validation, and 2,863 tuples for the test are collected. Note that noise is injected to the action from CEM for generating various data.
A.2 Configurations for experiment
Symbol | Meaning | Value |
The size of action space | 2 | |
The size of image observation | 64643 | |
The size of arm observation | 4 | |
– | Control frequency | 1 Hz |
Horizon step | 5 | |
The number of candidates | 10,000 | |
– | Maximum iteration | 20 |
Elite ratio | 0.01 | |
Smooth update | 0.4 |
The network architecture for VAE is almost identical to Fig. 5 and Table 4 except for replacing and the latent dimension size . In contrast, the network architecture for the world model is lightened due to the simplicity of task and importance of real-time control, as shown in Table 5.
For the control by CEM, the control frequency is set to 1 Hz. However, under the consideration of other processes, such as encoding the observation, the allowable computational time for CEM is limited to 0.56 s. Since future predictions do not play a significant role in control performance, is set to reduce the computational cost. Other configurations are summarized in Table 8.
For collecting a dataset, a simple feedback controller is employed with the explicit target position. Using it, 2,983 tuples for the training, 63 tuples for the validation, and 107 tuples for the test are collected. Note that the number of data is absolutely less than one for the simulation due to the lack of diversity. Therefore, noise injected to the action is also reduced to .