⁰⁰footnotetext: ¹ Mila, University of Montreal, ² Microsoft Research, New York, NY, ³ Google Deepmind, ⁴ Google Research
Corresponding authors: adidolkar123@gmail.com

Temporal Latent Bottleneck: Synthesis of Fast and Slow Processing Mechanisms in Sequence Learning

Aniket Didolkar ¹, Kshitij Gupta ¹, Anirudh Goyal ¹, Nitesh B. Gundavarapu ⁴
Alex Lamb ², Nan Rosemary Ke ³, Yoshua Bengio ¹

Abstract

Recurrent neural networks have a strong inductive bias towards learning temporally compressed representations, as the entire history of a sequence is represented by a single vector. By contrast, Transformers have little inductive bias towards learning temporally compressed representations, as they allow for attention over all previously computed elements in a sequence. Having a more compressed representation of a sequence may be beneficial for generalization, as a high-level representation may be more easily re-used and re-purposed and will contain fewer irrelevant details. At the same time, excessive compression of representations comes at the cost of expressiveness. We propose a solution which divides computation into two streams. A slow stream that is recurrent in nature aims to learn a specialized and compressed representation, by forcing chunks of $K$ time steps into a single representation which is divided into multiple vectors. At the same time, a fast stream is parameterized as a Transformer to process chunks consisting of $K$ time-steps conditioned on the information in the slow-stream. In the proposed approach we hope to gain the expressiveness of the Transformer, while encouraging better compression and structuring of representations in the slow stream. We show the benefits of the proposed method in terms of improved sample efficiency and generalization performance as compared to various competitive baselines for visual perception and sequential decision making tasks.

1 Introduction

The interplay between fast and slow mechanisms for information processing and perception has been studied in both cognitive science and machine learning DBLP:conf/nips/BaHMLI16; Hinton87usingfast. In the brain, short-term and long-term memory have developed in a specialized way. Short-term memory is allowed to change very quickly to react to immediate sensory inputs and perception. It also tends towards high capacity storage of all pieces of information which may be relevant for future reasoning jonides2008mind; atkinson1971control; averbach1961short. By contrast, long-term memory changes slowly kolodner1983maintaining; jeneson2012working, is highly selective and involves repeated consolidation. It contains a set of memories that summarize the entire past, only storing details about observations which are most relevant goelet1986long; baddeley1984attention.

Deep Learning has seen a variety of architectures for processing sequential data (hochrieter1997long; schuster1997bidirection; cho2014gru). Sequence models which only rely on a short fixed context window may be seen as a form of fast-processing mechanisms, because there is no dependence on the longer-term history. Early research in deep learning found that robustness is substantially improved by using a recurrent neural network to represent the entire history of the sequence, with direct evidence found in the domain of handwriting synthesis (graves2013generating). These recurrent neural networks which force the entire hidden state to be compressed into a single hidden state became highly successful across speech recognition, natural language processing, and other sequence domains (graves2013speech; graves2008offline; graves2005framewise; sutskever2014sequence). Recurrent neural networks also saw significant advances in architectural inductive biases, such as a bias towards making small and selective changes to the hidden state on each step which helped improve its generalization performance DBLP:conf/iclr/KruegerMKPBKGBC17; DBLP:conf/iclr/HenaffWSBL17; goyal2019recurrent.

Refer to caption — Figure 1: Perceptual module + Temporal Latent Bottleneck Model. ${\mathcal{F}}$ denotes the perceptual module or the fast stream which is a Transformer. ${\mathcal{I}}$ represents the temporal latent bottleneck state (consisting of a set of vectors) that are updated using a recurrent function denoted by ${\mathcal{G}}$ . The given sequence is first divided into chunks of size $K$ and each chunk $X_{l}$ is processed by ${\mathcal{F}}$ which consists of interleaved Self Attention + FFN (denoted in blue) and Cross Attention + FFN (denoted in green) layers. The Cross Attention + FFN layers allow the representation of ${\mathcal{F}}$ to be conditioned on top-down information from ${\mathcal{I}}$ . The representations of the temporal latent bottleneck state is updated using the outputs of ${\mathcal{F}}$ by a recurrent function ${\mathcal{G}}$ , which consists of a Cross Attention + FFN layer as shown in the circle.

Despite their wide applicability, recurrent neural networks have at the same time struggled in modeling long-sequences, especially in the presence of long-range dependencies. To help address this, attention was added to recurrent neural networks to allow for information processing which can selectively bypass the recurrent state (Bahdanau2015NeuralMT). In later works the recurrent state was removed entirely in favor of the Transformer architecture, which shares information between positions using attention vaswani2017attention. Research on scaling laws has shown that the expressiveness advantage of Transformers over recurrent neural networks grows with the length of the sequences being modeled (devlin2018bert; Radford2018ImprovingLU; brown2020language).

Transformers have become the dominant architecture across a wide range of domains including vision (dosovitskiy2020vit), natural language (devlin2018bert; Radford2018ImprovingLU; brown2020language; zhang2022opt; chowdhery2022palm; rae2022scaling), and reinforcement learning (chen2021decision; janner2021reinforcement). Despite their success, it is well known that Transformers are extremely data hungry and work well mainly at scale (dosovitskiy2020vit; seung2021vision; liu2021efficient). This may be attributed to the inductive bias in Transformers to model all possible pairwise interactions between the inputs which also results in a very high computational complexity that scales quadratically with the input size. The lack of selectivity in interactions in the Transformer makes them extremely memory inefficient and may bias them towards modeling unnecessary, irrelevant, or noisy interactions.

This motivates the use of a Temporal Latent Bottleneck, with the goal of improving Transformers while using lesser data and memory. The proposed model aims at introducing selectivity in the interactions by dividing computation into two streams - (1) A high-level slow stream consisting of a set of vectors updated in a recurrent manner (also referred to as the temporal latent bottleneck), and, (2) A low-level attention-based fast stream that processes the input. The fast stream processes locally neighbouring information within chunks, while the slow stream contains information about distant tokens across chunks. The slow stream and the fast stream interact using a multi-head attention mechanism to achieve selectivity in how local and distant information is mixed. We show that the resulting model substantially outperforms Transformers and demonstrates improved generalization, especially when the test data includes novel challenges for the model that were not encountered during training.

2 Methodology

We now present the proposed approach in detail. Our model jointly leverages the strengths of Transformers (vaswani2017attention) and recurrent neural networks (cho2014gru; hochrieter1997long). The fast stream operates on raw inputs and is instantiated using a Transformer. We denote this Transformer by $\mathcal{F}$ . The slow stream operates at a higher level and updates at a lower frequency than the fast stream. Any recurrent function can be used to instantiate the slow stream.

2.1 Desiderata for Fast and Slow Streams of Processing

We give the detailed description of the proposed model in the next section. Here, we give an overview of our architecture and discuss some of its key properties. We present a detailed diagram of our model in Figure 1. Given an input sequence, it is first divided into chunks of size $K$ . Each chunk is processed by a Transformer (denoted as ${\mathcal{F}}$ ) which is also known as the perceptual module (since it processes the raw input) or the fast stream. While processing each chunk, ${\mathcal{F}}$ is also conditioned on information from the slow stream ${\mathcal{G}}$ which is also called the temporal latent bottleneck module. The slow stream is a recurrent stream which has its own state consisting of a set of $N$ vectors also called temporal latent bottleneck state denoted as ${\mathcal{I}}$ in Figure 1. In the following sections, we use the term temporal latent bottleneck to refer to the temporal latent bottleneck state ${\mathcal{I}}$ . This state is updated once per chunk using information from the perceptual module through a cross attention mechanism.

The perceptual module operates within each chunk while the temporal latent bottleneck operates across chunks slowly updating itself after each chunk has been processed by the perceptual module. Thus, the only way the perceptual module gets information about inputs beyond its own chunk is through the temporal latent bottleneck. One advantage of this is that the computational complexity of the attention mechanism in the proposed model is ${\mathcal{O}}(K^{2}+KN)$ while that of a Transformer is ${\mathcal{O}}(T^{2})$ , where $T$ is the length of the sequence, $K$ is the chunk size, and $N$ is the number of temporal latent bottleneck state vectors. Since $K<<T$ and $N<<T$ , we can see that $K^{2}+KN<T^{2}$ . Therefore the proposed model has a much lower computational complexity compared to a Transformer. Furthermore, the capacity of the temporal latent bottleneck is limited and much smaller than that of the perceptual module. This encourages the temporal latent bottleneck to represent the most salient information about the past while the perceptual module represents only local information. This creates an information asymmetry between the two streams. This information asymmetry leads to the perceptual module having a fine grained view of the nearby inputs but a very coarse grained view of the distant past. This is very different from the usual self-attention which attends to all tokens in the sequence at the same level of granularity.

One advantage of having a bottlenecked view of the past is that it allows the model to forget irrelevant information. For example, if an agent is navigating in a large maze, it does not need to have fine grained knowledge of its actions from the distant past. In the case of a Transformer, it would attend to every step from the past (including steps from the distant past) which may be irrelevant in the present context thus wasting its capacity in modeling irrelevant details.

Introducing specialized fast and slow processing mechanisms in machine learning is not completely new. A number of works (zhang2021multiscale; lu2021swin; wu2021cvt; yuan2021incorporating; wang2021pyramid; yang2021focal) build on vision Transformers (dosovitskiy2020vit) by introducing convolution-like hierarchies in Transformers. The main idea behind these models is to progressively reduce the size of the feature maps similar to convolutions so that some parts of the network can specialize at a coarser granularity. Our way of specializing fast and slow processing is different. We do not progressively reduce the size of the feature maps in deeper layers, rather information flows both ways between the fast processing stream and the slow processing stream. This allows the higher level (i.e. the slow stream) to convey information about the long-term context to the lower levels (i.e. the fast stream) as top-down conditioning information . However, previous works on introducing hierarchies in vision Transformer do not allow the higher levels to convey any information to the lower levels. Past works have shown the usefulness of top-down information sharing between specialized processing streams. mittal2020learning explored an architecture in which multiple recurrent streams are active at every time-step, but encouraged one stream to be more active by providing it direct access to the input. This selective sharing of information between the streams using attention led to better generalization. fan2021addressing showed that top-down sequential feedback in Transformers, although computationally expensive, improves the learning of long term dependencies.

Through our experiments we show that the proposed model substantially outperforms Transformers on tasks where generalization to variations unseen during training is critical. We verify this improvement in imitation learning (behavior cloning), where the training and evaluation distributions can diverge as a result of the agent visiting states which are not present in the dataset. Next, we describe the detailed implementation of our model.

2.2 Computational Steps

We first describe our notation. We denote the input ${\bm{X}}$ as a sequence of ${\bm{T}}$ tokens - ${\bm{X}}=[{\bm{x}}_{0},{\bm{x}}_{1},{\bm{x}}_{2},\ldots,{\bm{x}}_{t}]$ . We chunk this input into chunks of size ${\bm{K}}$ resulting in $\lfloor{\bm{T}}/{\bm{K}}\rfloor$ chunks. We refer to $l^{th}$ chunk as $X_{l}$ . We represent the state of the temporal latent bottleneck $\mathcal{I}$ (i.e. the slow stream) as a set of ${\bm{M}}$ $d$ -dimensional vectors. As mentioned previously, we denote the temporal latent bottleneck module as ${\mathcal{G}}$ and the perceptual module as ${\mathcal{F}}$ . ${\mathcal{G}}$ updates the temporal latent bottleneck state while ${\mathcal{F}}$ processes chunks $X_{l}$ to form the latent representation $\bar{X}_{l}$ -

	$\displaystyle\text{{Perceptual Module}}\quad\bar{X}_{l}$	$\displaystyle={\mathcal{F}}(X_{l},{\mathcal{I}}_{l})$		(1)
	$\displaystyle\text{{Temporal Latent Bottleneck Module}}\quad{\mathcal{I}}_{l+1}$	$\displaystyle={\mathcal{G}}({\mathcal{I}}_{l},\bar{X}_{l})$		(2)

Next, we go into details of the implementation of these modules.

Preliminaries. The central component of our model is the key value attention mechanism (Bahdanau2015NeuralMT; vaswani2017attention). The attention mechanism allows to dynamically select information from a set of ${\mathcal{N}}$ read vectors ${\mathcal{R}}$ to update a set of ${\mathcal{M}}$ write vectors ${\mathcal{W}}$ . The write vectors are first projected to a set of queries $Q=W^{q}{\mathcal{W}}$ , where $Q\in\mathbb{R}^{{\mathcal{M}}\times D}$ . The read vectors are projected into keys $K=W^{k}{\mathcal{R}}$ and values $V=W^{v}{\mathcal{R}}$ , where $K\in\mathbb{R}^{{\mathcal{N}}\times D}$ and $V\in\mathbb{R}^{{\mathcal{N}}\times D}$ . The updated write vectors $\bar{{\mathcal{W}}}$ are obtained as -

\displaystyle\bar{{\mathcal{W}}}=\textsc{Attention}(Q,K,V)=\text{softmax}(\frac{QK^{T}}{\sqrt{D}})V

(3)

The first term $\frac{QK^{T}}{D}$ is a dot product that decides the attention scores given to each query-key pair. The softmax normalizes the scores across the key dimension. Pairs with high similarity will be assigned a high score. The final update to the write vector is obtained by taking the convex combination of values weighted by the attention scores. The convex combination ensures that the dynamic selection of information is differentiable and can be trained by backpropagation. The updated vectors $\bar{{\mathcal{W}}}$ are utilized by residually adding them to the write vectors ( ${\mathcal{W}}={\mathcal{W}}+\bar{{\mathcal{W}}}$ ). If the read and write vectors refer to the same set of vectors, the attention mechanism is called a self-attention mechanism (vaswani2017attention). If the read and write vectors refer to seperate set of vectors, the attention mechanism is called a cross attention mechanism (goyal2019recurrent; goyal2021coordination; jaegle2021perceiver). In this paper, we use both kinds of attention mechanisms.

The Transformer architecture is another central component of our model which stacks a series of self attention and FFN layers. Each FFN layer is a 2 layered MLP that projects the input to a larger dimension and projects it back to the input dimension. In this case, we use the gelu activation (hendrycks2016bridging) for the FFN.

\displaystyle\vspace{-5mm}\textsc{FFN}(x)=W_{2}.\textsc{GELU}(W_{1}x+B_{1})+B_{2}

(4)

Given the attention and FFN layers, a single Transformer layer can be written as -

	$\displaystyle x$	$\displaystyle=\textsc{Attention}(\text{LN}(x),\text{LN}(x),\text{LN}(x))+x$
	$\displaystyle x$	$\displaystyle=\textsc{FFN}(\text{LN}(x))+x$		(5)

where LN refers to layer normalization layers (ba2016layer). In general, multiple such Transformer layers are stacked to solve any task.

We augment the Transformer, which has only self attentions, to include cross attentions too so that the temporal latent bottleneck module and perceptual module can share information with each other. We now describe the specifics of the perceptual module and the temporal latent bottleneck module.

Perceptual Module ${\mathcal{F}}$ . As mentioned previously, the perceptual module refers to the fast stream that acts directly on the input. The perceptual module operates on each chunk separately. Therefore, at any time the input to the perceptual module are the tokens corresponding to a particular chunk ${\bm{X}}_{l}=[{\bm{x}}_{l\times K},{\bm{x}}_{l\times K+1},\ldots,{\bm{x}}_{l\times K+K}]$ . The perceptual module is a Transformer with self attention layers, cross attention layers, and FFNs. It has 2 kinds of layers - (1) self attention + FFN; (2) cross attention + FFN. The self attention + FFN layers process the input tokens as described in Equation 2.2 and the cross attention + FFN layers integrate top-down information from the temporal latent bottleneck state ${\mathcal{I}}$ as follows -

	$\displaystyle{\bm{X}}_{l}$	$\displaystyle=\textsc{Attention}(\text{LN}({\bm{X}}_{l}),\text{LN}({\mathcal{I}}),\text{LN}({\mathcal{I}}))+{\bm{X}}_{l}$
	$\displaystyle{\bm{X}}_{l}$	$\displaystyle=\textsc{FFN}(\text{LN}({\bm{X}}_{l}))+{\bm{X}}_{l}$		(6)

We include one cross attention + FFN layer per $R$ self attention + FFN layers. The diagramatic representation of the perceptual module is presented in Figure 1 (in the processing of chunk $X_{l}$ ). In the figure, we set $R=1$ .

Temporal Latent Bottleneck Module ${\mathcal{G}}$ . The temporal latent bottleneck (TLB) module represents the slow stream that operates on the temporal latent bottleneck state ${\mathcal{I}}$ . ${\mathcal{I}}$ is updated using information from a particular chunk processed by the perceptual module. This update happens once for each chunk of the perceptual module resulting in $\lfloor{\bm{T}}/{\bm{K}}\rfloor$ updates for ${\mathcal{I}}$ . Since the temporal latent bottleneck state ${\mathcal{I}}$ updates at a lower frequency than the perceptual module, it is expected to capture more stable and slowly changing features while the perceptual module captures faster changing features resulting in multiple scales of information representation. An update to the temporal latent bottleneck state ${\mathcal{I}}$ consists of a cross attention operation where the queries come from ${\mathcal{I}}$ and the keys and values come from the output of the perceptual module. This cross attention operation is followed by an FFN update to ${\mathcal{I}}$ . Consider the perceptual module outputs for a chunk $l$ to be $\bar{{\bm{X}}}_{l}=[\bar{{\bm{x}}}_{l\times K},\ldots,\bar{{\bm{x}}}_{l\times K+K}]$ . The update operation is implemented as follows:

	$\displaystyle\bar{{\mathcal{I}}}$	$\displaystyle=\textsc{Attention}(\text{LN}({\mathcal{I}}_{l}),\text{LN}(\bar{{\bm{X}}}_{l}),\text{LN}(\bar{{\bm{X}}}_{l}))+{\mathcal{I}}_{l}$
	$\displaystyle{\mathcal{I}}_{l+1}$	$\displaystyle=\textsc{FFN}(\text{LN}(\bar{{\mathcal{I}}}))+\bar{{\mathcal{I}}}$		(7)

The temporal latent bottleneck module introduces the notion of recurrence in our model. We show the details of this module in Figure 1 (inside the circle).

Perceptual Module + Temporal Latent Bottleneck Model. We now present our complete architecture integrating both the perceptual module and the temporal latent bottleneck together. Given a sequence of tokens ${\bm{X}}=[{\bm{x}}_{0},{\bm{x}}_{1},{\bm{x}}_{2},\ldots,{\bm{x}}_{t}]$ . We chunk this input into chunks of size ${\bm{K}}$ resulting in $\lfloor{\bm{T}}/{\bm{K}}\rfloor$ chunks. The chunks are processed sequentially one after the other. For a chunk $k$ , it is first processed using the perceptual module conditioned on information from the temporal latent bottleneck state. The outputs of the chunk are used to update the temporal latent bottleneck state ${\mathcal{I}}$ . The resultant temporal latent bottleneck state is then used to process the next chunk. The full model is presented in Figure 1. We use a Transformer as the perceptual module in our experiments. Thus our main contribution is introducing a temporal latent bottleneck into Transformers and showing its advantages through a variety of experiments. We also present the detailed algorithm for the proposed approach in the Appendix (Algorithm LABEL:algo:tb_algo).

3 Related Work

Hierarchical or Multiscale Recurrent neural networks. This work takes inspiration from a wide array of work on introducing multiple scales of processing into recurrent neural networks (chung2016hierarchical; hihi1995hierarchical; mozer1991induction; Schmidhuber91neuralsequence; jan2014clockwork). These works divide the processing into multiple streams each operating at a different temporal granularity. While these works mainly focus on recurrent neural networks and their application is mainly on natural language tasks, we focus on introducing multiple streams of processing and a hierarchical structure into Transformers while also focusing on a broader range of domains beyond natural language.

Transformers. Some of the components we describe in the proposed model have been used previously in various Transformer models. Transformer XL (dai2019transformer) also divides the input into segments. Each segment considers the tokens from the current segment and the previous segment for attention without passing gradients into the previous segments. A number of previous works (zhang2021multiscale; lu2021swin; wu2021cvt; yuan2021incorporating; wang2021pyramid; yang2021focal) have worked on introducing a hierarchical structure in Transformers mainly in the domain of vision. The main goal of these works has been to introduce convolution-like hierarchies into Vision Transformers (dosovitskiy2020vit). While these works progressively reduce the spatial resolution of the inputs in order to introduce hierarchies, we introduce hierarchies by adding another slow stream of information processing and without reducing the spatial resolution of the inputs. We also provision for the higher level of the hierarchy (i.e. the slow stream) to provide information to the lower levels as top-down conditioning which is not possible in any of the previous works.

Top-Down Conditioning. Top-down information is information propagated from higher to lower levels of the network. It represents the models beliefs of the world and provides context for interpreting perceptual information. mittal2020learning and fan2021addressing have shown the advantages of top-down conditioning in recurrent neural networks and Transformers respectively. These works focus on different streams of processing operating at the same temporal granularity and the top-down conditioning is provided by higher level streams to the lower level streams. In our case, the top-down conditioning for the perceptual module is provided by the high-level slow stream which operates at a slower temporal granularity. This allows the perceptual model to be affected by much more long term high level information as compared to just short-term high level information as in the case of mittal2020learning and fan2021addressing.

The proposed model is similar to a parallel work called block recurrent Transformers (delesley2022block). There are few differences between our work and theirs. First, they use a sliding window attention, while we divide the input into chunks. In their paper, they perform cross attention and self attention in parallel while we find that doing them sequentially and performing cross attention once per $R$ self attention steps yields better results. They also use special tricks to deal with some instabilities in their case, while we find no such instabilities in our model. Also, while their focus is mainly on natural language tasks, we focus on a broader variety of tasks.

4 Experiments

Table 1: Image Classification. Here we compare the performance of the proposed ViT + TLB model against ViT and SwinV2 on CIFAR10 and CIFAR100 datasets for

64\times 64

images and

128\times 128

images. Note that the model is trained only on the

64\times 64

sized images and then transferred to

128\times 128

sized images. Results averaged across 3 seeds.

	CIFAR10		CIFAR100
Model	$\bm{64\times 64}$	$\bm{128\times 128}$	$\bm{64\times 64}$	$\bm{128\times 128}$
ViT	93.75	73.18	69.53	47.4
Swin V2	97.66	84.9	79.95	58.59
ViT + TLB	94.79	84.38	79.17	59.19

In this section, we outline the tasks in which we applied the temporal latent bottleneck and direct the reader to the appendix for more details on the experiments. Our goal is to show the wide applicability and benefits offered by the temporal latent bottleneck, which we refer to as TLB. We demonstrate that the proposed model outperforms competitive baselines across many domains including vision, reinforcement learning, and natural language. Our main goal is to show that the proposed approach has high expressive power like Transformers while also being sample efficient unlike Transformers. Thus our main baselines are based on the original Transformer architecture. For example, we compare against ViT (dosovitskiy2020vit) in image classification, Decision Transformer (chen2021decision) in Reinforcement Learning, and Vanilla Transformer in rest of the tasks. We also compare against some representative baseline that offer some of the key properties that our model offers. For example, we compare against state-of-the art Swin Transformer (swinv2) which is a strong baseline for image classification and is also hierarchical similar to the proposed model. We also compare against Transformer LS (zhu2021long) which also processes long-term and short-term information using different attention streams. Furthermore, we also compare against Transformer-XL, which also divides the input into segments similar to the proposed model. Another key point of the proposed model is that any position cannot attend to any information from the future beyond its chunk since the temporal latent bottleneck only summarizes the past, not the future. Meanwhile, all the baselines we consider have bidirectional context i.e. they can attend to all of the past and the future. We observe that despite this limitation, the proposed model outperforms all the considered baselines.

4.1 Temporal Latent Bottleneck For Perception

Table 2: Here we show the performance of the proposed ViT + TLB model without top-down conditioning. We can see that the model suffers a significant drop in performance without top-down conditioning thus showing the importance of top-down conditioning. Results averaged across 3 seeds.

	CIFAR100
Model	$\bm{64\times 64}$	$\bm{128\times 128}$
ViT + TLB	79.17	59.19
w/o Top-Down Condn.	76.04	49.22

Image Classification. Recently, Transformers have been widely applied for visual perception and have shown strong performance improvements over CNNs in tasks such as image classification, semantic segmentation, instance segmentation, etc. In this work we focus on image classification using Transformers. For a model to do well on image classification, it should learn to only focus on the relevant information and ignore other details (eg. background information). Self attention does not inherently have this inductive bias of ignoring irrelevant information since it models all pairwise interactions between the inputs. We posit that adding a limited bandwidth temporal latent bottleneck into the Transformer will allow the model to focus only on the most important information in the image which should enable the model to perform well.

We test our hypothesis on the CIFAR10 and CIFAR100 (Krizhevsky09learningmultiple) image classification datasets. We also test the generalization abilities of the models by comparing their performance on images of higher resolution ( $128\times 128$ ) than seen during training ( $64\times 64$ ). We use ViT (dosovitskiy2020vit) and Swin Transformer V2 (denoted as Swin V2) swinv2 as our baselines. Swin Transformer V2 has a key strength of generalizing to higher resolution images than those seen during training, making it a strong baseline. The input image is split into patches of size $4\times 4$ and fed in rastor order to all the models. For the proposed model we use ViT as the perceptual module and add a temporal latent bottleneck module to it. We call this model ViT + TLB. To predict the classification scores, we take the mean across the final temporal latent bottleneck state vectors and pass the resulting representation through an MLP. We present the results for this experiment in table 1. We can see that ViT + TLB outperforms ViT for all cases and performs competitively to Swin Transformer V2. For further hyperparameter details, we refer the reader to Appendix section LABEL:appendix:image_classification.

One essential component of our model is top-down conditioning. Top down information helps integrate information from the past into the the perceptual module. We hypothesize that top-down conditioning enables the perceptual module to pay attention to the most important information in the input. We conjecture that without this, the perceptual module would have no prior knowledge of which input patches to pay more attention to, hence degrading the performance of the model. To test this, we run an experiment where we omit top-down conditioning. We show the results for this ablation in Table 2. We can see that the performance drops significantly without top-down conditioning thus showing the importance of top-down conditioning.

Self Supervised Learning. Many recent works have used vision Transformers for self-supervised learning (bao2021beit; ahmed2021sit; he2021masked; caron2021emerging; li2021mst; li2021efficient). Here we show a proof-of-concept that introducing a temporal latent bottleneck in Vision Transformers results in better self-supervised representations. We consider the SiT model from ahmed2021sit for this experiment. They use 3 objectives to pretrain their model - (1) The Reconstruction Objective - Reconstructs the input image, (2) The Rotation Prediction Objective - Predicts the rotation angle from [ $0^{\circ}$ , $90^{\circ}$ , $180^{\circ}$ , $270^{\circ}$ ], and (3) The Constrastive Objective (similar to SimCLR (simclr)). For the proposed approach, we introduce a temporal latent bottleneck into SiT resulting in the SiT + TLB model. SiT also uses additional trainable contrastive and rotation tokens as input for calculating the contrastive and rotation objectives respectively. For SiT + TLB, we take the mean across the temporal latent bottleneck state vectors and use the resulting representation for computing the rotation and contrastive objectives. We use a chunk length of 20 for the SiT + TLB model. We use the CIFAR10 dataset (Krizhevsky09learningmultiple) for our experiments. We pretrain the model for 400 epochs and evaluate the pretrained model at different epochs using linear probing. We use the CIFAR10 dataset for linear probing as well. We present the results for this experiment in Figure 2. We can see that the proposed approach outperforms SiT in all cases thus showing the effectiveness of the proposed architecture for self-supervised learning. For additional details regarding the setup, we refer the reader to Appendix section LABEL:appendix:ssl.

4.2 Temporal Latent Bottleneck for Sequential Decision Making

Transformers have recently been used for sequential decision making in reinforcement learning tasks such as Atari and BabyAI (chen2021decision; iii2022improving). These works deploy Transformers in the offline RL setting where a large number of trajectories are available either through another trained agent or an expert agent. The Transformer is trained as an autoregressive generative model that predicts actions conditioned on the past context. We incorporate the temporal latent bottleneck module into the Transformer and explore its benefits in the RL setting. We test the proposed model in the BabyAI (babyai-env) and Atari (atari-env) benchmarks. We describe our setups in detail below.

Instruction Based Decision Making: BabyAI. BabyAI (babyai-env) provides a suite of environments where the agent has to carry out a given instruction in a partially-observable maze. These instructions include competencies such as going to an object in the maze, placing an object beside another object in the maze, opening a door with a key, etc. Some environments in the benchmark contain instructions that combine multiple competencies sequentially. For example, pick up a red ball and open the door in front of you after you pick up the grey ball on your left and pick up a red box. Each environment in Baby AI benchmark has a different type of instruction that tests a different competency. The BossLevel is the most complicated environment that contains instructions from all competencies. For more details regarding the various environments from the BabyAI benchmark, we refer the reader to Appendix section LABEL:appendix:babyai.

We train our models with behavior cloning using expert trajectories from an oracle. For evaluation, we test the model by directly deploying it in the environment. We report the success rate which measures whether the agent successfully carried out the given instruction or not. We use a Transformer (vaswani2017attention) as the baseline in these experiments. For the proposed model, we introduce a temporal latent bottleneck into the Transformer-based perceptual module. For the baseline Transformer model, we append the language instruction to the sequence of states allowing the model to attend to the language instruction at each layer. For the proposed model, the language instruction is appended to each chunk, allowing each chunk to attend to it.

We consider two settings - Single task and Multi task. In the single task setting, we evaluate the proposed approach on individual environments from the BabyAI benchmark while in the multi-task setting we train a single model on 8 different environments.

Single Task. We present the results for BossLevel in Figure 3 (left) and present the results for the other tasks in Appendix Figure LABEL:fig:babyai_single_task. We can see that while Transformer and Transformer + TLB achieve almost similar performance at convergence. However, Transformer + TLB is much more sample efficient, converging much faster. We posit that the temporal latent bottleneck module prohibits the model from paying attention to unnecessary information which allows it to converge faster.

Multi Task. We present the results for the multi task setting in Figure 3 (right). We train the model on 8 environment - PutNext, Unlock, Synth, GoToSeq, SynthLoc, GoToImpUnlock, BossLevel. We evaluate the model on the same 8 environments. We report the average success rate across 8 games. We can see that the Transformer + TLB model converges faster and also outperforms the Transformer. We refer the reader to the appendix for more details regarding the model and training.

Table 3: Atari. Here we show that adding a temporal latent bottleneck into decision Transformer improves performance across various atari games. Results are averaged across 10 seeds.

Game	DT	DT + TLB
Breakout	71.51_{$\pm$ 20.58}	87.63_{$\pm$ 16.24}
Pong	13.68_{$\pm$ 2.00}	14.71_{$\pm$ 1.78}
Qbert	3268_{$\pm$ 1773.07}	5019.75_{$\pm$ 1647.13}
Seaquest	1039.11_{$\pm$ 122.90}	1248.22_{$\pm$ 86.62}

Atari. (chen2021decision) recently introduced the Decision Transformer (DT) which learns to play various games in the Atari benchmark from suboptimal trajectories of a learned agent. Decision Transformer models the offline RL problem as a conditional sequence modelling task. The model uses a causal mask and supervised training to match the actions in the offline dataset conditioned on the future expected returns and the past history. This is done by feeding into the model the states, actions, and the return-to-go $\hat{R}_{c}=\sum_{c^{\prime}=c}^{C}r_{c}$ , where $c$ denotes the timesteps. This results in the following trajectory representation: $\tau=\big{(}\hat{R}_{1},s_{1},a_{1},\hat{R}_{2},s_{2},a_{2},\hat{R}_{3},s_{3},a_{3},\ldots\big{)}$ , where $a_{c}$ denotes the actions and $s_{c}$ denotes the states. At test time, the start state $s_{1}$ and desired return $\hat{R}_{1}$ is fed into the model and it autoregressively generates the rest of the trajectory. Experimental results show that DT can leverage the strong generalization capabilities of Transformers and achieve the desired returns in a wide variety of tasks in Atari and OpenAI Gym, outperforming previous approaches in offline RL.

We use the same setup as used in (chen2021decision) for our experiments. We set the context length to a fixed number ${\bm{C}}$ . During training, ${\bm{C}}$ timesteps from an episode are sampled and fed into the model resulting in a trajectory of length $3{\bm{C}}$ (considering 3 modalities - returns-to-go, states, and actions). Each modality is processed into an embedding of size ${\bm{d}}$ . The state is processed using a convolutional encoder into an embedding of size ${\bm{d}}$ . The resulting trajectory is fed into the decision Transformer. The outputs corresponding to the states $s_{c}$ are fed into a linear layer to predict the action $a_{c}$ to be taken at timestep $c$ . For the proposed model, we incorporate a temporal latent bottleneck module into the decision Transformer.

We present our results in Table 3. The model is trained on 1% of the Atari DQN-replay dataset (agarwal2019striving) (500K transitions for each game). We use the same 4 games used in (chen2021decision): Pong, Seaquest, Qbert, and Breakout. We can see that the proposed model outperforms decision Transformer in all the considered games thus showing the effectiveness of the proposed model. More details regarding the model and training can be found in the appendix section LABEL:appendix:atari.

4.3 Temporal Latent Bottleneck for Language Modeling

We present preliminary results of applying the Transformer + Temporal Latent Bottleneck Model to language modelling on the enwiki8 dataset (mahoney2011enwik). We use Transformer XL (dai2019transformer) as our baseline. We find that Transformer XL achieves a performance of 0.99 bit-per-byte while the proposed model achieves 0.97 bit-per-byte (lower is better). Thus showing that the proposed model is useful in language modeling as well. For more details, we refer the reader to Appendix section LABEL:appendix:language_modelling.

Table 4: Long Range Dependencies. Here we compare the performance of the proposed model against the recently proposed long-short Transformer model (zhu2021long) and the vanilla Transformer model (vaswani2017attention). We can see that the proposed model outperforms both the baselines thus showing the superiority of the proposed model in modelling long-range and hierarchical dependencies. Results averaged across 3 seeds.

Model	ListOps	Text
		Classification
Transformer	37.64_{$\pm$ 0.0001}	64.0_{$\pm$ 0.0001}
Transformer LS	37.5_{$\pm$ 0.0002}	65.5_{$\pm$ 0.0003}
Transformer + TLB	38.2_{$\pm$ 0.0001}	82.08_{$\pm$ 0.44}

Temporal Latent Bottleneck For Long Range Dependencies. Here, we test the effectiveness of the proposed model in modelling long range dependencies. We apply the proposed model on the ListOps and text classification tasks from the Long Range Arena (LRA) benchmark (yi2020long). Both these tasks have very long sequences ranging from 1K to 4K tokens. Thus, for a model to do well, it has to learn to capture dependencies across very long time scales. Additionally, all these tasks have an inherent hierarchical structure. For example, Listops consists of nested list operations which makes it hierarchical. For text classification, the inputs consist of text in the form of bytes. Therefore, the model has to learn to compose bytes into characters and characters into words. We hypothesize that the multi-scale nature of the proposed model will be extremely useful in modelling such hierarchical information. The temporal latent bottleneck which operates at a slow scale can behave as a composer that composes low-level information in relevant way to solve the desired task.

For this experiment, we use the same setup as (zhu2021long). For the proposed model, we use a Transformer as the perceptual model and implement the temporal latent bottleneck as described in Section 2.2. We take the mean across the temporal latent bottleneck state vectors and use the resulting representation for classification. We compare the model against the long-short Transformer (LS) model (zhu2021long), which is a recently proposed model for the long range arena benchmark, and the vanilla Transformer model (vaswani2017attention). We present the results in Table 4. We can see that the proposed model outperforms both the baselines in both the tasks thus showing its usefulness in modeling long range dependencies. For further details, we refer the reader to Appendix section LABEL:appendix:long_range_arena.

5 Conclusion

We have developed an approach aimed at introducing selectivity in the interactions across time-steps in a transformer by splitting processing into two streams: (a) a slow stream that is updated in a recurrent manner and (b) a fast stream that processes the visual input. The two streams are parameterized independently and interact with each other via attentional bottleneck. The information processed by the fast stream is used to change the state of the slow stream, and the information in the slow stream is used by the fast stream as contextual information to process the input. Through our experiments we show that the proposed approach works well across wide range of domains and problems. One limitation of the proposed model is that the chunk size is fixed and treated as a hyperparameter which requires some domain knowledge. Future work should explore methods for dynamic chunking