Adversarial Memory Networks for Action Prediction

Zhiqiang Tao Equal contributionSanta Clara University, email: ztao@scu.edu Yue Bai¹¹footnotemark: 1 Northeastern University, email: bai.yue@northeastern.edu Handong Zhao Adobe Research, email: hazhao@adobe.com Sheng Li University of Georgia, email: sheng.li@uga.edu Yu Kong Rochester Institute of Technology, email: yu.kong@rit.edu Yun Fu Northeastern University, email: yunfu@ece.neu.edu

Abstract

Action prediction aims to infer the forthcoming human action with partially-observed videos, which is a challenging task due to the limited information underlying early observations. Existing methods mainly adopt a reconstruction strategy to handle this task, expecting to learn a single mapping function from partial observations to full videos to facilitate the prediction process. In this study, we propose adversarial memory networks (AMemNet) to generate the “full video” feature conditioning on a partial video query from two new aspects. Firstly, a key-value structured memory generator is designed to memorize different partial videos as key memories and dynamically write full videos in value memories with gating mechanism and querying attention. Secondly, we develop a class-aware discriminator to guide the memory generator to deliver not only realistic but also discriminative full video features upon adversarial training. The final prediction result of AMemNet is given by late fusion over RGB and optical flow streams. Extensive experimental results on two benchmark video datasets, UCF-101 and HMDB51, are provided to demonstrate the effectiveness of the proposed AMemNet model over state-of-the-art methods.

1 Introduction

Action prediction is a highly practical research topic that could be used in many real-world applications such as video surveillance, autonomous navigation, human-computer interaction, etc. Different from action recognition, which recognizes the human action category upon a complete video, action prediction aims to understand the human activity at an early stage – using a partially-observed video before an entire action execution. Typically, action prediction methods [32, 2, 22, 4, 46, 50] assume that a portion of consecutive frames from the beginning is given, considered as a partial video. The challenges mainly arise from the limited information in the early progress of videos, leading to the incomplete temporal context and a lack of discriminative cues for recognizing actions. Thus, the key problem of solving action prediction lies in: how to enhance the discriminative information for a partial video?

In recent years, many research efforts centering on the above question have been made for the action prediction task. Pioneering works [32, 2, 21] mainly handle partial videos by relying on hand-crafted features, dictionary learning, and designing temporally-structured classifiers. More recently, deep convolutional neural networks (CNNs), especially those pre-trained on large-scale video benchmarks (e.g., Sports-1M [18] and Kinetics [19]), have been widely adopted to predict actions. The pre-trained CNNs, to some extent, compensate for the incomplete temporal context and empower reconstructing full video representations from the partial ones. Along this line, existing methods [22, 34, 46, 23] focus on designing models to continuously improve the reconstruction performance, yet without considering the “malnutrition” nature (i.e., the limited temporal cues) of incomplete videos. Particularly, it will be more straightforward to learn what “nutrients” (e.g., the missing temporal cues or reconstruction bases) a partial video may need for recognizing actions, compared with mapping it to an entire video. Moreover, it is also challenging to handle various partial videos by resorting to a single model.

In this study, we propose a novel adversarial memory networks (AMemNet) model to address the above challenges. The proposed AMemNet leverages augmented memory networks to explicitly learn and store full video features to enrich incomplete ones. Specifically, we treat a partial video as a query and the corresponding full video as its memory. The “full video” is generated with relevant memory slots fetched by the query of partial videos. We summarize the contribution of this work in two aspects.

Firstly, a memory-augmented generator model is designed for generating full-video features conditioning on partial-video queries. We adopt a key-value memory network architecture [28, 49] for action prediction, where the key memory slots are used for capturing similar partial videos and the value memory slots are extracted from full training videos. The memory writing process is implemented by gating mechanism and attention weights. The input/forget gates enable AMemNet to dynamically update video memories attended by different queries and thus memorize the variation between different video progress.

Secondly, a class-aware discriminator model is developed to guide the memory generator with adversarial training, which not only employs an adversarial loss to encourage generating realistic full video features, but also imposes a classification loss on training the network. By this means, the discriminator network could further push the generator to deliver discriminative full-video features.

The proposed AMemNet obtains prediction results by employing a late fusion strategy over two streams (i.e., RGB and optical flow) following [35, 43]. Extensive experiments on two benchmark datasets, UCF101 and HMDB51, are conducted to show the effectiveness of AMemNet compared with state-of-the-art methods, where our approach surprisingly achieves over $90\%$ accuracy by only observing $10\%$ of the beginning video frames on the UCF101 dataset. A detailed ablation study compared with several competitive baselines is also presented.

2 Related Work

Action Recognition targets at recognizing the label of human action in a given video, which is one of the core tasks for video understanding. Previous works have extensively studied this research problem from several aspects, including hand-crafted features (e.g., spatio-temporal interest points [8, 31], poselet key-frames [30, 27]), and dense trajectory [42], 3D convolutional neural networks [17, 39, 14], recurrent neural networks (RNN) based methods [29, 9], and many recent deep CNN based methods such as temporal linear encoding networks [7], non-local neural networks [45], etc. Among existing methods, the two-stream architecture [35, 10, 43] forms a landmark [3], which mainly employ deep CNNs on the RGB and optical flow streams for exploiting the spatial-temporal information inside videos. In this work, we also adopt the two-stream structure as it naturally provides the complimentary information for the action prediction task – the RGB stream contributes more on the early observation and the optical flow leads the following progress.

Action Prediction has attracted lots of research efforts [22, 34, 46, 23, 4, 50] in recent years, which tries to predict action labels upon the early progress of videos and thus falls into a special case of video-based action recognition. Previous works [32, 2, 26, 21] solve this task via hand-crafted features, and recent works [22, 34, 46, 23, 5, 20, 13, 4, 50] mainly rely on pre-trained deep CNN models for encoding videos, such as the 3D convolutional networks in [22, 23, 46], deep CNNs in [34, 4], and two-stream CNNs in [5, 20, 13, 50]. Among these works, the most common way for predicting actions is to design deep neural networks model for reconstructing full videos from the partial ones, such as deep sequential context networks [22], the RBF Kernelized RNN [34], progressive teacher-student learning networks [46], adversarial action prediction networks [23], etc. Moreover, some other interesting methods include the LSTM based ones [33, 20, 47], part-activated deep reinforcement learning [4], residual learning [13, 50], motion prediction [1], asynchronous activity anticipation [51], etc.

The memory augmented LSTM (Mem-LSTM) [20] model and adversarial action prediction networks (AAPNet) [23] share some similar ideas to our approach. However, several essential differences between the proposed AMemNet and Mem-LSTM/AAPnet could be summarized. First, memory networks play distinct roles in Mem-LSTM [20] and AMemNet. Mem-LSTM formulates action labels as video memories and adopts the memory network as a nearest-neighbor classifier. Differently, the proposed AMemNet develops a key-value memory architecture as a generator model and learns value memory slots from full videos as reconstruction bases for a generation purpose. Second, the generator models used in AAPNet [23] and AMemNet are different. AAPNet [23] employs a variational-GAN model, whereas AMemNet develops the memory-augmented generator to explicitly provide auxiliary information to generate full-video features for testing videos.

Memory Networks, i.e., Memory-Augmented Neural Networks [48, 12], generally consist of two components: 1) a memory matrix and 2) a neural network controller, where the memory matrix is used for storing the information as memory slots and the neural network controller is generally designed for addressing, reading and writing memories. Several representative memory network architectures include end-to-end memory networks [37], Key-Value memory networks [28, 49], neural tuning machines [12, 44], recurrent memory networks [40, 38], etc. The memory networks work well in practice for its flexibility in saving the auxiliary knowledge and its ability in memorizing the long-term temporal information. The proposed AMemNet shares the same memory architecture with [28, 49] and employs the memory writing methods provided in [12], which, however, is designed with different purposes compared with these methods. The memory module in AMemNet is tailored for solving the action prediction problem.

Refer to caption — Figure 1: Illustration of training the proposed AMemNet on the RGB stream. The attention weight $\alpha$ is first given by a query embedding $\mathbf{q}$ and a key memory matrix $\mathbf{M}^{k}$ in the memory addressing. Then, $\alpha$ is used for updating the value memory $\mathbf{M}^{v}_{t-1}\rightarrow\mathbf{M}^{v}_{t}$ with real full video features $\mathbf{v}$ governed by input/forget gates $\mathbf{W}_{e/a}$ . The generated full video feature $\mathbf{\hat{v}}$ is obtained by a memory reading operation with $\alpha$ and $\mathbf{M}^{v}_{t}$ , and guided by a class-aware discriminator network $D=\{D_{cls},D_{adv}\}$ under adversarial training. The memory update (dash lines) will be disabled in testing.

3 Methodology

3.1 Problem Setup

Given a set of training videos $\{(x,y)\}$ , where $x\in\mathcal{X}$ denotes one video sample and $y\in\mathcal{Y}$ refers to its action category label, action prediction aims to infer $y$ by only observing the beginning sequential frames of $x$ instead of using the entire video. Let $\tau\in(0,1]$ be the observation ratio and $L$ be the length (i.e., the total number of frames) of $x$ , a partial video is defined as $x_{1:\left\lfloor\tau L\right\rfloor}$ that is a subsequence of the full video $x$ containing from the first frame to the $\left\lfloor\tau L\right\rfloor$ -th frame. We employ a set of observation ratios $\{\tau_{p}\}_{1}^{P}$ to mimic the partial observations at different progress levels and define $x_{p}=x_{1:\left\lfloor\tau_{p}L\right\rfloor}$ as the $p$ -th progress level observation of $x$ , $p\in\{1,\dots,P\}$ . By this means, the training set is augmented as $P$ times of the original one, i.e., $\{(x_{p},y)\}$ . Following the existing work [22, 4, 50], we set $P=10$ and increase $\tau_{p}$ from $0.1$ to $1.0$ with a fixed step of $0.1$ .

We propose to solve the action prediction problem with memory networks. The partial video $x_{p}$ is formulated as a query to “retrieve” its lost information from a memory block learned from all the full training videos. To encode video memories, we build a training set as $\{(x_{p},v,y)\}$ , where $v$ indicates the full video of $x_{p}$ , i.e., $v:=x_{P}$ . Different from previous works [22, 20, 23], which require the progress level $p$ during the training process, the proposed AMemNet is “progress-free”. Hence, for convenience, we omit the subscript $p$ of the partial video $x_{p}$ when no confusion occurs, and always denote $(x,v,y)$ as a triplet of partial observation, full observation, and action category of the same video sample throughout the paper.

Following [35, 43], we train the proposed AMemNet model on the RGB frames and optical flows, respectively, and employ a late fusion mechanism to exploit the spatial-temporal information in a two-stream framework. Each stream employs the same network architecture with its own trainable weights. We refer to $(x^{rgb},v^{rgb},y)$ / $(x^{flow},v^{flow},y)$ as two modalities, and omit the subscripts ( $rgb$ / $flow$ ) when it is unnecessary.

3.2 Adversarial Memory Networks (AMemNet)

Fig. 1 shows the network architecture of the proposed AMemNet model. Overall, our model consists of three components: 1) query/memory encoder; 2) memory generator, and 3) discriminator. The encoder network is used to vectorize the given partial/full video as feature representations, the memory generator is learned to generate a full video representation conditioning on the partial video query, and the discriminator is trained to distinguish between fake and real full video representations and also deliver the prediction scores over all the categories. During the training process, the value memories are continuously updated by full videos with erase/add vectors that work as input/forget gates. We show the details of each component in the following.

3.2.1 Query/Memory Encoder

Given a partial video $x$ and its corresponding full video $v$ , we employ deep convolution neural networks (CNN) as an encoding model to obtain feature representations as follows: $\mathbf{x}=f_{cnn}(x;\theta_{cnn})$ and $\mathbf{v}=f_{cnn}(v;\theta_{cnn})$ , where $\mathbf{x}\in\mathbb{R}^{d}$ and $\mathbf{v}\in\mathbb{R}^{d}$ are the encoded representations for $x$ and $v$ , respectively, $d$ is the feature dimension, and $\theta_{cnn}$ parameterizes the CNN model. Following [5, 50], we instantiate $f_{cnn}(\cdot;\theta_{cnn})$ with the pre-trained TSN model [43] for its robust and competitive performance on action recognition.

The proposed AMemNet model utilizes the partial video feature $\mathbf{x}$ as a query to fetch relevant memories, which are learned from full training videos, to generate its full video feature $\mathbf{v}$ . Hence, it is natural to directly utilize $f_{cnn}(\cdot;\theta_{cnn})$ as the memory encoder for learning memory representations of full videos . On the other hand, to facilitate the querying process, we further encode the partial video representation $\mathbf{x}$ in a lower-dimensional embedding space by

\mathbf{q}=f_{q}(\mathbf{x};\theta_{q}),

(1)

where $\mathbf{q}\in\mathbb{R}^{h}$ denotes the query vector, $h<d$ refers to the dimension of query embeddings, and $f_{q}(\cdot;\theta_{q})$ is given by fully-connected networks. By using Eq. (1), the query encoder is formulated by concatenating $f_{q}$ on top of $f_{cnn}$ . In this work, the memory and query encoder share the same CNN weights, and freeze $\theta_{cnn}$ with the pre-trained TSN model to avoid overfitting. Thus, the encoding component of AMemNet is mainly parameterized by $\theta_{enc}=\{\theta_{q}\}$ .

3.2.2 Memory Generator

We adopt the key-value memory network architecture [28, 49] and develop it as a generator model by

\mathbf{\hat{v}}=G_{mem}(\mathbf{q};\theta_{mem}),

(2)

where $G_{mem}(\cdot;\theta_{mem})$ denotes the memory generator and $\mathbf{\hat{v}}\in\mathbb{R}^{d}$ represents the generated full video representation. Particularly, $G_{mem}(\cdot;\theta_{mem})$ includes two memory blocks, termed as a key memory matrix $\mathbf{M}^{k}\in\mathbb{R}^{N\times h}$ and a value memory matrix $\mathbf{M}^{v}\in\mathbb{R}^{N\times d}$ , where $N$ is the number of memory slots in each memory block. The memory slot, in essence, is one row in $\mathbf{M}^{k}$ / $\mathbf{M}^{v}$ , learned with query $\mathbf{q}$ and full video memory $\mathbf{v}$ during the training process. The benefits of using such a key-value structure lies in separating the learning process for different purposes – $\mathbf{M}^{k}$ could focus on memorizing different queries of partial videos and $\mathbf{M}^{v}$ is trained to distill useful information from full videos for generation. To generate $\mathbf{\hat{v}}$ , our memory generator $G_{mem}$ conducts the following three steps in a sequence.

1) Memory Addressing. The key memory matrix $\mathbf{M}^{k}$ of $G_{mem}$ provides sufficient flexibility to store similar queries (partial videos) for addressing the relevant value memory slots in $\mathbf{M}^{k}$ with querying attentions. The addressing process is computed by

\alpha[i]=\textup{softmax}(\phi(\mathbf{q},\mathbf{M}^{k}[i]))=\frac{\exp(\phi(\mathbf{q},\mathbf{M}^{k}[i]))}{\sum_{j=1}^{N}\exp(\phi(\mathbf{q},\mathbf{M}^{k}[i]))},

(3)

where $\alpha\in\mathbb{R}^{N}$ denotes the soft attention weights over all the memory slots, $\mathbf{M}^{k}[i]$ refers to its $i$ -th row, and $\phi(\cdot,\cdot)$ is a similarity score function, which could be given by the cosine similarity $\phi(\mathbf{a},\mathbf{b})=\mathbf{a}^{\mathrm{T}}\mathbf{b}$ or $\ell_{2}$ norm $\phi(\mathbf{a},\mathbf{b})=-\|\mathbf{a}-\mathbf{b}\|$ . Notably, Eq. (3) enables an end-to-end differentiable property [37, 28] of our memory networks, optimizing the key slots with backpropagation gradients.

2) Memory Writing. The value memory matrix $\mathbf{M}^{v}$ of $G_{mem}$ memorizes full videos for the generation purpose, where the memory slots attended by a partial video query $\mathbf{q}$ are written with its full video representation $\mathbf{v}$ . Specifically, $G_{mem}$ updates the value memory matrix with gate mechanism and attentions following [12, 28]. Let $t$ be the current training step and $\mathbf{M}_{t-1}^{v}$ be the value memory matrix in the last step, $\mathbf{M}_{t}^{v}\leftarrow\mathbf{M}_{t-1}^{v}$ is obtained by

$\displaystyle\mathbf{e}_{t}$	$\displaystyle=\textup{sigmoid}(\mathbf{W}_{e}\mathbf{v}),$	(4)
$\displaystyle\mathbf{a}_{t}$	$\displaystyle=\textup{tanh}(\mathbf{W}_{a}\mathbf{v}),$	(5)
$\displaystyle\mathbf{\tilde{M}}^{v}_{t}[i]$	$\displaystyle=\mathbf{M}^{v}_{t-1}[i]\odot(\mathbf{1}-\alpha_{t}[i]\mathbf{e}_{t}),$	(6)
$\displaystyle\mathbf{M}^{v}_{t}[i]$	$\displaystyle=\mathbf{\tilde{M}}^{v}_{t}[i]+\alpha_{t}[i]\mathbf{a}_{t},$	(7)

where $\mathbf{e}_{t}\in\mathbb{R}^{d}$ and $\mathbf{a}_{t}\in\mathbb{R}^{d}$ represent the erase vector and add vector, respectively, $\odot$ denotes the element-wise multiplication, and $\alpha_{t}$ is computed by Eq (3) with ( $\mathbf{q}$ , $\mathbf{v}$ ) arriving at the $t$ -th training step. In Eqs. (4) and (5), the erase vector $\mathbf{e}_{t}$ and add vector $\mathbf{a}_{t}$ work as input and forget gates in the LSTM model [15], implemented by two linear projection matrices¹¹1We omit all the bias vectors to simplify notations. $\mathbf{W}_{e}\in\mathbb{R}^{d\times d}$ and $\mathbf{W}_{a}\in\mathbb{R}^{d\times d}$ , respectively. The $\mathbf{e}_{t}$ decides the forgetting degree of memory slots in $\mathbf{M}^{v}_{t-1}$ , while $\mathbf{a}_{t}$ computes the update in $\mathbf{M}^{v}_{t}$ . By using query attentions $\alpha_{t}$ , Eqs. (6) and (7) will mainly update the most attended ( $\alpha_{t}[i]\rightarrow 1$ ) memory slots and leave the ones ( $\alpha_{t}[i]\rightarrow 0$ ) that are irrelevant to the query $\mathbf{q}$ nearly unchanged.

3) Memory Reading. After updating the value memory matrix $\mathbf{M}^{v}$ , $G_{mem}$ generates the full video representation $\mathbf{\hat{v}}$ by reading memory slots from $\mathbf{M}^{v}$ in the following way:

\mathbf{\hat{v}}=\mathbf{x}+\sum_{i}\alpha[i]\mathbf{M}^{v}[i],

(8)

which adds a skip-connection between the partial video feature $\mathbf{x}$ and the memory output. Notably, Eq. (8) enables $G_{mem}$ to memorize the residual between a partial video and its corresponding entire one.

In summary, the memory generator $G_{mem}(\cdot;\theta_{mem})$ defined in Eq. (2) is implemented through Eq. (3) to Eq. (8), where $\theta_{mem}$ includes the key/value memory matrix and all the learnable gate parameters, i.e., $\theta_{mem}=\{\mathbf{M}^{k},\mathbf{M}^{v},\mathbf{W}_{e},\mathbf{W}_{a}\}$ .

3.2.3 Discriminator

The discriminator network is designed with two purposes as 1) predicting the true action category label given the real/generated ( $\mathbf{v}$ / $\mathbf{\hat{v}}$ ) full video representation, and 2) distinguishing the real full video representation $\mathbf{v}$ and the fake one $\mathbf{\hat{v}}$ . Inspired by [6, 23], we build the discriminator in a composition way: $D(\cdot;\theta_{D}):=\{D_{cls}(\cdot;\theta_{cls}),D_{adv}(\cdot;\theta_{adv})\}$ , where $D_{cls}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{|\mathcal{Y}|}$ works as a classifier to predict probability scores over $|\mathcal{Y}|$ action classes, and $D_{adv}:\mathbb{R}^{d}\rightarrow\{0,1\}$ follows the same definition in the GAN model [11] to infer the probability of the given sample being real. The discriminator $D$ in our model is formulated as fully-connected networks parameterized by $\theta_{D}=\{\theta_{cls},\theta_{adv}\}$ .

3.3 Objective Function

The main goal of this work is to deliver realistic full-video representations for partial videos to predict their correct action classes. To this end, three loss functions are jointly employed for training the proposed AMemNet model.

Adversarial Loss. Given a partial video feature $\mathbf{x}$ and its real full video representation $\mathbf{v}$ , we compute the adversarial loss $\mathcal{L}_{adv}$ by

\mathcal{L}_{adv}=\mathbb{E}_{\mathbf{v}}[\log D_{adv}(\mathbf{v})]+\mathbb{E}_{\mathbf{q}}[\log(1-D_{adv}(G_{mem}(\mathbf{q})))].

(9)

The discriminator $D_{adv}$ tries to differentiate $\mathbf{\hat{v}}=G_{mem}(\mathbf{q})$ from the real one $\mathbf{v}$ by maximizing $\mathcal{L}_{adv}$ , while, on the contrary, the memory generator $G_{mem}$ aims to fool $D_{adv}$ by minimizing $\mathcal{L}_{adv}$ . By using Eq. (9), we could employ $D_{adv}$ to push $G_{mem}$ towards generating realistic full video features.

Reconstruction Loss. The adversarial loss $\mathcal{L}_{adv}$ encourages our model to generate video features approaching the real feature distribution of full videos, yet without considering the reconstruction error at an instance level, which might miss some useful information for recovering $\mathbf{v}$ from $\mathbf{x}$ . In light of this, we define a reconstruction loss as

\mathcal{L}_{rec}=\mathbb{E}_{(\mathbf{x},\mathbf{v})}\|G_{mem}(f_{q}(\mathbf{x}))-\mathbf{v}\|_{2}^{2},

(10)

which calculates the squared Euclidean distance between the generated feature $\mathbf{\hat{v}}=G_{mem}(f_{q}(\mathbf{x}))$ and its corresponding full video feature $\mathbf{v}$ . Eq. (10) further guides the memory generator by bridging the gap between $\mathbf{x}$ and $\mathbf{v}$ .

Classification Loss. It is important for $G_{mem}$ to generate discriminative representations $\mathbf{\hat{v}}$ for predicting different action classes. Thus, it is natural to impose a classification loss $\mathcal{L}_{cls}$ on training the memory generator as follows:

	$\displaystyle\mathcal{L}_{cls}^{v}$	$\displaystyle=\mathbb{E}_{(\mathbf{v},\mathbf{y})}H(\mathbf{y},D_{cls}(\mathbf{v})),$		(11)
	$\displaystyle\mathcal{L}_{cls}^{x}$	$\displaystyle=\mathbb{E}_{(\mathbf{x},\mathbf{y})}H(\mathbf{y},D_{cls}(G_{mem}(\mathbf{x}))),$		(12)

where $\mathbf{y}\in\mathbb{R}^{|\mathcal{Y}|}$ indicates the one-hot vector of an action label $y$ over $|\mathcal{Y}|$ classes and $H(\cdot,\cdot)$ computes the cross-entropy between two probability distributions. Let $\mathbf{\hat{y}}\in\mathbb{R}^{|\mathcal{Y}|}$ be the output of $D_{cls}$ , we have $H(\mathbf{y},\mathbf{\hat{y}})=-\sum_{i=1}^{|\mathcal{Y}|}\mathbf{y}[i]\log\mathbf{\hat{y}}[i]$ .

Different from [23], which only employs the classification loss $\mathcal{L}_{cls}^{v}$ in training the discriminator model with full videos, we employ Eq. (11) and Eq. (12) to train the discriminator $D_{cls}$ and the memory generator $G_{mem}$ alternatively. The benefit lies in: a high-quality classifier $D_{cls}$ is first obtained by minimizing $\mathcal{L}_{cls}^{v}$ with real full videos and then $D_{cls}$ is leveraged to “teach” $G_{mem}$ for generating representations $\mathbf{\hat{v}}$ to lower $\mathcal{L}_{cls}^{x}$ . By this means, $G_{mem}$ could learn the discriminative information from $D_{cls}$ .

Final Objective. By summarizing Eq. (9) to Eq. (12), the final objective function of the proposed AMemNet model is given by

	$\displaystyle\max_{\theta_{D}}$	$\displaystyle~\mathcal{L}_{adv}+\lambda_{cls}\mathcal{L}_{cls}^{v},$		(13)
	$\displaystyle\min_{\theta_{G}}$	$\displaystyle~\mathcal{L}_{adv}+\lambda_{cls}\mathcal{L}_{cls}^{x}+\lambda_{rec}\mathcal{L}_{rec},$		(14)

where $\theta_{G}=\{\theta_{enc},\theta_{mem}\}$ includes all the trainable parameters for generating $\mathbf{\hat{v}}$ from $\mathbf{x}$ , $\theta_{D}=\{\theta_{cls},\theta_{adv}\}$ parametrizes the discriminator, and $\lambda_{cls}$ and $\lambda_{rec}$ are the trade-off parameters balancing different loss functions. To proceed with the training procedure, we optimize $\theta_{D}$ and $\theta_{G}$ by alternatively solving Eq. (13) and Eq. (14) while fixing the other.

3.4 Two-Stream Fusion for Action Prediction

After training the proposed AMemNet model via Eq. (13) and Eq. (14), we freeze the model weights $\theta=\{\theta_{G},\theta_{D}\}$ and suppress the memory writing operations in Eq. (4)-(7) for testing AMemNet. Particularly, given a partial video feature $\mathbf{x}$ , we predict its action label by

\displaystyle\mathbf{\hat{y}}=D_{cls}(G_{mem}(f_{q}(\mathbf{x}))),

(15)

where $\mathbf{\hat{y}}\in\mathbb{R}^{|\mathcal{Y}|}$ denotes the probability distribution over $|\mathcal{Y}|$ action classes for $\mathbf{x}$ .

As shown in Fig 2, we adopt a two-stream framework [35, 43] to exploit the spatial and temporal information of given videos, where we first test AMemNet on each stream (i.e., RGB frames and optical flow) individually, and then fuse the prediction scores to obtain the final result. Given $\mathbf{x}^{rgb}$ and $\mathbf{x}^{flow}$ , we obtain the prediction results $\mathbf{\hat{y}}^{rgb}$ and $\mathbf{\hat{y}}^{flow}$ by using Eq. (15) with $\theta^{rgb}$ and $\theta^{rgb}$ , respectively. The final prediction result is given by

\mathbf{\hat{y}}^{fusion}=\mathbf{\hat{y}}^{rgb}+\beta\mathbf{\hat{y}}^{flow},

(16)

where $\beta$ is the fusion weight for integrating the scores given by the stream of spatial RGB frames and the stream of temporal optical flow images.

Table 1: Action prediction accuracy (%) under 10 observation ratios on the UCF101 dataset.

	Method	$0.1$	$0.2$	$0.3$	$0.4$	$0.5$	$0.6$	$0.7$	$0.8$	$0.9$	$1.0$
Single-stream	IBoW [32]	$36.29$	$65.69$	$71.69$	$74.25$	$74.39$	$75.23$	$75.36$	$75.57$	$75.79$	$75.79$
	MSSC [2]	$34.05$	$53.31$	$58.55$	$57.94$	$61.79$	$60.86$	$63.17$	$63.64$	$61.63$	$61.63$
	MTSSVM [21]	$40.05$	$72.83$	$80.02$	$82.18$	$82.39$	$83.21$	$83.37$	$83.51$	$83.69$	$82.82$
	DeepSCN [22]	$45.02$	$77.64$	$82.95$	$85.36$	$85.75$	$86.70$	$87.10$	$87.42$	$87.50$	$87.63$
	PA-DRL [4]	$81.36$	$82.63$	$82.90$	$83.51$	$84.01$	$84.38$	$85.09$	$85.41$	$85.81$	$86.15$
	PTSL [46]	$83.32$	$87.13$	$88.92$	$89.82$	$90.85$	$91.04$	$91.28$	$91.23$	$91.31$	$91.47$
Two-stream	Mem-LSTM [20]	$51.02$	$80.97$	$85.73$	$87.76$	$88.37$	$88.58$	$89.09$	$89.38$	$89.67$	$90.49$
	TSL [5]	$82.20$	$86.70$	$88.50$	$89.50$	$90.10$	$91.00$	$91.50$	$91.90$	$92.40$	$92.50$
	RGN-KF [50]	$83.12$	$85.16$	$88.44$	$90.78$	$91.42$	$92.03$	$92.00$	$93.19$	$93.13$	$93.14$
	AAPNet [23]	$90.25$	$93.10$	$94.46$	$95.41$	$95.89$	$96.09$	$96.27$	$96.35$	$96.47$	$96.36$
Baselines	TSN [43]	$86.76$	$89.29$	$90.64$	$91.81$	$91.73$	$92.47$	$92.97$	$93.15$	$93.31$	$93.42$
	TSN+finetune	$88.88$	$91.52$	$93.01$	$94.05$	$94.66$	$95.34$	$95.64$	$95.92$	$95.90$	$96.00$
	TSN+KNN	$85.69$	$88.71$	$90.34$	$91.29$	$91.78$	$92.33$	$92.41$	$92.75$	$92.96$	$93.11$
Our model	AMemNet-RGB	$85.95$	$87.47$	$88.21$	$88.57$	$89.26$	$89.51$	$89.81$	$89.99$	$90.06$	$90.17$
	AMemNet-Flow	$83.64$	$88.32$	$90.74$	$92.18$	$93.20$	$93.78$	$94.40$	$94.75$	$94.88$	$94.96$
	AMemNet	$\boldsymbol{92.45}$	$\boldsymbol{94.60}$	$\boldsymbol{95.55}$	$\boldsymbol{96.00}$	$\boldsymbol{96.45}$	$\boldsymbol{96.67}$	$\boldsymbol{96.97}$	$\boldsymbol{96.95}$	$\boldsymbol{97.07}$	$\boldsymbol{97.03}$

4 Experiments

4.1 Experimental Setting

Datasets. Two benchmark video datasets, UCF101 [36] and HMDB51 [24], are used in the experiment. The UCF101 dataset consists of $13,320$ videos from $101$ human actions covering a wide range of human activities, and the HMDB51 dataset collects $6,766$ video clips from movies and web videos over $51$ action categories. We follow the standard training/testing splits on these two datasets following [43, 23]. We test the proposed AMemNet model over three splits and report the average prediction result for each dataset. We employ the preprocessed RGB frames and optical flow images provided in [10].

Implementation Details. The proposed AMemNet is built on top of temporal segment networks (TSN) [43], where we adopt the BN-Inception network [16] as its backbone and employ the pre-trained model on the Kinetics dataset [19]. The same data augmentation strategy (e.g., cropping and jittering) as provided in [43] is employed for encoding all the partial and full videos as $d=1024$ feature representations. We formulate $f_{q}$ as fully-connected networks of two layers, where the middle layer has $512$ hidden states and the final query embedding size is set as $h=256$ . The batch normalization and LealyRelu are both used in $f_{q}$ . We employ $N=512$ memory slots for the key and value memory matrices, hence we have $\mathbf{M}^{k}\in\mathbb{R}^{512\times 256}$ and $\mathbf{M}^{v}\in\mathbb{R}^{512\times 1024}$ . All the memory matrices and gating parameters in $\theta_{mem}$ are randomly initialized. We implement the discriminator network by one fully-connected layer, where the softmax and sigmoid activation function are used for $D_{cls}$ and $D_{adv}$ , respectively.

For each training step, we first employ the Adam optimizer with a learning rate of $0.0001$ to update $\theta_{D}$ with Eq. (13) twice, and then optimize $\theta_{G}$ once by solving Eq. (14) with the SGD optimizer of $0.0001$ learning rate and $0.9$ momentum rate. We set the batch size as $64$ . For all the datasets, we set $\lambda_{cls}=1$ to strengthen the impact of $D_{cls}$ on the memory generator $G_{mem}$ for encouraging discriminative representations, and set $\lambda_{res}=0.1$ to avoid overemphasizing the reconstruction of each video sample to lead the overfitting issue. The fusion weight $\beta$ is fixed as $1.5$ for all the datasets following [43]. All the codes in this work were implemented by Pytorch and ran with Titan X GPUs.

Compared Methods. We compare our approach with three different kinds of methods as follows. 1) Single-stream methods: Integral BoW (IBoW) [32], mixture segments sparse coding (MSSC) [2], multiple temporal scales SVM (MTSSVM) [21], deep sequential context networks (DeepSCN) [22], part-activated deep reinforcement learning (PA-DRL) [4], and progressive teacher-student learning (PTSL) [46]. 2) Two-stream methods: memory augmented LSTM (Mem-LSTM) [20], temporal sequence learning (TSL) [5], residual generator network with Kalman filter (RGN-KF) [50], and adversarial action prediction networks (AAPNet) [23]. We implemented the AAPNet with the same pre-trained TSN features as our approach and posted the authors’ reported results for the other single/two-stream methods. 3) Baselines: We also compare AMemNet with temporal segment networks (TSN) [43]. Specifically, we test the TSN model pre-trained on the UCF101/HMDB51 dataset as baseline results, and finetune the TSN model pre-trained on the Kinetics dataset for UCF101 and HMDB51, respectively. Moreover, we train a k-nearest neighbors (KNN) classifier with the Kinetics pre-trained TSN features, termed as TSN+KNN, and report its best performance by selecting $k$ from $\{5,10,20,30,50,100,500\}$ . For a fair comparison, we follow the same testing setting in previous works [22, 4, 50] by evenly dividing all the videos into 10 progresses, i.e., $P=10$ as described in Section 3.1. However, it is worth noting that, the proposed AMemNet does not require any progress label in both training and testing.

Table 2: Action prediction accuracy (%) under 10 observation ratios on the HMDB51 dataset.

	Method	$0.1$	$0.2$	$0.3$	$0.4$	$0.5$	$0.6$	$0.7$	$0.8$	$0.9$	$1.0$
Two-stream	Global-Local [25]	$38.80$	$43.80$	$49.10$	$50.40$	$52.60$	$54.70$	$56.30$	$56.90$	$57.30$	$57.30$
	TSL [5]	$38.80$	$51.60$	$57.60$	$60.50$	$62.90$	$64.60$	$65.60$	$66.20$	$66.30$	$66.30$
	AAPNet [23]	$56.03$	$60.11$	$64.87$	$67.99$	$70.76$	$72.55$	$74.00$	$74.81$	$75.59$	$75.56$
Baselines	TSN [43]	$47.12$	$52.81$	$59.35$	$62.55$	$64.77$	$67.52$	$68.95$	$69.87$	$70.07$	$70.13$
	TSN+finetune	$55.13$	$59.82$	$63.88$	$67.02$	$69.74$	$71.72$	$72.98$	$73.43$	$74.08$	$73.55$
	TSN+KNN	$48.77$	$53.59$	$57.83$	$60.33$	$62.84$	$65.18$	$66.53$	$67.00$	$67.53$	$66.65$
Our model	AMemNet-RGB	$52.55$	$55.52$	$58.27$	$60.55$	$62.53$	$63.87$	$64.41$	$64.61$	$64.99$	$64.86$
	AMemNet-Flow	$47.41$	$54.43$	$60.26$	$64.51$	$68.03$	$70.53$	$72.10$	$73.05$	$73.39$	$73.52$
	AMemNet	$\boldsymbol{57.74}$	$\boldsymbol{62.10}$	$\boldsymbol{66.28}$	$\boldsymbol{70.17}$	$\boldsymbol{72.66}$	$\boldsymbol{74.55}$	$\boldsymbol{75.22}$	$\boldsymbol{75.78}$	$\boldsymbol{76.08}$	$\boldsymbol{76.14}$

4.2 Prediction Performance

UCF101 Dataset. Table 1 summarizes the prediction accuracy of the proposed AMemNet and 13 compared methods on the UCF101 dataset. Overall, AMemNet consistently outperforms all the competitors over different observation ratios with a significant improvement. Impressively, the proposed AMemNet achieves around $92\%$ accuracy when only $10\%$ video is observed, which fully validates the effectiveness of applying AMemNet for early action prediction. This is mainly benefited from the rich key-value structured memories learned from full-video features guided by the adversarial training.

The single-stream methods mainly explore the temporal information by using hand-crafted features (e.g., spatio-temporal interest points (STIP) [8], dense trajectory) like in IBoW [32], MSSC [2], and MTSSVM [21], or by utilizing 3D convolutional networks (e.g., C3D [39]) like in DeepSCN [22] and PTSL [46]. Differently, the two-stream methods deploy the convolutional neural networks on two pathways to capture the spatial information of RGB images and the temporal characteristics of optic flows, respectively. On the one hand, the two-stream methods could better exploit the spatial-temporal information inside videos than using one single stream. The proposed AMemNet inherits the merits from this two-stream architecture, and thus performs better than the single-stream methods that even employ a more powerful CNN encoder, e.g., the 3D ResNeXt-101 [14] used in PTSL [46] is much deeper than BN-Inception in the proposed AMemNet. On the other hand, compared with two-stream methods, especially the AAPNet [23] implemented with the same backbone as our model, the consistent improvement of AMemNet over AAPNet shows the effectiveness of using the memory generator to deliver “full” video features in early progress.

In Table 1, we refer to AMemNet-RGB and AMemNet-Flow as the single-stream result by using AMemNet on RGB frames and flow images, respectively. Two interesting observations could be drawn: 1) The RGB contributes more than the flow at the beginning, as the still images encapsulating scenes and objects could provide key clues for recognizing the actions with few frames. 2) The late fusion naturally fits action prediction by integrating the complimentary information between two streams over time.

HMDB51 Dataset. Table 2 reports the prediction results of our approach and TSN [43] on the HMDB51 dataset, which, compared with UCF101, is a more challenging dataset for predicting actions due to the large motion variations rather than static cues across different categories [24]. As can be seen, the flow result of AMemNet exceeds AMemNet-RGB around $8\%$ accuracy after more progress being observed (e.g., $\tau_{p}\geq 0.5$ ). However, even under this case, the proposed AMemNet still consistently improves AMemNet-Flow by incorporating RGB results along with different progresses. Moreover, the clear improvements of AMemNet over TSN indicate that the full video memories learned by our memory generator could well enhance the discriminability of video representations in early progress.

4.3 Model Discussion

Ablation Study. Fig. 3 shows the ablation study of the proposed AMemNet model on the UCF101 dataset²²2 More ablation study results and parameter analyses on the UCF101 and HMDB51 datasets are provided in the supplementary material. in terms of RGB, Flow and fusion, respectively, where we test all the methods by different observation ratios. We adopt TSN + finetune as a sanity check for our approach and implement two strong ablated models to discuss the impact of two main components of AMemNet as follows. 1) AMemNet w/o Mem refers to our model by discarding the memory generator, i.e., $\theta\backslash\theta_{mem}$ . Instead, we use the same generator network as in AAPNet [23] for generating full video features with AMemNet w/o Mem. 2) AMemNet w/o GAN is developed without the adversarial training and is trained by only using a classification loss.

As shown in Fig 3, AMemNet improves all the above methods with a clear margin on different cases, which strongly supports the motivation of this work. It is worth noting that, for the early progress (i.e., the observation ratio $\tau_{p}\leq 0.3$ ), AMemNet w/o GAN clearly boots the performance over AMemNet w/o Mem. This demonstrates the effectiveness of using memory networks to compensate for the limited information in incomplete videos. As observing more progress, the GAN model will lead the generating process since it has sufficient information given by the partial videos, where AMemNet w/o Mem improves over AMemNet w/o GAN after $\tau_{p}>0.7$ on the UCF101 dataset.

Early Predicable vs Late Predicable. In Fig. 4, we discuss the performance of AMemNet for action categories of different properties, e.g., predictability (the progress level required for recognizing an action), on the UCF101 dataset. We compare AMemNet and TSN [43] on the $10\%$ progress level video of 10 different categories in 4(a) and show the corresponding t-SNE [41] embeddings of TSN features and the generated full video features given by AMemNet in 4(b) and 4(c), respectively. Inspired by [22], we select 10 action categories from UCF101 and divide them into two groups as 1) the early predictable group including Billiards, IceDancing, RockClimbingIndoor, PlayingPiano, PommelHorse, and 2) the late predictable group including VolleyballSpiking, CliffDiving, HeadMassage, PoleVault, ThrowDiscus, where the early group usually could be predicted by given $10\%$ progress and the late group is selected as the non-early ones.

As expected, the proposed AMemNet mainly improves the TSN baseline over late predictable actions in Fig. 4(a), which again demonstrates the realistic of the full video features given by our memory generator. Moreover, as shown in Fig. 4(b) and 4(c), while TSN exhibits a good structured feature embeddings for early predictable classes, e.g., IceDancing and PommelHorse, its embeddings mixes up for the late predictable ones like PoleVault and CliffDiving. In contrast, AMemNet generates full video features that encourage a good cluster structure in the embedding space.

5 Conclusion

In this paper, we presented a novel two-stream adversarial memory networks (AMemNet) model for the action prediction task. A key-value structured memory generator was proposed to generate the full video feature conditioning on the partial video query, and a class-aware discriminator was developed to supervise the generator for delivering realistic and discriminative representations towards full videos through adversarial training. The proposed AMemNet adopts input and forget gates for updating the full video memories attended by different queries, which captures the long-term temporal variation across different video progresses. Experimental results on two benchmark datasets were provided to demonstrate the effectiveness of AMemNet for the action prediction problem compared with state-of-the-art methods.

References

[1] Y. Cai, L. Huang, Y. Wang, T.-J. Cham, J. Cai, J. Yuan, J. Liu, X. Yang, Y. Zhu, X. Shen, D. Liu, J. Liu, and N. M. Thalmann. Learning progressive joint propagation for human motion prediction. In Computer Vision – ECCV 2020, pages 226–242, 2020.
[2] Y. Cao, D. Barrett, A. Barbu, S. Narayanaswamy, H. Yu, A. Michaux, Y. Lin, S. Dickinson, J. Siskind, and S. Wang. Recognizing human activities from partially observed videos. In CVPR, 2013.
[3] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and kinetics dataset. In CVPR, 2017.
[4] L. Chen, J. Lu, Z. Song, and J. Zhou. Part-activated deep reinforcement learning for action prediction. In ECCV, September 2018.
[5] S. Cho and H. Foroosh. A temporal sequence learning for action recognition and prediction. In WACV, pages 352–361, 2018.
[6] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, 2018.
[7] A. Diba, V. Sharma, and L. Van Gool. Deep temporal linear encoding networks. In CVPR, July 2017.
[8] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In VS-PETS, 2005.
[9] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, June 2015.
[10] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, June 2016.
[11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pages 2672–2680. 2014.
[12] A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. CoRR, abs/1410.5401, 2014.
[13] S. Guo, L. Qing, J. Miao, and L. Duan. Deep residual feature learning for action prediction. In IEEE International Conference on Multimedia Big Data, pages 1–6, 2018.
[14] K. Hara, H. Kataoka, and Y. Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In CVPR, June 2018.
[15] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[16] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, volume 37, pages 448–456, 2015.
[17] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. TPAMI, 2013.
[18] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
[19] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
[20] Y. Kong, S. Gao, B. Sun, and Y. Fu. Action prediction from videos via memorizing hard-to-predict samples. In AAAI, 2018.
[21] Y. Kong, D. Kit, and Y. Fu. A discriminative model with multiple temporal scales for action prediction. In ECCV, 2014.
[22] Y. Kong, Z. Tao, and Y. Fu. Deep sequential context networks for action prediction. In CVPR, 2017.
[23] Y. Kong, Z. Tao, and Y. Fu. Adversarial action prediction networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(3):539–553, 2020.
[24] H. Kuhne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recognition. In ICCV, 2011.
[25] S. Lai, W. Zheng, J. Hu, and J. Zhang. Global-local temporal saliency action prediction. IEEE Transactions on Image Processing, 27(5):2272–2285, 2018.
[26] T. Lan, T.-C. Chen, and S. Savarese. A hierarchical representation for future action prediction. In ECCV, 2014.
[27] I. Laptev. On space-time interest points. IJCV, 64(2):107–123, 2005.
[28] A. H. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes, and J. Weston. Key-value memory networks for directly reading documents. In EMNLP, pages 1400–1409, 2016.
[29] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
[30] M. Raptis and L. Sigal. Poselet key-framing: A model for human activity recognition. In CVPR, 2013.
[31] M. Ryoo and J. Aggarwal. Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In ICCV, pages 1593–1600, 2009.
[32] M. S. Ryoo. Human activity prediction: Early recognition of ongoing activities from streaming videos. In ICCV, 2011.
[33] M. Sadegh Aliakbarian, F. Sadat Saleh, M. Salzmann, B. Fernando, L. Petersson, and L. Andersson. Encouraging lstms to anticipate actions very early. In ICCV, Oct 2017.
[34] Y. Shi, B. Fernando, and R. Hartley. Action anticipation with rbf kernelized feature mapping rnn. In ECCV, September 2018.
[35] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
[36] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild. Technical report, CRCV-TR-12-01, 2012.
[37] S. Sukhbaatar, a. szlam, J. Weston, and R. Fergus. End-to-end memory networks. In NeurIPS, pages 2440–2448. 2015.
[38] Z. Tao, S. Li, Z. Wang, C. Fang, L. Yang, H. Zhao, and Y. Fu. Log2intent: Towards interpretable user modeling via recurrent semantics memory unit. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1055–1063, 2019.
[39] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
[40] K. M. Tran, A. Bisazza, and C. Monz. Recurrent memory networks for language modeling. In NAACL HLT, pages 321–331, 2016.
[41] L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008.
[42] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103(1):60–79, 2013.
[43] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
[44] Q. Wang, H. Yin, Z. Hu, D. Lian, H. Wang, and Z. Huang. Neural memory streaming recommender networks with adversarial training. In SIGKDD, pages 2467–2475, 2018.
[45] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In CVPR, June 2018.
[46] X. Wang, J.-F. Hu, J.-H. Lai, J. Zhang, and W.-S. Zheng. Progressive teacher-student learning for early action prediction. In CVPR, June 2019.
[47] Y. Wang, L. Jiang, M.-H. Yang, L.-J. Li, M. Long, and L. Fei-Fei. Eidetic 3d lstm: A model for video prediction and beyond. In ICLR, 2019.
[48] J. Weston, S. Chopra, and A. Bordes. Memory networks. CoRR, abs/1410.3916, 2014.
[49] J. Zhang, X. Shi, I. King, and D. Yeung. Dynamic key-value memory networks for knowledge tracing. In WWW, pages 765–774, 2017.
[50] H. Zhao and R. P. Wildes. Spatiotemporal feature residual propagation for action prediction. In ICCV, October 2019.
[51] H. Zhao and R. P. Wildes. On diverse asynchronous activity anticipation. In Computer Vision – ECCV 2020, pages 781–799, 2020.

Table 3: Overall comparison between ablated models. We provide two-stream fusion results of the

p=0.1

observation ratio and the averaged results over 10 observation ratios. Generally,

p=0.1

is the most important observation ratio for action prediction.

Methods	Model Components			UCF101		HMDB51
Methods	$\mathcal{L}_{rec}$	$\mathcal{L}_{adv}$	$\theta_{mem}$	$0.1$	avg.	$0.1$	avg.
TSN + finetune				$88.88$	$94.09$	$55.13$	$68.13$
AMemNet w/o Mem	✓	✓		$90.17$	$95.12$	$55.49$	$68.68$
AMemNet w/o GAN	✓		✓	$91.47$	$95.45$	$56.36$	$70.30$
AMemNet w/o Res		✓	✓	$92.26$	$95.62$	$56.00$	$70.33$
AMemNet (ours)	✓	✓	✓	$\boldsymbol{92.45}$	$\boldsymbol{95.97}$	$\boldsymbol{57.74}$	$\boldsymbol{70.67}$

Appendix A Supplementary Material

We provide 1) more ablation study results (Table 3-5 and Fig. 6) and 2) parameter analyses (Fig. 5(a) and Fig. 5(b)) in the supplementary material. Particularly, we report the averaged results over three splits of the UCF101 and HMDB51 datasets for ablation study, respectively, and mainly conduct the parameter analysis on split 1 of the HMDB51 dataset.

Ablation Study. The final objective function of the proposed AMemNet model is given by

	$\displaystyle\max_{\theta_{D}}$	$\displaystyle~\mathcal{L}_{adv}+\lambda_{cls}\mathcal{L}_{cls}^{v},$
	$\displaystyle\min_{\theta_{G}}$	$\displaystyle~\mathcal{L}_{adv}+\lambda_{cls}\mathcal{L}_{cls}^{x}+\lambda_{rec}\mathcal{L}_{rec},$

Table 3 shows the overall comparison results between the proposed AMemNet and three ablated models and one baseline. Table 4 and Table 5 provide all the comparison results of each stream on the UCF101 and HMDB51 datasets, respectively. In the proposed AMemNet, the memory-augmented generator $\theta_{G}$ and the adversarial training loss $\mathcal{L}_{adv}$ play the key roles in generating full-video-like features, where $\theta_{G}$ contributes more on the early progress and $\mathcal{L}_{adv}$ leads the late progress. The reconstruction loss $\mathcal{L}_{rec}$ works as a “regularization” term for adversarial training, and thus AMemNet w/o Res presents a similar overall performance to our full model. As shown in Table 5, $\mathcal{L}_{rec}$ exhibits a more significant impact on the earlier progress (e.g., $p=0.1$ or $0.2$ ), since the videos in HMDB51 usually have a lager variance than UCF101. However, overemphasizing $\mathcal{L}_{rec}$ may lower the performance (see $\lambda_{res}=10$ in Fig. 5(b)).

Parameter Analysis. In the experiment, we set the number of memory slots used in $G_{mem}$ as $N=512$ by default. We study the impact of $N$ in Fig. 5(a), suggesting $N\geq 512$ could lead to a stable prediction performance. We fix $\lambda_{cls}=1$ since the classification is the main goal for action prediction, and mainly tune $\lambda_{res}$ in Fig. 5(b), which indicates a relatively small $\lambda_{res}\leq 0.1$ is useful for recovering the partial videos. We set $\lambda_{res}=0.1$ as default.

Table 4: Action prediction accuracy (%) under 10 observation ratios on the UCF101 dataset.

	Method	$0.1$	$0.2$	$0.3$	$0.4$	$0.5$	$0.6$	$0.7$	$0.8$	$0.9$	$1.0$
RGB	TSN + finetune	$83.81$	$85.20$	$86.06$	$86.81$	$87.34$	$88.02$	$88.40$	$88.81$	$88.95$	$89.36$
	AMemNet w/o Mem	$84.92$	$86.30$	$87.38$	$87.80$	$88.40$	$88.97$	$89.47$	$89.82$	$90.03$	$90.04$
	AMemNet w/o GAN	$85.33$	$86.70$	$87.41$	$87.98$	$88.62$	$88.94$	$89.58$	$89.67$	$89.78$	$89.86$
	AMemNet w/o Res	$86.03$	$87.27$	$87.97$	$88.34$	$88.90$	$89.39$	$89.73$	$89.93$	$90.01$	$90.12$
	AMemNet-RGB	$85.95$	$87.47$	$88.21$	$88.57$	$89.26$	$89.51$	$89.81$	$89.99$	$90.06$	$90.17$
Flow	TSN + finetune	$78.85$	$84.17$	$87.14$	$89.11$	$90.85$	$91.94$	$92.82$	$93.61$	$93.70$	$93.93$
	AMemNet w/o Mem	$79.68$	$84.93$	$88.15$	$90.16$	$91.63$	$92.80$	$93.70$	$94.32$	$94.65$	$94.76$
	AMemNet w/o GAN	$81.98$	$86.87$	$89.70$	$90.94$	$91.91$	$93.08$	$93.75$	$94.20$	$94.32$	$94.35$
	AMemNet w/o Res	$83.57$	$88.00$	$90.39$	$91.86$	$92.79$	$93.25$	$93.87$	$94.34$	$94.53$	$94.51$
	AMemNet-Flow	$83.64$	$88.32$	$90.74$	$92.18$	$93.20$	$93.78$	$94.40$	$94.75$	$94.88$	$94.96$
Fusion	TSN + finetune	$88.88$	$91.52$	$93.01$	$94.05$	$94.66$	$95.34$	$95.64$	$95.92$	$95.90$	$96.00$
	AMemNet w/o Mem	$90.17$	$92.77$	$94.23$	$95.03$	$95.80$	$96.27$	$96.52$	$96.80$	$96.90$	$96.76$
	AMemNet w/o GAN	$91.47$	$93.84$	$94.77$	$95.45$	$96.01$	$96.15$	$96.55$	$96.74$	$96.76$	$96.72$
	AMemNet w/o Res	$92.26$	$94.15$	$94.93$	$95.72$	$96.16$	$96.34$	$96.62$	$96.63$	$96.65$	$96.72$
	AMemNet (ours)	$92.45$	$94.60$	$95.55$	$96.00$	$96.45$	$96.67$	$96.97$	$96.95$	$97.07$	$97.03$

Table 5: Action prediction accuracy (%) under 10 observation ratios on the HMDB51 dataset.

	Method	$0.1$	$0.2$	$0.3$	$0.4$	$0.5$	$0.6$	$0.7$	$0.8$	$0.9$	$1.0$
RGB	TSN + finetune	$50.82$	$54.61$	$57.03$	$58.97$	$60.30$	$60.98$	$61.54$	$61.94$	$62.23$	$62.68$
	AMemNet w/o Mem	$51.24$	$54.81$	$57.63$	$59.79$	$61.39$	$62.67$	$63.34$	$63.83$	$63.92$	$64.11$
	AMemNet w/o GAN	$51.77$	$55.34$	$57.98$	$60.13$	$61.53$	$62.56$	$63.13$	$63.69$	$64.28$	$64.05$
	AMemNet w/o Res	$51.51$	$54.68$	$57.23$	$59.63$	$61.13$	$61.85$	$62.32$	$62.86$	$63.32$	$63.34$
	AMemNet-RGB	$52.55$	$55.52$	$58.27$	$60.55$	$62.53$	$63.87$	$64.41$	$64.61$	$64.99$	$64.86$
Flow	TSN + finetune	$45.54$	$53.13$	$57.69$	$62.44$	$66.02$	$68.72$	$70.41$	$71.52$	$71.68$	$71.54$
	AMemNet w/o Mem	$45.13$	$52.22$	$57.84$	$62.15$	$65.53$	$68.67$	$70.10$	$71.31$	$72.16$	$72.05$
	AMemNet w/o GAN	$46.84$	$54.58$	$59.68$	$63.60$	$66.63$	$69.62$	$71.21$	$72.13$	$72.88$	$72.26$
	AMemNet w/o Res	$46.65$	$54.66$	$59.76$	$64.33$	$67.71$	$70.20$	$72.02$	$72.39$	$72.94$	$72.57$
	AMemNet-Flow	$47.41$	$54.43$	$60.26$	$64.51$	$68.03$	$70.53$	$72.10$	$73.05$	$73.39$	$73.52$
Fusion	TSN + finetune	$55.13$	$59.82$	$63.88$	$67.02$	$69.74$	$71.72$	$72.98$	$73.43$	$74.08$	$73.55$
	AMemNet w/o Mem	$55.49$	$59.93$	$64.14$	$67.48$	$70.02$	$72.01$	$73.88$	$74.46$	$74.74$	$74.64$
	AMemNet w/o GAN	$56.36$	$61.57$	$65.72$	$69.38$	$72.29$	$74.22$	$75.46$	$75.79$	$76.08$	$76.11$
	AMemNet w/o Res	$56.00$	$60.81$	$65.78$	$69.49$	$72.44$	$74.58$	$75.49$	$75.62$	$76.73$	$76.38$
	AMemNet (ours)	$57.74$	$62.10$	$66.28$	$70.17$	$72.66$	$74.55$	$75.22$	$75.78$	$76.08$	$76.14$