This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adversarial Memory Networks for Action Prediction

Zhiqiang Tao Equal contributionSanta Clara University, email: ztao@scu.edu    Yue Bai11footnotemark: 1 Northeastern University, email: bai.yue@northeastern.edu    Handong Zhao Adobe Research, email: hazhao@adobe.com    Sheng Li University of Georgia, email: sheng.li@uga.edu    Yu Kong Rochester Institute of Technology, email: yu.kong@rit.edu    Yun Fu Northeastern University, email: yunfu@ece.neu.edu
Abstract

Action prediction aims to infer the forthcoming human action with partially-observed videos, which is a challenging task due to the limited information underlying early observations. Existing methods mainly adopt a reconstruction strategy to handle this task, expecting to learn a single mapping function from partial observations to full videos to facilitate the prediction process. In this study, we propose adversarial memory networks (AMemNet) to generate the “full video” feature conditioning on a partial video query from two new aspects. Firstly, a key-value structured memory generator is designed to memorize different partial videos as key memories and dynamically write full videos in value memories with gating mechanism and querying attention. Secondly, we develop a class-aware discriminator to guide the memory generator to deliver not only realistic but also discriminative full video features upon adversarial training. The final prediction result of AMemNet is given by late fusion over RGB and optical flow streams. Extensive experimental results on two benchmark video datasets, UCF-101 and HMDB51, are provided to demonstrate the effectiveness of the proposed AMemNet model over state-of-the-art methods.

1 Introduction

Action prediction is a highly practical research topic that could be used in many real-world applications such as video surveillance, autonomous navigation, human-computer interaction, etc. Different from action recognition, which recognizes the human action category upon a complete video, action prediction aims to understand the human activity at an early stage – using a partially-observed video before an entire action execution. Typically, action prediction methods [32, 2, 22, 4, 46, 50] assume that a portion of consecutive frames from the beginning is given, considered as a partial video. The challenges mainly arise from the limited information in the early progress of videos, leading to the incomplete temporal context and a lack of discriminative cues for recognizing actions. Thus, the key problem of solving action prediction lies in: how to enhance the discriminative information for a partial video?

In recent years, many research efforts centering on the above question have been made for the action prediction task. Pioneering works [32, 2, 21] mainly handle partial videos by relying on hand-crafted features, dictionary learning, and designing temporally-structured classifiers. More recently, deep convolutional neural networks (CNNs), especially those pre-trained on large-scale video benchmarks (e.g., Sports-1M [18] and Kinetics [19]), have been widely adopted to predict actions. The pre-trained CNNs, to some extent, compensate for the incomplete temporal context and empower reconstructing full video representations from the partial ones. Along this line, existing methods [22, 34, 46, 23] focus on designing models to continuously improve the reconstruction performance, yet without considering the “malnutrition” nature (i.e., the limited temporal cues) of incomplete videos. Particularly, it will be more straightforward to learn what “nutrients” (e.g., the missing temporal cues or reconstruction bases) a partial video may need for recognizing actions, compared with mapping it to an entire video. Moreover, it is also challenging to handle various partial videos by resorting to a single model.

In this study, we propose a novel adversarial memory networks (AMemNet) model to address the above challenges. The proposed AMemNet leverages augmented memory networks to explicitly learn and store full video features to enrich incomplete ones. Specifically, we treat a partial video as a query and the corresponding full video as its memory. The “full video” is generated with relevant memory slots fetched by the query of partial videos. We summarize the contribution of this work in two aspects.

Firstly, a memory-augmented generator model is designed for generating full-video features conditioning on partial-video queries. We adopt a key-value memory network architecture [28, 49] for action prediction, where the key memory slots are used for capturing similar partial videos and the value memory slots are extracted from full training videos. The memory writing process is implemented by gating mechanism and attention weights. The input/forget gates enable AMemNet to dynamically update video memories attended by different queries and thus memorize the variation between different video progress.

Secondly, a class-aware discriminator model is developed to guide the memory generator with adversarial training, which not only employs an adversarial loss to encourage generating realistic full video features, but also imposes a classification loss on training the network. By this means, the discriminator network could further push the generator to deliver discriminative full-video features.

The proposed AMemNet obtains prediction results by employing a late fusion strategy over two streams (i.e., RGB and optical flow) following [35, 43]. Extensive experiments on two benchmark datasets, UCF101 and HMDB51, are conducted to show the effectiveness of AMemNet compared with state-of-the-art methods, where our approach surprisingly achieves over 90%90\% accuracy by only observing 10%10\% of the beginning video frames on the UCF101 dataset. A detailed ablation study compared with several competitive baselines is also presented.

2 Related Work

Action Recognition targets at recognizing the label of human action in a given video, which is one of the core tasks for video understanding. Previous works have extensively studied this research problem from several aspects, including hand-crafted features (e.g., spatio-temporal interest points [8, 31], poselet key-frames [30, 27]), and dense trajectory [42], 3D convolutional neural networks [17, 39, 14], recurrent neural networks (RNN) based methods [29, 9], and many recent deep CNN based methods such as temporal linear encoding networks [7], non-local neural networks [45], etc. Among existing methods, the two-stream architecture [35, 10, 43] forms a landmark [3], which mainly employ deep CNNs on the RGB and optical flow streams for exploiting the spatial-temporal information inside videos. In this work, we also adopt the two-stream structure as it naturally provides the complimentary information for the action prediction task – the RGB stream contributes more on the early observation and the optical flow leads the following progress.

Action Prediction has attracted lots of research efforts [22, 34, 46, 23, 4, 50] in recent years, which tries to predict action labels upon the early progress of videos and thus falls into a special case of video-based action recognition. Previous works [32, 2, 26, 21] solve this task via hand-crafted features, and recent works [22, 34, 46, 23, 5, 20, 13, 4, 50] mainly rely on pre-trained deep CNN models for encoding videos, such as the 3D convolutional networks in [22, 23, 46], deep CNNs in [34, 4], and two-stream CNNs in [5, 20, 13, 50]. Among these works, the most common way for predicting actions is to design deep neural networks model for reconstructing full videos from the partial ones, such as deep sequential context networks [22], the RBF Kernelized RNN [34], progressive teacher-student learning networks [46], adversarial action prediction networks [23], etc. Moreover, some other interesting methods include the LSTM based ones [33, 20, 47], part-activated deep reinforcement learning [4], residual learning [13, 50], motion prediction [1], asynchronous activity anticipation [51], etc.

The memory augmented LSTM (Mem-LSTM) [20] model and adversarial action prediction networks (AAPNet) [23] share some similar ideas to our approach. However, several essential differences between the proposed AMemNet and Mem-LSTM/AAPnet could be summarized. First, memory networks play distinct roles in Mem-LSTM [20] and AMemNet. Mem-LSTM formulates action labels as video memories and adopts the memory network as a nearest-neighbor classifier. Differently, the proposed AMemNet develops a key-value memory architecture as a generator model and learns value memory slots from full videos as reconstruction bases for a generation purpose. Second, the generator models used in AAPNet [23] and AMemNet are different. AAPNet [23] employs a variational-GAN model, whereas AMemNet develops the memory-augmented generator to explicitly provide auxiliary information to generate full-video features for testing videos.

Memory Networks, i.e., Memory-Augmented Neural Networks [48, 12], generally consist of two components: 1) a memory matrix and 2) a neural network controller, where the memory matrix is used for storing the information as memory slots and the neural network controller is generally designed for addressing, reading and writing memories. Several representative memory network architectures include end-to-end memory networks [37], Key-Value memory networks [28, 49], neural tuning machines [12, 44], recurrent memory networks [40, 38], etc. The memory networks work well in practice for its flexibility in saving the auxiliary knowledge and its ability in memorizing the long-term temporal information. The proposed AMemNet shares the same memory architecture with [28, 49] and employs the memory writing methods provided in [12], which, however, is designed with different purposes compared with these methods. The memory module in AMemNet is tailored for solving the action prediction problem.

Refer to caption
Figure 1: Illustration of training the proposed AMemNet on the RGB stream. The attention weight α\alpha is first given by a query embedding 𝐪\mathbf{q} and a key memory matrix 𝐌k\mathbf{M}^{k} in the memory addressing. Then, α\alpha is used for updating the value memory 𝐌t1v𝐌tv\mathbf{M}^{v}_{t-1}\rightarrow\mathbf{M}^{v}_{t} with real full video features 𝐯\mathbf{v} governed by input/forget gates 𝐖e/a\mathbf{W}_{e/a}. The generated full video feature 𝐯^\mathbf{\hat{v}} is obtained by a memory reading operation with α\alpha and 𝐌tv\mathbf{M}^{v}_{t}, and guided by a class-aware discriminator network D={Dcls,Dadv}D=\{D_{cls},D_{adv}\} under adversarial training. The memory update (dash lines) will be disabled in testing.

3 Methodology

3.1 Problem Setup

Given a set of training videos {(x,y)}\{(x,y)\}, where x𝒳x\in\mathcal{X} denotes one video sample and y𝒴y\in\mathcal{Y} refers to its action category label, action prediction aims to infer yy by only observing the beginning sequential frames of xx instead of using the entire video. Let τ(0,1]\tau\in(0,1] be the observation ratio and LL be the length (i.e., the total number of frames) of xx, a partial video is defined as x1:τLx_{1:\left\lfloor\tau L\right\rfloor} that is a subsequence of the full video xx containing from the first frame to the τL\left\lfloor\tau L\right\rfloor-th frame. We employ a set of observation ratios {τp}1P\{\tau_{p}\}_{1}^{P} to mimic the partial observations at different progress levels and define xp=x1:τpLx_{p}=x_{1:\left\lfloor\tau_{p}L\right\rfloor} as the pp-th progress level observation of xx, p{1,,P}p\in\{1,\dots,P\}. By this means, the training set is augmented as PP times of the original one, i.e., {(xp,y)}\{(x_{p},y)\}. Following the existing work [22, 4, 50], we set P=10P=10 and increase τp\tau_{p} from 0.10.1 to 1.01.0 with a fixed step of 0.10.1.

We propose to solve the action prediction problem with memory networks. The partial video xpx_{p} is formulated as a query to “retrieve” its lost information from a memory block learned from all the full training videos. To encode video memories, we build a training set as {(xp,v,y)}\{(x_{p},v,y)\}, where vv indicates the full video of xpx_{p}, i.e., v:=xPv:=x_{P}. Different from previous works [22, 20, 23], which require the progress level pp during the training process, the proposed AMemNet is “progress-free”. Hence, for convenience, we omit the subscript pp of the partial video xpx_{p} when no confusion occurs, and always denote (x,v,y)(x,v,y) as a triplet of partial observation, full observation, and action category of the same video sample throughout the paper.

Following [35, 43], we train the proposed AMemNet model on the RGB frames and optical flows, respectively, and employ a late fusion mechanism to exploit the spatial-temporal information in a two-stream framework. Each stream employs the same network architecture with its own trainable weights. We refer to (xrgb,vrgb,y)(x^{rgb},v^{rgb},y) / (xflow,vflow,y)(x^{flow},v^{flow},y) as two modalities, and omit the subscripts (rgbrgb/flowflow) when it is unnecessary.

3.2 Adversarial Memory Networks (AMemNet)

Fig. 1 shows the network architecture of the proposed AMemNet model. Overall, our model consists of three components: 1) query/memory encoder; 2) memory generator, and 3) discriminator. The encoder network is used to vectorize the given partial/full video as feature representations, the memory generator is learned to generate a full video representation conditioning on the partial video query, and the discriminator is trained to distinguish between fake and real full video representations and also deliver the prediction scores over all the categories. During the training process, the value memories are continuously updated by full videos with erase/add vectors that work as input/forget gates. We show the details of each component in the following.

3.2.1 Query/Memory Encoder

Given a partial video xx and its corresponding full video vv, we employ deep convolution neural networks (CNN) as an encoding model to obtain feature representations as follows: 𝐱=fcnn(x;θcnn)\mathbf{x}=f_{cnn}(x;\theta_{cnn}) and 𝐯=fcnn(v;θcnn)\mathbf{v}=f_{cnn}(v;\theta_{cnn}), where 𝐱d\mathbf{x}\in\mathbb{R}^{d} and 𝐯d\mathbf{v}\in\mathbb{R}^{d} are the encoded representations for xx and vv, respectively, dd is the feature dimension, and θcnn\theta_{cnn} parameterizes the CNN model. Following [5, 50], we instantiate fcnn(;θcnn)f_{cnn}(\cdot;\theta_{cnn}) with the pre-trained TSN model [43] for its robust and competitive performance on action recognition.

The proposed AMemNet model utilizes the partial video feature 𝐱\mathbf{x} as a query to fetch relevant memories, which are learned from full training videos, to generate its full video feature 𝐯\mathbf{v}. Hence, it is natural to directly utilize fcnn(;θcnn)f_{cnn}(\cdot;\theta_{cnn}) as the memory encoder for learning memory representations of full videos . On the other hand, to facilitate the querying process, we further encode the partial video representation 𝐱\mathbf{x} in a lower-dimensional embedding space by

𝐪=fq(𝐱;θq),\mathbf{q}=f_{q}(\mathbf{x};\theta_{q}), (1)

where 𝐪h\mathbf{q}\in\mathbb{R}^{h} denotes the query vector, h<dh<d refers to the dimension of query embeddings, and fq(;θq)f_{q}(\cdot;\theta_{q}) is given by fully-connected networks. By using Eq. (1), the query encoder is formulated by concatenating fqf_{q} on top of fcnnf_{cnn}. In this work, the memory and query encoder share the same CNN weights, and freeze θcnn\theta_{cnn} with the pre-trained TSN model to avoid overfitting. Thus, the encoding component of AMemNet is mainly parameterized by θenc={θq}\theta_{enc}=\{\theta_{q}\}.

3.2.2 Memory Generator

We adopt the key-value memory network architecture [28, 49] and develop it as a generator model by

𝐯^=Gmem(𝐪;θmem),\mathbf{\hat{v}}=G_{mem}(\mathbf{q};\theta_{mem}), (2)

where Gmem(;θmem)G_{mem}(\cdot;\theta_{mem}) denotes the memory generator and 𝐯^d\mathbf{\hat{v}}\in\mathbb{R}^{d} represents the generated full video representation. Particularly, Gmem(;θmem)G_{mem}(\cdot;\theta_{mem}) includes two memory blocks, termed as a key memory matrix 𝐌kN×h\mathbf{M}^{k}\in\mathbb{R}^{N\times h} and a value memory matrix 𝐌vN×d\mathbf{M}^{v}\in\mathbb{R}^{N\times d}, where NN is the number of memory slots in each memory block. The memory slot, in essence, is one row in 𝐌k\mathbf{M}^{k}/𝐌v\mathbf{M}^{v}, learned with query 𝐪\mathbf{q} and full video memory 𝐯\mathbf{v} during the training process. The benefits of using such a key-value structure lies in separating the learning process for different purposes – 𝐌k\mathbf{M}^{k} could focus on memorizing different queries of partial videos and 𝐌v\mathbf{M}^{v} is trained to distill useful information from full videos for generation. To generate 𝐯^\mathbf{\hat{v}}, our memory generator GmemG_{mem} conducts the following three steps in a sequence.

1) Memory Addressing. The key memory matrix 𝐌k\mathbf{M}^{k} of GmemG_{mem} provides sufficient flexibility to store similar queries (partial videos) for addressing the relevant value memory slots in 𝐌k\mathbf{M}^{k} with querying attentions. The addressing process is computed by

α[i]=softmax(ϕ(𝐪,𝐌k[i]))=exp(ϕ(𝐪,𝐌k[i]))j=1Nexp(ϕ(𝐪,𝐌k[i])),\alpha[i]=\textup{softmax}(\phi(\mathbf{q},\mathbf{M}^{k}[i]))=\frac{\exp(\phi(\mathbf{q},\mathbf{M}^{k}[i]))}{\sum_{j=1}^{N}\exp(\phi(\mathbf{q},\mathbf{M}^{k}[i]))}, (3)

where αN\alpha\in\mathbb{R}^{N} denotes the soft attention weights over all the memory slots, 𝐌k[i]\mathbf{M}^{k}[i] refers to its ii-th row, and ϕ(,)\phi(\cdot,\cdot) is a similarity score function, which could be given by the cosine similarity ϕ(𝐚,𝐛)=𝐚T𝐛\phi(\mathbf{a},\mathbf{b})=\mathbf{a}^{\mathrm{T}}\mathbf{b} or 2\ell_{2} norm ϕ(𝐚,𝐛)=𝐚𝐛\phi(\mathbf{a},\mathbf{b})=-\|\mathbf{a}-\mathbf{b}\|. Notably, Eq. (3) enables an end-to-end differentiable property [37, 28] of our memory networks, optimizing the key slots with backpropagation gradients.

2) Memory Writing. The value memory matrix 𝐌v\mathbf{M}^{v} of GmemG_{mem} memorizes full videos for the generation purpose, where the memory slots attended by a partial video query 𝐪\mathbf{q} are written with its full video representation 𝐯\mathbf{v}. Specifically, GmemG_{mem} updates the value memory matrix with gate mechanism and attentions following [12, 28]. Let tt be the current training step and 𝐌t1v\mathbf{M}_{t-1}^{v} be the value memory matrix in the last step, 𝐌tv𝐌t1v\mathbf{M}_{t}^{v}\leftarrow\mathbf{M}_{t-1}^{v} is obtained by

𝐞t\displaystyle\mathbf{e}_{t} =sigmoid(𝐖e𝐯),\displaystyle=\textup{sigmoid}(\mathbf{W}_{e}\mathbf{v}), (4)
𝐚t\displaystyle\mathbf{a}_{t} =tanh(𝐖a𝐯),\displaystyle=\textup{tanh}(\mathbf{W}_{a}\mathbf{v}), (5)
𝐌~tv[i]\displaystyle\mathbf{\tilde{M}}^{v}_{t}[i] =𝐌t1v[i](𝟏αt[i]𝐞t),\displaystyle=\mathbf{M}^{v}_{t-1}[i]\odot(\mathbf{1}-\alpha_{t}[i]\mathbf{e}_{t}), (6)
𝐌tv[i]\displaystyle\mathbf{M}^{v}_{t}[i] =𝐌~tv[i]+αt[i]𝐚t,\displaystyle=\mathbf{\tilde{M}}^{v}_{t}[i]+\alpha_{t}[i]\mathbf{a}_{t}, (7)

where 𝐞td\mathbf{e}_{t}\in\mathbb{R}^{d} and 𝐚td\mathbf{a}_{t}\in\mathbb{R}^{d} represent the erase vector and add vector, respectively, \odot denotes the element-wise multiplication, and αt\alpha_{t} is computed by Eq (3) with (𝐪\mathbf{q}, 𝐯\mathbf{v}) arriving at the tt-th training step. In Eqs. (4) and (5), the erase vector 𝐞t\mathbf{e}_{t} and add vector 𝐚t\mathbf{a}_{t} work as input and forget gates in the LSTM model [15], implemented by two linear projection matrices111We omit all the bias vectors to simplify notations. 𝐖ed×d\mathbf{W}_{e}\in\mathbb{R}^{d\times d} and 𝐖ad×d\mathbf{W}_{a}\in\mathbb{R}^{d\times d}, respectively. The 𝐞t\mathbf{e}_{t} decides the forgetting degree of memory slots in 𝐌t1v\mathbf{M}^{v}_{t-1}, while 𝐚t\mathbf{a}_{t} computes the update in 𝐌tv\mathbf{M}^{v}_{t}. By using query attentions αt\alpha_{t}, Eqs. (6) and (7) will mainly update the most attended (αt[i]1\alpha_{t}[i]\rightarrow 1) memory slots and leave the ones (αt[i]0\alpha_{t}[i]\rightarrow 0) that are irrelevant to the query 𝐪\mathbf{q} nearly unchanged.

3) Memory Reading. After updating the value memory matrix 𝐌v\mathbf{M}^{v}, GmemG_{mem} generates the full video representation 𝐯^\mathbf{\hat{v}} by reading memory slots from 𝐌v\mathbf{M}^{v} in the following way:

𝐯^=𝐱+iα[i]𝐌v[i],\mathbf{\hat{v}}=\mathbf{x}+\sum_{i}\alpha[i]\mathbf{M}^{v}[i], (8)

which adds a skip-connection between the partial video feature 𝐱\mathbf{x} and the memory output. Notably, Eq. (8) enables GmemG_{mem} to memorize the residual between a partial video and its corresponding entire one.

In summary, the memory generator Gmem(;θmem)G_{mem}(\cdot;\theta_{mem}) defined in Eq. (2) is implemented through Eq. (3) to Eq. (8), where θmem\theta_{mem} includes the key/value memory matrix and all the learnable gate parameters, i.e., θmem={𝐌k,𝐌v,𝐖e,𝐖a}\theta_{mem}=\{\mathbf{M}^{k},\mathbf{M}^{v},\mathbf{W}_{e},\mathbf{W}_{a}\}.

3.2.3 Discriminator

The discriminator network is designed with two purposes as 1) predicting the true action category label given the real/generated (𝐯\mathbf{v}/𝐯^\mathbf{\hat{v}}) full video representation, and 2) distinguishing the real full video representation 𝐯\mathbf{v} and the fake one 𝐯^\mathbf{\hat{v}}. Inspired by [6, 23], we build the discriminator in a composition way: D(;θD):={Dcls(;θcls),Dadv(;θadv)}D(\cdot;\theta_{D}):=\{D_{cls}(\cdot;\theta_{cls}),D_{adv}(\cdot;\theta_{adv})\}, where Dcls:d|𝒴|D_{cls}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{|\mathcal{Y}|} works as a classifier to predict probability scores over |𝒴||\mathcal{Y}| action classes, and Dadv:d{0,1}D_{adv}:\mathbb{R}^{d}\rightarrow\{0,1\} follows the same definition in the GAN model [11] to infer the probability of the given sample being real. The discriminator DD in our model is formulated as fully-connected networks parameterized by θD={θcls,θadv}\theta_{D}=\{\theta_{cls},\theta_{adv}\}.

3.3 Objective Function

The main goal of this work is to deliver realistic full-video representations for partial videos to predict their correct action classes. To this end, three loss functions are jointly employed for training the proposed AMemNet model.

Adversarial Loss. Given a partial video feature 𝐱\mathbf{x} and its real full video representation 𝐯\mathbf{v}, we compute the adversarial loss adv\mathcal{L}_{adv} by

adv=𝔼𝐯[logDadv(𝐯)]+𝔼𝐪[log(1Dadv(Gmem(𝐪)))].\mathcal{L}_{adv}=\mathbb{E}_{\mathbf{v}}[\log D_{adv}(\mathbf{v})]+\mathbb{E}_{\mathbf{q}}[\log(1-D_{adv}(G_{mem}(\mathbf{q})))]. (9)

The discriminator DadvD_{adv} tries to differentiate 𝐯^=Gmem(𝐪)\mathbf{\hat{v}}=G_{mem}(\mathbf{q}) from the real one 𝐯\mathbf{v} by maximizing adv\mathcal{L}_{adv}, while, on the contrary, the memory generator GmemG_{mem} aims to fool DadvD_{adv} by minimizing adv\mathcal{L}_{adv}. By using Eq. (9), we could employ DadvD_{adv} to push GmemG_{mem} towards generating realistic full video features.

Reconstruction Loss. The adversarial loss adv\mathcal{L}_{adv} encourages our model to generate video features approaching the real feature distribution of full videos, yet without considering the reconstruction error at an instance level, which might miss some useful information for recovering 𝐯\mathbf{v} from 𝐱\mathbf{x}. In light of this, we define a reconstruction loss as

rec=𝔼(𝐱,𝐯)Gmem(fq(𝐱))𝐯22,\mathcal{L}_{rec}=\mathbb{E}_{(\mathbf{x},\mathbf{v})}\|G_{mem}(f_{q}(\mathbf{x}))-\mathbf{v}\|_{2}^{2}, (10)

which calculates the squared Euclidean distance between the generated feature 𝐯^=Gmem(fq(𝐱))\mathbf{\hat{v}}=G_{mem}(f_{q}(\mathbf{x})) and its corresponding full video feature 𝐯\mathbf{v}. Eq. (10) further guides the memory generator by bridging the gap between 𝐱\mathbf{x} and 𝐯\mathbf{v}.

Classification Loss. It is important for GmemG_{mem} to generate discriminative representations 𝐯^\mathbf{\hat{v}} for predicting different action classes. Thus, it is natural to impose a classification loss cls\mathcal{L}_{cls} on training the memory generator as follows:

clsv\displaystyle\mathcal{L}_{cls}^{v} =𝔼(𝐯,𝐲)H(𝐲,Dcls(𝐯)),\displaystyle=\mathbb{E}_{(\mathbf{v},\mathbf{y})}H(\mathbf{y},D_{cls}(\mathbf{v})), (11)
clsx\displaystyle\mathcal{L}_{cls}^{x} =𝔼(𝐱,𝐲)H(𝐲,Dcls(Gmem(𝐱))),\displaystyle=\mathbb{E}_{(\mathbf{x},\mathbf{y})}H(\mathbf{y},D_{cls}(G_{mem}(\mathbf{x}))), (12)

where 𝐲|𝒴|\mathbf{y}\in\mathbb{R}^{|\mathcal{Y}|} indicates the one-hot vector of an action label yy over |𝒴||\mathcal{Y}| classes and H(,)H(\cdot,\cdot) computes the cross-entropy between two probability distributions. Let 𝐲^|𝒴|\mathbf{\hat{y}}\in\mathbb{R}^{|\mathcal{Y}|} be the output of DclsD_{cls}, we have H(𝐲,𝐲^)=i=1|𝒴|𝐲[i]log𝐲^[i]H(\mathbf{y},\mathbf{\hat{y}})=-\sum_{i=1}^{|\mathcal{Y}|}\mathbf{y}[i]\log\mathbf{\hat{y}}[i].

Different from [23], which only employs the classification loss clsv\mathcal{L}_{cls}^{v} in training the discriminator model with full videos, we employ Eq. (11) and Eq. (12) to train the discriminator DclsD_{cls} and the memory generator GmemG_{mem} alternatively. The benefit lies in: a high-quality classifier DclsD_{cls} is first obtained by minimizing clsv\mathcal{L}_{cls}^{v} with real full videos and then DclsD_{cls} is leveraged to “teach” GmemG_{mem} for generating representations 𝐯^\mathbf{\hat{v}} to lower clsx\mathcal{L}_{cls}^{x}. By this means, GmemG_{mem} could learn the discriminative information from DclsD_{cls}.

Final Objective. By summarizing Eq. (9) to Eq. (12), the final objective function of the proposed AMemNet model is given by

maxθD\displaystyle\max_{\theta_{D}} adv+λclsclsv,\displaystyle~\mathcal{L}_{adv}+\lambda_{cls}\mathcal{L}_{cls}^{v}, (13)
minθG\displaystyle\min_{\theta_{G}} adv+λclsclsx+λrecrec,\displaystyle~\mathcal{L}_{adv}+\lambda_{cls}\mathcal{L}_{cls}^{x}+\lambda_{rec}\mathcal{L}_{rec}, (14)

where θG={θenc,θmem}\theta_{G}=\{\theta_{enc},\theta_{mem}\} includes all the trainable parameters for generating 𝐯^\mathbf{\hat{v}} from 𝐱\mathbf{x}, θD={θcls,θadv}\theta_{D}=\{\theta_{cls},\theta_{adv}\} parametrizes the discriminator, and λcls\lambda_{cls} and λrec\lambda_{rec} are the trade-off parameters balancing different loss functions. To proceed with the training procedure, we optimize θD\theta_{D} and θG\theta_{G} by alternatively solving Eq. (13) and Eq. (14) while fixing the other.

Refer to caption
Figure 2: Illustration of applying AMemNets over RGB and optical flows.

3.4 Two-Stream Fusion for Action Prediction

After training the proposed AMemNet model via Eq. (13) and Eq. (14), we freeze the model weights θ={θG,θD}\theta=\{\theta_{G},\theta_{D}\} and suppress the memory writing operations in Eq. (4)-(7) for testing AMemNet. Particularly, given a partial video feature 𝐱\mathbf{x}, we predict its action label by

𝐲^=Dcls(Gmem(fq(𝐱))),\displaystyle\mathbf{\hat{y}}=D_{cls}(G_{mem}(f_{q}(\mathbf{x}))), (15)

where 𝐲^|𝒴|\mathbf{\hat{y}}\in\mathbb{R}^{|\mathcal{Y}|} denotes the probability distribution over |𝒴||\mathcal{Y}| action classes for 𝐱\mathbf{x}.

As shown in Fig 2, we adopt a two-stream framework [35, 43] to exploit the spatial and temporal information of given videos, where we first test AMemNet on each stream (i.e., RGB frames and optical flow) individually, and then fuse the prediction scores to obtain the final result. Given 𝐱rgb\mathbf{x}^{rgb} and 𝐱flow\mathbf{x}^{flow}, we obtain the prediction results 𝐲^rgb\mathbf{\hat{y}}^{rgb} and 𝐲^flow\mathbf{\hat{y}}^{flow} by using Eq. (15) with θrgb\theta^{rgb} and θrgb\theta^{rgb}, respectively. The final prediction result is given by

𝐲^fusion=𝐲^rgb+β𝐲^flow,\mathbf{\hat{y}}^{fusion}=\mathbf{\hat{y}}^{rgb}+\beta\mathbf{\hat{y}}^{flow}, (16)

where β\beta is the fusion weight for integrating the scores given by the stream of spatial RGB frames and the stream of temporal optical flow images.

Table 1: Action prediction accuracy (%) under 10 observation ratios on the UCF101 dataset.
Method 0.10.1 0.20.2 0.30.3 0.40.4 0.50.5 0.60.6 0.70.7 0.80.8 0.90.9 1.01.0
Single-stream IBoW [32] 36.2936.29 65.6965.69 71.6971.69 74.2574.25 74.3974.39 75.2375.23 75.3675.36 75.5775.57 75.7975.79 75.7975.79
MSSC [2] 34.0534.05 53.3153.31 58.5558.55 57.9457.94 61.7961.79 60.8660.86 63.1763.17 63.6463.64 61.6361.63 61.6361.63
MTSSVM [21] 40.0540.05 72.8372.83 80.0280.02 82.1882.18 82.3982.39 83.2183.21 83.3783.37 83.5183.51 83.6983.69 82.8282.82
DeepSCN [22] 45.0245.02 77.6477.64 82.9582.95 85.3685.36 85.7585.75 86.7086.70 87.1087.10 87.4287.42 87.5087.50 87.6387.63
PA-DRL [4] 81.3681.36 82.6382.63 82.9082.90 83.5183.51 84.0184.01 84.3884.38 85.0985.09 85.4185.41 85.8185.81 86.1586.15
PTSL [46] 83.3283.32 87.1387.13 88.9288.92 89.8289.82 90.8590.85 91.0491.04 91.2891.28 91.2391.23 91.3191.31 91.4791.47
Two-stream Mem-LSTM [20] 51.0251.02 80.9780.97 85.7385.73 87.7687.76 88.3788.37 88.5888.58 89.0989.09 89.3889.38 89.6789.67 90.4990.49
TSL [5] 82.2082.20 86.7086.70 88.5088.50 89.5089.50 90.1090.10 91.0091.00 91.5091.50 91.9091.90 92.4092.40 92.5092.50
RGN-KF [50] 83.1283.12 85.1685.16 88.4488.44 90.7890.78 91.4291.42 92.0392.03 92.0092.00 93.1993.19 93.1393.13 93.1493.14
AAPNet [23] 90.2590.25 93.1093.10 94.4694.46 95.4195.41 95.8995.89 96.0996.09 96.2796.27 96.3596.35 96.4796.47 96.3696.36
Baselines TSN [43] 86.7686.76 89.2989.29 90.6490.64 91.8191.81 91.7391.73 92.4792.47 92.9792.97 93.1593.15 93.3193.31 93.4293.42
TSN+finetune 88.8888.88 91.5291.52 93.0193.01 94.0594.05 94.6694.66 95.3495.34 95.6495.64 95.9295.92 95.9095.90 96.0096.00
TSN+KNN 85.6985.69 88.7188.71 90.3490.34 91.2991.29 91.7891.78 92.3392.33 92.4192.41 92.7592.75 92.9692.96 93.1193.11
Our model AMemNet-RGB 85.9585.95 87.4787.47 88.2188.21 88.5788.57 89.2689.26 89.5189.51 89.8189.81 89.9989.99 90.0690.06 90.1790.17
AMemNet-Flow 83.6483.64 88.3288.32 90.7490.74 92.1892.18 93.2093.20 93.7893.78 94.4094.40 94.7594.75 94.8894.88 94.9694.96
AMemNet 92.45\boldsymbol{92.45} 94.60\boldsymbol{94.60} 95.55\boldsymbol{95.55} 96.00\boldsymbol{96.00} 96.45\boldsymbol{96.45} 96.67\boldsymbol{96.67} 96.97\boldsymbol{96.97} 96.95\boldsymbol{96.95} 97.07\boldsymbol{97.07} 97.03\boldsymbol{97.03}

4 Experiments

4.1 Experimental Setting

Datasets. Two benchmark video datasets, UCF101 [36] and HMDB51 [24], are used in the experiment. The UCF101 dataset consists of 13,32013,320 videos from 101101 human actions covering a wide range of human activities, and the HMDB51 dataset collects 6,7666,766 video clips from movies and web videos over 5151 action categories. We follow the standard training/testing splits on these two datasets following [43, 23]. We test the proposed AMemNet model over three splits and report the average prediction result for each dataset. We employ the preprocessed RGB frames and optical flow images provided in [10].

Implementation Details. The proposed AMemNet is built on top of temporal segment networks (TSN) [43], where we adopt the BN-Inception network [16] as its backbone and employ the pre-trained model on the Kinetics dataset [19]. The same data augmentation strategy (e.g., cropping and jittering) as provided in [43] is employed for encoding all the partial and full videos as d=1024d=1024 feature representations. We formulate fqf_{q} as fully-connected networks of two layers, where the middle layer has 512512 hidden states and the final query embedding size is set as h=256h=256. The batch normalization and LealyRelu are both used in fqf_{q}. We employ N=512N=512 memory slots for the key and value memory matrices, hence we have 𝐌k512×256\mathbf{M}^{k}\in\mathbb{R}^{512\times 256} and 𝐌v512×1024\mathbf{M}^{v}\in\mathbb{R}^{512\times 1024}. All the memory matrices and gating parameters in θmem\theta_{mem} are randomly initialized. We implement the discriminator network by one fully-connected layer, where the softmax and sigmoid activation function are used for DclsD_{cls} and DadvD_{adv}, respectively.

For each training step, we first employ the Adam optimizer with a learning rate of 0.00010.0001 to update θD\theta_{D} with Eq. (13) twice, and then optimize θG\theta_{G} once by solving Eq. (14) with the SGD optimizer of 0.00010.0001 learning rate and 0.90.9 momentum rate. We set the batch size as 6464. For all the datasets, we set λcls=1\lambda_{cls}=1 to strengthen the impact of DclsD_{cls} on the memory generator GmemG_{mem} for encouraging discriminative representations, and set λres=0.1\lambda_{res}=0.1 to avoid overemphasizing the reconstruction of each video sample to lead the overfitting issue. The fusion weight β\beta is fixed as 1.51.5 for all the datasets following [43]. All the codes in this work were implemented by Pytorch and ran with Titan X GPUs.

Compared Methods. We compare our approach with three different kinds of methods as follows. 1) Single-stream methods: Integral BoW (IBoW) [32], mixture segments sparse coding (MSSC) [2], multiple temporal scales SVM (MTSSVM) [21], deep sequential context networks (DeepSCN) [22], part-activated deep reinforcement learning (PA-DRL) [4], and progressive teacher-student learning (PTSL) [46]. 2) Two-stream methods: memory augmented LSTM (Mem-LSTM) [20], temporal sequence learning (TSL) [5], residual generator network with Kalman filter (RGN-KF) [50], and adversarial action prediction networks (AAPNet) [23]. We implemented the AAPNet with the same pre-trained TSN features as our approach and posted the authors’ reported results for the other single/two-stream methods. 3) Baselines: We also compare AMemNet with temporal segment networks (TSN) [43]. Specifically, we test the TSN model pre-trained on the UCF101/HMDB51 dataset as baseline results, and finetune the TSN model pre-trained on the Kinetics dataset for UCF101 and HMDB51, respectively. Moreover, we train a k-nearest neighbors (KNN) classifier with the Kinetics pre-trained TSN features, termed as TSN+KNN, and report its best performance by selecting kk from {5,10,20,30,50,100,500}\{5,10,20,30,50,100,500\}. For a fair comparison, we follow the same testing setting in previous works [22, 4, 50] by evenly dividing all the videos into 10 progresses, i.e., P=10P=10 as described in Section 3.1. However, it is worth noting that, the proposed AMemNet does not require any progress label in both training and testing.

Table 2: Action prediction accuracy (%) under 10 observation ratios on the HMDB51 dataset.
Method 0.10.1 0.20.2 0.30.3 0.40.4 0.50.5 0.60.6 0.70.7 0.80.8 0.90.9 1.01.0
Two-stream Global-Local [25] 38.8038.80 43.8043.80 49.1049.10 50.4050.40 52.6052.60 54.7054.70 56.3056.30 56.9056.90 57.3057.30 57.3057.30
TSL [5] 38.8038.80 51.6051.60 57.6057.60 60.5060.50 62.9062.90 64.6064.60 65.6065.60 66.2066.20 66.3066.30 66.3066.30
AAPNet [23] 56.0356.03 60.1160.11 64.8764.87 67.9967.99 70.7670.76 72.5572.55 74.0074.00 74.8174.81 75.5975.59 75.5675.56
Baselines TSN [43] 47.1247.12 52.8152.81 59.3559.35 62.5562.55 64.7764.77 67.5267.52 68.9568.95 69.8769.87 70.0770.07 70.1370.13
TSN+finetune 55.1355.13 59.8259.82 63.8863.88 67.0267.02 69.7469.74 71.7271.72 72.9872.98 73.4373.43 74.0874.08 73.5573.55
TSN+KNN 48.7748.77 53.5953.59 57.8357.83 60.3360.33 62.8462.84 65.1865.18 66.5366.53 67.0067.00 67.5367.53 66.6566.65
Our model AMemNet-RGB 52.5552.55 55.5255.52 58.2758.27 60.5560.55 62.5362.53 63.8763.87 64.4164.41 64.6164.61 64.9964.99 64.8664.86
AMemNet-Flow 47.4147.41 54.4354.43 60.2660.26 64.5164.51 68.0368.03 70.5370.53 72.1072.10 73.0573.05 73.3973.39 73.5273.52
AMemNet 57.74\boldsymbol{57.74} 62.10\boldsymbol{62.10} 66.28\boldsymbol{66.28} 70.17\boldsymbol{70.17} 72.66\boldsymbol{72.66} 74.55\boldsymbol{74.55} 75.22\boldsymbol{75.22} 75.78\boldsymbol{75.78} 76.08\boldsymbol{76.08} 76.14\boldsymbol{76.14}
Refer to caption
(a) RGB on UCF101
Refer to caption
(b) Flow on UCF101
Refer to caption
(c) Fusion on UCF101
Figure 3: Ablation study for the proposed AMemNet on the UCF101 dataset in terms of RGB, Flow and Fusion, respectively.
Refer to caption
(a) Predictions accuracy
Refer to caption
(b) TSN
Refer to caption
(c) AMemNet
Figure 4: Visual analysis of the proposed AMemNet between early predictable and late predictable action categories. (a) Prediction results of TSN and the proposed AMemNet over 10 categories with 10%10\% video progress on UCF101. (b) and (c) t-SNE embedding results given by TSN and AMemNet.

4.2 Prediction Performance

UCF101 Dataset. Table 1 summarizes the prediction accuracy of the proposed AMemNet and 13 compared methods on the UCF101 dataset. Overall, AMemNet consistently outperforms all the competitors over different observation ratios with a significant improvement. Impressively, the proposed AMemNet achieves around 92%92\% accuracy when only 10%10\% video is observed, which fully validates the effectiveness of applying AMemNet for early action prediction. This is mainly benefited from the rich key-value structured memories learned from full-video features guided by the adversarial training.

The single-stream methods mainly explore the temporal information by using hand-crafted features (e.g., spatio-temporal interest points (STIP) [8], dense trajectory) like in IBoW [32], MSSC [2], and MTSSVM [21], or by utilizing 3D convolutional networks (e.g., C3D [39]) like in DeepSCN [22] and PTSL [46]. Differently, the two-stream methods deploy the convolutional neural networks on two pathways to capture the spatial information of RGB images and the temporal characteristics of optic flows, respectively. On the one hand, the two-stream methods could better exploit the spatial-temporal information inside videos than using one single stream. The proposed AMemNet inherits the merits from this two-stream architecture, and thus performs better than the single-stream methods that even employ a more powerful CNN encoder, e.g., the 3D ResNeXt-101 [14] used in PTSL [46] is much deeper than BN-Inception in the proposed AMemNet. On the other hand, compared with two-stream methods, especially the AAPNet [23] implemented with the same backbone as our model, the consistent improvement of AMemNet over AAPNet shows the effectiveness of using the memory generator to deliver “full” video features in early progress.

In Table 1, we refer to AMemNet-RGB and AMemNet-Flow as the single-stream result by using AMemNet on RGB frames and flow images, respectively. Two interesting observations could be drawn: 1) The RGB contributes more than the flow at the beginning, as the still images encapsulating scenes and objects could provide key clues for recognizing the actions with few frames. 2) The late fusion naturally fits action prediction by integrating the complimentary information between two streams over time.

HMDB51 Dataset. Table 2 reports the prediction results of our approach and TSN [43] on the HMDB51 dataset, which, compared with UCF101, is a more challenging dataset for predicting actions due to the large motion variations rather than static cues across different categories [24]. As can be seen, the flow result of AMemNet exceeds AMemNet-RGB around 8%8\% accuracy after more progress being observed (e.g., τp0.5\tau_{p}\geq 0.5). However, even under this case, the proposed AMemNet still consistently improves AMemNet-Flow by incorporating RGB results along with different progresses. Moreover, the clear improvements of AMemNet over TSN indicate that the full video memories learned by our memory generator could well enhance the discriminability of video representations in early progress.

4.3 Model Discussion

Ablation Study. Fig. 3 shows the ablation study of the proposed AMemNet model on the UCF101 dataset222 More ablation study results and parameter analyses on the UCF101 and HMDB51 datasets are provided in the supplementary material. in terms of RGB, Flow and fusion, respectively, where we test all the methods by different observation ratios. We adopt TSN + finetune as a sanity check for our approach and implement two strong ablated models to discuss the impact of two main components of AMemNet as follows. 1) AMemNet w/o Mem refers to our model by discarding the memory generator, i.e., θ\θmem\theta\backslash\theta_{mem}. Instead, we use the same generator network as in AAPNet [23] for generating full video features with AMemNet w/o Mem. 2) AMemNet w/o GAN is developed without the adversarial training and is trained by only using a classification loss.

As shown in Fig 3, AMemNet improves all the above methods with a clear margin on different cases, which strongly supports the motivation of this work. It is worth noting that, for the early progress (i.e., the observation ratio τp0.3\tau_{p}\leq 0.3), AMemNet w/o GAN clearly boots the performance over AMemNet w/o Mem. This demonstrates the effectiveness of using memory networks to compensate for the limited information in incomplete videos. As observing more progress, the GAN model will lead the generating process since it has sufficient information given by the partial videos, where AMemNet w/o Mem improves over AMemNet w/o GAN after τp>0.7\tau_{p}>0.7 on the UCF101 dataset.

Early Predicable vs Late Predicable. In Fig. 4, we discuss the performance of AMemNet for action categories of different properties, e.g., predictability (the progress level required for recognizing an action), on the UCF101 dataset. We compare AMemNet and TSN [43] on the 10%10\% progress level video of 10 different categories in 4(a) and show the corresponding t-SNE [41] embeddings of TSN features and the generated full video features given by AMemNet in 4(b) and 4(c), respectively. Inspired by [22], we select 10 action categories from UCF101 and divide them into two groups as 1) the early predictable group including Billiards, IceDancing, RockClimbingIndoor, PlayingPiano, PommelHorse, and 2) the late predictable group including VolleyballSpiking, CliffDiving, HeadMassage, PoleVault, ThrowDiscus, where the early group usually could be predicted by given 10%10\% progress and the late group is selected as the non-early ones.

As expected, the proposed AMemNet mainly improves the TSN baseline over late predictable actions in Fig. 4(a), which again demonstrates the realistic of the full video features given by our memory generator. Moreover, as shown in Fig. 4(b) and 4(c), while TSN exhibits a good structured feature embeddings for early predictable classes, e.g., IceDancing and PommelHorse, its embeddings mixes up for the late predictable ones like PoleVault and CliffDiving. In contrast, AMemNet generates full video features that encourage a good cluster structure in the embedding space.

5 Conclusion

In this paper, we presented a novel two-stream adversarial memory networks (AMemNet) model for the action prediction task. A key-value structured memory generator was proposed to generate the full video feature conditioning on the partial video query, and a class-aware discriminator was developed to supervise the generator for delivering realistic and discriminative representations towards full videos through adversarial training. The proposed AMemNet adopts input and forget gates for updating the full video memories attended by different queries, which captures the long-term temporal variation across different video progresses. Experimental results on two benchmark datasets were provided to demonstrate the effectiveness of AMemNet for the action prediction problem compared with state-of-the-art methods.

References

  • [1] Y. Cai, L. Huang, Y. Wang, T.-J. Cham, J. Cai, J. Yuan, J. Liu, X. Yang, Y. Zhu, X. Shen, D. Liu, J. Liu, and N. M. Thalmann. Learning progressive joint propagation for human motion prediction. In Computer Vision – ECCV 2020, pages 226–242, 2020.
  • [2] Y. Cao, D. Barrett, A. Barbu, S. Narayanaswamy, H. Yu, A. Michaux, Y. Lin, S. Dickinson, J. Siskind, and S. Wang. Recognizing human activities from partially observed videos. In CVPR, 2013.
  • [3] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and kinetics dataset. In CVPR, 2017.
  • [4] L. Chen, J. Lu, Z. Song, and J. Zhou. Part-activated deep reinforcement learning for action prediction. In ECCV, September 2018.
  • [5] S. Cho and H. Foroosh. A temporal sequence learning for action recognition and prediction. In WACV, pages 352–361, 2018.
  • [6] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, 2018.
  • [7] A. Diba, V. Sharma, and L. Van Gool. Deep temporal linear encoding networks. In CVPR, July 2017.
  • [8] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie. Behavior recognition via sparse spatio-temporal features. In VS-PETS, 2005.
  • [9] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In CVPR, June 2015.
  • [10] C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, June 2016.
  • [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pages 2672–2680. 2014.
  • [12] A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. CoRR, abs/1410.5401, 2014.
  • [13] S. Guo, L. Qing, J. Miao, and L. Duan. Deep residual feature learning for action prediction. In IEEE International Conference on Multimedia Big Data, pages 1–6, 2018.
  • [14] K. Hara, H. Kataoka, and Y. Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In CVPR, June 2018.
  • [15] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [16] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, volume 37, pages 448–456, 2015.
  • [17] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neural networks for human action recognition. TPAMI, 2013.
  • [18] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
  • [19] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman. The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
  • [20] Y. Kong, S. Gao, B. Sun, and Y. Fu. Action prediction from videos via memorizing hard-to-predict samples. In AAAI, 2018.
  • [21] Y. Kong, D. Kit, and Y. Fu. A discriminative model with multiple temporal scales for action prediction. In ECCV, 2014.
  • [22] Y. Kong, Z. Tao, and Y. Fu. Deep sequential context networks for action prediction. In CVPR, 2017.
  • [23] Y. Kong, Z. Tao, and Y. Fu. Adversarial action prediction networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(3):539–553, 2020.
  • [24] H. Kuhne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recognition. In ICCV, 2011.
  • [25] S. Lai, W. Zheng, J. Hu, and J. Zhang. Global-local temporal saliency action prediction. IEEE Transactions on Image Processing, 27(5):2272–2285, 2018.
  • [26] T. Lan, T.-C. Chen, and S. Savarese. A hierarchical representation for future action prediction. In ECCV, 2014.
  • [27] I. Laptev. On space-time interest points. IJCV, 64(2):107–123, 2005.
  • [28] A. H. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes, and J. Weston. Key-value memory networks for directly reading documents. In EMNLP, pages 1400–1409, 2016.
  • [29] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In CVPR, 2015.
  • [30] M. Raptis and L. Sigal. Poselet key-framing: A model for human activity recognition. In CVPR, 2013.
  • [31] M. Ryoo and J. Aggarwal. Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In ICCV, pages 1593–1600, 2009.
  • [32] M. S. Ryoo. Human activity prediction: Early recognition of ongoing activities from streaming videos. In ICCV, 2011.
  • [33] M. Sadegh Aliakbarian, F. Sadat Saleh, M. Salzmann, B. Fernando, L. Petersson, and L. Andersson. Encouraging lstms to anticipate actions very early. In ICCV, Oct 2017.
  • [34] Y. Shi, B. Fernando, and R. Hartley. Action anticipation with rbf kernelized feature mapping rnn. In ECCV, September 2018.
  • [35] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
  • [36] K. Soomro, A. R. Zamir, and M. Shah. UCF101: A Dataset of 101 Human Action Classes From Videos in The Wild. Technical report, CRCV-TR-12-01, 2012.
  • [37] S. Sukhbaatar, a. szlam, J. Weston, and R. Fergus. End-to-end memory networks. In NeurIPS, pages 2440–2448. 2015.
  • [38] Z. Tao, S. Li, Z. Wang, C. Fang, L. Yang, H. Zhao, and Y. Fu. Log2intent: Towards interpretable user modeling via recurrent semantics memory unit. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1055–1063, 2019.
  • [39] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
  • [40] K. M. Tran, A. Bisazza, and C. Monz. Recurrent memory networks for language modeling. In NAACL HLT, pages 321–331, 2016.
  • [41] L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9:2579–2605, 2008.
  • [42] H. Wang, A. Kläser, C. Schmid, and C.-L. Liu. Dense trajectories and motion boundary descriptors for action recognition. IJCV, 103(1):60–79, 2013.
  • [43] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 2016.
  • [44] Q. Wang, H. Yin, Z. Hu, D. Lian, H. Wang, and Z. Huang. Neural memory streaming recommender networks with adversarial training. In SIGKDD, pages 2467–2475, 2018.
  • [45] X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. In CVPR, June 2018.
  • [46] X. Wang, J.-F. Hu, J.-H. Lai, J. Zhang, and W.-S. Zheng. Progressive teacher-student learning for early action prediction. In CVPR, June 2019.
  • [47] Y. Wang, L. Jiang, M.-H. Yang, L.-J. Li, M. Long, and L. Fei-Fei. Eidetic 3d lstm: A model for video prediction and beyond. In ICLR, 2019.
  • [48] J. Weston, S. Chopra, and A. Bordes. Memory networks. CoRR, abs/1410.3916, 2014.
  • [49] J. Zhang, X. Shi, I. King, and D. Yeung. Dynamic key-value memory networks for knowledge tracing. In WWW, pages 765–774, 2017.
  • [50] H. Zhao and R. P. Wildes. Spatiotemporal feature residual propagation for action prediction. In ICCV, October 2019.
  • [51] H. Zhao and R. P. Wildes. On diverse asynchronous activity anticipation. In Computer Vision – ECCV 2020, pages 781–799, 2020.
Table 3: Overall comparison between ablated models. We provide two-stream fusion results of the p=0.1p=0.1 observation ratio and the averaged results over 10 observation ratios. Generally, p=0.1p=0.1 is the most important observation ratio for action prediction.
Methods Model Components UCF101 HMDB51
rec\mathcal{L}_{rec} adv\mathcal{L}_{adv} θmem\theta_{mem} 0.10.1 avg. 0.10.1 avg.
TSN + finetune 88.8888.88 94.0994.09 55.1355.13 68.1368.13
AMemNet w/o Mem 90.1790.17 95.1295.12 55.4955.49 68.6868.68
AMemNet w/o GAN 91.4791.47 95.4595.45 56.3656.36 70.3070.30
AMemNet w/o Res 92.2692.26 95.6295.62 56.0056.00 70.3370.33
AMemNet (ours) 92.45\boldsymbol{92.45} 95.97\boldsymbol{95.97} 57.74\boldsymbol{57.74} 70.67\boldsymbol{70.67}

Appendix A Supplementary Material

We provide 1) more ablation study results (Table 3-5 and Fig. 6) and 2) parameter analyses (Fig. 5(a) and Fig. 5(b)) in the supplementary material. Particularly, we report the averaged results over three splits of the UCF101 and HMDB51 datasets for ablation study, respectively, and mainly conduct the parameter analysis on split 1 of the HMDB51 dataset.

Ablation Study. The final objective function of the proposed AMemNet model is given by

maxθD\displaystyle\max_{\theta_{D}} adv+λclsclsv,\displaystyle~\mathcal{L}_{adv}+\lambda_{cls}\mathcal{L}_{cls}^{v},
minθG\displaystyle\min_{\theta_{G}} adv+λclsclsx+λrecrec,\displaystyle~\mathcal{L}_{adv}+\lambda_{cls}\mathcal{L}_{cls}^{x}+\lambda_{rec}\mathcal{L}_{rec},

where θG={θenc,θmem}\theta_{G}=\{\theta_{enc},\theta_{mem}\} includes all the trainable parameters for generating 𝐯^\mathbf{\hat{v}} from 𝐱\mathbf{x}, θD={θcls,θadv}\theta_{D}=\{\theta_{cls},\theta_{adv}\} parametrizes the discriminator, and λcls,λrec\lambda_{cls},\lambda_{rec} are two trade-off parameters for balancing different terms.

Table 3 shows the overall comparison results between the proposed AMemNet and three ablated models and one baseline. Table 4 and Table 5 provide all the comparison results of each stream on the UCF101 and HMDB51 datasets, respectively. In the proposed AMemNet, the memory-augmented generator θG\theta_{G} and the adversarial training loss adv\mathcal{L}_{adv} play the key roles in generating full-video-like features, where θG\theta_{G} contributes more on the early progress and adv\mathcal{L}_{adv} leads the late progress. The reconstruction loss rec\mathcal{L}_{rec} works as a “regularization” term for adversarial training, and thus AMemNet w/o Res presents a similar overall performance to our full model. As shown in Table 5, rec\mathcal{L}_{rec} exhibits a more significant impact on the earlier progress (e.g., p=0.1p=0.1 or 0.20.2), since the videos in HMDB51 usually have a lager variance than UCF101. However, overemphasizing rec\mathcal{L}_{rec} may lower the performance (see λres=10\lambda_{res}=10 in Fig. 5(b)).

Refer to caption
(a) Number of memory slots
Refer to caption
(b) Balancing parameter λres\lambda_{res}
Figure 5: (a) Parameter analysis on the number of memory slots (NN). We set N=N= 64, 128, 256, 512, and 1024, respectively. (b) Parameter analysis on λres\lambda_{res} for the reconstruction loss res\mathcal{L}_{res}. We set λres=\lambda_{res}= 0, 0.01, 0.1, 1, and 10, respectively.

Parameter Analysis. In the experiment, we set the number of memory slots used in GmemG_{mem} as N=512N=512 by default. We study the impact of NN in Fig. 5(a), suggesting N512N\geq 512 could lead to a stable prediction performance. We fix λcls=1\lambda_{cls}=1 since the classification is the main goal for action prediction, and mainly tune λres\lambda_{res} in Fig. 5(b), which indicates a relatively small λres0.1\lambda_{res}\leq 0.1 is useful for recovering the partial videos. We set λres=0.1\lambda_{res}=0.1 as default.

Table 4: Action prediction accuracy (%) under 10 observation ratios on the UCF101 dataset.
Method 0.10.1 0.20.2 0.30.3 0.40.4 0.50.5 0.60.6 0.70.7 0.80.8 0.90.9 1.01.0
RGB TSN + finetune 83.8183.81 85.2085.20 86.0686.06 86.8186.81 87.3487.34 88.0288.02 88.4088.40 88.8188.81 88.9588.95 89.3689.36
AMemNet w/o Mem 84.9284.92 86.3086.30 87.3887.38 87.8087.80 88.4088.40 88.9788.97 89.4789.47 89.8289.82 90.0390.03 90.0490.04
AMemNet w/o GAN 85.3385.33 86.7086.70 87.4187.41 87.9887.98 88.6288.62 88.9488.94 89.5889.58 89.6789.67 89.7889.78 89.8689.86
AMemNet w/o Res 86.0386.03 87.2787.27 87.9787.97 88.3488.34 88.9088.90 89.3989.39 89.7389.73 89.9389.93 90.0190.01 90.1290.12
AMemNet-RGB 85.9585.95 87.4787.47 88.2188.21 88.5788.57 89.2689.26 89.5189.51 89.8189.81 89.9989.99 90.0690.06 90.1790.17
Flow TSN + finetune 78.8578.85 84.1784.17 87.1487.14 89.1189.11 90.8590.85 91.9491.94 92.8292.82 93.6193.61 93.7093.70 93.9393.93
AMemNet w/o Mem 79.6879.68 84.9384.93 88.1588.15 90.1690.16 91.6391.63 92.8092.80 93.7093.70 94.3294.32 94.6594.65 94.7694.76
AMemNet w/o GAN 81.9881.98 86.8786.87 89.7089.70 90.9490.94 91.9191.91 93.0893.08 93.7593.75 94.2094.20 94.3294.32 94.3594.35
AMemNet w/o Res 83.5783.57 88.0088.00 90.3990.39 91.8691.86 92.7992.79 93.2593.25 93.8793.87 94.3494.34 94.5394.53 94.5194.51
AMemNet-Flow 83.6483.64 88.3288.32 90.7490.74 92.1892.18 93.2093.20 93.7893.78 94.4094.40 94.7594.75 94.8894.88 94.9694.96
Fusion TSN + finetune 88.8888.88 91.5291.52 93.0193.01 94.0594.05 94.6694.66 95.3495.34 95.6495.64 95.9295.92 95.9095.90 96.0096.00
AMemNet w/o Mem 90.1790.17 92.7792.77 94.2394.23 95.0395.03 95.8095.80 96.2796.27 96.5296.52 96.8096.80 96.9096.90 96.7696.76
AMemNet w/o GAN 91.4791.47 93.8493.84 94.7794.77 95.4595.45 96.0196.01 96.1596.15 96.5596.55 96.7496.74 96.7696.76 96.7296.72
AMemNet w/o Res 92.2692.26 94.1594.15 94.9394.93 95.7295.72 96.1696.16 96.3496.34 96.6296.62 96.6396.63 96.6596.65 96.7296.72
AMemNet (ours) 92.4592.45 94.6094.60 95.5595.55 96.0096.00 96.4596.45 96.6796.67 96.9796.97 96.9596.95 97.0797.07 97.0397.03
Table 5: Action prediction accuracy (%) under 10 observation ratios on the HMDB51 dataset.
Method 0.10.1 0.20.2 0.30.3 0.40.4 0.50.5 0.60.6 0.70.7 0.80.8 0.90.9 1.01.0
RGB TSN + finetune 50.8250.82 54.6154.61 57.0357.03 58.9758.97 60.3060.30 60.9860.98 61.5461.54 61.9461.94 62.2362.23 62.6862.68
AMemNet w/o Mem 51.2451.24 54.8154.81 57.6357.63 59.7959.79 61.3961.39 62.6762.67 63.3463.34 63.8363.83 63.9263.92 64.1164.11
AMemNet w/o GAN 51.7751.77 55.3455.34 57.9857.98 60.1360.13 61.5361.53 62.5662.56 63.1363.13 63.6963.69 64.2864.28 64.0564.05
AMemNet w/o Res 51.5151.51 54.6854.68 57.2357.23 59.6359.63 61.1361.13 61.8561.85 62.3262.32 62.8662.86 63.3263.32 63.3463.34
AMemNet-RGB 52.5552.55 55.5255.52 58.2758.27 60.5560.55 62.5362.53 63.8763.87 64.4164.41 64.6164.61 64.9964.99 64.8664.86
Flow TSN + finetune 45.5445.54 53.1353.13 57.6957.69 62.4462.44 66.0266.02 68.7268.72 70.4170.41 71.5271.52 71.6871.68 71.5471.54
AMemNet w/o Mem 45.1345.13 52.2252.22 57.8457.84 62.1562.15 65.5365.53 68.6768.67 70.1070.10 71.3171.31 72.1672.16 72.0572.05
AMemNet w/o GAN 46.8446.84 54.5854.58 59.6859.68 63.6063.60 66.6366.63 69.6269.62 71.2171.21 72.1372.13 72.8872.88 72.2672.26
AMemNet w/o Res 46.6546.65 54.6654.66 59.7659.76 64.3364.33 67.7167.71 70.2070.20 72.0272.02 72.3972.39 72.9472.94 72.5772.57
AMemNet-Flow 47.4147.41 54.4354.43 60.2660.26 64.5164.51 68.0368.03 70.5370.53 72.1072.10 73.0573.05 73.3973.39 73.5273.52
Fusion TSN + finetune 55.1355.13 59.8259.82 63.8863.88 67.0267.02 69.7469.74 71.7271.72 72.9872.98 73.4373.43 74.0874.08 73.5573.55
AMemNet w/o Mem 55.4955.49 59.9359.93 64.1464.14 67.4867.48 70.0270.02 72.0172.01 73.8873.88 74.4674.46 74.7474.74 74.6474.64
AMemNet w/o GAN 56.3656.36 61.5761.57 65.7265.72 69.3869.38 72.2972.29 74.2274.22 75.4675.46 75.7975.79 76.0876.08 76.1176.11
AMemNet w/o Res 56.0056.00 60.8160.81 65.7865.78 69.4969.49 72.4472.44 74.5874.58 75.4975.49 75.6275.62 76.7376.73 76.3876.38
AMemNet (ours) 57.7457.74 62.1062.10 66.2866.28 70.1770.17 72.6672.66 74.5574.55 75.2275.22 75.7875.78 76.0876.08 76.1476.14
Refer to caption
(a) RGB on HMDB51
Refer to caption
(b) Flow on HMDB51
Refer to caption
(c) Fusion on HMDB51
Figure 6: Ablation study for the proposed AMemNet on the HMDB51 dataset in terms of RGB, Flow and Fusion, respectively