Unsupervised Video Anomaly Detection
via Normalizing Flows with Implicit Latent Features

MyeongAh Cho¹ Taeoh Kim² Woo Jin Kim¹ Suhwan Cho¹ Sangyoun Lee¹

¹Yonsei University, Korea
²NAVER CLOVA Video, Korea
Published on Pattern Recognition, Elsevier

Abstract

In contemporary society, surveillance anomaly detection, i.e., spotting anomalous events such as crimes or accidents in surveillance videos, is a critical task. As anomalies occur rarely, most training data consists of unlabeled videos without anomalous events, which makes the task challenging. Most existing methods use an autoencoder (AE) to learn to reconstruct normal videos; they then detect anomalies based on their failure to reconstruct the appearance of abnormal scenes. However, because anomalies are distinguished by appearance as well as motion, many previous approaches have explicitly separated appearance and motion information—for example, using a pre-trained optical flow model. This explicit separation restricts reciprocal representation capabilities between two types of information. In contrast, we propose an implicit two-path AE (ITAE), a structure in which two encoders implicitly model appearance and motion features, along with a single decoder that combines them to learn normal video patterns. For the complex distribution of normal scenes, we suggest normal density estimation of ITAE features through normalizing flow (NF)-based generative models to learn the tractable likelihoods and identify anomalies using out-of-distribution detection. NF models intensify ITAE performance by learning normality through implicitly learned features. Finally, we demonstrate the effectiveness of ITAE and its feature distribution modeling on six benchmarks, including databases that contain various anomalies in real-world scenarios.

1 Introduction

Anomaly detection, also called outlier detection, seeks to identify unusual, unseen, or undefined abnormal data among normal data. Anomaly detection has several practical applications in various fields such as surveillance anomaly detection, defect detection in factories, X-ray security systems, and medical diagnostics. At present, with CCTVs present in most places, video anomaly detection such as detection of accidents and crimes from amongst the petabytes of surveillance videos has become critical. Furthermore, human monitoring for unpredictable anomalous events tends to be time-consuming, laborious, and error-prone, and must be replaced with an automated intelligent system.

However, there are several challenges. First, real-world anomalous events such as robberies and car accidents occur very infrequently compared with normal events, resulting in a class imbalance problem between normal and abnormal data. Therefore, the training sets of most surveillance databases only contain normal videos, while anomalous events only exist in the test set. This makes it challenging to train models in a general supervised manner that uses manually labeled data. Second, because anomalies are unbounded, it is impossible to define and collect all existing abnormal events, and the task of labeling is extremely laborious. Therefore, detecting unseen and undefined anomalous events requires the system to learn normality through abundant and easily obtained normal videos.

Since the advent of deep learning, studies on surveillance anomaly detection tasks with a large number of normal training videos have grown substantially. Frame reconstruction or prediction-based methods are used predominantly in unsupervised learning approaches [47, 29]. Autoencoder (AE) structured networks that learn reconstruction (or prediction) tasks with only normal scenes cannot reconstruct properly when abnormal scenes are input during testing, entailing large reconstruction errors between the input and output for anomaly detection. This approach enables training without labeled data and has achieved notable improvement in performance.

In a surveillance system, anomalous events can be distinguished from normal events based on appearance, motion, or both. For example, the presence of non-pedestrian objects such as cars traversing on the sidewalk has a different appearance from a normal scene; fighting or chasing people illustrate differences in motion; and people throwing abnormal objects exhibit differences in both. It is essential to extract features that contain the appearance as well as the motion of the input video for anomaly detection. Many AE-based methods use explicit motion information with a pre-trained network such as an optical flow network [48, 29], pose estimator [36]. However, explicitly separating information forces inductive bias into the network and makes it dependent on the pre-trained model. This can degrade network capacity due to the strong prior and cannot fully exploit end-to-end spatio-temporal representations.

Therefore, we propose an implicit two-path AE (ITAE) that implicitly focuses on appearance and motion information. As it is difficult to capture motion information using a single AE, we suggest a structure with two encoders and a single decoder, in which the two encoders capture relatively static and dynamic features, and the decoder learns to combine and reconstruct them together as original inputs. In contrast to other two-encoder and two-decoder structures or frameworks with pre-trained feature extractors, we simply add one encoder path and design ITAE with few shallow layers to make it suitable for video anomaly detection. Inspired by the SlowFast network [12], which achieves satisfactory performance in action recognition tasks, the static and dynamic encoders of ITAE have different temporal and channel sizes to focus either on appearance or on motion information. In Fig. 1, we visualize the output results when the appearance of the input frame looks normal (person) but the motion (riding a skateboard) is abnormal. Compared to one-path AE, the proposed ITAE gives a larger reconstruction error because the motion of the input frame differs from the learned normal frames. This large error leads to anomaly detection, which demonstrates that the dynamic encoder of ITAE is critical.

For complex and diverse scenes, AE becomes difficult to reconstruct and has limitations in detecting anomalies. We suggest compensating for this drawback by distribution learning of latent features extracted from ITAE with a normalizing flow (NF)-based generative model. The method of classifying real reconstructed samples using a discriminator through adversarial learning [43] faces limitations on supporting data distribution and is usually unstable because of its formulation (min-max). In contrast, the NF model [8] focuses on the density estimation of high-dimensional data using tractable exact log-likelihood. By directly maximizing the likelihood of the NF model, it is possible to learn high-dimensional normal video features. For distribution modeling, instead of the original input frame, we use latent features from the proposed ITAE that represent normal patterns and show satisfactory results, especially in the ST database [29] with the most diverse anomalies. After training, abnormal events are found by out-of-distribution detection within ITAE feature likelihood (Fig. 1 (d)). The contributions of this paper are as follows:

•

A new approach, ITAE, is proposed, in which two encoders implicitly focus on static and dynamic features and a decoder learns to combine them to detect an event showing abnormal appearance or motion in surveillance videos.
•

For complex distributions of normal scenes, using the features from ITAE, NF models estimate the density of normality and show the effectiveness of modeling learned using the ITAE embeddings.
•

Unsupervised learning without external datasets or models is conducted, and competitive performance on six surveillance anomaly detection benchmarks is achieved.

2 Related Works

2.1 Convolutional Networks for Video.

As a basic network for video data, 3D convolution-based networks have been proposed and used for feature extraction in video anomaly detection. Recently, a transformer-based model for video representation learning has been studied [3, 2, 11, 30]. TimeSformer [3] and Video Vision Transformer (ViViT) [2] divide space and time information to perform self-attention using a sequence of spatio-temporal tokens; Multiscale Vision Transformers (MViT) [11] contains several channel-resolution scale stages to create a multiscale pyramid of feature activation; Swin-T [30] extends the Swin Transformer to the spatio-temporal domain for spatio-temporal locality of videos. These methods usually utilize a Resnet backbone because of their low throughput. For video recognition, two-stream networks have also been proposed to model motion features explicitly. However, these require the explicit extraction of temporal information, such as temporal differences or optical flows. A Siamese network structure with contrastive learning [18] that maximizes the similarity of positive pairs while minimizing those of the negative ones has been studied as an unsupervised learning paradigm for representation learning. Furthermore, Multi-view Contrastive Learning with Noise-robust loss (MvCLN) [55] has been proposed to learn consistent representations from multi-view/modal data. Recently, for video recognition, Feichtenhofer et al. [12] proposed slow and fast pathway networks that operate at different temporal and spatial resolutions to extract stationary and motion information. The slow-pathway is composed of a narrow temporal window, while the fast-pathway uses a higher temporal rate. In this paper, to learn normal appearance and motion patterns through reconstructing frames, we propose ITAE, which has two encoders and a single decoder composed of shallow layers, excluding residual blocks.

2.2 Video Anomaly Detection.

Many anomaly detection algorithms based on frame reconstruction that harness the powerful representation ability of deep convolutional networks have been proposed. These algorithms exploit the structure of convolutional AEs [20], recurrent neural networks [34], or 3D convolutions [56]. Other algorithms have been proposed by learning reconstruction with other objectives [39], usingmemory modules [15, 42], reconstructing optical flows from frames [48], or encoding normal patterns with sparse dictionary learning [57, 35]. Zhou et al. [57] suggested a sparse long short-term memory (SLSTM) unit with adaptive ISTA $l_{1}$ -solvers to encapsulate the historical information and achieve sparse codes in an unsupervised manner for anomaly detection; Luo et al. [35] proposed a Temporally-coherent Sparse Coding (TSC) framework, which has a special type of sRNN that preserves the similarities between neighboring frames. Frame prediction-based approaches have been proposed to increase the probability of unpredictability for abnormal samples [29, 38, 41]. However, prediction-based approaches generally require heavier structures or optical flows, and bi-directional models rely heavily on future frames.

As another approach, anomaly detection methods that learn the compactness of normal clusters has been proposed. Clusters are configured by removing small clusters of normal samples, features from the reconstruction objective [54], or extracting from a pre-trained object detector [22] or pose estimator [36].

2.3 Generative Models.

Compared with discriminative models, generative models do not require either normal annotations or proxy tasks such as frame generation. Some approaches based on a Gaussian model or non-parametric density estimation have been proposed based on either latent features or extracted features. Deep generative models can be categorized into implicit and explicit density estimation models [16]. However, implicit density models, such as GAN-based anomaly detectors [49], do not define the data likelihood and cannot use in-distribution estimation without modifying the discriminator for the likelihood estimator, whereas explicit models first define likelihoods and try to maximize them. An approximation-based model (e.g., VAE) has been proposed to calculate likelihood, which requires approximation through a lower bound of the likelihood (ELBO), and has been used for anomaly detection. Utilizing Expectation Maximization (EM) to estimate density for anomaly detection has also been proposed [40], which clusters features and applies E-step to retrieve the posterior likelihoods and M-step to update parameters iteratively. Auto-regressive models have shown promising results in estimating the density more precisely and have been adapted for anomaly detection [1]. However, these autoregressive models are not efficient for high-dimensional data, cannot be implemented with parallel processing, and are sensitive to the choice of sequence order.

Tractable density estimators using NFs have been proposed to alleviate these issues [8, 23]. Dinh et al. [8] proposed invertible networks using an affine coupling layer and calculated the tractable likelihood using the change of variable theorem. Kingma and Dhariwal proposed Glow [23] for further improvements using activation normalization and invertible $1\times 1$ convolution. In this paper, we suggest distribution learning with NF models using static and dynamic features obtained from ITAE, which helps to deal with the complex distribution of normal scenes. Although there have been studies using likelihood for anomaly detection, the proposed method differs from them in the following aspects: (1) we compute exact tractable likelihood through normalizing flow, and (2) we perform density estimation of features learned with two path encoders and show effectiveness through experiment analysis, especially on the complex and multi-scene ShanghaiTech dataset, without using any pre-trained external models. To the best of our knowledge, this is the first attempt at learning appearance and motion normality through NFs in video anomaly detection, and it demonstrates effectiveness.

3 Proposed Method

3.1 Overview

In surveillance anomaly detection, normal and anomalous scenes have differences in appearance (e.g., a car driving down a sidewalk), motion (e.g., jumping), or both (e.g., chasing a person with an abnormal object). Therefore, to learn the appearance and motion normal patterns of both scene types, we propose an ITAE that implicitly focuses on static and dynamic features. With these learned features from ITAE, we suggest distribution modeling through NF models to learn complex and diverse normality.

Table 1: Instantiation of ITAE. Numbers in parentheses denote kernel size (temporal

\times

spatial). Output size is in the order of {channel

\times

temporal

\times

spatial} size.

Layer	Encoder		Output size
Layer	Static	Dynamic	Output size
Conv1	( $1,3^{2}$ )	( $5,3^{2}$ )	S: $96\times 4\times 128^{2}$
Conv1	stride: $1,2^{2}$	stride: $1,2^{2}$	D: $12\times 16\times 128^{2}$
Conv2	( $1,3^{2}$ )	( $3,3^{2}$ )	S: $128\times 4\times 64^{2}$
Conv2	stride: $1,2^{2}$	stride: $1,2^{2}$	D: $16\times 16\times 64^{2}$
Conv3	( $3,3^{2}$ )	( $3,3^{2}$ )	S: $256\times 4\times 64^{2}$
Conv3	stride: $1,1^{2}$	stride: $1,1^{2}$	D: $32\times 16\times 64^{2}$
Conv4	( $3,3^{2}$ )	( $3,3^{2}$ )	S: $256\times 4\times 64^{2}$
Conv4	stride: $1,1^{2}$	stride: $1,1^{2}$	D: $32\times 16\times 64^{2}$
Layer	Decoder		Output size
DeConv1	( $3,3^{2}$ )		$256\times 4\times 64^{2}$
DeConv1	stride: $1,1^{2}$		$256\times 4\times 64^{2}$
DeConv2	( $3,3^{2}$ )		$128\times 8\times 128^{2}$
DeConv2	stride: $2,2^{2}$		$128\times 8\times 128^{2}$
DeConv3	( $3,3^{2}$ )		$96\times 16\times 256^{2}$
DeConv3	stride: $2,2^{2}$		$96\times 16\times 256^{2}$
DeConv4	( $3,3^{2}$ )		$3\times 16\times 256^{2}$
DeConv4	stride: $1,1^{2}$		$3\times 16\times 256^{2}$

The framework is trained in two steps using only normal training videos. In the first step, ITAE learns to reconstruct the normal frames. As depicted in Fig. 2, the sequence of frames is embedded through the ITAE with the static encoder ( $E_{static}$ ) and dynamic encoder ( $E_{dynamic}$ ). The embedding features of the two encoders are combined and reconstructed into original frames through a single decoder. In the second step, to estimate the density of the normal appearance and motion pattern, NF models ( $F_{static}$ and $F_{dynamic}$ ) learn the distribution modeling of the two embedding features from the ITAE. In this step, ITAE is frozen to prevent the NF model from becoming difficult to be optimized when the weight of ITAE is updated and the feature distribution is changed.

During testing, the abnormality score is calculated using the reconstruction error of the ITAE and the estimated likelihood of the NF models. When an abnormal scene is input, the ITAE learned using normal frames outputs a large error because of its poor reconstruction. Furthermore, with NF models, the static and dynamic embedding features obtained from ITAE differ from the features of the normal training set, which entails a low likelihood value.

3.2 Implicit two-path AE (ITAE)

Two-path encoder. We design the AE as a two-path encoder to capture appearance and motion information implicitly and mutually. Surveillance videos tend to have static information that shows minimal change, such as background between frames, whereas dynamic information such as walking or running changes relatively quickly. The input sequences of the two encoders are sampled at different frame rates, and information is transferred between the two encoders through a lateral connection. We input $T/\tau$ frames at $\tau$ sampling rate to the static encoder, and input $T$ frames to the dynamic encoder (we use $\tau=4$ in this study). A lateral connection concatenates the static and dynamic features by matching the temporal size with a $(5,1,1)$ kernel.

As the AE gets deeper and the parameter increases, the capacity becomes too powerful to reconstruct the anomalies accurately, so we compose each path into four shallow layers. In Table 1, each encoder composed with different spatial and temporal sizes of kernels and output features is presented. Furthermore, instead of a framework consisting of two encoder-decoder networks, ITAE fuses the features of two encoders into one latent feature, which the decoder uses to generate the original frame. These two encoders generate a higher reconstruction error than a one-path encoder for scenes with abnormal motion or appearance and perform better anomaly detection (visualized error maps are depicted in Fig. 4 in Section 5.1).

Decoder. The two embedding features obtained from each encoder are reconstructed into original input frames through a decoder. Similar to the lateral connection between encoders, the two final features are channel-wise concatenated and reconstructed (Fig. 2). For the decoder to generate the output by grasping the relationships between the static and dynamic features, we do not use the skip connection between the encoder and decoder, which is used primarily in the U-Net AE structure. Through encoders that focus on static and dynamic changes of input scenes and a decoder that fuses and reconstructs them by combining two embedded features, the network has the capacity to examine both sources of information implicitly. Without a complex structure, the ITAE that consists of four layers in each encoder and decoder is more powerful than the other proposed AE, even when using inception blocks or a convLSTM structure with a pre-trained network.

3.3 Learning Normality Distribution

We can learn normality from unlabeled normal training data by the unsupervised density estimation method. By using explicit likelihood generative models, it is possible to compute the likelihood of input data. An NF-based generative model can calculate the tractable likelihood via changes of variables toward a simple distribution (e.g., multivariate Gaussian). The likelihood is calculated by passing an invertible parametric function composed of multiple layers that maps the complex data distribution to the simple distribution, which can be any distribution. In this study, we set the prior $\boldsymbol{p_{z}}$ to be an isotropic unit norm Gaussian. With an input variable $\boldsymbol{x}\in X$ , the distribution of which is unknown, a simple distribution $\boldsymbol{z}\sim p_{z}$ , and a parametric function $\boldsymbol{f}_{\boldsymbol{\theta}}:X\rightarrow Z$ , the integral of the probability density function is

\begin{split}\int_{\boldsymbol{z}}p_{z}(\boldsymbol{z})\,\textrm{d}\boldsymbol{z}&=\int_{\boldsymbol{x}}p_{x}(\boldsymbol{x})\,\textrm{d}\boldsymbol{x}\\ &=\int_{\boldsymbol{x}}p_{z}(\boldsymbol{f}_{\boldsymbol{\theta}}(\boldsymbol{x}))\left|\textrm{det}\left(\frac{\partial\boldsymbol{f}_{\boldsymbol{\theta}}}{\partial\boldsymbol{x}}\right)\right|\,\textrm{d}\boldsymbol{x},\end{split}

(1)

where $\textrm{det}\left(\frac{\partial\boldsymbol{f}_{\boldsymbol{\theta}}}{\partial\boldsymbol{x}}\right)$ is the Jacobian determinant of function $\boldsymbol{f}_{\boldsymbol{\theta}}$ under change of variable theorem. When the generative model $\boldsymbol{f}$ with parameter $\boldsymbol{\theta}$ is $\boldsymbol{f}_{\boldsymbol{\theta}}=\boldsymbol{f}_{\boldsymbol{\theta_{M}}}\circ\boldsymbol{f}_{\boldsymbol{\theta_{M-1}}}\circ\cdots\circ\boldsymbol{f}_{\boldsymbol{\theta_{1}}}$ and $\boldsymbol{h}_{i}=\boldsymbol{f}_{\theta_{i}}(\boldsymbol{h}_{i-1})$ with $\boldsymbol{h}_{i-1}\sim p_{i-1}(\boldsymbol{h}_{i-1})$ where $\boldsymbol{h}_{0}=\boldsymbol{x}$ and $\boldsymbol{h}_{M}=\boldsymbol{z}$ , the probability density function $p_{i}$ is as follows (for brevity, we omit $\boldsymbol{\theta}$ from $\boldsymbol{f_{\boldsymbol{\theta}}}$ ):

$\displaystyle p_{i}(\boldsymbol{h}_{i})$	$\displaystyle=p_{i-1}(\boldsymbol{f}^{-1}_{i}(\boldsymbol{h}_{i}))\left\|\textrm{det}\left(\frac{\partial\boldsymbol{f}^{-1}_{i}}{\partial\boldsymbol{h}_{i}}\right)\right\|$	(2)
	$\displaystyle=p_{i-1}(\boldsymbol{h}_{i-1})\left\|\textrm{det}\left(\frac{\partial\boldsymbol{f}_{i}}{\partial\boldsymbol{h}_{i-1}}\right)^{-1}\right\|$	(3)
	$\displaystyle=p_{i-1}(\boldsymbol{h}_{i-1})\left\|\textrm{det}\left(\frac{\partial\boldsymbol{f}_{i}}{\partial\boldsymbol{h}_{i-1}}\right)\right\|^{-1}$	(4)

Equation 2 is written as Eq. 3 according to the inverse function theorem: if $y=f(x)$ and $x=f^{-1}(y)$ then $\frac{\textsl{d}f^{-1}(y)}{\textsl{d}y}=\frac{\textsl{d}x}{\textsl{d}y}=(\frac{\textsl{d}y}{\textsl{d}x})^{-1}=(\frac{\textsl{d}f(x)}{\textsl{d}x})^{-1}$ . Equation 4 is from Jacobians of invertible function of $\textrm{det}(M^{-1})=(\textrm{det}(M))^{-1}$ . Then,

\displaystyle\textsl{log}\,p_{i}(\boldsymbol{h}_{i})=\textsl{log}\,p_{i-1}(\boldsymbol{h}_{i-1})-\textsl{log}\left|\textrm{det}\left(\frac{\partial\boldsymbol{f}_{i}}{\partial\boldsymbol{h}_{i-1}}\right)\right|

(5)

By repeatedly applying the rule for change of variables, the expanded equations are as follows:

	$\displaystyle\textsl{log}\,p_{z}(\boldsymbol{z})=\textsl{log}\,p_{M-1}(\boldsymbol{h}_{M-1})-\textsl{log}\left\|\textrm{det}\left(\frac{\partial\boldsymbol{f}_{M}}{\partial\boldsymbol{h}_{M-1}}\right)\right\|$
	$\displaystyle=\textsl{log}\,p_{M-1}(\boldsymbol{h}_{M-1})-\textsl{log}\left\|\textrm{det}\left(\frac{\partial\boldsymbol{f}_{M-1}}{\partial\boldsymbol{h}_{M-2}}\right)\right\|-\textsl{log}\left\|\textrm{det}\left(\frac{\partial\boldsymbol{f}_{M}}{\partial\boldsymbol{h}_{M-1}}\right)\right\|$
	$\displaystyle\vdots$
	$\displaystyle=\textsl{log}\,p_{x}(\boldsymbol{x})-\textsl{log}\left\|\textrm{det}\left(\frac{\partial\boldsymbol{f}_{1}}{\partial\boldsymbol{x}}\right)\right\|-\cdots-\textsl{log}\left\|\textrm{det}\left(\frac{\partial\boldsymbol{f}_{M}}{\partial\boldsymbol{h}_{M-1}}\right)\right\|$
	$\displaystyle=\textsl{log}\,p_{x}(\boldsymbol{x})-\sum^{M}_{i=1}\textsl{log}\left\|\textrm{det}\left(\frac{\partial\boldsymbol{f}_{i}}{\partial\boldsymbol{h}_{i-1}}\right)\right\|$		(6)

With these steps, the probability density function of $\boldsymbol{x}$ is as given in Eq. 7.

\displaystyle log\,p_{x}(\boldsymbol{x})=log\,p_{z}(\boldsymbol{z})+\sum^{M}_{i=1}log\,\left|\textrm{det}\left(\frac{\partial\boldsymbol{f}_{i}}{\partial\boldsymbol{h}_{i-1}}\right)\right|

(7)

If the density function of the latent variable $\boldsymbol{z}$ is tractable like a Gaussian distribution and the Jacobian matrix $\frac{\partial\boldsymbol{f}_{\boldsymbol{\theta}}}{\partial\boldsymbol{x}}$ is triangular, the likelihood of input variable $\boldsymbol{x}$ can be obtained simply. We use the Glow model with $L$ level multi-scale architecture and $K$ series of Actnorm, invertible convolution, and Affine layer for density estimation [23].

By maximizing the likelihood in Eq. 7, an NF model estimates the density of high-dimensional data through multiple layers of convolutional networks. As the likelihood of a generative model heavily depends on the image complexity [44], unlike Glow, which begins from the image space, the intermediate latent feature of ITAE is used for complex density modeling of normal videos. After training the ITAE with frame reconstruction, we estimate the density of the static and dynamic embedding features obtained from each encoder. The max-pooling and average-pooling along the channel axis of the feature from each path are applied to reduce the dimensions and are concatenated and input into each NF model, i.e., $F_{static}$ and $F_{dynamic}$ (Fig. 2). We also concatenate the resized intensity of the input frame to the $F_{static}$ input, which provides additional sparse appearance information of the feature map. For abnormal scenes, static and dynamic NF models trained with latent features that embed each appearance and motion pattern of normal scenes show low likelihood results.

3.4 Training and Testing

Reconstruction loss function. For reconstruction, the model is trained by minimizing the L2 loss of the input sequence of frames $\boldsymbol{I}$ (the ground truth) and the output frames $\boldsymbol{\hat{I}}$ to make all pixels of the RGB or gray channel similar (Eq. 8). We also add multi-scale SSIM $L_{ms-ssim}$ and gradient loss $L_{grad}$ to maintain the sharpness of the frames, which computes the difference of gradient at each pixel between the input and output frames. The total loss is the sum of $L_{2}$ , $L_{ms-ssim}$ , and $L_{grad}$ as given in Eq. 9. Just as [29, 38, 47] used the sum of the intensity loss and gradient loss having the same importance as per [37], we also add the three loss terms having the same importance by following [32].

\displaystyle L_{2}=\left\|\boldsymbol{I}-\boldsymbol{\hat{I}}\right\|^{2}_{2}

(8)

\displaystyle\hskip 6.99997ptL_{recon}=L_{2}+L_{ms-ssim}+L_{grd}

(9)

Log-likelihood loss function. After training ITAE, the generative models $F_{static}$ and $F_{dynamic}$ are trained with the negative log-likelihood (nll) $L_{nll}$ of the static and dynamic embedding feature $\boldsymbol{x}^{s}$ and $\boldsymbol{x}^{d}$ in Eq. 10. As in Eq. 7, the exact log-likelihood $log\,p_{x}(\boldsymbol{x};\boldsymbol{\theta})$ of the input feature is calculated through generative models, and the parameters $\boldsymbol{\theta}$ are updated to minimize nll in Eq. 11.

\displaystyle L_{nll}=NLL(\boldsymbol{x}^{s})+NLL(\boldsymbol{x}^{d})

(10)

	$\displaystyle\boldsymbol{\theta}^{}=\operatorname{\arg\min}_{\theta}\;-log\,p_{x}(\boldsymbol{x};\boldsymbol{\theta})$
	$\displaystyle NLL(\boldsymbol{x})=-log\,p_{X}(\boldsymbol{x})$		(11)

Anomaly score. The anomaly score for the reconstruction error $R(\boldsymbol{I}_{t},\boldsymbol{\hat{I}}_{t})$ of the $t$ -th frame is the difference between input and the output frame of the ITAE within a sliding patch (in Eq. 12, $P$ indicates an $N\times N$ image patch and $|P|$ is the pixel number of the patch). For each frame, we compute the mean of error values in all segments in which it appears. The score $L(\boldsymbol{x}_{t}^{s},\boldsymbol{x}_{t}^{d})$ from the generative models is calculated by adding the nll values of each static feature and the dynamic feature and normalizing it in Eq. 13 where $\textrm{norm}(\cdot)$ denotes normalization within a video clip as in some previous studies [29, 1, 42].

\displaystyle R(\boldsymbol{I}_{t},\boldsymbol{\hat{I}}_{t})=\operatorname*{\arg\max}_{slidingP}\frac{1}{|p|}\sum_{i,j\in P}|\boldsymbol{I}_{t}^{i,j}-\boldsymbol{\hat{I}}_{t}^{i,j}|

(12)

\displaystyle L(\boldsymbol{x}_{t}^{s},\boldsymbol{x}_{t}^{d})=norm(NLL(\boldsymbol{x}_{t}^{s})+NLL(\boldsymbol{x}_{t}^{d}))

(13)

The total anomaly score $S_{t}$ is computed by summing the reconstruction error and nll with scaling factor $\lambda_{L}$ in Eq. 14. $\lambda_{L}$ is experimentally obtained value from the set [0.1, 0.3, 0.5, 0.7, 0.9, 1.0] (in Section 4.3.4).

\displaystyle S_{t}=R(\boldsymbol{I}_{t},\boldsymbol{\hat{I}}_{t})+\lambda_{L}L(\boldsymbol{x}_{t}^{s},\boldsymbol{x}_{t}^{d})

(14)

4 Experiments

For the evaluation, we compute the average area under curve (AUC) and Equal Error Rate (EER) through the receiver operation characteristic (ROC) by gradually changing the threshold of the anomaly score in a frame-level annotated database (the ROC curves are reported in the supplementary material). For a fair comparison, we report Mirco-AUC performance including [14, 22, 13].

4.1 Databases

UCSD [28]. UCSD consists of two subsets, Ped1 and Ped2, which are composed of overlooking walkway scenes obtained through a mounted camera. The foreground object size and movement changes are small, and videos are grayscale and low-resolution. Ped1 has a very low frame resolution, with an inconsistency issue; for example, some [43, 54] tested on only 16 videos, while others [20, 56] tested all videos. For these reasons, we conduct experiments only on Ped2 as in another study [38, 47]. There are anomalies of non-pedestrian objects in the walkways, such as cars and skaters in the test set, and the density of pedestrians varies.

CUHK Avenue [31]. CUHK consists of normal videos in the training set with several outliers included and a slight camera shake. The test set includes anomalies such as a person walking in the wrong direction, strange actions, substantial motion, and foreground scale variation.

Shanghai Tech Campus (ST) [29]. ST is the largest-volume database, containing 13 different scenes, whereas the two databases above contain a single scene. This database includes diverse anomalous events such as brawling and loitering, including sudden motion in multiple scenes. It is challenging because of its complex angles and lighting conditions.

UCF-Crimes [46]. UCF-Crimes consists of long untrimmed surveillance videos that cover 13 real-world anomalies, including abuse, arrest, arson, assault, accident, burglary, explosion, fighting, robbery, shooting, stealing, shoplifting, and vandalism. This is a large-scale anomaly detection dataset composed of 950 unedited real-world surveillance videos with clear anomalies and 950 normal videos, leading to a total of 1,900 videos with a total duration of 128 hours. Unlike an unsupervised anomaly detection dataset whose training set only consists of normal videos, it contains 1,610 training videos with video-level labels and 290 test videos with frame-level labels of much more complicated and diverse backgrounds.

Live-Videos (LV) [25]. LV consists of 28 surveillance videos with a total duration of 3.93 hours captured at different frame rates in indoor and outdoor scenarios with several illumination changes and some camera motions. The anomalies include panic, robberies, kidnap, fighting, and clashes.

UBI-Fights [7]. UBI-Fight is a large-scale abnormal event dataset that consists of 1,000 videos with a duration of 80 hours. There are 216 videos containing a fight event and 784 normal daily life videos captured under various conditions such as indoor, outdoor, grayscale, RGB-scale, rotating scenario, and moving camera.

4.2 Implementations

We consider three databases in our experiments: UCSD, CUHK, and ShanghaiTech (ST). For training, we resize the input frame to $256\times 256$ and set $T$ to 16. For UCSD, which has a small foreground scale, we use the original frame size ( $240\times 360$ ). We use the Adam optimizer and Cosine annealing scheduler. In the first training step, the batch sizes are 2, 2, and 8, and the learning rates are $1e^{-3},1e^{-2}$ , and $1e^{-2}$ for the UCSD, CUHK, and ST databases, respectively. For the second step, the batch sizes are 8, 5, and 8, and the learning rates are $5e^{-4},5e^{-4}$ , and $1e^{-4}$ . $\lambda_{L}$ is 0.3, 0.1, and 0.7, respectively. Glow is used for the NF models with $K=32$ and $L=3$ ( $L=1$ for UCSD). For the ST dataset, which contains multi-scene videos, each anomaly score is normalized to match the scale of the score. For real-world scenario datasets, UCF-Crimes, LV, and UBI-Fights, all settings are the same as ST. Please refer to the supplementary material for detailed information on experiments.

4.3 Ablation Studies

4.3.1 Structure of ITAE

Table 2: Ablation studies of two-path AE on CUHK database.

AE	SlowFast-18 AE		SlowFast-50 AE		ITAE
Model	one path	two path	one path	two path	one path	two path
AUC	83.3	81.1	81.8	80.7	86.4	87.3

In Table 2, we compare the two-path networks SlowFast [12] and ITAE. In contrast to ITAE, SlowFast is a discriminative model proposed for action recognition that has deep ResNet-style layers. For comparison, we train SlowFast AE (Backbone as ResNet18 and 50) by attaching a decoder with residual layers to SlowFast (experiment conditions are the same as ITAE). As can be seen from the SlowFast-18 and 50 AE results, the performance of the two-path encoder is worse than when there is only a one-path encoder, which indicates that increased model complexity does not lead to performance gains. The deep and complex structure of SlowFast AE instead degrades the performance of detecting anomalies. In contrast, ITAE, with a shallow and non-residual network, performs better and improves markedly when both static and dynamic paths are considered together (Fig. 4). This result illustrates that ITAE is more suitable for video anomaly detection where motion and appearance information sources are important.

4.3.2 Distribution modeling with ITAE features

Table 3: Normal density estimation using raw frames and ITAE features with NF models on three benchmarks.

Database	NF	NF on ITAE features
Database	on Frames	w/o ITAE	w ITAE
UCSD Ped2^{^⋆} [28]	82.2	90.6	99.2
CUHK^⋄ [31]	79.4	74.6	88.0
Shanghai Tech^† [29]	69.2	72.5	76.3

^⋆ 1 scene/ 28 clips, ^⋄ 1 scene/ 38 clips, ^† 13 scenes/ 437 clips

Our framework estimates the normality using the ITAE features instead of the raw frame for distribution modeling. In Table 3, without an ITAE reconstruction score, NF models with ITAE static and dynamic features illustrate superior or comparable results to those with the raw frames ( $128\times 128$ input size), which proves the effectiveness of modeling with learned ITAE features for surveillance video data. Furthermore, in the ST dataset, i.e., the largest and most diverse among the three benchmarks (it has 13 scenes, whereas others have 1), the NF model exhibits excellent results without an ITAE score. With extensive and diverse data, NF performs more general distribution learning and responds to complex scenes.

4.3.3 Ablation studies of proposed framework

Table 4: Ablation studies of static and dynamic encoders in ITAE and NF models on three benchmark databases.

ITAE		NFs		UCSD^⋆ [28]		CUHK^⋄ [31]		ST^† [29]
Static	Dynamic	Static	Dynamic	AUC	EER	AUC	EER	AUC	EER
✓				97.7	5.6	86.4	21.3	73.1	33.1
✓	✓			98.7	5.5	87.3	19.6	74.8	31.8
✓	✓	✓		98.8	4.4	87.0	20.2	76.0	30.3
✓	✓		✓	99.1	4.1	87.9	19.1	76.0	31.1
✓	✓	✓	✓	99.2	3.9	88.0	19.0	76.3	30.6

^⋆ 1 scene/ 28 clips, ^⋄ 1 scene/ 38 clips, ^† 13 scenes/ 437 clips

Table 5: Comparison with state-of-the-art methods on three benchmarks; the comparison result of our model is the performance of ITAE and static and dynamic NF models.

w external dataset · model	External	Methods	UCSD^⋆ [28]		CUHK^⋄ [31]		ST^† [29]
	module	Methods	AUC	ERR	AUC	ERR	AUC	ERR
	FRCN	MT-FRCN [21]	92.2	13.9	-	-	-	-
	OpticalFlow	AbnormalGAN [43]	93.5	14.0	-	-	-	-
	Flownet	FFP [29]	95.4	-	84.9	-	72.8	-
	Flownet2	AMC [38]	96.2	-	86.9	-	-	-
	OpticalFlow	MLAD [48]	97.5	4.7	71.5	36.4	-	-
	Yolov3	Obj-centric [22]	94.3	-	87.4	-	78.7	-
	Resnet-50	AnomalyNet [57]	94.9	10.2	86.1	22.0	-	-
	Flownet	UNet-inte [47]	96.3	10.0	85.1	-	73.0	-
	OpticalFlow	STFF [52]	92.8	12.5	85.5	20.7	-	-
	SelFlow, Yolov3	BG-Agnostic [14]	98.7	-	92.3	-	82.7	-
	Flownet, Yolov3	MONAD [10]	97.2	-	86.4	-	70.9	-
	Yolov3	Multi-task [13]	97.5	-	91.5	-	82.4	-
w/o external dataset/model	Methods		UCSD^⋆ [28]		CUHK^⋄ [31]		ST^† [29]
	Methods		AUC	ERR	AUC	ERR	AUC	ERR
		AMDN [54]	90.8	17.0	-	-	-	-
		BMAN [24]	96.6	-	90.0	-	76.2	-
	None-Recon.	DDGAN [9]	95.6	-	84.9	-	73.7	32.0
		Mem-guided [42]	97.0	-	88.5	-	70.5	-
		Fewshot [33]	96.2	-	85.8	-	77.9	-
		STCEN [19]	96.9	8.0	86.6	19.2	73.8	-
		AE-Conv2D [20]	90.0	21.7	70.2	25.1	-	-
		AE-Conv3D [56]	88.6	20.9	80.9	24.4	-	-
		TSC [34]	91.0	-	80.6	-	67.9	-
		StackRNN [34]	92.2	-	81.7	-	68.0	-
		HybridAE [39]	84.3	-	82.8	-	-	-
	Recon.	Auto-reg [1]	95.4	-	-	-	72.5	-
		MemAE [15]	94.1	-	83.3	-	71.2	-
		LRCCDL [27]	96.6	8.9	-	-	-	-
		Mem-guided [42]	90.2	-	82.8	-	69.8	-
		DissociateAE [5]	96.7	-	87.1	-	73.7	-
		ITAE(ours)	98.7	5.5	87.3	19.6	74.8	31.8
		ITAE+NFs(ours)	99.2	3.9	88.0	19.0	76.3	30.6

^⋆ 1 scene/ 28 clips, ^⋄ 1 scene/ 38 clips, ^† 13 scenes/ 437 clips

Table 4 presents the results of ablation studies with each static and dynamic path of ITAE and NF models. Using the dynamic encoder produces superior results in all three databases compared with using only the static encoder. On the ST database, which contains the most complex and largest-volume videos, the most significant performance improvement, i.e., 1.7%, is achieved when the dynamic path is added. Furthermore, with NF models, the normality distribution modeling of the ITAE latent feature shows the best performance. On CUHK, dynamic feature modeling presents more effective results than the static feature because anomalous events (throwing a bag or papers, running person on the walkway) are associated primarily with motion. On the ST database, which has the largest number of training data sets and various anomalous events in the test set, the performance improves the most when the NF models added to ITAE–flow models learn more general distribution with a large amount of training data.

4.3.4 Ablation studies of $\lambda_{L}$

Figure 3 shows the results of AUC and ERR on three benchmarks according to the hyper-parameter $\lambda_{L}$ . In Eq. 14, the anomaly score $S_{t}$ is calculated as the sum of the reconstruction score $R(\boldsymbol{I}_{t},\boldsymbol{\hat{I}}_{t})$ and the NF score $L(\boldsymbol{x}_{t}^{s},\boldsymbol{x}_{t}^{d})$ , and $\lambda_{L}$ is the scaling factor of the NF score. We compute the scores of $ITAE$ with static NF model $NF_{static}$ , dynamic NF model $NF_{dynamic}$ , and both NF models $NF$ by scaling $\lambda_{L}$ from the set [0.1, 0.3, 0.5, 0.7, 0.9, 1.0]. For UCSD, CUHK, and ST, we select the $\lambda_{L}$ value as 0.3, 0.1, and 0.7, respectively. In particular, in the ST dataset composed of multiple and diverse scenes, the highest performance improvement with highest importance of $\lambda_{L}=0.7$ is achieved, which indicates the effectiveness of the density model on diverse and complex scenes. (Please refer to the supplementary material for detailed results.)

4.4 Comparison

A comparison with other state-of-the-art methods on three benchmarks is presented in Table 5. The proposed approach achieves superior or competitive performance with the state-of-the-art methods on three datasets, and ITAE significantly improves performance over the well-designed AE (AE-Conv3D [56], TSC [34], StackRNN [34], and DissociateAE [5]). In particular, ITAE and ITAE+NFs both show superior performance in UCSD Ped2 with 98.7% and 99.2% AUC, respectively, without any external models or datasets such as optical flow model (e.g., SelFlow, OpticalFlow, Flownet, Flownet2 pre-trained on FlyingThing3D, and ChairsSDHom datasets); object detector (e.g., FRCN, Yolov3 pre-trained on MS-COCO dataset); and feature extractor (e.g., Resnet-50 pre-trained on the ImagNet dataset). In addition, compared to the latest methods using an object detector [14, 22, 13], some results show slightly higher performance than the proposed method, but the proposed method’s performance is still 0.5%, 4.9%, and 1.7% higher on Ped2. Furthermore, these methods have a crucial issue: detecting anomalies is impossible for the object classes that are not pre-trained on external datasets as noted in BG-Agnostic [14]. Utilizing an object detector for knowledge distillation by considering unseen object classes (e.g., bike, car) during training as anomalies may be difficult to generalize in real-world scenarios.

Without an external model, the memory-based approach [15, 42] that stores and updates normal query features by memory module shows good performance, but this approach demonstrates low performance on the ST database with complex and multiple scenes, which indicates that storing fixed numbers of memory items may not be suitable for the general problem. In contrast, ITAE demonstrates better performance, i.e., 74.8%, on the ST database, and the performance with NFs is 76.3%, which shows the highest improvement in multi-scene anomaly detection among the results on the three benchmarks, which is very important from the perspective of generality. For all three databases, the results are superior to those of Auto-reg [1], which performs auto-regressive generative modeling that requires a causal network and data ordering to perform sequential operations. It is noteworthy that the ITAE+NFs model outperforms on three databases without using any external dataset and model, and NFs also show effectiveness on generalization through high performance improvement on multi-scene anomaly detection.

5 Discussions

5.1 Qualitative Results

Figure 4 illustrates a comparison of AE (consisting of only a static encoder) and ITAE. As in a previous study [42], the error maps are visualized by marking the pixel that is larger than the average error value within the frame. The first row of the figure is a jumping child whose appearance is normal, which leads to low reconstruction error with one-path AE. In contrast, the ITAE, which focuses on motion as well as appearance, generates a substantial error owing to the poor reconstruction for inputs that differ from the normal learned motion. For the second row, a person throwing a paper, the abnormality of the paper’s appearance, and flying motion produce a large error in ITAE.

Figure 5 is a histogram of likelihood within a video clip from static and dynamic NF models. The first row (a) is a car and biker on the walkway, and the second row (b) is a person riding a bike on the sidewalk. As depicted in the histogram, the likelihoods of both NF models are low in abnormal frames, with abnormal appearance and motion on the sidewalk.

With the two-path encoder and likelihoods, we compute the anomaly score of each frame in Fig. 6. The two scores complement each other and achieve satisfactory results, even when the pedestrian density is high or low and the foreground scale is small or large. (Please refer to the supplementary material for various qualitative results.)

Table 6: Testing results on the cross-domain dataset. Each result is the performance on the target dataset of the model trained on the source dataset.

Source-Target	Ours		BG-Agnostic [14]	Fewshot [33]
Dataset	ITAE	ITAE w NFs	BG-Agnostic [14]	Fewshot [33]
Ped2 $\rightarrow$ Ped2	98.7	99.2	98.7	96.2
CUHK $\rightarrow$ Ped2	97.0	94.3 (-4.9%)	87.0 (-11.7%)	-
ST $\rightarrow$ Ped2	97.8	96.8 (-2.4%)	90.6 (-8.1%)	82.0 (-14.2%)
CUHK $\rightarrow$ CUHK	87.3	88.0	92.3	85.8
ST $\rightarrow$ CUHK	84.3	85.1 (-2.9%)	83.6 (-8.7%)	71.4 (-14.4%)
ST $\rightarrow$ ST	74.8	76.3	82.7	-
CUHK $\rightarrow$ ST	72.5	73.0 (-3.3%)	76.4 (-6.3%)	-

5.2 Cross-domain Testing Results

When considering real-world scenarios, generality is a crucial issue in surveillance anomaly detection. In order to detect undefined abnormal events, it is necessary to learn a general normal pattern through normal scenes of training data, and to prevent the occurrence of many false alarms by focusing on specific normality due to overfitting the data. As the unsupervised anomaly detection approach trains with unlabeled data, we investigate the generalization ability by testing the model that trained the normal pattern on another dataset. As the Ped2 dataset is in grayscale , and the CUHK and ST datasets are composed of RGB scale frames, the color domain of the source and target datasets might differ. Instead of training a model with the same color domain as the target dataset, we use the model trained on the source dataset as it is. Therefore, when CUHK $\rightarrow$ Ped2 or ST $\rightarrow$ Ped2, given that the trained model on CUHK or ST takes 3-channel input, we duplicate the grayscale value to 3-channel for evaluation following Fewshot [33].

Table 6 presents the result of testing the model, which was trained on the source dataset, on the target dataset. ITAE and ITAE+NFs both show a slight decrease in performance, but this decrease is much smaller than the latest methods, i.e., BG-Augnostic [14] and Fewshot [33]. When the target dataset is Ped2 and the source dataset is CUHK or ST, the performance degradation of the proposed method is 4.9% and 2.4%, respectively, that of BG-Augnostic is 11.7% and 8.1%, respectively, and that of Fewshot is 14.2% for CUHK. ITAE and NFs also show the smallest decrease in other target datasets, and the highest performance in Ped2 and CUHK datasets. The degradation is the most in CUHK $\rightarrow$ Ped2, because Ped2 is grayscale unlike CUHK and the characteristics of the dataset such as camera angle and object size differ the most. For this reason, the distribution modeling of NFs is difficult owing to the large difference between the target and source datasets and the performance of ITAE+NFs is lower than that of ITAE when Ped2 is the target dataset. ITAE+NFs, which focuses on static and dynamic features and performs normality learning, shows high performance in cross-domain testing and effectiveness in generalization ability.

Table 7: Ablation studies of ITAE and NF models on three benchmark databases.

ITAE		NFs		(a) UCF-Crimes [46]		(b) LV [25]		(c) UBI-Fights [7]
Static	Dynamic	Static	Dynamic	AUC	EER	AUC	EER	AUC	EER
✓				66.0	36.7	70.6	36.9	56.2	45.4
✓	✓			66.3	36.9	74.6	30.2	57.7	44.2
		✓	✓	67.2	38.3	77.1	29.2	53.3	48.0
✓	✓	✓		69.7	36.7	77.2	27.6	57.9	44.3
✓	✓		✓	70.7	34.9	77.4	28.3	57.8	44.1
✓	✓	✓	✓	70.9	35.2	77.5	27.8	57.8	44.1

Table 8: Comparison with unsupervised anomaly detection methods on three benchmarks; comparison result of our model is the performance of ITAE and NFs. DEC indicates decidability index, which measures how well intra-class (genuine) and inter-class (impostor) distribution scores are distant from each other.

(a) UCF-Crimes [46]

Methods	Features	AUC
SVM Baseline	-	50.0
ConvAE [20]	-	50.6
S-SVDD [45]	-	58.5
SCL [31]	C3D RGB	65.5
BODS [50]	I3D RGB	68.3
GODS [50]	I3D RGB	70.5
ITAE (ours)	-	66.3
ITAE+NF (ours)	-	70.9

(b) Live-Video [25]

Method	AUC	EER
AME [17]	39.8	57.2
SCL [31]	49.6	51.0
MVs [4]	56.6	44.9
CS [26]	61.8	41.0
DeepOC [53]	70.6	35.1
ITAE (ours)	74.6	30.2
ITAE+NF (ours)	77.5	27.8

Method	AUC	EER	DEC
spatiotemporal [6]	52.8	44.6	0.194
LTR [20]	53.3	48.4	0.147
abnormalGAN [43]	54.0	47.5	0.164
$S^{2}$ VAE [51]	54.1	48.0	0.059
Binary SVM	55.6	44.3	0.128
ITAE (ours)	57.7	44.2	0.098
ITAE+NF (ours)	57.8	44.1	0.102

5.3 Analysis According to the Number of Scenes

We compose subsets with 1, 6, and 13 scenes in the ST dataset and compare the performance as the scenes of the data become diverse and extensive. The sizes of the test sets are proportional, and the train-to-test set ratio in each subset is similar. In Fig. 7, through various scenes and a large number of videos, the NF model learns the general normal pattern and shows higher performance on a multiple scene subset than on a single scene subset. In contrast, ITAE illustrates higher performance as there are fewer scenes because of overfitting, but exhibits limitations as scenes become more diverse. NFs complement the drawback of AE and produce an excellent performance improvement (1.5%) in the most diverse 13-scene subset. From Fig. 7 and Table 3, it can be observed that the distribution learning of the NF models on ITAE features is effective in coping with the complex normality distribution.

5.4 Experimental Result and Analysis on Real-world Scenario Benchmarks

We also consider more realistic and dynamic databases: UCF-Crimes, LV, and UBI-Fights. As our method is an unsupervised-learning-based approach, the framework is trained using only normal videos from the training set of each database. In Table 7 (a) and (b), the performance NFs is higher than that of ITAE in UCF-Crimes as well as LV datasets. In Section 5.3, the experimental results show that as the number of scenes increases, the performance of the autoencoder deteriorates, while the performance of NF model remains similar by learning the distribution of general normal patterns. In the case of UBI-Fights, in Table7 (c), ITAE and NFs achieve 57.7% and 53.3% AUC score, respectively. ITAE shows better performance than NFs, which seems to indicate that detecting anomalies with reconstruction error is more effective than distinguishing abnormality by distribution of latent features when the abnormal event is fighting; this is because a fighting event involves a large change in appearance and motion at the frame level, which results in a large reconstruction error that leads to anomaly detection. In Table 8(c), the performance of ITAE with NFs on UCF-Crimes, LV, and UBI-Fights dataset is 70.9%, 77.5%, and 57.8% respectively, and achieves the best results compared with unsupervised anomaly detection methods.

In Fig. 8, the first row is an example of the UCF-Crimes dataset ‘Robbery048’ clip, which is a scene of one person assaulting another; the second row is the LV dataset ‘murder’ clip, which is a scene showing a person being shot. Qualitative results show that anomalies are well distinguished even in clips of complex and diverse anomalies in real-world scenarios.

5.5 False Positive and Negative Samples

Figure 9 shows false positive (FP) and false negative (FN) samples. In (a), unlike ITAE which considers appearance and motion factor concurrently, the one-path AE shows a high reconstruction error in the pedestrian carrying the red bag. (b) is an FP sample of ITAE, a scene in which a pedestrian carries a long stick, showing a high error on the strange stick that causes a high anomaly score, but the false alarm is prevented by adding the score of the NF model. (c) is an FN sample of the proposed method, a scene where a pedestrian is loitering. In this scene, the appearance of a pedestrian and the motion of walking are both similar to the normal pattern, so the ITAE and NFs are unable to distinguish between normal and abnormal scenes.

5.6 Running Time and Model Complexity

We measure the computational time of the proposed method on the UCSD Ped2 dataset (resolution: $240\times 360$ ) following [13]. ITAE infers the anomaly score of a frame in 9.1 milliseconds (ms), while static and dynamic NF models take 22.2 ms and 20.7 ms, respectively. The total framework takes 52 ms per frame, while calculating the anomaly score takes 0.5 ms. With all components together, the proposed method runs at 19 FPS on an NVIDIA GeForce RTX 3090, 24GB memory. The parameter number of ITAE, static NFs, and dynamic NFs is 7.74, 10.16, and 11.05M, respectively; compared to one-path AE of 5.76M, ITAE with motion path only increases 1.98M parameters as the dynamic encoder focuses on motion with a high frame-rate and low channel size (1/8 of static).

6 Conclusion

In this paper, we proposed an ITAE and the distribution modeling of normal features based on an NF model in an unsupervised manner for video anomaly detection. We designed the ITAE to implicitly capture representative static and dynamic information of normal scenes without using a pre-trained network. For the complex normality, using the latent features of ITAE, we modeled the distribution of appearance and motion normality using an NF model through the tractable likelihood. In an experiment on standard benchmarks, ITAE demonstrated high effectiveness in scenes where motion is abnormal by learning the dynamic information of normal scenes. Furthermore, the normality modeling of the ITAE feature achieved superior results, especially when the database is extensive and composed of diverse scenes. The proposed method can be expected to model a general distribution and solve practical problems through a vast number of real-world videos with unsupervised learning.

As CCTVs exist in most places and keep records of our daily lives, their use may involve ethical concerns related to privacy invasion. However, from the perspective of the development of computer vision and pattern recognition applications, as surveillance anomaly detection can help to quickly detect accidents and crimes or automatically prevent them in advance, the reduction of human time, labor, and positive social impact is noteworthy. Our work, which is an unsupervised approach, makes it possible to learn general normal patterns using various scenarios of real-world surveillance video, which is promising and expected to accelerate detecting anomalies in contemporary society.

Acknowledgement
This research was supported by Multi-Ministry Collaborative R&D Program (R&D program for complex cognitive technology) through the National Research Foundation of Korea (NRF) funded by MSIT, MOTIE, KNPA (NRF-2018M3E3A1057289)

References

[1] Davide Abati, Angelo Porrello, Simone Calderara, and Rita Cucchiara. Latent space autoregression for novelty detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 481–490, 2019.
[2] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6836–6846, 2021.
[3] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding. arXiv preprint arXiv:2102.05095, 2(3):4, 2021.
[4] Sovan Biswas and R Venkatesh Babu. Real time anomaly detection in h. 264 compressed videos. In 2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG), pages 1–4. IEEE, 2013.
[5] Yunpeng Chang, Zhigang Tu, Wei Xie, Bin Luo, Shifu Zhang, Haigang Sui, and Junsong Yuan. Video anomaly detection with spatio-temporal dissociation. Pattern Recognition, 122:108213, 2022.
[6] Yong Shean Chong and Yong Haur Tay. Abnormal event detection in videos using spatiotemporal autoencoder. In International Symposium on Neural Networks, pages 189–196. Springer, 2017.
[7] Bruno Degardin and Hugo Proença. Human activity analysis: Iterative weak/self-supervised learning frameworks for detecting abnormal events. In 2020 IEEE International Joint Conference on Biometrics (IJCB), pages 1–7. IEEE.
[8] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. 2015.
[9] Fei Dong, Yu Zhang, and Xiushan Nie. Dual discriminator generative adversarial network for video anomaly detection. IEEE Access, 8:88170–88176, 2020.
[10] Keval Doshi and Yasin Yilmaz. Online anomaly detection in surveillance videos with asymptotic bound on false alarm rate. Pattern Recognition, 114:107865, 2021.
[11] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6824–6835, 2021.
[12] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition. In IEEE international conference on computer vision, pages 6202–6211, 2019.
[13] Mariana-Iuliana Georgescu, Antonio Barbalau, Radu Tudor Ionescu, Fahad Shahbaz Khan, Marius Popescu, and Mubarak Shah. Anomaly detection in video via self-supervised and multi-task learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12742–12752, 2021.
[14] Mariana-Iuliana Georgescu, Radu Tudor Ionescu, Fahad Shahbaz Khan, Marius Popescu, and Mubarak Shah. A background-agnostic framework with adversarial training for abnormal event detection in video. arXiv preprint arXiv:2008.12328, 2020.
[15] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 1705–1714, 2019.
[16] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
[17] Kishanprasad G Gunale and Prachi Mukherji. Deep learning with a spatiotemporal descriptor of appearance and motion estimation for video anomaly detection. Journal of Imaging, 4(6):79, 2018.
[18] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1735–1742. IEEE, 2006.
[19] Yi Hao, Jie Li, Nannan Wang, Xiaoyu Wang, and Xinbo Gao. Spatiotemporal consistency-enhanced network for video anomaly detection. Pattern Recognition, 121:108232, 2022.
[20] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. Learning temporal regularity in video sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 733–742, 2016.
[21] Ryota Hinami, Tao Mei, and Shin’ichi Satoh. Joint detection and recounting of abnormal events by learning deep generic knowledge. In Proceedings of the IEEE International Conference on Computer Vision, pages 3619–3627, 2017.
[22] Radu Tudor Ionescu, Fahad Shahbaz Khan, Mariana-Iuliana Georgescu, and Ling Shao. Object-centric auto-encoders and dummy anomalies for abnormal event detection in video. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7842–7851, 2019.
[23] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In Advances in neural information processing systems, pages 10215–10224, 2018.
[24] Sangmin Lee, Hak Gu Kim, and Yong Man Ro. Bman: Bidirectional multi-scale aggregation networks for abnormal event detection. IEEE Transactions on Image Processing, 29:2395–2408, 2019.
[25] Roberto Leyva, Victor Sanchez, and Chang-Tsun Li. The lv dataset: A realistic surveillance video dataset for abnormal event detection. In 2017 5th international workshop on biometrics and forensics (IWBF), pages 1–6. IEEE, 2017.
[26] Roberto Leyva, Victor Sanchez, and Chang-Tsun Li. Video anomaly detection with compact feature sets for online performance. IEEE Transactions on Image Processing, 26(7):3463–3478, 2017.
[27] Ang Li, Zhenjiang Miao, Yigang Cen, Xiao-Ping Zhang, Linna Zhang, and Shiming Chen. Abnormal event detection in surveillance videos based on low-rank and compact coefficient dictionary learning. Pattern Recognition, 108:107355, 2020.
[28] Weixin Li, Vijay Mahadevan, and Nuno Vasconcelos. Anomaly detection and localization in crowded scenes. IEEE transactions on pattern analysis and machine intelligence, 36(1):18–32, 2013.
[29] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6536–6545, 2018.
[30] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. arXiv preprint arXiv:2106.13230, 2021.
[31] Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. In IEEE international conference on computer vision, pages 2720–2727, 2013.
[32] Yiwei Lu, K Mahesh Kumar, Seyed shahabeddin Nabavi, and Yang Wang. Future frame prediction using convolutional vrnn for anomaly detection. In 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–8. IEEE, 2019.
[33] Yiwei Lu, Frank Yu, Mahesh Kumar Krishna Reddy, and Yang Wang. Few-shot scene-adaptive anomaly detection. In European Conference on Computer Vision, pages 125–141. Springer, 2020.
[34] Weixin Luo, Wen Liu, and Shenghua Gao. A revisit of sparse coding based anomaly detection in stacked rnn framework. In Proceedings of the IEEE International Conference on Computer Vision, pages 341–349, 2017.
[35] Weixin Luo, Wen Liu, Dongze Lian, Jinhui Tang, Lixin Duan, Xi Peng, and Shenghua Gao. Video anomaly detection with sparse coding inspired deep neural networks. IEEE transactions on pattern analysis and machine intelligence, 43(3):1070–1084, 2019.
[36] Amir Markovitz, Gilad Sharir, Itamar Friedman, Lihi Zelnik-Manor, and Shai Avidan. Graph embedded pose clustering for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10539–10547, 2020.
[37] Michael Mathieu, Camille Couprie, and Yann LeCun. Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440, 2015.
[38] Trong-Nguyen Nguyen and Jean Meunier. Anomaly detection in video sequence with appearance-motion correspondence. In Proceedings of the IEEE International Conference on Computer Vision, pages 1273–1283, 2019.
[39] Trong Nguyen Nguyen and Jean Meunier. Hybrid deep network for anomaly detection. arXiv preprint arXiv:1908.06347, 2019.
[40] Yuqi Ouyang and Victor Sanchez. Video anomaly detection by estimating likelihood of representations. In 2020 25th International Conference on Pattern Recognition (ICPR), pages 8984–8991. IEEE, 2021.
[41] Chaewon Park, MyeongAh Cho, Minhyeok Lee, and Sangyoun Lee. Fastano: Fast anomaly detection via spatio-temporal patch transformation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2249–2259, 2022.
[42] Hyunjong Park, Jongyoun Noh, and Bumsub Ham. Learning memory-guided normality for anomaly detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14372–14381, 2020.
[43] Mahdyar Ravanbakhsh, Moin Nabi, Enver Sangineto, Lucio Marcenaro, Carlo Regazzoni, and Nicu Sebe. Abnormal event detection in videos using generative adversarial nets. In 2017 IEEE international conference on image processing, pages 1577–1581, 2017.
[44] Joan Serrà, David Álvarez, Vicenç Gómez, Olga Slizovskaia, José F Núñez, and Jordi Luque. Input complexity and out-of-distribution detection with likelihood-based generative models. In International Conference on Learning Representations, 2019.
[45] Fahad Sohrab, Jenni Raitoharju, Moncef Gabbouj, and Alexandros Iosifidis. Subspace support vector data description. In International Conference on Pattern Recognition, 2018.
[46] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6479–6488, 2018.
[47] Yao Tang, Lin Zhao, Shanshan Zhang, Chen Gong, Guangyu Li, and Jian Yang. Integrating prediction and reconstruction for anomaly detection. Pattern Recognition Letters, 129:123–130, 2020.
[48] Hung Vu, Tu Dinh Nguyen, Trung Le, Wei Luo, and Dinh Phung. Robust anomaly detection in videos using multilevel representations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 5216–5223, 2019.
[49] Chu Wang, Yan-Ming Zhang, and Cheng-Lin Liu. Anomaly detection via minimum likelihood generative adversarial networks. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 1121–1126. IEEE, 2018.
[50] Jue Wang and Anoop Cherian. Gods: Generalized one-class discriminative subspaces for anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8201–8211, 2019.
[51] Tian Wang, Meina Qiao, Zhiwei Lin, Ce Li, Hichem Snoussi, Zhe Liu, and Chang Choi. Generative neural networks for anomaly detection in crowded scenes. IEEE Transactions on Information Forensics and Security, 14(5):1390–1399, 2018.
[52] Peng Wu, Jing Liu, Mingming Li, Yujia Sun, and Fang Shen. Fast sparse coding networks for anomaly detection in videos. Pattern Recognition, 107:107515, 2020.
[53] Peng Wu, Jing Liu, and Fang Shen. A deep one-class neural network for anomalous event detection in complex scenes. IEEE transactions on neural networks and learning systems, 31(7):2609–2622, 2019.
[54] Dan Xu, Yan Yan, Elisa Ricci, and Nicu Sebe. Detecting anomalous events in videos by learning deep representations of appearance and motion. Computer Vision and Image Understanding, 156:117–127, 2017.
[55] Mouxing Yang, Yunfan Li, Zhenyu Huang, Zitao Liu, Peng Hu, and Xi Peng. Partially view-aligned representation learning with noise-robust contrastive loss. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1134–1143, 2021.
[56] Yiru Zhao, Bing Deng, Chen Shen, Yao Liu, Hongtao Lu, and Xian-Sheng Hua. Spatio-temporal autoencoder for video anomaly detection. In Proceedings of the 25th ACM international conference on Multimedia, pages 1933–1941, 2017.
[57] Joey Tianyi Zhou, Jiawei Du, Hongyuan Zhu, Xi Peng, Yong Liu, and Rick Siow Mong Goh. Anomalynet: An anomaly detection network for video surveillance. IEEE Transactions on Information Forensics and Security, 14(10):2537–2550, 2019.

	$\displaystyle\textsl{log}\,p_{z}(\boldsymbol{z})=\textsl{log}\,p_{M-1}(\boldsymbol{h}_{M-1})-\textsl{log}\left\|\textrm{det}\left(\frac{\partial\boldsymbol{f}_{M}}{\partial\boldsymbol{h}_{M-1}}\right)\right\|$
	$\displaystyle=\textsl{log}\,p_{M-1}(\boldsymbol{h}_{M-1})-\textsl{log}\left\|\textrm{det}\left(\frac{\partial\boldsymbol{f}_{M-1}}{\partial\boldsymbol{h}_{M-2}}\right)\right\|-\textsl{log}\left\|\textrm{det}\left(\frac{\partial\boldsymbol{f}_{M}}{\partial\boldsymbol{h}_{M-1}}\right)\right\|$
	$\displaystyle\vdots$
	$\displaystyle=\textsl{log}\,p_{x}(\boldsymbol{x})-\textsl{log}\left\|\textrm{det}\left(\frac{\partial\boldsymbol{f}_{1}}{\partial\boldsymbol{x}}\right)\right\|-\cdots-\textsl{log}\left\|\textrm{det}\left(\frac{\partial\boldsymbol{f}_{M}}{\partial\boldsymbol{h}_{M-1}}\right)\right\|$
	$\displaystyle=\textsl{log}\,p_{x}(\boldsymbol{x})-\sum^{M}_{i=1}\textsl{log}\left\|\textrm{det}\left(\frac{\partial\boldsymbol{f}_{i}}{\partial\boldsymbol{h}_{i-1}}\right)\right\|$		(6)

Unsupervised Video Anomaly Detection via Normalizing Flows with Implicit Latent Features