This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Bayesian Nonparametric Submodular Video Partition for Robust
Anomaly Detection

Hitesh Sapkota     Qi Yu
Rochester Institute of Technology
{hxs1943, qi.yu}@rit.edu
Abstract

Multiple-instance learning (MIL) provides an effective way to tackle the video anomaly detection problem by modeling it as a weakly supervised problem as the labels are usually only available at the video level while missing for frames due to expensive labeling cost. We propose to conduct novel Bayesian non-parametric submodular video partition (BN-SVP) to significantly improve MIL model training that can offer a highly reliable solution for robust anomaly detection in practical settings that include outlier segments or multiple types of abnormal events. BN-SVP essentially performs dynamic non-parametric hierarchical clustering with an enhanced self-transition that groups segments in a video into temporally consistent and semantically coherent hidden states that can be naturally interpreted as scenes. Each segment is assumed to be generated through a non-parametric mixture process that allows variations of segments within the same scenes to accommodate the dynamic and noisy nature of many real-world surveillance videos. The scene and mixture component assignment of BN-SVP also induces a pairwise similarity among segments, resulting in non-parametric construction of a submodular set function. Integrating this function with an MIL loss effectively exposes the model to a diverse set of potentially positive instances to improve its training. A greedy algorithm is developed to optimize the submodular function and support efficient model training. Our theoretical analysis ensures a strong performance guarantee of the proposed algorithm. The effectiveness of the proposed approach is demonstrated over multiple real-world anomaly video datasets with robust detection performance.

1 Introduction

Anomaly detection from videos poses fundamental challenges as abnormal activities are usually rare, complicated, and unbounded in nature [15]. Furthermore, segment or frame labels are typically unavailable due to high labeling cost and therefore, the detection models have to rely on the weak video level labels [19]. There are two main streams of work to handle the challenging anomaly detection task. The first stream treats anomaly detection as an unsupervised learning problem [21]. It assumes that an event is considered to be abnormal if it deviates significantly from a predefined set of normal events included in the training data [2, 26, 24]. However, a model trained on limited normal data is likely to capture only specific characteristics present in the training dataset, and therefore, testing normal events deviating significantly from the training normal events will lead to a high false alarm rate [27]. The second stream of research has attempted to address the limitation by formulating the problem as multiple instance learning (MIL) that models each video as a bag and its segments (or frames) as instances within the bag [19]. The goal is to learn a model that can effectively make frame-level anomaly predictions relying on the video-level labels during the training process. One effective MIL learning objective is to maximize the gap between two instances having the respective highest anomaly scores from a pair of positive and negative bags. The maximum score based MIL (referred as MMIL) model outperformed the unsupervised approaches and achieved promising performance in real-world long surveillance videos [19].

Refer to caption
(a) Burglary (Outlier 1)
Refer to caption
(b) Walking (Outlier 2)
Refer to caption
(c) Running (Modality 1)
Refer to caption
(d) Loitering (Modality 2)
Refer to caption
(e) Walking (Modality 3)
Figure 1: Examples of outlier (a-b) and multimodal frames (c-e) from the Avenue dataset

However, there are two key limitations with the MMIL model. First, the presence of noisy outliers (different from other normal events) in both abnormal and normal videos may significantly impact the overall model performance. This is because the objective function solely focuses on the individual segments from both positive and negative bags, making the training process sensitive to noises. Figure 1 (a-b) shows the example normal frames that are significantly different from other normal ones in real-world surveillance videos. The first frame is from the burglary video that looks similar to an abnormal frame from a video with an arson event. The second frame is from the shooting video that looks similar to a fighting frame. Hence, they may serve as outliers in the corresponding videos.

Second, if multiple types of abnormal events (referred to as multi-modal anomaly) present in a single abnormal video, MMIL may only detect one type of anomaly while missing other important ones due to the limitation of the objective function. Figure 1 (c)-(e) demonstrate three frames with different anomaly types from an example video in the Avenue dataset [16]. In Figure 1 (c), the person is running, which is regarded as a strange action in that context [16]. In (b), it shows a person waiting in a place holding some object in the hand, and (c) involves a person walking in the wrong direction. Therefore, the single video has multiple anomaly frames leading to a multimodal scenario.

Top-kk ranking loss has been adopted in an attempt to address the issues as outlined above. It maximizes the gap between the mean score of the top-kk predicted instances from a positive bag and that of a negative one [22, 18]. However, there are inherent limitations by using a top-kk loss. First, it tends to be extremely sensitive to the selected kk value. Figure 2 shows the highly fluctuating detection performance from two real-world surveillance video datasets. Since there is no frame (or segment) labels available during model training, setting an optimal kk through cross-validation is infeasible or highly costly. Furthermore, given the diverse videos, the number of abnormal instances may vary significantly from one video to another implying we should have a different kk for each video. Hence, applying the same kk to all videos as in the existing approaches fails to capture the nature of the data. Another serious but more subtle issue is that all (or most of) the selected kk segments may come from the same sub-sequence of the video. Using a consecutive set of visually similar segments is less effective for model training, making it more likely to suffer from outlier and multimodal scenarios. As a result, top-kk approaches will fall short in providing a reliable detection performance in most practical settings as evidenced by our experiments.

Refer to caption
(a) Avenue
Refer to caption
(b) ShanghaiTech
Figure 2: Highly fluctuating detection performance w.r.t. kk

To address the fundamental limitations of existing solutions, we propose novel Bayesian non-parametric construction of a submodular set function, which is integrated with multiple instance learning to deliver robust video anomaly detection performance under practical settings. Instead of choosing a set of instances with the highest prediction scores that are likely from a consecutive sub-sequence, maximizing a specially designed submodular function can involve a more diverse set of instances and expose the model to all potentially abnormal segments for more effective model training. Furthermore, the submodular set function is constructed in a non-parametric way, which induces a pairwise similarity among different segments in a video based on the diverse nature of the data. More specifically, an infinite Hidden Markov Model with a Hierarchical Dirichlet prior (HDP-HMM) [20] augmented with an enhanced self-transition is employed to partition a video through dynamic non-parametric clustering of its segments. To more effectively accommodate the dynamic and noisy nature of real-world surveillance videos, the emission process of the HMM is also governed by a non-parametric mixture model to allow segments within the same hidden state to have visual and spatial variations. This unique design is instrumental to discover temporally consistent and semantically coherent hidden states that can be naturally interpreted as scenes. Pairwise similarity among different segments is defined according to the state-component structure, which leads to the construction of a submodular set function. We then develop a novel submodularity diversified MIL loss function to ensure robust anomaly detection from real-world surveillance videos with outlier and multimodal scenarios. Our key contributions include:

  • Formulation of a novel submodularity diversified MIL loss that simultaneously extracts a diverse set of potentially positive instances while maximizing the gap between the mean score of these instances from a positive bag and a negative one, respectively.

  • Bayesian non-parametric construction of the submodular set function that infers the diversity from the video data to induce a pairwise similarity among different segments in a video and provide an upper bound on the size of the diverse set.

  • A greedy algorithm that leverages the state-component hierarchical structure resulting from the non-parametric construction for submodular set function optimization and efficient model training.

  • Theoretical results to ensure strong performance guarantee of the greedy algorithm.

The proposed approach achieves the state-of-the-art robust anomaly detection performance on real-world surveillance videos with noisy and multimodal scenarios.

2 Related Work

Encoding and sparse reconstruction-based approaches have been employed for anomaly detection, assuming that abnormal events are rare and deviate from normal patterns. They aim to capture the normal patterns using models, such as Gaussian processes (GPs) [12] and HMMs [9], to identify anomalies as outliers based on the reconstruction loss. Sparse representation-based approaches construct a dictionary for normal events and identify the events with the high reconstruction error as anomalies [16]. Recent approaches consider both abnormal and normal events in the training process. For video anomaly detection, since only video-level labels are assumed to be available during model training [6], MIL offers a natural solution by modeling each video as a bag and the associated segments (frames) as instances of the bag. Sultani et al. proposed an MIL based approach that enables to maximize the gap between highest prediction scores from a positive and negative bags, respectively [19] . However, this maximum score based MIL model (i.e., MMIL) is insufficient to handle outlier and multimodal scenarios as discussed earlier.

Top-kk ranking loss based MIL models have been developed to address the limitations of the MMIL model [22, 18]. These models produce state-of-the-art detection performance given that a suitable kk value can be assigned in advance. However, as demonstrated earlier, the detection performance of such models is highly sensitive to the chosen kk value. Meanwhile, given the diverse nature of videos, applying the same kk value to all the videos is sub-optimal. More importantly, since instance level labels are not available during training time, choosing a single kk value through cross-validation is infeasible or incurs a high annotation cost. Distributionally Robust Optimization (DRO) has been used to convert the top-kk set into an uncertainty set that allows the model to focus on instances in proportion to their prediction scores [18]. This is equivalent to assigning soft membership to involve instances into the MIL loss function. However, the size of the uncertainty set is controlled by the radius (i.e., η\eta) of the uncertainty ball, which needs to be manually set. Furthermore, the model may put more focus on a set of consecutive segments with the highest prediction scores and ignore some other potentially positive segments.

The proposed approach constructs a novel submodular set function in a non-parametric way by inferring the diversity from data automatically. By jointly optimizing the submodular function and the MIL loss, it automatically chooses a diverse set of segments and lets the model better differentiate these (potentially positive) segments from those of a negative bag to ensure good detection performance.

3 Methodology

Following the standard MIL assumption, we consider, for a positive bag, there is at least one abnormal segment whereas, for a negative bag all segments are of normal types. Table 3 in the Appendix summarizes the major symbols and their descriptions.

Refer to caption
Figure 3: Output of the top-kk based approach in a video from Avenue datase (missing some of the abnormal segments in top-kk).

3.1 Preliminaries

Let 𝐱i+{\bf x}_{i}^{+} be the ithi^{th} segment in the positive bag pos\mathcal{B}_{pos} and 𝐱j{\bf x}_{j}^{-} indicates the jthj^{th} segment in the negative bag neg\mathcal{B}_{neg}. Also consider nn as the total number of instances per bag. The maximum score based MIL (MMIL) model tries to maximize the gap between the maximum prediction score from positive bag and that from the negative bag [19]:

L(pos,neg)=[1maxipos(f(𝐱i+))+maxjneg(f(𝐱j))]+L(\mathcal{B}_{pos},\mathcal{B}_{neg})=\Bigl{[}1-\max_{i\in\mathcal{B}_{pos}}(f({\bf x}_{i}^{+}))+\max_{j\in\mathcal{B}_{neg}}(f({\bf x}_{j}^{-}))\Bigr{]}_{+} (1)

where f(𝐱i+)f({\bf x}_{i}^{+}) (or f(𝐱j)f({\bf x}_{j}^{-})) is the prediction score of 𝐱i+{\bf x}_{i}^{+} (or 𝐱j{\bf x}_{j}^{-}) and [a]+=max{0,a}[a]_{+}=\max\{0,a\}. As mentioned earlier, MMIL is less effective to handle outlier and multimodal scenarios. The top-kk ranking loss partially addresses the limitation of MMIL by maximizing the gap between an average of kk highest segment predictions from the positive bag and maximum segment prediction score from a negative bag:

L(pos,neg)=[11ki=1kf(𝐱[i]+)+maxjnegf(𝐱j)]+\displaystyle L(\mathcal{B}_{pos},\mathcal{B}_{neg})=\Bigl{[}1-\frac{1}{k}\sum_{i=1}^{k}f({\bf x}_{[i]}^{+})+\max_{j\in\mathcal{B}_{neg}}f({\bf x}_{j}^{-})\Bigl{]}_{+} (2)

where the positive bag segment predictions are sorted in a non-decreasing order, i.e., f(𝐱[1]+)f(𝐱[k]+)f({\bf x}_{[1]}^{+})\geq...\geq f({\bf x}_{[k]}^{+}).

Refer to caption
(a) Normal
Refer to caption
(b) Abnormal
Refer to caption
(c) Abnormal
Refer to caption
(d) Normal
Refer to caption
(e) Abnormal
Figure 4: Example frames from different scenes in an explosion video from UCF-Crime: (a-b) scene 1, (c) scene 2, (d-e) scene 3

There are two major issues associated with the top-kk ranking loss. First, choosing an optimal kk value is a key challenge as the number of abnormal instances may vary significantly from one video to another implying different kk value for each video. Second, all the selected top-kk instances may come from same sub-sequence of the video. Including all those visually similar instances does not contribute much in the model training process. Furthermore, concentrating only on a specific sub-sequence may make the approach less effective to handle multimodal and outlier scenario. Figure 3 presents the output of the top-kk based model in the Avenue dataset. It can be seen that the top-kk based approach picks the consecutive video segments while missing quite a few other abnormal frames.

3.2 Bayesian Non-parametric Submodular Set Function Construction

The proposed Bayesian non-parametric submodular video partition (BN-SVP) approach offers a novel integrated solution to address the above two fundamental challenges simultaneously. In particular, since submodular set functions provide a natural measure for diversity, we design a special submodular set function that enables discovery of a representative set of segments from a video. This avoids only choosing visually similar consecutive video segments like in the top-kk approach, which enhances the model’s exposure to potential abnormal instances during model training. As a result, the model’s capability to handle multimodal and outlier scenarios can be effectively improved.

However, maximizing a submodular set function still requires to specify the size of the set. As mentioned above, choosing a set with an optimal size in video anomaly detection is highly challenging. To this end, we propose a novel Bayesian non-parametric construction of the submodular set function. The non-parametric construction leverages both visual features of the video segments and their temporal closeness to derive a similarity measure that allows us to define a submodular set function F(𝒞+)F(\mathcal{C}^{+}), where 𝒞+\mathcal{C}^{+} represents a subset of segments in a video. The size of 𝒞+\mathcal{C}^{+} is automatically determined through Bayesian non-parametric analysis of the video. Intuitively, most videos, especially those with anomalies, usually consist of multiple scenes, where each scene is comprised of a consecutive set of visually similar segments. Figure 4 shows the example frames from three different scenes in a video that records an explosion event. Ideally, if a video could be partitioned based on these scenes, we can choose representative (and potentially positive) segments from each scene. Such information can significantly facilitate the optimization of the submodular set function. However, both the number and the types of the scenes are unavailable during model training.

The proposed BN-SVP addresses the above issue through non-parametric partition of a video. It builds upon and extends an HDP-HMM model that places a Hierarchical Dirichlet Process (HDP) prior on the state transition distribution of a Hidden Markov Model (HMM) model [20]. By using an HMM to model a video (as a sequence of segments), each discovered hidden state can be naturally interpreted as a scene in the video. The HDP prior allows us to determine the optimal number of states (i.e., scenes) according to the nature of the data. However, real-world videos may be highly noisy and directly using an HDP-HMM model may extract too many scenes with less significant visual characteristics (e.g., spatial changes of objects or addition/removal of a small number of objects). To address this issue, we follow the sticky HDP-HMM to encourage a stronger self-transition of a state [4]. This will result in temporal persistence of states to produce longer and semantically coherent scenes. To further accommodate spatial changes or variations in certain objects, we allow the emission distribution to follow another non-parametric DP that automatically determines the number of mixture components (i.e., sub-scenes) within the same scene. For example, scene 1 in Figure 4 is comprised of two sub-scenes: the first with a clear sky and the second with smoke in the sky.

More specifically, consider a collection of hidden states (i.e. scenes in a video), the transition probability of state jj to other states is governed by a DP:

𝒢j=l=1π^jlδϕ^jl,𝝅^jGEM(α)\mathcal{G}_{j}=\sum_{l=1}^{\infty}\hat{\pi}_{jl}\delta_{\hat{\phi}_{jl}},\;\hat{\boldsymbol{\pi}}_{j}\thicksim\text{GEM}(\alpha) (3)

where GEM(α)\text{GEM}(\alpha) is formed through a stick breaking construction process with parameter α\alpha [20], ϕ^jl\hat{\phi}_{jl} is drawn from a base distribution 𝒢0\mathcal{G}_{0}, which follows another DP

𝒢0=k=1βkδϕk,𝜷GEM(γ),ϕkH\mathcal{G}_{0}=\sum_{k=1}^{\infty}\beta_{k}\delta_{\phi_{k}},\;\boldsymbol{\beta}\thicksim\text{GEM}(\gamma),\;\phi_{k}\thicksim H (4)

Because of the discrete nature of 𝒢0\mathcal{G}_{0}, there can be multiple ϕ^jl\hat{\phi}_{jl}’s taking an identical value of ϕk\phi_{k}. Considering the unique set of atoms ϕk\phi_{k}, we can rewrite 𝒢j\mathcal{G}_{j} as

𝒢j=k=1πjkδϕk,𝝅jDP(α,𝜷),ϕkH\mathcal{G}_{j}=\sum_{k=1}^{\infty}\pi_{jk}\delta_{\phi_{k}},\;\boldsymbol{\pi}_{j}\thicksim\text{DP}(\alpha,\boldsymbol{\beta}),\phi_{k}\thicksim H (5)

Given the highly dynamic and noisy nature of many real-world surveillance videos, directly applying the standard HDP-HMM model to partition a video may result in many redundant scenes and rapidly switches among them. This is problematic in our setting in which it is critical to infer semantically coherent scenes along with a slower transition among them. As a result, it is essential to ensure temporal persistence of the discovered scenes [4]. This can be achieved through enhanced self transitions. In particular, the transition probability of the jj’s state is augmented by

𝝅jDP(α+ρ,α𝜷+ρδjα+ρ)\boldsymbol{\pi}_{j}\thicksim\text{DP}\left(\alpha+\rho,\frac{\alpha\boldsymbol{\beta}+\rho\delta_{j}}{\alpha+\rho}\right) (6)

This has the effect of increasing the expected probability of staying in the same state.

𝔼[πjk|𝜷]={αβj+ρα+ρifk=jαβkα+ρ,otherwise\mathbb{E}[\pi_{jk}|\boldsymbol{\beta}]=\left\{\begin{array}[]{@{}ll@{}}\frac{\alpha\beta_{j}+\rho}{\alpha+\rho}&\text{if}\ k=j\\ \frac{\alpha\beta_{k}}{\alpha+\rho},&\text{otherwise}\end{array}\right. (7)

To allow certain levels of variations within the same scene and accommodate the highly dynamic nature of a video sequence, we propose to model the emission process using a mixture distribution governed by another non-parametric DP. This design offers three unique advantages. First, it further ensures the temporal persistence of a scene as for a segment with less significant visual differences, it can stay in the same scene by switching to a different mixture component instead of transitioning to another (redundant) scene. Second, it offers a fine-grained partition of the video sequence, which is instrumental to separate abnormal segments (e.g., frames (b) and (e) in Figure 4) from normal ones (e.g., frames (a) and (d) in Figure 4) that share a common background. Last, the number of mixture components is automatically determined by the DP (e.g., scenes 1 & 3 have two mixture components while scene 2 only has one). For the kk-th scene, there is an unique stick-breaking distribution 𝝍kGEM(τ)\boldsymbol{\psi}_{k}\thicksim\text{GEM}(\tau) that defines the weights of the mixture components within the scene. Then, given the scene and mixture component assignment (zi=k,si=t)(z_{i}=k,s_{i}=t) of a segment 𝐱i+{\bf x}^{+}_{i} in a video, it is drawn from a specific multivariate Gaussian: 𝒩(𝝁k,t,Σk,t)\mathcal{N}(\boldsymbol{\mu}_{k,t},\Sigma_{k,t}).

Posterior inference of the augmented HDP-HMM model with a DP mixture for emission can be achieved through direct assignment [20] or blocked sampling with an increasing mixing rate [3]. Hyper-parameters can also be inferred by placing a vague prior on them and conduct Gibbs sampling.

The scene and component assignments of BN-SVP induces a pairwise similarity among segments in a video:

{Si,j=(𝐱i+)Σzi,si1𝐱j+ifsi==sjzi==zjSi,j=0otherwise\left\{\begin{array}[]{@{}ll@{}}S_{i,j}=({\bf x}_{i}^{+})^{\top}\Sigma^{-1}_{z_{i},s_{i}}{\bf x}_{j}^{+}&\text{if}\ s_{i}==s_{j}\wedge z_{i}==z_{j}\\ S_{i,j}=0&\text{otherwise}\end{array}\right. (8)

It is worth to note that the similarity between two segments 𝐱i+{\bf x}_{i}^{+} and 𝐱j+{\bf x}_{j}^{+} is evaluated using the learned feature representations (through a DNN) instead of the raw features. The induced similarity allows us to define a submodular set function [11, 10] summarized by the follow proposition.

Proposition 1.

Let κ\kappa denote the number of unique mixture components across all the discovered states in a bag pos\mathcal{B}_{pos} and 𝒞pos\mathcal{C}\subset\mathcal{B}_{pos} is a subset of pos\mathcal{B}_{pos} with size κ\kappa. Given the BN-SVP induced pairwise similarity defined in (8), the following function is a submodular set function:

F(𝒞)=iposmaxj𝒞Si,jF({\mathcal{C}})=\sum_{i\in{\mathcal{B}_{pos}}}\max_{j\in{\mathcal{C}}}S_{i,j} (9)

Based on the definition of Si,jS_{i,j} as shown above, it is straightforward to show that F(𝒞)F({\mathcal{C}}) is a special instance of the location facility function [14], which is submodular. Since each mixture component captures a unique sub-scene, maximization of F(𝒞)F({\mathcal{C}}) can extract a diverse set of segments that best represent the all the scenes (and sub-scenes) in the entire video. By further integrating the margin loss given in (2), we achieve a submodularity diversified MIL loss:

min𝐰,𝒞+pos,|𝒞+|κL(𝒞+)λF(𝒞+)\displaystyle\min_{{\bf w},\mathcal{C}^{+}\in\mathcal{B}_{pos},|\mathcal{C}^{+}|\leq{\kappa}}L(\mathcal{C}^{+})-\lambda F(\mathcal{C}^{+}) (10)

where the margin loss is defined over instances in a set 𝒞+\mathcal{C}^{+} with size no larger than κ\kappa:

L(𝒞+)=[11|𝒞+|i𝒞+f(𝐱i+)+maxjnegf(𝐱j)]+L(\mathcal{C}^{+})=\Bigl{[}1-\frac{1}{|\mathcal{C}^{+}|}\sum_{i\in\mathcal{C}^{+}}f({\bf x}^{+}_{i})+\max_{j\in\mathcal{B}_{neg}}f({\bf x}_{j}^{-})\Bigl{]}_{+} (11)

In essence, 𝒞+\mathcal{C}^{+} includes the set of instances in a positive bag that participate in the model training. The constraint |𝒞+|κ|\mathcal{C}^{+}|\leq{\kappa} has the effect of excluding some representative segments from the margin loss as these segments are less likely to be abnormal (e.g., with a very low predicted score). Including these segments will increase the margin loss and coefficient λ\lambda controls the balance between the margin loss and the diversity among the chosen segments.

3.3 Greedy Submodular Function Optimization

We propose a greedy algorithm for optimizing the submodular function in (9) to ensure efficient model training. The proposed algorithm leverages the special structure of the state and mixture component space resulted from the HDP-HMM partition of the video segments. The performance guarantee of the greedy algorithm is ensured by our theoretical result presented at the end of this section.

Recall that we use sis_{i} to denote the mixture component of segment 𝐱i+{\bf x}^{+}_{i}. Let fsf^{*}_{s} denote the maximum score among all the segments assigned to the same mixture component and isi^{*}_{s} be the index of the corresponding representative segment:

is=argmaxi:si=sf(𝐱i+),fs=f(𝐱is+)\displaystyle i^{*}_{s}=\arg\max_{\forall i:s_{i}=s}f({\bf x}_{i}^{+}),\;f_{s}^{*}=f({\bf x}_{i^{*}_{s}}^{+}) (12)

We construct a representative set 𝒞+^\widehat{\mathcal{C}^{+}} as follows. Let 𝒞+^=Φ\widehat{\mathcal{C}^{+}}=\Phi and for each mixture component ss, we set

{𝒞+^𝒞+^{is},iffsϵ𝒞+^𝒞+^,otherwise\left\{\begin{array}[]{@{}ll@{}}\widehat{\mathcal{C}^{+}}\leftarrow\widehat{\mathcal{C}^{+}}\cup\{i^{*}_{s}\},&\text{if}\ f^{*}_{s}\geq\epsilon\\ \widehat{\mathcal{C}^{+}}\leftarrow\widehat{\mathcal{C}^{+}},&\text{otherwise}\end{array}\right. (13)

where ϵ\epsilon is a threshold to exclude segments with a low prediction score, which plays an equivalent role as constraint |𝒞+|κ|\mathcal{C}^{+}|\leq{\kappa} in (10). In our experiments, we use ϵ\epsilon equal to the output of the segment staying in the 35th percentile among video specific segments so as to avoid skipping any potential abnormal segments. Once a representative set 𝒞+^\widehat{\mathcal{C}^{+}} is constructed, model training can proceed by solving the following MIL loss:

min𝐰[11|𝒞+^|i𝒞+^f(𝐱i+)+maxjnegf(𝐱j)]+\displaystyle\min_{{\bf w}}\Bigl{[}1-\frac{1}{|\widehat{\mathcal{C}^{+}}|}\sum_{i\in\widehat{\mathcal{C}^{+}}}f({\bf x}^{+}_{i})+\max_{j\in\mathcal{B}_{neg}}f({\bf x}_{j}^{-})\Bigl{]}_{+}\vspace{-2mm} (14)

Given the state and mixture component assignment of each segment in a video, the representative set can be quickly constructed by sorting segments within each component according to their predicted scores and choosing the representative segment from each component by comparing its score with the threshold ϵ\epsilon. Next, we provide a strong theoretical guarantee that the greedy algorithm can ensure the inclusion of a diverse set of segments for model training.

Theorem 1.

The representative set based MIL loss given in (14) is equivalent to the submodularity diversified MIL loss given in (10). Furthermore, using the proposed greedy algorithm to locate the κ\kappa representative segments essentially provides a κ\kappa-constrained greedy approximation to the maximization of the submodular set function F(𝒞)F(\mathcal{C}). As a result, the obtained solution is guaranteed to be no worse (1e1)(1-e^{-1}) of the optimal solution.

The detailed proof is provided in the Appendix.

4 Experiments

We conduct extensive experiments to evaluate the effectiveness of the proposed BN-SVP approach. Through these experiments, we aim to demonstrate: (i) outstanding anomaly detection performance by comparing with competitive top-kk, MIL, and other video anomaly detection models, (ii) robustness to outlier and multimodal scenarios, and (iii) deeper insights on the better detection performance through a qualitative study.

4.1 Datasets and Experimental Settings

Our experimentation includes three video datasets of different scales: ShanghaiTech [17], Avenue [16], and UCF-Crime [19]. Table 4 in the Appendix shows how the videos are partitioned into the training/testing sets in each dataset.

  • ShanghaiTech consists of 437 videos (330 normal and 107 abnormal). In the original setting, all training videos are normal. To fit into our setting, we follow the data split in [27] to assign normal and abnormal videos in both training and testing sets.

  • Avenue consists of 16 training and 21 testing videos. We perform 80:20 split separately in the abnormal and normal video sets to generate training and testing instances.

  • UCF-Crime consists of 13 different anomalies with a total of 1900 videos, where 1610 are training videos and 290 are testing videos. In this dataset, frame labels are available only for the testing videos.

To show the robustness of the proposed approach in the multimodal and outlier scenarios, we also generate the Multimodal and Outlier datasets. Specifically, we create a multimodal scenario by extending the UCF-Crime dataset. For the outlier scenario, we deliberately impose some outliers in the ShanghaiTech dataset. More details of these two datasets are provided in Section 4.3. For evaluation, we report the frame-level receiver operating characteristics (ROC) curve along with the corresponding area under curve (AUC). The AUC score indicates the robustness of the performance at various thresholds.

Refer to caption
(a) UCF Crime
Refer to caption
(b) Shanghaitech
Refer to caption
(c) Avenue
Refer to caption
(d) Outlier
Refer to caption
(e) Multimodal
Figure 5: Performance comparison with top-kk ranking models

For Avenue and ShanghaiTech datasets, we extract visual features from the FC7 layer of a pre-trained C3D network [23]. We re-size each video frame to 240×340240\times 340 pixels and fix the frame rate to 30 fps. We compute the C3D features for every 16-frame video clip. This may yield a different number of clips (each clip having a 2048 dimensional feature vector) depending on the number of frames in each video. Thus, we fit any number of clips to the 32 segments by taking an average of clip features in a specific segment. In case of UCF-Crime, we extract the features using an I3D network [1] by using the pretrained network as described in [25]. For all datasets, we use parallel GCN networks to capture the feature similarity and temporal consistency. The outputs of the parallel branches are combined and passed through a 5-layer LSTM network where each layer has 32 hidden units followed by batch normalization. Finally, an FC layer with sigmoid activation is applied to bring the prediction score to (0,1)(0,1). For model training, we use SGD with a learning rate of 0.0010.001 and l2l_{2} regularization with parameter λ=0.001\lambda=0.001. Detailed information about the network architecture is provided in the Appendix.

4.2 Performance Comparison

Comparison with Top-kk Models

We first compare the detection performance with two most recent top-kk based models, including Robust Temporal Feature Magnitude learning (RTFM) [22] and the DRO based deep kernel MIL (DRO-DKMIL) [18]. We also compare with a standard average top-kk model (Avg Topk) as the baseline. Avg Topk uses the rank loss in (2) with the same network architecture as BN-SVP. For RTFM, we get the result by re-running the original implementation for different kk values. Similarly, for DRO-DKMIL, we run the original implementation for different η\eta values that control the size of the uncertainty set. The proposed BN-SVP removes the dependency on these highly sensitive parameters through non-parametric modeling. Detailed comparison results are shown in Figure 5.

We have several important observations. First, all the top-kk models are very sensitive to the selection of the kk value (or η\eta that defines a soft version of the top-kk set). Both RTFM and DRO-DKMIL outperform the standard Avg Topk. DRO-DKMIL achieves relatively more stable performance across all datasets. This may attribute to its conversion of discrete optimization (i.e., choose a specific kk) to a continuous optimization problem (i.e. choosing η\eta). However, for certain dataset (e.g., Avenue), its performance still varies more than 8%8\%. Second, while for some rare cases that RTFM or DRO-DKMIL achieves the best performance for a specific kk or η\eta, they under-perform BN-SVP in most cases. This is mainly due to that these models tend to choose a consecutive set of segments, which limits the model’s exposure of other potentially positive segments. This issue has been effectively addressed by BN-SVP, which extracts a diverse set of potentially positive segments through submodular optimization.

Comparison with Other Models

We also make comparison with other existing techniques that do not depend on the kk value. Specifically, our comparison study includes the maximum score based MIL model (MMIL) by Sultani et al. [19], attention based deep MIL model proposed by Ilse et al. [7], a dictionary based approach proposed by Lu et al. [16], and an MIL model for soft bags (MILS) proposed by Li & Vasconcelos [13] as common baselines for all datasets. Sultani et al. [19] used the loss function in (1) along with the temporal similarity and consistency as a regularizer. Ilse et al. used a permutation invariant aggregation function to detect the positive instances in the bag, where the function operators are learned using the attention network [7]. Li & Vasconcelos used a large-margin based latent support vector machine model with the goal to correctly classify positive and negative bags [13]. In case of the approach presented by Zhong et al. [27], we directly report the performance from the original paper for the UCF-Crime and ShanghaiTech datasets. This approach involves multiple rounds of alternative optimization between classification and cleaning and may produce unstable performance [22]. Considering its difficulty in the training and replication process, we do not include it in other datasets.

Table 1 reports the AUC scores of BN-SVP along with the results from the comparison models as described above. It can be seen that BN-SVP clearly outperforms other models in all datasets and a large margin (i.e., 66-8%8\%) is achieved on both ShanghaiTech and Avenue datasets. The corresponding ROC curves are shown in Figure 6, which demonstrates a consistent trend. For example, on UCF-Crime, BN-SVP has a more than 10%10\% better True Positive Rate (TPR) compared to MMIL at a False Positive Rate (FPR) of 0.2. Also at varying FPR, BN-SVP consistently outperforms the other competitive baselines, which justifies its outstanding detection capability.

Table 1: Comparison with Other Models
Approach AUC (%)
UCF-Crime
Hasan et al. [5] (C3D) 50.6050.60
Lu et al. [16] (C3D) 65.5165.51
Lu et al. [16] (I3D) 61.9861.98
MMIL [19] (C3D) 75.4175.41
Li & Vasconcelos [13] (I3D) 77.9577.95
Ilse et al. [7] (I3D) 76.5276.52
Zhong et al. [27] (GCN (C3D)) 81.0881.08
Zhong et al. [27] (TSNRGB\text{TSN}^{\text{RGB}}) 82.1282.12
Zhong et al. [27] (TSNOpticalFlow\text{TSN}^{\text{OpticalFlow}}) 78.0878.08
MMIL [19] (I3D) 79.6879.68
BN-SVP (I3D) 83.39
ShanghaiTech
Lu et al. [16] (C3D) 72.9072.90
Li & Vasconcelos [13] (C3D) 90.4090.40
Zhong et al. [27] (GCN (C3D)) 76.4476.44
Zhong et al. [27] (GCN(TSNRGB)\text{GCN}(\text{TSN}^{\text{RGB}})) 84.4484.44
Zhong et al. [27] (GCN(TSNOptical Flow)\text{GCN}(\text{TSN}^{\text{Optical Flow}})) 84.1384.13
Ilse et al. [7] (C3D) 85.7885.78
MMIL [19] (C3D) 92.1892.18
BN-SVP (C3D) 96.00
Avenue
Binary SVM (C3D) 69.1169.11
Lu et al. [16] (C3D) 62.1462.14
Li & Vasconcelos [13] (C3D) 72.2372.23
Ilse et al. [7] (C3D) 72.3972.39
MMIL [19] 70.4070.40
BN-SVP (C3D) 80.87
Refer to caption
(a) UCF-Crime
Refer to caption
(b) ShanghaiTech
Refer to caption
(c) Avenue
Refer to caption
(d) Multimodal (UCF)
Refer to caption
(e) Outlier (ShanghaiTech)
Figure 6: ROC curves on three video datasets (a)-(c), multimodal (d) and outlier (e)

4.3 Detecting Multimodal and Outlier Segments

Multimodal Detection

The original UCF-Crime dataset does not explicitly consider a multimodal scenario. Even though the real-world surveillance videos may indeed contain those cases (which is evidenced by the superior performance of the BN-SVP model), it is hard to identify actual videos for this specific information. In UCF-Crime dataset, different types of anomalies are present. This allows us to explicitly create multimodal scenarios by combining multiple abnormal videos from different activity types. To this end, we randomly select three activity types and form an abnormal bag by concatenating three abnormal videos, one video per activity type. The training bags are constructed using the training dataset whereas testing bags are constructed using the testing dataset. In total, we construct 50 abnormal and 50 normal training bags. In the testing set, there are 10 normal and 10 abnormal videos. Table 2 shows the AUC scores and corresponding ROC curve is shown in the Figure 6 (d). BN-SVP achieves a more superior performance compared to other baselines. Furthermore, BN-SVP stays consistently on the top in the ROC curve justifying the effectiveness of the approach toward the multimodal scenario. As an example, at FPR=0.1\text{FPR}=0.1, BN-SVP is at least 20%20\% better than other approaches on TPR.

Outlier Detection

To assess the robustness on outlier detection, we extend the ShanghaiTech dataset with outliers. Specifically, we randomly select 120 segments from abnormal videos and replace their features with points drawn from a standard multivariate Gaussian distribution. As shown in Table 2, MMIL suffers heavily by the outliers compared to the proposed BN-SVP. This is because, it is likely to have an outlier prediction as the maximum prediction score from the abnormal video. As a result, the overall optimization process may be heavily influenced by outliers.

Table 2: AUC Scores on Multimodal and Outlier Detection
Approach AUC (%\%)
Multimodal Outlier
Lu et al. [16] (C3D) 58.6758.67 72.9072.90
Li & Vasconcelos [13] (C3D) 70.9670.96 90.9590.95
Ilse et al. [7] (C3D) 66.8566.85 85.6585.65
MMIL [19] 57.0857.08 86.4786.47
BN-SVP 76.53 95.27

4.4 Qualitative Analysis

Refer to caption
(a) Frame 1
Refer to caption
(b) Frame 2
Figure 7: Frames from UCF-Crime Stealing019; (a) Correct BN-SVP, Avg Topk, (b) Correct BN-SVP, Incorrect Avg Topk

To show the effectiveness of extracting a diverse set of segments for model training, we present illustrative sample frames in a stealing video from UCF-Crime, where BN-SVP correctly identifies all abnormal frames and a top-kk approach (e.g., Avg Topk) misses some of them. In Figure 7, both frames are of abnormal types and but they occur in two distinct time intervals within the video. The first frame is more obvious for a stealing event. Consequently, both the proposed BN-SVP and Avg Topk are able to correctly identify it. In contrast, the second frame is less obvious for a stealing activity. Nevertheless, it is still chosen by BN-SVP due to its diverse coverage of potentially abnormal frames during the training process. On the other hand, Avg Topk only focuses on frames with high prediction scores that are usually co-located in the same time interval. This will narrow the scope of the model being exposed to other abnormal frames. Therefore, Avg Topk is not able to correctly predict the second frame and falsely classify it as normal. More qualitative analysis that demonstrates the robustness of the proposed approach on multimodal and outlier scenarios is provided in the Appendix along with an ablation study for the prediction score threshold ϵ\epsilon defined in (13).

5 Conclusion

In this paper, we propose a novel Bayesian non-parametric submodularity diversified MIL model for robust video anomaly detection in practical settings that involve outlier and multimodal scenarios. By integrating submodular optimization with the minimization of an MIL loss, the proposed approach identifies a diverse set of segments to ensure comprehensive coverage of all potential positive segments for effective model training. The Bayesian non-parametric construction of the submodular set function automatically determines the upper bound on the size of the diverse set, which serves as a key constraint for minimizing the submodularity diversified MIL loss function. The resulting state-component structure also leads to a greedy submodular optimization algorithm to support efficient model training. The effectiveness of the proposed approach is demonstrated through the state-of-the-art robust anomaly detection performance on real-world surveillance videos with noisy and multimodal scenarios.

Acknowledgement

This research was supported in part by an NSF IIS award IIS-1814450 and an ONR award N00014-18-1-2875. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agency.

References

  • [1] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 4724–4733, 2017.
  • [2] Y. Cong, J. Yuan, and J. Liu. Sparse reconstruction cost for abnormal event detection. In CVPR 2011, pages 3449–3456, 2011.
  • [3] Yanbo Fan, Siwei Lyu, Yiming Ying, and Baogang Hu. Learning with average top-k loss. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, NIPS, pages 497–505. Curran Associates, Inc., 2017.
  • [4] Emily B. Fox, Erik B. Sudderth, Michael I. Jordan, and Alan S. Willsky. A sticky hdp-hmm with application to speaker diarization. The Annals of Applied Statistics, 5:1020–1056, 2011.
  • [5] M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and L. S. Davis. Learning temporal regularity in video sequences. In CVPR, pages 733–742, 2016.
  • [6] Chengkun He, Jie Shao, and Jiayu Sun. An anomaly-introduced learning method for abnormal event detection. Multimedia Tools Appl., 77(22):29573–29588, Nov. 2018.
  • [7] Maximilian Ilse, Jakub M. Tomczak, and Max Welling. Attention-based deep multiple instance learning. In ICML, 2018.
  • [8] Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the 5th International Conference on Learning Representations, ICLR ’17, 2017.
  • [9] L. Kratz and K. Nishino. Anomaly detection in extremely crowded scenes using spatio-temporal motion pattern models. In CVPR, pages 1446–1453, 2009.
  • [10] Andreas Krause. Optimizing sensing: Theory and applications. Technical report, 2008.
  • [11] Andreas Krause, H. Brendan McMahan, Carlos Guestrin, and Anupam Gupta. Robust submodular observation selection. Journal of Machine Learning Research, 9(93):2761–2801, 2008.
  • [12] Nannan Li, Xinyu Wu, Huiwen Guo, Dan Xu, Yongsheng Ou, and Yen-Lun Chen. Anomaly detection in video surveillance via gaussian process. Int. J. Pattern Recognit. Artif. Intell., 29:1555011:1–1555011:25, 2015.
  • [13] W. Li and N. Vasconcelos. Multiple instance learning for soft bags via top instances. In CVPR, 2015, pages 4277–4285, 2015.
  • [14] Hui Lin and Jeff Bilmes. How to select a good training-data subset for transcription: Submodular active selection for sequences. Technical report, WASHINGTON UNIV SEATTLE DEPT OF ELECTRICAL ENGINEERING, 2009.
  • [15] Wen Liu, Weixin Luo, Zhengxin Li, Peilin Zhao, and Shenghua Gao. Margin learning embedded prediction for video anomaly detection with a few anomalies. In IJCAI, 2019.
  • [16] C. Lu, J. Shi, and J. Jia. Abnormal event detection at 150 fps in matlab. In ICCV, pages 2720–2727, 2013.
  • [17] W. Luo, W. Liu, and S. Gao. A revisit of sparse coding based anomaly detection in stacked rnn framework. In ICCV, pages 341–349, 2017.
  • [18] Hitesh Sapkota, Yiming Ying, Feng Chen, and Qi Yu. Distributionally robust optimization for deep kernel multiple instance learning. In Arindam Banerjee and Kenji Fukumizu, editors, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pages 2188–2196. PMLR, 13–15 Apr 2021.
  • [19] W. Sultani, C. Chen, and M. Shah. Real-world anomaly detection in surveillance videos. In CVPR, pages 6479–6488, 2018.
  • [20] Yee Whye Teh, Michael I Jordan, Matthew J Beal, and David M Blei. Hierarchical dirichlet processes. Journal of the american statistical association, 101(476):1566–1581, 2006.
  • [21] Kai Tian, Shuigeng Zhou, Jianping Fan, and Jihong Guan. Learning competitive and discriminative reconstructions for anomaly detection. In AAAI, pages 5167–5174. AAAI Press, 2019.
  • [22] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W. Verjans, and Gustavo Carneiro. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4975–4986, October 2021.
  • [23] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, ICCV ’15, page 4489–4497, USA, 2015. IEEE Computer Society.
  • [24] Hung Vu, T. Nguyen, Trung Le, Wei Luo, and Dinh Q. Phung. Robust anomaly detection in videos using multilevel representations. In AAAI, 2019.
  • [25] X. Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018.
  • [26] Huan Yang, Baoyuan Wang, Stephen Lin, David Wipf, Minyi Guo, and Baining Guo. Unsupervised extraction of video highlights via robust recurrent auto-encoders. In ICCV, ICCV ’15, page 4633–4641, USA, 2015. IEEE Computer Society.
  • [27] J. Zhong, N. Li, W. Kong, S. Liu, T. H. Li, and G. Li. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In CVPR, pages 1237–1246, 2019.

Appendix

Organization of Appendix

In this appendix, we first summarize the major notations used in the paper in Table 3 of Appendix A. We then provide the detailed proof of Theorem 1 in Appendix B. Next, we provide additional experimental results in Appendix C. Finally, the link to the source code is provided in Appendix D.

Appendix A Summary of Notations

Table 3 summarizes all the major symbols along with their descriptions.

Table 3: Symbols with Descriptions
Notation Description
pos\mathcal{B}_{pos} Positive bag (video)
neg\mathcal{B}_{neg} Negative bag (video)
nn Number of segments in each bag
xi+\textbf{x}_{i}^{+} Segment in a positive bag
𝐱[i]+{\bf x}_{[i]}^{+} ithi^{th} largest prediction segment in a positive bag
xj\textbf{x}_{j}^{-} Segment in a negative bag
MM Feature dimension of each video segment
w Network parameters
b Network bias
kk Number of segments considered in the top-kk formulation
η\eta Learning rate
𝒞+\mathcal{C}^{+} Set of instances from positive bag involve in model training
𝒢0\mathcal{G}_{0} Base distribution in DP
γ\gamma Concentration parameter for the distribution 𝒢0\mathcal{G}_{0}
βk\beta_{k} Weight associated with the kthk^{th} atom
ϕk\phi_{k} Atom kk drawn from the distribution HH
𝒢j\mathcal{G}_{j} Transition probability distribution of jthj^{th} state
π^jl\hat{\pi}_{jl} Stick breaking weight associated with lthl^{th} atom in jthj^{th} group
α\alpha Concentration parameter for π^j\hat{\pi}_{j}
ϕjl\phi_{jl} lthl^{th} atom corresponding jthj^{th} group
βk\beta_{k} Stick breaking weight corresponding to atom ϕk\phi_{k}
γ\gamma Concentration parameter for βk\beta_{k}
ρ\rho Parameter defining the self transitioning
ziz_{i} Scene assignment for the ithi^{th} segment in a video
sis_{i} Mixture component assignment for the ithi^{th} segment in a video
𝒩\mathcal{N} Multivariate Gaussian distribution
𝝁k,t\boldsymbol{\mu}_{k,t} Mean of the kthk^{th} state, ttht^{th} mixture component
Σk,t\Sigma_{k,t} Covariance of kthk^{th} state, ttht^{th} mixture component
Si,jS_{i,j} Pairwise similarity between ithi^{th} and jthj^{th} segments
F(𝒞)F(\mathcal{C}) Submodular set function
fsf_{s}^{*} Maximum output score among segments assigned to the same cluster
isi_{s}^{*} Index of the representative segment
𝒞+^\widehat{{\mathcal{C}}^{+}} Representative set constructed using the greedy algorithm
ϵ\epsilon Threshold to exclude segments with low prediction score from the representative set
κ\kappa Upper bound of number of representative segments

Appendix B Proof of Theorem 1

In this section, we provide the detailed proof of Theorem 1. We first show that the representative set based MIL loss given by (14) is equivalent to the submodularity diversified MIL loss given by Equation (10) with a specific λ\lambda to balance the MIL loss and the diversity of the set. We then show that greedy algorithm to locate the κ\kappa representative segments provides a κ\kappa-constrained greedy approximation to the maximization of the submodular set function F(C)F(C) with the solution to be no worse than (1e1)(1-e^{-1}) of the optimal solution.

Proof of representative set based MIL loss in (14) is a special case of the submodular diversified MIL loss in (10)

We first present a lemma, which is used in the proof.

Lemma 2.

Assume that 𝒞+~\widetilde{\mathcal{C}^{+}} with size κ\kappa is a solution that maximizes F(𝒞)F(\mathcal{C}) in (9). Then, 𝒞+~\widetilde{\mathcal{C}^{+}} should contain one segment from each mixture component (i.e. sub-scene).

Proof.

The lemma can be proved by following the definition of the BN-SVP induced pairwise similarity between segments given by (8) and then use proof by contradiction. Assume that at least two segments, say 𝐱i(t),𝐱j(t){\bf x}^{(t)}_{i},{\bf x}^{(t)}_{j}, are chosen from the same component tt. Then, there will be at least one component, say tt^{\prime}, where no segments are chosen by 𝒞+~\widetilde{\mathcal{C}^{+}}. Given the definition of F(𝒞)F(\mathcal{C}) in (9), for each segment in tt, either 𝐱i(t){\bf x}^{(t)}_{i} or 𝐱j(t){\bf x}^{(t)}_{j} could be used to compute the pairwise similarity based on their closeness to that segment. Since the cohesiveness of each component is guaranteed through the BN-SVP process, both 𝐱i(t){\bf x}^{(t)}_{i} and 𝐱j(t){\bf x}^{(t)}_{j} should be close to the mean of their assigned Gaussian component 𝒩(𝐱t,Σt)\mathcal{N}({\bf x}_{t},\Sigma_{t}) to ensure a high likelihood optimized by HDP-HMM. Due to triangle inequality, 𝐱i(t){\bf x}^{(t)}_{i} and 𝐱j(t){\bf x}^{(t)}_{j} should be close to each other. As a result, we can assume that 𝐱i(t){\bf x}^{(t)}_{i} is always chosen to evaluate the pairwise similarity Si,pS_{i,p} with each segment 𝐱p(t){\bf x}^{(t)}_{p} in component tt. Next, we replace 𝐱j(t){\bf x}^{(t)}_{j} with another segment 𝐱j(t){\bf x}^{(t^{\prime})}_{j} from component tt^{\prime} to construct another solution set 𝒞+¯\overline{\mathcal{C}^{+}}. Since 𝐱j(t){\bf x}^{(t^{\prime})}_{j} has positive similarity with each segment in tt^{\prime} and the pairwise similarity between 𝐱j(t){\bf x}^{(t)}_{j} and all segments in tt^{\prime} is all zero, we have F(𝒞+¯)>F(𝒞+~)F(\overline{\mathcal{C}^{+}})>F(\widetilde{\mathcal{C}^{+}}), which contradicts the assumption that 𝒞+~\widetilde{\mathcal{C}^{+}} maximizes F(𝒞)F(\mathcal{C}). ∎

Since the representative set 𝒞+^\widehat{\mathcal{C}^{+}} is constructed by choosing one segment from each mixture component, it satisfies the necessary condition to be an optimizer of F(𝒞)F(\mathcal{C}) specified in the above lemma. However, choosing a set of segments with the maximum diversity is not the primary goal and the overall objective function (10) includes both the MIL loss and the diversity, which are balanced through λ\lambda. Due to the lack of instance-level labels, choosing a λ\lambda that optimally balances the MIL loss and the set diversity is challenging. We argue that construction 𝒞+^\widehat{\mathcal{C}^{+}} essentially offers alternative way to set a specific λ\lambda to balance these two terms. First, since the constraint |𝒞+|κ|\mathcal{C}^{+}|\leq{\kappa} allows the set to contain less than κ\kappa segment, 𝒞+^\widehat{\mathcal{C}^{+}} excludes those segments with low prediction scores. This can be viewed as setting a λ\lambda to increase F(𝒞+)-F(\mathcal{C}^{+}) while decreasing the MIL loss L(𝒞+)L(\mathcal{C}^{+}). Similarly, instead of choosing the instance with the largest pairwise similarity with all other segments in the same component, we choose a segment with the highest prediction score. Again, this can be viewed as further reducing the λ\lambda to give more preference to the MIL loss as such segments can further reduce the training MIL loss. Thus, instead of directly setting λ\lambda, which is highly challenging, 𝒞+^\widehat{\mathcal{C}^{+}} is constructed by leveraging both the mixture component assignments and the prediction scores of the segments. This is equivalent to implicitly setting a λ\lambda to balance the MIL loss and the diversity of the representative set 𝒞+^\widehat{\mathcal{C}^{+}}, which completes the proof of the equivalence of these two objective functions.

Proof of the optimality of the greedy algorithm

We first reformulate (10) as a minimization problem min𝐰g(𝐰)\min_{{\bf w}}g({\bf w}) with g(𝐰)g({\bf w}) defined as

g(𝐰)\ensurestackMath\stackon[1pt]=Δmin𝒞+pos,|𝒞+|κL(pos,neg)λF(𝒞+)g({\bf w})\mathrel{\ensurestackMath{\stackon[1pt]{=}{\scriptstyle\Delta}}}\min_{{\mathcal{C}}^{+}\subseteq\mathcal{B}_{pos},|{\mathcal{C}}^{+}|\leq\kappa}L(\mathcal{B}_{pos},\mathcal{B}_{neg})-\lambda F(\mathcal{C}^{+}) (15)

The above optimization involves finding a subset 𝒞+pos{\mathcal{C}}^{+}\subseteq\mathcal{B}_{pos} that maximizes F(𝒞+)F({\mathcal{C}}^{+}). This requires enumerating over all (nκ){n\choose\kappa} possible subsets, which is expensive when there are large number of segments in a given video. Defining the discrete objective function G𝐰G_{{\bf w}} where

G𝐰(𝒞+)\ensurestackMath\stackon[1pt]=ΔL(pos,neg)λF(𝒞+)G_{{\bf w}}({\mathcal{C}}^{+})\mathrel{\ensurestackMath{\stackon[1pt]{=}{\scriptstyle\Delta}}}L(\mathcal{B}_{pos},\mathcal{B}_{neg})-\lambda F(\mathcal{C}^{+}) (16)

Since G𝐰(𝒞+)-G_{{\bf w}}({\mathcal{C}}^{+}) is monotone non-decreasing submodular, a fast greedy procedure can be used to approximately optimize G𝐰(𝒞+)G_{{\bf w}}({\mathcal{C}}^{+}). A typical greedy procedure involves evaluating the similarity between each pair of segments in a video and then choose the segments with the largest overall similarity with the all other segments. We make two important adjustments of this standard greedy process. First, our non-parametric HDP-HMM process follows the clustering based heuristic (Lin and Bilmes 2018) by choosing one segment from each cluster, which avoids evaluating each candidate segment in the video. Different from (Lin and Bilmes 2018), which chooses the data point that is closest to the cluster centroid, we choose the one with the highest output score. Second, our similarity evaluation takes a linear complexity with respect to the bag size by leveraging the temporal neighborhood of the segments. By leveraging the above greedy procedure, we can show that the obtained approximate solution is guaranteed to be no worse than (1e1)(1-e^{-1}) of the optimal solution according to the standard result from (Nemhauser et al. 1978), which completes the proof of the second part.

Table 4: Video Level Distribution on Different Datasets
Split ShanghaiTech UCF-Crime UCF-Crime Multimodal Avenue
Normal Abnormal Normal Abormal Normal Abnormal Normal Abnormal
Train 175175 6363 810810 800800 150150 150150 1313 1717
Test 155155 4444 150150 140140 3030 3030 33 44

Appendix C Additional Experimental Results

In this section, we first show the detailed network architecture used in our training process. Next, we provide the ablation study demonstrating the impact of hyperparameter ϵ\epsilon used in our experimentation. Finally, we provide some additional qualitative analysis justifying the effectiveness of the proposed approach. Further we also show effectiveness of the HDP-HMM technique to discover subscenes of different types in a video through qualitative analysis.

C.1 Network Architecture

First, we pass the pre-trained features through the two parallel GCN branches. The upper branch captures the feature similarity between segments and the lower one captures the temporal consistency between segments such that nearby segments will provide similar predictions. The output of the parallel branches are combined and passed through the 5 LSTM layers with 32 hidden units on each. Next, the output is passed through the BatchNorm. Finally, FCN is connected with sigmoid activation to get the final prediction score.

GCN Architecure

Next, we explain the GCN architecture in detail. Let 𝐀\mathbf{A} is the n×nn\times n dimensional adjacency matrix where the (i,j)(i,j) entry in the matrix indicates the similarity between segment ii and jj. Mathematically,

𝐀(i,j)=k(𝐱i,𝐱j)\mathbf{A}(i,j)=k({\bf x}_{i},{\bf x}_{j}) (17)

where 𝐱i{\bf x}_{i} and 𝐱j{\bf x}_{j} be the D-dimensional representation for ithi^{th} and jthj^{th} segments respectively. It should be noted that for the feature similarity branch, we use the RBF kernel with the following form

k(𝐱i,𝐱j)=exp(|𝐱i𝐱j|22l2)k({\bf x}_{i},{\bf x}_{j})=\exp(\frac{|{\bf x}_{i}-{\bf x}_{j}|^{2}}{-2l^{2}}) (18)

In case of temporal consistency branch, we use the following form between ithi^{th} and jthj^{th} segment:

k(𝐱i,𝐱j)=exp(|ij|)k({\bf x}_{i},{\bf x}_{j})=\exp(-|i-j|) (19)

This drives the temporally nearby segments to have a similar score. Based on the adjacency matrix, following Kipf and Welling [8], the graph-Laplacian with the renormalization trick can be written as

𝐀^=𝐃12(𝐀+𝐈𝐧)𝐃12\hat{{\bf A}}={\bf D}^{-\frac{1}{2}}({\bf A}+{\bf I_{n}}){\bf D}^{-\frac{1}{2}} (20)

In the above equation 𝐃(i,i)=j{𝐀+𝐈𝐧}(i,j){\bf D}_{(i,i)}=\sum_{j}\left\{{\bf A}+{\bf I_{n}}\right\}_{(i,j)} is the corresponding degree matrix. The output of the feature similarity graph is computed as:

𝐇=𝐀^𝐗𝐖{\bf H}={\bf\hat{A}}{\bf X}{\bf W} (21)

where 𝐖D×M{\bf W}\in\mathcal{R}^{D\times M} is a trainable parameter matrix and 𝐗n×D{\bf X}\in\mathcal{R}^{n\times D} is the video specific features.

C.2 Ablation Study

Impact of ϵ\epsilon

In this subsection, we show the the impact of the error threshold ϵ\epsilon on the model performance. It is worth mentioning that ϵ\epsilon indicates the percentile we used to determine the threshold so as to exclude the clusters with potentially all normal segments with a high probability. For example, ϵ=0.1\epsilon=0.1 indicates that we first determine the output score corresponding to the segment that lies in the 10th10^{th} percentile based on scores of all segments sorted in the non-decreasing order. This selected score is used as the threshold. Next, all representative segments with a predicted score below this output threshold are discarded from the representative set 𝒞+^\widehat{\mathcal{C}^{+}}. Figure 8 show the performance variation with respect to the different ϵ\epsilon’s for five different datasets. As can be seen, for a relatively lower ϵ\epsilon value (i.e., 20-35%), the performance is fairly stable for all datasets. This is because, with a low ϵ\epsilon, the model rejects the segments from a given video with a sufficiently low output score. This way, the chance of including normal segments from abnormal videos is minimized. Further lowering ϵ\epsilon may include a good number of normal segments, making the model mis-identify other similar normal segments as anomalies. On the other hand, choosing a very high ϵ\epsilon results in the drop in performance. In this case, some potentially abnormal segments may be missed in the loss function and therefore, the model may have less exposure to different types of abnormal frames resulting in the degradation of performance. In sum, as long as we stay in the relatively low range when choosing ϵ\epsilon (e.g., 20-35% gives very stable results), we can get the stable (and nearly optimal) performance.

Refer to caption
(a) UCF-Crime
Refer to caption
(b) ShanghaiTech
Refer to caption
(c) Avenue
Refer to caption
(d) Multimodal (UCF)
Refer to caption
(e) Outlier (ShanghaiTech)
Figure 8: Performance variation with respect to ϵ\epsilon

Impact of the constraint |C|κ|C|\leq\kappa

It should be noted that in our approach κ\kappa only provides an upper bound on the selected segments and the actual number is determined by the non-parametric model along with the prediction threshold ϵ\epsilon. This addresses the fundamental issue in the top-kk models, in which a fixed kk has to be set for all videos. Figures 9 shows that a stable performance can be achieved for a wide range of κ\kappa values as long as it is not set too small that may exclude some representative abnormal segments.

Refer to caption
(a) UCF-Crime
Refer to caption
(b) ShanghaiTech
Refer to caption
(c) Avenue
Refer to caption
(d) Multimodal (UCF)
Refer to caption
(e) Outlier (ShanghaiTech)
Figure 9: Performance variation with respect to κ\kappa

Impact of λ\lambda

We would like to emphasize that BN-SVP does not require to directly set λ\lambda, which is highly challenging. By leveraging the prediction score of instances and their mixture assignments, BN-SVP implicitly sets λ\lambda to balance MIL loss and the diversity of set 𝒞+^\widehat{\mathcal{C}^{+}}. Specifically because of the constraint |𝒞+|κ|{\mathcal{C}}^{+}|\leq\kappa, we ensure that the set contains no more than κ\kappa segments. It excludes segments with a low prediction score, which has the effect of decreasing λ\lambda to reduce the MIL loss. Similarly, instead of choosing the instance with the largest pairwise similarity with all other instances within the same mixture assignment, it chooses the instance with the highest prediction score. This can also be viewed as choosing a smaller λ\lambda to reduce the MIL loss.

Table 5: Performance (AUROC) with and without augmentation
Dataset UCF-Crime Avenue Multimodal ShanghaiTech Outlier
w augmentation 83.3983.39 80.8780.87 76.5376.53 96.0096.00 95.2795.27
w/o augmentation 80.5680.56 76.7176.71 63.2363.23 94.9994.99 94.5294.52

Impact of Augmentation

We compare the performance (AUROC) with augmentation (ρ=1\rho=1) and without augmentation (ρ=0\rho=0). Table 5 shows the result for different datasets. As can be seen, BN-SVP consistently performs better on all datasets than w/o augmentation. Without augmentation, the approach transitions from one state to another state quickly for small visual changes and may not be able to keep the temporal persistence when discovering the scenes and therefore the performance is lower. We have also shown the significance of using augmentation via a qualitative analysis in Appendix C.4.

C.3 Additional Qualitative Analysis

To show the effectiveness of the proposed approach to handle multimodality, we compare BN-SVP with MMIL using some illustrative examples. Figure 10 shows two frames from the TEST06 video in Avenue with different anomaly types. In the first anomaly type, the object is thrown and in the second, a person is walking in the wrong lane. As the first anomaly is more obvious, both BN-SVP and MMIL are able to correctly predict it as abnormal. For the second one, our proposed approach correctly detects it as abnormal while MMIL fails to do that. Due to the submodular diversified loss, BN-SVP is more likely to include even less obvious frames (e.g., Figure 10 (b)) during the training process and as a result the approach can make a correct prediction. On the other hand, MMIL picks the one with maximum score and therefore more likely to miss those less obvious ones during training process resulting in the mis-identification of similar frames as normal.

Refer to caption
(a) Frame 1
Refer to caption
(b) Frame 2
Figure 10: Example frames from UCF-Crime Stealing019; (a) Correct BN-SVP, MMIL, (b) Correct BN-SVP, incorrect MMIL

C.4 Effectiveness of Bayesian Non-Parametric Video Partition

In this section, we present representative frames from the mixture components (i.e. sub-scenes) discovered by the proposed Bayesian non-parametric video partition process. The purpose is to demonstrate that semantically coherent segments are automatically grouped into the same mixture components by the proposed BN-SVP. This significantly facilitates the optimization of the submodular function to choose a diverse set of segments and allow some of the most representative segments to participate in the MIL loss. Figure 11 shows frames randomly selected from different mixture components for video 01_0162 from the ShanghaiTech dataset. As shown in Figure 11 (a), the frame does not contain any person and its associated component (i.e., Component 0) mostly consists of background segments (which are predicted as normal by the model). In Figure 11 (b), there are multiple people in the frame. Furthermore, someone is riding a bike in a wrong lane while a second person is pointing to another group of people. This frame is assigned to a newly created component (i.e., Component 1) since it looks quite different from the previous frames. Also, given the abnormal behavior in the frame, the model predicts it as an anomaly. Next shown in Figure 11 (c), as the bike starts to vanish from the camera frame, it looks different from (b) and therefore the model assigns it to a new Component 2. Although (b) and (c) are both of abnormal types, the latter is much less obvious than the former. Given their distinctions, they have been assigned to different mixture components so both of them could be chosen when optimizing the submodular function to participate in model training. Finally, for Figure 11 (d), the bike completely disappears from the frame and only a group people walking normally. So, it is assigned to Component 3 and the model predicts its as normal.

Refer to caption
(a) Normal Frame (Component 0)
Refer to caption
(b) Abnormal Frame (Component 1)
Refer to caption
(c) Abnormal Frame (Component 2)
Refer to caption
(d) Normal Frame (Component 3)
Figure 11: Example frames from the discovered mixture components

Appendix D Link to Source Code

For the source code, please click the following link: https://github.com/ritmininglab/BN-SVP.