Distilling Aggregated Knowledge for Weakly-Supervised Video Anomaly Detection

Jash Dalvi
K J Somaiya Institute of Technology
jashdalvi99@gmail.com Ali Dabouei
Carnegie Mellon University
ali.dabouei@gmail.com Corresponding authors. Gunjan Dhanuka
Carnegie Mellon University
gdhanuka@cs.cmu.edu Min Xu ¹¹footnotemark: 1
Carnegie Mellon University
mxu1@cs.cmu.edu

Abstract

Video anomaly detection aims to develop automated models capable of identifying abnormal events in surveillance videos. The benchmark setup for this task is extremely challenging due to: i) the limited size of the training sets, ii) weak supervision provided in terms of video-level labels, and iii) intrinsic class imbalance induced by the scarcity of abnormal events. In this work, we show that distilling knowledge from aggregated representations of multiple backbones into a single-backbone Student model achieves state-of-the-art performance. In particular, we develop a bi-level distillation approach along with a novel disentangled cross-attention-based feature aggregation network. Our proposed approach, DAKD (Distilling Aggregated Knowledge with Disentangled Attention), demonstrates superior performance compared to existing methods across multiple benchmark datasets. Notably, we achieve significant improvements of 1.36%, 0.78%, and 7.02% on the UCF-Crime, ShanghaiTech, and XD-Violence datasets, respectively.

1 Introduction

Video Anomaly Detection (VAD) is a realization of automation based on video data which addresses the exhaustive labor and time requirements of video surveillance. The goal of a practical VAD system is to identify an activity that deviates from normal activities characterized by the training distribution [27].

Refer to caption — Figure 1: Left: A brief overview of our approach that distills the multi-backbone Teacher model’s knowledge to the Student model. In the Teacher model, representations from multiple backbones are aggregated using our proposed Temporal Aggregation Module. The single-backbone Student model is then trained with bi-level fine-grained knowledge distillation framework. Right: Frame-level predictions for individual backbones vs our proposed feature aggregation method on a video of a Road Accident from the testing set of UCF-Crime.

Despite the extensive background of research in VAD [27, 6, 13, 29], the development of a robust model capable of accurately detecting anomalies within videos remains a difficult task. This challenge arises from the difficulty of modeling the spatiotemporal characteristics of abnormal events, particularly those of rare occurrence and significant variability. This complexity is compounded by the labor-intensive process of collecting frame-level annotations for video data, which presents a substantial barrier towards developing an effective VAD model for real-world scenarios. Prior works on VAD have adopted a practical approach by employing weakly-supervised learning which solely requires video-level labels to develop a model capable of making frame-level predictions[27, 39, 41, 6, 34, 37, 29, 33, 14, 4, 13].

Although weakly-supervised VAD is an intriguing approach, it suffers from limited supervision during training, resulting from the absence of precise frame-level annotations. To overcome this challenge, previous works [27, 13, 29] have employed knowledge transfer by combining a fixed backbone, pre-trained on general video representation learning, with a dedicated prediction head to perform anomaly detection. Our exploratory evaluations, described in Figure 4, highlight that knowledge transfer has a substantial impact on the performance of weakly-supervised VAD. In particular, these evaluations suggest that the impact of the knowledge transfer is even more critical than the design choice of the prediction head: employing the knowledge of multiple pretrained backbones significantly enhances VAD performance. We attribute this performance boost to the complementary nature of knowledge from different backbones resulting from variations in inductive biases and pretraining datasets. In this work, we further analyze this observation by developing an aggregated model for weakly-supervised VAD.

A careful aggregation of the knowledge from multiple models is essential especially when the training supervision is weak, i.e., video-level supervision rather than frame-level supervision. To this aim, we propose a novel Temporal Aggregation Module (TAM) that combines spatiotemporal information from the backbones through multiple self- and cross-attention mechanisms. This module comprehensively combines spatial (content) and temporal (positional) information from all the backbones to construct an effective aggregated representation of the video.

Our empirical evaluations, discussed later in Section 4, highlight the effectiveness of this aggregated model for weakly-supervised VAD. However, this model is computationally expensive for deployment due to incorporating multiple cumbersome backbones. To address this limitation, we develop a bi-level fine-grained knowledge distillation mechanism, which distills the knowledge from the aggregated Teacher into an efficient Student, demonstrated in Figure 1. The distillation process enforces both prediction-level and feature-level similarity between the Teacher and the Student. In the former, we align the output distributions of the Teacher and Student models to capture the detailed characteristics of the aggregated predictions. In the latter case, we align the representation-level knowledge to distill more complex, higher-order feature dependencies.

Our results highlight that the proposed TAM and knowledge distillation approach are highly beneficial for weakly supervised VAD, where the learning signal is weak. In particular, we propose DAKD (Distilling Aggregated Knowledge with Disentangled Attention) that consists of a disentangled cross-attention-based Temporal Aggregation Module and a bi-level fine-grained knowledge distillation framework.

In summary, the contributions of our work are:

1.

We argue that knowledge transfer plays an important role in the challenging setting of weakly-supervised VAD. To support this, we propose a novel Temporal Aggregation Module to effectively combine knowledge from multiple backbones.
2.

We develop a spatiotemporal knowledge distillation technique that distills the knowledge of the aggregated model into a single backbone to address the efficiency concerns.
3.

Through extensive experiments and ablation studies, we validate the effectiveness of the proposed framework and show that our approach outperforms existing methods on benchmark datasets.

2 Related Work

Previous works on VAD can be categorized into two classes: Unsupervised VAD [8, 35, 19, 36, 16, 18, 30, 2, 31, 7, 17, 20, 23] and Weakly-supervised VAD [27, 39, 41, 6, 34, 37, 29, 33, 14, 4, 13]. Unsupervised VAD approaches such as One-class classification assume that merely normal videos are available for training and flag videos that have a considerable deviation from the learned distribution as anomalous [35, 8, 19]. However, the performance of these methods is limited and often results in a high false acceptance rate. This can be attributed to the fact that normal videos with novel events closely resemble abnormal events, and it is difficult to differentiate between the two events without the related context. Weakly-supervised VAD, on the other hand, leverages video-level labels and has gained popularity for its enhanced performance.

2.1 Weakly Supervised VAD

Sultani et al. [27] proposed a deep Multiple-Instance Learning (MIL) framework, incorporating sparsity and temporal smoothness constraints and knowledge transfer for enhancing anomaly localization. Zhong et al. [41] used a graph convolutional network to mitigate label noise, but had higher computational costs. Feng et al. [6] introduced a two-stage approach to fine-tune a backbone network for domain-specific knowledge. Tian et al. [29] used top-k instances and a multi-scale temporal network for feature magnitude learning. Li et al. [13] employed a scale-aware approach for capturing anomalous patterns using patch spatial relations. Zaheer et al. [37] minimized anomaly scores in normal regions with a Normalcy Suppression mechanism and introduced a clustering distance-based loss to improve discrimination. Despite current approaches, limited training data and weakly-supervised constraints restrict model learning. Knowledge transfer plays a crucial role in anomaly detection performance. Building upon the deep MIL framework [27], we propose architectural changes to enhance performance on unseen data.

2.2 Knowledge Distillation

Knowledge distillation is a technique to transfer knowledge from a complex Teacher model to a simpler Student model. Hinton et al. [11] introduced the concept of aligning Teacher and Student model probabilities. FitNets[26] extended distillation to intermediate-level hints, focusing on matching intermediate representations. Zagoruyko and Komodakis [12] introduced attention-based distillation to transfer attention maps. Papernot et al. [22] emphasized matching intermediate representations for effective knowledge transfer. Zhang et al. [40] introduced self-distillation, leveraging the Student model as a Teacher to improve generalization. Heo et al. [10] improved knowledge transfer by distilling activation boundaries formed by hidden neurons.

3 Method

Weakly-supervised VAD aims to train models for frame-level anomaly detection using only video-level supervision. This approach faces challenges due to limited supervision and imbalanced training data, with anomalies typically occupying a small fraction of frames (e.g., 7.3% in UCF-Crime dataset [27]). Previous works [27, 41, 29, 37, 13] address this by using knowledge transfer from large video datasets. We extend this approach by utilizing multiple backbones and introducing a novel fusion method for their representations. To mitigate the increased computational demands, we propose a distillation technique to compress the aggregated model’s knowledge into a single-backbone Student model. The following sections detail our feature extraction process (Section 3.1), temporal network for representation aggregation (Section 3.2), and the proposed distillation approach (Section 3.3).

3.1 Feature Extraction

To alleviate the limited size of the training set and the highly imbalanced distribution of classes in weakly-supervised VAD, we adopt intensive knowledge transfer by employing multiple pre-trained video backbones for feature extraction. Different feature backbones encode different types of information, which can aid in anomaly detection and help circumvent the challenges arising from limited supervision. Additionally, different backbones can help increase the diversity of the extracted features, which can lead to a more comprehensive aggregated representation of the input videos.

Consider that the video dataset consists of $n_{v}$ pairs $\{(V_{i},y_{i})\}_{i=1}^{n_{v}}$ , where the $i^{th}$ video, $V_{i}$ , is a sequence of clip instances $\mathbf{v}_{i,j}$ and $y_{i}\in\{0,1\}$ is the corresponding video-level label. Let $\psi_{1},\dots$ , $\psi_{T}$ denote the set of pre-trained backbones for extracting representations from the input videos. For each input video clip $\mathbf{v}_{i,j}$ , we extract features using the $t^{th}$ backbone as $\mathbf{z}_{i,j,t}=\psi_{t}(\mathbf{v}_{i,j})$ , where $\mathbf{z}_{i,j,t}\in\mathbb{R}^{d_{t}}$ and $d_{t}$ denotes the cardinality of the output of the $t^{th}$ backbone. After extracting representations using multiple backbones, we aggregate them using a novel Temporal Aggregation Module described in the next section.

3.2 Temporal Aggregation Module

In weakly-supervised VAD, incorporating relative positional information is vital due to the low-pass temporal frequency characteristics of natural events, where anomalous frames cluster together rather than appearing sporadically. Leveraging this information enhances anomaly detection performance, which we achieve by utilizing a disentangled attention mechanism that inherently accounts for relative positional information during attention computation. This mechanism [9] employs a relative positional bias, with the maximum relative distance parameterized by $k$ . The relative distance between positions $i$ and $j$ is encoded by the function $\gamma(i,j)$ , constrained within $[0,2k)$ , thereby reducing attention model complexity and making it suitable for low-data and weak-supervision scenarios. Formally, $\gamma(i,j)$ is defined as follows:

\gamma(i,j)=\begin{cases}0&\text{for $i-j\leq-k$ },\\ 2k-1&\text{for $i-j\geq k$},\\ i-j+k&\text{others}.\end{cases}

(1)

The disentangled attention has three attention components: i) content-to-content attention: This component attends to the content of a token at position i by interacting with the content of the token at position j within the same sequence, ii) content-to-position attention: This attention component considers the content of token i and its relative position to token j, capturing how the content of token i influences its attention weight concerning token j. iii) position-to-content attention: Similarly, this component assesses the content of token j and its relative position to token i, elucidating how the content at position j influences its attention weight with respect to token i.

We further disentangle this attention mechanism to adopt it for multi-input scenarios. To this aim, we add a cross-attention module to fuse information from multiple input sequences. Table 5 evaluates the contribution of each of these attention components. Given $T$ input sequences $Z_{t}$ , $t\in\{1,\dots,T\}$ , we define the query, key, and value for each of the representations as:

\begin{split}Q^{c_{t}}=Z_{t}W_{q,c_{t}},\quad K^{c_{t}}=Z_{t}W_{k,c_{t}},\quad V^{c_{t}}=Z_{t}W_{v,c_{t}}.\end{split}

(2)

The shared relative position key and query are also computed as :

Q^{r}=\tilde{Z}W_{q,r},\quad K^{r}=\tilde{Z}W_{k,r},

(3)

where $\tilde{Z}\in\mathbb{R}^{2k\times d}$ represents the relative position embedding vectors shared across all layers and backbones (i.e., staying fixed during forward propagation).

Our aggregation attention mechanism is then formulated as:

	$\displaystyle\tilde{A}_{i,j,t}=$	$\displaystyle\underbrace{Q_{i}^{c_{t}}{K_{j}^{c_{t}}}^{\top}}_{\text{(self content-to-content)}}+\underbrace{\sum_{h,h\neq t}Q_{i}^{c_{t}}{K_{j}^{c_{h}}}^{\top}}_{\text{(cross content-to-content)}}$		(4)
		$\displaystyle+\underbrace{Q_{i}^{c_{t}}{K^{r}_{\gamma(i,j)}}^{\top}}_{\text{(content-to-position)}}+\underbrace{K_{j}^{c_{t}}{Q^{r}_{\gamma(j,i)}}^{\top}}_{\text{(position-to-content)}},$		(4)

and the output is computed as:

H_{t}=softmax\big{(}\dfrac{\tilde{A}_{t}}{\sqrt{(T+2)d}}\big{)}V^{c_{t}},

(5)

where $T$ is the number of backbones, and the final aggregated output is $H=\tfrac{1}{T}\sum_{t}H_{t}$ . $Q^{c_{t}}$ , $K^{c_{t}}$ , and $V^{c_{t}}$ are content vectors derived through projection matrices $W_{q,c_{t}}$ , $W_{k,c_{t}}$ , $W_{v,c_{t}}$ $\in\mathbb{R}^{d\times d}$ and $t\in\{1,\dots,T\}$ is the index of the backbone. $Q_{r}$ and $K_{r}$ correspond to the projected relative position vectors, facilitated by projection matrices $W_{q,r}$ and $W_{k,r}\in\mathbb{R}^{d\times d}$ , respectively. The architecture of the Temporal Aggregation Module integrates the aforementioned disentangled attention mechanism to improve the relative encoding and fusing of representations from multiple backbones.

3.3 Bi-level Fine-grained Knowledge Distillation

Our approach leverages the knowledge from multiple pre-trained backbones, allowing it to benefit from their collective expertise. We employ the MIL ranking approach, proposed by Sultani et al. [27] for training the first stage aggregated model. This intensive knowledge transfer aims to address the scarcity of supervision, which often hinders learning in the current weakly-supervised learning setup. However, using multiple backbones drastically increases computational overhead, thereby making the model less suitable for real-world applications. To overcome these issues, we develop a knowledge distillation approach, as shown in Figure 2, that distills the knowledge of the aggregated Teacher at prediction and representation levels into the Student model.

Prediction-level distillation: In the first level of distillation, we align the output distributions of the Teacher and Student models using the cross-entropy loss function [11]. Weakly-supervised VAD methods [27, 13, 29] generally use a single segment or top-k segments for the given input during the training since they solely have access to the video-level annotation. However, for distillation, despite the lack of fine-grained ground truth labels, we use the Teacher’s segment-level predictions to provide robust learning signals. These predictions act as soft pseudo-labels for training the Student model. The trained aggregated model generates scores for anomalous videos marked as $S^{a}=\{s_{i}^{a}\}_{i=1}^{n_{s}}$ , where $n_{s}$ is the number of segments. Based upon [6], to remove the jitter and refine the anomaly scores, we use a convolutional kernel of size $\epsilon$ as a moving average filter and use min-max normalization afterward. Min-max normalization helps to focus on the anomalous segments during training. The impact of using moving average filter and min-max normalization is presented in Table 4. Min-max normalization and moving average filter are described as:

\begin{split}\hat{y}_{i}^{a}&=\frac{\tilde{s}_{i}^{a}-\min(\tilde{S}^{a})}{\max(\tilde{S}^{a})-\min(\tilde{S}^{a})},\quad i\in[1,n_{s}],\\ \tilde{s}_{i}^{a}&=\frac{1}{2\epsilon}\sum_{j=i-\epsilon}^{i+\epsilon}{s}_{j}^{a},\end{split}

(6)

respectively, where $\min$ and $\max$ functions compute the minimum and maximum scores in the given set.

We refine the anomaly scores into $Y^{a}=\{\hat{y}_{i}^{a}\}_{i=1}^{n_{s}}$ and use these as soft pseudo-labels. Since we are certain about the segment-level annotation in the normal videos, we can combine the soft anomaly labels with normal videos. Given the nature of the VAD task, which typically involves predictions for two classes, the conventional posterior matching approach encounters limitations due to the limited support of the distribution. To mitigate this issue, we extend the distillation process to operate at the feature level, employing the InfoNCE loss [21, 28].

Feature-level distillation: In the context of feature-level distillation, we utilize a multilayer perceptron (MLP) with a single hidden layer to transform the input representations $\mathbf{h}_{i}$ from both the Student and Teacher models into corresponding feature vectors $\mathbf{z}_{i}=g(\mathbf{h}_{i})=W^{(2)}\sigma(W^{(1)}\mathbf{h}_{i})$ , with $\sigma$ representing the ReLU nonlinearity[1]. Subsequently, leveraging the Teacher model’s prediction outputs, we determine class labels for individual features by applying a threshold $\delta$ . This facilitates the identification of four distinct feature subsets: $\mathbf{z}_{a}^{T}$ (anomaly features of the Teacher), $\mathbf{z}_{n}^{T}$ (normal features of the Teacher), $\mathbf{z}_{a}^{S}$ (anomaly features for the Student), and $\mathbf{z}_{n}^{S}$ (normal features of the Student). Here, the superscripts T and S denote the Teacher and Student models, respectively. We utilize cosine similarity, denoted as $\text{sim}(.,.)$ , as a measure of similarity between input vectors.

Our feature distillation loss using InfoNCE is defined as:

\begin{split}\mathcal{L}_{nce}=-\log\dfrac{e^{\text{sim}(\mathbf{z}_{a_{i}}^{T},\mathbf{z}_{a_{i}}^{S})/\tau}}{\sum_{k=1}^{N}e^{\text{sim}(\mathbf{z}_{a_{i}}^{T},\mathbf{z}_{n_{k}}^{S})/\tau}+e^{\text{sim}(\mathbf{z}_{a_{i}}^{T},\mathbf{z}_{a_{i}}^{S})/\tau}}\\ -\log\dfrac{e^{\text{sim}(\mathbf{z}_{n_{i}}^{T},\mathbf{z}_{n_{i}}^{S})/\tau}}{\sum_{k=1}^{N}e^{\text{sim}(\mathbf{z}_{n_{i}}^{T},\mathbf{z}_{a_{k}}^{S})/\tau}+e^{\text{sim}(\mathbf{z}_{n_{i}}^{T},\mathbf{z}_{n_{i}}^{S})/\tau}},\end{split}

(7)

where $\tau$ represents the temperature parameter. The complete loss function for the distillation is as follows:

\mathcal{L}_{d}=\mathcal{L}_{bce}(y^{T},y^{S})+\alpha\tau^{2}\mathcal{L}_{nce}(T,S),

(8)

where $\mathcal{L}_{nce}$ represents the feature-level distillation loss using InfoNCE, $\mathcal{L}_{bce}$ represents the BCE loss, $\mathcal{L}_{d}$ represents the combined loss for distillation, and $\alpha$ represents the scaling coefficient to control the contribution of the two loss terms.

4 Experiments

4.1 Datasets and Metrics

Our model is evaluated on three benchmark datasets for weakly-supervised VAD: UCF-Crime, ShanghaiTech, and XD-Violence. The UCF-Crime dataset [27] contains 1900 untrimmed videos totaling 128 hours, captured by surveillance cameras in diverse real-world settings, with 13 types of anomalies. The ShanghaiTech dataset [16] comprises 437 videos from fixed-angle street cameras, featuring 13 background scenes. We follow Zhong et al.’s [41] approach to adapt it for weakly-supervised learning. The XD-Violence dataset [34] is a comprehensive multiscene collection sourced from various media, containing 4754 untrimmed videos spanning over 217 hours. All three datasets provide video-level labels for training and frame-level labels for testing, allowing for robust evaluation of weakly-supervised VAD models across diverse scenarios and anomaly types.

Temporal Network	AUC(%)
MTN	85.31
Multihead Attention	86.28
LSTM	86.97
RNN	86.99
Disentangled Attention	87.09
GRU	87.53
1D CNN	87.72
TAM	88.34

Table 1: Comparison of the proposed Temporal Aggregation Module (TAM) with other variants including the MTN module from RTFM. We observe that TAM is superior to other temporal models at capturing spatio-temporal dependencies.

Evaluation Metrics: In order to assess the effectiveness of our approach, we utilize the frame-based receiver operating characteristic (ROC) curve and the area under the curve (AUC), which have been commonly used in previous studies on anomaly detection [27, 29, 6]. Based on [34], we use average precision (AP) as the evaluation measure for the XD-Violence dataset.

Method	Features	T=32	AUC (%)
Sultani et al. [27]	C3D-RGB	✓	75.41
Sultani et al. [27]	I3D-RGB	✓	77.92
Zhang et al. [39]	C3D-RGB	✓	78.66
GCN [41]	TSN-RGB	-	82.12
MIST [6]	I3D-RGB	-	82.30
Wu et al. [34]	I3D-RGB	✓	82.44
CLAWS [37]	C3D - RGB	-	83.03
RTFM^∗ [29]	I3D-RGB	✓	84.30
Wu et al. [33]	I3D-RGB	✓	84.89
MSL [14]	I3D-RGB	✓	85.30
MSL [14]	VSwin-RGB	✓	85.62
S3R [32]	I3D-RGB	✓	85.99
SSRL^∗ [13]	I3D-RGB	✓	86.79
MGFN [4]	I3D-RGB	✓	86.98
DAKD_T (Ours)	Multiple	✓	88.15
DAKD_S (Ours)	I3D	✓	88.10
DAKD_S (Ours)	CLIP	✓	88.34

Table 2: Comparison with existing weakly-supervised methods on UCF-Crime dataset. DAKD_T and DAKD_S denote Teacher and Student models. T=32 indicates 32 non-overlapping video segments. Features column shows backbone used for feature extraction. Asterisk (*) indicates methods for which we could not validate the performance using the official code or our implementation. Results are the average over five independent runs.

4.2 Implementation Details

Our proposed method is implemented using PyTorch [24]. We divide each video into 32 non-overlapping segments to pass through the feature extractors.

Teacher Model: For the Teacher Model, we utilize the I3D [3], S3D [38], and CLIP [25] backbones to obtain representations for the video inputs. Before aggregation, the feature vectors are projected to a common dimension (512) using two-layer MLPs with ReLU [1] activation in the first layer. The input dimension for the MLPs processing I3D and S3D features is 1024, while for the one processing CLIP representations is 512. The hidden dimension is 512 for each of the three MLPs.

In our Temporal Aggregation Module, we utilize disentangled attention along with our proposed cross-attention mechanism to combine multiple inputs, as explained in Section 3.2. The disentangled attention module has one hidden layer with a hidden embedding dimension of 1024 and 8 attention heads. Notably, the TAM shares positional embeddings among all backbones. We experimented with values of maximum relative distance $k$ from 1 to 32, where 32 is the maximum segment index, to determine the optimal value that constrains the maximum distance between two positions $(i,j)$ in disentangled attention. The aggregated features are then passed through a feedforward network with hidden dimensions of 512 and 32, and with a sigmoid activation in the final layer to obtain the segment-level prediction. The conventional MIL Loss [27] serves as the loss function during training.

Student Model: In the Student Model, we use the CLIP [25] backbone which has shown notable performance in video analysis tasks [15]. We then pass the representations to the Temporal Module. The Temporal Module uses the disentangled attention mechanism with the omission of cross content-to-content attention term. The attention normalization factor is also adjusted based on the single backbone formulation. The dimensionality of embeddings and the number of attention heads is similar to Teacher model formulation. We train the Student model with the bi-level fine-grained knowledge distillation approach as discussed in Section 3.3. Employing contrastive loss for representation-level distillation, we mask the positive and negative examples using the threshold $\delta=0.9$ on the Teacher’s predictions. Then we calculate the cosine similarity between the samples from the Teacher and Student and obtain the loss for cross-positive and cross-negative examples. For prediction-level distillation, we use the BCE loss function to align the output distributions. Finally, we use a linear combination of the two loss terms using the parameter $\alpha$ to calculate the final loss. We also scale the $\mathcal{L}_{nce}$ loss by $\tau^{2}$ (Equation 8) which results in better training.

The Teacher and Student models are trained for 100 epochs, using the Adagrad optimizer [5] with a weight decay of 0.001 and a learning rate of 0.0001 for the temporal models and 0.001 otherwise. The training batch size is 60, and each batch consists of 30 normal and 30 anomalous video clips.

4.3 Comparison with the state of the art

Table 2 presents our main results on the UCF-Crime dataset, while Table 3 shows the results for the ShanghaiTech and XD-Violence datasets. DAKD_T and DAKD_S outperform all the existing weakly-supervised methods by a significant margin on all the datasets. Remarkably, on the UCF-Crime dataset, DAKD_S outperforms current SOTA methods, MGFN [4] by 1.36%, SSRL [13] by 1.55%, S3R [32] by 2.35%, and RTFM [29] by 4.04%. DAKD_S is also extremely efficient compared to SSRL, which uses multi-scale video crops for training, making it suitable for real-world use. Additionally, DAKD also achieves an AUC score of 98.10% on the ShanghaiTech dataset, providing superior performance although the performance of previous methods seems to be saturated on this dataset. Moreover, DAKD achieves an AP score of 85.61% on the XD-Violence dataset and outperforms existing methods by a significant margin.

We further analyze the performance of the models on each anomaly class in UCF-Crime to highlight the effectiveness of DAKD. Figure 6 presents the class-wise AUC scores for the Anomaly classes in UCF-Crime of our method compared with that of Sultani et al. [27] and RTFM [29]. Our approach outperforms existing methods by a significant margin in classes such as Assault, Arrest, Burglary, Explosion, and Vandalism.

4.4 Ablation Study

Analysis of Different Feature Backbones: Ablation studies were conducted to assess the impact of different pre-trained feature backbones on our Teacher Model’s training. Specifically, we utilized three feature backbones: I3D [3], S3D [38], and CLIP [25]. The results of these ablation studies can be found in Figure 4. It is evident from the figure that configurations involving the CLIP backbone consistently outperform other combinations. Notably, the combination involving all three backbones yields the best performance and is consequently adopted for training the Teacher Model. This choice not only enhances the diversity of input features but also mitigates the challenges posed by the limited availability of training data.

Method	Features	T=32	SHT	XDV
Sultani et al. [27]	C3D-RGB	✓	86.30	73.20
Zhang et al. [39]	C3D-RGB	✓	82.50	-
MIST [6]	I3D-RGB	-	94.83	-
CLAWS [37]	C3D - RGB	-	89.67	-
RTFM^∗ [29]	I3D-RGB	✓	97.21	77.81
Wu et al. [34]	I3D-RGB	✓	-	78.64
Wu et al. [33]	I3D-RGB	✓	97.48	-
MSL [14]	I3D-RGB	✓	96.08	78.28
SSRL^∗ [13]	I3D-RGB	✓	97.04	-
MSL [14]	VSwin-RGB	✓	97.32	78.59
DAKD_T (Ours)	Multiple	✓	98.08	84.78
DAKD_S (Ours)	I3D	✓	98.02	85.12
DAKD_S (Ours)	CLIP	✓	98.10	85.61

Table 3: Performance comparison with existing weakly-supervised methods on ShanghaiTech (SHT, AUC score) and XD-Violence (XDV, AP score) datasets. DAKD_T and DAKD_S denote Teacher and Student models. Other notations as in Table 2.

Analysis of Different Temporal Modules: We study the impact of the proposed Temporal Aggregation Module compared to prominent temporal networks. The feature representations obtained from the I3D [3], S3D [38], and CLIP [25] backbones are passed to the specified temporal module. The particular results are presented in Table 1. From Table 1, we can see that the proposed TAM specification outperforms other popular temporal networks in terms of the AUC score. Notably, TAM outperforms vanilla disentangled attention and multihead attention mechanisms showing its efficacy.

Analysis of Other Parameters: We investigated the effects of several key training parameters, as shown in Figure 5 and Table 4. Increasing the temperature $\tau$ in the $\mathcal{L}_{nce}$ loss beyond $\tau=10$ reduced AUC scores, as higher $\tau$ weakens penalties on hard negatives, while smaller $\tau$ enhances feature separation. The scaling factor $\alpha$ , which balances $\mathcal{L}_{nce}$ in the overall loss $\mathcal{L}_{d}$ , achieved the highest AUC at $\alpha=7.5$ , with higher values prioritizing feature-level distillation objective. Ablation studies on the maximum relative distance $k$ in $\gamma(i,j)$ showed optimal performance at $k=25$ , balancing attention model complexity and generalization to higher number of segments. Similarly, $\delta=0.9$ yielded the best results for assigning class-based labels to the segment-level features in feature-level distillation objective. In Table 4, we also compare the impact of various key components of our framework like TAM, the bi-level distillation objective, and the pseudo-label refinement. The results highlight the importance of each component and show that all the mentioned components are crucial towards achieving optimal performance.

Method	UCF	SHT
Baseline	77.92	86.30
Ours w/o TAM	82.27	91.34
Ours w/o $\mathcal{L}_{nce}$	86.91	96.52
Ours w/o $\mathcal{L}_{bce}$	87.60	96.89
Ours w/o Min-Max Normalization	87.90	97.21
Ours w/o Moving Average Filter	88.10	97.63
Ours	88.34	98.10

Table 4: Ablation studies on UCF-Crime and ShanghaiTech datasets, examining key components of our framework. Baseline: MIL method [27]. Variants: without TAM, representation-level distillation (

\mathcal{L}_{nce}

), and prediction-level distillation (

\mathcal{L}_{bce}

). Also includes ablations on min-max normalization and moving average filter for pseudo-label refinement.

Attention Mechanism	UCF	SHT
Content-to-Position	86.12	95.72
Position-to-Content	86.47	96.34
Cross Content-to-Content	86.81	96.58
Self Content-to-Content	87.56	97.46
All Components	88.34	98.10

Table 5: Ablation studies on the components of the aggregation attention mechanism described in Section 3.2. The table presents the AUC scores for the UCF-Crime (UCF) and ShanghaiTech (SHT) datasets.

4.5 Qualitative Results

From Figure 1, it is clear that while individual backbones struggle at corresponding to the ground truth frame-level annotations, our approach of using aggregated features is able to correctly localize the anomaly. The combined features leverage the power of individual backbones and show effective performance where dataset size is limited. Our approach also shows a comparatively smoother transition between normal and anomalous regions, demonstrating that it is able to consider the temporal localization of an event.

5 Conclusion

In this work, we introduced DAKD to address weakly-supervised VAD challenges, particularly the scarcity of frame-level labeled data. Our approach features a Temporal Aggregation Module (TAM) that combines diverse representations from multiple backbones using disentangled cross-attention. To mitigate computational costs, DAKD employs a bi-level knowledge distillation mechanism, transferring the aggregated model’s knowledge to a single-backbone Student. Extensive evaluations on UCF-Crime, ShanghaiTech, and XD-Violence datasets demonstrate the effectiveness of our aggregated model and show that the distilled Student consistently outperforms existing methods, achieving state-of-the-art performance in weakly-supervised VAD.

6 Acknowledgement

This work was supported in part by U.S. NIH grants R01GM134020 and P41GM103712, NSF grants DBI-1949629, DBI-2238093, IIS-2007595, IIS-2211597, and MCB-2205148. This work was supported in part by Oracle Cloud credits and related resources provided by Oracle for Research, and the computational resources support from AMD HPC Fund.

References

[1] Abien Fred Agarap. Deep learning using rectified linear units (relu), 2019.
[2] Ruichu Cai, Hao Zhang, Wen Liu, Shenghua Gao, and Zhifeng Hao. Appearance-motion memory consistency network for video anomaly detection. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 938–946, 2021.
[3] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
[4] Yingxian Chen, Zhengzhe Liu, Baoheng Zhang, Wilton Fok, Xiaojuan Qi, and Yik-Chung Wu. Mgfn: Magnitude-contrastive glance-and-focus network for weakly-supervised video anomaly detection. arXiv preprint arXiv:2211.15098, 2022.
[5] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
[6] Jia-Chang Feng, Fa-Ting Hong, and Wei-Shi Zheng. Mist: Multiple instance self-training framework for video anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14009–14018, 2021.
[7] Mariana-Iuliana Georgescu, Antonio Barbalau, Radu Tudor Ionescu, Fahad Shahbaz Khan, Marius Popescu, and Mubarak Shah. Anomaly detection in video via self-supervised and multi-task learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12742–12752, 2021.
[8] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K Roy-Chowdhury, and Larry S Davis. Learning temporal regularity in video sequences. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 733–742, 2016.
[9] Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654, 2020.
[10] Byeongho Heo, Minsik Lee, Sangdoo Yun, and Jin Young Choi. Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3779–3787, 2019.
[11] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
[12] Nikos Komodakis and Sergey Zagoruyko. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In ICLR, 2017.
[13] Guoqiu Li, Guanxiong Cai, Xingyu Zeng, and Rui Zhao. Scale-aware spatio-temporal relation learning for video anomaly detection. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, pages 333–350. Springer, 2022.
[14] Shuo Li, Fang Liu, and Licheng Jiao. Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 1395–1403, 2022.
[15] Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, and Hongsheng Li. Frozen clip models are efficient video learners. In European Conference on Computer Vision, pages 388–404. Springer, 2022.
[16] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Future frame prediction for anomaly detection–a new baseline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6536–6545, 2018.
[17] Zhian Liu, Yongwei Nie, Chengjiang Long, Qing Zhang, and Guiqing Li. A hybrid video anomaly detection framework via memory-augmented flow reconstruction and flow-guided frame prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13588–13597, 2021.
[18] Yiwei Lu, K Mahesh Kumar, Seyed shahabeddin Nabavi, and Yang Wang. Future frame prediction using convolutional vrnn for anomaly detection. In 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–8. IEEE, 2019.
[19] Weixin Luo, Wen Liu, and Shenghua Gao. Remembering history with convolutional lstm for anomaly detection. In 2017 IEEE International Conference on Multimedia and Expo (ICME), pages 439–444. IEEE, 2017.
[20] Weixin Luo, Wen Liu, and Shenghua Gao. A revisit of sparse coding based anomaly detection in stacked rnn framework. In Proceedings of the IEEE international conference on computer vision, pages 341–349, 2017.
[21] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
[22] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE symposium on security and privacy (SP), pages 582–597. IEEE, 2016.
[23] Hyunjong Park, Jongyoun Noh, and Bumsub Ham. Learning memory-guided normality for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14372–14381, 2020.
[24] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
[25] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
[26] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
[27] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6479–6488, 2018.
[28] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation. arXiv preprint arXiv:1910.10699, 2019.
[29] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4975–4986, 2021.
[30] Jue Wang and Anoop Cherian. Gods: Generalized one-class discriminative subspaces for anomaly detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8201–8211, 2019.
[31] Ziming Wang, Yuexian Zou, and Zeming Zhang. Cluster attention contrast for video anomaly detection. In Proceedings of the 28th ACM international conference on multimedia, pages 2463–2471, 2020.
[32] Jhih-Ciang Wu, He-Yen Hsieh, Ding-Jie Chen, Chiou-Shann Fuh, and Tyng-Luh Liu. Self-supervised sparse representation for video anomaly detection. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIII, pages 729–745. Springer, 2022.
[33] Peng Wu and Jing Liu. Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Transactions on Image Processing, 30:3513–3527, 2021.
[34] Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. Not only look, but also listen: Learning multimodal violence detection under weak supervision. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 322–339. Springer, 2020.
[35] Dan Xu, Elisa Ricci, Yan Yan, Jingkuan Song, and Nicu Sebe. Learning deep representations of appearance and motion for anomalous event detection. arXiv preprint arXiv:1510.01553, 2015.
[36] Guang Yu, Siqi Wang, Zhiping Cai, En Zhu, Chuanfu Xu, Jianping Yin, and Marius Kloft. Cloze test helps: Effective video anomaly detection via learning to complete video events. In Proceedings of the 28th ACM International Conference on Multimedia, pages 583–591, 2020.
[37] Muhammad Zaigham Zaheer, Arif Mahmood, Marcella Astrid, and Seung-Ik Lee. Claws: Clustering assisted weakly supervised learning with normalcy suppression for anomalous event detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, pages 358–376. Springer, 2020.
[38] Da Zhang, Xiyang Dai, Xin Wang, and Yuan-Fang Wang. S3d: single shot multi-span detector via fully 3d convolutional networks. arXiv preprint arXiv:1807.08069, 2018.
[39] Jiangong Zhang, Laiyun Qing, and Jun Miao. Temporal convolutional network with complementary inner bag loss for weakly supervised anomaly detection. In 2019 IEEE International Conference on Image Processing (ICIP), pages 4030–4034. IEEE, 2019.
[40] Zhilu Zhang and Mert Sabuncu. Self-distillation as instance-specific label smoothing. Advances in Neural Information Processing Systems, 33:2184–2195, 2020.
[41] Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H Li, and Ge Li. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1237–1246, 2019.