Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised Audio-Visual Video Parsing

Jie Fu, Junyu Gao, and Changsheng Xu Jie Fu is with Zhengzhou University, ZhengZhou 450001, China, and also with the State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China. (email: fujie@gs.zzu.edu.cn). Junyu Gao and Changsheng Xu are with the State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences, Beijing 100190, P. R. China, and with School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China. Changsheng Xu is also with Peng Cheng Laboratory, ShenZhen 518055, China. (e-mail: junyu.gao@nlpr.ia.ac.cn; csxu@nlpr.ia.ac.cn).

Abstract

Weakly-supervised audio-visual video parsing (WS-AVVP) aims to localize the temporal extents of audio, visual and audio-visual event instances as well as identify the corresponding event categories with only video-level category labels for training. Most previous methods pay much attention to refining the supervision for each modality or extracting fruitful cross-modality information for more reliable feature learning. None of them have noticed the imbalanced feature learning between different modalities in the task. In this paper, to balance the feature learning processes of different modalities, a dynamic gradient modulation (DGM) mechanism is explored, where a novel and effective metric function is designed to measure the imbalanced feature learning between audio and visual modalities. Furthermore, principle analysis indicates that the multimodal confusing calculation will hamper the precise measurement of multimodal imbalanced feature learning, which further weakens the effectiveness of our DGM mechanism. To cope with this issue, a modality-separated decision unit (MSDU) is designed for more precise measurement of imbalanced feature learning between audio and visual modalities. Comprehensive experiments are conducted on public benchmarks and the corresponding experimental results demonstrate the effectiveness of our proposed method.

Index Terms:

Imbalance-aware, Gradient modulation, Weakly-supervised, Audio-visual video parsing.

I Introduction

Recently, many different audio-visual video understanding tasks such as audio-visual action recognition [1, 2, 3], audio-visual separation [4, 5, 6, 7, 8] and audio-visual event localization [9, 10, 11, 12, 13, 14] have been proposed and achieve impressive progresses. The above audio-visual parsing models are all learned based on the assumption that audio and visual modalities are temporally aligned and the corresponding fine-grained frame-level annotations of different modalities are also provided for training.

Refer to caption — Figure 1: Simple illustration of our motivation. During model training, the modality containing obvious semantic information will dominate the training progress and achieve more optimization attention, leading to suboptimal issue in the other modality. Consequently, we attempt to balance the multimodal feature learning by mudulating gradients.

However, in real-world scenes, the synchronization between audio and visual modalities is not always satisfied and annotating frame-level labels for massive videos is time-consuming and unfeasible. To mitigate the above issues, weakly-supervised audio-visual video parsing task [15] is proposed. In the task, only video-level event category labels are annotated for training the model, which attempts to identify the starting and ending timestamps of each event instance and predict the corresponding event categories in terms of modalities (i.e., audio, visual, or both) during the inference stage.

Inspired by the prior knowledge that multiple modalities can provide more effective information from different aspects than only single one, most existing WS-AVVP methods [15, 16, 17] are designed in the same pipeline to aggregate multimodal information for more precise audio-visual video parsing: Firstly, the HAN cross-attention [15] is utilized to enhance the audio and visual feature representation. Afterwards, the enhanced feature of different modalities are input into the same attention module and classification head to generate the modality-specific attention weights and classification scores, which are aggregated to produce the final video-level classification predictions. Throughout the training process, a uniform learning objective and joint training strategy are utilized to optimize the sub-networks of different modalities.

By going in depth into the pipeline mentioned above, an intuitive deficiency can be concluded that the audio and visual modalities are optimized equally and the natural discrepancy between them is overlooked. Concretely, as shown in Figure 1, it is difficult to judge whether ‘Basketball bounce’ is happening by just listening to the audio stream, but it is obvious enough in visual data. Consequently, during the training process, the modality conveying more salient semantic information will dominate the training process and obtain more optimization attention, and the modality containing relatively confusing information will not be fully optimized [18]. Ultimately, variant modalities often trend to convergence at different rates, which will further result in uncoordinated convergence issue [19, 20, 21].

To cope with the above issue, a Dynamic Gradient Modulation (DGM) mechanism is explored to balance the feature learning processes of audio and visual modalities. Concretely, a novel metric function is first designed to measure the imbalanced feature learning degree between different modalities, which followed by utilizing the imbalance degree to modulate the backward gradients of the sub-networks for different modalities, so as to drive the AVVP model to pay more optimization attention to the suboptimal modality. So far, a similar imbalance metric function [22] based on the predicted scores of the correct categories in different modalities has been proposed. However, it neglects the global prediction distribution information of all categories, which is also beneficial for precise imbalance assessment. Consequently, in our DGM, a more thoughtful imbalance metric function considering both above two information is designed.

Furthermore, in traditional WS-AVVP pipeline, the cross-attention operation is always utilized to exchange audio and visual information, which will lead to the confusing multimodal information. In this case, it is hard to measure the imbalanced feature learning in different modalities purely, which will further damage the effectiveness of our proposed DGM. In order to address the above issue, according to the principle analysis, we design a modality-separated decision unit (MSDU) structure, which is embeded between the modality-specific feature encoders and cross-attention block of the traditional pipeline for more precise measurement of imbalanced feature learning between different modalities. In MSDU, the calculation of different modalities is separated completely, which is proved to be beneficial for enlarging the performance gains brought by our DGM mechanism.

Ultimately, to evaluate the performance of our proposed method, we conduct extensive comparison and ablation studies on several widely utilized audio-visual benchmarks. Experimental results show that our proposed method achieves the state-of-the-art performance. To summarize, our main contributions are three-fold:

$\bullet$

We analyze the imbalanced feature learning issue in WS-AVVP task. To this end, a dynamic gradient modulation mechanism is proposed to modulate the gradients of variant sub-networks for different modalities according to their contributions to the learning objective, so as to make the multimodal framework pay more attention to the suboptimal modality.
$\bullet$

We observe that the confusing calculation of different modalities will disturb the precise measurement of multimodal imbalanced feature learning, which will further damage the effectiveness of our proposed DGM mechanism, thus we design a Modality-Separated Decision Unit, which can cooperate with the proposed DGM for a significant improvement.
$\bullet$

Comprehensive experimental results show that our proposed model outperforms all current state-of-the-arts, which verifies the effectiveness of our proposed DGM mechanism and MSDU structure.

II Related Work

In this section, we review the most related work to our method including audio-visual learning and understanding as well as weakly-supervised audio-visual video parsing.

II-A Audio-visual learning and understanding

As the two most common and fundamental sensory information, visual and auditory data have attracted a large amount of research attention and derived many different audio-visual understanding tasks, such as audio-visual feature representation learning [23, 24, 25, 26, 27, 28, 29], audio-visual action recognition [1, 2, 3], sound source localization [30, 31, 32, 33], audio-visual video captioning [34, 35, 36], audio-visual event localization [9, 10, 11, 12, 13, 14] and audio-visual sound separation [4, 5, 6, 7, 8]. Most audio-visual learning methods are designed based on the assumption that audio and visual modalities are synchronized and temporal correlated. Concretely, [27, 25, 28, 29, 32] attempt to learn audio-visual feature representation jointly by utilizing the temporal alignment information between audio and visual modalities as the self-supervised guidance. In [26], a novel pretext task is proposed to extract correspondent feature for correlated video frame and audio, where the audio-visual feature pairs belonging to the same temporal snippet are gathered and the features from unpaired video snippets are separated. In [24], the unsupervised multimodal clustering information is utilized as the supervision for cross-modality feature correlation learning. To enhance the object detection capability, the correlation information between audio and object motion is taken into consideration [5, 37, 38, 39]. The methods mentioned above have achieved some progress by utilizing the synchronization between different modalities, which is not always satisfied in realistic scenes.

II-B Weakly-supervised audio-visual video parsing

WS-AVVP aims to parse an arbitrary untrimmed video into a group of event instances associated with semantic categories, temporal boundaries and modalities (i.e., audio, visual and audio-visual), where only the video-level labels are provided as the supervision information for training. Tian et al. [15] first propose the WS-AVVP setting and design a hybrid attention network in a multimodal multiple instance learning (MMIL) pipeline, where the intra- and cross-modality attention strategies are utilized to capture cross-modality contextual information. Afterwards, Lin et al. [16] further explore cross-video and cross-modality complementary information to facilitate WS-AVVP, where both the common and diverse event semantics across videos are exploited to identify audio, visual and audio-visual events. Thereafter, to refine the event labels individually for each modality, Wu et al. [40] propose a novel method by swapping audio and visual modalities with other unrelated videos. JoMoLD [17] utilizes the cross-modality loss pattern to help remove the noisy event labels for each modality. Although the methods mentioned above have enhanced the cross-modality feature learning and mitigated the issue of modality-specific noisy labels, none of them have noticed the imbalanced multimodal feature learning.

II-C Imbalanced audio-visual learning

Compared with the uni-modality learning, multimodal learning can integrate more information from different aspects. However, there always exists a discrepancy between different modalities, which makes the extraction and fusion of information from different modalities become more challenging. Concretely, the information volume and complexity in different modalities are often variant, thus the training difficulty of the networks for variant modalities is also different, which will lead to uncoordinated optimization issue [41, 19, 20, 21]: the dominated modality [18] conveying salient information will achieve more optimization attention and better performance during the whole training process, which will suppress the optimization progress of the other one. To cope with the above issue, Wang et al. [21] propose a metric to measure the overfitting-to-generalization ratio (OGR) and design a novel training scheme to minimize the OGR via an optimal blend of multiple supervised signals. To enhance the feature embedding capability of the suboptimal modality, Du et al. propose to distill reliable knowledge from the well-optimized model to help strengthen the optimization of the other one. Similar to our work, OGM [22] proposes to measure the optimization discrepancy between different modalities by calculating the ratio of predicted scores for the correct categories in different modalities, which is then utilized to modulate the gradients of different modality-specific models. However, OGM only takes the prediction scores for the correct event categories into consideration, which is not sufficient for reliable imbalance assessment. Consequently, we further consider the global prediction distribution on all categories. Meanwhile, to mitigate the negative effect of multimodal confusing calculation on our proposed DGM mechanism, we design the MSDU, which can cooperate with DGM for more significant improvement.

III Our approach

III-A Problem Formulation

In this paper, we follow the standard protocol of WS-AVVP [15]: Formally, given a multimodal training video sequence $\left\{a^{t},v^{t}\right\}_{t=1}^{T}$ with $T$ snippets, only the coarse video-level label $Y\in\left\{0,1\right\}^{C}$ is available during the training process, where $a^{t}$ and $v^{t}$ indicate the $t$ -th audio and visual snippets and $C$ denotes the number of event categories. In the testing phase, the trained model should predict the snippet-level semantic category label $\mathbf{y}_{t}=\left\{y_{t}^{a},y_{t}^{v},y_{t}^{av}\right\}$ for audio, visual and audio-visual modalities respectively, where $y_{t}^{av}=y_{t}^{a}\times y_{t}^{v}$ . In other words, the audio-visual events happen only when the corresponding audio and visual events belonging to the same semantic category occurr at the same time. Notably, the video-level event category label (multi-hot vector) can be obtained by aggregating the snippet-level event category annotations along the temporal dimension in each video. Due to the lack of fine-grained snippet-level supervision information during the training process, current WS-AVVP models are all designed in the multimodal multiple instance learning (MMIL) pipeline: in the training process, the snippet-level predictions from different modalities in the same video are aggregated to form the video-level prediction, which is supervised by the annotated video-level label.

III-B Optimization Analysis and Our DGM

The pipeline of our proposed framework is illustrated in Figure 2: Compared with our proposed framework, the traditional WS-AVVP pipeline (i.e., our proposed framework without the component in the light green background box) contains three main components including audio and visual feature encoders $\psi_{a/v}\left(\cdot\right)$ for snippet-level video feature learning, cross-attention mechanism $cro\_att(\cdot)$ for exchanging multimodal information, as well as multimodal attention $Atn_{a/v}\left(\cdot\right)$ and classification $\varphi_{a/v}\left(\cdot\right)$ modules for snippet-level video action prediction.

During the training process, an arbitrary training video containing audio $a$ and visual $v$ modalities is input into $\psi_{a/v}\left(\cdot\right)$ to produce the modality-specific snippet-level video feature, which is then fed into the cross-attention function $cro\_att(\cdot)$ to extract multimodal cross-attented feature representation. Afterwards, $Atn_{a/v}\left(\cdot\right)$ and $\varphi_{a/v}\left(\cdot\right)$ are utilized to generate the modality-specific attention weights and classification predictions. Ultimately, the attention weights and classification predictions in different modalities are combined together to produce the video-level action classification prediction, which is supervised by the video-level labels provided by the annotators. To optimize the model, the optimization objective of the traditional pipeline can be formulated as follows:

	$\displaystyle L=-\frac{1}{NC}\sum_{n=1}^{N}\sum_{c=1}^{C}[Y_{n}[c]\log$	$\displaystyle P_{n}[c]+$		(1)
	$\displaystyle(1-$	$\displaystyle Y_{n}[c])\log(1-P_{n}[c])]$		(1)

where $N$ denotes the number of training samples in a mini-batch and $Y\left[c\right]$ indicates the $c$ -th element of $Y$ . $P_{n}$ is the classification prediction for the $n$ -th video, which is obtained via sigmoid function. Concretely, video-level classification prediction $P_{n}$ can be formally calculated as follows:

$\displaystyle P_{n}$	$\displaystyle=\varphi\left(f^{n}_{av}\right)=\alpha^{w}[f^{n}_{a},f^{n}_{v}]+[\alpha^{b}_{a},\alpha^{b}_{v}]$	(2)
	$\displaystyle=\alpha^{w}_{a}\cdot f^{n}_{a}+\alpha^{b}_{a}+\alpha^{w}_{v}\cdot f^{n}_{v}+\alpha^{b}_{v}$
	$\displaystyle=\alpha^{w}_{a}\cdot\pi_{a}(a_{n};\theta_{a})+\alpha^{b}_{a}+\alpha^{w}_{v}\cdot\pi_{v}(v_{n};\theta_{v})+\alpha^{b}_{v}$

where $\alpha^{w}$ and $\alpha^{b}$ denote the weight and bias of modality-specific classifiers, $\pi(\cdot)$ is the formalization of feature embedding function including the modality-specific feature encoder $\psi(\cdot)$ and cross-attention block $cro\_att(\cdot)$ , and $\theta$ indicates the hyper-parameter of $\pi(\cdot)$ . In the following content, we omit the bias matrix $\alpha^{b}$ for more brevity. During the optimization process of our model, the gradient descent strategy is utilized to update the parameters of our model. Formally, the optimization processes of different parameters in our model can be formulated as follows (The superscript and subscript $a/v$ are omitted for brevity):

	$\displaystyle\alpha^{w}$	$\displaystyle\leftarrow\alpha^{w}+\lambda\frac{\partial{L}}{\partial{\varphi\left(f_{av}\right)}}\frac{\partial{\varphi\left(f_{av}\right)}}{\partial{\alpha^{w}}}$		(3)
		$\displaystyle=\alpha^{w}+\lambda\frac{1}{N}\sum_{n=1}^{N}\frac{\partial{L}}{\partial{\varphi\left(f^{n}_{av}\right)}}\cdot f^{n}$		(3)

	$\displaystyle\theta$	$\displaystyle\leftarrow\theta+\lambda\frac{\partial{L}}{\partial{\varphi\left(f_{av}\right)}}\frac{\partial{\varphi\left(f_{av}\right)}}{\partial{\theta}}$		(4)
		$\displaystyle=\theta+\lambda\frac{1}{N}\sum_{n=1}^{N}\frac{\partial{L}}{\partial{\varphi\left(f^{n}_{av}\right)}}\frac{\partial{\varphi\left(f^{n}_{av}\right)}}{\partial{\theta}}$		(4)

where $\lambda$ is the learning rate of the optimizer. Obviously, the common item in the above two equations is (Refer to Supplementary Material for detailed deduction of Eq. (5)):

	$\displaystyle\frac{\partial{L}}{\partial{\varphi\left(f_{av}^{n}\right)}}[c]=\frac{1}{1+e^{-\varphi\left(f_{av}^{n}\right)}}[c]-Y_{n}[c]$		(5)
	$\displaystyle=\frac{1}{1+e^{-[\alpha^{w}_{a}\cdot\pi_{a}(a_{n};\theta_{a})+\alpha^{w}_{v}\cdot\pi_{v}(v_{n};\theta_{v})]}}[c]-Y_{n}[c]$		(5)

From the above formulations, we can intuitively draw the conclusion that if the semantic information contained in the audio modality is much more obvious than the visual one, the audio will achieve higher predicted confidence scores. Furthermore, the back-propagation gradient (Eq. (5)) is more contributed by $\alpha^{w}_{a}\cdot\pi_{a}(a_{n};\theta_{a})$ , which will make the audio achieve more optimization attention. Consequently, the visual modality will have a relatively lower prediction confidence and limited optimization efforts will be paid to it during the training process. Ultimately, although the training process of the whole multimodal model has converged, the modality containing relatively weak semantic information could not be fully optimized.

To cope with this issue, we propose a simple but effective dynamic gradient modulation (DGM) strategy (i.e., the component shown in the light green background box of Figure 2) to balance the feature learning of audio and visual modalities. Specifically, inspired by the factor that fully optimized model will predict higher classification scores for those correct categories and the discrepancy between the predicted scores for correct and wrong categories will be large, we assess the relative optimization progress between visual and audio modalities as follows:

\displaystyle\omega_{v-a}=\frac{\sum_{n}\sum_{c}s_{n}^{v}[c]\cdot Y_{n}[c]+\sum_{n}\triangle\bar{s}_{n}^{v}}{\sum_{n}\sum_{c}s_{n}^{a}[c]\cdot Y_{n}[c]+\sum_{n}\triangle\bar{s}_{n}^{a}}

(6)

where $s_{n}^{v}$ and $s_{n}^{a}$ are visual and audio classification confidence scores of the $n$ -th sample, and $\triangle\bar{s}_{n}$ denotes the corresponding discrepancy between the average prediction scores of the correct and wrong event categories. Similarly, $\omega_{a-v}$ is the reciprocal of $\omega_{v-a}$ . If the $\omega_{v-a}$ is larger than $1$ , the optimizer will pay more attention to the visual modality, which will result in suboptimal audio feature learning. As a result, the balance coefficients of different modalities can be designed as follows:

\displaystyle\mu^{v}=\left\{\begin{matrix}1-\tanh\left(\gamma\cdot\omega_{v-a}\right),&if\quad\omega_{v-a}>1\\ 1,&if\quad\omega_{v-a}\leq 1\end{matrix}\right.

(7)

where $\tanh\left(\cdot\right)$ denotes the activation function and $\gamma$ is a hyper-parameter managing the modulation degree. $\mu^{a}$ can be obtained in a similar way, but we omit the corresponding description for clarity. Afterwards, we utilize the balance coefficients $\mu$ to modify the gradients of different sub-networks during the back-propagation process:

\displaystyle W\leftarrow W+\lambda\cdot\mu\cdot\frac{\partial{L}}{\partial{W}}

(8)

where $W$ indicates the parameter to be optimized (i.e., $\theta$ and $\alpha^{w}$ in our model).

Moreover, according to [42, 43, 22], the back-propagation gradients in each batch follow a Gaussian distribution and an appropriately large gradient covariance will lead to better generalization ability. However, the gradient covariance modified by our DGM mechanism becomes $\mu^{2}\cdot\sigma^{2}(\frac{\partial{L}}{\partial{W}})$ , which is smaller than the original one $\sigma^{2}(\frac{\partial{L}}{\partial{W}})$ because $\mu\in(0,1]$ . To make up for this deficiency, we add an extra term to the modified gradient covariance as follows:

	$\displaystyle W$	$\displaystyle\leftarrow W+\lambda\cdot\mathrm{E}(\frac{\partial{L}}{\partial{W}})+\lambda\varepsilon,$		(9)
	$\displaystyle where\quad\varepsilon$	$\displaystyle\sim\mathcal{N}(0,(\mu^{2}+1)\cdot\sigma^{2}(\frac{\partial{L}}{\partial{W}}))$		(9)

where $E(\cdot)$ and $\sigma^{2}(\cdot)$ are the expectation and covariance.

III-C Modality-Separated Imbalance Measurement

Obviously, in our proposed DGM mechanism, the core idea is to utilize the discrepancy between the predictions of different modalities to assess the imbalanced feature learning between audio and visual modalities, which is further applied to modulate the gradient of each modality-specific feature encoder during the optimization process. However, almost all existing WS-AVVP models [15, 16, 40, 17] are designed in the following pipeline as shown in Figure 3(a): audio $a$ and visual $v$ data are first input into the feature encoders $\psi_{a/v}\left(\cdot\right)$ to produce the modality-specific feature. Thereafter, the cross-modality attention mechanism $cro\_att(\cdot)$ is utilized to exchange the related information in different modalities and produced the cross-attented features $f_{a}$ and $f_{v}$ . Afterwards, $f_{a}$ and $f_{v}$ are separately fed into the classifier $\varphi(\cdot)$ and attention module $Atn(\cdot)$ shared by audio and visual modalities to produce the snippet-level classification probabilities (i.e., $P_{a}$ and $P_{v}$ ) and temporal attention weights (i.e., $A_{a}$ and $A_{v}$ ) for different modalities. Ultimately, $P_{a}$ , $P_{v}$ , $A_{a}$ and $A_{v}$ are aggregated together to generate the video-level event classification prediction $P$ , which is supervised by the video-level event category label provided by the annotators. Formally, the above pipeline can be written as follows:

$\displaystyle f_{a},\ f_{v}$	$\displaystyle=cro\_att(\psi_{a}(a),\ \psi_{v}(v))$	(10)
$\displaystyle P_{a},\ P_{v}$	$\displaystyle=\varphi(f_{a}),\ \varphi(f_{v}),$
$\displaystyle A_{a},\ A_{v}$	$\displaystyle=Atn(f_{a}),\ Atn(f_{v})$
$\displaystyle P=A$	$\displaystyle gg_{T}(A_{a}P_{a}+A_{v}P_{v})$

where $`*^{\prime}$ and $`+^{\prime}$ denote the broadcast multiplication and element-wise addition, and $Agg_{T}(\cdot)$ is the aggregation operation along the temporal dimension. In the traditional case, $P_{a/v}$ and $A_{a/v}$ are utilized to calculate $s^{a/v}$ and $\triangle\bar{s}^{a/v}$ for the assessment of imbalanced feature learning between different modalities.

By going in depth into the traditional pipeline mentioned above, we can find that due to the cross-attention operation between audio and visual modalities, both $f_{a}$ and $f_{v}$ encode the information from different modalities. Consequently, the information of different modalities are confusing, which makes it hard to purely measure the imbalanced feature learning between different modalities by directly utilizing the attention $A_{a/v}$ and classification predictions $P_{a/v}$ produced based on the confusing multimodal feature $f_{a/v}$ .

According to the above analysis, we modify the traditional pipeline and propose a modality-separated decision unit (MSDU) for more pure assessment of imbalanced feature learning between different modalities. Concretely, MSDU is embeded between the modality-specific feature encoders $\psi_{a/v}\left(\cdot\right)$ and cross-attention block $cro\_att(\cdot)$ . Our proposed pipeline is shown in Figure 3(b). Obviously, in our pipeline, apart from our proposed MSDU (i.e., the component shown in yellow background box), the rest is similar to the traditional WS-AVVP pipeline: audio $a$ and visual $v$ data is first input into the modality-specific feature encoders to produce the modality-specific video feature $e_{a/v}$ , which are then fed into the cross-attention module for more robust feature extraction. Afterwards, the cross-attented multimodal feature $f_{a}$ and $f_{v}$ are further taken as the inputs of modality-specific attention blocks $Atn_{a/v}(\cdot)$ and classification modules $\varphi_{a/v}(\cdot)$ , which produce snippet-level action attention weights $A_{a/v}$ and classification scores $P_{a/v}$ . To achieve more pure measurement of imbalanced multimodal feature learning, in our proposed MSDU, another group of modality-specific attention blocks $Atn_{a/v}^{ms}(\cdot)$ and classification modules $\varphi_{a/v}^{ms}(\cdot)$ are designed to predict the snippet-level action attentions $A_{a/v}^{ms}$ and classification scores $P_{a/v}^{ms}$ based on pure audio and visual modality features $e_{a/v}$ respectively. Formally, the pipeline of our method can be written as follows:

$\displaystyle e_{a},\ e_{v}$	$\displaystyle=\psi_{a}(a),\ \psi_{v}(v)$	(11)
$\displaystyle P_{a}^{ms},\ P_{v}^{ms}$	$\displaystyle=\varphi_{a}^{ms}(e_{a}),\ \varphi_{v}^{ms}(e_{v}),$
$\displaystyle A_{a}^{ms},\ A_{v}^{ms}$	$\displaystyle=Atn_{a}^{ms}(e_{a}),\ Atn_{v}^{ms}(e_{v})$
$\displaystyle f_{a},\ f_{v}$	$\displaystyle=cro\_att(e_{a},\ e_{v})$
$\displaystyle P_{a},\ P_{v}$	$\displaystyle=\varphi_{a}(f_{a}),\ \varphi_{v}(f_{v}),$
$\displaystyle A_{a},\ A_{v}$	$\displaystyle=Atn_{a}(f_{a}),\ Atn_{v}(f_{v})$
$\displaystyle P=A$	$\displaystyle gg_{T}(A_{a}P_{a}+A_{v}P_{v})$

In our pipeline, $P_{a/v}^{ms}$ and $A_{a/v}^{ms}$ are utilized to calculate $s^{a/v}$ and $\triangle\bar{s}^{a/v}$ in Eq. (6). $P_{a/v}$ and $A_{a/v}$ are utilized as the final predcitions of our proposed model. In this manner, $P_{a/v}^{ms}$ and $A_{a/v}^{ms}$ are both based on single-modality data, which will lead to purer measurement of multimodal feature learning.

IV Experiments

TABLE I: Comparison with all the recent state-of-the-art methods on LLP. Our proposed model achieves the best performance on all audio-visual video parsing sub-tasks under both the segment-level and event-level metrics. ‘Co-teaching’ and ‘JoCoR’ indicate the variants of [44, 45], which are reproduced by the authors of [17]. ‘CL’ indicates the contrastive feature learning mechanism proposed in MA [40].

Methods		Segment-level					Event-level
Methods		A	V	AV	Type	Event	A	V	AV	Type	Event
TALNet [46]		50.0	-	-	-	-	41.7	-	-	-	-
STPN [47]		-	46.5	-	-	-	-	41.5	-	-	-
CMCS [48]		-	48.1	-	-	-	-	45.1	-	-	-
AVE [9]		49.9	37.3	37.0	41.4	43.6	43.6	32.4	32.6	36.2	37.4
AVSDN [49]		47.8	52.0	37.1	45.7	50.8	34.1	46.3	26.5	35.6	37.7
HAN [15]		60.1	52.9	48.9	54.0	55.4	51.3	48.9	43.0	47.7	48.0
HAN+Co-teaching [44]		59.4	56.7	52.0	56.0	56.3	50.7	53.9	46.6	50.4	48.7
HAN+JoCoR [45]		61.0	58.2	53.1	57.4	57.7	52.8	54.7	46.7	51.4	50.3
MA [40]	w/o CL	59.8	57.5	52.6	56.6	56.6	52.1	54.4	45.8	50.8	49.4
MA [40]	w CL	60.3	60.0	55.1	58.9	57.9	53.6	56.4	49.0	53.0	50.6
JoMoLD [17]	w/o CL	60.6	62.2	56.0	59.6	58.6	53.1	58.9	49.4	53.8	51.4
JoMoLD [17]	w CL	61.3	63.8	57.2	60.8	59.9	53.9	59.9	49.6	54.5	52.5
MM-Pyramid [50]		61.1	60.3	55.8	59.7	59.1	53.8	56.7	49.4	54.1	51.2
CVCM [16]		60.8	63.5	57.0	60.5	59.5	53.8	58.9	49.5	54.0	52.1
Ours		63.5	65.0	57.8	62.1	62.0	55.6	61.3	50.5	55.8	54.4

IV-A Experimental Settings

Dataset. Following the standard protocol, we mainly evaluate the performance of our proposed method for WS-AVVP task on LLP dataset [15]. LLP contains 11849 YouTube video clips spanning over 25 semantic categories, where each video clip is 10-second long. To perform training, validation and testing, we take the common operation to split the dataset into a training set with 10000 videos, a validation set with 649 videos and a testing set with 1200 videos.

To comprehensively investigate the effectiveness of our proposed DGM strategy, we also conduct experiments on CREMA-D [51] and AVE [9] datasets: CREMA-D is an audio-visual benchmark for speech emotion recognition. In this dataset, 7442 video clips from 91 actors are collected. This dataset is divided into the training and validation subsets, which contain 6698 and 744 videos separately. AVE is an audio-visual benchmark for audio-visual event localization learning. In this benchmark, there are 4143 10-second videos from 28 event categories. In our experiments, the split of this dataset follows [9].

Evaluation Metrics. To quantitatively investigate the effectiveness of our proposed method, we evaluate the performance of our model on all kinds of different events (audio, visual and audio-visual) under both segment-level and event-level metrics. The segment-level metric is utilized to evaluate the snippet-level prediction performance. In addition, to measure the event-level F-score results, we concatenate the positive consecutive snippets with the same event category to produce the event-level prediction, where 0.5 is selected as the mIoU threshold to determine the positive snippets. Afterwards, the event-level F-score for each event prediction is calculated. To comprehensively assess the overall performance of our proposed model, the aggregated results are also measured: Type@AV denotes the average value of audio, visual and audio-visual event localization results. Instead of just averaging audio, visual and audio-visual metrics, Event@AV measures the results considering all audio and visual events. For more brevity, in all experimental tables, ‘A’, ‘V’ and ‘AV’ represent the audio, visual and audio-visual events respectively. ‘Type’ and ‘Event’ denotes the Type@AV and Event@AV metrics. On AVE and CREMA-D, we utilize the same evaluation metrics as [52] and [22] respectively.

Implementation Details. As the standard protocol, given an arbitrary 10-second long video, we first uniformly divide it into a group of 10 non-overlapping snippets, where each snippet contains 8 frames. To extract the visual input, the pre-trained ResNet152 [53] and R(2+1)D [54] networks are utilized to produce the snippet-level appearance and motion feature respectively, which are concatenated at the channel dimension to form the input visual feature. In terms of the audio input, a pre-trained VGGish [55] model is adopted to extract the audio feature. In all experiments, we utilize JoMoLD [17] as the baseline and the training batch size is set to 128. Each model is trained for 25 epochs by using the Adam optimizer. $D$ and $\gamma$ are set to $512$ and $0.1$ respectively. The initial learning rate for our model is $5e-4$ and drops by a factor of 0.25 for every 6 epochs. In the experimens on AVE and CREMA-D, our reproduced PSP [52] and OGM [22] methods are utilized as the baselines separately.

TABLE II: Effectiveness of DGM on CREMA-D dataset.

Methods	Accuracy
Baseline [22]	61.1
DGM	61.8 (+0.7)

TABLE III: Effectiveness of DGM on AVE.

Methods	Accuracy
Baseline [52]	75.57
DGM	76.14 (+0.57)

IV-B Comparison with State-of-the-art Methods

To comprehensively validate the effectiveness of our model, we compare it with different state-of-the-art methods, including the weakly-supervised sound event detection model TALNet [46], weakly-supervised video action detection models CMCS [48] and STPN [47], modified audio-visual event localization algorithms AVE [9] and AVSDN [49], as well as the recent weakly-supervised audio-visual video parsing methods HAN [15], MA [40], CVCM [16], MM-Pyramid [50] and JoMoLD [17]. For fair comparison, all WS-AVVP models are trained by using the LLP training set with the same videos and features.

Detailed experimental results of our proposed model and other state-of-the-art methods on the LLP testing subset are reported in Table I. Intuitively, our proposed model performs favorably against all other compared methods and achieves the best performance on all audio-visual video parsing sub-tasks under both the segment-level and event-level evaluation metrics. Concretely, compared with the most recent JoMoLD model, our proposed model achieves 1.48 average performance gains for Audio, Visual, Audio-visual, Type@AV and Event@AV tasks under the segment-level evaluation metric. Meanwhile, the average performance improvement for different sub-tasks under the event-level metric is 1.44. The above results significantly demonstrate the effectiveness of our proposed method.

Additionally, to comprehensively investigate the effectiveness of our proposed method, more experiments are conducted on different audio-visual benchmarks including CREMA-D and AVE. The corresponding experimental results are summarized in Table II and Table III. Consistent performance improvement verifies the effectiveness and good generalization of our proposed method.

TABLE IV: Ablation study on the DGM mechanism and MSDU. Both segment-level and event-level parsing results are reported. The best results are highlighted in bold.

Event type	Methods	Segment-level	Event-level
A	JoMoLD	61.3	53.9
	JoMoLD+DGM	61.9	54.4
	JoMoLD+DGM+MSDU	63.5	55.6
V	JoMoLD	63.8	59.9
	JoMoLD+DGM	64.4	60.9
	JoMoLD+DGM+MSDU	65.0	61.3
AV	JoMoLD	57.2	49.6
	JoMoLD+DGM	56.8	49.3
	JoMoLD+DGM+MSDU	57.8	50.5
Type	JoMoLD	60.8	54.5
	JoMoLD+DGM	61.0	54.9
	JoMoLD+DGM+MSDU	62.1	55.8
Event	JoMoLD	59.9	52.5
	JoMoLD+DGM	60.6	53.4
	JoMoLD+DGM+MSDU	62.0	54.4

TABLE V: Ablation study on different optimization discrepancy measure functions.

Event type	Methods	Segment-level	Event-level
A	score	63.1	56.0
	discrepancy	62.4	55.1
	fusion	63.5	55.6
V	score	64.4	60.9
	discrepancy	64.4	60.6
	fusion	65.0	61.3
AV	score	57.5	49.9
	discrepancy	57.0	49.6
	fusion	57.8	50.5
Type	score	61.7	55.6
	discrepancy	61.2	55.1
	fusion	62.1	55.8
Event	score	61.7	54.7
	discrepancy	60.9	53.8
	fusion	62.0	54.4

IV-C Ablation Studies

Effectiveness of each component. In this part, to further investigate the effectiveness of each component in our method, we conduct comprehensive ablation studies on LLP dataset. The corresponding experimental results are reported in Table IV. From the results, we can observe that our proposed DGM can consistently improve the audio-visual video parsing performance of different network structures (i.e., JoMoLD and our proposed MSDU structure). Concretely, when we insert our proposed DGM mechanism into JoMoLD, the average performance improvements for Audio, Visual, Type@AV and Event@AV sub-tasks are 0.525 and 0.7 under the Segment-level and Event-level evaluation metrics. However, the performance for audio-visual sub-task drops slightly, which can be attributed to the factor that the confusing multimodal feature will lead to unreliable imbalance measurement.

Compared with ‘JoMoLD+DGM’, MSDU can bring additional 1.14 and 0.94 average performance improvements for different sub-tasks under the Segment-level and Event-level metrics respectively. Obviously, our proposed DGM mechanism can bring more performance gains for ‘JoMoLD+MSDU’ than JoMoLD. We attribute the experimental phenomenon to the reason that the calculations of different modalities are confused in JoMoLD, which is not beneficial for pure assessment of imbalanced feature learning between different modalities. However, in our designed MSDU block, the calculation of different modalities is separated and pure, which can boost the effectiveness of DGM.

To sum up, we can draw the following conclusions from the above experimental results: (1) Our proposed DGM mechanism can improve the audio-visual video parsing performance by balancing the feature learning between different modalities; (2) The MSDU structure can effectively separate the calculation of different modalities, thus further promote the effectiveness of our proposed DGM mechanism. (3) Our proposed DGM mechanism can be adapted to different backbone structures, achieving consistent performance improvement in different WS-AVVP sub-tasks.

Optimization imbalance between different modalities. To reliably measure the imbalanced feature learning between audio and visual modalities, we investigate three different methods: (1) For each training video, we first calculate the sum of predicted scores for the correct categories in audio and visual modalities separately. Afterwards, the ratio of the summed predicted scores is calculated as the imbalance degree between two modalities. (2) We calculate the discrepancies between the average predicted scores for the correct and false semantic categories in different modalities firstly. Thereafter, the ratio of the discrepancies in different modalities is utilized to measure the imbalanced feature learning between audio and visual modalities. (3) The above two methods are combined together to assess the imbalanced optimization between audio and visual feature encoders. Detailed experimental results are listed in Table V. ‘score’, ‘discrepancy’ and ‘fusion’ denote the three different methods mentioned above. From the experimental results, we can conclude that all three methods can measure the optimization imbalance between audio and visual feature encoders effectively. When we combine the two different information utilized in the first and second strategies, our DGM mechanism can achieve the most performance gains (i.e., ‘JoMoLD+DGM+MSDU’ in Table IV). We attribute this phenomenon to the reason that ‘score’ strategy only takes the predictions for the correct categories into consideration, which neglects the global distribution of the classification predictions. Similarly, ‘discrepancy’ method only considers the discrepancy between the predictions for correct and false event categories while ignores the original prediction scores for the correct categories. When we combine the above two strategies, more exhaustive information will provide more reliable assessment of the optimization imbalance between different feature encoders, which will better balance the multimodal feature learning and improve the audio-visual video paring performance.

TABLE VI: Ablation study on the effectiveness of different

\gamma

in Eq.(7) of our main paper. Bold and underlined represent the optimal and sub-optimal performance.

$\gamma$	Segment-level					Event-level					AVG
$\gamma$	A	V	AV	Type	Event	A	V	AV	Type	Event	AVG
0.1	63.5	65.0	57.8	62.1	62.0	55.6	61.3	50.5	55.8	54.4	58.80
0.2	63.6	64.5	57.7	61.9	62.0	56.1	60.8	50.2	55.7	54.6	58.71
0.3	63.2	65.4	57.6	62.1	61.9	55.2	61.5	50.0	55.6	54.1	58.66
0.4	63.6	64.8	57.6	62.0	62.1	55.7	60.7	49.8	55.4	54.2	58.59
0.5	63.6	64.8	57.3	61.9	62.2	55.7	60.9	49.5	55.3	54.5	58.57
0.6	62.9	64.9	57.6	61.8	61.7	54.9	61.3	49.9	55.4	53.9	58.43
0.7	62.9	64.5	57.5	61.7	61.2	55.5	60.5	49.9	55.3	53.8	58.28
0.8	62.6	65.0	57.3	61.6	61.1	54.5	61.1	49.7	55.1	53.2	58.12
0.9	62.7	64.5	57.0	61.4	61.6	55.1	60.9	49.4	55.2	54.1	58.19

TABLE VII: Ablation study on the generalization of our proposed method in different WS-AVVP models.

Method	Segment-level					Event-level					AVG
Method	A	V	AV	Type	Event	A	V	AV	Type	Event	AVG
HAN	60.1	52.9	48.9	54.0	55.4	51.3	48.9	43.0	47.7	48.0	51.02
HAN + Ours	61.0	55.2	51.5	55.9	56.2	53.4	51.0	45.1	49.8	50.0	52.91
JoMoLD	61.3	63.8	57.2	60.8	59.9	53.9	59.9	49.6	54.5	52.5	57.34
JoMoLD + Ours	63.5	65.0	57.8	62.1	62.0	55.6	61.3	50.5	55.8	54.4	58.80

Effectiveness of $\gamma$ in Eq.(7) of our main paper. To analyze the effect of $\gamma$ in Eq.(7) on our proposed DGM mechanism, we conduct comprehensive ablations and the corresponding results are reported in Table VI. From the experimental results, we can conclude that when $\gamma$ is equal to 0.1, our model achieves the best audio-visual video parsing performance. Additionally, when $\gamma$ increases from 0.2 to 0.9, our model performance decreases slightly. However, compared with ‘JoMoLD’ model in Table (4) of our main paper, our proposed DGM mechanism can always improve the performance no matter what $\gamma$ is, which verifies that our proposed DGM can robustly balance the optimization processes of different modalities and further improve the performance of our proposed WS-AVVP model.

Balanced optimization between different modalities. To intuitively verify that our proposed DGM mechanism can balance the optimization between audio and visual modalities, we analyze the training losses of different modalities before and after the gradients are modulated by our proposed DGM mechanism. As illustrated in Figure 5, x-coordinate and y-coordinate denote the training epoch and the corresponding loss respectively. Intuitively, we can observe that there is a large gap between the losses of audio and visual modalities before the gradients of our designed WS-AVVP model are modulated. Meanwhile, after audio and visual sub-networks are modulated by our proposed DGM, the traning losses of different modalities are balanced, which strongly proves that our proposed DGM mechanism can make the suboptimal modality achieve more optimization attention and further balance the feature learning processes of different modalities.

Generalization of our proposed method. To verify the generalization ability of our proposed DGM strategy, we embed it into different WS-AVVP models including HAN [15] and JoMoLD [17]. The corresponding experimental results are summarized in Table VII. Obviously, our proposed method can consistently improve the audio-visual video parsing performance of differnt baseline models, which proves that our method can cooperate with variant model structures for better AVVP performance without being limited by the specific model structure.

Qualitative Results. To investigate the effectiveness of our proposed method more intuitively, we compare the qualitative audio-visual parsing predictions produced by the baseline and our proposed complete model (i.e., JoMoLD+MSDU+DGM) in Figure 4. From the results, we can conclude that our complete method can localize more accurate audio, visual, audio-visual event instances than baseline structure, which verifies that our proposed DGM can effectively improve the WS-AVVP performance. In addition, we also observe that compared with the baseline model, our proposed DGM mechanism can help detect more precise event instances in the relatively weak modality. For example, as shown in subfigure (c) of Figure 4, it is hard to judge whether ‘Frying Food’ is happening by just listening to the audio, but it is obvious in visual modality. Consequently, the bseline method can only detect the accurate event instances in visual modality but not in audio. However, our proposed method can not only precisely localize the events in visual modality but also in audio. Consequently, we can conclude that our DGM can improve the video parsing performance in relatively weak modalities, which further proves that our proposed DGM can balance the optimization processes between different modalities and make the weak modalities achieve more optimization attention.

V Conclusion

In this paper, we first analyze the imbalanced feature learning between different modalities in WS-AVVP task. To mitigate the above issue, the dynamic gradient modulation strategy is designed to modulate the gradients of the feature encoders for different modalities, so as to make the model pay more optimization attention to the suboptimal branch. Meanwhile, to address the negative effect caused by the confusing multimodal feature on our proposed DGM, we design a modality-separated decision unit for more precise measurement of the imbalanced feature learning between audio and visual modalities. Comprehensive experiments verify the effectiveness of our proposed method.

References

[1] M. Planamente, C. Plizzari, E. Alberti, and B. Caputo, “Domain generalization through audio-visual relative norm alignment in first person action recognition,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 1807–1818.
[2] R. Gao, T.-H. Oh, K. Grauman, and L. Torresani, “Listen to look: Action recognition by previewing audio,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 457–10 467.
[3] E. Kazakos, A. Nagrani, A. Zisserman, and D. Damen, “Epic-fusion: Audio-visual temporal binding for egocentric action recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5492–5501.
[4] E. Tzinis, S. Wisdom, T. Remez, and J. R. Hershey, “Audioscopev2: Audio-visual attention architectures for calibrated open-domain on-screen sound separation,” in European Conference on Computer Vision. Springer, 2022, pp. 368–385.
[5] C. Gan, D. Huang, H. Zhao, J. B. Tenenbaum, and A. Torralba, “Music gesture for visual sound separation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 478–10 487.
[6] E. Tzinis, S. Wisdom, A. Jansen, S. Hershey, T. Remez, D. Ellis, and J. R. Hershey, “Into the wild with audioscope: Unsupervised audio-visual separation of on-screen sounds,” in International Conference on Learning Representations, 2020.
[7] Y. Tian, D. Hu, and C. Xu, “Cyclic co-learning of sounding object visual grounding and sound separation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2745–2754.
[8] R. Gao and K. Grauman, “Visualvoice: Audio-visual speech separation with cross-modal consistency,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2021, pp. 15 490–15 500.
[9] Y. Tian, J. Shi, B. Li, Z. Duan, and C. Xu, “Audio-visual event localization in unconstrained videos,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 247–263.
[10] Y. Wu, L. Zhu, Y. Yan, and Y. Yang, “Dual attention matching for audio-visual event localization,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6292–6300.
[11] H. Xuan, Z. Zhang, S. Chen, J. Yang, and Y. Yan, “Cross-modal attention network for temporal inconsistent audio-visual event localization,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 279–286.
[12] B. Duan, H. Tang, W. Wang, Z. Zong, G. Yang, and Y. Yan, “Audio-visual event localization via recursive fusion by joint co-attention,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2021, pp. 4013–4022.
[13] C. Xue, X. Zhong, M. Cai, H. Chen, and W. Wang, “Audio-visual event localization by learning spatial and semantic co-attention,” IEEE Transactions on Multimedia, 2021.
[14] H. Xu, R. Zeng, Q. Wu, M. Tan, and C. Gan, “Cross-modal relation-aware networks for audio-visual event localization,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 3893–3901.
[15] Y. Tian, D. Li, and C. Xu, “Unified multisensory perception: Weakly-supervised audio-visual video parsing,” in European Conference on Computer Vision. Springer, 2020, pp. 436–454.
[16] Y.-B. Lin, H.-Y. Tseng, H.-Y. Lee, Y.-Y. Lin, and M.-H. Yang, “Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing,” Advances in Neural Information Processing Systems, vol. 34, pp. 11 449–11 461, 2021.
[17] H. Cheng, Z. Liu, H. Zhou, C. Qian, W. Wu, and L. Wang, “Joint-modal label denoising for weakly-supervised audio-visual video parsing,” European Conference on Computer Vision, 2022.
[18] K. Parida, N. Matiyali, T. Guha, and G. Sharma, “Coordinated joint multimodal embeddings for generalized audio-visual zero-shot classification and retrieval of videos,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2020, pp. 3251–3260.
[19] A. A. Ismail, M. Hasan, and F. Ishtiaq, “Improving multimodal accuracy through modality pre-training and attention,” arXiv preprint arXiv:2011.06102, 2020.
[20] Y. Sun, S. Mai, and H. Hu, “Learning to balance the learning rates between various modalities via adaptive tracking factor,” IEEE Signal Processing Letters, vol. 28, pp. 1650–1654, 2021.
[21] W. Wang, D. Tran, and M. Feiszli, “What makes training multi-modal classification networks hard?” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 12 695–12 705.
[22] X. Peng, Y. Wei, A. Deng, D. Wang, and D. Hu, “Balanced multimodal learning via on-the-fly gradient modulation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8238–8247.
[23] Y. Aytar, C. Vondrick, and A. Torralba, “Soundnet: Learning sound representations from unlabeled video,” Advances in neural information processing systems, vol. 29, 2016.
[24] D. Hu, F. Nie, and X. Li, “Deep multimodal clustering for unsupervised audiovisual learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9248–9257.
[25] B. Korbar, D. Tran, and L. Torresani, “Cooperative learning of audio and video models from self-supervised synchronization,” Advances in Neural Information Processing Systems, vol. 31, 2018.
[26] R. Arandjelovic and A. Zisserman, “Look, listen and learn,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 609–617.
[27] Y. Cheng, R. Wang, Z. Pan, R. Feng, and Y. Zhang, “Look, listen, and attend: Co-attention network for self-supervised audio-visual representation learning,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 3884–3892.
[28] S. Ma, Z. Zeng, D. McDuff, and Y. Song, “Active contrastive learning of audio-visual video representations,” in International Conference on Learning Representations, 2020.
[29] P. Morgado, Y. Li, and N. Nvasconcelos, “Learning representations from audio-visual spatial alignment,” Advances in Neural Information Processing Systems, vol. 33, pp. 4733–4744, 2020.
[30] K. K. Rachavarapu, V. Sundaresha, A. Rajagopalan et al., “Localize to binauralize: Audio spatialization from visual sound source localization,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1930–1939.
[31] H. Xuan, Z. Wu, J. Yang, Y. Yan, and X. Alameda-Pineda, “A proposal-based paradigm for self-supervised sound source localization in videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1029–1038.
[32] A. Owens and A. A. Efros, “Audio-visual scene analysis with self-supervised multisensory features,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 631–648.
[33] A. Senocak, T.-H. Oh, J. Kim, M.-H. Yang, and I. S. Kweon, “Learning to localize sound source in visual scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4358–4366.
[34] C. Hori, T. Hori, T.-Y. Lee, Z. Zhang, B. Harsham, J. R. Hershey, T. K. Marks, and K. Sumi, “Attention-based multimodal fusion for video description,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 4193–4202.
[35] T. Rahman, B. Xu, and L. Sigal, “Watch, listen and tell: Multi-modal weakly supervised dense event captioning,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 8908–8917.
[36] Y. Tian, C. Guan, J. Goodman, M. Moore, and C. Xu, “An attempt towards interpretable audio-visual video captioning,” arXiv preprint arXiv:1812.02872, 2018.
[37] C. Gan, H. Zhao, P. Chen, D. Cox, and A. Torralba, “Self-supervised moving vehicle tracking with stereo sound,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 7053–7062.
[38] H. Zhao, C. Gan, W.-C. Ma, and A. Torralba, “The sound of motions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 1735–1744.
[39] H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba, “The sound of pixels,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 570–586.
[40] Y. Wu and Y. Yang, “Exploring heterogeneous clues for weakly-supervised audio-visual video parsing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1326–1335.
[41] C. Du, T. Li, Y. Liu, Z. Wen, T. Hua, Y. Wang, and H. Zhao, “Improving multi-modal learning with uni-modal teachers,” arXiv preprint arXiv:2106.11059, 2021.
[42] S. Mandt, M. D. Hoffman, and D. M. Blei, “Stochastic gradient descent as approximate bayesian inference,” arXiv preprint arXiv:1704.04289, 2017.
[43] S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey, “Three factors influencing minima in sgd,” arXiv preprint arXiv:1711.04623, 2017.
[44] X. Yu, B. Han, J. Yao, G. Niu, I. Tsang, and M. Sugiyama, “How does disagreement help generalization against label corruption?” in International Conference on Machine Learning. PMLR, 2019, pp. 7164–7173.
[45] H. Wei, L. Feng, X. Chen, and B. An, “Combating noisy labels by agreement: A joint training method with co-regularization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 13 726–13 735.
[46] Y. Wang, J. Li, and F. Metze, “A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 31–35.
[47] D. T. Nguyen, C. K. Mummadi, T. P. N. Ngo, T. H. P. Nguyen, L. Beggel, and T. Brox, “Self: Learning to filter noisy labels with self-ensembling,” arXiv preprint arXiv:1910.01842, 2019.
[48] D. Liu, T. Jiang, and Y. Wang, “Completeness modeling and context separation for weakly supervised temporal action localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1298–1307.
[49] Y.-B. Lin, Y.-J. Li, and Y.-C. F. Wang, “Dual-modality seq2seq network for audio-visual event localization,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 2002–2006.
[50] J. Yu, Y. Cheng, R.-W. Zhao, R. Feng, and Y. Zhang, “Mm-pyramid: multimodal pyramid attentional network for audio-visual event localization and video parsing,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 6241–6249.
[51] H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,” IEEE transactions on affective computing, vol. 5, no. 4, pp. 377–390, 2014.
[52] J. Zhou, L. Zheng, Y. Zhong, S. Hao, and M. Wang, “Positive sample propagation along the audio-visual event line,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8436–8444.
[53] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[54] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018.
[55] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold et al., “Cnn architectures for large-scale audio classification,” in 2017 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 2017, pp. 131–135.