A Novel Long-term Iterative Mining Scheme for Video Salient Object Detection

Chenglizhao Chen¹ Hengsen Wang² Yuming Fang³ Chong Peng^1∗
¹China University of Petroleum (East China) ²Qingdao University ³Jiangxi University of Finance and Economics
Code and Data: https://github.com/qduOliver/LIMVSOD
Corresponding author: Chong Peng (pchong1991@163.com)

Abstract

The existing state-of-the-art (SOTA) video salient object detection (VSOD) models have widely followed short-term methodology, which dynamically determines the balance between spatial and temporal saliency fusion by solely considering the current consecutive limited frames. However, the short-term methodology has one critical limitation, which conflicts with the real mechanism of our visual system — a typical long-term methodology. As a result, failure cases keep showing up in the results of the current SOTA models, and the short-term methodology becomes the major technical bottleneck. To solve this problem, this paper proposes a novel VSOD approach, which performs VSOD in a complete long-term way. Our approach converts the sequential VSOD, a sequential task, to a data mining problem, i.e., decomposing the input video sequence to object proposals in advance and then mining salient object proposals as much as possible in an easy-to-hard way. Since all object proposals are simultaneously available, the proposed approach is a complete long-term approach, which can alleviate some difficulties rooted in conventional short-term approaches. In addition, we devised an online updating scheme that can grasp the most representative and trustworthy pattern profile of the salient objects, outputting framewise saliency maps with rich details and smoothing both spatially and temporally. The proposed approach outperforms almost all SOTA models on five widely used benchmark datasets.

Index Terms:

Long-term Information; Video Salient Object Detection.

Refer to caption — Figure 1: The major difference between the current SOTA models and our approach. As seen in subfigure-A, the conventional VSOD models usually take several consecutive video frames as input; thus, their VSOD methodology follows the short-term manner, where the current saliency decision is only derived on the current consecutive frames. In sharp contrast, our approach is based on object-level clustering, where all object proposals belonging to different frames (including the beyond-scope frames) are simultaneously available when making saliency prediction, and this is a typical long-term manner, where, compared with the short-term manner, the long-term methodology can be more robust when the salient objects have large appearances or movement changes. We use subfigure-B to visualize the salient object proposal mining process.

I Introduction and Motivation

The video salient object detection (VSOD), also known as zero-shot video segmentation [1, 2, 3, 4, 5, 6], has received extensive research attention in recent years, whose primary objective is to segment video objects that attract the human visual attention most [7, 8, 9]. Different from the widely studied image salient object detection (ISOD) using spatial information only [10, 11, 12], the temporal information provided by the video data makes the saliency detection task more difficult [13, 14, 15], and we give an in-depth discussion regarding this issue to clearly demonstrate our motivation.

Generally, compared with the spatial saliency cue, the human visual system is more easily attracted by the temporal saliency cue [16, 17, 18]. Taking an image with a complex scene for instance, we tend to focus on image regions that are spatially contrastive to their nearby surroundings in terms of colors or textures [19, 20, 21]. However, our attention can promptly shift to another region if it involves sudden movement, even though this movement is very slight. Thus, a preferable VSOD approach should not overemphasize temporal information because it is not always trustworthy; for example, the temporal saliency cue would be completely absent if the salient object stays static for a long period of time. Therefore, the key for obtaining high-performance VSOD relies on how to balance the spatial and temporal saliency cues.

However, most of the current state-of-the-art (SOTA) deep models [22, 23, 24, 25] have followed the short-term methodology, where the spatiotemporal balance is simply determined by the ‘current’ multiple frames. Actually, this short-term manner has one critical limitation:
Due to the varying nature of object movements, it is very difficult for the existing SOTA models [26, 27] to achieve an optimal spatiotemporal balance if only consecutive short-term frames are considered. For example, from the short-term perspective, an object might be temporally salient in both spatial and temporal sources; however, in the long-term view, it might be a nonsalient object. In fact, the human visual system basically follows the long-term methodology because human beings tend to automatically remember the most representative appearance in the passing video contents. Therefore, if a deep model follows the short-term methodology, it will easily produce failure cases when the current spatial and temporal saliency cues are less trustworthy.

Moreover, considering the VSOD task, we argue that the widely followed methodology - designing a very strong deep model to handle all cases - might be problematic and not necessary. The main reason is that for video frames belonging to an identical video sequence, the spatial scene tends to be relatively stable; thus, even a very simple deep model can perform very well on these frames if it has been updated/fine-tuned online on the trustworthy saliency cues mined in an online manner.

Recently, there exist several works [28, 29] which have attempted to explore long-term information, yet our method is different with them in two key aspects. First, both [28, 29] cannot make full use of long-term information, while our approach is global, where both the key frame selection and iteration processes follow a complete long-term way. Second, compared with [29] in method design, our approach is clearly more comprehensive with elegant design in submodules.

Considering all issues mentioned above, this paper performs the VSOD task in a fully long-term manner (see Fig. 1-B). Given a video sequence, we use the off-the-shelf object detector [30] to acquire all potential object proposals in advance. Thus, the conventional frame-by-frame VSOD task can be converted to a data mining problem: iteratively finding as many salient object proposals as possible in an easy-to-hard way. Since all object proposals are simultaneously available during the mining process, our approach is clearly a typical long-term approach; thus, it does not need to consider the complicated spatiotemporal balance mentioned above.

In our approach, those clear salient object proposals - the easy ones - can be localized in the early mining iterations, then the consistency of these easy cases is learned and later used as the auxiliary knowledge to help find the harder cases, and this is fully detailed in Sec. III. Once all salient object proposals have been mined, we use the existing ISOD model, which is online updated/fine-tuned on the salient object proposals, to output framewise saliency maps. The technical details regarding the online fine-tuning are detailed in Sec. IV.

The key contributions of this paper can be summarized into the following aspects:

•

We are one of the first attempts to convert the VSOD task, a conventional sequential task, into an orderless process, thus the long-term information can be well utilized to promote VSOD detection.
•

We have devised a novel object proposal based mining scheme, which iteratively localizes salient objects in an easy-to-hard manner from a completely global perspective, thus it can avoid being trapped in considering the complicated spatiotemporal balance issue.
•

A series of online saliency refinement strategies have been proposed, all of which ensure that our saliency inference process can take full advantage of the long-term mining process, producing accurate detection results with crisp object boundaries.
•

A comprehensive quantitative evaluations and comparisons have been conducted to illustrate the effectiveness and superiority of the proposed approach.
•

Both codes and results are publicly available, which could be very potential to benefit our research community in the future.

II Related Work

II-A Hand-crafted VSOD Approaches

Although the main scope of this paper is deep learning-based VSOD, we give a brief review regarding the most representative conventional approaches. Liu et al. [31] performed the VSOD task via superpixel-based graph smoothing. By using color and global motion histograms, both spatial and temporal saliency cues are obtained in advance and then diffused via temporal propagation. Xi et al. [32] devised a novel method for computing the background prior. This prior is combined by averaging the prior of the spatial [33] and temporal backgrounds [34]. Li et al. [35] proposed a motion-based bilateral network to estimate the backgrounds in a framewise manner, which are later embedded into a graph for saliency propagation. Chen et al. [36] proposed a novel solution to distinguish moving salient objects from diverse changing background regions. Zhou et al. [37] considered both temporal consistency and correlation among adjacent frames to compute the temporary saliency map. Then, both spatial and temporal saliency maps are fused via the proposed spatiotemporal refinement. Guo et al. [38] designed a primitive approach to identify the salient object by ranking and selecting the salient proposals. Chen et al. [7] adopted the low-rank strategy to alleviate the problem of false-alarm error accumulation induced by the widely used saliency propagation. However, this approach basically follows the batchwise methodology, which fails to take full advantage of the beyond-scope saliency consistency, producing massive false alarms when the majority of the intrabatch saliency clues are incorrect. To further improve, the same authors [39] devised bilevel metric learning to include more spatial-temporal saliency cues in the current saliency prediction. Guo et al. [40] proposed a fast VSOD method by using the principal motion vectors to represent the corresponding motion patterns, and such motion message coupling with the color clues together are fed into the multiclue optimization framework to achieve the spatiotemporal VSOD. Zhao et al. [41] proposed a weakly-supervised VSOD model based on eye-fixation annotation. Compared with the fully-supervised VSOD models, the proposed new annotation method dramatically reduces the consumption of time.

II-B Deep Learning-based VSOD Models

In view of the video instance segmentation task [42, 43], Liu et al. in [44] have presented a one-stage, end-to-end, and proposal-free deep model, whose key highlight is its capability of dividing instances into sub-regions dynamically and performing segmentation on each region for spatial granularity. Thus, it achieves a more appealing segmentation behavior as it can enrich object details and produce masks with more accurate edges. In view of the video object detection (VOD) task [45], Cui et al. in [46] have proposed to depict the temporal feature relations and blend valuable neighboring features, thus the spatiotemporal feature representation can get enhanced. Also targeting at the VOD task, Liu et al. in [47] proposed an approach which aggregates features calibrated at both pixel and instance levels, thereby achieving a better detection performance. More works about VOD tasks can be referred to the recent survey paper [48].

Let us now move to the most representative deep learning-based SOTA models, whose performances have significantly outperformed the conventional models. Wang et al. [49] adopted an end-to-end network to compute spatial saliency, which was later coupled with two consecutive frames to serve as the input of another network to compute the spatiotemporal saliency. Li et al. [23] adopted the optical flow results as the network input to compute motion saliency. Because motion saliency could only occasionally benefit the VSOD task, this work developed an attention mechanism, which is empowered by gate logic to filter the less trustworthy motion saliency cues when performing spatial and temporal saliency fusion. Tokmakov et al. [50] fed the concatenated spatial and temporal deep features into the ConvLSTM for spatiotemporal saliency fusion.

Following the bistream structure, Le et al. [51] computed the motion saliency via 3D convolutions, and the VSOD results were derived simply by concatenating and convolving the spatial and temporal deep features. Chen et al. [52] proposed a new version of 3D convolution, which involves multiple 3D convolutions with the newly proposed temporal shuffle operation to enhance the network’s ability in sensing temporal information. This modification also enables the spatial branch to fully interact with the temporal branch in a multiscale manner. Song et al. [26] devised the bi-LSTM network to sense multiscale spatiotemporal information. This work also adopted pyramid dilated convolutions to extract multiscale spatial saliency features, which are fed into the abovementioned bi-LSTM network to achieve multiscale VSOD. Wang et al. [53] proposed coattention to enhance the consistency between different frames. Fan et al. [22] developed an attention-shift baseline and released a large-scale saliency-shift-aware dataset for the VSOD problem. Zhou et al. [54] have presented a novel end-to-end zero-shot video segmentation network. In this work, the proposed network follows the traditional bi-stream structure, yet, different from previous works, it newly devised a novel module to interact temporally with spatial. Generally, this method follows a typical coarse-to-fine rationale. The temporal information is treated as the coarse indicator to locate the target object. Then, based on the coarsely localized regions, the proposed fusion module further learns the spatial appearance to refine the segmentation process, ensuring the output has a crisp object boundary. Recently, Zhou et al. [55] have devised a novel bottom-up instance discrimination network which takes advantage of temporal context information in videos for more accurate segmentation. In view of the methodology innovation, the authors have converted the video segmentation task to a tracking paradigm, which can better capture the appearance information of the target objects, yielding better segmentation results.

Recently, Gu et al. [56] learned the nonlocal motion dependencies across several frames, and then it followed the pyramid structure to capture the spatiotemporal saliency clues at various scales. The major highlight of this paper is the proposed constrained self-attention operation, which can capture motion cues via the prior that objects always move in a continuous trajectory, achieving a very high FPS rate. Ren et al. [24] proposed a novel triple-stream network, which includes a spatiotemporal network, a spatial network, and a temporal network, where the saliency cues computed from the spatial and temporal networks serve as the attention to enhance the main network. Jiao et al. [57] proposed a Guidance and Teaching Network (GTNet), GTNet introduces a temporal modulator to bridge features from motion into appearance, and a motion-guided mask is used to propagate the explicit cues during the feature aggregation. Bi et al. [58] devised STEG-Net, which uses the extracted edge cues to guide the extraction of spatial-temporal cues and combines deep texture cues with shallow edge cues. This strategy can simultaneously retain edge information and enhance object’s global cues, making the object’s position more accurately.

II-C Early Explorations Regarding Long-term Information

Clearly, most of the current SOTA models mentioned above have one critical problem, i.e., their methodologies tend to be short-term in essence, and the limitations have been mentioned in the introduction. Several previous works have become aware of this problem, and multiple modifications have been attempted to alleviate this problem.

Lu et al. in [59] have presented a GNN based zero-shot video object segmentation (ZVOS) network. By using the newly devised attentive graph neural network (AGNN), the ZVOS can be treated as an end-to-end message passing based graph information fusion procedure. The major highlight of this approach is its capability to make the high-order relations among frames be captured. In the later journal version [60], the AGNN was extended to diverse segmentation tasks with additional in-depth discussions and explanations. To further improve the SOTA models that primarily focus on learning discriminative foreground representations in a short-term manner, Lu et al. in [61] have proposed to re-design the ZVOS in a holistic fashion. The authors have devised the co-attention layers to learn global correlations and scene context, and thus the short-term spatiotemporal feature fragments can be interacted in a joint feature space, achieving significant performance improvement.

Chen et al. [62] used SIFT-Flow to introduce beyond-scope saliency cues to the current problem domain, making its spatiotemporal saliency computation relatively ‘long-term’. Li et al. [63] utilized tracking consistency to mine keyframes, and then the high-quality spatiotemporal saliency cues could be diffused from each keyframe to other normal frames. The keyframe selection scheme proposed in this work belongs to the long-term scope, but the saliency diffusion process is still a short-term method. Another representative work is [64], in which the keyframes were selected offline by measuring the consistency degree between spatial and temporal saliency cues, and then an ISOD model was fine-tuned on these keyframes to adapt for this sequence. It is worth mentioning that this idea is inspired by [65], where the model fine-tuning scheme was used, while the major difference relies on whether the annotation of the 1st frame is given. Wang et al. [66] proposed a novel attentive graph neural network to explore the relationship between different frames. Compared with the previous short-term method, the saliency sensing scope of this work was expanded significantly. However, because the network training and testing processes still followed the framewise sequential order, this work still belongs to the short-term category. Following the model online matching and updating schemes that were widely used object trackers, Wang et al. [1] proposed the episodic graph memory network, where multiple submodels are trained, stored, and updated to adapt the current VSOD task to the current video sequence long-term. However, the VSOD task carried out by this work was solely along the temporal direction; thus, not all spatiotemporal saliency cues embedded in the video were simultaneously available to the current frame. Wang et al. [67] converted the framewise VSOD task to a graph problem, which converted the video frames to supervoxels, and the graph convolutional network, taking the supervoxels as the graph nodes, was used to explore the long-term spatiotemporal VSOD. However, because this graph structure was still constrained by the temporal neighboring topology, the long-term information explored by this work might be rather weak in essence. Zhang et al. [68] proposed a dynamic context-sensitive filtering network (DCFNet). This net generates dynamic convolution kernels containing rich context information at multiple scales by estimating location correlation weights to improve the model’s adaptability to dynamic video scenes. Chen et al. [69] proposed a new concept, named motion quality, to re-balance the complementary fusion status between spatial information and temporal information.

III Salient Object Proposal Localization

III-A Method Overview

Given a video sequence, we use the off-the-shelf object detection tool SEOD [30] to acquire object proposals. For a single frame, we rank all its object proposals according to their objectness confidences, and a maximum of 10 object proposals are considered (we also test to consider more than 10 object proposals, but no obvious performance improvement can be observed). Thus, a video sequence with a total of $T$ frames can now be expressed as $N=T\times 10$ object proposals. We use $P_{i}$ to denote the $i$ -th object proposal, and all proposals can be represented as $\mathcal{P}=\{P_{1}...,P_{N}\}$ .

The aim of this section is to find which object proposals are the salient proposals (see ‘Part 1’ in Fig. 2). Our key idea is to iteratively mine all potential salient object proposals from $\mathcal{P}$ from easy to hard, and in each mining iteration, only a small group of $\mathcal{P}$ could be determined to be either salient or nonsalient. Initially, we place the accuracy rate as the highest priority, while the recall rate can be gradually ensured by the upcoming iterative mining steps. The proposed iterative mining process can be visualized in Fig. 1-B, where more salient object proposals, which are difficult to determine in the early iterations, can be correctly mined later because our mining approach can take full use of the knowledge derived from the previous iterations.

Once all salient object proposals have been determined, we utilize the online fine-tuning scheme to convert these salient object proposals to framewise saliency maps (see ‘Part 2’ in Fig. 2).

III-B Coarse-level Localization for Salient Object Proposals

For each object proposal in $\mathcal{P}$ , we use the pretrained feature backbone (ResNet18) to extract the corresponding semantical deep feature ( $f_{i}$ ), and the deep features of $\mathcal{P}$ can be represented as $\textbf{F}=\{f_{1}...,f_{N}\}$ .

Based on the L2 similarity between each pair of F, all object proposals ( $\mathcal{P}$ ) can be clustered into $K$ clusters. Considering the limitation of GPU memory, long video sequences should be divided into multiple short sequences in advance. It might also be rare for typical short video sequences (less than 100 frames) to contain more than three salient objects. The clustering process can be represented as Eq. 1.

\mathcal{C}=\{C_{1}...,C_{K}\}=Clustering(\textbf{F},K),

(1)

where we use $C_{i}$ to denote the $i$ -th cluster, $i\in\{1,2...,K\}$ . Any off-the-self clustering tool can be applied here, where we simply choose the K-means.

To locate salient object proposals coarsely, the most intuitive method might be computing the ‘cluster-level saliency degree’ first and then selecting the top salient object proposals as the salient clusters. Our rationale is a salient cluster’s object proposals tend to exhibit very strong saliency cues. Following this rationale, for each cluster, we use the averaged ‘motion saliency’ to represent the cluster-level saliency degree. Generally, compared with spatial saliency, motion saliency has one clear advantage: fewer false alarms; for example, at worst, motion saliency becomes absent, while spatial saliency might be totally incorrect for unpredictable reasons. To continue introducing the technical details of the clustering-based salient object proposal coarse localization, we leave the contents regarding the motion saliency computation to the next subsection.

We select $\nu$ clusters from $\mathcal{C}$ as the salient clusters, where we empirically constrain $\nu<\lfloor 0.5\times K\rfloor$ . The exact value of $\nu$ can be automatically determined via Eq. 2.

\nu=arg\mathop{\min}_{i}\Big{\{}\breve{ams}(i)\Big{\}},

(2)

\breve{ams}\leftarrow\nabla\big{(}\mathcal{DES}(ams)\big{)},

(3)

ams(i)=avg(PMS_{i}),

(4)

where $PMS_{i}=\{PMS_{i}(1)...,PMS_{i}(g)\}$ is a set in which each element represents the motion saliency value, e.g., $PMS_{i}(1)$ denotes the motion saliency value (1-dimension) of the 1st object proposal in the $i$ -th cluster, and $g$ is the number of object proposals in this cluster; function $avg(C_{i})$ returns the averaged motion saliency value of all object proposals in the $i$ -th cluster, thus $ams$ is a $K$ -dimensional vector; function $\mathcal{DES}(\cdot)$ sorts its input elements in descending order; $\nabla$ is the first-order derivation operation.

Clearly, the rationale of the dynamic $\nu$ relies on representing the initial ‘localization’ regarding which object proposals among the cluster $C_{i}$ are very likely to be salient.

III-C Motion Saliency Cues

Intuitively, one can obtain motion saliency cues directly via an off-the-shelf image salient object detection (ISOD) model if it has been fine-tuned on the optical flow data, i.e., $\{OF_{j},GT_{j}\}$ (see below).

Assume two consecutive video frames can be represented as $\textbf{I}_{j}$ and $\textbf{I}_{j+1}$ , and the corresponding optical flow $OF_{j}$ , which will be used for ISOD model fine-tuning, can be computed by Eq. 5.

OF_{j}=ce\Big{\{}FlowNet(\textbf{I}_{j},\textbf{I}_{j+1})\Big{\}},

(5)

where ‘FlowNet’ represents the off-the-shelf optical flow tool [70], whose input includes two consecutive video frames, and the outputs are two gradient matrixes representing the spatial displacement over the vertical and horizontal directions; $ce\{\cdot\}$ denotes the widely used color encryption tool, which converts the abovementioned two optical flow matrixes to a three-dimensional matrix ( $\mathbb{R}^{W\times H\times 3}$ ), in which all gradient directions and values are uniformly encoded by different colors after this operation (see the third column of Fig. 3).

Thus, $OF_{j}$ and the corresponding raw ISOD ground truth $GT_{j}$ can be used to fine-tune the off-the-shelf ISOD model. The loss function of ISOD deep model fine-tuning can be represented as Eq. 6, for which we use a small learning rate ( $10^{-7}$ ).

\begin{split}\rm L_{ISOD}=-\sum_{\emph{j}}\Big{[}\emph{GT}_{\emph{j}}&\times\log\emph{OF}_{\emph{j}}\\ &+(1-{\rm\emph{GT}}_{\emph{j}})\times\log(1-\emph{OF}_{\emph{j}})\Big{]}.\end{split}

(6)

After being fine-tuned, the adopted ISOD model ( $\mathcal{M}$ ) takes $OF$ as input and then outputs a motion saliency map (see the last column in Fig. 3). This process can be represented by the following equation:

\begin{split}MS_{j}=\mathcal{M}(\Theta,OF_{j}),\end{split}

(7)

where $\Theta$ is the learnable hidden parameters, $\mathcal{M}$ can be any off-the-shelf end-to-end ISOD model following the typical encoder-decoder structure (we choose the CPD19 [71]), and $MS_{i}$ is the motion saliency map (see the last column of Fig. 3).

The motion saliency maps are usually blurred in object boundaries, which are mainly determined by the optical flow quality. To the best of our knowledge, there exists no better method for improving this problem at present. Thus, the motion saliency maps can only be used for the coarse localization of the salient object proposals.

In addition, the motion saliency, a typical unstable saliency cue, tends to vary from frame to frame; for example, it may become helpless if the salient object is temporally static, as demonstrated in the #54 frame in Fig. 3. However, the motion saliency cue might be very helpful in some other scenarios that have clear object movements, e.g., the #57 frame. Therefore, based on $MS$ (Eq. 7) and $\mathcal{P}$ (Sec. III-A), we detail the computation of $PMS$ , which is mentioned in Eq. 4, as below.

Taking $PMS_{i}(u)$ as an example, suppose this object proposal is the $h$ -th object proposal in $\mathcal{P}$ and it also belongs to the $j$ -th video frame; thus, the exact value of $PMS_{i}(u)$ can be obtained by:

PMS_{i}(u)=avg\big{(}crop(MS_{j},P_{h})\big{)},

(8)

where the $crop$ operation crops a patch from $MS_{j}$ according to the coordinates provided by the object proposal $P_{h}$ .

III-D Fine-level Localization for Salient Object Proposals

By performing the previous steps (e.g., Eq. 2), $\nu$ salient clusters have been selected. The remaining problem here includes two aspects: 1) these $\nu$ salient clusters might contain some nonsalient object proposals, and 2) some salient object proposals might belong to the $\{K-\nu\}$ nonsalient clusters. To precisely separate these two types of object proposals (we call them the ‘undetermined’ proposals), we resort to an auxiliary binary classifier.

Our idea is that for the ‘undetermined’ object proposals, the binary classifier serves as an indicator to tell which object proposals are more likely to be salient. In practice, this binary classifier can be weakly trained by instances belonging to salient clusters or nonsalient clusters. The main reason that we call this training process a ‘weakly supervised’ process is that for each video sequence, both positive and negative instances for the binary classifier training are determined online. Without knowing the ground truths, some of the most trustworthy object proposals in the salient clusters are selected as the positive instances ( $PseudoGT=1$ ). The most trustworthy negative instances ( $PseudoGT=0$ ) are selected from the nonsalient clusters. To focus on the main topic in this subsection, we leave the technical details regarding instance selection to the next subsection.

Based on the positive and negative instances mentioned above (total $Q$ instances and $Q\leq N$ ), the loss function of the binary classifier, a typical cross-entropy loss, can be represented as Eq. 9.

\begin{split}\mathcal{L}_{BC}=\frac{1}{Q}\times\sum_{q}&-\Big{[}PseudoGT_{q}\times log\big{(}FC(f_{q})\big{)}\\ &+(1-PseudoGT_{q})\times log\big{(}1-FC(f_{q})\big{)}\Big{]},\end{split}

(9)

where $FC(\cdot)$ denotes the widely used multilayer fully connected layers, $f_{q}$ represents the high-dimensional deep feature of the $q$ -th object proposal selected under the current salient and nonsalient cluster partition, $PseudoGT_{q}$ is the corresponding binary label of $f_{q}$ . Note that the training process of this binary classifier of this classifier can be performed very quickly; thus, it can be applied in the VSOD task. In the next subsection, we detail the computational process of $PseudoGT$ .

III-E Training Instances for the Binary Classifier

Considering that the performance of the binary classifier is positively related to the quality of the pseudolabels (i.e., $PseudoGT$ ), we should ensure that those less trustworthy object proposals are excluded from the pseudotraining set. To achieve this, we utilize two measurements: 1) the distance to the cluster’s centroid and 2) the motion saliency degree of each object proposal.

The rationale of measurement 1) is that the trustworthy degree of an object proposal to be a real salient proposal should be positively related to the confidence degree of belonging to its cluster. Actually, this confidence degree can be measured by the feature similarity degree ( $sim$ ) between the cluster’s average profile ( $AP_{i}$ ) and the object proposal (see Eq. 10), where $AP_{i}$ could be automatically obtained during the K-means clustering process.

sim_{i}(j)=\big{|}\big{|}AP_{i},f_{j}\big{|}\big{|}_{2},\ \ AP_{i}=\frac{1}{g}\sum_{l=1}^{g}f_{l},

(10)

where $f_{j}$ represents the deep feature of the $j$ -th object proposal in its cluster $C_{i}$ , $||\cdot||_{2}$ is the $L_{2}$ -norm, and $AP_{i}$ represents the clustering profile of $C_{i}$ , which shares an identical size to the $f_{j}$ ; $sim_{i}(j)$ measures the feature distance of the $j$ -th object proposal to the cluster centroid. The primary objective of Eq. 10 is to serve the proposed mining process to filter out those less trustworthy object proposals in a given cluster. Our motivation is clear and intuitive, i.e., given a salient cluster, an object belonging to yet with a large feature distance to the cluster’s average profile tends to be non-salient. Thus, this object proposal shall have a large chance of being filtered during the subsequent mining process. Once $sim$ (Eq. 10) is obtained, we use it to rank all object proposals in ascending order.

The rationale for measurement 2) relies on the fact that object proposals with clear movements are more likely to be salient proposals. Thus, all object proposals in each cluster can also be ranked in descending order according to their motion saliency degrees, i.e., $PMS$ (Eq. 8).

Based on the abovementioned factors, the less trustworthy object proposals in the salient clusters ( $\mathcal{C}^{+}$ ) or the nonsalient clusters ( $\mathcal{C}^{-}$ ) can be filtered by using the equations documented below; $\mathcal{C}=\mathcal{C}^{+}\cup\mathcal{C}^{-}$ and $\cup$ is the union operation.

SI_{i}^{+}=\mathcal{ASD}\big{(}C_{i}^{+},SIM_{i}\big{)},\ \ MI_{i}^{+}=\mathcal{DES}\big{(}C_{i}^{+},PMS_{i}\big{)},

(11)

SI_{i}^{-}=\mathcal{ASD}\big{(}C_{i}^{-},SIM_{i}\big{)},\ \ MI_{i}^{-}=\mathcal{ASD}\big{(}C_{i}^{-},PMS_{i}\big{)},

(12)

where we use the superscripts $+$ and $-$ to distinguish salient clusters and nonsalient clusters determined by the previous steps; $SIM_{i}\in\mathbb{R}^{1\times g}=[sim_{i}(1)...,sim_{i}(g)]$ , and $sim_{i}(1)$ denotes the similarity degree of the 1st object proposal in the $i$ -th cluster computed via Eq. 10; $PMS_{i}$ can be found in Eq. 8; $SI_{i},MI_{i}\in\mathbb{R}^{1\times g}$ are two subscript vectors of object proposals in the $i$ -th salient cluster; $\mathcal{ASD}(C_{i},SIM_{i})$ returns the reranked object proposals’ subscripts via $sim_{i}$ in ascending order, and $\mathcal{DES}(C_{i},MS_{i})$ returns the reranked object proposals’ subscripts via $MS_{i}$ in descending order.

Next, the intersection between the top- $\alpha$ elements in $SI_{i}$ and the top- $\beta$ in $MI_{i}$ can be the most trustworthy object proposals in the current cluster, where $\alpha$ and $\beta$ can be obtained via Eq. 13 and Eq. 15.

\alpha^{+}_{i}=arg\max_{\xi\leq j<g}\Big{\{}\nabla\big{(}Re(SIM_{i},\ SI_{i}^{+})\big{)}\Big{\}}\big{(}j\big{)},

(13)

\alpha^{-}_{i}=arg\max_{\xi\leq j<g}\Big{\{}\nabla\big{(}Re(SIM_{i},\ SI_{i}^{-})\big{)}\Big{\}}\big{(}j\big{)},

(14)

\beta^{+}_{i}=arg\max_{\xi\leq j<g}\Big{\{}\nabla\big{(}Re(PMS_{i},\ MI_{i}^{+})\big{)}\Big{\}}\big{(}j\big{)},

(15)

\beta^{-}_{i}=arg\max_{\xi\leq j<g}\Big{\{}\nabla\big{(}Re(PMS_{i},\ MI_{i}^{-})\big{)}\Big{\}}\big{(}j\big{)},

(16)

where the definitions of $SI$ and $MI$ can be found in Eq. 11 and Eq. 12, the operation $Re(SIM_{i},\ SI_{i}^{+})$ reorders the elements in $SIM_{i}$ according to the $SI_{i}^{+}$ , similar to Eq. 3, $\nabla(\cdot)$ computes the gradient of its input vector, i.e., the output of $Re(\cdot,\cdot)$ , $\xi$ is the lower bound, and we empirically assign it to $0.6\times g$ , and $g$ is the total proposal number in the $i$ -th cluster.

The filtering process outputting trustworthy positive and negative instances can be detailed as Eq. 17.

\begin{split}Pos\leftarrow\Bigg{\{}TOP\Big{(}\mathcal{ASD}(&C_{i}^{+},SIM_{i}),\alpha^{+}_{i}\Big{)}\Bigg{\}}\\ &\cap\ \Bigg{\{}TOP\Big{(}\mathcal{DES}(PMS_{i}),\beta_{i}^{+}\Big{)}\Bigg{\}},\end{split}

(17)

\begin{split}Neg\leftarrow\Bigg{\{}TOP\Big{(}\mathcal{ASD}(&C_{i}^{-},SIM_{i}),\alpha^{-}_{i}\Big{)}\Bigg{\}}\\ &\cap\ \Bigg{\{}TOP\Big{(}\mathcal{ASD}(PMS_{i}),\beta_{i}^{-}\Big{)}\Bigg{\}},\end{split}

(18)

where $\mathcal{ASD}/\mathcal{DES}$ is identical to that of the previous equations; $\alpha/\beta$ can be found in Eq. 13, Eq. 14, Eq. 15, and Eq. 16; $TOP(SIM_{i},\alpha)$ returns the top- $\alpha$ elements of $SIM_{i}$ , $SIM/PMS$ can be referred from Eq. 11, and $\cap$ is the operation that returns the intersection of two sets. Compared with the original $\mathcal{P}^{+}$ and $\mathcal{P}^{-}$ , samples in $Pos$ or $Neg$ are clearly more trustworthy in general, and the total sample numbers of $Pos$ and $Neg$ should be less than the total object proposal number ( $N$ ) of $\mathcal{P}$ .

III-F Iteratively Mining More Trustworthy Object Proposals

Based on both $Pos$ and $Neg$ , the binary classifier can be well trained. By using this classifier as the indicator, we can determine the saliency labels (salient or nonsalient) for more object proposals that are relatively less trustworthy than those belonging to either $Pos$ or $Neg$ , and we call them the ‘uncertain’ object proposals.

For each uncertain object proposal in the salient clusters ( $\mathcal{C}^{+}$ ), we add it into the $Pos$ set if the binary classifier predicts it as ‘salient’. Similarly, for each uncertain object proposal in the nonsalient clusters ( $\mathcal{C}^{-}$ ), we add it to the $Neg$ set if the binary classifier predicts it to be ‘nonsalient’. By using the mining step mentioned above, both sets ( $Pos$ and $Neg$ ) can be gradually expanded to comprise more feature patterns. However, the binary classifier is not always trustworthy and predictions might be occasionally incorrect, making both $Pos$ and $Neg$ sets noisy.

To alleviate this problem, we propose two additional constraints to ensure that the uncertain object proposals included in $Pos$ and $Neg$ sets are the most trustworthy. We use $Pos+$ and $Neg+$ to denote these two types of uncertain proposals. Thus, only those object proposals, which belong to the $Pos+$ or $Neg+$ set and meet the constraint addressed below, would be eventually added into the $Pos$ or $Neg$ set. Taking $Pos+$ as an example, all proposals are reordered via their similarity degrees to the mean profile of its cluster (Eq. 10); thus, we denote the reordered $Pos+$ as $Pos+^{\prime}$ . We constrain only the top- $\gamma\%$ proposals in $Pos+^{\prime}$ to be added to the $Pos$ set. We empirically choose a relative slack value (60%) to $\gamma$ to balance the diversity and the trustworthiness degree.

In our implementation, we repeat the above mining process multiple times, where both $Pos$ and $Neg$ can be expanded each time. Specifically, to acquire more knowledge, the binary classifier is updated/fine-tuned after both $Pos$ and $Neg$ are expanded. Clearly, the binary classifier becomes more powerful as the above mining process continues.

We visualize the iterative mining process regarding the salient object proposals in Fig. 5. As seen in ‘iteration 1’, the mining process tends to simply select the most trustworthy object proposals (the top-left cell is the most trustworthy object proposal, and the bottom-right cell is the cell with the smallest trustworthy degree). Then, in ‘iteration 2’, some of the less trustworthy object proposals are selected to enhance the diversity of the $Pos$ set. Finally, in ‘iteration 3’, again, some of the most trustworthy object proposals that are misdetected by the previous iterations can be selected. Compared with the previous two iterations, only a very small group of object proposals are selected in ‘iteration 3’; thus we omit ‘iteration 4’ to ensure efficiency.

IV Framewise Refinement via Online Fine-tuning

Considering that the object proposals belonging to the $Pos$ set (Sec. III-F) are very likely to be the real salient proposals, we propose fine-tuning the existing ISOD model online on the knowledge embedded in the ‘ $Pos$ ’ set. Thus, this model, which has shown some clips of the salient objects, can segment the salient objects well in the current video sequence. The proposed online model fine-tuning is demonstrated in Fig. 4, which includes three major components: 1) patchwise refiner, 2) fast key frame selection, and 3) framewise refiner.

IV-A Patchwise Refiner

Previous steps (Sec. III) can only tell us which object proposals are salient. To obtain a framewise saliency map, we simply resort to the existing image salient object detection (ISOD) model, i.e., each object proposal is fed into the ISOD model to obtain a patchwise saliency map.

TABLE I: Quantitative comparisons with current SOTA methods. The top three results are marked by red, green and blue, respectively.

Dataset	Metrics	Ours	2021		2020				2019				2018			2017
	Metrics	Ours	DCFNet[68]	MQP[69]	TENet[24]	U2Net[72]	PCSA[56]	LSTI[62]	SSAV[22]	MGA[23]	COS[53]	CPD[71]	PDB[26]	MBN[35]	SCO[36]	SFLR[7]
	maxF	0.911	0.900	0.904	0.881	0.839	0.880	0.850	0.861	0.892	0.875	0.778	0.855	0.861	0.783	0.727
Davis[73]	S-M	0.922	0.914	0.916	0.905	0.876	0.902	0.876	0.893	0.910	0.902	0.859	0.882	0.887	0.832	0.790
	MAE	0.016	0.016	0.018	0.017	0.027	0.022	0.034	0.023	0.023	0.020	0.032	0.028	0.031	0.064	0.056
	maxF	0.899	0.839	0.841	0.810	0.775	0.810	0.858	0.801	0.821	0.801	0.778	0.800	0.716	0.764	0.745
Segv2[74]	S-M	0.921	0.883	0.882	0.868	0.843	0.865	0.870	0.851	0.865	0.850	0.841	0.864	0.809	0.815	0.804
	MAE	0.013	0.015	0.018	0.025	0.042	0.025	0.025	0.023	0.030	0.020	0.023	0.024	0.026	0.030	0.037
	maxF	0.953	0.953	0.939	0.949	0.958	0.940	0.905	0.939	0.933	0.966	0.941	0.888	0.883	0.831	0.779
Visal[2]	S-M	0.947	0.952	0.942	0.949	0.952	0.946	0.916	0.943	0.936	0.965	0.942	0.907	0.898	0.762	0.814
	MAE	0.011	0.010	0.016	0.012	0.011	0.017	0.033	0.020	0.017	0.011	0.016	0.032	0.020	0.122	0.062
	maxF	0.725	0.791	0.703	0.697	0.620	0.655	0.585	0.603	0.640	0.614	0.608	0.572	0.520	0.464	0.478
DAVSOD[22]	S-M	0.792	0.846	0.770	0.779	0.728	0.741	0.695	0.724	0.738	0.725	0.724	0.698	0.637	0.599	0.624
	MAE	0.064	0.060	0.075	0.070	0.103	0.086	0.106	0.092	0.084	0.096	0.092	0.116	0.159	0.220	0.132
	maxF	0.822	0.660	0.768	0.781	0.748	0.747	0.649	0.742	0.735	0.724	0.735	0.742	0.670	0.690	0.546
VOS[75]	S-M	0.844	0.741	0.828	0.845	0.815	0.827	0.695	0.819	0.792	0.798	0.818	0.818	0.742	0.712	0.624
	MAE	0.060	0.074	0.069	0.052	0.076	0.065	0.115	0.073	0.075	0.065	0.068	0.078	0.099	0.162	0.145

Compared with the conventional ISOD, the patchwise SOD task here is much simpler because it has a relatively smaller problem domain. Thus, we prefer off-the-shelf ISOD models with lightweight designs. In our implementation, we choose the CPD [71], and we fine-tune it on the MSRA10K set, where we converted the original image-level training instances to patchwise instances. The well-segmented patchwise saliency maps can be visualized in the second row of Fig. 4. Almost all salient objects or tiny parts have been well segmented by the patchwise refiner. However, failure cases still exist, such as the middle case in the far left column, where only the dog’s leg was detected, while the main body of the dog was missed.

Intuitively, the frame-level saliency map can be easily obtained by ‘pasting’ all its patchwise saliency maps. However, due to the abovementioned failure cases, most of which are missing detections and false-alarm cases are very rare, the ‘pasting scheme’ - the widely used average operation-based fusion - might not be suitable here, which can lead to frame-level saliency maps suffering from missing detections. Therefore, we formulate the ‘pasting process’ as follows:

FS_{i}=\max_{j}\Bigg{\{}proj\bigg{(}Z\Big{(}\mathcal{PR}\big{(}\Theta,crop(\textbf{I}_{i},P_{j})\big{)}\Big{)}\bigg{)}\Bigg{\}},

(19)

where $FS_{i}$ represents the frame-level saliency map of the $i$ -th frame $\textbf{I}_{i}$ ; assume $\textbf{I}_{i}$ contains $b$ salient object proposals, the function $crop$ crops a patch from $\textbf{I}_{i}$ according to the coordinates provided by the object proposal $P_{j}$ , and $1\leq j\leq b$ ; $\mathcal{PR}$ denotes the proposed patchwise refiner, $\Theta$ is its learnable hidden parameters, $\mathcal{PR}$ outputs the patchwise saliency map, $Z(\cdot)$ is a typical min-max normalization function, and function $proj$ pastes its input to the current video frame; thus, the output of this function is a matrix with the same height and width as the current frame.

The qualitative demonstration of $FS_{i}$ can be seen in the middle part of Fig. 4. Compared with the existing ISOD models [76, 77, 78, 79, 80, 81, 82, 83], which consider only short-term information, our framewise saliency map is clearly built from the long-term perspective, which has a clear advantage in terms of robustness.

IV-B Framewise Refinement via Online Model Fine-tuning

Despite the merits mentioned before, framewise saliency maps ( $FS$ ) still have two problems. First, the quality of $FS$ is heavily dependent on the previous salient object proposal localization steps. Second, this quality is also influenced by the ‘patchwise refiner’. As a result, imperfect $FS$ can be observed frequently, such as the far-right column of Fig. 4. Thus, we propose the ‘framewise refiner’, and only the most trustworthy $FS$ serves as the teacher to guide the online learning process.

TABLE II: Component evaluation results. ‘SOPM’: salient object proposal mining scheme (Sec. III); ‘BC’: binary classifier (Sec. III-D); ‘OLF’: online fine-tuning scheme (Sec. IV); ‘MAX’ and ‘AVE’ respectively denote the maximizing operation based and the averaging based pasting process (Eq. 19). The corresponding qualitative demonstrations can be found in Fig. 8.

Dataset	Davis [73]			Segv2 [74]			Visal [2]			DAVSOD [22]			VOS [75]
Metrics	maxF	S-M	MAE	maxF	S-M	MAE	maxF	S-M	MAE	maxF	S-M	MAE	maxF	S-M	MAE
Motion Saliency (Eq. 7)	0.798	0.854	0.044	0.648	0.760	0.054	0.627	0.738	0.079	0.450	0.613	0.148	0.405	0.566	0.167
CPD Baseline [71]	0.778	0.859	0.032	0.778	0.841	0.023	0.941	0.942	0.016	0.608	0.724	0.092	0.735	0.818	0.068
+SOPM(MAX)	0.878	0.891	0.025	0.852	0.882	0.021	0.930	0.924	0.019	0.684	0.757	0.076	0.789	0.827	0.063
+SOPM(MAX)+BC	0.888	0.903	0.022	0.864	0.899	0.018	0.943	0.940	0.017	0.696	0.772	0.071	0.809	0.842	0.062
+SOPM(AVE) +BC	0.883	0.899	0.023	0.862	0.895	0.019	0.943	0.939	0.017	0.685	0.764	0.073	0.808	0.841	0.062
+SOPM(MAX)+BC+OLF	0.903	0.915	0.019	0.865	0.901	0.017	0.946	0.947	0.016	0.697	0.775	0.070	0.810	0.842	0.060
+SOPM(MAX)+BC+OLF(KFS)	0.911	0.922	0.016	0.899	0.921	0.013	0.953	0.947	0.011	0.725	0.792	0.064	0.822	0.844	0.060

In practice, the framewise refiner can be any ISOD model using only spatial information, where we simply continue using the CPD as the framewise refiner due to its lightweight design. Note that if the framewise refiner is taught by some of the high-quality $FS$ , it can perform very well on the remaining unseen frames, producing very high-quality saliency maps. We show the difference between $FS$ and the final saliency map in Fig. 6.

Outwardly, using only the high-quality $FS$ to fine-tune the framewise refiner omits the temporal information completely. Nevertheless, it should be noted that the temporal information is implicitly embedded in the $FS$ because the previous salient object proposal localization steps fully consider the motion saliency cues (i.e., Eq. 8).

Let us now move to the details regarding how to select the high-quality $FS$ — the proposed fast keyframe selection scheme. Compared with spatial cues, motion cues are more likely to attract our visual attention. Thus, in some cases, motion saliency cues can be a very effective indicator for locating real salient objects. However, the motion saliency cue has one critical drawback, i.e., it is unstable in essence, and we noted this issue multiple times before. Therefore, we consider both spatial and temporal saliency cues to derive robust VSOD.

Considering the common attribute of high-quality $FS$ , both their corresponding spatial and temporal saliency maps tend to be high-quality maps, and the reverse usually holds. Hence, for each frame, we can use the similarity between the spatial saliency ( $FS$ ) and temporal saliency ( $MS$ ) to measure the quality degree of $FS$ because those frames with low-quality $FS$ are unlikely to have an ‘extremely’ high spatiotemporal saliency consistency.

To be more specific, our keyframe selection process can be summarized into the following two steps:
First, for each frame, we compute the consistency between its $FS$ and $MS$ , where we use the S-measure [84] as the similarity measurement because $MS$ focuses more on the overall structural similarity rather than the tiny saliency details that are quite rare in $MS$ .
Second, for every $B$ frames (the exact choice of $B$ can be found in the ablation study), only the one with the largest spatiotemporal saliency consistency degree is selected as the key frame. For example, given a sequence with $N$ frames in total, there are $N/B$ keyframes to be selected, and all these keyfames are used to fine-tune the framewise refiner. The fine-tuning process can be performed very quickly (10 epochs), and the final VSOD results can be obtained by taking each nonkeyframe as the input of the framewise refiner. To maintain strong generalization ability, for each sequence, the framewise refiner is recovered to its initial status.

To facilitate readers’ understanding, we have provided a method flow chat, where all key steps have been included into it. The detailed flow chat can be found in Algorithm 1.

Algorithm 1 Key Steps of Our Approach.

0: a video sequence;

0: high-quality VSOD results;

1: object proposal computation via the off-the-shelf tool SEOD;

2: color saliency computation using the off-the-shelf tool CPD;

3: optical flow computation (Eq. 5);

4: train motion saliency model (Eq. 6) and output motion saliency maps (Eq. 7);

5: cluster all object proposals into

K

clusters (Eq. 1);

6: coarsely localize

v

salient clusters (Eq. 2-4); For iter=1:3

7: rank each cluster’s object proposals (Eq. 11-12) via SIM (Eq. 10) and PMS (Eq. 8);

8: dynamically threshold each cluster’s object proposals (Eq. 13-16);

9: formulate pos. and neg. training samples (Eq. 17-18);

10: train binary classifier (Eq. 9);End

11: train patch refiner using Pos. (identical to Eq. 6);

12: generate frame-level saliency map via patch refiner (Eq. 19);

13: select a key frame from each

B

frames (Sec. IV-B);

14: train frame-wise refiner using the selected key frames (identical to Eq. 6);

15: utilize frame-wise refiner to output VSOD result;

V Experiments

V-A Datasets

Following the conventional experimental setting, we evaluated the proposed approach on five widely used publicly available datasets, including Davis [73], Segtrack-v2 [74], VISal [2], DAVSOD [22], and VOS [75].

•

The Davis dataset contains 50 video sequences with 3,455 frames in total, and most of its sequences only contain moderate motions.
•

The Segtrack-v2 dataset contains 13 video sequences (excluding the penguin sequence) with 1,024 frames in total, containing complex backgrounds and variable motion patterns, which is generally more challenging than the Davis dataset. This dataset is dominated by a temporal source with fast object movements.
•

The VISal dataset contains 17 video sequences with 963 frames in total, and this dataset is relatively simple. This dataset is dominated by spatial sources, where most of the existing SOTA ISOD models using spatial information alone performed very well on this dataset.
•

The DAVSOD dataset contains 226 video sequences with 23,938 frames in total, which is the most challenging dataset, involving various object instances, different motion patterns, and saliency shifting between different objects. Since most SOTA VSOD models have failed to consider the attention shifting mechanism, their quantitative scores over this set are very low.
•

The VOS dataset contains 40 video sequences with 24,177 frames in total, yet only 1,540 frames were annotated well, in which the sequences were all obtained in indoor scenes.

V-B Implementation Details

We implemented our method on a PC with an Intel(R) Xeon(R) CPU, NVIDIA GTX2080Ti GPU (with 11G RAM) and 64G RAM. We choose SEOD [30] as the object detector to obtain object proposals. The patchwise and framewise refiners (Sec. IV and Fig. 4) follow an identical structure to the CPD [71]. The patchwise refiner is retrained on patchwise data generated from the widely used MSRA10K [85]. The framewise refiner follows the online learning scheme, which is trained on DAVIS-TR [73] in advance, and then fine-tuned online on the keyframes mined from the current input sequence. Note that for each input video sequence, we perform the online fine-tuning process only once. The motion saliency network (Eq. 6) also follows an identical structure to the CPD, which is initially trained on the MSRA10K set and then fine-tuned on the DAVIS-TR set. The binary classifier is a multilayer fully connected network whose network structure follows the lightweight ResNet18, and the updating process takes 20 epochs. To avoid overfitting problems, we adopted random horizontal flips for data augmentation.

V-C Evaluation Metrics

To accurately measure the consistency between the predicted VSOD and the manually annotated ground truth, we adopt five commonly used evaluation metrics, including the precision rate, recall rate, maximum F-measure value (maxF) [86], mean absolute error (MAE) [87], and structure measure value (S-measure) [84].

TABLE III: More quantitative evidence regarding the effectiveness of the proposed iterative mining scheme. ‘Iter0-PA’ and ‘Iter0-OL’ respectively denote the performance of

FS

(Eq. 19) and the final result (Sec. IV-B) in the ‘Iteration 0’.

Sets	Metrics	Iter0-PA	Iter0-OL	Iter1-PA	Iter1-OL	Iter2-PA	Iter2-OL	Iter3-PA	Iter3-OL
Davis	maxF	0.878	0.892	0.902	0.908	0.899	0.884	0.907	0.911
	S-M	0.907	0.909	0.910	0.921	0.909	0.921	0.916	0.922
	MAE	0.017	0.020	0.017	0.017	0.018	0.016	0.016	0.016
Segv2	maxF	0.843	0.855	0.848	0.874	0.848	0.866	0.849	0.899
	S-M	0.881	0.894	0.888	0.905	0.889	0.900	0.890	0.921
	MAE	0.021	0.018	0.020	0.016	0.019	0.016	0.019	0.013
VISal	maxF	0.927	0.931	0.945	0.952	0.944	0.948	0.943	0.953
	S-M	0.921	0.930	0.941	0.943	0.940	0.945	0.942	0.947
	MAE	0.018	0.014	0.016	0.012	0.016	0.012	0.015	0.011
DAVSOD	maxF	0.669	0.646	0.694	0.679	0.696	0.675	0.702	0.725
	S-M	0.739	0.734	0.768	0.763	0.769	0.763	0.775	0.792
	MAE	0.072	0.077	0.071	0.070	0.070	0.071	0.071	0.064
VOS	maxF	0.817	0.785	0.830	0.805	0.828	0.814	0.821	0.822
	S-M	0.845	0.827	0.857	0.841	0.855	0.849	0.850	0.844
	MAE	0.050	0.063	0.049	0.057	0.051	0.053	0.058	0.060

TABLE IV: Ablation study on the online fine-tuning epochs for the ‘Framewise Refiner’ (Sec. IV-B).

Metrics	2-epoches	4-epoches	6-epoches	8-epoches	10-epoches	12-epoches	14-epoches	16-epoches	18-epoches
maxF	0.795	0.823	0.864	0.883	0.886	0.895	0.899	0.894	0.894
S-M	0.855	0.874	0.902	0.912	0.913	0.918	0.921	0.919	0.919
MAE	0.034	0.029	0.015	0.014	0.014	0.013	0.013	0.013	0.013
Times (Seconds)	2.000	4.000	6.000	8.000	10.00	12.00	14.00	16.00	18.00

V-D Quantitative Comparisons

We compared our method with 14 SOTA approaches (see Table I), including DCFNet [68], MQP [69], TENet20 [24], U2Net20 [72], PCSA20 [56], LSTI20 [62], SSAV19 [22], MGA19 [23], COS19 [53], CPD19 [71], PDB18 [26], MBN18 [35], SCO18 [36], and SFLR17 [7]. Among all the SOTA competitors compared here, TENet20 and LSTI20 are mainly designed for introducing more long-term information into the current problem domain. However, compared with these two models, the long-term attribute of our approach is clearly stronger because all object proposals in the input sequences are simultaneously available to the current problem, while the basic methodology of both TENet20 and LSTI20 might be short-term in essence. For example, the graph structure adopted by TENet20 is a clear short-term manner, and the long-term alignment process of LSTI20 can be quite limited because long-term spatial alignment might be very difficult if the spatial appearance of the salient object has changed significantly.

Compared with other short-term models (e.g., PCSA20, SSAV19, and PDB18), our method outperformed them significantly, especially on the SegTrack-v2 set, where the salient objects tend to exhibit fast object movements and vary their spatial appearances very quickly; thus, the conventional short-term methodology can easily produce failure detections if both the current spatial and temporal saliency cues are incorrect. In sharp contrast, the proposed long-term method can to utilize the beyond-scope information to help the current saliency prediction.

Specifically, we noticed that our method failed to achieve the best performance on the VISal set. This can be explained by the attribute of the VISal set — all video sequences are dominated by spatial saliency cues, while the temporal saliency cues become completely helpless. This issue can be evidenced by the fact that the ISOD model (i.e., the CPD19), which only considers the spatial information, could still perform very well on the VISal set. In addition, we have presented the qualitative results in Fig. 7

V-E Component Evaluations

In this subsection, we verify the effectiveness of each major component in the proposed approach, and the quantitative results are detailed in Table II.

Both baseline models, i.e., the motion saliency deep model (Eq. 6) and the original CPD [71], exhibit the worst results. Then, by using the proposed clustering-based coarse localization (i.e., the ‘+SOPM(MAX)’) to find salient object proposals, the overall performance, which is already comparable to some SOTA models published in 2019, can be improved significantly. Then, by using the proposed data mining scheme, i.e., the binary classifier, the overall performance can be further improved. Quantitative evidence can be seen by comparing ‘+SOPM(MAX)’ with ‘+SOPM(MAX)+CL’, showing the effectiveness of the proposed iterative mining scheme. Since the binary classifier-based data mining scheme is the major component of this paper, we provide an additional quantitative evaluation to further verify its effectiveness in the next subsection.

In our implementation, all object proposal-based saliency maps are pasted back to the original video frame via the maximizing operation (Eq. 19), and we also test the averaging operation, and the result is shown as ‘+SOPM(AVE)+CL’. Clearly, the maximizing operation can more suit the pasting scheme than the averaging operation.

To verify the effectiveness of the proposed online fine-tuning scheme, we test the ‘+SOPM(MAX)+CL+OLF’, where, for each five video frames, one frame is randomly selected as the key frames, and the ‘framewise refiner’ will be fine-tuned on these key frames using their $FS$ (Eq. 4) as the pseudoGTs. Because this random selection process easily introduces some less trustworthy $FS$ into the fine-tuning process, the performance gain achieved by this online updating is very marginal. By using the proposed keyframe selection strategy, which has been denoted as ‘+SOPM(MAX)+CL+OLF(KFS)’, the overall performance can be improved by an average of 2%.

V-F More Quantitative Evidence of the Effectiveness of the Proposed Iterative Mining Scheme

The overall performances of the final saliency maps obtained by using or not using the online fine-tuning scheme are shown in Table III, where ‘Iter0-PA’ and ‘Iter0-OL’ denote the performance of $FS$ (Eq. 19) and the final result (Sec. IV-B) in the ‘Iteration 0’, respectively. Specifically, we also observed that the overall performance seems to reach a plateau after ‘Iteration 3’, where some quantity metrics decrease slightly, e.g., the maxF on the Segv2 set (0.893 $\rightarrow$ 0.885). Therefore, we decided to omit ‘Iteration 4’ and set the maximum iteration number to 3 to ensure computational efficiency.

V-G Ablation Study on the Online Fine-tuning Epochs

Actually, the proposed network online fine-tuning process (Sec. IV-B) costs some additional computational time. However, in our case, we adapt only the framewise refiner to the current video sequence, where the training data are relatively small. In addition, to avoid overfitting and consider the fact that the framewise refiner performs very well if it has weakly learned by taking the $FS$ (Eq. 19) of those trustworthy frames as the learning objective, the online fine-tuning process only requires several epochs. As shown in Table IV, we chose 10 epochs as the optimal choice to ensure the overall computational efficiency because the performance gains achieved by using some additional epochs could be very marginal.

V-H Ablation Study on the Clustering Parameter $K$

The hyperparameter $K$ determines the initial cluttering number, and we have tested multiple choices, i.e., $K$ ={4,6,8,10}, where the $K$ =8 is the empirically selected choice in the previous version. The quantitative results have been shown in Table V. Clearly, $K$ =8 is the best choice, where other choices of $K$ may lead to some performance degenerations. Specifically, we have noticed that overall performance decreased significantly when $K$ = 10 over the SegTrack set. The main reason is that the entire birdfall sequence is completely ill-detected. Since the falling bird in this low-resolution sequence is a quite small size object, our clustering process has assigned multiple background proposals with similar appearances to the bird into the salient clusters. W.r.t. other choices of $K$ , the overall performances tend to stay stable. Also, another reason for the $K$ =8 to be the best choice might be the fact that all other parameters (e.g., the mining iteration times) are selected under the default $K$ =8. Thus, it is quite reasonable for other choices of $K$ to exhibit inferior results.

TABLE V: Ablation study on the pre-given clustering number

K

, where we have tested

K

={4,6,8,10}, and we set the

K

=8 as the optimal choice.

Dataset	Davis [73]			Segv2 [74]			Visal [2]
Metrics	maxF	S-M	MAE	maxF	S-M	MAE	maxF	S-M	MAE
$K$ =4	0.893	0.908	0.019	0.890	0.912	0.014	0.951	0.947	0.013
$K$ =6	0.905	0.915	0.017	0.894	0.917	0.014	0.949	0.947	0.013
$K$ =8	0.911	0.922	0.016	0.899	0.921	0.013	0.953	0.947	0.011
$K$ =10	0.912	0.923	0.015	0.853	0.891	0.021	0.952	0.949	0.011

V-I Ablation Study on the Key Frame Sampling Parameter $B$

The hyper parameter $B$ is the frame batch size, which determines the overall keyframe number to be used during the model fine-tuning process. Generally, a large choice of $B$ correlates to sparse key frame selections, and a small choice of $B$ will lead to a dense sampling of keyframes. The exact ablation study towards $B$ can be seen in Table VI, where we have tested multiple choices, including $B$ ={3,5,8,10}, and $B$ =5 is the default choice. As shown in the table, the overall performances with different choices of $B$ are very stable, even though $B$ =5 slightly outperforms other choices. The main reason for this phenomenon is that there exists a clear trade-off, i.e., a large number of keyframes may lead the fine-tuning process difficult to reach convergency, while a small number of keyframes may result in over-fitting. Therefore, we continue to use $B$ =5 as the best choice.

TABLE VI: Ablation study on the key frame sampling parameter

B

, where we have tested

B

={3,5,8,10}, and we set the

B

=5 as the optimal choice, i.e., there will be at least 1 frame for each 5 frames will be selected as the default key frames during the online model refining process.

Dataset	Davis [73]			Segv2 [74]			Visal [2]
Metrics	maxF	S-M	MAE	maxF	S-M	MAE	maxF	S-M	MAE
$B$ =3	0.909	0.918	0.016	0.885	0.909	0.014	0.952	0.947	0.011
$B$ =5	0.911	0.922	0.016	0.899	0.921	0.013	0.953	0.947	0.011
$B$ =8	0.908	0.918	0.017	0.842	0.892	0.015	0.951	0.948	0.011
$B$ =10	0.902	0.917	0.018	0.848	0.898	0.016	0.949	0.948	0.013

TABLE VII: Quantitative results of our method after using other object proposal (YOLOv4 [88]) and optical flow (LiteFlow [89]) methods.

Dataset	Davis [73]			Segv2 [74]			Visal [2]
Metrics	maxF	S-M	MAE	maxF	S-M	MAE	maxF	S-M	MAE
MS Baseline	0.798	0.854	0.044	0.648	0.760	0.054	0.627	0.738	0.079
SEOD [30] $\rightarrow$ YOLOv4 [88]	0.907	0.919	0.016	0.858	0.899	0.017	0.933	0.931	0.018
FlowNet [70] $\rightarrow$ LiteFlow [89]	0.910	0.922	0.015	0.887	0.913	0.014	0.941	0.939	0.014
SEOD [30]+FlowNet [70]	0.911	0.922	0.016	0.899	0.921	0.013	0.953	0.947	0.011

TABLE VIII: Detailed averaged time cost for a single frame. This result was obtained on a PC with an Intel(R) Xeon(R) CPU, NVIDIA GTX2080Ti GPU (with 11G RAM) and 64G RAM. This experiment was carried out on the Visal set only.

MainSteps	Seconds
Optical Flow (Sec. III-A)	0.060s
Object Proposal (Sec. III-A)	0.121s
Low-level Saliency (Sec. III-A)	0.050s
Clustering and Ranking (Sec. III-B)	0.020s
Binary Classifier Training & Updating (Sec. III-D)	0.057s
Patchwise Refiner (Sec. IV-A)	0.030s
Framewise Refiner (Sec. IV-B)	0.080s
Total	0.418s

V-J How will Different Object Proposal and Optical Flow Methods Affect the Overall Performance

Actually, both object proposal and optical flow methods affect the overall performance of our method, of course. And the main reason is clear, i.e., all other parameters are assigned based on the combination of SEOD + FlowNet, where new choices of object proposal and optical flow may lead our model to stay in the sub-optimal situation. As shown in Table VII, using either YOLOv4 or LiteFlow could lead to some performance degenerations as expected. However, the quantitative results illustrated in Table VII also suggest that our approach is generally stable. Though the overall performance has been decreased, the overall performances of the two modified versions (i.e., YOLOv4+FlowNet and SEOD+LiteFlow) still significantly outperform the motion saliency (MS) baseline. Also, this experiment has been included in the revised version.

V-K Failure Cases and Limitations

We show some failure cases in Fig. 9, In fact, these failure cases are mainly induced for three main reasons: 1) the binary classifier is not always correct, 2) the patchwise refiner can produce some missing detections, and 3) the proposed key frame strategy might include some low-quality $FS$ in the online fine-tuning process. One possible method for improvement might be to fully implement our approach in an end-to-end manner, which deserves our future investigation.

The major limitation of our approach might be its offline data mining methodology, which includes multiple sequential steps needing to use multiple off-the-shelf tools such as optical flow tools, object detectors, and ISOD models. Since our approach is not an end-to-end model, our method is quite time consuming in essence, and we detailed the time cost of each major step in Table VIII. Compared with the most representative SSAV19 with 20 FPS (frame per second), the FPS rate of our approach is only 2.39.

VI Conclusion and Future Work

In this paper, we devised a novel long-term scheme for the VSOD task. The proposed approach can iteratively mine salient object proposals. The major highlight of the proposed approach is that it converts the conventional framewise VSOD task into an object-level data mining problem. Moreover, we provided an in-depth analysis and discussion of the rationale of the proposed long-term scheme, which has the potential to benefit our research community in the future. Additionally, we conducted an extensive component evaluation to verify the effectiveness of each major component in our approach. Quantitative comparisons to the SOTA models also demonstrate the superiority of the proposed approach.

In the near future, we are particularly interested in reimplementing our approach in a fully end-to-end manner. Thus, the time-consuming limitation can be alleviated, and some empirical parameter settings can also be avoided to make our method more robust.

Acknowledgments. This research was supported in part by the National Natural Science Foundation of China (62172246), the Open Project Program of State Key Laboratory of Virtual Reality Technology and Systems (VRLAB2021A05), and Youth Innovation and Technology Support Plan of Colleges and Universities in Shandong Province (2021KJ062).

References

[1] X. Lu, W. Wang, M. Danelljan, T. Zhou, J. Shen, and L. V. Gool, “Video object segmentation with episodic graph memory networks,” in European Conference on Computer Vision, 2020, pp. 207–223.
[2] W. Wang, J. Shen, and L. Shao, “Consistent video saliency using local gradient flow optimization and global refinement,” IEEE Transactions on Image processing, vol. 24, no. 11, pp. 4185–4196, 2015.
[3] B. Liu, K. Mu, M. Xu, F. Wang, and L. Feng, “A novel spatiotemporal attention enhanced discriminative network for video salient object detection,” Applied Intelligence, pp. 1–16, 2021.
[4] H. Huang, C. Liu, L. Tian, J. Mu, and X. Jing, “A novel fcns-convlstm network for video salient object detection,” International Journal of Circuit Theory and Applications, vol. 49, no. 4, pp. 1050–1060, 2021.
[5] W. Wang, H. Song, S. Zhao, J. Shen, S. Zhao, S. C. Hoi, and H. Ling, “Learning unsupervised video object segmentation through visual attention,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3064–3074.
[6] W. Wang, J. Shen, X. Dong, A. Borji, and R. Yang, “Inferring salient objects from human fixations,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 8, pp. 1913–1927, 2019.
[7] C. Chen, S. Li, Y. Wang, H. Qin, and A. Hao, “Video saliency detection via spatial-temporal fusion and low-rank coherency diffusion,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3156–3170, 2017.
[8] G. Ma, S. Li, C. Chen, A. Hao, and H. Qin, “Stage-wise salient object detection in 360 omnidirectional image via object-level semantical saliency ranking,” IEEE Transactions on Visualization and Computer Graphics, vol. 26, no. 12, pp. 3535–3545, 2020.
[9] A. Kompella and R. V. Kulkarni, “A semi-supervised recurrent neural network for video salient object detection,” Neural Computing and Applications, vol. 33, no. 6, pp. 2065–2083, 2021.
[10] J. Han, D. Zhang, G. Cheng, N. Liu, and D. Xu, “Advanced deep-learning techniques for salient and category-specific object detection: A survey,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 84–100, 2018.
[11] J. Han, G. Cheng, Z. Li, and D. Zhang, “A unified metric learning-based framework for co-saliency detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 10, pp. 2473–2483, 2017.
[12] C. Chen, J. Wei, C. Peng, W. Zhang, and H. Qin, “Improved saliency detection in rgb-d images using two-phase depth estimation and selective deep fusion,” IEEE Transactions on Image Processing, vol. 29, pp. 4296–4307, 2020.
[13] R. Cong, J. Lei, H. Fu, M. Cheng, W. Lin, and Q. Huang, “Review of visual saliency detection with comprehensive information,” IEEETransactions on Circuits and Systems for Video Technology, vol. 29, no. 10, pp. 2941–2959, 2018.
[14] Y. Kong, Y. Wang, A. Li, and Q. Huang, “Self-sufficient feature enhancing networks for video salient object detection,” IEEE Transactions on Multimedia, 2021.
[15] M. Xu, P. Fu, B. Liu, and J. Li, “Multi-stream attention-aware graph convolution network for video salient object detection,” IEEE Transactions on Image Processing, vol. 30, pp. 4183–4197, 2021.
[16] W. Wang, J. Shen, J. Xie, M. Cheng, H. Ling, and A. Borji, “Revisiting video saliency prediction in the deep learning era,” EEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 1, pp. 220–237, 2021.
[17] W. Wang, Q. Lai, H. Fu, J. Shen, H. Ling, and R. Yang, “Salient object detection in the deep learning era: An in-depth survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 6, pp. 3239–3259, 2021.
[18] H.-B. Bi, D. Lu, H.-H. Zhu, L.-N. Yang, and H.-P. Guan, “Sta-net: spatial-temporal attention network for video salient object detection,” Applied Intelligence, vol. 51, no. 6, pp. 3450–3459, 2021.
[19] C. Chen, S. Li, H. Qin, and A. Hao, “Structure-sensitive saliency detection via multilevel rank analysis in intrinsic feature space,” IEEE Transactions on Image Processing, vol. 24, no. 8, pp. 2303–2316, 2015.
[20] G. Ma, C. Chen, S. Li, C. Peng, A. Hao, and H. Qin, “Salient object detection via multiple instance joint re-learning,” IEEE Transactions on Multimedia, vol. 22, no. 2, pp. 324–336, 2020.
[21] Y.-W. Chen, X. Jin, X. Shen, and M.-H. Yang, “Video salient object detection via contrastive features and attention modules,” in IEEE Winter Conference on Applications of Computer Vision, 2022, pp. 1320–1329.
[22] D. Fan, W. Wang, M. Cheng, and J. Shen, “Shifting more attention to video salient object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 8554–8564.
[23] H. Li, G. Chen, G. Li, and Y. Yu, “Motion guided attention for video salient object detection,” in IEEE International Conference on Computer Vision, 2019, pp. 7274–7283.
[24] S. Ren, C. Han, X. Yang, G. Han, and S. He, “Tenet: Triple excitation network for video salient object detection,” in European Conference on Computer Vision, 2020, pp. 212–228.
[25] Y. Tang, Y. Li, and G. Xing, “Video salient object detection via adaptive local-global refinement,” arXiv preprint arXiv:2104.14360, 2021.
[26] H. Song, W. Wang, S. Zhao, J. Shen, and K. Lam, “Pyramid dilated deeper convlstm for video salient object detection,” in IEEE International Conference on Computer Vision, 2018, pp. 715–731.
[27] P. Chen, J. Lai, G. Wang, and H. Zhou, “Confidence-guided adaptive gate and dual differential enhancement for video salient object detection,” in 2021 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2021, pp. 1–6.
[28] L. Hong, W. Zhang, L. Chen, W. Zhang, and J. Fan, “Adaptive selection of reference frames for video object segmentation,” IEEE Transactions on Image Processing, vol. 31, pp. 1057–1071, 2022.
[29] Y. Lee, H. Seong, and E. Kim, “Iteratively selecting an easy reference frame makes unsupervised video object segmentation easier,” in AAAI Conference on Artificial Intelligence, 2022.
[30] M. Tan, R. Pang, and Q. V. Le, “Efficientdet: Scalable and efficient object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 781–10 790.
[31] Z. Liu, J. Li, L. Ye, G. Sun, and L. Shen, “Saliency detection for unconstrained videos using superpixel-level graph and spatiotemporal propagation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 12, pp. 2527–2542, 2016.
[32] T. Xi, W. Zhao, H. Wang, and W. Lin, “Salient object detection with spatiotemporal background priors for video,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3425–3436, 2016.
[33] C. Liu, J. Yuen, and A. Torralba, “Sift flow: Dense correspondence across scenes and its applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 5, pp. 978–994, 2011.
[34] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimization from robust background detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2814–2821.
[35] S. Li, B. Seybold, A. Vorobyov, X. Lei, and C.-C. J. Kuo, “Unsupervised video object segmentation with motion-based bilateral networks,” in European Conference on Computer Vision, 2018, pp. 207–223.
[36] Y. Chen, W. Zou, Y. Tang, X. Li, C. Xu, and N. Komodakis, “Scom: Spatiotemporal constrained optimization for salient object detection,” IEEE Transactions on Image Processing, vol. 27, no. 7, pp. 3345–3357, 2018.
[37] X. Zhou, Z. Liu, C. Gong, and L. Wei, “Improving video saliency detection via localized estimation and spatiotemporal refinement,” IEEE Transactions on Multimedia, vol. 20, no. 11, pp. 2993–3007, 2018.
[38] F. Guo, W. Wang, J. Shen, L. Shao, J. Yang, D. Tao, and Y. Tang, “Video saliency detection using object proposals,” IEEE Transactions on Cybernetics, vol. 48, no. 11, pp. 3159–3170, 2017.
[39] C. Chen, S. Li, H. Qin, Z. Pan, and G. Yang, “Bilevel feature learning for video saliency detection,” IEEE Transactions on Multimedia, vol. 20, no. 12, pp. 3324–3336, 2018.
[40] F. Guo, W. Wang, Z. Shen, J. Shen, L. Shao, and D. Tao, “Motion-aware rapid video saliency detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 12, pp. 4887–4898, 2020.
[41] W. Zhao, J. Zhang, L. Li, N. Barnes, N. Liu, and J. Han, “Weakly supervised video salient object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 16 826–16 835.
[42] W. Wang, T. Zhou, F. Yu, J. Dai, E. Konukoglu, and L. Van Gool, “Exploring cross-image pixel contrast for semantic segmentation,” in IEEE International Conference on Computer Vision, 2021, pp. 7303–7313.
[43] W. Wang, G. Sun, and L. Van Gool, “Looking beyond single images for weakly supervised semantic segmentation learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[44] D. Liu, Y. Cui, W. Tan, and Y. Chen, “Sg-net: Spatial granularity network for one-stage video instance segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 9816–9825.
[45] X. Lu, W. Wang, J. Shen, Y.-W. Tai, D. J. Crandall, and S. C. Hoi, “Learning video object segmentation from unlabeled videos,” in IEEE conference on computer vision and pattern recognition, 2020, pp. 8960–8970.
[46] Y. Cui, L. Yan, Z. Cao, and D. Liu, “Tf-blender: Temporal feature blender for video object detection,” in IEEE International Conference on Computer Vision, 2021, pp. 8138–8147.
[47] D. Liu, Y. Cui, Y. Chen, J. Zhang, and B. Fan, “Video object detection for autonomous driving: Motion-aid feature calibration,” Neurocomputing, vol. 409, pp. 1–11, 2020.
[48] W. Wang, T. Zhou, F. Porikli, D. Crandall, and L. Van Gool, “A survey on deep learning technique for video segmentation,” arXiv preprint arXiv:2107.01153, 2021.
[49] W. Wang, J. Shen, and L. Shao, “Video salient object detection via fully convolutional networks,” IEEE Transactions on Image Processing, vol. 27, no. 1, pp. 38–49, 2017.
[50] P. Tokmakov, K. Alahari, and C. Schmid, “Learning video object segmentation with visual memory,” in IEEE International Conference on Computer Vision, 2017, pp. 4481–4490.
[51] T.-N. Le and A. Sugimoto, “Deeply supervised 3d recurrent fcn for salient object detection in videos,” in British Machine Vision Conference, vol. 1, 2017, p. 3.
[52] C. Chen, G. Wang, C. Peng, Y. Fang, D. Zhang, and H. Qin, “Exploring rich and efficient spatial temporal interactions for real time video salient object detection,” IEEE Transactions on Image Processing, vol. 1, no. 1, pp. 1–1, 2021.
[53] X. Lu, W. Wang, C. Ma, J. Shen, L. Shao, and F. Porikli, “See more, know more: Unsupervised video object segmentation with co-attention siamese networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3623–3632.
[54] T. Zhou, J. Li, S. Wang, and R. Tao, “Matnet: Motion-attentive transition network for zero-shot video object segmentation,” IEEE Transactions on Image Processing, vol. 29, pp. 8326–8338, 2020.
[55] T. Zhou, J. Li, X. Li, and L. Shao, “Target-aware object discovery and association for unsupervised video multi-object segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2021.
[56] Y. Gu, L. Wang, Z. Wang, Y. Liu, M. Cheng, and S. Lu, “Pyramid constrained self-attention network for fast video salient object detection,” in The AAAI Conference on Artificial Intelligence, 2020, pp. 10 869–10 876.
[57] Y. Jiao, X. Wang, Y.-C. Chou, S. Yang, G.-P. Ji, R. Zhu, and G. Gao, “Guidance and teaching network for video salient object detection,” in 2021 IEEE International Conference on Image Processing (ICIP). IEEE, 2021, pp. 2199–2203.
[58] H. Bi, L. Yang, H. Zhu, D. Lu, and J. Jiang, “Steg-net: Spatio-temporal edge guidance network for video salient object detection,” IEEE Transactions on Cognitive and Developmental Systems, 2021.
[59] X. Lu, W. Wang, J. Shen, D. J. Crandall, and Shao, “Zero-shot video object segmentation via attentive graph neural networks,” in IEEE International Conference on Computer Vision, 2019.
[60] X. Lu, W. Wang, J. Shen, D. J. Crandall, and L. V. Gool, “Segmenting objects from relational visual data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[61] X. Lu, W. Wang, J. Shen, D. Crandall, and J. Luo, “Zero-shot video object segmentation with co-attention siamese networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, pp. 2228–2242, 2022.
[62] C. Chen, G. Wang, C. Peng, X. Zhang, and H. Qin, “Improved robust video saliency detection based on long-term spatial-temporal information,” IEEE Transactions on Image Processing, vol. 29, pp. 1090–1100, 2020.
[63] Y. Li, S. Li, C. Chen, A. Hao, and H. Qin, “Accurate and robust video saliency detection via self-paced diffusion,” IEEE Transactions on Multimedia, vol. 22, no. 5, pp. 1153–1167, 2019.
[64] ——, “A plug-and-play scheme to adapt image saliency deep model for video data,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 1, no. 1, pp. 1–1, 2021.
[65] P. Voigtlaender and B. Leibe, “Online adaptation of convolutional neural networks for video object segmentation,” in British Machine Vision Conference, 2017, pp. 1–16.
[66] W. Wang, X. Lu, J. Shen, D. Crandall, and L. Shao, “Zero-shot video object segmentation via attentive graph neural networks,” in IEEE International Conference on Computer Vision, 2019.
[67] B. Wang, W. Liu, Member, G. Han, and S. He, “Learning long-term structural dependencies for video salient object detection,” IEEE Transactions on Image Processing, vol. 29, pp. 9017–9031, 2020.
[68] M. Zhang, J. Liu, Y. Wang, Y. Piao, S. Yao, W. Ji, J. Li, H. Lu, and Z. Luo, “Dynamic context-sensitive filtering network for video salient object detection,” in IEEE International Conference on Computer Vision, 2021, pp. 1553–1563.
[69] C. Chen, J. Song, C. Peng, G. Wang, and Y. Fang, “A novel video salient object detection method via semisupervised motion quality perception,” IEEE Transactions on Circuits and Systems for Video Technology, 2021.
[70] D. Sun, X. Yang, M. Liu, and J. Kautz, “Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8934–8943.
[71] Z. Wu, L. Su, and Q. Huang, “Cascaded partial decoder for fast and accurate salient object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3907–3916.
[72] X. Qin, Z. Zhang, C. Huang, M. Dehghan, O. Zaiane, and M. Jagersand, “U2-net: Going deeper with nested u-structure for salient object detection,” Pattern Recognition, vol. 106, p. 107404, 2020.
[73] F. Perazzi, J. PontTuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology for video object segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 724–732.
[74] F. Li, T. Kim, A. Humayun, D. Tsai, and J. Rehg, “Video segmentation by tracking many figure-ground segments,” in IEEE International Conference on Computer Vision, 2013, pp. 2192–2199.
[75] J. Li, C. Xia, and X. Chen, “A benchmark dataset and saliency-guided stacked autoencoders for video-based salient object detection,” IEEE Transactions on Image Processing, vol. 27, no. 1, pp. 349–364, 2017.
[76] Q. Hou, M. Cheng, X. Hu, A. Borji, Z. Tu, and P. Torr, “Deeply supervised salient object detection with short connections,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 4, pp. 815–828, 2019.
[77] Z. Wu, S. Li, C. Chen, A. Hao, and H. Qin, “Recursive multi-model complementary deep fusion for robust salient object detection via parallel sub-networks,” Pattern Recognition, vol. 121, p. 108212, 2022.
[78] X. Qin, Z. Zhang, C. Huang, C. Gao, M. Dehghan, and M. Jagersand, “Basnet: Boundary-aware salient object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7479–7489.
[79] Z. Wu, S. Li, C. Chen, A. Hao, and H. Qin, “A deeper look at salient object detection: Bi-stream network with a small training dataset,” IEEE Transactions on Multimedia, vol. 24, pp. 73–86, 2022.
[80] X. Wang, S. Li, C. Chen, Y. Fang, A. Hao, and H. Qin, “Data-level recombination and lightweight fusion scheme for rgb-d salient object detection,” IEEE Transactions on Image Processing, vol. 30, pp. 458–471, 2020.
[81] C. Chen, J. Wei, C. Peng, and H. Qin, “Depth-quality-aware salient object detection,” IEEE Transactions on Image Processing, vol. 30, pp. 2350–2363, 2021.
[82] G. Wang, C. Chen, D. Fan, A. Hao, and H. Qin, “From semantic categories to fixations: A novel weakly-supervised visual-auditory saliency detection approach,” in IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 119–15 128.
[83] G. Ma, S. Li, C. Chen, A. Hao, and H. Qin, “Rethinking image salient object detection: Object-level semantic saliency reranking first, pixelwise saliency refinement later,” IEEE Transactions on Image Processing, vol. 30, pp. 4238–4252, 2021.
[84] D. Fan, M. Cheng, Y. Liu, T. Li, and A. Borji, “Structure-measure: A new way to evaluate foreground maps,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4548–4557.
[85] M. Cheng, N. J. Mitra, X. Huang, P. Torr, and S. Hu, “Global contrast based salient region detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 569–582, 2015.
[86] R. Achanta, S. Hemami, F. Estrada, and S. Susstrunk, “Frequency-tuned salient region detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 1597–1604.
[87] F. Perazzi, P. Krähenbühl, Y. Pritch, and A. Hornung, “Saliency filters: Contrast based filtering for salient region detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 733–740.
[88] A. Bochkovskiy, C.-Y. Wang, and H.-Y. M. Liao, “Yolov4: Optimal speed and accuracy of object detection,” arXiv preprint arXiv:2004.10934, 2020.
[89] T.-W. Hui, X. Tang, and C. C. Loy, “Liteflownet: A lightweight convolutional neural network for optical flow estimation,” in IEEE Conference on Computer Vision and Pattern Recognition, June 2018.

A Novel Long-term Iterative Mining Scheme for Video Salient Object Detection

Abstract

Index Terms:

I Introduction and Motivation

II Related Work

II-A Hand-crafted VSOD Approaches

II-B Deep Learning-based VSOD Models

II-C Early Explorations Regarding Long-term Information

III Salient Object Proposal Localization

III-A Method Overview

III-B Coarse-level Localization for Salient Object Proposals

III-C Motion Saliency Cues

III-D Fine-level Localization for Salient Object Proposals

III-E Training Instances for the Binary Classifier

III-F Iteratively Mining More Trustworthy Object Proposals

IV Framewise Refinement via Online Fine-tuning

IV-A Patchwise Refiner

IV-B Framewise Refinement via Online Model Fine-tuning

V Experiments

V-A Datasets

V-B Implementation Details

V-C Evaluation Metrics

V-D Quantitative Comparisons

V-E Component Evaluations

V-F More Quantitative Evidence of the Effectiveness of the Proposed Iterative Mining Scheme

V-G Ablation Study on the Online Fine-tuning Epochs

V-H Ablation Study on the Clustering Parameter KK

V-I Ablation Study on the Key Frame Sampling Parameter BB

V-J How will Different Object Proposal and Optical Flow Methods Affect the Overall Performance

V-K Failure Cases and Limitations

VI Conclusion and Future Work

References

V-H Ablation Study on the Clustering Parameter $K$

V-I Ablation Study on the Key Frame Sampling Parameter $B$