3rd Place Solution for MOSE Track in CVPR 2024 PVUW workshop: Complex Video Object Segmentation
Abstract
Video Object Segmentation (VOS) is a vital task in computer vision, focusing on distinguishing foreground objects from the background across video frames. Our work draws inspiration from the Cutie model, and we investigate the effects of object memory, the total number of memory frames, and input resolution on segmentation performance. This report validates the effectiveness of our inference method on the coMplex video Object SEgmentation (MOSE) dataset, which features complex occlusions. Our experimental results demonstrate that our approach achieves a J&F score of 0.8139 on the test set, securing the third position in the final ranking. These findings highlight the robustness and accuracy of our method in handling challenging VOS scenarios.
1 Introduction
The Pixel-level Video Understanding in the Wild (PVUW) Challenge is a computer vision competition that encompasses four different tracks. Its overarching aim is to advance the field of pixel-level scene understanding, which involves classifying object categories, generating masks, and assigning semantic labels to every pixel within an image. Given that the real world is dynamic and not limited to static images, the ability to segment and understand videos is of greater relevance and practicality for real-world applications. The PVUW Challenge is designed to spur the development of technologies that can accurately interpret and segment video content in natural settings, thereby pushing the boundaries of computer vision in dynamic scene understanding.
In the 3rd of the PVUW challenge, We have noted the addition of two new tracks in the third PVUW challenge: The MOSE Track includes additional videos and annotations that feature challenging elements such as the disappearance and reappearance of objects, inconspicuous small objects, heavy occlusions, and crowded environments[8]. The Motion Expression guided Video Segmentation (MeViS) Track is designed to advance the study of natural language-guided video understanding in complex environments, with the goal of fostering the development of a more comprehensive and robust pixel-level understanding of video scenes in such settings and realistic scenarios through the inclusion of new videos, sentences, and annotations[7].
In the realm of Video Object Segmentation (VOS), especially within the semisupervised paradigm, the objective is to track and segment objects from a broad range of categories based solely on an initial frame’s annotation. VOS methodologies find extensive application in fields such as robotics[13] , video editing[5] , and in reducing the burden of data annotation[1]. Given the Video Object Segmentation (VOS) dataset, comprising training and test sets, each frame of the video contains corresponding annotation data. These annotations are represented as two-dimensional matrices of height and width. For the pixel-wise annotation matrix, each element records the pixel information (e.g., RGB channels) of the corresponding pixel in the video frame. For the classification result matrix, each element represents a one-hot vector of length equal to the number of object categories in the VOS task, indicating the classification of the corresponding pixel (see Fig. 1).
Recent VOS methods utilize a memory-based paradigm [10, 4, 14, 2], where a memory representation is built from previously segmented frames (either provided or generated by the model). New query frames then access this memory to retrieve features for segmentation. These methods predominantly employ pixel-level matching for memory reading, whether through a single [10] or multiple matching layers [14], and construct segmentation from the pixel memory readout. Pixel-level matching independently maps each query pixel to a linear combination of memory pixels (e.g., using an attention layer). However, this approach often lacks high-level consistency and is vulnerable to matching noise, especially when distractors are present.
The complex video object segmentation task focuses on the tracking and segmentation of objects within intricate scenes. MOSE is specifically designed to test VOS models in complex environments featuring multiple objects, frequent occlusions, and numerous distractors. This dataset poses a significant challenge due to its high variability and realistic scenes, making it a more rigorous benchmark than traditional datasets like DAVIS-2017 [12]. In fact, recent methods [4, 14] show a performance drop of over 20 points in J & F when evaluated on MOSE compared to DAVIS-2017.This significant performance decline on MOSE highlights the limitations of current memory-based VOS techniques. In scenarios with high object density and frequent interactions, pixel-level matching struggles with segmentation accuracy due to noise and distractors. These issues emphasize the need for more robust and context-aware approaches in memory-based VOS to effectively manage the complexities of datasets like MOSE.
Recently, the Cutie approach [3] has restructured the video object segmentation task into three distinct subprocesses: image object segmentation, tracking/alignment, and refinement. Furthermore, Cutie construct a compact object memory to summarize object features in the long term, which are retrieved as targetspecific object-level representations during querying, showcasing significant advancements in the field of Video Object Segmentation (VOS). Leveraging the exceptional capabilities of Cutie, our team secured the 3rd position in the complex video object segmentation track of the 3rd PVUW Challenge at CVPR 2024, and all without the need for additional training.

2 Method
Our approach is inspired by recent work on video object segmentation, particularly the Cutie framework by Cheng et al. [3]. Cutie operates in a semi-supervised video object segmentation (VOS) setting, where it takes a first-frame segmentation as input and processes subsequent frames sequentially.
Cutie encodes segmented frames into a high-resolution pixel memory and a high-level object memory . These memories are used for segmenting future frames. When segmenting a new frame, Cutie first retrieves an initial pixel readout from the pixel memory using the encoded query features. This initial readout is typically noisy due to low-level pixel matching.
To enhance this initial readout, Cutie enriches with object-level semantics using information from the object memory and object queries . This is done through an object transformer with multiple transformer blocks. The final enriched output, , is then passed to the decoder to generate the output mask.
In summary, Cutie introduces three main contributions: object-transformer, sec:masked-attention, and object-memory.

The ’Cutie-base’ model is based on the ’base’ variant, utilizing ResNet-50 as the query encoder backbone. It consists of channels, object transformer blocks, and object queries.
The query and mask encoders are designed using ResNets [9]. Following previous studies [11, 4], we discard the final convolutional stage and employ the stride 16 feature.
The object transformer block integrates both query FFN and pixel FFN components. The query FFN comprises a 2-layer MLP with a hidden size of . Meanwhile, the pixel FFN utilizes two convolutions with a reduced hidden size of to minimize computational overhead. The ReLU activation function is employed throughout the network.
3 Experiment
3.1 Inference
When testing, the input video is upscaled to a resolution of 720p, which provides a higher density of pixel information compared to lower resolutions such as 480p. This increased resolution enhances the detail and clarity of the video, allowing for more precise segmentation and tracking of objects. The choice of 720p as the standard testing resolution is motivated by the need to balance between the level of detail required for accurate segmentation and the computational resources required to process the video data.
In the context of the memory frame encoding, we update both the pixel memory and the object memory every -th frame. The default value of is set to 3, following the same configuration used in the XMem framework [4]. This interval strikes a balance between capturing the temporal evolution of the scene and maintaining computational efficiency. In the attention component of the pixel memory, we retain the keys and values from the first frame, which is provided by the user. This initial frame serves as a reference point for the video sequence. For subsequent memory frames, we employ a First-In-First-Out (FIFO) strategy, which ensures that the most recent information is retained while older data is gradually phased out. This approach is designed to keep the memory footprint manageable and focused on the most relevant frames.The choice of a predefined limit of for the total number of memory frames is a practical compromise. This value balances the need to avoid excessive memory usage and maintain real-time performance while still capturing sufficient temporal evolution of the scene. Maintaining a history of 15 frames is generally adequate for effectively exploiting temporal correlations in VOS tasks. This enhances segmentation accuracy by providing enough context for object tracking and appearance prediction without imposing excessive computational overhead or compromising system responsiveness. Extending this limit further could lead to diminishing returns, as the additional frames may not significantly improve performance and could increase computational load unnecessarily.
The noise in the memory increases with its size, and when the sequence is long, performance can degrade. Based on these observations, we propose filtering affinities to retain only the top- entries. This effectively eliminates noise regardless of the sequence length. The top-k strategy not only enhances robustness but also overcomes the overhead associated with top- operations. The graph reports the performance improvement and robustness brought about by top- filtering. In our implementation, the top-k operation uses query filtering to refine the memory. To further manage the memory capacity, we apply top- filtering [6] with to the pixel memory. Setting top- to 60 has the effect of prioritizing the most relevant pixel memories based on their attention scores, which is crucial for maintaining accurate segmentation over time while preventing the memory from being overwhelmed with less significant information. This approach ensures that the memory retains the most informative aspects of the scene, which are necessary for consistent and reliable segmentation results, especially in lengthy video sequences. This filtering technique selects the top most relevant pixel memories, based on their attention scores, for updating the memory. By doing so, we prioritize the most informative pixel data, which is crucial for maintaining accurate segmentation over time, while also preventing the memory from being overwhelmed with less significant information.
In the final testing phase, we employed Test-Time Augmentation (TTA), which is a strategy that enhances the robustness and accuracy of predictions by incorporating a variety of augmented versions of the input data. TTA is a powerful technique that can help to mitigate overfitting and improve generalization by simulating variations in the data that the model might encounter in real-world scenarios. The core idea behind TTA is to make predictions not just on the original test samples but also on perturbed versions of them. These perturbations can include a range of augmentations such as scaling, cropping, flipping, and rotating the input images or videos. By applying these augmentations, we can capture a broader range of possible transformations that the objects of interest might undergo, thereby improving the model’s ability to recognize and segment them accurately.In the context of video segmentation, TTA can be particularly beneficial due to the dynamic nature of video data. Frames in a video can have significant variations due to camera motion, object motion, lighting changes, and other environmental factors. TTA helps to address these variations by providing the model with multiple perspectives of the same scene, which can lead to more consistent and reliable segmentations over time.
4 Evaluation Metrics
To evaluate the performance of our model, we compute the Jaccard value (J), the F-Measure (F), and the mean of J and F.
4.1 Jaccard Value (J).
The Jaccard value, also known as Intersection over Union (IoU), measures the similarity between two sets. For a predicted segmentation mask and a ground truth segmentation mask , the Jaccard value is defined as:
(1) |
where and denote the value of the -th pixel in the predicted and ground truth masks, respectively. The Jaccard value ranges from to , with higher values indicating better performance.
4.2 F-Measure (F).
The F-Measure is a metric that combines Precision and Recall, commonly used to evaluate the performance of binary classification models. It is calculated as follows:
(2) |
where
(3) |
and
(4) |
The F-Measure also ranges from to , with higher values indicating better model performance in handling positive and negative samples.
4.3 Mean of J and F.
To comprehensively evaluate the model’s performance, we compute the mean of the Jaccard value (J) and the F-Measure (F):
(5) |
These metrics together provide a robust assessment of the segmentation model’s accuracy and consistency, offering insights into its performance in predicting segmentation masks.
User | J | F | J&F |
---|---|---|---|
PCL_MDS | 0.7235 (1) | 0.8044 (1) | 0.7640 (1) |
yahooo | 0.7101 (2) | 0.7899 (3) | 0.7500 (3) |
valgab | 0.7095 (3) | 0.7923 (2) | 0.7509 (2) |
xzs123456 | 0.7011 (4) | 0.7779 (4) | 0.7395 (4) |
ISS | 0.6892 (5) | 0.7705 (5) | 0.7299 (5) |
User | J | F | J&F |
---|---|---|---|
PCL_MDS | 0.8101 (1) | 0.8789 (1) | 0.8445 (1) |
Yao_Xu_MTLab | 0.8007 (2) | 0.8683 (2) | 0.8345 (2) |
ISS | 0.7879 (3) | 0.8559 (3) | 0.8219 (3) |
xsong2023 | 0.7873 (4) | 0.8544 (4) | 0.8208 (4) |
yangdonghan50 | 0.7799 (5) | 0.8480 (5) | 0.8139 (5) |
5 Comparison with Other Methods
In the 1st Complex Video Object Segmentation Challenge, our(ISS) method demonstrated significant performance improvements in both the development and test phases. The leaderboards for the development and test phases are presented in Tables 1 and 2, respectively. Our method achieved Jaccard values (J) and F-Measures (F) that outperformed most other participants. Specifically, in the development phase, our method attained a Jaccard value of 0.6892 and an F-Measure of 0.7705, resulting in a combined J&F score of 0.7299. Similarly, in the test phase, our method achieved a Jaccard value of 0.7799, an F-Measure of 0.8480, and a combined J&F score of 0.8139. These results highlight the effectiveness and robustness of our method.
6 Conclusion
In this study, we developed a method for Video Object Segmentation (VOS) drawing inspiration from the Cutie model. We examined key factors such as object memory management, the number of memory frames, and input resolution, assessing their impact on segmentation performance. Our approach was rigorously evaluated on the MOSE dataset, achieving a notable J&F score of 0.8139, which earned us third place in the test phase. These results underscore the robustness of our method in addressing the challenges posed by complex occlusions in video sequences.
References
- [1] Ali Athar, Jonathon Luiten, Paul Voigtlaender, Tarasha Khurana, Achal Dave, Bastian Leibe, and Deva Ramanan. Burst: A benchmark for unifying object recognition, segmentation and tracking in video. In WACV, 2023.
- [2] Maksym Bekuzarov, Ariana Bermudez, Joon-Young Lee, and Hao Li. Xmem++: Production-level video segmentation from few annotated frames. In ICCV, 2023.
- [3] Ho Kei Cheng, Seoung Wug Oh, Brian Price, Joon-Young Lee, and Alexander Schwing. Putting the object back into video object segmentation. In arXiv, 2023.
- [4] Ho Kei Cheng and Alexander G Schwing. Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In ECCV, 2022.
- [5] Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. In CVPR, 2021.
- [6] Ho Kei Cheng, Yu-Wing Tai, and Chi-Keung Tang. Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3881–3890, 2021.
- [7] Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. MeViS: A large-scale benchmark for video segmentation with motion expressions. In ICCV, 2023.
- [8] Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, Philip HS Torr, and Song Bai. Mose: A new dataset for video object segmentation in complex scenes. arXiv, 2023.
- [9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2016.
- [10] Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. Video object segmentation using space-time memory networks. In ICCV, 2019.
- [11] Seoung Wug Oh, Joon-Young Lee, Ning Xu, and Seon Joo Kim. Video object segmentation using space-time memory networks. In ICCV, 2019.
- [12] Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In CVPR, 2016.
- [13] Vladimír Petrík, Mohammad Nomaan Qureshi, Josef Sivic, and Makar Tapaswi. Learning object manipulation skills from video via approximate differentiable physics. In IROS, 2022.
- [14] Zongxin Yang, Yunchao Wei, and Yi Yang. Associating objects with transformers for video object segmentation. In NeurIPS, 2021.