Evaluating SAM2 for Video Semantic Segmentation

📅 2025-12-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses three key challenges in video semantic segmentation (VSS): low spatial precision, poor temporal consistency, and difficulty modeling complex multi-object boundaries and scale variations. To this end, we introduce the Segmentation Anything Model 2 (SAM2) into dense VSS for the first time, proposing two novel fusion paradigms: (1) a parallel architecture that leverages SAM2 to generate high-fidelity initial masks, jointly optimized with a semantic segmentation network; and (2) a feature-driven architecture that extracts temporally aligned region features using SAM2 masks, followed by lightweight semantic classification and multi-frame result fusion. We further propose three innovations: mask generation with explicit temporal awareness, cross-frame feature alignment, and adaptive mask fusion. Experiments demonstrate significant improvements in boundary accuracy and inter-frame consistency across multiple VSS benchmarks—particularly for fine-grained object segmentation and long-term tracking scenarios.

Technology Category

Application Category

📝 Abstract
The Segmentation Anything Model 2 (SAM2) has proven to be a powerful foundation model for promptable visual object segmentation in both images and videos, capable of storing object-aware memories and transferring them temporally through memory blocks. While SAM2 excels in video object segmentation by providing dense segmentation masks based on prompts, extending it to dense Video Semantic Segmentation (VSS) poses challenges due to the need for spatial accuracy, temporal consistency, and the ability to track multiple objects with complex boundaries and varying scales. This paper explores the extension of SAM2 for VSS, focusing on two primary approaches and highlighting firsthand observations and common challenges faced during this process. The first approach involves using SAM2 to extract unique objects as masks from a given image, with a segmentation network employed in parallel to generate and refine initial predictions. The second approach utilizes the predicted masks to extract unique feature vectors, which are then fed into a simple network for classification. The resulting classifications and masks are subsequently combined to produce the final segmentation. Our experiments suggest that leveraging SAM2 enhances overall performance in VSS, primarily due to its precise predictions of object boundaries.
Problem

Research questions and friction points this paper is trying to address.

Extending SAM2 for dense Video Semantic Segmentation
Addressing spatial accuracy and temporal consistency challenges
Tracking multiple objects with complex boundaries and varying scales
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extracting unique object masks using SAM2
Classifying features with a simple network
Combining classifications and masks for segmentation
🔎 Similar Papers
No similar papers found.
S
Syed Hesham Syed Ariff
School of EEE, Nanyang Technological University, Singapore 639798, Singapore.
Y
Yun Liu
College of Computer Science, Nankai University, Tianjin 300350, China.
Guolei Sun
Guolei Sun
ETH Zurich
Visual AttentionVideoWeak SupervisionCamouflageLow-level Vision
J
Jing Yang
State Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China.
Henghui Ding
Henghui Ding
Fudan University
Computer VisionMachine LearningSegmentationAIGC
X
Xue Geng
Institute for Infocomm Research, A*STAR, Singapore 138632, Singapore.
X
Xudong Jiang
School of EEE, Nanyang Technological University, Singapore 639798, Singapore.