Can Unsupervised Segmentation Reduce Annotation Costs for Video Semantic Segmentation?

📅 2026-03-29

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

This work addresses the high cost of pixel-level annotations in video semantic segmentation by proposing an efficient learning paradigm that leverages unlabeled video frames alongside coarse-grained labels. The approach utilizes the Segment Anything Model (SAM/SAM 2) to automatically generate and refine segmentation masks. Systematic evaluation demonstrates, for the first time, that this method reduces human annotation effort by approximately 33% while maintaining comparable performance to fully supervised baselines. Furthermore, the study reveals that inter-frame diversity exerts a substantially greater influence on model performance than the sheer number of frames, underscoring the critical role of data diversity in weakly supervised video segmentation.

Technology Category

Application Category

📝 Abstract

Present-day deep neural networks for video semantic segmentation require a large number of fine-grained pixel-level annotations to achieve the best possible results. Obtaining such annotations, however, is very expensive. On the other hand, raw, unannotated video frames are practically free to obtain. Similarly, coarse annotations, which do not require precise boundaries, are also much cheaper. This paper investigates approaches to reduce the annotation cost required for video segmentation datasets by utilising such resources. We show that using state-of-the-art segmentation foundation models, Segment Anything Model (SAM) and Segment Anything Model 2 (SAM 2), we can utilise both unannotated frames as well as coarse annotations to alleviate the effort required for manual annotation of video segmentation datasets by automating mask generation. Our investigation suggests that if used appropriately, we can reduce the need for annotation by a third with similar performance for video semantic segmentation. More significantly, our analysis suggests that the variety of frames in the dataset is more important than the number of frames for obtaining the best performance.

Problem

Research questions and friction points this paper is trying to address.

video semantic segmentation

annotation cost

unsupervised segmentation

coarse annotations

pixel-level annotations

Innovation

Methods, ideas, or system contributions that make the work stand out.

unsupervised segmentation

annotation cost reduction

video semantic segmentation