🤖 AI Summary
Video semantic segmentation faces significant challenges, including high annotation costs, poor temporal consistency in existing image-based models, limited semantic understanding and high computational overhead of foundation models. This work proposes DiTTA, a novel framework that, for the first time, leverages temporal knowledge distillation from SAM2 under fully unsupervised conditions to initialize a video-aware model. By integrating a lightweight temporal fusion module and a test-time adaptation strategy, DiTTA efficiently transforms static image segmentation models into temporally coherent video models without requiring any video annotations. Evaluated on VSPW and Cityscapes, the method achieves performance comparable to or even surpassing fully supervised approaches, while substantially outperforming zero-shot baselines that rely on repeated SAM2 invocations.
📝 Abstract
Fully supervised Video Semantic Segmentation (VSS) relies heavily on densely annotated video data, limiting practical applicability. Alternatively, applying pre-trained Image Semantic Segmentation (ISS) models frame-by-frame avoids annotation costs but ignores crucial temporal coherence. Recent foundation models such as SAM2 enable high-quality mask propagation yet remain impractical for direct VSS due to limited semantic understanding and computational overhead. In this paper, we propose DiTTA (Distillation-assisted Test-Time Adaptation), a novel framework that converts an ISS model into a temporally-aware VSS model through efficient test-time adaptation (TTA), without annotated videos. DiTTA distills SAM2's temporal segmentation knowledge into the ISS model during a brief, single-pass initialization phase, complemented by a lightweight temporal fusion module to aggregate cross-frame context. Crucially, DiTTA achieves robust generalization even when adapting with highly limited partial video snippets (e.g., initial 10%), significantly outperforming zero-shot refinement approaches that repeatedly invoke SAM2 during inference. Extensive experiments on VSPW and Cityscapes demonstrate DiTTA's effectiveness, achieving competitive or superior performance relative to fully-supervised VSS methods, thus providing a practical and annotation-free solution for real-world VSS tasks.