🤖 AI Summary
This work addresses few-shot referring video object segmentation (FS-RVOS/RVMOS), tackling challenges in language–vision alignment, referential ambiguity among multiple objects, and cross-frame consistency modeling. We propose a novel method that jointly integrates cross-modal affinity modeling and instance-level temporal matching within a unified Transformer framework. Specifically, we introduce cross-modal attention-driven affinity graph construction and contrastive learning-guided instance sequence matching—both co-optimized end-to-end. A lightweight few-shot adaptation module is further incorporated to enhance generalization with minimal parameters. The approach supports both single- and multi-object scenarios under joint optimization, significantly improving segmentation accuracy and cross-video generalization. Extensive experiments demonstrate state-of-the-art performance across multiple benchmarks, with mAP gains exceeding 5.2% in complex multi-object settings.
📝 Abstract
Referring video object segmentation (RVOS) aims to segment objects in videos guided by natural language descriptions. We propose FS-RVOS, a Transformer-based model with two key components: a cross-modal affinity module and an instance sequence matching strategy, which extends FS-RVOS to multi-object segmentation (FS-RVMOS). Experiments show FS-RVOS and FS-RVMOS outperform state-of-the-art methods across diverse benchmarks, demonstrating superior robustness and accuracy.