Few-Shot Referring Video Single- and Multi-Object Segmentation via Cross-Modal Affinity with Instance Sequence Matching

📅 2025-04-18

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses few-shot referring video object segmentation (FS-RVOS/RVMOS), tackling challenges in language–vision alignment, referential ambiguity among multiple objects, and cross-frame consistency modeling. We propose a novel method that jointly integrates cross-modal affinity modeling and instance-level temporal matching within a unified Transformer framework. Specifically, we introduce cross-modal attention-driven affinity graph construction and contrastive learning-guided instance sequence matching—both co-optimized end-to-end. A lightweight few-shot adaptation module is further incorporated to enhance generalization with minimal parameters. The approach supports both single- and multi-object scenarios under joint optimization, significantly improving segmentation accuracy and cross-video generalization. Extensive experiments demonstrate state-of-the-art performance across multiple benchmarks, with mAP gains exceeding 5.2% in complex multi-object settings.

Technology Category

Application Category

📝 Abstract

Referring video object segmentation (RVOS) aims to segment objects in videos guided by natural language descriptions. We propose FS-RVOS, a Transformer-based model with two key components: a cross-modal affinity module and an instance sequence matching strategy, which extends FS-RVOS to multi-object segmentation (FS-RVMOS). Experiments show FS-RVOS and FS-RVMOS outperform state-of-the-art methods across diverse benchmarks, demonstrating superior robustness and accuracy.

Problem

Research questions and friction points this paper is trying to address.

Segment objects in videos using language descriptions

Improve few-shot video object segmentation accuracy

Extend single-object to multi-object segmentation capability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based model for video segmentation

Cross-modal affinity module enhances accuracy

Instance sequence matching enables multi-object segmentation

🔎 Similar Papers

No similar papers found.