Few-Shot Referring Video Single- and Multi-Object Segmentation via Cross-Modal Affinity with Instance Sequence Matching

📅 2025-04-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses few-shot referring video object segmentation (FS-RVOS/RVMOS), tackling challenges in language–vision alignment, referential ambiguity among multiple objects, and cross-frame consistency modeling. We propose a novel method that jointly integrates cross-modal affinity modeling and instance-level temporal matching within a unified Transformer framework. Specifically, we introduce cross-modal attention-driven affinity graph construction and contrastive learning-guided instance sequence matching—both co-optimized end-to-end. A lightweight few-shot adaptation module is further incorporated to enhance generalization with minimal parameters. The approach supports both single- and multi-object scenarios under joint optimization, significantly improving segmentation accuracy and cross-video generalization. Extensive experiments demonstrate state-of-the-art performance across multiple benchmarks, with mAP gains exceeding 5.2% in complex multi-object settings.

Technology Category

Application Category

📝 Abstract
Referring video object segmentation (RVOS) aims to segment objects in videos guided by natural language descriptions. We propose FS-RVOS, a Transformer-based model with two key components: a cross-modal affinity module and an instance sequence matching strategy, which extends FS-RVOS to multi-object segmentation (FS-RVMOS). Experiments show FS-RVOS and FS-RVMOS outperform state-of-the-art methods across diverse benchmarks, demonstrating superior robustness and accuracy.
Problem

Research questions and friction points this paper is trying to address.

Segment objects in videos using language descriptions
Improve few-shot video object segmentation accuracy
Extend single-object to multi-object segmentation capability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based model for video segmentation
Cross-modal affinity module enhances accuracy
Instance sequence matching enables multi-object segmentation
🔎 Similar Papers
No similar papers found.
Heng Liu
Heng Liu
Guangxi Minzu University
adaptive fuzzy controlfractional-order systemnonlinear systemrobust controlneural network
G
Guanghui Li
School of Computer Science and Technology, Anhui University of Technology, Maxiang Road, Ma'anshan, 243032, China.
M
Mingqi Gao
Department of Computer Science and Engineering, Southern University of Science and Technology, Xueyuan Avenue, Shenzhen, 518055, China.
Xiantong Zhen
Xiantong Zhen
United Imaging
Medical Image AnalysisMachine LearningComputer Vision
F
Feng Zheng
Department of Computer Science and Engineering, Southern University of Science and Technology, Xueyuan Avenue, Shenzhen, 518055, China.
Y
Yang Wang
School of Computer Science and Information Engineering, Hefei University of Technology, Feicui Road, Hefei, 230601, China.