🤖 AI Summary
This work addresses the limitations of existing referring video object segmentation methods, which either rely on large-scale supervised fine-tuning—compromising generalization—or exhibit subpar performance in zero-shot settings. To overcome these challenges, we propose a multi-agent collaborative framework that decomposes the task into a stepwise reasoning process through an alternating mechanism of inference and reflection, enhanced by a self-feedback loop for iterative refinement. Key innovations include a coarse-to-fine frame sampling strategy, a dynamic focus layout, and a question-answering-style chain-of-reflection, enabling seamless integration of emerging multimodal foundation models without any fine-tuning. Our approach achieves state-of-the-art performance across five established benchmarks, significantly outperforming both supervised fine-tuned and zero-shot counterparts while offering high performance and plug-and-play extensibility.
📝 Abstract
Referring Video Object Segmentation (RVOS) aims to segment objects in videos based on textual queries. Current methods mainly rely on large-scale supervised fine-tuning (SFT) of Multi-modal Large Language Models (MLLMs). However, this paradigm suffers from heavy data dependence and limited scalability against the rapid evolution of MLLMs. Although recent zero-shot approaches offer a flexible alternative, their performance remains significantly behind SFT-based methods, due to the straightforward workflow designs. To address these limitations, we propose \textbf{Refer-Agent}, a collaborative multi-agent system with alternating reasoning-reflection mechanisms. This system decomposes RVOS into step-by-step reasoning process. During reasoning, we introduce a Coarse-to-Fine frame selection strategy to ensure the frame diversity and textual relevance, along with a Dynamic Focus Layout that adaptively adjusts the agent's visual focus. Furthermore, we propose a Chain-of-Reflection mechanism, which employs a Questioner-Responder pair to generate a self-reflection chain, enabling the system to verify intermediate results and generates feedback for next-round reasoning refinement. Extensive experiments on five challenging benchmarks demonstrate that Refer-Agent significantly outperforms state-of-the-art methods, including both SFT-based models and zero-shot approaches. Moreover, Refer-Agent is flexible and enables fast integration of new MLLMs without any additional fine-tuning costs. Code will be released at https://github.com/iSEE-Laboratory/Refer-Agent.