Refer-Agent: A Collaborative Multi-Agent System with Reasoning and Reflection for Referring Video Object Segmentation

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the limitations of existing referring video object segmentation methods, which either rely on large-scale supervised fine-tuning—compromising generalization—or exhibit subpar performance in zero-shot settings. To overcome these challenges, we propose a multi-agent collaborative framework that decomposes the task into a stepwise reasoning process through an alternating mechanism of inference and reflection, enhanced by a self-feedback loop for iterative refinement. Key innovations include a coarse-to-fine frame sampling strategy, a dynamic focus layout, and a question-answering-style chain-of-reflection, enabling seamless integration of emerging multimodal foundation models without any fine-tuning. Our approach achieves state-of-the-art performance across five established benchmarks, significantly outperforming both supervised fine-tuned and zero-shot counterparts while offering high performance and plug-and-play extensibility.

Technology Category

Application Category

📝 Abstract

Referring Video Object Segmentation (RVOS) aims to segment objects in videos based on textual queries. Current methods mainly rely on large-scale supervised fine-tuning (SFT) of Multi-modal Large Language Models (MLLMs). However, this paradigm suffers from heavy data dependence and limited scalability against the rapid evolution of MLLMs. Although recent zero-shot approaches offer a flexible alternative, their performance remains significantly behind SFT-based methods, due to the straightforward workflow designs. To address these limitations, we propose \textbf{Refer-Agent}, a collaborative multi-agent system with alternating reasoning-reflection mechanisms. This system decomposes RVOS into step-by-step reasoning process. During reasoning, we introduce a Coarse-to-Fine frame selection strategy to ensure the frame diversity and textual relevance, along with a Dynamic Focus Layout that adaptively adjusts the agent's visual focus. Furthermore, we propose a Chain-of-Reflection mechanism, which employs a Questioner-Responder pair to generate a self-reflection chain, enabling the system to verify intermediate results and generates feedback for next-round reasoning refinement. Extensive experiments on five challenging benchmarks demonstrate that Refer-Agent significantly outperforms state-of-the-art methods, including both SFT-based models and zero-shot approaches. Moreover, Refer-Agent is flexible and enables fast integration of new MLLMs without any additional fine-tuning costs. Code will be released at https://github.com/iSEE-Laboratory/Refer-Agent.

Problem

Research questions and friction points this paper is trying to address.

Referring Video Object Segmentation

Multi-modal Large Language Models

Supervised Fine-Tuning

Zero-shot Learning

Scalability

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent system

reasoning and reflection

referring video object segmentation