π€ AI Summary
This work addresses the limitations of existing language-guided segmentation methods, which rely heavily on large-scale training and lack explicit visual-spatial reasoning capabilities, thereby struggling to accurately segment arbitrarily described targets in zero-shot settings. To overcome these challenges, the authors propose Seg-Agentβa training-free framework that enables explicit multimodal reasoning through an iterative loop of generation, selection, and refinement. By integrating Set-of-Mark visual prompts, Seg-Agent orchestrates collaborative spatial reasoning between a multimodal large language model and a foundation segmentation model such as SAM. This approach introduces the first explicit multimodal chain-of-reasoning mechanism, transcending the constraints of purely textual inference. Without any parameter updates, Seg-Agent achieves performance comparable to state-of-the-art trained methods and establishes Various-LangSeg, a new benchmark encompassing semantic, generic object, and complex reasoning segmentation tasks.
π Abstract
Language-guided segmentation transcends the scope limitations of traditional semantic segmentation, enabling models to segment arbitrary target regions based on natural language instructions. Existing approaches typically adopt a two-stage framework: employing Multimodal Large Language Models (MLLMs) to interpret instructions and generate visual prompts, followed by foundational segmentation models (e.g., SAM) to produce masks. However, due to the limited spatial grounding capabilities of off-the-shelf MLLMs, these methods often rely on extensive training on large-scale datasets to achieve satisfactory accuracy. While recent advances have introduced reasoning mechanisms to improve performance, they predominantly operate within the textual domain, performing chain-of-thought reasoning solely based on abstract text representations without direct visual feedback. In this paper, we propose Seg-Agent, a completely training-free framework that pioneers Explicit Multimodal Chain-of-Reasoning. Unlike prior text-only reasoning, our approach constructs an interactive visual reasoning loop comprising three stages: generation, selection, and refinement. Specifically, we leverage Set-of-Mark (SoM) visual prompting to render candidate regions directly onto the image, allowing the MLLM to ``see'' and iteratively reason about spatial relationships in the visual domain rather than just the textual one. This explicit multimodal interaction enables Seg-Agent to achieve performance comparable to state-of-the-art training-based methods without any parameter updates. Furthermore, to comprehensively evaluate generalization across diverse scenarios, we introduce Various-LangSeg, a novel benchmark covering explicit semantic, generic object, and reasoning-guided segmentation tasks. Extensive experiments demonstrate the effectiveness and robustness of our method.