RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) exhibit strong reasoning capabilities but lack explicit visual grounding and segmentation, leading to a disconnection between cognitive reasoning and visual perception. To address this, we propose a two-stage reasoning-driven structured visual segmentation framework: (1) a multimodal chain-of-thought visual prompting stage that generates interpretable region proposals; and (2) a vision-language segmentation module (VLSM) that refines these proposals into pixel-accurate masks. We introduce the novel “reasoning-guided segmentation” paradigm, explicitly modeling the synergy between multimodal reasoning and segmentation to unify cognitive inference with structured visual representation. Our method achieves +6.5 gIoU and +9.2 cIoU improvements on ReasonSeg, and attains 49.7 mAP in zero-shot SegInW—substantially outperforming prior approaches.

Technology Category

Application Category

📝 Abstract
Multi-modal Large Language Models (MLLMs) have demonstrated remarkable reasoning capability while lack explicit mechanisms for visual grounding and segmentation, creating a gap between cognitive reasoning and visual perception. To bridge this gap, we introduce Reasoning Segmentation via Visual Prompting (RSVP), a novel framework that unifies multi-step multimodal reasoning with grounded visual understanding. RSVP is a two-stage structuralized framework that integrates reasoning-driven localization with segmentation refinement. In the reasoning stage, RSVP employs multimodal chain-of-thought visual prompts to help MLLMs understand queries and infer targets, generating interpretable region proposals that enhance visual grounding. In segmentation stage, RSVP refines these proposals with a Vision-Language Segmentation Module (VLSM), seamlessly integrates textual and visual cues to produce precise segmentation masks. By explicitly modelling the interaction between multimodal reasoning and segmentation, RSVP introduces a new paradigm for interpretable reasoning segmentation. It exploits MLLMs' inherent localization capabilities, enabling the models to not only reason about objects but also generate structured visual representations. Our extensive experiments demonstrate that RSVP achieves state-of-the-art performance, surpasses state-of-the-art methods by up to +6.5 gIoU and +9.2 cIoU on ReasonSeg, and achieves 49.7 mAP on SegInW under zero-shot settings. These results validate RSVP as an effective and scalable framework for integrating cognitive reasoning with structured visual understanding.
Problem

Research questions and friction points this paper is trying to address.

Bridges gap between cognitive reasoning and visual perception in MLLMs
Unifies multimodal reasoning with grounded visual understanding via RSVP
Enhances segmentation precision using reasoning-driven localization and refinement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework for reasoning and segmentation
Multimodal chain-of-thought visual prompts
Vision-Language Segmentation Module integration
🔎 Similar Papers
No similar papers found.
Y
Yi Lu
Opus AI Research, University of Toronto
J
Jiawang Cao
Opus AI Research
Yongliang Wu
Yongliang Wu
Southeast University
Vision-Language Model
B
Bozheng Li
Opus AI Research, Brown University
L
Licheng Tang
Opus AI Research
Y
Yangguang Ji
Opus AI Research
Chong Wu
Chong Wu
City University of Hong Kong
J
Jay Wu
Opus AI Research
W
Wenbo Zhu
Opus AI Research