Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought

📅 2025-05-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal large language models (MLLMs) suffer from inaccurate visual attention, leading to reasoning biases in vision-centric tasks. To address this, we propose Object-Centric Visual Chain-of-Thought (OC-VCoT), a novel mechanism that enables goal-directed, explicit region focusing and language-guided coordination through object-level visual grounding, language-conditioned attention modeling, and multi-stage vision–language joint reasoning. OC-VCoT is the first method to embed fine-grained object localization directly into the reasoning chain, thereby enhancing visual fidelity and interpretability. Evaluated on vision grounding and multimodal reasoning benchmarks—including RefCOCO+, GQA, and POPE—OC-VCoT consistently surpasses state-of-the-art methods. These results empirically validate that a vision-centric paradigm, grounded in explicit object representations, yields fundamental improvements in multimodal intelligence, particularly for visually dominated tasks.

Technology Category

Application Category

📝 Abstract
Recent advances in multimodal large language models (MLLMs) have demonstrated remarkable capabilities in vision-language tasks, yet they often struggle with vision-centric scenarios where precise visual focus is needed for accurate reasoning. In this paper, we introduce Argus to address these limitations with a new visual attention grounding mechanism. Our approach employs object-centric grounding as visual chain-of-thought signals, enabling more effective goal-conditioned visual attention during multimodal reasoning tasks. Evaluations on diverse benchmarks demonstrate that Argus excels in both multimodal reasoning tasks and referring object grounding tasks. Extensive analysis further validates various design choices of Argus, and reveals the effectiveness of explicit language-guided visual region-of-interest engagement in MLLMs, highlighting the importance of advancing multimodal intelligence from a visual-centric perspective. Project page: https://yunzeman.github.io/argus/
Problem

Research questions and friction points this paper is trying to address.

Addresses vision-centric reasoning limitations in MLLMs
Improves visual attention grounding for accurate reasoning
Enhances multimodal intelligence with visual-centric focus
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual attention grounding mechanism for MLLMs
Object-centric grounding as visual chain-of-thought
Language-guided visual region-of-interest engagement
🔎 Similar Papers
No similar papers found.