Decoupling Reasoning and Perception: An LLM-LMM Framework for Faithful Visual Reasoning

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Large multimodal models (LMMs) often drift from image content during long-chain visual reasoning, over-relying on textual logic and thereby introducing visual hallucinations and erroneous conclusions. To address this, we propose a training-free, decoupled visual reasoning framework that separates logical inference—performed by large language models (LLMs)—from visual perception—on-demand invoked via large vision models (LVMs). Our approach integrates dynamic visual question-answering queries, chain-of-thought reasoning, and context-aware reorganization to tightly coordinate the two modules. This work presents the first plug-and-play decoupling of reasoning and perception, ensuring every reasoning step is grounded in actual visual evidence. Evaluated across multiple visual reasoning benchmarks, our method significantly reduces vision-irrelevant errors and substantially improves reasoning accuracy, demonstrating both its effectiveness in preserving visual fidelity and its strong generalization capability.

Technology Category

Application Category

📝 Abstract

Significant advancements in the reasoning capabilities of Large Language Models (LLMs) are now driven by test-time scaling laws, particularly those leveraging extended Chain-of-Thought (CoT) reasoning. Inspired by these breakthroughs, researchers have extended these paradigms to Large Multimodal Models (LMMs). However, a critical limitation emerges: as their reasoning chains extend, LMMs increasingly rely on textual logic, progressively losing grounding in the underlying visual information. This leads to reasoning paths that diverge from the image content, culminating in erroneous conclusions. To address this, we introduce a strikingly simple yet effective training-free visual-reasoning pipeline. The core concept is to decouple the reasoning and perception processes. A powerful LLM orchestrates the high-level reasoning, strategically interrogating a LMM to extract specific visual information required for its logical chain. The LMM, in turn, functions exclusively as a visual question-answering engine, supplying the necessary perceptual details on demand. This lightweight, plug-and-play approach requires no additional training or architectural changes. Comprehensive evaluations validate that our framework effectively governs the visual reasoning process, leading to a significant reduction in visually-unfounded reasoning steps and a substantial improvement in reasoning fidelity.

Problem

Research questions and friction points this paper is trying to address.

LMMs lose visual grounding with extended reasoning chains

Visual reasoning diverges from image content causing errors

Decoupling reasoning and perception improves visual reasoning fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples reasoning and perception processes

Uses LLM to orchestrate reasoning and query LMM

LMM functions as visual question-answering engine

🔎 Similar Papers

What is the Visual Cognition Gap between Humans and Multimodal LLMs?