Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

📅 2025-01-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the failure of pure language-based chain-of-thought (CoT) reasoning in complex spatial reasoning tasks, this paper proposes Multimodal Visualized Reasoning (MVoT), a paradigm enabling multimodal large models to jointly generate semantically consistent and visually faithful text-image reasoning traces during inference. Our key contributions are: (1) the first explicit integration of differentiable visualized reasoning into the reasoning path; (2) a token discrepancy loss that jointly optimizes cross-modal alignment and image generation fidelity; and (3) an end-to-end autoregressive training framework unifying visual generation, cross-modal alignment, and differentiable image optimization. On multiple dynamic spatial reasoning benchmarks, MVoT significantly outperforms CoT baselines—particularly in high-difficulty scenarios where CoT completely fails—demonstrating robust performance gains.

Technology Category

Application Category

📝 Abstract
Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Yet, it struggles in complex spatial reasoning tasks. Nonetheless, human cognition extends beyond language alone, enabling the remarkable capability to think in both words and images. Inspired by this mechanism, we propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT). It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces. To ensure high-quality visualization, we introduce token discrepancy loss into autoregressive MLLMs. This innovation significantly improves both visual coherence and fidelity. We validate this approach through several dynamic spatial reasoning tasks. Experimental results reveal that MVoT demonstrates competitive performance across tasks. Moreover, it exhibits robust and reliable improvements in the most challenging scenarios where CoT fails. Ultimately, MVoT establishes new possibilities for complex reasoning tasks where visual thinking can effectively complement verbal reasoning.
Problem

Research questions and friction points this paper is trying to address.

Artificial Intelligence
Spatial Understanding
Chained Thinking (CoT) Limitations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal Visual Thinking
Chain of Thought Reasoning
Spatial Problem Solving
🔎 Similar Papers