Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

📅 2025-06-30

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing text-based chain-of-thought (CoT) reasoning treats visual input as static context, creating a “semantic gap” between perception and symbolic reasoning. To bridge this gap, we propose a paradigm shift—from “thinking about images” to “thinking with images”—and introduce the first systematic three-stage framework wherein vision serves as a dynamic cognitive workspace: visual input → manipulable intermediate representation → autonomous generation and operation. Our method integrates programmable visual operations, intrinsic imagination mechanisms, cross-modal alignment, and textual CoT, elevating vision from passive input to an active medium for reasoning. This framework establishes the theoretical foundation for “thinking with images,” advances multimodal AI toward human-like cognitive autonomy, and provides a clear roadmap for evaluation design, key technology validation, and future research directions.

Technology Category

Application Category

📝 Abstract

Recent progress in multimodal reasoning has been significantly advanced by textual Chain-of-Thought (CoT), a paradigm where models conduct reasoning within language. This text-centric approach, however, treats vision as a static, initial context, creating a fundamental "semantic gap" between rich perceptual data and discrete symbolic thought. Human cognition often transcends language, utilizing vision as a dynamic mental sketchpad. A similar evolution is now unfolding in AI, marking a fundamental paradigm shift from models that merely think about images to those that can truly think with images. This emerging paradigm is characterized by models leveraging visual information as intermediate steps in their thought process, transforming vision from a passive input into a dynamic, manipulable cognitive workspace. In this survey, we chart this evolution of intelligence along a trajectory of increasing cognitive autonomy, which unfolds across three key stages: from external tool exploration, through programmatic manipulation, to intrinsic imagination. To structure this rapidly evolving field, our survey makes four key contributions. (1) We establish the foundational principles of the think with image paradigm and its three-stage framework. (2) We provide a comprehensive review of the core methods that characterize each stage of this roadmap. (3) We analyze the critical landscape of evaluation benchmarks and transformative applications. (4) We identify significant challenges and outline promising future directions. By providing this structured overview, we aim to offer a clear roadmap for future research towards more powerful and human-aligned multimodal AI.

Problem

Research questions and friction points this paper is trying to address.

Bridging semantic gap between vision and symbolic reasoning

Evolving AI from thinking about to thinking with images

Developing dynamic visual cognitive workspace in models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leveraging visual information as intermediate cognitive steps

Transforming vision into dynamic manipulable workspace

Three-stage framework for cognitive autonomy evolution

🔎 Similar Papers

No similar papers found.