Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

📅 2025-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current multimodal chain-of-thought (MCoT) reasoning lacks a systematic survey, suffering from conceptual ambiguity, methodological fragmentation, and incomplete modality coverage. This work establishes the first unified taxonomy and comprehensive classification framework for MCoT, encompassing six modalities: image, video, speech, 3D, structured data, and cross-modal combinations. We rigorously formalize foundational definitions and synthesize core methodologies—including cross-modal alignment, stepwise multimodal reasoning modeling, interpretability analysis, and task-driven evaluation. Furthermore, we identify critical pathways toward multimodal artificial general intelligence (AGI) and articulate key open challenges. The survey critically analyzes over 200 seminal works, offering a methodological guide and technology roadmap for high-impact domains such as robotics, healthcare, and autonomous driving. To our knowledge, this is the first authoritative, comprehensive survey dedicated to MCoT.

Technology Category

Application Category

📝 Abstract
By extending the advantage of chain-of-thought (CoT) reasoning in human-like step-by-step processes to multimodal contexts, multimodal CoT (MCoT) reasoning has recently garnered significant research attention, especially in the integration with multimodal large language models (MLLMs). Existing MCoT studies design various methodologies and innovative reasoning paradigms to address the unique challenges of image, video, speech, audio, 3D, and structured data across different modalities, achieving extensive success in applications such as robotics, healthcare, autonomous driving, and multimodal generation. However, MCoT still presents distinct challenges and opportunities that require further focus to ensure consistent thriving in this field, where, unfortunately, an up-to-date review of this domain is lacking. To bridge this gap, we present the first systematic survey of MCoT reasoning, elucidating the relevant foundational concepts and definitions. We offer a comprehensive taxonomy and an in-depth analysis of current methodologies from diverse perspectives across various application scenarios. Furthermore, we provide insights into existing challenges and future research directions, aiming to foster innovation toward multimodal AGI.
Problem

Research questions and friction points this paper is trying to address.

Extends chain-of-thought reasoning to multimodal contexts.
Addresses challenges in integrating image, video, speech, and 3D data.
Provides a systematic survey and future directions for MCoT research.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends CoT reasoning to multimodal contexts
Integrates with multimodal large language models
Addresses challenges across diverse data modalities
🔎 Similar Papers
No similar papers found.