Investigating Inference-time Scaling for Chain of Multi-modal Thought: A Preliminary Study

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior chain-of-thought (CoT) research in multimodal reasoning has focused predominantly on textual inputs, neglecting joint visual–linguistic modeling during inference. Method: We propose the first reasoning-time scaling paradigm for multimodal CoT, integrating multimodal large language models with cross-modal alignment strategies and designing a consistency-enhanced verifier to jointly guide sampling-based methods (e.g., Self-Consistency) and search-based methods (e.g., Tree-of-Thought). Evaluation spans 10 cross-domain multimodal tasks. Results: Multimodal CoT substantially outperforms text-only baselines; hybrid reasoning paths improve both diversity and accuracy. However, visual inputs incur significant token overhead, exposing a fundamental trade-off between performance and computational efficiency. Contribution: This work pioneers the integration of vision and language into a unified reasoning-time scaling framework and introduces a verifiable, scalable mechanism for multimodal thought generation and selection.

Technology Category

Application Category

📝 Abstract
Recently, inference-time scaling of chain-of-thought (CoT) has been demonstrated as a promising approach for addressing multi-modal reasoning tasks. While existing studies have predominantly centered on text-based thinking, the integration of both visual and textual modalities within the reasoning process remains unexplored. In this study, we pioneer the exploration of inference-time scaling with multi-modal thought, aiming to bridge this gap. To provide a comprehensive analysis, we systematically investigate popular sampling-based and tree search-based inference-time scaling methods on 10 challenging tasks spanning various domains. Besides, we uniformly adopt a consistency-enhanced verifier to ensure effective guidance for both methods across different thought paradigms. Results show that multi-modal thought promotes better performance against conventional text-only thought, and blending the two types of thought fosters more diverse thinking. Despite these advantages, multi-modal thoughts necessitate higher token consumption for processing richer visual inputs, which raises concerns in practical applications. We hope that our findings on the merits and drawbacks of this research line will inspire future works in the field.
Problem

Research questions and friction points this paper is trying to address.

Explores multi-modal thought scaling.
Compares text-only vs. multi-modal reasoning.
Analyzes token consumption in visual inputs.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Inference-time scaling exploration
Multi-modal thought integration
Consistency-enhanced verifier application
🔎 Similar Papers
No similar papers found.
Yujie Lin
Yujie Lin
Shandong University
Artificial Intelligence
A
Ante Wang
School of Informatics, Xiamen University, China
M
Moye Chen
Baidu Inc., Beijing, China
J
Jingyao Liu
School of Informatics, Xiamen University, China
H
Hao Liu
Baidu Inc., Beijing, China
Jinsong Su
Jinsong Su
Xiamen University
Natural Language ProcessingDeep LearningNeural Machine Translation
Xinyan Xiao
Xinyan Xiao
Baidu
Natural Language ProcessingStatistical Machine Translation