Let's Think with Images Efficiently! An Interleaved-Modal Chain-of-Thought Reasoning Framework with Dynamic and Precise Visual Thoughts

📅 2026-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Interleaved Chain-of-Thought (ICoT) approaches suffer from inefficient reasoning and semantic inconsistency due to static visual insertion and incoherent multimodal representations. To address these limitations, this work proposes the DaP-ICoT framework, which introduces a novel dynamic, on-demand visual integration mechanism coupled with a precise visual grounding strategy. This enables adaptive selection of when to inject visual tokens during reasoning, thereby generating contextually aligned and semantically coherent multimodal representations. Experimental results demonstrate that DaP-ICoT achieves state-of-the-art performance across multiple benchmarks and model architectures, significantly reducing image query frequency and cutting token consumption by 72.6%, thus substantially improving both reasoning efficiency and representational consistency.

Technology Category

Application Category

📝 Abstract
Recently, Interleaved-modal Chain-of-Thought (ICoT) reasoning has achieved remarkable success by leveraging both multimodal inputs and outputs, attracting increasing attention. While achieving promising performance, current ICoT methods still suffer from two major limitations: (1) Static Visual Thought Positioning, which statically inserts visual information at fixed steps, resulting in inefficient and inflexible reasoning; and (2) Broken Visual Thought Representation, which involves discontinuous and semantically incoherent visual tokens. To address these limitations, we introduce Interleaved-modal Chain-of-Thought reasoning with Dynamic and Precise Visual Thoughts (DaP-ICoT), which incorporates two key components: (1) Dynamic Visual Thought Integration adaptively introduces visual inputs based on reasoning needs, reducing redundancy and improving efficiency. (2) Precise Visual Thought Guidance ensures visual semantically coherent and contextually aligned representations. Experiments across multiple benchmarks and models demonstrate that DaP-ICoT achieves state-of-the-art performance. In addition, DaP-ICoT significantly reduces the number of inserted images, leading to a 72.6% decrease in token consumption, enabling more efficient ICoT reasoning.
Problem

Research questions and friction points this paper is trying to address.

Interleaved-modal Chain-of-Thought
Static Visual Thought Positioning
Broken Visual Thought Representation
Multimodal Reasoning
Visual Thought Coherence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interleaved-modal Chain-of-Thought
Dynamic Visual Thought Integration
Precise Visual Thought Guidance
Multimodal Reasoning
Token Efficiency
🔎 Similar Papers
No similar papers found.
X
Xu Liu
Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen
Yongheng Zhang
Yongheng Zhang
M.S. Student @ CSU | Research Intern @ Tencent
Artificial IntelligenceLarge Language ModelWorld Model
Qiguang Chen
Qiguang Chen
Harbin Institute of Technology
Chain-of-ThoughtReasoningMultilingual LLMMulti-modal LLM
Y
Yao Li
Shanghai Aviation Electric Co., Ltd, Aviation Industry Corporation of China, Shanghai
S
Sheng Wang
Shanghai Aviation Electric Co., Ltd, Aviation Industry Corporation of China, Shanghai
L
Libo Qin
Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen