Interleaved-Modal Chain-of-Thought

📅 2024-11-29
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) struggle to generate intermediate reasoning steps that are finely aligned with image regions during chain-of-thought (CoT) inference, limiting both interpretability and accuracy. To address this, we propose Interleaved-modal Chain-of-Thought (ICoT), a multimodal reasoning framework that constructs stepwise, interleaved text-image reasoning sequences. Our key contributions are: (1) a training-free, plug-and-play Attention-Driven Selection (ADS) mechanism that leverages native VLM attention maps to dynamically localize salient visual regions; and (2) a paired image-text prompting strategy integrated with a multimodal CoT generation architecture. Evaluated on three major benchmarks, ICoT achieves up to 14% absolute accuracy improvement over prior methods while significantly enhancing fine-grained alignment between reasoning steps and image content, thereby improving both transparency and fidelity of multimodal reasoning.

Technology Category

Application Category

📝 Abstract
Chain-of-Thought (CoT) prompting elicits large language models (LLMs) to produce a series of intermediate reasoning steps before arriving at the final answer. However, when transitioning to vision-language models (VLMs), their text-only rationales struggle to express the fine-grained associations with the original image. In this paper, we propose an image-incorporated multimodal Chain-of-Thought, named extbf{Interleaved-modal Chain-of-Thought (ICoT)}, which generates sequential reasoning steps consisting of paired visual and textual rationales to infer the final answer. Intuitively, the novel ICoT requires VLMs to enable the generation of fine-grained interleaved-modal content, which is hard for current VLMs to fulfill. Considering that the required visual information is usually part of the input image, we propose extbf{Attention-driven Selection (ADS)} to realize ICoT over existing VLMs. ADS intelligently inserts regions of the input image to generate the interleaved-modal reasoning steps with ignorable additional latency. ADS relies solely on the attention map of VLMs without the need for parameterization, and therefore it is a plug-and-play strategy that can be generalized to a spectrum of VLMs. We apply ADS to realize ICoT on two popular VLMs of different architectures. Extensive evaluations of three benchmarks have shown that ICoT prompting achieves substantial performance (up to 14%) and interpretability improvements compared to existing multimodal CoT prompting methods.
Problem

Research questions and friction points this paper is trying to address.

Enhance vision-language models' reasoning with paired visual-textual steps.
Generate fine-grained interleaved-modal content for improved interpretability.
Propose Attention-driven Selection for efficient multimodal reasoning integration.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interleaved-modal Chain-of-Thought (ICoT) for VLMs
Attention-driven Selection (ADS) for image integration
Plug-and-play strategy for multimodal reasoning
🔎 Similar Papers
No similar papers found.
J
Jun Gao
School of Computer Science and Technology, Soochow University
Y
Yongqing Li
Department of Computer Science, The Hong Kong Polytechnic University
Ziqiang Cao
Ziqiang Cao
Soochow University
Natural Language Processing
W
Wenjie Li
Department of Computer Science, The Hong Kong Polytechnic University