🤖 AI Summary
Existing vision-language models (VLMs) struggle to generate intermediate reasoning steps that are finely aligned with image regions during chain-of-thought (CoT) inference, limiting both interpretability and accuracy. To address this, we propose Interleaved-modal Chain-of-Thought (ICoT), a multimodal reasoning framework that constructs stepwise, interleaved text-image reasoning sequences. Our key contributions are: (1) a training-free, plug-and-play Attention-Driven Selection (ADS) mechanism that leverages native VLM attention maps to dynamically localize salient visual regions; and (2) a paired image-text prompting strategy integrated with a multimodal CoT generation architecture. Evaluated on three major benchmarks, ICoT achieves up to 14% absolute accuracy improvement over prior methods while significantly enhancing fine-grained alignment between reasoning steps and image content, thereby improving both transparency and fidelity of multimodal reasoning.
📝 Abstract
Chain-of-Thought (CoT) prompting elicits large language models (LLMs) to produce a series of intermediate reasoning steps before arriving at the final answer. However, when transitioning to vision-language models (VLMs), their text-only rationales struggle to express the fine-grained associations with the original image. In this paper, we propose an image-incorporated multimodal Chain-of-Thought, named extbf{Interleaved-modal Chain-of-Thought (ICoT)}, which generates sequential reasoning steps consisting of paired visual and textual rationales to infer the final answer. Intuitively, the novel ICoT requires VLMs to enable the generation of fine-grained interleaved-modal content, which is hard for current VLMs to fulfill. Considering that the required visual information is usually part of the input image, we propose extbf{Attention-driven Selection (ADS)} to realize ICoT over existing VLMs. ADS intelligently inserts regions of the input image to generate the interleaved-modal reasoning steps with ignorable additional latency. ADS relies solely on the attention map of VLMs without the need for parameterization, and therefore it is a plug-and-play strategy that can be generalized to a spectrum of VLMs. We apply ADS to realize ICoT on two popular VLMs of different architectures. Extensive evaluations of three benchmarks have shown that ICoT prompting achieves substantial performance (up to 14%) and interpretability improvements compared to existing multimodal CoT prompting methods.