🤖 AI Summary
Multimodal large language models (MLLMs) struggle to capture the temporal structure and narrative logic of image sequences, often treating consecutive frames as independent static inputs. To address this, we propose “Image-Chain Multi-Round Dialogue,” a novel paradigm that explicitly models image sequences as temporally aware vision-language dialogues. Our approach integrates multimodal multi-round instruction tuning, interleaved image-text encoding, and temporally controllable dialogue modeling to enable context-aware next-scene description generation. This framework advances static image understanding toward dynamic temporal reasoning for the first time, supporting zero-shot cross-domain transfer (e.g., comics, robotic vision). Evaluated on SimRate, our method achieves an average improvement of 15.3 percentage points (from 3.7% to 19.0%), demonstrating substantial gains in semantic consistency and generalization robustness.
📝 Abstract
Reasoning over sequences of images remains a challenge for multimodal large language models (MLLMs). While recent models incorporate multi-image data during pre-training, they still struggle to recognize sequential structures, often treating images independently. This work introduces ImageChain, a framework that enhances MLLMs with sequential reasoning capabilities over image data by modeling visual sequences as a multi-turn conversation. In ImageChain, images are interleaved with corresponding textual descriptions to form a controlled dialogue that explicitly captures temporal dependencies and narrative progression. Our method optimizes for the task of next-scene description, where the model generates a context-aware description of an upcoming scene based on preceding visual and textual cues. We demonstrate that our approach improves performance on the next-scene description task -- achieving an average improvement from 3.7% to 19% in SimRate, a metric that quantifies semantic similarity to human-annotated ground truths. Moreover, ImageChain achieves robust zero-shot out-of-domain performance in applications ranging from comics to robotics. Extensive experiments validate that instruction-tuning in a multimodal, multi-turn conversation design is key to bridging the gap between static image understanding and temporally-aware reasoning.