ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models

📅 2025-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) struggle to capture the temporal structure and narrative logic of image sequences, often treating consecutive frames as independent static inputs. To address this, we propose “Image-Chain Multi-Round Dialogue,” a novel paradigm that explicitly models image sequences as temporally aware vision-language dialogues. Our approach integrates multimodal multi-round instruction tuning, interleaved image-text encoding, and temporally controllable dialogue modeling to enable context-aware next-scene description generation. This framework advances static image understanding toward dynamic temporal reasoning for the first time, supporting zero-shot cross-domain transfer (e.g., comics, robotic vision). Evaluated on SimRate, our method achieves an average improvement of 15.3 percentage points (from 3.7% to 19.0%), demonstrating substantial gains in semantic consistency and generalization robustness.

Technology Category

Application Category

📝 Abstract
Reasoning over sequences of images remains a challenge for multimodal large language models (MLLMs). While recent models incorporate multi-image data during pre-training, they still struggle to recognize sequential structures, often treating images independently. This work introduces ImageChain, a framework that enhances MLLMs with sequential reasoning capabilities over image data by modeling visual sequences as a multi-turn conversation. In ImageChain, images are interleaved with corresponding textual descriptions to form a controlled dialogue that explicitly captures temporal dependencies and narrative progression. Our method optimizes for the task of next-scene description, where the model generates a context-aware description of an upcoming scene based on preceding visual and textual cues. We demonstrate that our approach improves performance on the next-scene description task -- achieving an average improvement from 3.7% to 19% in SimRate, a metric that quantifies semantic similarity to human-annotated ground truths. Moreover, ImageChain achieves robust zero-shot out-of-domain performance in applications ranging from comics to robotics. Extensive experiments validate that instruction-tuning in a multimodal, multi-turn conversation design is key to bridging the gap between static image understanding and temporally-aware reasoning.
Problem

Research questions and friction points this paper is trying to address.

Enhance sequential image reasoning
Improve next-scene description accuracy
Enable zero-shot out-of-domain performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sequential image-text reasoning
Multi-turn conversation modeling
Context-aware scene description
🔎 Similar Papers
No similar papers found.