🤖 AI Summary
Text-to-image diffusion models struggle to generate coherent image sequences for visual storytelling, primarily due to the difficulty of effectively modeling historical text–image pairs to ensure inter-frame consistency and precise text alignment. To address this, we propose ViSTA, a multimodal historical adapter. ViSTA introduces a novel multimodal historical fusion module that dynamically selects salient historical text–image pairs to enhance cross-frame coherence. Additionally, we design TIFA—a VQA-based evaluation metric—for interpretable, narrative-level text–image alignment assessment. Crucially, ViSTA operates as a plug-and-play component requiring no end-to-end retraining of base diffusion models. Evaluated on StorySalon and FlintStonesSV, ViSTA significantly improves both inter-frame consistency and text fidelity. It achieves strong generalizability across diverse diffusion backbones, high computational efficiency, and transparent, interpretable evaluation—demonstrating a practical, scalable solution for coherent visual storytelling.
📝 Abstract
Text-to-image diffusion models have achieved remarkable success, yet generating coherent image sequences for visual storytelling remains challenging. A key challenge is effectively leveraging all previous text-image pairs, referred to as history text-image pairs, which provide contextual information for maintaining consistency across frames. Existing auto-regressive methods condition on all past image-text pairs but require extensive training, while training-free subject-specific approaches ensure consistency but lack adaptability to narrative prompts. To address these limitations, we propose a multi-modal history adapter for text-to-image diffusion models, extbf{ViSTA}. It consists of (1) a multi-modal history fusion module to extract relevant history features and (2) a history adapter to condition the generation on the extracted relevant features. We also introduce a salient history selection strategy during inference, where the most salient history text-image pair is selected, improving the quality of the conditioning. Furthermore, we propose to employ a Visual Question Answering-based metric TIFA to assess text-image alignment in visual storytelling, providing a more targeted and interpretable assessment of generated images. Evaluated on the StorySalon and FlintStonesSV dataset, our proposed ViSTA model is not only consistent across different frames, but also well-aligned with the narrative text descriptions.