ViSTA: Visual Storytelling using Multi-modal Adapters for Text-to-Image Diffusion Models

📅 2025-06-13
📈 Citations: 0
Influential: 0
📄 PDF

career value

181K/year
🤖 AI Summary
Text-to-image diffusion models struggle to generate coherent image sequences for visual storytelling, primarily due to the difficulty of effectively modeling historical text–image pairs to ensure inter-frame consistency and precise text alignment. To address this, we propose ViSTA, a multimodal historical adapter. ViSTA introduces a novel multimodal historical fusion module that dynamically selects salient historical text–image pairs to enhance cross-frame coherence. Additionally, we design TIFA—a VQA-based evaluation metric—for interpretable, narrative-level text–image alignment assessment. Crucially, ViSTA operates as a plug-and-play component requiring no end-to-end retraining of base diffusion models. Evaluated on StorySalon and FlintStonesSV, ViSTA significantly improves both inter-frame consistency and text fidelity. It achieves strong generalizability across diverse diffusion backbones, high computational efficiency, and transparent, interpretable evaluation—demonstrating a practical, scalable solution for coherent visual storytelling.

Technology Category

Application Category

📝 Abstract
Text-to-image diffusion models have achieved remarkable success, yet generating coherent image sequences for visual storytelling remains challenging. A key challenge is effectively leveraging all previous text-image pairs, referred to as history text-image pairs, which provide contextual information for maintaining consistency across frames. Existing auto-regressive methods condition on all past image-text pairs but require extensive training, while training-free subject-specific approaches ensure consistency but lack adaptability to narrative prompts. To address these limitations, we propose a multi-modal history adapter for text-to-image diffusion models, extbf{ViSTA}. It consists of (1) a multi-modal history fusion module to extract relevant history features and (2) a history adapter to condition the generation on the extracted relevant features. We also introduce a salient history selection strategy during inference, where the most salient history text-image pair is selected, improving the quality of the conditioning. Furthermore, we propose to employ a Visual Question Answering-based metric TIFA to assess text-image alignment in visual storytelling, providing a more targeted and interpretable assessment of generated images. Evaluated on the StorySalon and FlintStonesSV dataset, our proposed ViSTA model is not only consistent across different frames, but also well-aligned with the narrative text descriptions.
Problem

Research questions and friction points this paper is trying to address.

Generating coherent image sequences for visual storytelling
Effectively leveraging history text-image pairs for consistency
Ensuring adaptability to narrative prompts in text-to-image generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-modal history adapter for diffusion models
Salient history selection strategy
VQA-based metric for text-image alignment