LogiStory: A Logic-Aware Framework for Multi-Image Story Visualization

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual sequence generation models often suffer from action discontinuities and narrative fragmentation due to insufficient modeling of logical structure. This work proposes the first multi-image story generation framework that explicitly models visual logic by integrating structured story planning with multimodal image synthesis to enhance both narrative coherence and visual fidelity. The core innovation lies in a multi-agent system that collaboratively performs character grounding, causal chain extraction, and story-level consistency verification. Furthermore, the authors introduce LogicTale, the first benchmark specifically designed for evaluating causal reasoning in visual storytelling. Experimental results demonstrate that the proposed approach significantly outperforms current state-of-the-art methods on both automatic metrics and human evaluations, effectively improving the logical consistency and perceptual quality of generated stories.
📝 Abstract
Generating coherent and communicative visual sequences, such as image sequences and videos, remains a significant challenge for current multimodal systems. Despite advances in visual quality and the integration of world knowledge, existing models still struggle to maintain logical flow, often resulting in disjointed actions, fragmented narratives, and unclear storylines. We attribute these issues to the lack of attention to visual logic, a critical yet underexplored dimension of visual sequence generation that we define as the perceptual and causal coherence among characters, actions, and scenes over time. To bridge this gap, we propose a logic-aware multi-image story visualization framework, LogiStory. The framework is built around the central innovation of explicitly modeling visual logic in story visualization. To realize this idea, we design a multi-agent system that grounds roles, extracts causal chains, and verifies story-level consistency, transforming narrative coherence from an implicit byproduct of image generation into an explicit modeling objective. This design effectively bridges structured story planning with visual generation, enhancing both narrative clarity and visual quality in story visualization. Furthermore, to evaluate the generation capacity, we construct LogicTale, a benchmark comprising richly annotated stories, emphasizing causal reasoning, and visual logic interpretability. We establish comprehensive automatic and human evaluation protocols designed to measure both visual logic and perceptual quality. Experiments demonstrate that our approach significantly improves the narrative logic of generated visual stories. This work provides a foundational step towards modeling and enforcing visual logic in general image sequence and video generation tasks.
Problem

Research questions and friction points this paper is trying to address.

visual logic
story visualization
narrative coherence
multi-image generation
causal reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

visual logic
multi-agent system
causal reasoning
story visualization
narrative coherence
🔎 Similar Papers
No similar papers found.
C
Chutian Meng
College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China
F
Fan Ma
College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China
C
Chi Zhang
College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China
Jiaxu Miao
Jiaxu Miao
Sun Yat-Sen University
Deep LearningVideo SegmentationFederated Learning
Yi Yang
Yi Yang
Zhejiang University
multimediacomputer visionmachine learning
Y
Yueting Zhuang
College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China