🤖 AI Summary
Visual narrative generation faces dual challenges of image fidelity and cross-image consistency, primarily due to the absence of explicitly modelable narrative planning knowledge. To address this, we introduce VinaBench—the first benchmark specifically designed for visual narrative evaluation—featuring systematic annotations of commonsense and discourse-level constraints to provide learnable narrative structure for image sequence generation. Our contributions are threefold: (1) fine-grained, constraint-aware data annotation; (2) a consistency modeling method integrating both commonsense and discourse relations; and (3) a multidimensional evaluation metric jointly measuring text–image alignment and inter-image coherence. We fine-tune three representative generative models on VinaBench, achieving improvements of 18.7% in fidelity and 22.3% in narrative coherence over baselines. These results empirically validate that explicit knowledge constraints effectively guide visual narrative generation.
📝 Abstract
Visual narrative generation transforms textual narratives into sequences of images illustrating the content of the text. However, generating visual narratives that are faithful to the input text and self-consistent across generated images remains an open challenge, due to the lack of knowledge constraints used for planning the stories. In this work, we propose a new benchmark, VinaBench, to address this challenge. Our benchmark annotates the underlying commonsense and discourse constraints in visual narrative samples, offering systematic scaffolds for learning the implicit strategies of visual storytelling. Based on the incorporated narrative constraints, we further propose novel metrics to closely evaluate the consistency of generated narrative images and the alignment of generations with the input textual narrative. Our results across three generative vision models demonstrate that learning with VinaBench's knowledge constraints effectively improves the faithfulness and cohesion of generated visual narratives.