🤖 AI Summary
This survey addresses the critical multimodal task of vision-driven story generation, systematically reviewing representative works from 2015 to 2024. Motivated by three key problems—lack of a unified analytical framework, ambiguous task boundaries, and outdated evaluation protocols—we propose, for the first time, a cross-task unifying framework encompassing image/video captioning, visual question answering (VQA), and story generation, clarifying methodological transfer patterns and fundamental distinctions. We critically examine prevalent datasets and metrics (e.g., BLEU, CIDEr, SPICE), exposing their limitations in capturing explainability, controllability, and long-range narrative coherence, and advocate for evaluation reforms aligned with these dimensions. Synthesizing advances in deep learning, multimodal alignment (e.g., CLIP-style architectures), and sequence modeling (e.g., Transformers), we identify current bottlenecks and chart a roadmap toward robust, trustworthy visual storytelling—providing both theoretical foundations and practical guidance for next-generation research.
📝 Abstract
Creating engaging narratives from visual data is crucial for automated digital media consumption, assistive technologies, and interactive entertainment. This survey covers methodologies used in the generation of these narratives, focusing on their principles, strengths, and limitations. The survey also covers tasks related to automatic story generation, such as image and video captioning, and visual question answering, as well as story generation without visual inputs. These tasks share common challenges with visual story generation and have served as inspiration for the techniques used in the field. We analyze the main datasets and evaluation metrics, providing a critical perspective on their limitations.