Synchronized Video Storytelling: Generating Video Narrations with Structured Storyline

📅 2024-05-22

🏛️ Annual Meeting of the Association for Computational Linguistics

📈 Citations: 2

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This paper addresses core challenges in video storytelling—namely, poor temporal synchronization between narration and visual content, weak semantic coherence, and insufficient integration of external knowledge—by introducing the novel task of *Synchronized Video Storytelling*. To support this task, we construct E-SyncVidStory, the first benchmark dataset featuring multi-granularity annotations. We propose VideoNarrator, a storyline-guided two-stage framework integrating multimodal large model fine-tuning, video segment representation learning, structured storyline generation, and cross-segment alignment. Additionally, we design specialized evaluation metrics—including SyncScore and CoherenceScore—to quantify synchronization fidelity and narrative quality. On E-SyncVidStory, our method achieves a 12.7% improvement over state-of-the-art methods in automatic evaluation and attains top performance in human evaluations for both synchronization accuracy and storytelling quality. All code, data, and evaluation tools are publicly released.

Technology Category

Application Category

📝 Abstract

Video storytelling is engaging multimedia content that utilizes video and its accompanying narration to attract the audience, where a key challenge is creating narrations for recorded visual scenes. Previous studies on dense video captioning and video story generation have made some progress. However, in practical applications, we typically require synchronized narrations for ongoing visual scenes. In this work, we introduce a new task of Synchronized Video Storytelling, which aims to generate synchronous and informative narrations for videos. These narrations, associated with each video clip, should relate to the visual content, integrate relevant knowledge, and have an appropriate word count corresponding to the clip's duration. Specifically, a structured storyline is beneficial to guide the generation process, ensuring coherence and integrity. To support the exploration of this task, we introduce a new benchmark dataset E-SyncVidStory with rich annotations. Since existing Multimodal LLMs are not effective in addressing this task in one-shot or few-shot settings, we propose a framework named VideoNarrator that can generate a storyline for input videos and simultaneously generate narrations with the guidance of the generated or predefined storyline. We further introduce a set of evaluation metrics to thoroughly assess the generation. Both automatic and human evaluations validate the effectiveness of our approach. Our dataset, codes, and evaluations will be released.

Problem

Research questions and friction points this paper is trying to address.

Video Storytelling

Synchronization

Content Matching

Innovation

Methods, ideas, or system contributions that make the work stand out.

SyncVidStory

VideoNarrator

E-SyncVidStory

🔎 Similar Papers

The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives