Facilitating Video Story Interaction with Multi-Agent Collaborative System

📅 2025-05-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video storytelling interaction methods rely on predefined narratives, lacking personalization and dynamic evolution capabilities. This paper proposes a multi-agent collaborative interactive system for video narrative generation, featuring a three-tier architecture integrating vision-language models (VLMs), retrieval-augmented generation (RAG), and multi-agent systems (MAS). The system enables user-intent-driven character development, emergent social behaviors, and on-demand scene expansion. Departing from passive viewing paradigms, it supports cross-modal narrative understanding and generation, endowing characters with long-term memory and explicit social relationship modeling. Evaluated on the *Harry Potter* series, generated characters exhibit significant developmental progression and social coherence, while narrative scenes support high-fidelity, customizable visual interaction. To our knowledge, this work is the first to deeply couple RAG with MAS within a video narrative framework, establishing a novel paradigm for personalized, dynamically evolving interactive video storytelling.

Technology Category

Application Category

📝 Abstract
Video story interaction enables viewers to engage with and explore narrative content for personalized experiences. However, existing methods are limited to user selection, specially designed narratives, and lack customization. To address this, we propose an interactive system based on user intent. Our system uses a Vision Language Model (VLM) to enable machines to understand video stories, combining Retrieval-Augmented Generation (RAG) and a Multi-Agent System (MAS) to create evolving characters and scene experiences. It includes three stages: 1) Video story processing, utilizing VLM and prior knowledge to simulate human understanding of stories across three modalities. 2) Multi-space chat, creating growth-oriented characters through MAS interactions based on user queries and story stages. 3) Scene customization, expanding and visualizing various story scenes mentioned in dialogue. Applied to the Harry Potter series, our study shows the system effectively portrays emergent character social behavior and growth, enhancing the interactive experience in the video story world.
Problem

Research questions and friction points this paper is trying to address.

Enhancing video story interaction through multi-agent collaboration
Overcoming limitations in user selection and narrative customization
Enabling dynamic character growth and scene visualization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Vision Language Model for video story understanding
Combines RAG and Multi-Agent System for dynamic interactions
Customizes scenes and characters based on user queries
🔎 Similar Papers
No similar papers found.
Y
Yiwen Zhang
The Hong Kong University of Science and Technology (Guangzhou), China
Jianing Hao
Jianing Hao
The Hong Kong University of Science and Technology (Guangzhou)
Human-AI collaborationTime-series representationVisual analysisRecommendation system
Z
Zhan Wang
The Hong Kong University of Science and Technology (Guangzhou), China
H
Hongling Sheng
The Hong Kong University of Science and Technology (Guangzhou), China
W
Wei Zeng
The Hong Kong University of Science and Technology (Guangzhou), China