Narrative Weaver: Towards Controllable Long-Range Visual Consistency with Multi-Modal Conditioning

📅 2026-03-04

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing generative models struggle to maintain narrative coherence and visual consistency over long sequences, limiting their applicability in domains such as film production and e-commerce advertising. To address this challenge, this work proposes Narrative Weaver, a novel framework that integrates multimodal large language model–driven high-level narrative planning, fine-grained control via a dynamic memory bank, and a progressive multi-stage training strategy to enable end-to-end generation of controllable, long-horizon visually consistent content. As part of this contribution, we introduce EAVSD, the first storyboard dataset for e-commerce advertising videos. Extensive experiments demonstrate state-of-the-art performance across controllable multi-scene generation, autonomous storytelling, and e-commerce advertising tasks, validating both the effectiveness and practical utility of the proposed approach.

Technology Category

Application Category

📝 Abstract

We present"Narrative Weaver", a novel framework that addresses a fundamental challenge in generative AI: achieving multi-modal controllable, long-range, and consistent visual content generation. While existing models excel at generating high-fidelity short-form visual content, they struggle to maintain narrative coherence and visual consistency across extended sequences - a critical limitation for real-world applications such as filmmaking and e-commerce advertising. Narrative Weaver introduces the first holistic solution that seamlessly integrates three essential capabilities: fine-grained control, automatic narrative planning, and long-range coherence. Our architecture combines a Multimodal Large Language Model (MLLM) for high-level narrative planning with a novel fine-grained control module featuring a dynamic Memory Bank that prevents visual drift. To enable practical deployment, we develop a progressive, multi-stage training strategy that efficiently leverages existing pre-trained models, achieving state-of-the-art performance even with limited training data. Recognizing the absence of suitable evaluation benchmarks, we construct and release the E-commerce Advertising Video Storyboard Dataset (EAVSD) - the first comprehensive dataset for this task, containing over 330K high-quality images with rich narrative annotations. Through extensive experiments across three distinct scenarios (controllable multi-scene generation, autonomous storytelling, and e-commerce advertising), we demonstrate our method's superiority while opening new possibilities for AI-driven content creation.

Problem

Research questions and friction points this paper is trying to address.

long-range visual consistency

narrative coherence

multi-modal controllable generation

visual drift

AI-driven content creation

Innovation

Methods, ideas, or system contributions that make the work stand out.

long-range visual consistency

multimodal conditioning

narrative planning