FilmWeaver: Weaving Consistent Multi-Shot Videos with Cache-Guided Autoregressive Diffusion

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current video generation models struggle to simultaneously ensure cross-shot consistency in characters and backgrounds while maintaining flexibility in video length and shot count. To address this, we propose a cache-guided autoregressive diffusion framework featuring a novel two-tier caching mechanism—comprising shot-level memory and temporal memory—that decouples cross-shot consistency modeling from intra-shot temporal coherence modeling. Our approach supports multi-concept injection, dynamic shot expansion, and multi-round interactive synthesis. We also introduce the first high-quality, multi-shot video training dataset. Quantitative and qualitative evaluations demonstrate that our method significantly improves cross-shot character stability, background consistency, and motion smoothness, while preserving high aesthetic quality. It achieves state-of-the-art performance across multiple consistency and generation quality metrics.

Technology Category

Application Category

📝 Abstract
Current video generation models perform well at single-shot synthesis but struggle with multi-shot videos, facing critical challenges in maintaining character and background consistency across shots and flexibly generating videos of arbitrary length and shot count. To address these limitations, we introduce extbf{FilmWeaver}, a novel framework designed to generate consistent, multi-shot videos of arbitrary length. First, it employs an autoregressive diffusion paradigm to achieve arbitrary-length video generation. To address the challenge of consistency, our key insight is to decouple the problem into inter-shot consistency and intra-shot coherence. We achieve this through a dual-level cache mechanism: a shot memory caches keyframes from preceding shots to maintain character and scene identity, while a temporal memory retains a history of frames from the current shot to ensure smooth, continuous motion. The proposed framework allows for flexible, multi-round user interaction to create multi-shot videos. Furthermore, due to this decoupled design, our method demonstrates high versatility by supporting downstream tasks such as multi-concept injection and video extension. To facilitate the training of our consistency-aware method, we also developed a comprehensive pipeline to construct a high-quality multi-shot video dataset. Extensive experimental results demonstrate that our method surpasses existing approaches on metrics for both consistency and aesthetic quality, opening up new possibilities for creating more consistent, controllable, and narrative-driven video content. Project Page: https://filmweaver.github.io
Problem

Research questions and friction points this paper is trying to address.

Maintaining character and background consistency across multiple video shots
Generating multi-shot videos of arbitrary length and shot count
Ensuring smooth motion within shots while preserving inter-shot identity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive diffusion for arbitrary-length video generation.
Dual-level cache mechanism ensures inter-shot and intra-shot consistency.
Supports multi-concept injection and video extension via decoupled design.
🔎 Similar Papers