Long Context Tuning for Video Generation

📅 2025-03-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video generation models excel at synthesizing long single-shot videos but struggle to maintain visual and motion consistency across multiple shots in narrative sequences. To address this, we propose Long-Context Tuning (LCT), a novel paradigm that extends the context window of single-shot diffusion models to encompass entire scenes—enabling, for the first time, full-attention, scene-level consistency modeling. Our method introduces 3D positional embeddings, asynchronous noise scheduling, joint fine-tuning of bidirectional and causal attention, and KV-cache-optimized autoregressive inference. It supports compositional generation and interactive shot extension without parameter growth. Experiments demonstrate that lightweight LCT fine-tuning transforms off-the-shelf single-shot models into capable multi-shot video generators, producing minute-long coherent narratives. The approach significantly improves temporal consistency and unlocks emergent capabilities—including cross-shot motion continuity and controllable scene composition—while preserving model efficiency and scalability.

Technology Category

Application Category

📝 Abstract
Recent advances in video generation can produce realistic, minute-long single-shot videos with scalable diffusion transformers. However, real-world narrative videos require multi-shot scenes with visual and dynamic consistency across shots. In this work, we introduce Long Context Tuning (LCT), a training paradigm that expands the context window of pre-trained single-shot video diffusion models to learn scene-level consistency directly from data. Our method expands full attention mechanisms from individual shots to encompass all shots within a scene, incorporating interleaved 3D position embedding and an asynchronous noise strategy, enabling both joint and auto-regressive shot generation without additional parameters. Models with bidirectional attention after LCT can further be fine-tuned with context-causal attention, facilitating auto-regressive generation with efficient KV-cache. Experiments demonstrate single-shot models after LCT can produce coherent multi-shot scenes and exhibit emerging capabilities, including compositional generation and interactive shot extension, paving the way for more practical visual content creation. See https://guoyww.github.io/projects/long-context-video/ for more details.
Problem

Research questions and friction points this paper is trying to address.

Extends single-shot video models to multi-shot scenes.
Ensures visual and dynamic consistency across video shots.
Enables coherent multi-shot generation without extra parameters.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Expands context window for scene-level consistency
Uses interleaved 3D position embedding strategy
Enables joint and auto-regressive shot generation
🔎 Similar Papers
No similar papers found.