Long Context Tuning for Video Generation

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

Existing video generation models excel at synthesizing long single-shot videos but struggle to maintain visual and motion consistency across multiple shots in narrative sequences. To address this, we propose Long-Context Tuning (LCT), a novel paradigm that extends the context window of single-shot diffusion models to encompass entire scenes—enabling, for the first time, full-attention, scene-level consistency modeling. Our method introduces 3D positional embeddings, asynchronous noise scheduling, joint fine-tuning of bidirectional and causal attention, and KV-cache-optimized autoregressive inference. It supports compositional generation and interactive shot extension without parameter growth. Experiments demonstrate that lightweight LCT fine-tuning transforms off-the-shelf single-shot models into capable multi-shot video generators, producing minute-long coherent narratives. The approach significantly improves temporal consistency and unlocks emergent capabilities—including cross-shot motion continuity and controllable scene composition—while preserving model efficiency and scalability.

Technology Category

Application Category

📝 Abstract

Recent advances in video generation can produce realistic, minute-long single-shot videos with scalable diffusion transformers. However, real-world narrative videos require multi-shot scenes with visual and dynamic consistency across shots. In this work, we introduce Long Context Tuning (LCT), a training paradigm that expands the context window of pre-trained single-shot video diffusion models to learn scene-level consistency directly from data. Our method expands full attention mechanisms from individual shots to encompass all shots within a scene, incorporating interleaved 3D position embedding and an asynchronous noise strategy, enabling both joint and auto-regressive shot generation without additional parameters. Models with bidirectional attention after LCT can further be fine-tuned with context-causal attention, facilitating auto-regressive generation with efficient KV-cache. Experiments demonstrate single-shot models after LCT can produce coherent multi-shot scenes and exhibit emerging capabilities, including compositional generation and interactive shot extension, paving the way for more practical visual content creation. See https://guoyww.github.io/projects/long-context-video/ for more details.

Problem

Research questions and friction points this paper is trying to address.

Extends single-shot video models to multi-shot scenes.

Ensures visual and dynamic consistency across video shots.

Enables coherent multi-shot generation without extra parameters.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Expands context window for scene-level consistency

Uses interleaved 3D position embedding strategy

Enables joint and auto-regressive shot generation

🔎 Similar Papers

Video In-context Learning