Context Forcing: Consistent Autoregressive Video Generation with Long Context

📅 2026-02-05

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing long video generation methods are limited by teacher models that possess only short-term memory, hindering student models from effectively learning global temporal dependencies and compromising long-term consistency. To address this, this work proposes the Context Forcing framework, which enables aligned supervision between teacher and student models over extended contexts for the first time. Additionally, a Slow-Fast Memory mechanism is introduced to efficiently compress historical information, overcoming the context-length bottleneck in streaming training. The proposed approach supports effective context modeling beyond 20 seconds—achieving a 2–10× improvement over prior methods such as LongLive and Infinite-RoPE—and demonstrates significant gains across multiple metrics of long-video temporal consistency.

Technology Category

Application Category

📝 Abstract

Recent approaches to real-time long video generation typically employ streaming tuning strategies, attempting to train a long-context student using a short-context (memoryless) teacher. In these frameworks, the student performs long rollouts but receives supervision from a teacher limited to short 5-second windows. This structural discrepancy creates a critical \textbf{student-teacher mismatch}: the teacher's inability to access long-term history prevents it from guiding the student on global temporal dependencies, effectively capping the student's context length. To resolve this, we propose \textbf{Context Forcing}, a novel framework that trains a long-context student via a long-context teacher. By ensuring the teacher is aware of the full generation history, we eliminate the supervision mismatch, enabling the robust training of models capable of long-term consistency. To make this computationally feasible for extreme durations (e.g., 2 minutes), we introduce a context management system that transforms the linearly growing context into a \textbf{Slow-Fast Memory} architecture, significantly reducing visual redundancy. Extensive results demonstrate that our method enables effective context lengths exceeding 20 seconds -- 2 to 10 times longer than state-of-the-art methods like LongLive and Infinite-RoPE. By leveraging this extended context, Context Forcing preserves superior consistency across long durations, surpassing state-of-the-art baselines on various long video evaluation metrics.

Problem

Research questions and friction points this paper is trying to address.

student-teacher mismatch

long-context video generation

temporal consistency

autoregressive modeling

context length limitation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Context Forcing

long-context video generation

Slow-Fast Memory