Deep Forcing: Training-Free Long Video Generation with Deep Sink and Participative Compression

📅 2025-12-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing autoregressive video diffusion models suffer from temporal repetition, motion drift, and deceleration during long-sequence streaming generation; directly adapting StreamingLLM-style attention pooling further degrades fidelity and induces dynamic stagnation. This paper proposes Deep Forcing—a training-free method for ultra-long video extrapolation—achieved via deep contextual stabilization and critical information preservation. Its core innovations are the first integration of Deep Sink (a sliding-window persistent sink token mechanism) with Participative Compression (importance-aware KV pruning coupled with temporal RoPE phase realignment), ensuring long-term temporal consistency and real-time inference without fine-tuning. Experiments demonstrate support for over 12× temporal extrapolation (e.g., 5s → 60+ s), outperforming LongLive and RollingForcing in image quality, aesthetic score, and motion richness.

Technology Category

Application Category

📝 Abstract
Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.
Problem

Research questions and friction points this paper is trying to address.

Address temporal repetition, drift, and motion deceleration in long video generation
Prevent fidelity degradation and motion stagnation from naive attention sink methods
Enable training-free long video extrapolation with improved quality and consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Deep Sink stabilizes context with persistent tokens and RoPE realignment
Participative Compression prunes KV cache via importance-aware token preservation
Training-free mechanisms enable long video extrapolation with real-time generation