🤖 AI Summary
Existing autoregressive video diffusion models suffer from temporal repetition, motion drift, and deceleration during long-sequence streaming generation; directly adapting StreamingLLM-style attention pooling further degrades fidelity and induces dynamic stagnation. This paper proposes Deep Forcing—a training-free method for ultra-long video extrapolation—achieved via deep contextual stabilization and critical information preservation. Its core innovations are the first integration of Deep Sink (a sliding-window persistent sink token mechanism) with Participative Compression (importance-aware KV pruning coupled with temporal RoPE phase realignment), ensuring long-term temporal consistency and real-time inference without fine-tuning. Experiments demonstrate support for over 12× temporal extrapolation (e.g., 5s → 60+ s), outperforming LongLive and RollingForcing in image quality, aesthetic score, and motion richness.
📝 Abstract
Recent advances in autoregressive video diffusion have enabled real-time frame streaming, yet existing solutions still suffer from temporal repetition, drift, and motion deceleration. We find that naively applying StreamingLLM-style attention sinks to video diffusion leads to fidelity degradation and motion stagnation. To overcome this, we introduce Deep Forcing, which consists of two training-free mechanisms that address this without any fine-tuning. Specifically, 1) Deep Sink dedicates half of the sliding window to persistent sink tokens and re-aligns their temporal RoPE phase to the current timeline, stabilizing global context during long rollouts. 2) Participative Compression performs importance-aware KV cache pruning that preserves only tokens actively participating in recent attention while safely discarding redundant and degraded history, minimizing error accumulation under out-of-distribution length generation. Together, these components enable over 12x extrapolation (e.g. 5s-trained to 60s+ generation) with better imaging quality than LongLive, better aesthetic quality than RollingForcing, almost maintaining overall consistency, and substantial gains in dynamic degree, all while maintaining real-time generation. Our results demonstrate that training-free KV-cache management can match or exceed training-based approaches for autoregressively streaming long-video generation.