🤖 AI Summary
Video diffusion Transformers exhibit poor out-of-distribution generalization beyond their training sequence length, manifesting as periodic content repetition and degraded visual fidelity. This work identifies “attention dispersion”—the unintended activation of distant tokens outside the training context window—as the root cause, revealed for the first time through systematic analysis of attention maps. We propose a training-free, plug-and-play attention suppression method: during inference, we apply a constant exponential decay factor to explicitly attenuate attention scores for tokens beyond the trained context window. The method is architecture-agnostic and requires no fine-tuning. Experiments demonstrate that, under 4× length extrapolation, our approach improves motion dynamics by 233% and video quality by 40.5% over prior methods. Moreover, it seamlessly transfers to downstream tasks—including controllable video generation and editing—establishing a universal, efficient, and lightweight paradigm for long-video synthesis.
📝 Abstract
Despite advances, video diffusion transformers still struggle to generalize beyond their training length, a challenge we term video length extrapolation. We identify two failure modes: model-specific periodic content repetition and a universal quality degradation. Prior works attempt to solve repetition via positional encodings, overlooking quality degradation and achieving only limited extrapolation. In this paper, we revisit this challenge from a more fundamental view: attention maps, which directly govern how context influences outputs. We identify that both failure modes arise from a unified cause: attention dispersion, where tokens beyond the training window dilute learned attention patterns. This leads to quality degradation and repetition emerges as a special case when this dispersion becomes structured into periodic attention patterns, induced by harmonic properties of positional encodings. Building on this insight, we propose UltraViCo, a training-free, plug-and-play method that suppresses attention for tokens beyond the training window via a constant decay factor. By jointly addressing both failure modes, we outperform a broad set of baselines largely across models and extrapolation ratios, pushing the extrapolation limit from 2x to 4x. Remarkably, it improves Dynamic Degree and Imaging Quality by 233% and 40.5% over the previous best method at 4x extrapolation. Furthermore, our method generalizes seamlessly to downstream tasks such as controllable video synthesis and editing.